Why neither vision nor force sensing alone is sufficient for unstructured robotic assembly
Multi-modal AI combining vision and force feedback improves assembly success rates in unstructured robot workcells because each modality has a fundamental blind spot that the other modality directly compensates. Visual perception is highly sensitive to pose and posture changes between assembled objects, but cannot reliably reflect contact state — especially when parts occlude each other during insertion. Force and torque sensing captures contact richness but is insensitive to macroscopic positional and orientation changes. A robot relying only on cameras cannot detect the forces building during a failed insertion; a robot relying only on a force-torque sensor cannot locate the part it needs to pick up. This complementary failure structure, identified across more than 50 patents and research papers spanning 2013 to 2025, is the foundational argument for multi-modal fusion.
The research dataset shows a pronounced concentration of publications between 2020 and 2023, reflecting a field-wide shift from hand-coded force thresholds and fixed vision pipelines toward learned, adaptive multi-modal policies. Key assignees appearing multiple times include the University of Patras Laboratory for Manufacturing Systems and Automation, Harbin Institute of Technology, Agile Robots AG, KTH Royal Institute of Technology, Shenyang University of Technology, and Autodesk Research — institutions whose repeated contributions indicate sustained programmatic investment rather than one-off experiments.
Visual perception in robotic assembly is highly sensitive to pose and posture changes between assembled objects but cannot reliably reflect contact state, especially when parts occlude each other during insertion. Force and torque sensing captures contact richness but is insensitive to macroscopic positional and orientation changes. Multi-modal fusion of both modalities is necessary to sustain performance throughout a full assembly cycle.
The inadequacy of unimodal approaches is not merely theoretical. Benchmarking on the NIST Assembly Task Boards, evaluated across both 2D/3D pattern matching and hybrid force/motion control, confirmed that neither approach in isolation achieves reliably high success rates across the full task range, and that the bottleneck problems differ between vision-based localisation and force-based contact management. According to NIST assembly challenge standards, this gap between localisation accuracy and contact compliance remains the primary engineering obstacle in unstructured workcell deployment.
Sensor fusion architectures: tensor networks, Bayesian state estimators, and residual RL
Three distinct architectural approaches to fusing vision and force data have emerged as dominant in the literature, each making a different trade-off between interpretability, training complexity, and deployment robustness.
Tensor fusion networks
The most explicit commercialisation of deep multi-modal fusion appears in a 2025 patent from Harbin Institute of Technology, Shenzhen. The architecture provisions a plurality of neural network models — including a reinforcement learning network and a tensor fusion network — trained simultaneously on visual data, tactile sensor data, robot motion feedback, and torque feedback. Tactile signals are used to indirectly infer multi-dimensional external forces, and the resulting fused representation vector generates robot motion instructions that flexibly adjust insertion force to complete assembly. Crucially, this architecture decouples the sensing modalities at the hardware level while integrating them at the representation level, enabling the system to respond to contact events that pure vision would miss entirely.
A tensor fusion network jointly encodes multiple heterogeneous sensor streams — visual frames, tactile readings, motion state, and torque measurements — into a single unified representation vector. Rather than processing each modality sequentially and passing outputs between stages, tensor fusion performs a joint outer-product combination that preserves cross-modal interaction terms, allowing the policy to exploit correlations between, for example, a visual edge detection and a simultaneous force spike that neither stream would reveal alone.
Bayesian state estimation
A single Bayesian framework that continuously tracks part poses using both visual observations and intrinsic tactile sensing was formalised by researchers at TU Munich in 2021. The framework enables the robot to maintain accurate state estimates even when occlusion blocks the vision system — the condition most likely to cause state estimation collapse in unstructured settings where parts move freely and are not constrained by fixtures. Object-centric assembly skills guided by these estimated poses can proceed robustly through the phases of an assembly cycle where either modality alone would fail.
Visual-haptic residual reinforcement learning
Agile Robots AG demonstrated a third architectural pattern: visual residual reinforcement learning that fuses operational-space visual and haptic inputs. The method addresses target uncertainty in unstructured environments by incorporating both visual observation and torque feedback into the RL policy, and introduces a proactive action mechanism to resolve partial observability. The system was validated on RAM module insertion — a highly contact-rich, precision-demanding task — demonstrating that visual-haptic fusion specifically improves sample efficiency of policy learning, not only final success rate.
Explore the full patent landscape for multi-modal robot assembly sensor fusion in PatSnap Eureka.
Search Patents in PatSnap Eureka →Learning methodologies: manipulation primitives, imitation learning, and sim-to-real transfer
The choice of learning methodology determines how quickly a multi-modal policy can be trained, how well it transfers from simulation to the real world, and how gracefully it recovers from failure. Three approaches dominate the literature: contact-aware manipulation primitives, hierarchical imitation learning that separates trajectory and force acquisition, and sim-to-real transfer augmented by visual error estimators.
Manipulation primitives with contact semantics
Researchers at NTU Singapore (2021) demonstrated that decomposing assembly into semantically meaningful manipulation primitives — such as “Move down until contact” or “Slide along x while maintaining contact with the surface” — keeps the reinforcement learning search tree shallow while encoding physical contact semantics. Policies learned entirely in simulation achieved direct sim-to-real transfer on tasks including round peg insertion with 0.04 mm clearance without retraining. This result would be unreachable by vision alone at typical camera resolutions; the contact-awareness embedded in each primitive is precisely what allows the transfer to succeed.
Manipulation primitives encoding physical contact events — such as “Move down until contact” or “Slide along x while maintaining contact with the surface” — enable direct sim-to-real transfer of robotic assembly policies at 0.04 mm peg-in-hole clearance without retraining, as demonstrated by NTU Singapore (2021). This clearance is unreachable by vision alone at typical camera resolutions.
Hierarchical imitation learning: separating trajectory and force
Researchers at AIST Japan (2021) identified a critical asymmetry in learning from human demonstration: nominal trajectories can be captured reliably via kinesthetic teaching or teleoperation, but realistic contact force profiles cannot be obtained from simulation due to the reality gap. Their hierarchical imitation learning framework learns trajectory from demonstration while learning the force profile separately through physical interaction, then combines both to generalise across assembly configurations. This separation of concerns is architecturally significant: the vision stream governs trajectory generation; the force stream governs contact compliance. The approach is consistent with the broader principle, also noted by researchers publishing under IEEE standards, that contact-rich manipulation requires independent modelling of geometric and dynamic constraints.
“Treating failed insertion attempts as informative sensorimotor signals — rather than terminal states — and using differentiable Bayesian filters to update part-type and position belief produces faster task completion and higher accuracy than retry-from-scratch strategies.”
Failure recovery through contact interpretation
Stanford University (2021) demonstrated a probabilistic approach that trains differentiable filters exploiting the tactile sensorimotor trace from failed assembly attempts to update belief about part position and type. By treating assembly failures as informative force-tactile signals rather than terminal states, the system effectively closes the loop between detection of failure via force and re-localisation by updating the visual-tactile belief state. This produces higher precision in position and type estimation and completes fitting tasks faster than baselines.
Sim-to-real transfer with visual error estimators
Hunan University (2023) directly addressed the success-rate degradation that occurs when simulation-trained policies encounter real-world visual noise. Their patent employs an RGB-D camera to capture 6D pose teaching trajectories from human demonstrations, uses these to pre-train an assembly policy in simulation, and then deploys a visual error estimator — derived from the gap between simulated and real robot states — to correct policy outputs during real-world execution. Domain randomisation further improves robustness. Siemens (2020) took a complementary approach using meta-reinforcement learning: training on a family of simulated insertion tasks and then adapting to real-world tasks in fewer than 20 real trials, substantially compressing the data requirement that would otherwise make force-and-vision RL impractical in production settings. The importance of bridging simulation and physical reality is also highlighted in robotics research published by Science and Nature as a foundational challenge for autonomous manufacturing systems.
Stanford University’s 2021 work on interpreting contact interactions shows that differentiable Bayesian filters applied to the force-tactile trace of failed insertion attempts update belief about part position and type, producing higher precision in position and type estimation and completing fitting tasks faster than retry-from-scratch strategies. This reframes assembly failure as an informative event rather than a reset condition.
Engineering implementations: how multi-modal systems perform in real unstructured workcells
Industrial deployment of multi-modal vision-force systems reveals a consistent pattern: RGB-D or structured-light vision handles coarse part localisation and grasp planning, while force and torque sensing governs the contact-rich insertion phase. The handoff between modalities — and the quality of that handoff — is where implementation quality is most clearly differentiated.
Robotic Materials Inc. (2020) benchmarked this pattern directly across 20 experimental trials per task for peg-in-hole and hole-on-peg assemblies from the 2018 World Robotics Summit challenge. The system uses RGB-D sensing for initial part localisation and grasp planning, then transitions to spiral-based search and tilting insertion algorithms governed by hand-coded force/torque thresholds to detect critical assembly transitions. This industrial trial directly quantifies how the combination of 3D vision for coarse pose and force control for fine insertion enables reliable completion of tasks that neither modality could handle alone.
Agile Robots AG achieved a 100% success rate in mobile manipulator assembly experiments using a pushing-based hybrid position/force skill that treats contact with the environment as an alignment resource rather than an obstacle. Precise real-time force sensing distinguishes productive constraint contact from damaging collision, eliminating residual pose uncertainty that vision-based control cannot resolve at sub-millimeter tolerances.
The most striking result in the engineering literature is the 100% success rate achieved by Agile Robots AG (2021) using a pushing-based hybrid position/force skill. The key insight is that contact with the environment — a wall, a fixture edge, a mating surface — is an alignment resource rather than an obstacle. This paradigm shift is architecturally impossible with vision-only control: a camera cannot distinguish productive constraint contact from a damaging collision; a calibrated force-torque sensor can. By exploiting environmental constraints actively, the system eliminates the residual pose uncertainty that vision-based control cannot resolve at sub-millimeter tolerances.
CAD-informed adaptive assembly from Autodesk Research (2022) extends the unstructured workcell paradigm further: a two-robot workcell assembles interlocking 3D designs by combining CAD-derived simulation training with real-time visual inference. The system’s ability to operate without task-specific fixtures — relying instead on learned visual features to identify part poses — exemplifies the design-to-manufacture pipeline that multi-modal AI enables. This connects to broader trends in digital manufacturing tracked by organisations such as WIPO, whose global patent data confirms accelerating IP activity in AI-driven manufacturing automation through 2024.
Track the latest patents from Agile Robots AG, Autodesk Research, and Harbin Institute of Technology in PatSnap Eureka.
Explore Patent Assignees in PatSnap Eureka →Osaka University’s earlier multi-modal 3D vision work (2017) established the phase-switching principle that later papers extended with force feedback: AR markers during a teaching phase, point clouds and geometric constraints during robot execution. Switching sensing modality depending on task phase — rather than committing to a single sensor throughout — is a design pattern that the subsequent decade of research consistently validates and extends.
Benchmarking on the NIST Assembly Task Boards (2021) confirmed that end-to-end integrated force/motion and 2D/3D vision solutions outperform either modality deployed independently across the full task range. The bottleneck problems differ between vision-based localisation and force-based contact management, meaning each modality’s weaknesses are non-overlapping and fusion is necessary rather than merely beneficial.
Key institutions, leading assignees, and the direction of the field through 2025
The concentration of repeated contributors across the dataset reveals which institutions have built sustained research programmes rather than isolated publications, and their strategic emphases indicate where the field is heading.
The University of Patras Laboratory for Manufacturing Systems and Automation is the single most frequent institutional contributor, appearing across papers on knowledge-enabled cell design, AI-based scheduling, mobile dual-arm workers, ROS-based human-robot collaboration frameworks, augmented reality operator guidance, and machine-learning-based assembly monitoring. This breadth signals a systems-integration focus rather than a single-technology specialisation.
Harbin Institute of Technology spans visual teaching for task learning (2020), prediction-based human-robot collaboration via ConvLSTM (2022), and the visual-tactile tensor fusion patent (2025), indicating a longitudinal programme moving from demonstration-based learning toward end-to-end learned multi-modal control. Agile Robots AG contributes two high-impact papers grounded in unstructured industrial scenarios: the pushing-based hybrid position/force skill achieving 100% success rate and the proactive visual-haptic residual RL for torque-controlled insertion. Autodesk Research pursues the design-to-manufacture pipeline, with force-guided robot-agnostic assembly RL and CAD-informed adaptive assembly workcells, placing it at the intersection of generative design and autonomous fabrication.
A clear temporal trend emerges from the data: the field has moved from hand-coded force thresholds and fixed vision pipelines (2017–2019) toward learned, adaptive multi-modal policies (2020–2025), with increasing emphasis on sim-to-real transfer as the mechanism bridging training efficiency and deployment reliability in unstructured workcells. KTH Royal Institute of Technology’s contributions on multimodal human-robot collaboration — fusing haptic, gesture, and voice inputs for symbiotic assembly — indicate that the next frontier extends multi-modal sensing beyond the robot’s own proprioception to include the human operator as a sensor in the loop. This trajectory aligns with manufacturing digitalisation frameworks tracked by OECD in its industrial robotics and AI adoption reports.
The PatSnap R&D Intelligence platform and PatSnap IP Intelligence platform provide structured access to the full patent and literature corpus underlying this analysis, enabling R&D teams to map white spaces, track competitor filings, and identify licensing opportunities across the multi-modal robotics landscape.