Multimodal Vision-Vibration Fusion Diagnosis 2026
Multimodal Vision-Vibration Fusion Diagnosis
Combining optical imaging with vibration sensing through deep learning architectures is reshaping fault diagnosis, structural health monitoring, and autonomous perception. This dataset spans 2013–2026 across 8 named assignees and 6 jurisdictions.
Vision-Vibration Fusion: From Lab Concept to Deployable Systems
Multimodal vision-vibration fusion diagnosis systems jointly process optical or video data alongside vibration signals — acoustic, inertial, radar, or structural — using neural network fusion architectures. The core challenge is that single-modality sensing yields incomplete or ambiguous information about the health state of mechanical, structural, or biological systems.
Three principal sub-domains appear across the dataset: vibration-to-image transformation for deep classification, direct multi-sensor fusion with joint visual-vibration encoding, and video-based monitoring that extracts frequency and displacement information directly from camera frames without contact sensors.
The field spans roughly 2013–2026, with three discernible phases: a foundational phase (2013–2018) establishing pixel-intensity FFT and early multi-sensor fusion patents; a development phase (2019–2022) where deep learning architectures became dominant; and an acceleration phase (2023–2026) featuring system-level deployment and hardware integration.
In this dataset, 8 distinct assignees are represented across 6 jurisdictions. Innovation is moderately distributed in retrieved records — no single assignee spans all application areas — but SRI International and Momenta (Suzhou) Technology each hold 3 active patent records, reflecting sustained portfolio building in this dataset.
Technology Clusters and Filing Timeline in the Dataset
The retrieved records distribute across four technology clusters — vibration-to-image classification, spatiotemporal video monitoring, physics-guided embedding fusion, and BEV-based multi-modal fusion — with filing activity concentrated in 2019–2026.
Patent Records by Technology Cluster (Dataset Snapshot)
In this dataset, BEV-based multi-modal fusion and vibration-to-image classification each account for the largest patent and literature record concentrations, with physics-guided embedding and spatiotemporal video monitoring representing smaller but sustained clusters.
↗ Click bars to explorePatent Filing Activity by Phase, 2013–2026 (Retrieved Records)
In this dataset, the acceleration phase (2023–2026) contains the highest concentration of commercial patent filings, with Momenta, Jiangsu University, Tongji University, and Qualcomm all filing between 2024 and 2026 — compared to 2 foundational records before 2019.
↗ Click bars to exploreKey Application Domains in Vision-Vibration Fusion Diagnosis
The retrieved records span five distinct application domains — from industrial machinery fault diagnosis and structural health monitoring to autonomous driving perception, smart building occupant sensing, and multimodal medical diagnosis — each drawing on overlapping deep learning fusion architectures.
Industrial Machinery Fault Diagnosis
The most technically concentrated domain in this dataset, covering bearing and gearbox diagnostics. A Multi-Information Fusion ViT Model (2023) achieved 99.85% average accuracy on small-sample bearing datasets using multi-scale DWT decomposition and CWT maps. Multi-Source SDP and VGG16 fusion for gearboxes (2022) demonstrated pre-fusion accuracy of 93–96% per sensor, with post-fusion performance exceeding individual sensor results via Dempster-Shafer evidence theory.
Vibration-to-Image ClassificationStructural Health Monitoring
Non-contact video-based vibration sensing was established in the 2013 foundational paper using pixel-intensity FFT from fixed digital camera frames for frequency extraction. The 2020 3DCNN-ConvLSTM system extended this using a Microsoft Kinect v2 RGB-D camera for low-frequency environmental vibration monitoring under unstable ambient light conditions, combining short-term 3D CNN and long-term ConvLSTM spatiotemporal features.
Spatiotemporal Video MonitoringAutonomous Driving Perception
The most patent-intensive application domain in the dataset, with Jiangsu University (2026, US pending) deploying a loosely coupled BEV fusion system for LiDAR, camera, and radar with cascade coupling for trajectory tracking. Momenta (Suzhou) Technology filed two active US patents (2023, 2026) covering radar-vision feature fusion and target matching. GM Global Technology Operations holds active US patents for multi-mode heterogeneous sensor fusion (2020) and radar-vision fusion (2019).
BEV Multi-Modal FusionSmart Buildings & Occupant Sensing
University of California filed a pending US patent in 2024 for cross-modal association between wearable biometric signals and structural floor-vibration sensor data using the AD-TCN (Association Distance Temporal Convolutional Network) architecture. The patent targets smart home, eldercare facility, and retail sensing environments, representing a direct extension of industrial fusion paradigms into inhabited built environments.
Cross-Modal Embedding FusionKey Patent Assignees in Vision-Vibration Fusion (Retrieved Records)
In this dataset, SRI International (US) and Momenta (Suzhou) Technology Co., Ltd. (CN/US) each hold 3 active patent records in retrieved records, representing the highest filing volumes among the 8 named assignees. In retrieved records, commercial entities concentrate in autonomous driving perception fusion while research institutions focus on cross-modal embedding frameworks.
Top Assignees by Filing Count in Retrieved Records (Dataset Snapshot)
↗ Click bars to exploreSRI International
SRI International holds 3 active patent records in this dataset spanning WO 2021, US 2023, and US 2025, all covering physics-guided deep multimodal embeddings for task-specific data exploitation. The core invention constructs a common embedding space where sensor-data-specific neural networks produce modality vectors and cross-modal related vectors are mapped closer together. All three records are active, representing a sustained and expanding IP portfolio in cross-modal feature alignment applicable to any multi-sensor diagnostic system incorporating physics priors.
United StatesMomenta (Suzhou) Technology
Momenta (Suzhou) Technology Co., Ltd. holds 3 active patent records in this dataset: EP 2023, US 2023, and US 2026, all covering camera-radar vision fusion architectures for autonomous driving perception. The 2026 US patent covers fusing vision perception and radar perception features through a radar-vision feature fusion model with cross-modality target matching, and is currently active. The EP 2023 and US 2023 filings address multi-sensor data fusion apparatus and method, reflecting a parallel international filing strategy.
China — CN / United StatesForward-Looking Technology Signals (2024–2026)
The most recent filings in the dataset (2024–2026) reveal five forward-looking directions: BEV unification of heterogeneous sensor streams, cross-modal association networks for built-environment sensing, adaptive online fusion with distribution-shift compensation, multimodal digital twins with visual-haptic-vibration fusion, and intelligent temporal multimodal encoding.
BEV Unification of Heterogeneous Sensor Streams
Converting LiDAR, camera, and millimeter-wave radar into Bird’s-Eye View feature spaces is becoming a standard architectural pattern in the 2025–2026 filings. Jiangsu University (2026, US pending) and Momenta (Suzhou) Technology (2026, US active) both deploy cascade coupling for trajectory-level diagnosis within unified BEV feature representations. This pattern is architecturally transferable from autonomous driving to industrial vibration-scene monitoring.
Cross-Modal Association Networks for Built-Environment Sensing
University of California’s 2024 pending US patent introduces the AD-TCN (Association Distance Temporal Convolutional Network) architecture for aligning wearable biometric signals with structural floor-vibration sensor data. This extends the fusion paradigm from industrial machinery to inhabited buildings, targeting smart home, eldercare facility, and retail sensing environments. This is the only retrieved patent directly claiming cross-modal alignment between structural vibration and wearable signals.
Vibration-to-Image Classification vs. BEV Multi-Modal Fusion
Click any row to explore further.
| Dimension | Vibration-to-Image Classification | BEV Multi-Modal Fusion |
|---|---|---|
| Vibration / Acoustic signals converted to TFR images | LiDAR point clouds, camera images, millimeter-wave radar | N/A |
| CWT / DWT time-frequency maps, SDP patterns, PSD energy maps | Unified Bird’s-Eye View (BEV) feature space | N/A |
| ViT (Vision Transformer), VGG16 convolutional networks | Cascade coupling BEV encoder, radar-vision feature fusion model | N/A |
| 99.85% average on small-sample bearing fault datasets (2023 ViT paper) | System-level deployment metrics; trajectory tracking validated in 2026 patents | N/A |
| Industrial machinery: bearings, gearboxes (aerospace, wind energy) | Autonomous driving perception, UAV cluster surveillance | N/A |
| Academic literature (unassigned); SRI International for embedding layer | Momenta (Suzhou) Technology, Jiangsu University, GM Global Tech Ops, Qualcomm | N/A |
| Primarily literature-stage; patent protection sparse on TFR+ViT pipeline | Multiple active US patents filed 2023–2026 by commercial entities | N/A |
| Approaching commodity accuracy; differentiation lies in small-sample and variable-condition robustness | Distribution-shift between training BEV features and real-time sensor data (addressed by Qualcomm 2024) | N/A |
Frequently Asked Questions: Vision-Vibration Fusion Diagnosis Patents
A 2023 paper on a Multi-Information Fusion ViT Model for bearing fault diagnosis reported 99.85% average accuracy on small-sample bearing fault datasets, using multi-scale DWT decomposition to generate CWT maps fed into a Vision Transformer classifier.
In this dataset, SRI International (US) and Momenta (Suzhou) Technology Co., Ltd. (CN/US) each hold 3 patent records. SRI International’s portfolio covers physics-guided deep multimodal embeddings (WO 2021, US 2023, US 2025, all active). Momenta’s portfolio covers camera-radar vision fusion for autonomous driving (EP 2023, US 2023, US 2026, all active).
The AD-TCN (Association Distance Temporal Convolutional Network) is a cross-modal alignment architecture introduced in a 2024 pending US patent from the University of California. It aligns wearable biometric signals with structural floor-vibration sensor data for indoor occupant sensing in smart homes, care facilities, and retail environments.
SRI International’s three-filing portfolio (WO 2021 → US 2023 → US 2025) covers physics-constrained common embedding spaces for multi-sensor diagnostic systems. According to the strategic analysis in this dataset, developers of physics-aware fusion architectures should conduct freedom-to-operate analysis against these claims before deployment.
Vibration-to-image classification converts acoustic or vibration signals into time-frequency representations (CWT/DWT maps) and classifies them using ViT or VGG architectures, primarily for industrial machinery bearing and gearbox fault diagnosis. BEV-based fusion converts heterogeneous sensor streams (LiDAR, camera, radar) into a unified Bird’s-Eye View spatial representation, primarily deployed in autonomous driving perception and UAV surveillance.
In this dataset, Jiangsu University and Tongji University both hold active or pending US patents filed in 2026. Jiangsu University filed a pending US patent in 2026 on a loosely coupled BEV multi-modal fusion system for intelligent driving. Tongji University holds an active US patent filed in 2026 covering radar-vision fusion for multi-view collaborative UAV tracking under low-luminance conditions.
Data and insights on this page are based on a limited patent and literature dataset and are for reference only. Figures may not represent the complete technology landscape.