When AI transitions from digital screens to the physical world, #multimodal models must undergo an architectural paradigm shift. Om AI has officially launched the world's first #on-device streaming multimodal model series for the physical world, VLX, pioneering the concept of "streaming multimodality" in the industry. Unlike traditional video models that slice video into frames for offline batch processing, the VLX series processes continuous real-time video streams using streaming encoding and incremental cache reasoning, achieving millisecond-level live perception and closing the loop of "perception-localization-action" directly on the edge.
The VLX series consists of three collaborative models tailored for real-time physical intelligence: VLX-Flow governs continuous perception, utilizing incremental encoding so the model continuously observes its environment and responds instantly to queries; VLX-Seek focuses on precise localization, converting coordinate generation into region retrieval (selecting regions instead of guessing coordinates) to provide reliable spatial awareness; and VLX-Go executes actions, translating visual understanding directly into short-term waypoints and motion trajectories for autonomous navigation and obstacle avoidance.
Under this new paradigm, visual data enters the model as a continuous stream rather than single frame captures. Instead of "analyzing after watching," the model understands on the fly and takes proactive action. This marks a qualitative leap in AI's autonomous capability rather than just a better chatbot interface.
To tackle the constraints of the physical world—continuous time, dynamic environments, and limited edge computing power—the VLX series is natively built for on-device deployment. Available in sizes ranging from 0.6B to 10B parameters, VLX offers distinct edge advantages: fast (latency as low as 0.06 seconds), compact, precise, and action-oriented.
[AgentUpdate Depth Analysis] The launch of VLX represents a critical paradigm shift for Embodied AI, transitioning from offline batch reasoning to real-time streaming execution. Traditional AI Agent architectures rely heavily on cloud-based LLMs for the "Perception-Planning-Action" loop, resulting in high latency and bandwidth overheads. By integrating streaming encoding with on-device reasoning, VLX lowers local hardware barriers and bypasses frame-by-frame processing limitations. This "act-while-observing" capability is precisely what physical Agents, such as #robotics and wearables, require to operate in complex, dynamic environments. Compared to cloud-dependent solutions, VLX's edge-native, closed-loop framework offers a highly scalable and cost-effective reference for the global AI Agent ecosystem, accelerating the deployment of spatial intelligence at scale.