Domestic open-source OCR technology has reached a new milestone. Recently, Baidu open-sourced its brand-new document parser, Unlimited OCR. This model excels at seamlessly digesting and parsing dozens of pages of long documents continuously, achieving a new SOTA on the authoritative benchmark OmniDocBench, surpassing the previously dominant DeepSeek OCR.
Unlike traditional OCR pipelines that rely on "page-by-page rendering and merging," Unlimited OCR mimics the cognitive workflow of a human scribe. Instead of memorizing all previously transcribed content, it only retains active context and tracking progress. Powered by the novel Reference Sliding Window Attention (R-SWA) mechanism, memory footprint and compute overhead remain virtually flat even as document length grows infinitely.
Standard OCR architectures suffer from quadratic compute expansion. During decoding, attending to all prior tokens causes the KV Cache to bloat, which leads to memory issues and sluggish generation. To bypass this, traditional systems utilize for-loop segmented processing, which inevitably disrupts textual coherence and degrades logical mapping over complex documents.
The introduction of R-SWA elegantly addresses this challenge. Inspired by human "soft forgetting," the mechanism maintains full visibility over the reference visual tokens (similar to keeping the source book open on a desk) while employing a sliding window on the generated textual outputs. By using a fixed-length queue for the output's KV Cache, Unlimited OCR prevents memory inflation and maintains a constant processing speed.
In performance evaluations on OmniDocBench v1.6, Unlimited OCR secured a peak score of 93.92%, setting a new SOTA. When parsing extensive documents exceeding 40 pages, its edit distance remained exceptionally low. Furthermore, at 6,000 generated tokens, its inference throughput (TPS) outperformed #DeepSeek OCR by approximately 35% with robust latency stability.
Industry trends indicate a strategic convergence. From DeepSeek's OCR2 to Zhipu's GLM-OCR, and now #Baidu's Unlimited OCR, tech giants are heavily investing in document parsing. The core driver is simple: high-quality web text is depleting, whereas valuable enterprise data remains locked inside PDFs, invoices, and blueprints. Advanced OCR is rapidly transitioning from a utility tool to a critical multimodal data gateway in the LLM era.
[AgentUpdate Depth Analysis] The paradigm shift in OCR technology, represented by Baidu's Unlimited OCR, marks a crucial step in evolving AI Agents from simple text comprehenders to #multimodal perceptual entities. Traditional OCR serves as isolated tools, whereas Unlimited OCR's R-SWA mechanism overcomes long-document processing bottlenecks through constant KV cache scaling. In the AI Agent ecosystem, an agent's perceptual fidelity dictates its planning and execution limits. By transforming massive physical PDFs and charts into continuous, high-fidelity tokens, Unlimited OCR equips agents with persistent "visual working memory." This architecture significantly lowers operational inference overhead, laying the foundation for "autonomous reading agents" capable of scanning entire manuals or conducting deep financial audits. It highlights that the future of agentic workflows lies in how efficiently models can ingest and reason over real-world document structures without exploding compute budgets.