omniparse

case studyrustopen source

Universal document parser for Rust — 35+ formats, zero runtime dependencies, streaming safe.

01 · problem

Problem

Rust had no pure-Rust equivalent to Apache Tika — a single library that could take arbitrary bytes, figure out the format, and return usable text. Every pipeline I wrote for document ingestion was reimplementing the same detection, extraction, and safety code, badly.

02 · shape

Shape

One crate with a parser registry keyed on magic bytes, content sniffing, and extension fallback. Each format lives behind a small trait so the registry stays flat and testable. Streaming is first-class — the parsers yield chunks, never load the full payload when they do not have to.

03 · build

Build

Detection runs in that fixed order — magic bytes first, content analysis second, extension last — so file renames can not poison results. Archives (zip, tar) and XML inherit bomb protection: depth caps, expansion-ratio caps, and a per-entry byte budget. v0.2.0 landed in November 2025 and added HTML, CSS, RTF, XLSX, PPTX, ODS, ODP, XLS, DOC, and PPT behind the same registry.

figure · service topology

04 · result

Result

140k+ downloads on crates.io. Benchmarks from the repo's PERFORMANCE_BENCHMARK_REPORT.md: HTML at 1 MB parses in under 0.6 ms (target 100 ms), a 10k-cell XLSX in under 0.9 ms (target 500 ms), a 100-slide PPTX in under 0.6 ms (target 1000 ms). Memory scales linearly with input size and the streaming path stays under 100 MB for a 50 MB file.

140k+

crates.io downloads

35+

supported formats

<1 ms

HTML 1 MB parse

<100 MB

peak memory on 50 MB input

stack

Rustpdfium-rendercalaminescraperzipcrates.io