case studyrustopen source

Universal document parser for Rust — 35+ formats, zero runtime dependencies, streaming safe.

01 · problem

Problem

Rust had no pure-Rust equivalent to Apache Tika — a single library that could take arbitrary bytes, figure out the format, and return usable text. Every pipeline I wrote for document ingestion was reimplementing the same detection, extraction, and safety code, badly.

02 · shape

Shape

One crate with a parser registry keyed on magic bytes, content sniffing, and extension fallback. Each format lives behind a small trait so the registry stays flat and testable. Streaming is first-class — the parsers yield chunks, never load the full payload when they do not have to.

03 · build

Build

Detection runs in that fixed order — magic bytes first, content analysis second, extension last — so file renames can not poison results. Archives (zip, tar) and XML inherit bomb protection: depth caps, expansion-ratio caps, and a per-entry byte budget. v0.2.0 landed in November 2025 and added HTML, CSS, RTF, XLSX, PPTX, ODS, ODP, XLS, DOC, and PPT behind the same registry.

omniparse — detect, dispatch, parse A file is detected via magic bytes, content heuristics, or extension fallback. The detector returns a MIME type with confidence. The ParserRegistry dispatches to one of four parser categories and emits an ExtractionResult. RUST · 25+ FORMATS · NO FFI file · path · bytes type detector magic · content · extension mime + confidence ∈ [0,1] parser registry text document image archive → ExtractionResult { content · metadata · confidence }
figure · service topology
04 · result

Result

140k+ downloads on crates.io. Benchmarks from the repo's PERFORMANCE_BENCHMARK_REPORT.md: HTML at 1 MB parses in under 0.6 ms (target 100 ms), a 10k-cell XLSX in under 0.9 ms (target 500 ms), a 100-slide PPTX in under 0.6 ms (target 1000 ms). Memory scales linearly with input size and the streaming path stays under 100 MB for a 50 MB file.

140k+
crates.io downloads
35+
supported formats
<1 ms
HTML 1 MB parse
<100 MB
peak memory on 50 MB input

stack

Rustpdfium-rendercalaminescraperzipcrates.io