Universal document parser for Rust — 35+ formats, zero runtime dependencies, streaming safe.
Problem
Rust had no pure-Rust equivalent to Apache Tika — a single library that could take arbitrary bytes, figure out the format, and return usable text. Every pipeline I wrote for document ingestion was reimplementing the same detection, extraction, and safety code, badly.
Shape
One crate with a parser registry keyed on magic bytes, content sniffing, and extension fallback. Each format lives behind a small trait so the registry stays flat and testable. Streaming is first-class — the parsers yield chunks, never load the full payload when they do not have to.
Build
Detection runs in that fixed order — magic bytes first, content analysis second, extension last — so file renames can not poison results. Archives (zip, tar) and XML inherit bomb protection: depth caps, expansion-ratio caps, and a per-entry byte budget. v0.2.0 landed in November 2025 and added HTML, CSS, RTF, XLSX, PPTX, ODS, ODP, XLS, DOC, and PPT behind the same registry.
Result
140k+ downloads on crates.io. Benchmarks from the repo's PERFORMANCE_BENCHMARK_REPORT.md: HTML at 1 MB parses in under 0.6 ms (target 100 ms), a 10k-cell XLSX in under 0.9 ms (target 500 ms), a 100-slide PPTX in under 0.6 ms (target 1000 ms). Memory scales linearly with input size and the streaming path stays under 100 MB for a 50 MB file.
stack
Rustpdfium-rendercalaminescraperzipcrates.io