marker

Convert PDF to markdown quickly with high accuracy

GPL-3.0 License

Downloads
18.7K
Stars
15.5K
marker - Significant speedup Latest Release

Published by VikParuchuri 4 months ago

This release has a 15% GPU speedup, 3x CPU, 7x MPS. The speedup comes from new surya models for layout and text detection that are a lot more efficient.

This is a "best case" speedup, if you need to OCR or do equation recognition, the speedup will be lower. But it will still be a lot faster.

marker - Fix transformers bugs

Published by VikParuchuri 4 months ago

  • New transformers version introduces a new kwarg in donut models. Handle this case by ignoring it.
  • New transformers version breaks MPS compatibility by using torch .isin to do a comparison. Handle this by setting the pytorch mps fallback setting.
marker - Pagination, bug fixes

Published by VikParuchuri 4 months ago

  • Add a setting to enable output pagination
  • Enable convert.py to use mps (but less memory efficient than cpu/cuda)
  • Fix bug with inference ram setting
  • Fix bug with pdf names with dots in them
  • Fix bug with images at the end of blocks
marker - Fix convert.py bug

Published by VikParuchuri 5 months ago

Fix model device check.

marker - Specify page range

Published by VikParuchuri 5 months ago

  • Make it more clear MPS can't be used with convert.py
  • Specify page range in convert with start_page and max_pages
marker - Python 3.12 compatibility

Published by VikParuchuri 5 months ago

  • Remove ray to enable python 3.12 compatibility
  • Removing ray frees a lot of VRAM (since we can use torch shared tensors), so on average with convert.py each process takes 3GB VRAM. This enables much higher throughput (was between 4.5GB and 5GB before).
marker - OCR speedups

Published by VikParuchuri 5 months ago

  • Pull in new surya and pdftext versions for speedups in OCR and text extraction, respectively
  • Refine heuristics to reduce OCR false positives (and true positives, unfortunately)
  • Enable float batch multipliers
marker - Speed improvements

Published by VikParuchuri 5 months ago

  • Enable parallel text extraction, with worker count settings
  • Bump surya version to pull in layout/line segmentation speed improvements, and OCR bug fix
marker - Faster OCR

Published by VikParuchuri 5 months ago

  • OCR is now ~2.5x faster, due to improvements in surya
marker - Speed up inference

Published by VikParuchuri 5 months ago

  • (from surya) faster ocr, line detection, layout inference
  • Unpin transformers version after testing

Should be significantly faster now, but haven't fully benchmarked, since I'm running low on time this week!

marker - Fix memory leak

Published by VikParuchuri 5 months ago

  • Fix a memory leak (fixed in surya, bumped the version). This caused high CPU memory usage on long docs.
  • Improve load_all_models to take device and dtype
marker - Marker v2

Published by VikParuchuri 6 months ago

Basically a full rewrite!

Main features:

  • Extracts and saves images
  • Improved table formatting
  • Better markdown wrapping
  • Better reading order on complex docs
  • Improved OCR engine with more language options
  • Simple pip package install (no more required system dependencies), so can be used easily on Windows
  • Can be used commercially (pymupdf and layoutlmv3 dependencies removed)

It takes ~2x as long to run now, but seems like a decent tradeoff.

See the README for details.

Package Rankings
Top 6.64% on Proxy.golang.org
Top 35.67% on Pypi.org
Related Projects