Extractor

Extractor is an HTTP service used to convert PDF, Markdown, HTML, Docx, Xlsx, CSV and other files into plain text output. It is used in RAG implementation to read external documents for vectorization.

Demo: https://extractor.gulu.ai

The access speed may be slow. It is for trial use only and should not be used in production.

Installation

Start the application server using Docker

docker run -d --restart=always --name extractor \
     -p 8080:80 \
     mylxsw/extractor:1.0.0

API

Convert PDF document to plain text

curl -s -X POST http://127.0.0.1:8080/v1/extractor/file -F file=@'test.pdf'

Automatically download the document of the URL and convert it to plain text

curl -s -X POST http://127.0.0.1:8080/v1/extractor/url -d 'url=https://example.com/test.pdf'

License

MIT

Related Projects

Parsr

Transforms PDF, Documents and Images into Enriched Structured Data

05 Aug 2019 5,634

RAGxplorer

Open-source tool to visualise your RAG 🔮

11 Jan 2024 1,064

griesheim-transparent.de

http Volltextsuche Ratsinfosystem Griesheim

01 May 2023 0

python_rag_app

Implementing a local RAG pipeline that processes PDFs and lets users query information from these...

15 Jun 2024 0

ChatPDF

RAG for Local LLM, chat with PDF/doc/txt files, ChatPDF. 纯原生实现RAG功能，基于本地LLM、embedding模型、reranker模...

17 Apr 2023 587

biblio_glutton_harvester

Open Access PDF harvester

29 Jul 2018 35

pdftabextract

A set of tools for extracting tables from PDF files helping to do data mining on (OCR-processed) ...

08 Jul 2016 2,209

easytextract

Easy to use text extractor, from PDF, DOC, DOCX and other documents, including if necessary using...

12 Nov 2017 6

mlx-rag

Explore a simple example of utilizing MLX for RAG application running locally on your Apple Silic...

24 Jan 2024 147