Marker

#pdf,
#markdown,
#open-source

Dépôt GitHub : https://github.com/VikParuchuri/marker

Journaux liées à cette note :

Un collègue m'a partagé le projet Marker (https://github.com/VikParuchuri/marker) :

Marker converts documents to markdown, JSON, and HTML quickly and accurately.

Converts PDF, image, PPTX, DOCX, XLSX, HTML, EPUB files in all languages

Formats tables, forms, equations, inline math, links, references, and code blocks

Extracts and saves images

Removes headers/footers/other artifacts

Extensible with your own formatting and logic

Optionally boost accuracy with LLMs

Works on GPU, CPU, or MPS

source

Voici comment fonctionne Marker :

Marker is a pipeline of deep learning models:

Extract text, OCR if necessary (heuristics, surya)

Detect page layout and find reading order (surya)

Clean and format each block (heuristics, texify, surya)

Optionally use an LLM to improve quality

Combine blocks and postprocess complete text

source