Journaux liées à cette note :

Journal du mercredi 14 mai 2025 à 11:48 #JaiDécouvert, #OnMaPartagé, #OCR

Un collègue m'a partagé le projet Marker (https://github.com/VikParuchuri/marker) :

Marker converts documents to markdown, JSON, and HTML quickly and accurately.

  • Converts PDF, image, PPTX, DOCX, XLSX, HTML, EPUB files in all languages
  • Formats tables, forms, equations, inline math, links, references, and code blocks
  • Extracts and saves images
  • Removes headers/footers/other artifacts
  • Extensible with your own formatting and logic
  • Optionally boost accuracy with LLMs
  • Works on GPU, CPU, or MPS

source

Voici comment fonctionne Marker :

Marker is a pipeline of deep learning models:

  • Extract text, OCR if necessary (heuristics, surya)
  • Detect page layout and find reading order (surya)
  • Clean and format each block (heuristics, texify, surya)
  • Optionally use an LLM to improve quality
  • Combine blocks and postprocess complete text

source