Use case

PDF to Markdown for RAG

How to use mdcraft in retrieval and knowledge-ingestion workflows without pretending OCR is magic.

PDF to Markdown for RAG#

When the goal is retrieval or knowledge ingestion, the real job is not “OCR the file.” The real job is producing Markdown that is easier to inspect, chunk, and trust downstream.

Where mdcraft fits#

mdcraft is strongest when you need:

  • readable section order
  • recoverable headings and lists
  • Markdown that can be reviewed before indexing
  • warnings when the source document is ambiguous

Best-fit PDFs#

These usually work best:

  • text-first reports
  • manuals and guides
  • whitepapers
  • internal documents with clear section structure

Where review still matters#

Review is especially important for:

  • scans
  • multi-column layouts
  • dense tables
  • chart-heavy pages
  • PDFs that are visually designed rather than structurally authored
  1. Convert the PDF to Markdown.
  2. Review warnings and obvious structure issues.
  3. Fix headings, lists, or tables before indexing.
  4. Only then push the Markdown into your chunking or embedding pipeline.

Why Markdown helps#

Markdown gives you something easier to:

  • diff
  • version
  • inspect manually
  • clean up with scripts
  • feed into retrieval pipelines

That makes it a better intermediate representation than raw OCR text when you care about quality.