Documentation
Benchmark suite
mdcraft.ai Benchmark Suite#
Goal#
Create a repeatable benchmark set that determines whether the MVP is genuinely better than common free converters and manual copy-paste workflows.
The benchmark should be used before launch and after any major rendering or extraction change.
Target size#
- 20 to 30 total documents
- Split across forward conversion and reverse conversion
- Include both everyday examples and failure-prone edge cases
Benchmark categories#
A. Markdown -> PDF (+ shared HTML export checks)#
- Developer documentation
- API tables
- code fences with long lines
- Mermaid diagrams
- nested lists
- footnotes
- AI-generated product documents
- PRDs
- strategy memos
- meeting summaries
- callouts and task lists
- Consulting and business reports
- title page
- section dividers
- images
- quotes
- executive summary layouts
- Student and educator documents
- math
- citations
- lecture notes
- dense headings
- print-focused page counts
- Layout stress tests
- wide tables
- side-by-side image cases
- code-plus-table on the same page
- long TOCs
B. PDF -> Markdown#
- Text-first reports
- Whitepapers with headings and lists
- Documentation exports with code blocks
- Moderate table-heavy PDFs
- A few known-bad layout-heavy PDFs used to test graceful failure
Quality rubric#
Each benchmark document should be scored on a 1-5 scale across these dimensions.
Forward conversion rubric#
- Visual quality
- typography looks polished
- spacing feels deliberate
- hierarchy is obvious
- Layout stability
- no broken page breaks
- no clipped tables or code
- images remain aligned
- Syntax fidelity
- headings, lists, tables, code, Mermaid, math, and footnotes render correctly
- Preview/export parity
- preview and final PDF/HTML match closely
- Share-readiness
- output looks safe to send externally without extra cleanup
Reverse conversion rubric#
- Structural accuracy
- headings, lists, tables, quotes, and code fences are reconstructed correctly
- Readability
- markdown is easy to read and edit
- Cleanup burden
- low manual cleanup required for text-first documents
- Failure clarity
- ambiguous extraction is surfaced for review instead of silently mangling content
- Reusability
- output is genuinely useful for editors, repos, AI tools, or docs systems
Minimum pass bar for MVP#
Markdown -> PDF (+ shared HTML export checks)#
- Average score of 4 or better in visual quality and syntax fidelity
- No catastrophic failures on code, tables, images, Mermaid, or math in the benchmark set
- At least 80 percent of benchmark docs rated "ready to share" without manual restyling
PDF -> Markdown#
- At least 65 percent of text-first PDFs converted into usable markdown with light cleanup
- Known-bad layout-heavy samples must fail gracefully and clearly
Reference competitors to compare against#
- Pandoc-based export workflow
- Typora export
- RenderMark
- markdown-to-pdf.org
- naive copy-paste from browser or PDF
Test execution process#
- Run each input through mdcraft.ai
- Run the same input through baseline competitors
- Score outputs using the rubric
- Capture screenshots or output files for visual comparison
- Log issues by category: tables, code, pagination, math, Mermaid, extraction
Files to collect next#
- 6 markdown-heavy technical docs
- 5 AI-generated markdown docs
- 4 consulting-style reports
- 4 student notes or math-heavy docs
- 5 PDFs with varying complexity
Release rule#
If mdcraft.ai is not clearly better on polish and at least comparable on correctness, do not expand scope. Fix the benchmark failures first.
Automation notes#
- The repository includes
npm run test:benchmark. - The runner executes the full benchmark corpus and writes JSON results to
handoff/benchmark-results/latest.json. - The report includes:
- forward and reverse score summaries
- corpus coverage by fixture category
- issue category counts (for example:
tables,lists,code,mermaid,math,layout,ocr)
- Use issue category counts to prioritize rendering and extraction fixes each sprint.
- Run
npm run test:benchmark:gateto enforce benchmark release thresholds. - Run
npm run test:benchmark:goldento refresh forward-fixture visual goldens inhandoff/benchmark-goldens.