☁️ The fastest HTML to markdown convertor built with JavaScript. Optimized for LLMs and supports streaming.
☁️ The fastest HTML to markdown convertor built with JavaScript. Optimized for LLMs and supports streaming.
|
Made possible by my Sponsor Program 💖 Follow me @harlan_zw 🐦 • Join Discord for help |
| Input Size | Rust (native) | mdream | Turndown | node-html-markdown |
|---|---|---|---|---|
| undefined160 KBundefined | 1.4ms | undefined3.2msundefined | 11.7ms | 15.0ms |
| undefined420 KBundefined | 1.9ms | undefined6.6msundefined | 14.0ms | 18.1ms |
| undefined1.8 MBundefined | 21ms | undefined60msundefined | 295ms | 28,600ms |
See the Benchmark methodology for more details.
A zero-dependency alternative to Turndown, node-html-markdown, and html-to-markdown, built specifically for LLM input.
Traditional HTML to Markdown converters were not built for LLMs or humans. They tend to be slow and bloated and produce output that’s poorly suited for LLMs token usage or for
human readability.
Other LLM specific convertors focus on supporting all document formats, resulting in larger bundles and lower quality Markdown output.
Mdream core is a highly optimized primitive for producing Markdown from HTML that is optimized for LLMs.
Mdream ships several packages on top of this to generate LLM artifacts like llms.txt
for your own sites or generate LLM context for any project you’re working with.
Mdream is built to run anywhere for all projects and use cases and is available in the following packages:
| Package | Description |
|---|---|
HTML to Markdown converter, use anywhere: browser, edge runtime, node, etc. Includes CLI for stdin conversion and package API. Minimal: no dependenciesundefined |
|
| Use mdream directly in browsers via unpkg/jsDelivr without any build step | |
Site-wide crawler to generate llms.txt artifacts from entire websites |
|
| Pre-built Docker image with Playwright Chrome for containerized website crawling | |
Generate automatic .md for your own Vite sites |
|
Generate automatic .md and llms.txt artifacts generation for Nuxt Sites |
|
Generate .md and llms.txt artifacts from your static .html output |
pnpm add mdream
import { htmlToMarkdown } from 'mdream'
const markdown = htmlToMarkdown('<h1>Hello World</h1>')
console.log(markdown) // # Hello World
undefinedCore Functions:undefined
See the API Usage section for complete details.
Need something that works in the browser or an edge runtime? Use Mdream.
The @mdream/crawl package crawls an entire site generating LLM artifacts using mdream for Markdown conversion.
md/ directory.# Interactive
npx @mdream/crawl
# Simple
npx @mdream/crawl https://harlanzw.com
# Glob patterns
npx @mdream/crawl "https://nuxt.com/docs/getting-started/**"
# Get help
npx @mdream/crawl -h
Feed website content directly to Claude or other AI tools:
# Analyze entire site with Claude
npx @mdream/crawl harlanzw.com
cat output/llms-full.txt | claude -p "summarize this website"
# Analyze specific documentation
npx @mdream/crawl "https://nuxt.com/docs/getting-started/**"
cat output/llms-full.txt | claude -p "explain key concepts"
# Analyze JavaScript/SPA sites (React, Vue, Angular)
npx -p playwright -p @mdream/crawl crawl https://spa-site.com --driver playwright
cat output/llms-full.txt | claude -p "what features does this app have"
# Convert single page
curl -s https://en.wikipedia.org/wiki/Markdown | npx mdream --origin https://en.wikipedia.org | claude -p "summarize"
Generate llms.txt to help AI tools understand your site:
# Static sites
npx @mdream/crawl https://yoursite.com
# JavaScript/SPA sites (React, Vue, Angular)
npx -p playwright -p @mdream/crawl crawl https://spa-site.com --driver playwright
Outputs:
output/llms.txt - Optimized for LLM consumptionoutput/llms-full.txt - Complete content with metadataoutput/md/ - Individual markdown files per pageCrawl websites and generate embeddings for vector databases:
import { crawlAndGenerate } from '@mdream/crawl'
import { embed } from 'ai'
import { withMinimalPreset } from 'mdream/preset/minimal'
import { htmlToMarkdownSplitChunks } from 'mdream/splitter'
const { createTransformersJS } = await import('@built-in-ai/transformers-js')
const embeddingModel = createTransformersJS().textEmbeddingModel('Xenova/bge-base-en-v1.5')
const embeddings = []
await crawlAndGenerate({
urls: ['https://example.com'],
onPage: async ({ url, html, title, origin }) => {
const chunks = htmlToMarkdownSplitChunks(html, withMinimalPreset({
chunkSize: 1000,
chunkOverlap: 200,
origin,
}))
for (const chunk of chunks) {
const { embedding } = await embed({ model: embeddingModel, value: chunk.content })
embeddings.push({ url, title, content: chunk.content, embedding })
}
},
})
// Save to vector database: await saveToVectorDB(embeddings)
Pull headers, images, or other elements during conversion:
import { htmlToMarkdown } from 'mdream'
import { extractionPlugin } from 'mdream/plugins'
const headers = []
const images = []
htmlToMarkdown(html, {
plugins: [
extractionPlugin({
'h1, h2, h3': el => headers.push(el.textContent),
'img[src]': el => images.push({ src: el.attributes.src, alt: el.attributes.alt })
})
]
})
Remove ads, navigation, and unwanted elements to reduce token costs:
import { createPlugin, ELEMENT_NODE, htmlToMarkdown } from 'mdream'
const cleanPlugin = createPlugin({
beforeNodeProcess({ node }) {
if (node.type === ELEMENT_NODE) {
const cls = node.attributes?.class || ''
if (cls.includes('ad') || cls.includes('nav') || node.name === 'script')
return { skip: true }
}
}
})
htmlToMarkdown(html, { plugins: [cleanPlugin] })
Mdream is much more minimal than Mdream Crawl. It provides a CLI designed to work exclusively with Unix pipes,
providing flexibility and freedom to integrate with other tools.
undefinedPipe Site to Markdownundefined
Fetches the Markdown Wikipedia page and converts it to Markdown preserving the original links and images.
curl -s https://en.wikipedia.org/wiki/Markdown \
| npx mdream --origin https://en.wikipedia.org --preset minimal \
| tee streaming.md
Tip: The --origin flag will fix relative image and link paths
undefinedLocal File to Markdownundefined
Converts a local HTML file to a Markdown file, using tee to write the output to a file and display it in the terminal.
cat index.html \
| npx mdream --preset minimal \
| tee streaming.md
--origin <url>: Base URL for resolving relative links and images--preset <preset>: Conversion presets: minimal--help: Display help information--version: Display version informationRun @mdream/crawl with Playwright Chrome pre-installed for website crawling in containerized environments.
# Quick start
docker run harlanzw/mdream:latest site.com/docs/**
# Interactive mode
docker run -it harlanzw/mdream:latest
# Using Playwright for JavaScript sites
docker run harlanzw/mdream:latest spa-site.com --driver playwright
undefinedAvailable Images:undefined
harlanzw/mdream:latest - Latest stable releaseghcr.io/harlan-zw/mdream:latest - GitHub Container RegistrySee DOCKER.md for complete usage, configuration, and building instructions.
pnpm add @mdream/action
See the GitHub Actions README for usage and configuration.
pnpm install @mdream/vite
See the Vite README for usage and configuration.
pnpm add @mdream/nuxt
See the Nuxt Module README for usage and configuration.
For browser environments, you can use mdream directly via CDN without any build step:
<!DOCTYPE html>
<html>
<head>
<script src="https://unpkg.com/mdream/dist/iife.js"></script>
</head>
<body>
<script>
// Convert HTML to Markdown in the browser
const html = '<h1>Hello World</h1><p>This is a paragraph.</p>'
const markdown = window.mdream.htmlToMarkdown(html)
console.log(markdown) // # Hello World\n\nThis is a paragraph.
</script>
</body>
</html>
undefinedCDN Options:undefined
https://unpkg.com/mdream/dist/iife.jshttps://cdn.jsdelivr.net/npm/mdream/dist/iife.jsLicensed under the MIT license.
We use cookies
We use cookies to analyze traffic and improve your experience. You can accept or reject analytics cookies.