☁️ Convert any site to clean markdown & llms.txt. Boost your site's AI discoverability or generate LLM context for a project you're working with.
Convert any site to clean markdown & llms.txt. Boost your site’s AI discoverability or generate LLM context for a project you’re working with.
|
Made possible by my Sponsor Program 💖 Follow me @harlan_zw 🐦 • Join Discord for help |
Traditional HTML to Markdown converters were not built for LLMs or humans. They tend to be slow and bloated and produce output that’s poorly suited for LLMs token usage or for
human readability.
Other LLM specific convertors focus on supporting all document formats, resulting in larger bundles and lower quality Markdown output.
Mdream core is a highly optimized primitive for producing Markdown from HTML that is optimized for LLMs.
Mdream ships several packages on top of this to generate LLM artifacts like llms.txt
for your own sites or generate LLM context for any project you’re working with.
Mdream is built to run anywhere for all projects and use cases and is available in the following packages:
| Package | Description |
|---|---|
HTML to Markdown converter, use anywhere: browser, edge runtime, node, etc. Includes CLI for stdin conversion and package API. Minimal: no dependenciesundefined |
|
| Use mdream directly in browsers via unpkg/jsDelivr without any build step | |
Site-wide crawler to generate llms.txt artifacts from entire websites |
|
| Pre-built Docker image with Playwright Chrome for containerized website crawling | |
Generate automatic .md for your own Vite sites |
|
Generate automatic .md and llms.txt artifacts generation for Nuxt Sites |
|
Generate .md and llms.txt artifacts from your static .html output |
Feed website content directly to Claude or other AI tools:
# Analyze entire site with Claude
npx @mdream/crawl harlanzw.com
cat output/llms-full.txt | claude -p "summarize this website"
# Analyze specific documentation
npx @mdream/crawl "https://nuxt.com/docs/getting-started/**"
cat output/llms-full.txt | claude -p "explain key concepts"
# Analyze JavaScript/SPA sites (React, Vue, Angular)
npx -p playwright -p @mdream/crawl crawl https://spa-site.com --driver playwright
cat output/llms-full.txt | claude -p "what features does this app have"
# Convert single page
curl -s https://en.wikipedia.org/wiki/Markdown | npx mdream --origin https://en.wikipedia.org | claude -p "summarize"
Generate llms.txt to help AI tools understand your site:
# Static sites
npx @mdream/crawl https://yoursite.com
# JavaScript/SPA sites (React, Vue, Angular)
npx -p playwright -p @mdream/crawl crawl https://spa-site.com --driver playwright
Outputs:
output/llms.txt - Optimized for LLM consumptionoutput/llms-full.txt - Complete content with metadataoutput/md/ - Individual markdown files per pageCrawl websites and generate embeddings for vector databases:
import { crawlAndGenerate } from '@mdream/crawl'
import { embed } from 'ai'
import { withMinimalPreset } from 'mdream/preset/minimal'
import { htmlToMarkdownSplitChunks } from 'mdream/splitter'
const { createTransformersJS } = await import('@built-in-ai/transformers-js')
const embeddingModel = createTransformersJS().textEmbeddingModel('Xenova/bge-base-en-v1.5')
const embeddings = []
await crawlAndGenerate({
urls: ['https://example.com'],
onPage: async ({ url, html, title, origin }) => {
const chunks = htmlToMarkdownSplitChunks(html, withMinimalPreset({
chunkSize: 1000,
chunkOverlap: 200,
origin,
}))
for (const chunk of chunks) {
const { embedding } = await embed({ model: embeddingModel, value: chunk.content })
embeddings.push({ url, title, content: chunk.content, embedding })
}
},
})
// Save to vector database: await saveToVectorDB(embeddings)
Pull headers, images, or other elements during conversion:
import { htmlToMarkdown } from 'mdream'
import { extractionPlugin } from 'mdream/plugins'
const headers = []
const images = []
htmlToMarkdown(html, {
plugins: [
extractionPlugin({
'h1, h2, h3': el => headers.push(el.textContent),
'img[src]': el => images.push({ src: el.attributes.src, alt: el.attributes.alt })
})
]
})
Remove ads, navigation, and unwanted elements to reduce token costs:
import { createPlugin, ELEMENT_NODE, htmlToMarkdown } from 'mdream'
const cleanPlugin = createPlugin({
beforeNodeProcess({ node }) {
if (node.type === ELEMENT_NODE) {
const cls = node.attributes?.class || ''
if (cls.includes('ad') || cls.includes('nav') || node.name === 'script')
return { skip: true }
}
}
})
htmlToMarkdown(html, { plugins: [cleanPlugin] })
Need something that works in the browser or an edge runtime? Use Mdream.
The @mdream/crawl package crawls an entire site generating LLM artifacts using mdream for Markdown conversion.
md/ directory.# Interactive
npx @mdream/crawl
# Simple
npx @mdream/crawl https://harlanzw.com
# Glob patterns
npx @mdream/crawl "https://nuxt.com/docs/getting-started/**"
# Get help
npx @mdream/crawl -h
Mdream is much more minimal than Mdream Crawl. It provides a CLI designed to work exclusively with Unix pipes,
providing flexibility and freedom to integrate with other tools.
undefinedPipe Site to Markdownundefined
Fetches the Markdown Wikipedia page and converts it to Markdown preserving the original links and images.
curl -s https://en.wikipedia.org/wiki/Markdown \
| npx mdream --origin https://en.wikipedia.org --preset minimal \
| tee streaming.md
Tip: The --origin flag will fix relative image and link paths
undefinedLocal File to Markdownundefined
Converts a local HTML file to a Markdown file, using tee to write the output to a file and display it in the terminal.
cat index.html \
| npx mdream --preset minimal \
| tee streaming.md
--origin <url>: Base URL for resolving relative links and images--preset <preset>: Conversion presets: minimal--help: Display help information--version: Display version informationRun @mdream/crawl with Playwright Chrome pre-installed for website crawling in containerized environments.
# Quick start
docker run harlanzw/mdream:latest site.com/docs/**
# Interactive mode
docker run -it harlanzw/mdream:latest
# Using Playwright for JavaScript sites
docker run harlanzw/mdream:latest spa-site.com --driver playwright
undefinedAvailable Images:undefined
harlanzw/mdream:latest - Latest stable releaseghcr.io/harlan-zw/mdream:latest - GitHub Container RegistrySee DOCKER.md for complete usage, configuration, and building instructions.
pnpm add @mdream/action
See the GitHub Actions README for usage and configuration.
pnpm install @mdream/vite
See the Vite README for usage and configuration.
pnpm add @mdream/nuxt
See the Nuxt Module README for usage and configuration.
For browser environments, you can use mdream directly via CDN without any build step:
<!DOCTYPE html>
<html>
<head>
<script src="https://unpkg.com/mdream/dist/iife.js"></script>
</head>
<body>
<script>
// Convert HTML to Markdown in the browser
const html = '<h1>Hello World</h1><p>This is a paragraph.</p>'
const markdown = window.mdream.htmlToMarkdown(html)
console.log(markdown) // # Hello World\n\nThis is a paragraph.
</script>
</body>
</html>
undefinedCDN Options:undefined
https://unpkg.com/mdream/dist/iife.jshttps://cdn.jsdelivr.net/npm/mdream/dist/iife.jspnpm add mdream
import { htmlToMarkdown } from 'mdream'
const markdown = htmlToMarkdown('<h1>Hello World</h1>')
console.log(markdown) // # Hello World
See the Mdream Package README for complete documentation on API usage, streaming, presets, and the plugin system.
Mdream includes a LangChain compatible Markdown splitter that runs efficiently in single pass.
This provides significant performance improvements over traditional multi-pass splitters and allows
you to integrate with your custom Mdream plugins.
import { htmlToMarkdownSplitChunks } from 'mdream/splitter'
const chunks = await htmlToMarkdownSplitChunks('<h1>Hello World</h1><p>This is a paragraph.</p>', {
chunkSize: 1000,
chunkOverlap: 200,
})
console.log(chunks) // Array of text chunks
See the Text Splitter Documentation for complete usage and configuration.
Licensed under the MIT license.