//mdreambyharlan-zw

mdream

☁️ Convert any site to clean markdown & llms.txt. Boost your site's AI discoverability or generate LLM context for a project you're working with.

666

TypeScript

View on GitHub

mdream

Convert any site to clean markdown & llms.txt. Boost your site’s AI discoverability or generate LLM context for a project you’re working with.

_{Made possible by my Sponsor Program 💖
Follow me @harlan_zw 🐦 • Join Discord for help}

Features

🧠 Custom built HTML to Markdown Convertor Optimized for LLMs (~50% fewer tokens)
🔍 Generates Minimal GitHub Flavored Markdown: Frontmatter, Nested & HTML markup support.
✂️ LangChain compatible Markdown Text Splitter for single-pass chunking.
🚀 Ultra Fast: Stream 1.4MB of HTML to markdown in ~50ms.
⚡ Tiny: 6kB gzip, zero dependency core.
⚙️ Run anywhere: CLI Crawler, Docker, GitHub Actions, Vite, & more.
🔌 Extensible: Plugin system for customizing and extending functionality.

What is Mdream?

Traditional HTML to Markdown converters were not built for LLMs or humans. They tend to be slow and bloated and produce output that’s poorly suited for LLMs token usage or for
human readability.

Other LLM specific convertors focus on supporting all document formats, resulting in larger bundles and lower quality Markdown output.

Mdream core is a highly optimized primitive for producing Markdown from HTML that is optimized for LLMs.

Mdream ships several packages on top of this to generate LLM artifacts like llms.txt
for your own sites or generate LLM context for any project you’re working with.

Mdream Packages

Mdream is built to run anywhere for all projects and use cases and is available in the following packages:

Package	Description
mdream	HTML to Markdown converter, use anywhere: browser, edge runtime, node, etc. Includes CLI for `stdin` conversion and package API. Minimal: no dependenciesundefined
Browser CDN	Use mdream directly in browsers via unpkg/jsDelivr without any build step
@mdream/crawl	Site-wide crawler to generate `llms.txt` artifacts from entire websites
Docker	Pre-built Docker image with Playwright Chrome for containerized website crawling
@mdream/vite	Generate automatic `.md` for your own Vite sites
@mdream/nuxt	Generate automatic `.md` and `llms.txt` artifacts generation for Nuxt Sites
@mdream/action	Generate `.md` and `llms.txt` artifacts from your static `.html` output

Examples

🤖 Analyze Websites with AI Tools

Feed website content directly to Claude or other AI tools:

# Analyze entire site with Claude
npx @mdream/crawl harlanzw.com
cat output/llms-full.txt | claude -p "summarize this website"

# Analyze specific documentation
npx @mdream/crawl "https://nuxt.com/docs/getting-started/**"
cat output/llms-full.txt | claude -p "explain key concepts"

# Analyze JavaScript/SPA sites (React, Vue, Angular)
npx -p playwright -p @mdream/crawl crawl https://spa-site.com --driver playwright
cat output/llms-full.txt | claude -p "what features does this app have"

# Convert single page
curl -s https://en.wikipedia.org/wiki/Markdown | npx mdream --origin https://en.wikipedia.org | claude -p "summarize"

🌐 Make Your Site AI-Discoverable

Generate llms.txt to help AI tools understand your site:

# Static sites
npx @mdream/crawl https://yoursite.com

# JavaScript/SPA sites (React, Vue, Angular)
npx -p playwright -p @mdream/crawl crawl https://spa-site.com --driver playwright

Outputs:

output/llms.txt - Optimized for LLM consumption
output/llms-full.txt - Complete content with metadata
output/md/ - Individual markdown files per page

🗄️ Build RAG Systems from Websites

Crawl websites and generate embeddings for vector databases:

import { crawlAndGenerate } from '@mdream/crawl'
import { embed } from 'ai'
import { withMinimalPreset } from 'mdream/preset/minimal'
import { htmlToMarkdownSplitChunks } from 'mdream/splitter'

const { createTransformersJS } = await import('@built-in-ai/transformers-js')
const embeddingModel = createTransformersJS().textEmbeddingModel('Xenova/bge-base-en-v1.5')

const embeddings = []

await crawlAndGenerate({
  urls: ['https://example.com'],
  onPage: async ({ url, html, title, origin }) => {
    const chunks = htmlToMarkdownSplitChunks(html, withMinimalPreset({
      chunkSize: 1000,
      chunkOverlap: 200,
      origin,
    }))

    for (const chunk of chunks) {
      const { embedding } = await embed({ model: embeddingModel, value: chunk.content })
      embeddings.push({ url, title, content: chunk.content, embedding })
    }
  },
})

// Save to vector database: await saveToVectorDB(embeddings)

✂️ Extract Specific Content from Pages

Pull headers, images, or other elements during conversion:

import { htmlToMarkdown } from 'mdream'
import { extractionPlugin } from 'mdream/plugins'

const headers = []
const images = []

htmlToMarkdown(html, {
  plugins: [
    extractionPlugin({
      'h1, h2, h3': el => headers.push(el.textContent),
      'img[src]': el => images.push({ src: el.attributes.src, alt: el.attributes.alt })
    })
  ]
})

⚡ Optimize Token Usage With Cleaner Content

Remove ads, navigation, and unwanted elements to reduce token costs:

import { createPlugin, ELEMENT_NODE, htmlToMarkdown } from 'mdream'

const cleanPlugin = createPlugin({
  beforeNodeProcess({ node }) {
    if (node.type === ELEMENT_NODE) {
      const cls = node.attributes?.class || ''
      if (cls.includes('ad') || cls.includes('nav') || node.name === 'script')
        return { skip: true }
    }
  }
})

htmlToMarkdown(html, { plugins: [cleanPlugin] })

Mdream Crawl

Need something that works in the browser or an edge runtime? Use Mdream.

The @mdream/crawl package crawls an entire site generating LLM artifacts using mdream for Markdown conversion.

llms.txt: A consolidated text file optimized for LLM consumption.
llms-full.txt: An extended format with comprehensive metadata and full content.
Individual Markdown Files: Each crawled page is saved as a separate Markdown file in the md/ directory.

Usage

# Interactive
npx @mdream/crawl
# Simple
npx @mdream/crawl https://harlanzw.com
# Glob patterns
npx @mdream/crawl "https://nuxt.com/docs/getting-started/**"
# Get help
npx @mdream/crawl -h

Stdin CLI Usage

Mdream is much more minimal than Mdream Crawl. It provides a CLI designed to work exclusively with Unix pipes,
providing flexibility and freedom to integrate with other tools.

undefinedPipe Site to Markdownundefined

Fetches the Markdown Wikipedia page and converts it to Markdown preserving the original links and images.

curl -s https://en.wikipedia.org/wiki/Markdown \
 | npx mdream --origin https://en.wikipedia.org --preset minimal \
  | tee streaming.md

Tip: The --origin flag will fix relative image and link paths

undefinedLocal File to Markdownundefined

Converts a local HTML file to a Markdown file, using tee to write the output to a file and display it in the terminal.

cat index.html \
 | npx mdream --preset minimal \
  | tee streaming.md

CLI Options

--origin <url>: Base URL for resolving relative links and images
--preset <preset>: Conversion presets: minimal
--help: Display help information
--version: Display version information

Docker

Run @mdream/crawl with Playwright Chrome pre-installed for website crawling in containerized environments.

# Quick start
docker run harlanzw/mdream:latest site.com/docs/**

# Interactive mode
docker run -it harlanzw/mdream:latest

# Using Playwright for JavaScript sites
docker run harlanzw/mdream:latest spa-site.com --driver playwright

undefinedAvailable Images:undefined

harlanzw/mdream:latest - Latest stable release
ghcr.io/harlan-zw/mdream:latest - GitHub Container Registry

See DOCKER.md for complete usage, configuration, and building instructions.

GitHub Actions Integration

Installation

pnpm add @mdream/action

See the GitHub Actions README for usage and configuration.

Vite Integration

Installation

pnpm install @mdream/vite

See the Vite README for usage and configuration.

Nuxt Integration

Installation

pnpm add @mdream/nuxt

See the Nuxt Module README for usage and configuration.

Browser CDN Usage

For browser environments, you can use mdream directly via CDN without any build step:

<!DOCTYPE html>
<html>
<head>
  <script src="https://unpkg.com/mdream/dist/iife.js"></script>
</head>
<body>
  <script>
    // Convert HTML to Markdown in the browser
    const html = '<h1>Hello World</h1><p>This is a paragraph.</p>'
    const markdown = window.mdream.htmlToMarkdown(html)
    console.log(markdown) // # Hello World\n\nThis is a paragraph.
  </script>
</body>
</html>

undefinedCDN Options:undefined

undefinedunpkg: https://unpkg.com/mdream/dist/iife.js
undefinedjsDelivr: https://cdn.jsdelivr.net/npm/mdream/dist/iife.js

Mdream Usage

Installation

pnpm add mdream

Basic Usage

import { htmlToMarkdown } from 'mdream'

const markdown = htmlToMarkdown('<h1>Hello World</h1>')
console.log(markdown) // # Hello World

See the Mdream Package README for complete documentation on API usage, streaming, presets, and the plugin system.

Text Splitter

Mdream includes a LangChain compatible Markdown splitter that runs efficiently in single pass.

This provides significant performance improvements over traditional multi-pass splitters and allows
you to integrate with your custom Mdream plugins.

import { htmlToMarkdownSplitChunks } from 'mdream/splitter'

const chunks = await htmlToMarkdownSplitChunks('<h1>Hello World</h1><p>This is a paragraph.</p>', {
  chunkSize: 1000,
  chunkOverlap: 200,
})
console.log(chunks) // Array of text chunks

See the Text Splitter Documentation for complete usage and configuration.

Credits

ultrahtml: HTML parsing inspiration

License

Licensed under the MIT license.

Find me

[beta]v0.13.0