HTML to Markdown Conversion with Browser Rendering
This guide demonstrates how to create a Cloudflare Worker that converts HTML to Markdown using browser rendering. It utilizes Cloudflare's Browser Rendering API, Durable Objects, and the webforai library.
Create a new Worker project
Start by creating a new Cloudflare Worker project:
npm create cloudflare@latest
Follow the prompts to set up your project. Choose a project name like "html-to-markdown-worker".
Install dependencies
Install the necessary packages:
npm install @cloudflare/puppeteer hono @hono/valibot-validator valibot webforai
Configure wrangler.toml
Update your wrangler.toml
file to include the Browser Rendering API binding and Durable Object:
name = "html-to-markdown-worker"
main = "src/index.ts"
compatibility_date = "2024-03-15"
# Browser Rendering API binding
browser = { binding = "MYBROWSER" }
# Binding to a Durable Object
[[durable_objects.bindings]]
name = "BROWSER"
class_name = "BrowserDO"
[[migrations]]
tag = "v1"
new_classes = ["BrowserDO"]
Create the main Worker script
Create a file named src/index.ts
and add the following code:
import { DurableObject } from "cloudflare:workers";
import puppeteer from "@cloudflare/puppeteer";
import { vValidator } from "@hono/valibot-validator";
import { Hono } from "hono";
import { cache } from "hono/cache";
import { url, literal, object, optional, string, union } from "valibot";
import { htmlToMarkdown } from "webforai";
type Bindings = { MYBROWSER: puppeteer.BrowserWorker; BROWSER: DurableObjectNamespace<BrowserDO> };
const app = new Hono<{ Bindings: Bindings }>();
const BROWSER_KEYS = ["browser1", "browser2"];
const schema = object({
url: string([url()]),
mode: optional(union([literal("readability"), literal("ai")])),
});
app.get(
"/",
cache({ cacheName: "html-to-markdown", cacheControl: "max-age=3600" }),
vValidator("query", schema),
async (c) => {
const { url, mode } = c.req.valid("query");
const pickedKey = BROWSER_KEYS[Math.floor(Math.random() * BROWSER_KEYS.length)];
const browser = c.env.BROWSER.get(c.env.BROWSER.idFromName(pickedKey));
const result = await browser.renderUrl(url);
if (!result.success) {
return c.text(result.error, 500);
}
const aiModeOptions = { linkAsText: true, tableAsText: true, hideImage: true };
const readabilityModeOptions = { linkAsText: false, tableAsText: false, hideImage: false };
const markdown = htmlToMarkdown(result.html, {
baseUrl: url,
...(mode === "ai" ? aiModeOptions : readabilityModeOptions),
});
return c.text(markdown);
},
);
export default app;
Create the Durable Object
In the same src/index.ts
file, add the following code for the Durable Object:
...existing code
const KEEP_BROWSER_ALIVE_IN_SECONDS = 60;
export class BrowserDO extends DurableObject<Bindings> {
private browser: puppeteer.Browser | null = null;
private keptAliveInSeconds = 0;
async renderUrl(url: string): Promise<{ success: true; html: string } | { success: false; error: string }> {
const normalizedUrl = new URL(url).toString();
try {
if (!this.browser?.isConnected()) {
const sessions = await puppeteer.sessions(this.env.MYBROWSER);
const freeSession = sessions.find((s) => !s.connectionId);
if (freeSession) {
this.browser = await puppeteer.connect(this.env.MYBROWSER, freeSession.sessionId);
} else {
this.browser = await puppeteer.launch(this.env.MYBROWSER);
}
}
} catch (e) {
console.error(e);
return { success: false, error: "Failed to launch browser" };
}
this.keptAliveInSeconds = 0;
const page = await this.browser.newPage();
await page.goto(normalizedUrl, { waitUntil: "networkidle0" });
await page.evaluate(() => {
const scripts = document.querySelectorAll("script");
for (const script of Array.from(scripts)) {
script.remove();
}
});
const html = await page.content();
const cleanup = async () => {
await page.close();
this.keptAliveInSeconds = 0;
const currentAlarm = await this.ctx.storage.getAlarm();
if (currentAlarm) {
return;
}
const tenSeconds = 10 * 1000;
await this.ctx.storage.setAlarm(Date.now() + tenSeconds);
};
this.ctx.waitUntil(cleanup());
return { success: true, html };
}
async alarm() {
this.keptAliveInSeconds += 10;
if (this.keptAliveInSeconds < KEEP_BROWSER_ALIVE_IN_SECONDS) {
await this.ctx.storage.setAlarm(Date.now() + 10 * 1000);
if (this.browser?.isConnected()) {
await this.browser.version();
}
} else {
await this.browser?.close();
this.browser = null;
}
}
}
Deploy the Worker
Deploy your Worker to Cloudflare's global network:
npx wrangler deploy
Use your Worker
After deployment, you can call your Worker by making a GET request to its URL with a url
query parameter:
https://your-worker-url.workers.dev/?url=https://example.com&mode=readability
This will return the Markdown version of the specified URL's content.
This Worker provides an API endpoint that converts HTML pages to Markdown using browser rendering. It supports two modes: "readability" (default) and "ai", which affect how links, tables, and images are processed in the output Markdown.
The Durable Object manages browser sessions, improving performance by reusing sessions and reducing the number of concurrent sessions needed.