How it works
Overview
The core function of webforai is converting HTML to Markdown, built on the Syntax Tree ecosystem. This process happens in two steps:
- Convert HTML to Hast. (Hypertext Abstract Syntax Tree)
- Convert Hast to Mdast. (Markdown Abstract Syntax Tree)
- Convert Mdast to Markdown.
What makes this special is the content extraction in step 1. This ensures that only the main content—the part humans care about—is extracted from the HTML. After that, the rest of the transformation is handled using fine-tuned utilities from the Syntax Tree ecosystem.
Extractor
In webforai, the process of extracting the main content from a web page is abstracted into a component called the Extractor. This is a flexible system designed to make content extraction simple and customizable.
Extractor Interface
The Extractor is a function that takes in two things:
- A Hast object, which represents the structure of the HTML.
- Optional metadata, such as the language or URL of the page.
The Extractor processes this input and returns a new Hast object that represents the cleaned-up, extracted content.
import type { Nodes as Hast } from "hast";
type ExtractParams = { hast: Hast; lang?: string; url?: string };
type Extractor = (params: ExtractParams) => Hast;
Default Extractor
By default, webforai provides a built-in Extractor called takumi-extractor
. This extractor is adjusted to produce a high average quality for a typical web page.
I do my best to adjust it to the best of my ability using various flags and scoring with reference to Mozilla's readability and other algorithms.
Customizing the Extraction
webforai allows you to define multiple extractors and chain them together. The Hast object is passed from one Extractor to the next in the order they are defined, allowing you to fine-tune the extraction process.
You can also create your own custom Extractor to implement specific algorithms or extraction logic.
import { htmlToMarkdown } from "webforai";
import { loadHtml } from "webforai/loaders/fetch";
import type { Extractor } from "webforai";
const customExtractor: Extractor = (params) => {
const { hast, url } = params;
// Your custom extraction logic here
return hast;
};
const html = await loadHtml("https://example.com");
const markdown = await htmlToMarkdown(html, {
extractors: [customExtractor],
});