Content Extraction Test Suite

Test pages for validating page content extraction (PCE) and form context capture. Each page presents a different HTML structure to test how Readability, Turndown, and our noise filters handle real-world patterns.

Test Pages

Simple Article Content

Clean article with headings, paragraphs, lists, and images. No forms, no noise. Baseline for content extraction.

Open Test

Single Contact Form 1 Form

Article content with a single contact form embedded in the middle. Tests form context extraction alongside page content.

Open Test

Multiple Forms 3 Forms

Page with three different forms: newsletter signup (top), contact form (middle), demo request (bottom). Tests multi-form extraction with different contexts.

Open Test

Pricing Page 1 Form

Pricing tiers with a "Get a Demo" form. Tests extraction of structured pricing content and form in commercial context.

Open Test

Blog Post Content

Long-form blog article with author byline, publish date, code blocks, blockquotes, and a comments section. Tests content extraction quality on editorial content.

Open Test

Noisy Landing Page Noise

Landing page with sticky header, notification bar, cookie banner, chat widget mock, newsletter popup, social sharing, and floating CTA. Tests noise filtering.

Open Test

Product Page 1 Form

E-commerce product detail page with JSON-LD structured data, image gallery, specs table, reviews, and "Notify Me" form. Tests structured data extraction.

Open Test

Support / FAQ Page 1 Form

FAQ accordion with expandable answers and a support ticket form. Tests extraction of Q&A content and form context in a support setting.

Open Test

Minimal Page Content

Almost empty page with just a title and one sentence. Tests how extraction handles pages with very little content.

Open Test

Form-Heavy Page 5 Forms

Page dominated by forms — login, registration, search, feedback, and newsletter. Minimal text content. Tests form extraction when forms ARE the content.

Open Test

Hidden Forms 2 Forms

One visible form and one hidden (display:none) form. Tests whether the extraction correctly handles visibility.

Open Test

SPA-Style Page Content

Content loaded dynamically via JavaScript after page load. Tests whether Readability captures dynamically rendered content.

Open Test

Edge Case Tests

Multiple Articles Content

Blog index with 8-10 article cards, sidebar, and no single dominant article. Tests Readability picking only one article and querySelector('article') getting only the first.

Open Test

Multiple JSON-LD 4 Schemas

Product page with 4 competing JSON-LD blocks: Organization, BreadcrumbList, Product, and FAQPage. Tests querySelector grabbing only the first block.

Open Test

Form Attribute 1 Form

Checkout form where fields are scattered outside the <form> tag using the form= attribute. Tests extractFormContext missing associated fields.

Open Test

Layout Tables Noise

Email-style content in nested layout tables mixed with one actual data table. Tests Turndown converting layout tables to garbled pipe-delimited markdown.

Open Test

Dynamic Forms 1 Form

React/Vue style page where form and reviews are rendered by JavaScript after a delay. Tests extraction firing before content exists in DOM.

Open Test

Aria Labels 1 Form

Modern signup form with NO <label> elements. Fields use only aria-label, aria-labelledby, and placeholder. Tests extractFieldSummary missing aria attributes.

Open Test

Wizard Form 1 Form

4-step wizard with ~20 fields in DOM but only 4-6 visible at a time. Hidden steps use display:none. Tests visibility check on form vs individual fields.

Open Test

Accordion & Tabs Content

Tabbed content with 3 tabs and 6-item accordion, all collapsed. Substantial hidden content plus a form inside an accordion panel.

Open Test

WordPress Blocks Content

Gutenberg block patterns: wp-block-cover, gallery, columns, pullquote, and empty wp-block-latest-posts. Tests comment artifacts and empty dynamic blocks.

Open Test

Inline SVG Noise

Infographic with large inline SVGs containing <text> elements with data labels. Tests SVG text leaking into markdown and performance with many path elements.

Open Test

Cookie Wall Noise

Full-screen consent overlay with non-standard class names (sp_message_container, qc-cmp2-container). Tests unrecognized CMP classes polluting output.

Open Test

ContentEditable 0 Forms

Rich text compose interface using contenteditable div, custom dropdown selects, and submit button outside any <form> tag. Invisible to form extraction.

Open Test

Emoji & Unicode Content

Product reviews with heavy emoji, Arabic RTL text, CJK characters, math symbols, and ZWJ emoji sequences. Tests truncation and sentence boundary detection.

Open Test

Image Gallery 1 Form

Product gallery with mixed image types: good alt, empty alt, missing alt, lazy-loaded, and <picture> element. Plus CSS background-image hero.

Open Test

Slider / Carousel Content

CSS-only hero slider with 4 slides (only 1 visible) and a product card carousel. Tests multiple hidden slides and image-only slides producing empty content.

Open Test

Video Embeds Content

YouTube iframe, Vimeo iframe, native <video>, and VideoObject JSON-LD. Tests iframe stripping losing video context while preserving surrounding text.

Open Test

Hero Backgrounds Noise

Page where all visuals use CSS background-image: hero banner, feature icons, testimonial section, product thumbs. Tests visual content invisible to extraction.

Open Test