Content Processing

Crawl4AI provides powerful content processing capabilities that help you extract clean, relevant content from web pages. This guide covers content cleaning, media handling, link analysis, and metadata extraction.

Content Cleaning

Understanding Clean Content

When crawling web pages, you often encounter a lot of noise - advertisements, navigation menus, footers, popups, and other irrelevant content. Crawl4AI automatically cleans this noise using several approaches:

Basic Cleaning: Removes unwanted HTML elements and attributes
Content Relevance: Identifies and preserves meaningful content blocks
Layout Analysis: Understands page structure to identify main content areas

result = await crawler.arun(
    url="https://example.com",
    word_count_threshold=10,        # Remove blocks with fewer words
    excluded_tags=['form', 'nav'],  # Remove specific HTML tags
    remove_overlay_elements=True    # Remove popups/modals
)

# Get clean content
print(result.cleaned_html)  # Cleaned HTML
print(result.markdown)      # Clean markdown version

Fit Markdown: Smart Content Extraction

One of Crawl4AI's most powerful features is fit_markdown. This feature uses advanced heuristics to identify and extract the main content from a webpage while excluding irrelevant elements.

How Fit Markdown Works

Analyzes content density and distribution
Identifies content patterns and structures
Removes boilerplate content (headers, footers, sidebars)
Preserves the most relevant content blocks
Maintains content hierarchy and formatting

Perfect For:

Blog posts and articles
News content
Documentation pages
Any page with a clear main content area

Not Recommended For:

E-commerce product listings
Search results pages
Social media feeds
Pages with multiple equal-weight content sections

result = await crawler.arun(url="https://example.com")

# Get the most relevant content
main_content = result.fit_markdown

# Compare with regular markdown
all_content = result.markdown

print(f"Fit Markdown Length: {len(main_content)}")
print(f"Regular Markdown Length: {len(all_content)}")

Example Use Case

async def extract_article_content(url: str) -> str:
    """Extract main article content from a blog or news site."""
    async with AsyncWebCrawler() as crawler:
        result = await crawler.arun(url=url)

        # fit_markdown will focus on the article content,
        # excluding navigation, ads, and other distractions
        return result.fit_markdown

Media Processing

Crawl4AI provides comprehensive media extraction and analysis capabilities. It automatically detects and processes various types of media elements while maintaining their context and relevance.

Image Processing

The library handles various image scenarios, including: - Regular images - Lazy-loaded images - Background images - Responsive images - Image metadata and context

result = await crawler.arun(url="https://example.com")

for image in result.media["images"]:
    # Each image includes rich metadata
    print(f"Source: {image['src']}")
    print(f"Alt text: {image['alt']}")
    print(f"Description: {image['desc']}")
    print(f"Context: {image['context']}")  # Surrounding text
    print(f"Relevance score: {image['score']}")  # 0-10 score

Handling Lazy-Loaded Content

Crawl4aai already handles lazy loading for media elements. You can also customize the wait time for lazy-loaded content:

result = await crawler.arun(
    url="https://example.com",
    wait_for="css:img[data-src]",  # Wait for lazy images
    delay_before_return_html=2.0   # Additional wait time
)

Video and Audio Content

The library extracts video and audio elements with their metadata:

# Process videos
for video in result.media["videos"]:
    print(f"Video source: {video['src']}")
    print(f"Type: {video['type']}")
    print(f"Duration: {video.get('duration')}")
    print(f"Thumbnail: {video.get('poster')}")

# Process audio
for audio in result.media["audios"]:
    print(f"Audio source: {audio['src']}")
    print(f"Type: {audio['type']}")
    print(f"Duration: {audio.get('duration')}")

Link Analysis

Crawl4AI provides sophisticated link analysis capabilities, helping you understand the relationship between pages and identify important navigation patterns.

Link Classification

The library automatically categorizes links into: - Internal links (same domain) - External links (different domains) - Social media links - Navigation links - Content links

result = await crawler.arun(url="https://example.com")

# Analyze internal links
for link in result.links["internal"]:
    print(f"Internal: {link['href']}")
    print(f"Link text: {link['text']}")
    print(f"Context: {link['context']}")  # Surrounding text
    print(f"Type: {link['type']}")  # nav, content, etc.

# Analyze external links
for link in result.links["external"]:
    print(f"External: {link['href']}")
    print(f"Domain: {link['domain']}")
    print(f"Type: {link['type']}")

Smart Link Filtering

Control which links are included in the results:

result = await crawler.arun(
    url="https://example.com",
    exclude_external_links=True,          # Remove external links
    exclude_social_media_links=True,      # Remove social media links
    exclude_social_media_domains=[                # Custom social media domains
        "facebook.com", "twitter.com", "instagram.com"
    ],
    exclude_domains=["ads.example.com"]   # Exclude specific domains
)

Metadata Extraction

Crawl4AI automatically extracts and processes page metadata, providing valuable information about the content:

result = await crawler.arun(url="https://example.com")

metadata = result.metadata
print(f"Title: {metadata['title']}")
print(f"Description: {metadata['description']}")
print(f"Keywords: {metadata['keywords']}")
print(f"Author: {metadata['author']}")
print(f"Published Date: {metadata['published_date']}")
print(f"Modified Date: {metadata['modified_date']}")
print(f"Language: {metadata['language']}")

Best Practices

Use Fit Markdown for Articles

# Perfect for blog posts, news articles, documentation
content = result.fit_markdown

Handle Media Appropriately

# Filter by relevance score
relevant_images = [
    img for img in result.media["images"]
    if img['score'] > 5
]

Combine Link Analysis with Content

# Get content links with context
content_links = [
    link for link in result.links["internal"]
    if link['type'] == 'content'
]

Clean Content with Purpose

# Customize cleaning based on your needs
result = await crawler.arun(
    url=url,
    word_count_threshold=20,      # Adjust based on content type
    keep_data_attributes=False,   # Remove data attributes
    process_iframes=True         # Include iframe content
)