11.1. JSON CSS - Crawl4AI Documentation

Here’s a detailed outline for the JSON-CSS Extraction Strategy video, covering all key aspects and supported structures in Crawl4AI:

10.1 JSON-CSS Extraction Strategy

1. Introduction to JSON-CSS Extraction

JSON-CSS Extraction is used for pulling structured data from pages with repeated patterns, like product listings, article feeds, or directories.
This strategy allows defining a schema with CSS selectors and data fields, making it easy to capture nested, list-based, or singular elements.

2. Basic Schema Structure

Schema Fields: The schema has two main components:
- baseSelector: A CSS selector to locate the main elements you want to extract (e.g., each article or product block).
- fields: Defines the data fields for each element, supporting various data types and structures.

3. Simple Field Extraction

Example HTML:

<div class="product">
    <h2 class="title">Sample Product</h2>
    <span class="price">$19.99</span>
    <p class="description">This is a sample product.</p>
</div>

Schema:

schema = {
    "baseSelector": ".product",
    "fields": [
        {"name": "title", "selector": ".title", "type": "text"},
        {"name": "price", "selector": ".price", "type": "text"},
        {"name": "description", "selector": ".description", "type": "text"}
    ]
}

Explanation: Each field captures text content from specified CSS selectors within each .product element.

4. Supported Field Types: Text, Attribute, HTML, Regex

Field Type Options:
- text: Extracts visible text.
- attribute: Captures an HTML attribute (e.g., src, href).
- html: Extracts the raw HTML of an element.
- regex: Allows regex patterns to extract part of the text.

Example HTML (including an image):

<div class="product">
    <h2 class="title">Sample Product</h2>
    <img class="product-image" src="image.jpg" alt="Product Image">
    <span class="price">$19.99</span>
    <p class="description">Limited time offer.</p>
</div>

Schema:

schema = {
    "baseSelector": ".product",
    "fields": [
        {"name": "title", "selector": ".title", "type": "text"},
        {"name": "image_url", "selector": ".product-image", "type": "attribute", "attribute": "src"},
        {"name": "price", "selector": ".price", "type": "regex", "pattern": r"\$(\d+\.\d+)"},
        {"name": "description_html", "selector": ".description", "type": "html"}
    ]
}

Explanation:
- attribute: Extracts the src attribute from .product-image.
- regex: Extracts the numeric part from $19.99.
- html: Retrieves the full HTML of the description element.

5. Nested Field Extraction

Use Case: Useful when content contains sub-elements, such as an article with author details within it.

Example HTML:

<div class="article">
    <h1 class="title">Sample Article</h1>
    <div class="author">
        <span class="name">John Doe</span>
        <span class="bio">Writer and editor</span>
    </div>
</div>

Schema:

schema = {
    "baseSelector": ".article",
    "fields": [
        {"name": "title", "selector": ".title", "type": "text"},
        {"name": "author", "type": "nested", "selector": ".author", "fields": [
            {"name": "name", "selector": ".name", "type": "text"},
            {"name": "bio", "selector": ".bio", "type": "text"}
        ]}
    ]
}

Explanation:
- nested: Extracts name and bio within .author, grouping the author details in a single author object.

6. List and Nested List Extraction

List: Extracts multiple elements matching the selector as a list.
Nested List: Allows lists within lists, useful for items with sub-lists (e.g., specifications for each product).

Example HTML:

<div class="product">
    <h2 class="title">Product with Features</h2>
    <ul class="features">
        <li class="feature">Feature 1</li>
        <li class="feature">Feature 2</li>
        <li class="feature">Feature 3</li>
    </ul>
</div>

Schema:

schema = {
    "baseSelector": ".product",
    "fields": [
        {"name": "title", "selector": ".title", "type": "text"},
        {"name": "features", "type": "list", "selector": ".features .feature", "fields": [
            {"name": "feature", "type": "text"}
        ]}
    ]
}

Explanation:
- list: Captures each .feature item within .features, outputting an array of features under the features field.

7. Transformations for Field Values

Transformations allow you to modify extracted values (e.g., converting to lowercase).
Supported transformations: lowercase, uppercase, strip.

Example HTML:

<div class="product">
    <h2 class="title">Special Product</h2>
</div>

Schema:

schema = {
    "baseSelector": ".product",
    "fields": [
        {"name": "title", "selector": ".title", "type": "text", "transform": "uppercase"}
    ]
}

Explanation: The transform property changes the title to uppercase, useful for standardized outputs.

8. Full JSON-CSS Extraction Example

Combining all elements in a single schema example for a comprehensive crawl:

Example HTML:

<div class="product">
    <h2 class="title">Featured Product</h2>
    <img class="product-image" src="product.jpg">
    <span class="price">$99.99</span>
    <p class="description">Best product of the year.</p>
    <ul class="features">
        <li class="feature">Durable</li>
        <li class="feature">Eco-friendly</li>
    </ul>
</div>

Schema:

schema = {
    "baseSelector": ".product",
    "fields": [
        {"name": "title", "selector": ".title", "type": "text", "transform": "uppercase"},
        {"name": "image_url", "selector": ".product-image", "type": "attribute", "attribute": "src"},
        {"name": "price", "selector": ".price", "type": "regex", "pattern": r"\$(\d+\.\d+)"},
        {"name": "description", "selector": ".description", "type": "html"},
        {"name": "features", "type": "list", "selector": ".features .feature", "fields": [
            {"name": "feature", "type": "text"}
        ]}
    ]
}

Explanation: This schema captures and transforms each aspect of the product, illustrating the JSON-CSS strategy’s versatility for structured extraction.

9. Wrap Up & Next Steps

Summarize JSON-CSS Extraction’s flexibility for structured, pattern-based extraction.
Tease the next video: 10.2 LLM Extraction Strategy, focusing on using language models to extract data based on intelligent content analysis.

This outline covers each JSON-CSS Extraction option in Crawl4AI, with practical examples and schema configurations, making it a thorough guide for users.