Summarization Example

This example demonstrates how to use Crawl4AI to extract a summary from a web page. The goal is to obtain the title, a detailed summary, a brief summary, and a list of keywords from the given page.

Step-by-Step Guide

  1. Import Necessary Modules

    First, import the necessary modules and classes.

    python import os import time import json from crawl4ai.web_crawler import WebCrawler from crawl4ai.chunking_strategy import * from crawl4ai.extraction_strategy import * from crawl4ai.crawler_strategy import * from pydantic import BaseModel, Field

  2. Define the URL to be Crawled

    Set the URL of the web page you want to summarize.

    python url = r'https://marketplace.visualstudio.com/items?itemName=Unclecode.groqopilot'

  3. Initialize the WebCrawler

    Create an instance of the WebCrawler and call the warmup method.

    python crawler = WebCrawler() crawler.warmup()

  4. Define the Data Model

    Use Pydantic to define the structure of the extracted data.

    python class PageSummary(BaseModel): title: str = Field(..., description="Title of the page.") summary: str = Field(..., description="Summary of the page.") brief_summary: str = Field(..., description="Brief summary of the page.") keywords: list = Field(..., description="Keywords assigned to the page.")

  5. Run the Crawler

    Set up and run the crawler with the LLMExtractionStrategy. Provide the necessary parameters, including the schema for the extracted data and the instruction for the LLM.

    python result = crawler.run( url=url, word_count_threshold=1, extraction_strategy=LLMExtractionStrategy( provider="openai/gpt-4o", api_token=os.getenv('OPENAI_API_KEY'), schema=PageSummary.model_json_schema(), extraction_type="schema", apply_chunking=False, instruction=( "From the crawled content, extract the following details: " "1. Title of the page " "2. Summary of the page, which is a detailed summary " "3. Brief summary of the page, which is a paragraph text " "4. Keywords assigned to the page, which is a list of keywords. " 'The extracted JSON format should look like this: ' '{ "title": "Page Title", "summary": "Detailed summary of the page.", ' '"brief_summary": "Brief summary in a paragraph.", "keywords": ["keyword1", "keyword2", "keyword3"] }' ) ), bypass_cache=True, )

  6. Process the Extracted Data

    Load the extracted content into a JSON object and print it.

    python page_summary = json.loads(result.extracted_content) print(page_summary)

  7. Save the Extracted Data

    Save the extracted data to a file for further use.

    python with open(".data/page_summary.json", "w", encoding="utf-8") as f: f.write(result.extracted_content)

Explanation

  • Importing Modules: Import the necessary modules, including WebCrawler and LLMExtractionStrategy from Crawl4AI.
  • URL Definition: Set the URL of the web page you want to crawl and summarize.
  • WebCrawler Initialization: Create an instance of WebCrawler and call the warmup method to prepare the crawler.
  • Data Model Definition: Define the structure of the data you want to extract using Pydantic's BaseModel.
  • Crawler Execution: Run the crawler with the LLMExtractionStrategy, providing the schema and detailed instructions for the extraction process.
  • Data Processing: Load the extracted content into a JSON object and print it to verify the results.
  • Data Saving: Save the extracted data to a file for further use.

This example demonstrates how to harness the power of Crawl4AI to perform advanced web crawling and data extraction tasks with minimal code.