Summarization Example
This example demonstrates how to use Crawl4AI
to extract a summary from a web page. The goal is to obtain the title, a detailed summary, a brief summary, and a list of keywords from the given page.
Step-by-Step Guide
-
Import Necessary Modules
First, import the necessary modules and classes.
python import os import time import json from crawl4ai.web_crawler import WebCrawler from crawl4ai.chunking_strategy import * from crawl4ai.extraction_strategy import * from crawl4ai.crawler_strategy import * from pydantic import BaseModel, Field
-
Define the URL to be Crawled
Set the URL of the web page you want to summarize.
python url = r'https://marketplace.visualstudio.com/items?itemName=Unclecode.groqopilot'
-
Initialize the WebCrawler
Create an instance of the
WebCrawler
and call thewarmup
method.python crawler = WebCrawler() crawler.warmup()
-
Define the Data Model
Use Pydantic to define the structure of the extracted data.
python class PageSummary(BaseModel): title: str = Field(..., description="Title of the page.") summary: str = Field(..., description="Summary of the page.") brief_summary: str = Field(..., description="Brief summary of the page.") keywords: list = Field(..., description="Keywords assigned to the page.")
-
Run the Crawler
Set up and run the crawler with the
LLMExtractionStrategy
. Provide the necessary parameters, including the schema for the extracted data and the instruction for the LLM.python result = crawler.run( url=url, word_count_threshold=1, extraction_strategy=LLMExtractionStrategy( provider="openai/gpt-4o", api_token=os.getenv('OPENAI_API_KEY'), schema=PageSummary.model_json_schema(), extraction_type="schema", apply_chunking=False, instruction=( "From the crawled content, extract the following details: " "1. Title of the page " "2. Summary of the page, which is a detailed summary " "3. Brief summary of the page, which is a paragraph text " "4. Keywords assigned to the page, which is a list of keywords. " 'The extracted JSON format should look like this: ' '{ "title": "Page Title", "summary": "Detailed summary of the page.", ' '"brief_summary": "Brief summary in a paragraph.", "keywords": ["keyword1", "keyword2", "keyword3"] }' ) ), bypass_cache=True, )
-
Process the Extracted Data
Load the extracted content into a JSON object and print it.
python page_summary = json.loads(result.extracted_content) print(page_summary)
-
Save the Extracted Data
Save the extracted data to a file for further use.
python with open(".data/page_summary.json", "w", encoding="utf-8") as f: f.write(result.extracted_content)
Explanation
- Importing Modules: Import the necessary modules, including
WebCrawler
andLLMExtractionStrategy
fromCrawl4AI
. - URL Definition: Set the URL of the web page you want to crawl and summarize.
- WebCrawler Initialization: Create an instance of
WebCrawler
and call thewarmup
method to prepare the crawler. - Data Model Definition: Define the structure of the data you want to extract using Pydantic's
BaseModel
. - Crawler Execution: Run the crawler with the
LLMExtractionStrategy
, providing the schema and detailed instructions for the extraction process. - Data Processing: Load the extracted content into a JSON object and print it to verify the results.
- Data Saving: Save the extracted data to a file for further use.
This example demonstrates how to harness the power of Crawl4AI
to perform advanced web crawling and data extraction tasks with minimal code.