Crawl Request Parameters
The run
function in Crawl4AI is designed to be highly configurable, allowing you to customize the crawling and extraction process to suit your needs. Below are the parameters you can use with the run
function, along with their descriptions, possible values, and examples.
Parameters
url (str)
Description: The URL of the webpage to crawl. Required: Yes Example:
url = "https://www.nbcnews.com/business"
word_count_threshold (int)
Description: The minimum number of words a block must contain to be considered meaningful. The default value is 5
.
Required: No
Default Value: 5
Example:
word_count_threshold = 10
extraction_strategy (ExtractionStrategy)
Description: The strategy to use for extracting content from the HTML. It must be an instance of ExtractionStrategy
. If not provided, the default is NoExtractionStrategy
.
Required: No
Default Value: NoExtractionStrategy()
Example:
extraction_strategy = CosineStrategy(semantic_filter="finance")
chunking_strategy (ChunkingStrategy)
Description: The strategy to use for chunking the text before processing. It must be an instance of ChunkingStrategy
. The default value is RegexChunking()
.
Required: No
Default Value: RegexChunking()
Example:
chunking_strategy = NlpSentenceChunking()
bypass_cache (bool)
Description: Whether to force a fresh crawl even if the URL has been previously crawled. The default value is False
.
Required: No
Default Value: False
Example:
bypass_cache = True
css_selector (str)
Description: The CSS selector to target specific parts of the HTML for extraction. If not provided, the entire HTML will be processed.
Required: No
Default Value: None
Example:
css_selector = "div.article-content"
screenshot (bool)
Description: Whether to take screenshots of the page. The default value is False
.
Required: No
Default Value: False
Example:
screenshot = True
user_agent (str)
Description: The user agent to use for the HTTP requests. If not provided, a default user agent will be used.
Required: No
Default Value: None
Example:
user_agent = "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/58.0.3029.110 Safari/537.3"
verbose (bool)
Description: Whether to enable verbose logging. The default value is True
.
Required: No
Default Value: True
Example:
verbose = True
**kwargs
Additional keyword arguments that can be passed to customize the crawling process further. Some notable options include:
- only_text (bool): Whether to extract only text content, excluding HTML tags. Default is
False
.
Example:
result = crawler.run(
url="https://www.nbcnews.com/business",
css_selector="p",
only_text=True
)
Example Usage
Here's an example of how to use the run
function with various parameters:
from crawl4ai import WebCrawler
from crawl4ai.extraction_strategy import CosineStrategy
from crawl4ai.chunking_strategy import NlpSentenceChunking
# Create the WebCrawler instance
crawler = WebCrawler()
# Run the crawler with custom parameters
result = crawler.run(
url="https://www.nbcnews.com/business",
word_count_threshold=10,
extraction_strategy=CosineStrategy(semantic_filter="finance"),
chunking_strategy=NlpSentenceChunking(),
bypass_cache=True,
css_selector="div.article-content",
screenshot=True,
user_agent="Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/58.0.3029.110 Safari/537.3",
verbose=True,
only_text=True
)
print(result)
This example demonstrates how to configure various parameters to customize the crawling and extraction process using Crawl4AI.