Core Classes and Functions

Overview

In this section, we will delve into the core classes and functions that make up the Crawl4AI library. This includes the WebCrawler class, various CrawlerStrategy classes, ChunkingStrategy classes, and ExtractionStrategy classes. Understanding these core components will help you leverage the full power of Crawl4AI for your web crawling and data extraction needs.

WebCrawler Class

The WebCrawler class is the main class you'll interact with. It provides the interface for crawling web pages and extracting data.

Initialization

from crawl4ai import WebCrawler

# Create an instance of WebCrawler
crawler = WebCrawler()

Methods

  • warmup(): Prepares the crawler for use, such as loading necessary models.
  • run(url: str, **kwargs): Runs the crawler on the specified URL with optional parameters for customization.
crawler.warmup()
result = crawler.run(url="https://www.nbcnews.com/business")
print(result)

CrawlerStrategy Classes

The CrawlerStrategy classes define how the web crawling is executed. The base class is CrawlerStrategy, which is extended by specific implementations like LocalSeleniumCrawlerStrategy.

CrawlerStrategy Base Class

An abstract base class that defines the interface for different crawler strategies.

from abc import ABC, abstractmethod

class CrawlerStrategy(ABC):
    @abstractmethod
    def crawl(self, url: str, **kwargs) -> str:
        pass

    @abstractmethod
    def take_screenshot(self, save_path: str):
        pass

    @abstractmethod
    def update_user_agent(self, user_agent: str):
        pass

    @abstractmethod
    def set_hook(self, hook_type: str, hook: Callable):
        pass

LocalSeleniumCrawlerStrategy Class

A concrete implementation of CrawlerStrategy that uses Selenium to crawl web pages.

Initialization

from crawl4ai.crawler_strategy import LocalSeleniumCrawlerStrategy

strategy = LocalSeleniumCrawlerStrategy(js_code=["console.log('Hello, world!');"])

Methods

  • crawl(url: str, **kwargs): Crawls the specified URL.
  • take_screenshot(save_path: str): Takes a screenshot of the current page.
  • update_user_agent(user_agent: str): Updates the user agent for the browser.
  • set_hook(hook_type: str, hook: Callable): Sets a hook for various events.
result = strategy.crawl("https://www.example.com")
strategy.take_screenshot("screenshot.png")
strategy.update_user_agent("Mozilla/5.0")
strategy.set_hook("before_get_url", lambda: print("About to get URL"))

ChunkingStrategy Classes

The ChunkingStrategy classes define how the text from a web page is divided into chunks. Here are a few examples:

RegexChunking Class

Splits text using regular expressions.

from crawl4ai.chunking_strategy import RegexChunking

chunker = RegexChunking(patterns=[r'\n\n'])
chunks = chunker.chunk("This is a sample text. It will be split into chunks.")

NlpSentenceChunking Class

Uses NLP to split text into sentences.

from crawl4ai.chunking_strategy import NlpSentenceChunking

chunker = NlpSentenceChunking()
chunks = chunker.chunk("This is a sample text. It will be split into sentences.")

ExtractionStrategy Classes

The ExtractionStrategy classes define how meaningful content is extracted from the chunks. Here are a few examples:

CosineStrategy Class

Clusters text chunks based on cosine similarity.

from crawl4ai.extraction_strategy import CosineStrategy

extractor = CosineStrategy(semantic_filter="finance", word_count_threshold=10)
extracted_content = extractor.extract(url="https://www.example.com", html="<html>...</html>")

LLMExtractionStrategy Class

Uses a Language Model to extract meaningful blocks from HTML.

from crawl4ai.extraction_strategy import LLMExtractionStrategy

extractor = LLMExtractionStrategy(provider='openai', api_token='your_api_token', instruction='Extract only news about AI.')
extracted_content = extractor.extract(url="https://www.example.com", html="<html>...</html>")

Conclusion

By understanding these core classes and functions, you can customize and extend Crawl4AI to suit your specific web crawling and data extraction needs. Happy crawling! 🕷️🤖