Loading... Please wait.
There are three ways to use Crawl4AI:
To install Crawl4AI as a library, follow these steps:
virtualenv venv
source venv/bin/activate
pip install "crawl4ai[all] @ git+https://github.com/unclecode/crawl4ai.git"
crawl4ai-download-models
virtualenv venv
source venv/bin/activate
git clone https://github.com/unclecode/crawl4ai.git
cd crawl4ai
pip install -e .[all]
docker build -t crawl4ai .
# docker build --platform linux/amd64 -t crawl4ai . For Mac users
docker run -d -p 8000:80 crawl4ai
For more information about how to run Crawl4AI as a local server, please refer to the GitHub repository.
warmup()
function.
crawler = WebCrawler()
crawler.warmup()
result = crawler.run(url="https://www.nbcnews.com/business")
result = crawler.run(url="https://www.nbcnews.com/business", bypass_cache=True)
`bypass_cache`
to True if you want to try different strategies for the same URL. Otherwise, the cached result will be returned. You can also set `always_by_pass_cache`
in constructor to True to always bypass the cache.
result = crawler.run(url="https://www.nbcnews.com/business", include_raw_html=False)
always_by_pass_cache
to True:crawler.always_by_pass_cache = True
result = crawler.run(
url="https://www.nbcnews.com/business",
screenshot=True
)
with open("screenshot.png", "wb") as f:
f.write(base64.b64decode(result.screenshot))
result = crawler.run(
url="https://www.nbcnews.com/business",
chunking_strategy=RegexChunking(patterns=["\n\n"])
)
result = crawler.run(
url="https://www.nbcnews.com/business",
chunking_strategy=NlpSentenceChunking()
)
result = crawler.run(
url="https://www.nbcnews.com/business",
extraction_strategy=CosineStrategy(word_count_threshold=10, max_dist=0.2, linkage_method="ward", top_k=3)
)
result = crawler.run(
url="https://www.nbcnews.com/business",
extraction_strategy=LLMExtractionStrategy(provider="openai/gpt-4o", api_token=os.getenv('OPENAI_API_KEY'))
)
result = crawler.run(
url="https://www.nbcnews.com/business",
extraction_strategy=LLMExtractionStrategy(
provider="openai/gpt-4o",
api_token=os.getenv('OPENAI_API_KEY'),
instruction="I am interested in only financial news"
)
)
result = crawler.run(
url="https://www.nbcnews.com/business",
css_selector="h2"
)
js_code = ["""
const loadMoreButton = Array.from(document.querySelectorAll('button')).find(button => button.textContent.includes('Load More'));
loadMoreButton && loadMoreButton.click();
"""]
crawler = WebCrawler(verbos=crawler_strategy, always_by_pass_cache=True)
result = crawler.run(url="https://www.nbcnews.com/business", js = js_code)
Content for chunking strategies...
Content for extraction strategies...
In recent times, we've witnessed a surge of startups emerging, riding the AI hype wave and charging for services that should rightfully be accessible to everyone. ππΈ One such example is scraping and crawling web pages and transforming them into a format suitable for Large Language Models (LLMs). πΈοΈπ€ We believe that building a business around this is not the right approach; instead, it should definitely be open-source. ππ So, if you possess the skills to build such tools and share our philosophy, we invite you to join our "Robinhood" band and help set these products free for the benefit of all. π€πͺ