Docker Deployment
Crawl4AI provides official Docker images for easy deployment and scalability. This guide covers installation, configuration, and usage of Crawl4AI in Docker environments.
Quick Start ๐
Pull and run the basic version:
Test the deployment:
import requests
# Test health endpoint
health = requests.get("http://localhost:11235/health")
print("Health check:", health.json())
# Test basic crawl
response = requests.post(
"http://localhost:11235/crawl",
json={
"urls": "https://www.nbcnews.com/business",
"priority": 10
}
)
task_id = response.json()["task_id"]
print("Task ID:", task_id)
Available Images ๐ท๏ธ
unclecode/crawl4ai:basic
- Basic web crawling capabilitiesunclecode/crawl4ai:all
- Full installation with all featuresunclecode/crawl4ai:gpu
- GPU-enabled version for ML features
Configuration Options ๐ง
Environment Variables
docker run -p 11235:11235 \
-e MAX_CONCURRENT_TASKS=5 \
-e OPENAI_API_KEY=your_key \
unclecode/crawl4ai:all
Volume Mounting
Mount a directory for persistent data:
Resource Limits
Control container resources:
Usage Examples ๐
Basic Crawling
request = {
"urls": "https://www.nbcnews.com/business",
"priority": 10
}
response = requests.post("http://localhost:11235/crawl", json=request)
task_id = response.json()["task_id"]
# Get results
result = requests.get(f"http://localhost:11235/task/{task_id}")
Structured Data Extraction
schema = {
"name": "Crypto Prices",
"baseSelector": ".cds-tableRow-t45thuk",
"fields": [
{
"name": "crypto",
"selector": "td:nth-child(1) h2",
"type": "text",
},
{
"name": "price",
"selector": "td:nth-child(2)",
"type": "text",
}
],
}
request = {
"urls": "https://www.coinbase.com/explore",
"extraction_config": {
"type": "json_css",
"params": {"schema": schema}
}
}
Dynamic Content Handling
request = {
"urls": "https://www.nbcnews.com/business",
"js_code": [
"const loadMoreButton = Array.from(document.querySelectorAll('button')).find(button => button.textContent.includes('Load More')); loadMoreButton && loadMoreButton.click();"
],
"wait_for": "article.tease-card:nth-child(10)"
}
AI-Powered Extraction (Full Version)
request = {
"urls": "https://www.nbcnews.com/business",
"extraction_config": {
"type": "cosine",
"params": {
"semantic_filter": "business finance economy",
"word_count_threshold": 10,
"max_dist": 0.2,
"top_k": 3
}
}
}
Platform-Specific Instructions ๐ป
macOS
Ubuntu
# Basic version
docker pull unclecode/crawl4ai:basic
docker run -p 11235:11235 unclecode/crawl4ai:basic
# With GPU support
docker pull unclecode/crawl4ai:gpu
docker run --gpus all -p 11235:11235 unclecode/crawl4ai:gpu
Windows (PowerShell)
Testing ๐งช
Save this as test_docker.py
:
import requests
import json
import time
import sys
class Crawl4AiTester:
def __init__(self, base_url: str = "http://localhost:11235"):
self.base_url = base_url
def submit_and_wait(self, request_data: dict, timeout: int = 300) -> dict:
# Submit crawl job
response = requests.post(f"{self.base_url}/crawl", json=request_data)
task_id = response.json()["task_id"]
print(f"Task ID: {task_id}")
# Poll for result
start_time = time.time()
while True:
if time.time() - start_time > timeout:
raise TimeoutError(f"Task {task_id} timeout")
result = requests.get(f"{self.base_url}/task/{task_id}")
status = result.json()
if status["status"] == "completed":
return status
time.sleep(2)
def test_deployment():
tester = Crawl4AiTester()
# Test basic crawl
request = {
"urls": "https://www.nbcnews.com/business",
"priority": 10
}
result = tester.submit_and_wait(request)
print("Basic crawl successful!")
print(f"Content length: {len(result['result']['markdown'])}")
if __name__ == "__main__":
test_deployment()
Advanced Configuration โ๏ธ
Crawler Parameters
The crawler_params
field allows you to configure the browser instance and crawling behavior. Here are key parameters you can use:
request = {
"urls": "https://example.com",
"crawler_params": {
# Browser Configuration
"headless": True, # Run in headless mode
"browser_type": "chromium", # chromium/firefox/webkit
"user_agent": "custom-agent", # Custom user agent
"proxy": "http://proxy:8080", # Proxy configuration
# Performance & Behavior
"page_timeout": 30000, # Page load timeout (ms)
"verbose": True, # Enable detailed logging
"semaphore_count": 5, # Concurrent request limit
# Anti-Detection Features
"simulate_user": True, # Simulate human behavior
"magic": True, # Advanced anti-detection
"override_navigator": True, # Override navigator properties
# Session Management
"user_data_dir": "./browser-data", # Browser profile location
"use_managed_browser": True, # Use persistent browser
}
}
Extra Parameters
The extra
field allows passing additional parameters directly to the crawler's arun
function:
request = {
"urls": "https://example.com",
"extra": {
"word_count_threshold": 10, # Min words per block
"only_text": True, # Extract only text
"bypass_cache": True, # Force fresh crawl
"process_iframes": True, # Include iframe content
}
}
Complete Examples
-
Advanced News Crawling
request = { "urls": "https://www.nbcnews.com/business", "crawler_params": { "headless": True, "page_timeout": 30000, "remove_overlay_elements": True # Remove popups }, "extra": { "word_count_threshold": 50, # Longer content blocks "bypass_cache": True # Fresh content }, "css_selector": ".article-body" }
-
Anti-Detection Configuration
-
LLM Extraction with Custom Parameters
-
Session-Based Dynamic Content
request = { "urls": "https://example.com", "crawler_params": { "session_id": "dynamic_session", "headless": False, "page_timeout": 60000 }, "js_code": ["window.scrollTo(0, document.body.scrollHeight);"], "wait_for": "js:() => document.querySelectorAll('.item').length > 10", "extra": { "delay_before_return_html": 2.0 } }
-
Screenshot with Custom Timing
Parameter Reference Table
Category | Parameter | Type | Description |
---|---|---|---|
Browser | headless | bool | Run browser in headless mode |
Browser | browser_type | str | Browser engine selection |
Browser | user_agent | str | Custom user agent string |
Network | proxy | str | Proxy server URL |
Network | headers | dict | Custom HTTP headers |
Timing | page_timeout | int | Page load timeout (ms) |
Timing | delay_before_return_html | float | Wait before capture |
Anti-Detection | simulate_user | bool | Human behavior simulation |
Anti-Detection | magic | bool | Advanced protection |
Session | session_id | str | Browser session ID |
Session | user_data_dir | str | Profile directory |
Content | word_count_threshold | int | Minimum words per block |
Content | only_text | bool | Text-only extraction |
Content | process_iframes | bool | Include iframe content |
Debug | verbose | bool | Detailed logging |
Debug | log_console | bool | Browser console logs |
Troubleshooting ๐
Common Issues
-
Connection Refused
Solution: Ensure the container is running and ports are properly mapped. -
Resource Limits
Solution: Increase MAX_CONCURRENT_TASKS or container resources. -
GPU Access
Solution: Ensure proper NVIDIA drivers and use--gpus all
flag.
Debug Mode
Access container for debugging:
View container logs:
Best Practices ๐
- Resource Management
- Set appropriate memory and CPU limits
- Monitor resource usage via health endpoint
-
Use basic version for simple crawling tasks
-
Scaling
- Use multiple containers for high load
- Implement proper load balancing
-
Monitor performance metrics
-
Security
- Use environment variables for sensitive data
- Implement proper network isolation
- Regular security updates
API Reference ๐
Health Check
Submit Crawl Task
POST /crawl
Content-Type: application/json
{
"urls": "string or array",
"extraction_config": {
"type": "basic|llm|cosine|json_css",
"params": {}
},
"priority": 1-10,
"ttl": 3600
}
Get Task Status
For more details, visit the official documentation.