1. BrowserConfig – Controlling the Browser

BrowserConfig focuses on how the browser is launched and behaves. This includes headless mode, proxies, user agents, and other environment tweaks.

from crawl4ai import AsyncWebCrawler, BrowserConfig

browser_cfg = BrowserConfig(
    browser_type="chromium",
    headless=True,
    viewport_width=1280,
    viewport_height=720,
    proxy="http://user:pass@proxy:8080",
    user_agent="Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 Chrome/116.0.0.0 Safari/537.36",
)

1.1 Parameter Highlights

Parameter Type / Default What It Does
browser_type "chromium", "firefox", "webkit"
(default: "chromium")
Which browser engine to use. "chromium" is typical for many sites, "firefox" or "webkit" for specialized tests.
headless bool (default: True) Headless means no visible UI. False is handy for debugging.
viewport_width int (default: 1080) Initial page width (in px). Useful for testing responsive layouts.
viewport_height int (default: 600) Initial page height (in px).
proxy str (default: None) Single-proxy URL if you want all traffic to go through it, e.g. "http://user:pass@proxy:8080".
proxy_config dict (default: None) For advanced or multi-proxy needs, specify details like {"server": "...", "username": "...", ...}.
use_persistent_context bool (default: False) If True, uses a persistent browser context (keep cookies, sessions across runs). Also sets use_managed_browser=True.
user_data_dir str or None (default: None) Directory to store user data (profiles, cookies). Must be set if you want permanent sessions.
ignore_https_errors bool (default: True) If True, continues despite invalid certificates (common in dev/staging).
java_script_enabled bool (default: True) Disable if you want no JS overhead, or if only static content is needed.
cookies list (default: []) Pre-set cookies, each a dict like {"name": "session", "value": "...", "url": "..."}.
headers dict (default: {}) Extra HTTP headers for every request, e.g. {"Accept-Language": "en-US"}.
user_agent str (default: Chrome-based UA) Your custom or random user agent. user_agent_mode="random" can shuffle it.
light_mode bool (default: False) Disables some background features for performance gains.
text_mode bool (default: False) If True, tries to disable images/other heavy content for speed.
use_managed_browser bool (default: False) For advanced “managed” interactions (debugging, CDP usage). Typically set automatically if persistent context is on.
extra_args list (default: []) Additional flags for the underlying browser process, e.g. ["--disable-extensions"].

Tips: - Set headless=False to visually debug how pages load or how interactions proceed.
- If you need authentication storage or repeated sessions, consider use_persistent_context=True and specify user_data_dir.
- For large pages, you might need a bigger viewport_width and viewport_height to handle dynamic content.


2. CrawlerRunConfig – Controlling Each Crawl

While BrowserConfig sets up the environment, CrawlerRunConfig details how each crawl operation should behave: caching, content filtering, link or domain blocking, timeouts, JavaScript code, etc.

from crawl4ai import AsyncWebCrawler, CrawlerRunConfig

run_cfg = CrawlerRunConfig(
    wait_for="css:.main-content",
    word_count_threshold=15,
    excluded_tags=["nav", "footer"],
    exclude_external_links=True,
)

2.1 Parameter Highlights

We group them by category.

A) Content Processing

Parameter Type / Default What It Does
word_count_threshold int (default: ~200) Skips text blocks below X words. Helps ignore trivial sections.
extraction_strategy ExtractionStrategy (default: None) If set, extracts structured data (CSS-based, LLM-based, etc.).
markdown_generator MarkdownGenerationStrategy (None) If you want specialized markdown output (citations, filtering, chunking, etc.).
content_filter RelevantContentFilter (None) Filters out irrelevant text blocks. E.g., PruningContentFilter or BM25ContentFilter.
css_selector str (None) Retains only the part of the page matching this selector.
excluded_tags list (None) Removes entire tags (e.g. ["script", "style"]).
excluded_selector str (None) Like css_selector but to exclude. E.g. "#ads, .tracker".
only_text bool (False) If True, tries to extract text-only content.
prettiify bool (False) If True, beautifies final HTML (slower, purely cosmetic).
keep_data_attributes bool (False) If True, preserve data-* attributes in cleaned HTML.
remove_forms bool (False) If True, remove all <form> elements.

B) Caching & Session

Parameter Type / Default What It Does
cache_mode CacheMode or None Controls how caching is handled (ENABLED, BYPASS, DISABLED, etc.). If None, typically defaults to ENABLED.
session_id str or None Assign a unique ID to reuse a single browser session across multiple arun() calls.
bypass_cache bool (False) If True, acts like CacheMode.BYPASS.
disable_cache bool (False) If True, acts like CacheMode.DISABLED.
no_cache_read bool (False) If True, acts like CacheMode.WRITE_ONLY (writes cache but never reads).
no_cache_write bool (False) If True, acts like CacheMode.READ_ONLY (reads cache but never writes).

Use these for controlling whether you read or write from a local content cache. Handy for large batch crawls or repeated site visits.


C) Page Navigation & Timing

Parameter Type / Default What It Does
wait_until str (domcontentloaded) Condition for navigation to “complete”. Often "networkidle" or "domcontentloaded".
page_timeout int (60000 ms) Timeout for page navigation or JS steps. Increase for slow sites.
wait_for str or None Wait for a CSS ("css:selector") or JS ("js:() => bool") condition before content extraction.
wait_for_images bool (False) Wait for images to load before finishing. Slows down if you only want text.
delay_before_return_html float (0.1) Additional pause (seconds) before final HTML is captured. Good for last-second updates.
mean_delay and max_range float (0.1, 0.3) If you call arun_many(), these define random delay intervals between crawls, helping avoid detection or rate limits.
semaphore_count int (5) Max concurrency for arun_many(). Increase if you have resources for parallel crawls.

D) Page Interaction

Parameter Type / Default What It Does
js_code str or list[str] (None) JavaScript to run after load. E.g. "document.querySelector('button')?.click();".
js_only bool (False) If True, indicates we’re reusing an existing session and only applying JS. No full reload.
ignore_body_visibility bool (True) Skip checking if <body> is visible. Usually best to keep True.
scan_full_page bool (False) If True, auto-scroll the page to load dynamic content (infinite scroll).
scroll_delay float (0.2) Delay between scroll steps if scan_full_page=True.
process_iframes bool (False) Inlines iframe content for single-page extraction.
remove_overlay_elements bool (False) Removes potential modals/popups blocking the main content.
simulate_user bool (False) Simulate user interactions (mouse movements) to avoid bot detection.
override_navigator bool (False) Override navigator properties in JS for stealth.
magic bool (False) Automatic handling of popups/consent banners. Experimental.
adjust_viewport_to_content bool (False) Resizes viewport to match page content height.

If your page is a single-page app with repeated JS updates, set js_only=True in subsequent calls, plus a session_id for reusing the same tab.


E) Media Handling

Parameter Type / Default What It Does
screenshot bool (False) Capture a screenshot (base64) in result.screenshot.
screenshot_wait_for float or None Extra wait time before the screenshot.
screenshot_height_threshold int (~20000) If the page is taller than this, alternate screenshot strategies are used.
pdf bool (False) If True, returns a PDF in result.pdf.
image_description_min_word_threshold int (~50) Minimum words for an image’s alt text or description to be considered valid.
image_score_threshold int (~3) Filter out low-scoring images. The crawler scores images by relevance (size, context, etc.).
exclude_external_images bool (False) Exclude images from other domains.

F) Link/Domain Handling

Parameter Type / Default What It Does
exclude_social_media_domains list (e.g. Facebook/Twitter) A default list can be extended. Any link to these domains is removed from final output.
exclude_external_links bool (False) Removes all links pointing outside the current domain.
exclude_social_media_links bool (False) Strips links specifically to social sites (like Facebook or Twitter).
exclude_domains list ([]) Provide a custom list of domains to exclude (like ["ads.com", "trackers.io"]).

Use these for link-level content filtering (often to keep crawls “internal” or to remove spammy domains).


G) Debug & Logging

Parameter Type / Default What It Does
verbose bool (True) Prints logs detailing each step of crawling, interactions, or errors.
log_console bool (False) Logs the page’s JavaScript console output if you want deeper JS debugging.

2.2 Example Usage

import asyncio
from crawl4ai import AsyncWebCrawler, BrowserConfig, CrawlerRunConfig, CacheMode

async def main():
    # Configure the browser
    browser_cfg = BrowserConfig(
        headless=False,
        viewport_width=1280,
        viewport_height=720,
        proxy="http://user:pass@myproxy:8080",
        text_mode=True
    )

    # Configure the run
    run_cfg = CrawlerRunConfig(
        cache_mode=CacheMode.BYPASS,
        session_id="my_session",
        css_selector="main.article",
        excluded_tags=["script", "style"],
        exclude_external_links=True,
        wait_for="css:.article-loaded",
        screenshot=True
    )

    async with AsyncWebCrawler(config=browser_cfg) as crawler:
        result = await crawler.arun(
            url="https://example.com/news",
            config=run_cfg
        )
        if result.success:
            print("Final cleaned_html length:", len(result.cleaned_html))
            if result.screenshot:
                print("Screenshot captured (base64, length):", len(result.screenshot))
        else:
            print("Crawl failed:", result.error_message)

if __name__ == "__main__":
    asyncio.run(main())

What’s Happening: - text_mode=True avoids loading images and other heavy resources, speeding up the crawl.
- We disable caching (cache_mode=CacheMode.BYPASS) to always fetch fresh content.
- We only keep main.article content by specifying css_selector="main.article".
- We exclude external links (exclude_external_links=True).
- We do a quick screenshot (screenshot=True) before finishing.


3. Putting It All Together

  • Use BrowserConfig for global browser settings: engine, headless, proxy, user agent.
  • Use CrawlerRunConfig for each crawl’s context: how to filter content, handle caching, wait for dynamic elements, or run JS.
  • Pass both configs to AsyncWebCrawler (the BrowserConfig) and then to arun() (the CrawlerRunConfig).