arun() Parameter Guide (New Approach)

In Crawl4AI’s latest configuration model, nearly all parameters that once went directly to arun() are now part of CrawlerRunConfig. When calling arun(), you provide:

await crawler.arun(
    url="https://example.com",  
    config=my_run_config
)

Below is an organized look at the parameters that can go inside CrawlerRunConfig, divided by their functional areas. For Browser settings (e.g., headless, browser_type), see BrowserConfig.


1. Core Usage

from crawl4ai import AsyncWebCrawler, CrawlerRunConfig, CacheMode

async def main():
    run_config = CrawlerRunConfig(
        verbose=True,            # Detailed logging
        cache_mode=CacheMode.ENABLED,  # Use normal read/write cache
        # ... other parameters
    )

    async with AsyncWebCrawler() as crawler:
        result = await crawler.arun(
            url="https://example.com",
            config=run_config
        )
        print(result.cleaned_html[:500])

Key Fields: - verbose=True logs each crawl step.
- cache_mode decides how to read/write the local crawl cache.


2. Cache Control

cache_mode (default: CacheMode.ENABLED)
Use a built-in enum from CacheMode: - ENABLED: Normal caching—reads if available, writes if missing. - DISABLED: No caching—always refetch pages. - READ_ONLY: Reads from cache only; no new writes. - WRITE_ONLY: Writes to cache but doesn’t read existing data. - BYPASS: Skips reading cache for this crawl (though it might still write if set up that way).

run_config = CrawlerRunConfig(
    cache_mode=CacheMode.BYPASS
)

Additional flags: - bypass_cache=True acts like CacheMode.BYPASS. - disable_cache=True acts like CacheMode.DISABLED. - no_cache_read=True acts like CacheMode.WRITE_ONLY. - no_cache_write=True acts like CacheMode.READ_ONLY.


3. Content Processing & Selection

3.1 Text Processing

run_config = CrawlerRunConfig(
    word_count_threshold=10,   # Ignore text blocks <10 words
    only_text=False,           # If True, tries to remove non-text elements
    keep_data_attributes=False # Keep or discard data-* attributes
)

3.2 Content Selection

run_config = CrawlerRunConfig(
    css_selector=".main-content",  # Focus on .main-content region only
    excluded_tags=["form", "nav"], # Remove entire tag blocks
    remove_forms=True,             # Specifically strip <form> elements
    remove_overlay_elements=True,  # Attempt to remove modals/popups
)
run_config = CrawlerRunConfig(
    exclude_external_links=True,         # Remove external links from final content
    exclude_social_media_links=True,     # Remove links to known social sites
    exclude_domains=["ads.example.com"], # Exclude links to these domains
    exclude_social_media_domains=["facebook.com","twitter.com"], # Extend the default list
)

3.4 Media Filtering

run_config = CrawlerRunConfig(
    exclude_external_images=True  # Strip images from other domains
)

4. Page Navigation & Timing

4.1 Basic Browser Flow

run_config = CrawlerRunConfig(
    wait_for="css:.dynamic-content", # Wait for .dynamic-content
    delay_before_return_html=2.0,    # Wait 2s before capturing final HTML
    page_timeout=60000,             # Navigation & script timeout (ms)
)

Key Fields: - wait_for:
- "css:selector" or
- "js:() => boolean"
e.g. js:() => document.querySelectorAll('.item').length > 10.

  • mean_delay & max_range: define random delays for arun_many() calls.
  • semaphore_count: concurrency limit when crawling multiple URLs.

4.2 JavaScript Execution

run_config = CrawlerRunConfig(
    js_code=[
        "window.scrollTo(0, document.body.scrollHeight);",
        "document.querySelector('.load-more')?.click();"
    ],
    js_only=False
)
  • js_code can be a single string or a list of strings.
  • js_only=True means “I’m continuing in the same session with new JS steps, no new full navigation.”

4.3 Anti-Bot

run_config = CrawlerRunConfig(
    magic=True,
    simulate_user=True,
    override_navigator=True
)
- magic=True tries multiple stealth features.
- simulate_user=True mimics mouse movements or random delays.
- override_navigator=True fakes some navigator properties (like user agent checks).


5. Session Management

session_id:

run_config = CrawlerRunConfig(
    session_id="my_session123"
)
If re-used in subsequent arun() calls, the same tab/page context is continued (helpful for multi-step tasks or stateful browsing).


6. Screenshot, PDF & Media Options

run_config = CrawlerRunConfig(
    screenshot=True,             # Grab a screenshot as base64
    screenshot_wait_for=1.0,     # Wait 1s before capturing
    pdf=True,                    # Also produce a PDF
    image_description_min_word_threshold=5,  # If analyzing alt text
    image_score_threshold=3,                # Filter out low-score images
)
Where they appear: - result.screenshot → Base64 screenshot string. - result.pdf → Byte array with PDF data.


7. Extraction Strategy

For advanced data extraction (CSS/LLM-based), set extraction_strategy:

run_config = CrawlerRunConfig(
    extraction_strategy=my_css_or_llm_strategy
)

The extracted data will appear in result.extracted_content.


8. Comprehensive Example

Below is a snippet combining many parameters:

import asyncio
from crawl4ai import AsyncWebCrawler, CrawlerRunConfig, CacheMode
from crawl4ai.extraction_strategy import JsonCssExtractionStrategy

async def main():
    # Example schema
    schema = {
        "name": "Articles",
        "baseSelector": "article.post",
        "fields": [
            {"name": "title", "selector": "h2", "type": "text"},
            {"name": "link",  "selector": "a",  "type": "attribute", "attribute": "href"}
        ]
    }

    run_config = CrawlerRunConfig(
        # Core
        verbose=True,
        cache_mode=CacheMode.ENABLED,

        # Content
        word_count_threshold=10,
        css_selector="main.content",
        excluded_tags=["nav", "footer"],
        exclude_external_links=True,

        # Page & JS
        js_code="document.querySelector('.show-more')?.click();",
        wait_for="css:.loaded-block",
        page_timeout=30000,

        # Extraction
        extraction_strategy=JsonCssExtractionStrategy(schema),

        # Session
        session_id="persistent_session",

        # Media
        screenshot=True,
        pdf=True,

        # Anti-bot
        simulate_user=True,
        magic=True,
    )

    async with AsyncWebCrawler() as crawler:
        result = await crawler.arun("https://example.com/posts", config=run_config)
        if result.success:
            print("HTML length:", len(result.cleaned_html))
            print("Extraction JSON:", result.extracted_content)
            if result.screenshot:
                print("Screenshot length:", len(result.screenshot))
            if result.pdf:
                print("PDF bytes length:", len(result.pdf))
        else:
            print("Error:", result.error_message)

if __name__ == "__main__":
    asyncio.run(main())

What we covered: 1. Crawling the main content region, ignoring external links.
2. Running JavaScript to click “.show-more”.
3. Waiting for “.loaded-block” to appear.
4. Generating a screenshot & PDF of the final page.
5. Extracting repeated “article.post” elements with a CSS-based extraction strategy.


9. Best Practices

1. Use BrowserConfig for global browser settings (headless, user agent).
2. Use CrawlerRunConfig to handle the specific crawl needs: content filtering, caching, JS, screenshot, extraction, etc.
3. Keep your parameters consistent in run configs—especially if you’re part of a large codebase with multiple crawls.
4. Limit large concurrency (semaphore_count) if the site or your system can’t handle it.
5. For dynamic pages, set js_code or scan_full_page so you load all content.


10. Conclusion

All parameters that used to be direct arguments to arun() now belong in CrawlerRunConfig. This approach:

  • Makes code clearer and more maintainable.
  • Minimizes confusion about which arguments affect global vs. per-crawl behavior.
  • Allows you to create reusable config objects for different pages or tasks.

For a full reference, check out the CrawlerRunConfig Docs.

Happy crawling with your structured, flexible config approach!