The Web Crawler API enables efficient web scraping and data extraction from websites, optimized for automation workflows. It supports output formats like Markdown and JSON, making it ideal for data processing.
Method: POST
URL: {{base_url}}/webcrawl
Authentication:
Required Parameters:
|
Parameter Name |
Value |
Description |
|---|---|---|
|
client_id |
your_client_id |
API authentication ID |
Required Headers:
|
Header Name |
Value |
Description |
|---|---|---|
|
client-secret |
your_client_secret |
API authentication secret |
|
Content-Type |
application/json |
Data sent in JSON format |
Input Formats:
JSON Body:
-
Basic Crawl (Markdown Output, Default)
{
"url": "https://example.com",
"deep_crawl": "false",
"max_pages": 10
}
-
Deep Crawl
{
"url": "https://example.com",
"deep_crawl": "true",
"max_pages": 10
}
-
Deep Crawl (All Pages)
"url": "https://example.com",
"deep_crawl": "true",
"max_pages": "all"
}
Features of Web Crawler API:
-
Blazing Fast Performance: Outperforms many paid services.
-
Output Formats: Supports JSON, cleaned HTML, and Markdown formats.
-
Multi-URL Crawling: Crawl multiple URLs simultaneously.
-
Media Extraction: Extracts images, audio, and video tags.
-
Link Extraction: Captures internal and external links.
-
Metadata Extraction: Retrieves metadata from web pages.
Notes:
-
Basic Crawl: Performs a single-page crawl on the provided URL, ignoring max_pages.
-
Deep Crawl: Crawls multiple pages, respecting the max_pages limit. If max_pages is set to “All”, “all”, or “ALL”, it will crawl all reachable pages from the provided URL.
-
Case Sensitivity: The max_pages value “All”, “all”, or “ALL” is case-insensitive and will trigger a full crawl of all pages when used with deep_crawl: “true”.
File Support:
The Web Crawler API processes web pages and extracts data in the following formats:
-
JSON
-
Markdown
-
Cleaned HTML
Authentication Instructions:
To acquire the Base URL and create your own Client ID and Secret, please refer to the My Profile section within your Bizdata account.
