Advanced Scraping Guide
Learn how to improve your Firecrawl scraping with advanced options.
This guide will walk you through the different endpoints of Firecrawl and how to use them fully with all its parameters.
本指南將帶您了解 Firecrawl 的不同端點,以及如何充分利用其所有參數。
Basic scraping with Firecrawl (/scrape)
To scrape a single page and get clean markdown content, you can use the /scrape
endpoint.
要抓取單一頁面並獲取乾淨的 Markdown 內容,你可以使用 /scrape
端點。
Scraping PDFs
Firecrawl supports scraping PDFs by default. You can use the /scrape
endpoint to scrape a PDF link and get the text content of the PDF. You can disable this by setting parsePDF
to false
.
Firecrawl 預設支援抓取 PDF 檔案。您可以使用 /scrape
端點來抓取 PDF 連結並取得 PDF 的文字內容。您可以透過將 parsePDF
設定為 false
來停用此功能。
Scrape Options
When using the /scrape
endpoint, you can customize the scraping behavior with many parameters. Here are the available options:
使用 /scrape
端點時,您可以透過許多參數自訂抓取行為。以下是可用的選項:
Setting the content formats on response with formats
- Type:
array
類型:array
- Enum:
["markdown", "links", "html", "rawHtml", "screenshot", "json"]
枚舉:["markdown", "links", "html", "rawHtml", "screenshot", "json"]
- Description: Specify the formats to include in the response. Options include:
描述:指定回應中包含的格式。選項包括:markdown
: Returns the scraped content in Markdown format.
markdown
: 以 Markdown 格式返回抓取的內容。links
: Includes all hyperlinks found on the page.
links
: 包含頁面上找到的所有超連結。html
: Provides the content in HTML format.
html
: 提供 HTML 格式的內容。rawHtml
: Delivers the raw HTML content, without any processing.
rawHtml
: 提供未經任何處理的原始 HTML 內容。screenshot
: Includes a screenshot of the page as it appears in the browser.
screenshot
: 包含頁面在瀏覽器中顯示的截圖。json
: Extracts structured information from the page using the LLM.
json
: 使用 LLM 從頁面中提取結構化資訊。
- Default:
["markdown"]
預設值:["markdown"]
Getting the full page content as markdown with onlyMainContent
- Type:
boolean
類型:boolean
- Description: By default, the scraper will only return the main content of the page, excluding headers, navigation bars, footers, etc. Set this to
false
to return the full page content.
描述:預設情況下,爬蟲只會返回頁面的主要內容,排除標頭、導覽列、頁尾等。將此設定為false
以返回完整的頁面內容。 - Default:
true
預設值:true
Setting the tags to include with includeTags
- Type:
array
類型:array
- Description: Specify the HTML tags, classes and ids to include in the response.
描述:指定要包含在回應中的 HTML 標籤、類別和 ID。 - Default: undefined 預設值:未定義
Setting the tags to exclude with excludeTags
- Type:
array
類型:array
- Description: Specify the HTML tags, classes and ids to exclude from the response.
描述:指定要從回應中排除的 HTML 標籤、類別和 ID。 - Default: undefined 預設值:未定義
Waiting for the page to load with waitFor
- Type:
integer
類型:integer
- Description: To be used only as a last resort. Wait for a specified amount of milliseconds for the page to load before fetching content.
描述:僅作為最後手段使用。在獲取內容之前,等待指定的毫秒數以讓頁面加載。 - Default:
0
預設值:0
Setting the maximum timeout
- Type:
integer
類型:integer
- Description: Set the maximum duration in milliseconds that the scraper will wait for the page to respond before aborting the operation.
描述:設定爬蟲在放棄操作前等待頁面回應的最大持續時間(以毫秒為單位)。 - Default:
30000
(30 seconds)
預設值:30000
(30 秒)
Example Usage
In this example, the scraper will:
在此範例中,爬蟲將會:
- Return the full page content as markdown.
將完整頁面內容以 Markdown 格式返回。 - Include the markdown, raw HTML, HTML, links and screenshot in the response.
在回應中包含 Markdown、原始 HTML、HTML、連結和截圖。 - The response will include only the HTML tags
<h1>
,<p>
,<a>
, and elements with the class.main-content
, while excluding any elements with the IDs#ad
and#footer
.
回應將僅包含 HTML 標籤<h1>
、<p>
、<a>
,以及具有類別.main-content
的元素,同時排除具有 ID#ad
和#footer
的任何元素。 - Wait for 1000 milliseconds (1 second) for the page to load before fetching the content.
在獲取內容之前,等待頁面加載 1000 毫秒(1 秒)。 - Set the maximum duration of the scrape request to 15000 milliseconds (15 seconds).
將爬取請求的最大持續時間設置為 15000 毫秒(15 秒)。
Here is the API Reference for it: Scrape Endpoint Documentation
以下是其 API 參考:爬取端點文檔
Extractor Options
When using the /scrape
endpoint, you can specify options for extracting structured information from the page content using the extract
parameter. Here are the available options:
當使用 /scrape
端點時,您可以使用 extract
參數指定從頁面內容中提取結構化資訊的選項。以下是可用的選項:
Using the LLM Extraction
schema
- Type:
object
類型:object
- Required: False if prompt is provided
必填:如果提供了提示,則為 False - Description: The schema for the data to be extracted. This defines the structure of the extracted data.
描述:要提取資料的結構描述。這定義了提取資料的結構。
system prompt
- Type:
string
類型:string
- Required: False 必填:否
- Description: System prompt for the LLM.
描述:LLM 的系統提示。
prompt
- Type:
string
類型:string
- Required: False if schema is provided
必填:如果提供了 schema 則為 False - Description: A prompt for the LLM to extract the data in the correct structure.
描述:一個提示,用於讓 LLM 以正確的結構提取數據。 - Example:
"Extract the features of the product"
範例:"Extract the features of the product"
Example Usage
Actions
When using the /scrape
endpoint, Firecrawl allows you to perform various actions on a web page before scraping its content. This is particularly useful for interacting with dynamic content, navigating through pages, or accessing content that requires user interaction.
當使用 /scrape
端點時,Firecrawl 允許你在抓取網頁內容之前對其執行各種操作。這對於與動態內容互動、瀏覽頁面或存取需要用戶互動的內容特別有用。
Available Actions
wait 等待
- Type:
object
類型:object
- Description: Wait for a specified amount of milliseconds.
描述:等待指定的毫秒數。 - Properties: 屬性:
type
:"wait"
milliseconds
: Number of milliseconds to wait.
milliseconds
: 等待的毫秒數。
- Example: 範例:
screenshot 截圖
- Type:
object
類型:object
- Description: Take a screenshot.
描述:擷取螢幕截圖。 - Properties: 屬性:
type
:"screenshot"
fullPage
: Should the screenshot be full-page or viewport sized? (default:false
)
fullPage
: 截圖應為全頁面還是視窗大小?(預設:false
)
- Example: 範例:
click 點擊
- Type:
object
類型:object
- Description: Click on an element.
描述:點擊一個元素。 - Properties: 屬性:
type
:"click"
selector
: Query selector to find the element by.
selector
: 用於查找元素的查詢選擇器。
- Example: 範例:
write 寫入
- Type:
object
類型:object
- Description: Write text into an input field.
描述:將文字寫入輸入欄位。 - Properties: 屬性:
type
:"write"
text
: Text to type.
text
: 要輸入的文字。selector
: Query selector for the input field.
selector
: 輸入欄位的查詢選擇器。
- Example: 範例:
press 按下
- Type:
object
類型:object
- Description: Press a key on the page.
描述:在頁面上按下一個按鍵。 - Properties: 屬性:
type
:"press"
key
: Key to press.
key
: 要按下的按鍵。
- Example: 範例:
scroll 滾動
- Type:
object
類型:object
- Description: Scroll the page.
描述:滾動頁面。 - Properties: 屬性:
type
:"scroll"
direction
: Direction to scroll ("up"
or"down"
).
direction
: 滾動方向 ("up"
或"down"
)。amount
: Amount to scroll in pixels.
amount
: 以像素為單位的滾動量。
- Example: 範例:
For more details about the actions parameters, refer to the API Reference.
有關操作參數的更多詳細信息,請參閱 API 參考文檔。
Crawling Multiple Pages
To crawl multiple pages, you can use the /crawl
endpoint. This endpoint allows you to specify a base URL you want to crawl and all accessible subpages will be crawled.
要爬取多個頁面,您可以使用 /crawl
端點。此端點允許您指定要爬取的基本 URL,並且所有可訪問的子頁面都將被爬取。
Returns a id 返回一個 id
Check Crawl Job
Used to check the status of a crawl job and get its result.
用於檢查爬取作業的狀態並取得其結果。
Pagination/Next URL 分頁/下一頁 URL
If the content is larger than 10MB or if the crawl job is still running, the response will include a next
parameter. This parameter is a URL to the next page of results. You can use this parameter to get the next page of results.
如果內容大於 10MB 或爬取任務仍在運行中,回應將包含一個 next
參數。此參數是下一頁結果的 URL。您可以使用此參數來獲取下一頁的結果。
Crawler Options
When using the /crawl
endpoint, you can customize the crawling behavior with request body parameters. Here are the available options:
使用 /crawl
端點時,您可以透過請求主體參數自訂爬取行為。以下是可用的選項:
includePaths
- Type:
array
類型:array
- Description: URL patterns to include in the crawl. Only URLs matching these patterns will be crawled.
描述:要包含在爬取中的 URL 模式。只有符合這些模式的 URL 才會被爬取。 - Example:
["/blog/*", "/products/*"]
範例:["/blog/*", "/products/*"]
excludePaths
- Type:
array
類型:array
- Description: URL patterns to exclude from the crawl. URLs matching these patterns will be skipped.
描述:從爬取中排除的 URL 模式。符合這些模式的 URL 將被跳過。 - Example:
["/admin/*", "/login/*"]
範例:["/admin/*", "/login/*"]
maxDepth
- Type:
integer
類型:integer
- Description: Maximum depth to crawl relative to the entered URL. A maxDepth of 0 scrapes only the entered URL. A maxDepth of 1 scrapes the entered URL and all pages one level deep. A maxDepth of 2 scrapes the entered URL and all pages up to two levels deep. Higher values follow the same pattern.
描述:相對於輸入的 URL 的最大爬取深度。maxDepth 為 0 時僅爬取輸入的 URL。maxDepth 為 1 時爬取輸入的 URL 以及所有一層深的頁面。maxDepth 為 2 時爬取輸入的 URL 以及所有最多兩層深的頁面。更高的值依此類推。 - Example:
2
範例:2
limit
- Type:
integer
類型:integer
- Description: Maximum number of pages to crawl.
描述:最大爬取頁面數。 - Default:
10000
預設值:10000
allowBackwardLinks
- Type:
boolean
類型:boolean
- Description: This option permits the crawler to navigate to URLs that are higher in the directory structure than the base URL. For instance, if the base URL is
example.com/blog/topic
, enabling this option allows crawling to pages likeexample.com/blog
orexample.com
, which are backward in the path hierarchy relative to the base URL.
描述:此選項允許爬蟲導航到比基礎 URL 更高層級的目錄結構中的 URL。例如,如果基礎 URL 是example.com/blog/topic
,啟用此選項將允許爬取像example.com/blog
或example.com
這樣的頁面,這些頁面在路徑層次結構中相對於基礎 URL 是向後的。 - Default:
false
預設值:false
allowExternalLinks
- Type:
boolean
類型:boolean
- Description: This option allows the crawler to follow links that point to external domains. Be careful with this option, as it can cause the crawl to stop only based only on the
limit
andmaxDepth
values.
描述:此選項允許爬蟲追蹤指向外部網域的連結。請謹慎使用此選項,因為它可能導致爬取僅基於limit
和maxDepth
值而停止。 - Default:
false
預設:false
scrapeOptions
As part of the crawler options, you can also specify the scrapeOptions
parameter. This parameter allows you to customize the scraping behavior for each page.
作為爬蟲選項的一部分,您還可以指定 scrapeOptions
參數。此參數允許您為每個頁面自定義抓取行為。
- Type:
object
類型:object
- Description: Options for the scraper.
描述:爬蟲的選項。 - Example:
{"formats": ["markdown", "links", "html", "rawHtml", "screenshot"], "includeTags": ["h1", "p", "a", ".main-content"], "excludeTags": ["#ad", "#footer"], "onlyMainContent": false, "waitFor": 1000, "timeout": 15000}
範例:{"formats": ["markdown", "links", "html", "rawHtml", "screenshot"], "includeTags": ["h1", "p", "a", ".main-content"], "excludeTags": ["#ad", "#footer"], "onlyMainContent": false, "waitFor": 1000, "timeout": 15000}
- Default:
{ "formats": ["markdown"] }
預設:{ "formats": ["markdown"] }
- See: Scrape Options 請參閱:爬取選項
Example Usage
In this example, the crawler will:
在此範例中,爬蟲將會:
- Only crawl URLs that match the patterns
/blog/*
and/products/*
.
僅爬取符合模式/blog/*
和/products/*
的 URL。 - Skip URLs that match the patterns
/admin/*
and/login/*
.
跳過符合模式/admin/*
和/login/*
的 URL。 - Return the full document data for each page.
返回每個頁面的完整文件資料。 - Crawl up to a maximum depth of 2.
爬取最大深度為 2。 - Crawl a maximum of 1000 pages.
最多爬取 1000 個頁面。
Mapping Website Links with /map
The /map
endpoint is adept at identifying URLs that are contextually related to a given website. This feature is crucial for understanding a site’s contextual link environment, which can greatly aid in strategic site analysis and navigation planning.
/map
端點擅長識別與特定網站上下文相關的 URL。此功能對於理解網站的上下文連結環境至關重要,這可以極大地幫助進行戰略性網站分析和導航規劃。
Usage
To use the /map
endpoint, you need to send a GET request with the URL of the page you want to map. Here is an example using curl
:
要使用 /map
端點,您需要發送一個 GET 請求,並附上您想要映射的頁面 URL。以下是一個使用 curl
的範例:
This will return a JSON object containing links contextually related to the url.
這將返回一個包含與該 URL 上下文相關的連結的 JSON 物件。
Example Response
Map Options
search
- Type:
string
類型:string
- Description: Search for links containing specific text.
描述:搜尋包含特定文字的連結。 - Example:
"blog"
範例:"blog"
limit
- Type:
integer
類型:integer
- Description: Maximum number of links to return.
描述:返回的最大連結數量。 - Default:
100
預設值:100
ignoreSitemap
- Type:
boolean
類型:boolean
- Description: Ignore the website sitemap when crawling
描述:爬取時忽略網站地圖 - Default:
true
預設值:true
includeSubdomains
- Type:
boolean
類型:boolean
- Description: Include subdomains of the website
描述:包含網站的子網域 - Default:
false
預設:false
Here is the API Reference for it: Map Endpoint Documentation
以下是其 API 參考資料:地圖端點文件