Advanced Scraping Guide

This guide will walk you through the different endpoints of Firecrawl and how to use them fully with all its parameters.
本指南將帶您了解 Firecrawl 的不同端點，以及如何充分利用其所有參數。

Basic scraping with Firecrawl (/scrape)

To scrape a single page and get clean markdown content, you can use the /scrape endpoint.
要抓取單一頁面並獲取乾淨的 Markdown 內容，你可以使用 /scrape 端點。

Copy  複製
// npm install @mendable/firecrawl-js

import { FirecrawlApp } from 'firecrawl-js';

const app = new FirecrawlApp({ apiKey: 'YOUR_API_KEY' });

const content = await app.scrapeUrl('https://docs.firecrawl.dev');

Scraping PDFs

Firecrawl supports scraping PDFs by default. You can use the /scrape endpoint to scrape a PDF link and get the text content of the PDF. You can disable this by setting parsePDF to false.
Firecrawl 預設支援抓取 PDF 檔案。您可以使用 /scrape 端點來抓取 PDF 連結並取得 PDF 的文字內容。您可以透過將 parsePDF 設定為 false 來停用此功能。

Scrape Options

When using the /scrape endpoint, you can customize the scraping behavior with many parameters. Here are the available options:
使用 /scrape 端點時，您可以透過許多參數自訂抓取行為。以下是可用的選項：

Setting the content formats on response with `formats`

Type: array 類型: array
Enum: ["markdown", "links", "html", "rawHtml", "screenshot", "json"] 枚舉: ["markdown", "links", "html", "rawHtml", "screenshot", "json"]
Description: Specify the formats to include in the response. Options include:
描述：指定回應中包含的格式。選項包括：
- markdown: Returns the scraped content in Markdown format.
  markdown : 以 Markdown 格式返回抓取的內容。
- links: Includes all hyperlinks found on the page.
  links : 包含頁面上找到的所有超連結。
- html: Provides the content in HTML format.
  html : 提供 HTML 格式的內容。
- rawHtml: Delivers the raw HTML content, without any processing.
  rawHtml : 提供未經任何處理的原始 HTML 內容。
- screenshot: Includes a screenshot of the page as it appears in the browser.
  screenshot : 包含頁面在瀏覽器中顯示的截圖。
- json: Extracts structured information from the page using the LLM.
  json : 使用 LLM 從頁面中提取結構化資訊。
Default: ["markdown"] 預設值： ["markdown"]

Getting the full page content as markdown with `onlyMainContent`

Type: boolean 類型： boolean
Description: By default, the scraper will only return the main content of the page, excluding headers, navigation bars, footers, etc. Set this to false to return the full page content.
描述：預設情況下，爬蟲只會返回頁面的主要內容，排除標頭、導覽列、頁尾等。將此設定為 false 以返回完整的頁面內容。
Default: true 預設值： true

Setting the tags to include with `includeTags`

Type: array 類型： array
Description: Specify the HTML tags, classes and ids to include in the response.
描述：指定要包含在回應中的 HTML 標籤、類別和 ID。
Default: undefined 預設值：未定義

Setting the tags to exclude with `excludeTags`

Type: array 類型： array
Description: Specify the HTML tags, classes and ids to exclude from the response.
描述：指定要從回應中排除的 HTML 標籤、類別和 ID。
Default: undefined 預設值：未定義

Waiting for the page to load with `waitFor`

Type: integer 類型： integer
Description: To be used only as a last resort. Wait for a specified amount of milliseconds for the page to load before fetching content.
描述：僅作為最後手段使用。在獲取內容之前，等待指定的毫秒數以讓頁面加載。
Default: 0 預設值： 0

Setting the maximum `timeout`

Type: integer 類型： integer
Description: Set the maximum duration in milliseconds that the scraper will wait for the page to respond before aborting the operation.
描述：設定爬蟲在放棄操作前等待頁面回應的最大持續時間（以毫秒為單位）。
Default: 30000 (30 seconds)
預設值： 30000 （30 秒）

Example Usage

Copy  複製
curl -X POST https://api.firecrawl.dev/v1/scrape \
    -H '
    Content-Type: application/json' \
    -H 'Authorization : Bearer YOUR_API_KEY' \
    -d '{
      "url": "https://docs.firecrawl.dev",
      "formats": ["markdown", "links", "html", "rawHtml", "screenshot"],
      "includeTags": ["h1", "p", "a", ".main-content"],
      "excludeTags": ["#ad", "#footer"],
      "onlyMainContent": false,
      "waitFor": 1000,
      "timeout": 15000
    }'

In this example, the scraper will:
在此範例中，爬蟲將會：

Return the full page content as markdown.
將完整頁面內容以 Markdown 格式返回。
Include the markdown, raw HTML, HTML, links and screenshot in the response.
在回應中包含 Markdown、原始 HTML、HTML、連結和截圖。
The response will include only the HTML tags <h1>, <p>, <a>, and elements with the class .main-content, while excluding any elements with the IDs #ad and #footer.
回應將僅包含 HTML 標籤 <h1> 、 <p> 、 <a> ，以及具有類別 .main-content 的元素，同時排除具有 ID #ad 和 #footer 的任何元素。
Wait for 1000 milliseconds (1 second) for the page to load before fetching the content.
在獲取內容之前，等待頁面加載 1000 毫秒（1 秒）。
Set the maximum duration of the scrape request to 15000 milliseconds (15 seconds).
將爬取請求的最大持續時間設置為 15000 毫秒（15 秒）。

Here is the API Reference for it: Scrape Endpoint Documentation
以下是其 API 參考：爬取端點文檔

Extractor Options

When using the /scrape endpoint, you can specify options for extracting structured information from the page content using the extract parameter. Here are the available options:
當使用 /scrape 端點時，您可以使用 extract 參數指定從頁面內容中提取結構化資訊的選項。以下是可用的選項：

Using the LLM Extraction

schema

Type: object 類型： object
Required: False if prompt is provided
必填：如果提供了提示，則為 False
Description: The schema for the data to be extracted. This defines the structure of the extracted data.
描述：要提取資料的結構描述。這定義了提取資料的結構。

system prompt

Type: string 類型： string
Required: False 必填：否
Description: System prompt for the LLM.
描述：LLM 的系統提示。

prompt

Type: string 類型： string
Required: False if schema is provided
必填：如果提供了 schema 則為 False
Description: A prompt for the LLM to extract the data in the correct structure.
描述：一個提示，用於讓 LLM 以正確的結構提取數據。
Example: "Extract the features of the product" 範例： "Extract the features of the product"

Example Usage

Copy  複製
curl -X POST https://api.firecrawl.dev/v0/scrape \
    -H 'Content-Type: application/json' \
    -H 'Authorization: Bearer YOUR_API_KEY' \
    -d '{
      "url": "https://firecrawl.dev",
      "formats": ["markdown", "json"],
      "json": {
        "prompt": "Extract the features of the product"
      }
    }'

Copy  複製
{
  "success": true,
  "data": {
    "content": "Raw Content",
    "metadata": {
      "title": "Mendable",
      "description": "Mendable allows you to easily build AI chat applications. Ingest, customize, then deploy with one line of code anywhere you want. Brought to you by SideGuide",
      "robots": "follow, index",
      "ogTitle": "Mendable",
      "ogDescription": "Mendable allows you to easily build AI chat applications. Ingest, customize, then deploy with one line of code anywhere you want. Brought to you by SideGuide",
      "ogUrl": "https://docs.firecrawl.dev/",
      "ogImage": "https://docs.firecrawl.dev/mendable_new_og1.png",
      "ogLocaleAlternate": [],
      "ogSiteName": "Mendable",
      "sourceURL": "https://docs.firecrawl.dev/",
      "statusCode": 200
    },
    "extract": {
      "product": "Firecrawl",
      "features": {
        "general": {
          "description": "Turn websites into LLM-ready data.",
          "openSource": true,
          "freeCredits": 500,
          "useCases": [
            "AI applications",
            "Data science",
            "Market research",
            "Content aggregation"
          ]
        },
        "crawlingAndScraping": {
          "crawlAllAccessiblePages": true,
          "noSitemapRequired": true,
          "dynamicContentHandling": true,
          "dataCleanliness": {
            "process": "Advanced algorithms",
            "outputFormat": "Markdown"
          }
        },
        ...
      }
    }
  }
}

Actions

When using the /scrape endpoint, Firecrawl allows you to perform various actions on a web page before scraping its content. This is particularly useful for interacting with dynamic content, navigating through pages, or accessing content that requires user interaction.
當使用 /scrape 端點時，Firecrawl 允許你在抓取網頁內容之前對其執行各種操作。這對於與動態內容互動、瀏覽頁面或存取需要用戶互動的內容特別有用。

Available Actions

wait 等待

Type: object 類型: object
Description: Wait for a specified amount of milliseconds.
描述：等待指定的毫秒數。
Properties: 屬性：
- type: "wait"
- milliseconds: Number of milliseconds to wait.
  milliseconds : 等待的毫秒數。

Example: 範例：

Copy  複製
{
  "type": "wait",
  "milliseconds": 2000
}

screenshot 截圖

Type: object 類型： object
Description: Take a screenshot.
描述：擷取螢幕截圖。
Properties: 屬性：
- type: "screenshot"
- fullPage: Should the screenshot be full-page or viewport sized? (default: false)
  fullPage : 截圖應為全頁面還是視窗大小？(預設： false )

Example: 範例：

Copy  複製
{
  "type": "screenshot",
  "fullPage": true
}

click 點擊

Type: object 類型： object
Description: Click on an element.
描述：點擊一個元素。
Properties: 屬性：
- type: "click"
- selector: Query selector to find the element by.
  selector : 用於查找元素的查詢選擇器。

Example: 範例：

Copy  複製
{
  "type": "click",
  "selector": "#load-more-button"
}

write 寫入

Type: object 類型： object
Description: Write text into an input field.
描述：將文字寫入輸入欄位。
Properties: 屬性：
- type: "write"
- text: Text to type.
  text : 要輸入的文字。
- selector: Query selector for the input field.
  selector : 輸入欄位的查詢選擇器。

Example: 範例：

Copy  複製
{
  "type": "write",
  "text": "Hello, world!",
  "selector": "#search-input"
}

press 按下

Type: object 類型： object
Description: Press a key on the page.
描述：在頁面上按下一個按鍵。
Properties: 屬性：
- type: "press"
- key: Key to press.
  key : 要按下的按鍵。

Example: 範例：

Copy  複製
{
  "type": "press",
  "key": "Enter"
}

scroll 滾動

Type: object 類型： object
Description: Scroll the page.
描述：滾動頁面。
Properties: 屬性：
- type: "scroll"
- direction: Direction to scroll ("up" or "down").
  direction : 滾動方向 ( "up" 或 "down" )。
- amount: Amount to scroll in pixels.
  amount : 以像素為單位的滾動量。

Example: 範例：

Copy  複製
{
  "type": "scroll",
  "direction": "down",
  "amount": 500
}

For more details about the actions parameters, refer to the API Reference.
有關操作參數的更多詳細信息，請參閱 API 參考文檔。

Crawling Multiple Pages

To crawl multiple pages, you can use the /crawl endpoint. This endpoint allows you to specify a base URL you want to crawl and all accessible subpages will be crawled.
要爬取多個頁面，您可以使用 /crawl 端點。此端點允許您指定要爬取的基本 URL，並且所有可訪問的子頁面都將被爬取。

Copy  複製
curl -X POST https://api.firecrawl.dev/v1/crawl \
    -H 'Content-Type: application/json' \
    -H 'Authorization: Bearer YOUR_API_KEY' \
    -d '{
      "url": "https://docs.firecrawl.dev"
    }'

Returns a id 返回一個 id

Copy  複製
{ "id": "1234-5678-9101" }

Check Crawl Job

Used to check the status of a crawl job and get its result.
用於檢查爬取作業的狀態並取得其結果。

Copy  複製
curl -X GET https://api.firecrawl.dev/v1/crawl/1234-5678-9101 \
  -H 'Content-Type: application/json' \
  -H 'Authorization: Bearer YOUR_API_KEY'

Pagination/Next URL 分頁/下一頁 URL

If the content is larger than 10MB or if the crawl job is still running, the response will include a next parameter. This parameter is a URL to the next page of results. You can use this parameter to get the next page of results.
如果內容大於 10MB 或爬取任務仍在運行中，回應將包含一個 next 參數。此參數是下一頁結果的 URL。您可以使用此參數來獲取下一頁的結果。

Crawler Options

When using the /crawl endpoint, you can customize the crawling behavior with request body parameters. Here are the available options:
使用 /crawl 端點時，您可以透過請求主體參數自訂爬取行為。以下是可用的選項：

`includePaths`

Type: array 類型： array
Description: URL patterns to include in the crawl. Only URLs matching these patterns will be crawled.
描述：要包含在爬取中的 URL 模式。只有符合這些模式的 URL 才會被爬取。
Example: ["/blog/*", "/products/*"] 範例： ["/blog/*", "/products/*"]

`excludePaths`

Type: array 類型： array
Description: URL patterns to exclude from the crawl. URLs matching these patterns will be skipped.
描述：從爬取中排除的 URL 模式。符合這些模式的 URL 將被跳過。
Example: ["/admin/*", "/login/*"] 範例： ["/admin/*", "/login/*"]

`maxDepth`

Type: integer 類型： integer
Description: Maximum depth to crawl relative to the entered URL. A maxDepth of 0 scrapes only the entered URL. A maxDepth of 1 scrapes the entered URL and all pages one level deep. A maxDepth of 2 scrapes the entered URL and all pages up to two levels deep. Higher values follow the same pattern.
描述：相對於輸入的 URL 的最大爬取深度。maxDepth 為 0 時僅爬取輸入的 URL。maxDepth 為 1 時爬取輸入的 URL 以及所有一層深的頁面。maxDepth 為 2 時爬取輸入的 URL 以及所有最多兩層深的頁面。更高的值依此類推。
Example: 2 範例： 2

`limit`

Type: integer 類型： integer
Description: Maximum number of pages to crawl.
描述：最大爬取頁面數。
Default: 10000 預設值： 10000

`allowBackwardLinks`

Type: boolean 類型： boolean
Description: This option permits the crawler to navigate to URLs that are higher in the directory structure than the base URL. For instance, if the base URL is example.com/blog/topic, enabling this option allows crawling to pages like example.com/blog or example.com, which are backward in the path hierarchy relative to the base URL.
描述：此選項允許爬蟲導航到比基礎 URL 更高層級的目錄結構中的 URL。例如，如果基礎 URL 是 example.com/blog/topic ，啟用此選項將允許爬取像 example.com/blog 或 example.com 這樣的頁面，這些頁面在路徑層次結構中相對於基礎 URL 是向後的。
Default: false 預設值： false

`allowExternalLinks`

Type: boolean 類型： boolean
Description: This option allows the crawler to follow links that point to external domains. Be careful with this option, as it can cause the crawl to stop only based only on thelimit and maxDepth values.
描述：此選項允許爬蟲追蹤指向外部網域的連結。請謹慎使用此選項，因為它可能導致爬取僅基於 limit 和 maxDepth 值而停止。
Default: false 預設： false

scrapeOptions

As part of the crawler options, you can also specify the scrapeOptions parameter. This parameter allows you to customize the scraping behavior for each page.
作為爬蟲選項的一部分，您還可以指定 scrapeOptions 參數。此參數允許您為每個頁面自定義抓取行為。

Type: object 類型： object
Description: Options for the scraper.
描述：爬蟲的選項。
Example: {"formats": ["markdown", "links", "html", "rawHtml", "screenshot"], "includeTags": ["h1", "p", "a", ".main-content"], "excludeTags": ["#ad", "#footer"], "onlyMainContent": false, "waitFor": 1000, "timeout": 15000} 範例： {"formats": ["markdown", "links", "html", "rawHtml", "screenshot"], "includeTags": ["h1", "p", "a", ".main-content"], "excludeTags": ["#ad", "#footer"], "onlyMainContent": false, "waitFor": 1000, "timeout": 15000}
Default: { "formats": ["markdown"] } 預設： { "formats": ["markdown"] }
See: Scrape Options 請參閱：爬取選項

Example Usage

Copy  複製
curl -X POST https://api.firecrawl.dev/v1/crawl \
    -H 'Content-Type: application/json' \
    -H 'Authorization : Bearer YOUR_API_KEY' \
    -d '{
      "url": "https://docs.firecrawl.dev",
      "includePaths": ["/blog/*", "/products/*"],
      "excludePaths": ["/admin/*", "/login/*"],
      "maxDepth": 2,
      "limit": 1000
    }'

In this example, the crawler will:
在此範例中，爬蟲將會：

Only crawl URLs that match the patterns /blog/* and /products/*.
僅爬取符合模式 /blog/* 和 /products/* 的 URL。
Skip URLs that match the patterns /admin/* and /login/*.
跳過符合模式 /admin/* 和 /login/* 的 URL。
Return the full document data for each page.
返回每個頁面的完整文件資料。
Crawl up to a maximum depth of 2.
爬取最大深度為 2。
Crawl a maximum of 1000 pages.
最多爬取 1000 個頁面。

Mapping Website Links with `/map`

The /map endpoint is adept at identifying URLs that are contextually related to a given website. This feature is crucial for understanding a site’s contextual link environment, which can greatly aid in strategic site analysis and navigation planning.
/map 端點擅長識別與特定網站上下文相關的 URL。此功能對於理解網站的上下文連結環境至關重要，這可以極大地幫助進行戰略性網站分析和導航規劃。

Usage

To use the /map endpoint, you need to send a GET request with the URL of the page you want to map. Here is an example using curl:
要使用 /map 端點，您需要發送一個 GET 請求，並附上您想要映射的頁面 URL。以下是一個使用 curl 的範例：

Copy  複製
curl -X POST https://api.firecrawl.dev/v1/map \
    -H 'Content-Type: application/json' \
    -H 'Authorization: Bearer YOUR_API_KEY' \
    -d '{
      "url": "https://docs.firecrawl.dev"
    }'

This will return a JSON object containing links contextually related to the url.
這將返回一個包含與該 URL 上下文相關的連結的 JSON 物件。

Example Response

Copy  複製
  {
    "success":true,
    "links":[
      "https://docs.firecrawl.dev",
      "https://docs.firecrawl.dev/api-reference/endpoint/crawl-delete",
      "https://docs.firecrawl.dev/api-reference/endpoint/crawl-get",
      "https://docs.firecrawl.dev/api-reference/endpoint/crawl-post",
      "https://docs.firecrawl.dev/api-reference/endpoint/map",
      "https://docs.firecrawl.dev/api-reference/endpoint/scrape",
      "https://docs.firecrawl.dev/api-reference/introduction",
      "https://docs.firecrawl.dev/articles/search-announcement",
      ...
    ]
  }

Map Options

`search`

Type: string 類型: string
Description: Search for links containing specific text.
描述：搜尋包含特定文字的連結。
Example: "blog" 範例： "blog"

`limit`

Type: integer 類型： integer
Description: Maximum number of links to return.
描述：返回的最大連結數量。
Default: 100 預設值： 100

`ignoreSitemap`

Type: boolean 類型： boolean
Description: Ignore the website sitemap when crawling
描述：爬取時忽略網站地圖
Default: true 預設值： true

`includeSubdomains`

Type: boolean 類型： boolean
Description: Include subdomains of the website
描述：包含網站的子網域
Default: false 預設： false

Here is the API Reference for it: Map Endpoint Documentation
以下是其 API 參考資料：地圖端點文件

Get Started

Features

Integrations

Contributing

​Basic scraping with Firecrawl (/scrape)

​Scraping PDFs

​Scrape Options

​Setting the content formats on response with formats

​Getting the full page content as markdown with onlyMainContent

​Setting the tags to include with includeTags

​Setting the tags to exclude with excludeTags

​Waiting for the page to load with waitFor

​Setting the maximum timeout

​Example Usage

​Extractor Options

​Using the LLM Extraction

​schema

​system prompt

​prompt

​Example Usage

​Actions

​Available Actions

​wait 等待

​screenshot 截圖

​click 點擊

​write 寫入

​press 按下

​scroll 滾動

​Crawling Multiple Pages

​Check Crawl Job

​Pagination/Next URL 分頁/下一頁 URL

​Crawler Options

​includePaths

​excludePaths

​maxDepth

​limit

​allowBackwardLinks

​allowExternalLinks

​scrapeOptions

​Example Usage

​Mapping Website Links with /map

​Usage

​Example Response

​Map Options

​search

​limit

​ignoreSitemap

​includeSubdomains

Basic scraping with Firecrawl (/scrape)

Scraping PDFs

Scrape Options

Setting the content formats on response with `formats`

Getting the full page content as markdown with `onlyMainContent`

Setting the tags to include with `includeTags`

Setting the tags to exclude with `excludeTags`

Waiting for the page to load with `waitFor`

Setting the maximum `timeout`

Example Usage

Extractor Options

Using the LLM Extraction

schema

system prompt

prompt

Example Usage

Actions

Available Actions

wait 等待

screenshot 截圖

click 點擊

write 寫入

press 按下

scroll 滾動

Crawling Multiple Pages

Check Crawl Job

Pagination/Next URL 分頁/下一頁 URL

Crawler Options

`includePaths`

`excludePaths`

`maxDepth`

`limit`

`allowBackwardLinks`

`allowExternalLinks`

scrapeOptions

Example Usage

Mapping Website Links with `/map`

Usage

Example Response

Map Options

`search`

`limit`

`ignoreSitemap`

`includeSubdomains`