9.1 Architecture Review: Who is Proxying Whom?
Before diving into code, we need to clarify the traffic flow, as it can be confusing. Many developers ask: "Does my Node.js scraper code need proxy configuration?"
The answer is: No!
In this architecture, your core application (Node.js backend) and the target website are separated by the Firecrawl proxy black box we built:
graph LR
Node[Node.js Backend
No proxy, real IP] -->|HTTP POST| FCAPI[Local Firecrawl:3002]
subgraph Proxy Black Box Region
FCAPI -->|Passes Scraping Task| Playwright[Playwright Container]
Playwright -->|Via PROXY_SERVER env| WARP[Host WARP SOCKS5:40000]
end
WARP -->|CF Edge IP Access| Target[Target Anti-Scraping Site]
Target -.-> WARP -.-> Playwright -.-> FCAPI -.->|Markdown Content| NodeCore Advantages:
- Business Decoupling: Your backend Node.js code doesn't need to handle proxy logic, maintain proxy pools, or even know WARP exists.
- Security Isolation: If the proxy fails, it only affects requests sent to Firecrawl. Your Node.js connections to databases or LLM APIs (like Gemini) still use the fast local network.
9.2 Calling Firecrawl in Business Code
Once Firecrawl is deployed as infrastructure, calling it in Node.js becomes a simple HTTP/REST request.
We recommend using native fetch directly for maximum flexibility instead of complex SDKs:
// crawler/src/utils/firecrawl-client.ts
export async function scrapeWithFirecrawl(url: string) {
const FIRECRAWL_API_URL = "http://localhost:3002";
console.log(`🛡️ Scraping via local proxy cluster: ${url}`);
try {
const response = await fetch(`${FIRECRAWL_API_URL}/v1/scrape`, {
method: 'POST',
headers: {
'Content-Type': 'application/json',
// Local self-hosted version doesn't need a real API Key
'Authorization': 'Bearer dummy-key'
},
body: JSON.stringify({
url: url,
formats: ['markdown', 'html'],
// Key parameter: allows time for the underlying Playwright engine to render
waitFor: 2000,
// Bypasses some headless browser detection
mobile: false,
}),
});
if (!response.ok) {
throw new Error(`Firecrawl response error: ${response.status}`);
}
const data = await response.json();
if (data.success && data.data) {
return {
success: true,
content: data.data.markdown, // Returns purified Markdown
html: data.data.html
};
}
return { success: false, error: "No data retrieved" };
} catch (error) {
console.error(`❌ Firecrawl scraping failed:`, error);
return { success: false, error: String(error) };
}
}
9.3 Handling Extreme Anti-Scraping: Advanced Configurations
For specific websites, you can strengthen your bypass capabilities by modifying the payload sent to Firecrawl:
1. Handling Mandatory Popups and Consent Forms (Interact API)
If a target site opens with an "Accept Cookies" overlay blocking content extraction:
{
"url": "https://example.com",
"formats": ["markdown"],
"actions": [
{ "type": "click", "selector": "#accept-cookies-btn" },
{ "type": "wait", "milliseconds": 1000 }
]
}
2. Scraping Specific Regions Only
If a page is massive and you only want the main article Markdown to save on LLM token costs:
{
"url": "https://news.ycombinator.com",
"formats": ["markdown"],
"includeTags": ["article", "main", ".story-content"],
"excludeTags": ["nav", "footer", ".ads-banner"]
}
9.4 Chapter Review
- Why should your Node.js application NOT use the WARP proxy when calling LLM APIs like Gemini?
- If your Firecrawl container and Node.js app are deployed in the same Docker Compose network, what should the
FIRECRAWL_API_URLbe? - What critical role does the
waitFor: 2000parameter play in bypassing anti-scraping measures?