2.1 IP Banning: The Oldest and Most Effective Defense
IP banning is the first and most common line of defense for any website. Websites block automated access by identifying the source IP of a request and blacklisting it.
Types of IP Controls
| Type | Trigger Condition | Symptom |
|---|---|---|
| Single IP Ban | High request frequency from one IP | Returns 403 or 429 |
| ASN Banning | Banning an entire IP range (e.g., Cloud provider ranges) | No IPs from that data center can access the site |
| Geo-Banning | Banning specific countries or regions | Rejected based on GeoIP match |
| Residential vs Data Center | Detecting IP type | Data center IPs are directly rejected |
Why is your server IP so easily banned?
IP ranges for cloud providers like AWS, GCP, and Alibaba Cloud are public. Websites can block all cloud servers with a simple rule:if ip in DATACENTER_IP_RANGES: block()
Rate Limiting
sequenceDiagram
participant Bot as Scraper
participant WAF as WAF/CDN
participant Server as Origin Server
Bot->>WAF: GET /page (1st time)
WAF->>Server: Forward request
Server-->>Bot: 200 OK
Bot->>WAF: GET /page (50th time/sec)
WAF->>WAF: Trigger rate threshold
WAF-->>Bot: 429 Too Many Requests
Note over WAF: Sliding window counter
or Token Bucket algorithm
Bot->>WAF: GET /page (during block period)
WAF-->>Bot: 403 Forbidden (IP temporarily banned)2.2 TLS Fingerprinting: Your Handshake Ritual Exposed
This is an advanced anti-scraping method unknown to many developers. The TLS (HTTPS) handshake itself can leak whether you are a bot or a real browser.
JA3 Fingerprinting
JA3 is a fingerprinting algorithm that hashes the TLS ClientHello message, uniquely identifying a TLS client:
JA3 = MD5(SSLVersion, Ciphers, Extensions, EllipticCurves, EllipticCurvePointFormats)
Comparison Examples:
| Client | JA3 Hash |
|---|---|
| Chrome 120 (Windows) | cd08e31494f9531f560d64c695473da9 |
| Python requests | 3b5074b1b5d032e5620f69f9f700ff0e |
| curl | 7dc465ee29f9f4cde9001c75d09b1e65 |
Python's requests library and curl have fixed JA3 signatures. Cloudflare Bot Management can identify and intercept them during the handshake phase.
How to Detect Your TLS Fingerprint
# Check your curl fingerprint (using a dedicated detection service)
curl https://tls.peet.ws/api/all | python3 -m json.tool | grep ja3
# Check browser fingerprint
# Visit https://browserleaks.com/tls to view your browser's fingerprint
2.3 HTTP Header Analysis
Modern anti-scraping systems meticulously analyze every HTTP request header:
# ❌ Typical scraper headers (easily identified)
headers = {
'User-Agent': 'python-requests/2.31.0' # Directly exposes the tool
}
# ✅ Simulating real Chrome browser headers
headers = {
'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/120.0.0.0 Safari/537.36',
'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,image/avif,image/webp,image/apng,*/*;q=0.8',
'Accept-Language': 'en-US,en;q=0.9,zh-CN;q=0.8,zh;q=0.7',
'Accept-Encoding': 'gzip, deflate, br',
'Connection': 'keep-alive',
'Upgrade-Insecure-Requests': '1',
'Sec-Fetch-Dest': 'document',
'Sec-Fetch-Mode': 'navigate',
'Sec-Fetch-Site': 'none',
'Sec-Fetch-User': '?1',
'sec-ch-ua': '"Not_A Brand";v="8", "Chromium";v="120", "Google Chrome";v="120"',
'sec-ch-ua-mobile': '?0',
'sec-ch-ua-platform': '"macOS"',
}
Key Detection Point: Real browsers have strict logical relationships between
Sec-Fetch-*andsec-ch-uaheaders.
If you spoof the UA but lack these security headers, or if the header order doesn't follow Chrome conventions, you will be flagged immediately.
2.4 Chapter Review
- Why is setting the User-Agent to a Chrome string often still detected?
- What is a JA3 fingerprint, and how can a Python program generate the same JA3 as Chrome?
- What is the fundamental difference between a cloud server IP and a residential IP in anti-scraping detection?