When to use proxies for scraping
Any site that serves captcha after N requests from one IP or limits requests per minute is a reason to use proxies. That's e-commerce (Amazon), classifieds, search engines, news sites, APIs with rate limits.
How to build a proxy pool
- Load addresses into a pool manager (requests-ip-rotator, scrapy-rotating-proxies)
- Pick a random proxy per request
- On 429/503 mark the proxy as "cooling" for 1-3 minutes
- Track success rate so bad IPs fall out of the pool automatically
Integrations
Python + Scrapy:
DOWNLOADER_MIDDLEWARES = {
'rotating_proxies.middlewares.RotatingProxyMiddleware': 610,
}
ROTATING_PROXY_LIST = open('proxies.txt').read().splitlines()
Python + Requests:
proxies = {
'http': 'socks5://user:pass@host:port',
'https': 'socks5://user:pass@host:port',
}
r = requests.get(url, proxies=proxies)
Puppeteer / Playwright: set the proxy in launch options via --proxy-server=socks5://host:port.
Tips
- Cache responses locally, don't fetch twice
- Use realistic User-Agents and rotate
- Headless browsers only when JS is unavoidable — a plain HTTP request is 10x faster
- Respect robots.txt and site terms
