Scrapy End-to-End Guide
What is Scrapy?
Scrapy is an open-source and collaborative framework for extracting the data you need from websites. It's a fast, high-level web crawling and scraping framework for Python.
Think of it as a complete toolkit for building web spiders and bots that can:
- Crawl websites: Follow links from one page to another to discover all the content on a site.
- Scrape data: Extract specific information (e.g., product names, prices, article headlines) from the HTML of a webpage.
- Process and store data: Clean up the extracted data and save it to various formats like JSON, CSV, or a database.
Setup
- Create Python environment:
python -m venv venv
source venv/bin/activate # Linux/Mac
venv\Scripts\activate # Windows
- Install dependencies:
pip install scrapy pymysql scrapy-splash
scrapy-splash allows scraping JavaScript-heavy pages. For fully JS-heavy sites, Selenium can also be used.
Start a Scrapy Project
scrapy startproject myproject
cd myproject
Directory:
myproject/
├── myproject/
│ ├── items.py
│ ├── middlewares.py
│ ├── pipelines.py
│ ├── settings.py
│ └── spiders/
└── scrapy.cfg
Define Items (items.py)
import scrapy
class MyprojectItem(scrapy.Item):
title = scrapy.Field()
price = scrapy.Field()
url = scrapy.Field()
Create a Spider
cd myproject/spiders
scrapy genspider products example.com
products
: This is the name of the spider. You will use this name to run the spider from the command line (e.g., scrapy crawl products
).
products.py:
import scrapy
from myproject.items import MyprojectItem
class ProductsSpider(scrapy.Spider):
name = "products"
allowed_domains = ["example.com"]
start_urls = ["https://example.com/products"]
def parse(self, response):
for product in response.css("div.product"):
item = MyprojectItem()
item['title'] = product.css("h2::text").get()
item['price'] = product.css("span.price::text").get()
item['url'] = product.css("a::attr(href)").get()
yield item
# Pagination
next_page = response.css("a.next::attr(href)").get()
if next_page:
yield response.follow(next_page, callback=self.parse)
Handle JavaScript Pages (Optional)
Some sites render with JS. Use Scrapy + Splash:
- Install Splash Docker container:
docker pull scrapinghub/splash
docker run -p 8050:8050 scrapinghub/splash
- Enable Splash in settings.py:
SPLASH_URL = 'http://localhost:8050'
DOWNLOADER_MIDDLEWARES = {
'scrapy_splash.SplashCookiesMiddleware': 723,
'scrapy_splash.SplashMiddleware': 725,
'scrapy.downloadermiddlewares.httpcompression.HttpCompressionMiddleware': 810,
}
SPIDER_MIDDLEWARES = {
'scrapy_splash.SplashDeduplicateArgsMiddleware': 100,
}
DUPEFILTER_CLASS = 'scrapy_splash.SplashAwareDupeFilter'
- Use SplashRequest in spider:
from scrapy_splash import SplashRequest
def start_requests(self):
yield SplashRequest(
url="https://example.com/products",
callback=self.parse,
args={'wait': 1}
)
Pipelines: Save to MySQL (pipelines.py)
import pymysql
class MySQLPipeline:
def open_spider(self, spider):
self.conn = pymysql.connect(host='localhost', user='root', password='1234', db='scrapy')
self.cursor = self.conn.cursor()
self.cursor.execute("""
CREATE TABLE IF NOT EXISTS products(
id INT AUTO_INCREMENT PRIMARY KEY,
title VARCHAR(255),
price VARCHAR(50),
url VARCHAR(255)
)
""")
def close_spider(self, spider):
self.conn.close()
def process_item(self, item, spider):
sql = "INSERT INTO products (title, price, url) VALUES (%s, %s, %s)"
self.cursor.execute(sql, (item['title'], item['price'], item['url']))
self.conn.commit()
return item
Enable pipeline in settings.py:
ITEM_PIPELINES = {
'myproject.pipelines.MySQLPipeline': 300,
}
Middlewares: Rotate User-Agent (middlewares.py)
import random
class RandomUserAgentMiddleware:
USER_AGENTS = [
'Mozilla/5.0 (Windows NT 10.0; Win64; x64)...',
'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7)...'
]
def process_request(self, request, spider):
request.headers['User-Agent'] = random.choice(self.USER_AGENTS)
Enable in settings.py:
DOWNLOADER_MIDDLEWARES = {
'myproject.middlewares.RandomUserAgentMiddleware': 400,
}
Run the Spider
# Run normally
scrapy crawl products
# Save to CSV or JSON
scrapy crawl products -o products.json
scrapy crawl products -o products.csv
Schedule for Automation
Option 1: Cron Job (Linux/Mac)
0 2 * * * cd /path/to/myproject && /path/to/venv/bin/scrapy crawl products
Option 2: Scrapyd (Production-level)
pip install scrapyd scrapyd-client
scrapyd
Option 3: Airflow DAG
Use PythonOperator to call spider daily/hourly.
Best Practices for Production
- Respect robots.txt:
ROBOTSTXT_OBEY = True
- Use download delays:
DOWNLOAD_DELAY = 1
- Log scraping:
LOG_LEVEL = 'INFO'
- Error handling in pipelines for DB inserts
- Modular spiders per website
- Use Splash or Selenium for JS-heavy sites
Further Learning
Watch this Python web scraping for beginners course on freeCodeCamp.org by Joe Kearney: