Scrapy End-to-End Guide

Scrapy End-to-End Guide
Photo by Paulo Latrônico / Unsplash

What is Scrapy?

Scrapy is an open-source and collaborative framework for extracting the data you need from websites. It's a fast, high-level web crawling and scraping framework for Python.

Think of it as a complete toolkit for building web spiders and bots that can:

  • Crawl websites: Follow links from one page to another to discover all the content on a site.
  • Scrape data: Extract specific information (e.g., product names, prices, article headlines) from the HTML of a webpage.
  • Process and store data: Clean up the extracted data and save it to various formats like JSON, CSV, or a database.

Setup

  1. Create Python environment:
python -m venv venv
source venv/bin/activate  # Linux/Mac
venv\Scripts\activate     # Windows

  1. Install dependencies:
pip install scrapy pymysql scrapy-splash
scrapy-splash allows scraping JavaScript-heavy pages. For fully JS-heavy sites, Selenium can also be used.

Start a Scrapy Project

scrapy startproject myproject
cd myproject

Directory:

myproject/
├── myproject/
│   ├── items.py
│   ├── middlewares.py
│   ├── pipelines.py
│   ├── settings.py
│   └── spiders/
└── scrapy.cfg

Define Items (items.py)

import scrapy

class MyprojectItem(scrapy.Item):
    title = scrapy.Field()
    price = scrapy.Field()
    url = scrapy.Field()

Create a Spider

cd myproject/spiders
scrapy genspider products example.com

products: This is the name of the spider. You will use this name to run the spider from the command line (e.g., scrapy crawl products).

products.py:

import scrapy
from myproject.items import MyprojectItem

class ProductsSpider(scrapy.Spider):
    name = "products"
    allowed_domains = ["example.com"]
    start_urls = ["https://example.com/products"]

    def parse(self, response):
        for product in response.css("div.product"):
            item = MyprojectItem()
            item['title'] = product.css("h2::text").get()
            item['price'] = product.css("span.price::text").get()
            item['url'] = product.css("a::attr(href)").get()
            yield item

        # Pagination
        next_page = response.css("a.next::attr(href)").get()
        if next_page:
            yield response.follow(next_page, callback=self.parse)

Handle JavaScript Pages (Optional)

Some sites render with JS. Use Scrapy + Splash:

  1. Install Splash Docker container:
docker pull scrapinghub/splash
docker run -p 8050:8050 scrapinghub/splash
  1. Enable Splash in settings.py:
SPLASH_URL = 'http://localhost:8050'

DOWNLOADER_MIDDLEWARES = {
    'scrapy_splash.SplashCookiesMiddleware': 723,
    'scrapy_splash.SplashMiddleware': 725,
    'scrapy.downloadermiddlewares.httpcompression.HttpCompressionMiddleware': 810,
}

SPIDER_MIDDLEWARES = {
    'scrapy_splash.SplashDeduplicateArgsMiddleware': 100,
}

DUPEFILTER_CLASS = 'scrapy_splash.SplashAwareDupeFilter'
  1. Use SplashRequest in spider:
from scrapy_splash import SplashRequest

def start_requests(self):
    yield SplashRequest(
        url="https://example.com/products",
        callback=self.parse,
        args={'wait': 1}
    )

Pipelines: Save to MySQL (pipelines.py)

import pymysql

class MySQLPipeline:
    def open_spider(self, spider):
        self.conn = pymysql.connect(host='localhost', user='root', password='1234', db='scrapy')
        self.cursor = self.conn.cursor()
        self.cursor.execute("""
            CREATE TABLE IF NOT EXISTS products(
                id INT AUTO_INCREMENT PRIMARY KEY,
                title VARCHAR(255),
                price VARCHAR(50),
                url VARCHAR(255)
            )
        """)

    def close_spider(self, spider):
        self.conn.close()

    def process_item(self, item, spider):
        sql = "INSERT INTO products (title, price, url) VALUES (%s, %s, %s)"
        self.cursor.execute(sql, (item['title'], item['price'], item['url']))
        self.conn.commit()
        return item

Enable pipeline in settings.py:

ITEM_PIPELINES = {
    'myproject.pipelines.MySQLPipeline': 300,
}

Middlewares: Rotate User-Agent (middlewares.py)

import random

class RandomUserAgentMiddleware:
    USER_AGENTS = [
        'Mozilla/5.0 (Windows NT 10.0; Win64; x64)...',
        'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7)...'
    ]

    def process_request(self, request, spider):
        request.headers['User-Agent'] = random.choice(self.USER_AGENTS)

Enable in settings.py:

DOWNLOADER_MIDDLEWARES = {
    'myproject.middlewares.RandomUserAgentMiddleware': 400,
}

Run the Spider

# Run normally
scrapy crawl products

# Save to CSV or JSON
scrapy crawl products -o products.json
scrapy crawl products -o products.csv

Schedule for Automation

Option 1: Cron Job (Linux/Mac)

0 2 * * * cd /path/to/myproject && /path/to/venv/bin/scrapy crawl products

Option 2: Scrapyd (Production-level)

pip install scrapyd scrapyd-client
scrapyd

Option 3: Airflow DAG

Use PythonOperator to call spider daily/hourly.

Best Practices for Production

  1. Respect robots.txt:
ROBOTSTXT_OBEY = True
  1. Use download delays:
DOWNLOAD_DELAY = 1
  1. Log scraping:
LOG_LEVEL = 'INFO'
  1. Error handling in pipelines for DB inserts
  2. Modular spiders per website
  3. Use Splash or Selenium for JS-heavy sites


Further Learning

Watch this Python web scraping for beginners course on freeCodeCamp.org by Joe Kearney: