Web Scraping with Scrapy for Data Extraction

Overview

Web scraping is the process of extracting data from websites and web applications. It is a common technique used by data scientists, researchers, and developers to collect data for analysis, machine learning, and natural language processing.

Scrapy is a fast high-level web crawling and web scraping framework used to crawl websites and extract structured data from their pages. It can be used for various purposes, from data mining to monitoring and automated testing.

Pioneers in Cloud Consulting & Migration Services

Reduced infrastructural costs
Accelerated application deployment

Get Started

Some Important Terms

Items: The main goal in scraping is to extract structured data from unstructured sources, typically web pages. Spiders are used to gather the extracted data and present it as items, which are Python objects that consist of key-value pairs.
Item Pipeline: The responsibility of the Item Pipeline is to handle the processing of items once the spiders have extracted them. This involves performing tasks such as cleansing, validation, and persistence, such as storing the item in a database.
Spiders: Spiders are custom classes written by users to parse responses and extract items from them or additional requests to follow.
Spider Middlewares: Spider middlewares are specialized hooks positioned between the Engine and the Spiders, enabling them to process both the input (responses) and output (items and requests) of the spider.

Architecture Overview

The following diagram shows an overview of the Scrapy architecture with its components and an outline of the data flow inside the system.

Source: Scrapy architecture

The execution engine controls the data flow in Scrapy and goes like this:

The Engine receives the initial request to initiate the crawling process from the Spider.
The Engine adds the request to the Scheduler and instructs it to initiate the crawling process.
The Scheduler provides the next request to the Engine.
The Engine sends the request to the Downloader, which passes through the Downloader Middleware.
Once the download is complete, the Downloader generates a response and sends it back to the Engine through the Downloader Middleware.
The Engine receives the response and forwards it to the Spider for further processing, which involves passing through the Spider Middleware.
After processing the response, the Spider returns scraped items and new requests (for subsequent crawling) to the Engine, passing through the Spider Middleware.
After processing the response, the Spider returns scraped items and new requests (for subsequent crawling) to the Engine, passing through the Spider Middleware.
The process repeats (from step 3) until there are no more requests from the Scheduler.

Demo

In the demo, we will scrap a website containing a list of books. Here we will extract each book’s name, image link, and price using Scrapy.

Link of the website to scrap – All products | Books to Scrape – Sandbox

Example HTML for a single book

html

We will use CMD/Terminal or Anaconda Prompt for the following.

Install Scrapy using – pip install scrapy
Then, go to a folder where the project is needed to be created and run the following command

scrapy startproject bookscraper

“bookscraper” is the project name

3. Following is the structure of our project

step3

4. Inside “spiders” folder, we create a file “scarpybooks.py”

step4

5. Inside the file, the following codes

import scrapy
from ..item.book_item import BookItem
from scrapy.loader import ItemLoader
 
class BooksSpider(scrapy.Spider):
    name = 'books'
    start_urls = ["http://books.toscrape.com/"]
 
    def parse(self, response):
        for books in response.css('div.col-sm-8 section div:nth-child(2) > ol >li'):
            
            l = ItemLoader(item=BookItem(), selector=books)
            l.add_css('name', '.product_pod > h3 > a')
            l.add_css('price', '.product_pod > div.product_price > p')
            l.add_css('link', '.product_pod > div.image_container > a > img::attr(src)')
 
            yield l.load_item()
 
        next_page = response.css('li.next > a').attrib['href'].replace('catalogue','http://books.toscrape.com/catalogue')
        if next_page is not None:
            yield response.follow(next_page, callback=self.parse)

import scrapy

from ..item.book_item import BookItem

from scrapy.loader import ItemLoader

class BooksSpider(scrapy.Spider):

name = 'books'

start_urls = ["http://books.toscrape.com/"]

def parse(self, response):

for books in response.css('div.col-sm-8 section div:nth-child(2) > ol >li'):

l = ItemLoader(item=BookItem(), selector=books)

l.add_css('name', '.product_pod > h3 > a')

l.add_css('price', '.product_pod > div.product_price > p')

l.add_css('link', '.product_pod > div.image_container > a > img::attr(src)')

yield l.load_item()

next_page = response.css('li.next > a').attrib['href'].replace('catalogue','http://books.toscrape.com/catalogue')

if next_page is not None:

yield response.follow(next_page, callback=self.parse)

6. At the root, we will create another folder called “items” containing our item classes.

step6

7. Inside “item” folder, we will create “book_item.py” file with the following codes

import scrapy
from scrapy.loader import ItemLoader
from itemloaders.processors import TakeFirst, MapCompose
from w3lib.html import remove_tags
from dataclasses import dataclass

def remove_currency(value):
    return value.replace('£','').strip()

def get_full_url(value):
    return 'http://books.toscrape.com/'+value

class BookItem(scrapy.Item):
    # remove_tags remove the html tag
    name = scrapy.Field(input_processor = MapCompose(remove_tags), output_processor = TakeFirst())
    price = scrapy.Field(input_processor = MapCompose(remove_tags, remove_currency), output_processor = TakeFirst())
    # TakeFirst() Returns the first non-null/non-empty value from the values received
    link = scrapy.Field(input_processor = MapCompose(get_full_url), output_processor = TakeFirst())

import scrapy

from scrapy.loader import ItemLoader

from itemloaders.processors import TakeFirst, MapCompose

from w3lib.html import remove_tags

from dataclasses import dataclass

def remove_currency(value):

return value.replace('£','').strip()

def get_full_url(value):

return 'http://books.toscrape.com/'+value

class BookItem(scrapy.Item):

# remove_tags remove the html tag

name = scrapy.Field(input_processor = MapCompose(remove_tags), output_processor = TakeFirst())

price = scrapy.Field(input_processor = MapCompose(remove_tags, remove_currency), output_processor = TakeFirst())

# TakeFirst() Returns the first non-null/non-empty value from the values received

link = scrapy.Field(input_processor = MapCompose(get_full_url), output_processor = TakeFirst())

8. After that we are going to run the following line.

scrapy crawl books -O books.json

This will create a JSON file and store all the extracted data

step8

Conclusion

Scrapy is a powerful and flexible web crawling and scraping framework that enables users to extract structured data from websites and web applications efficiently. Its spider-based crawling architecture, support for asynchronous and concurrent requests, and convenient item pipelines make it a popular tool for data scientists, researchers, and developers. By following best practices for web scraping and using Scrapy, we can collect and process data for analysis, machine learning, and natural language processing.

Empowering organizations to become ‘data driven’ enterprises with our Cloud experts.

Reduced infrastructure costs
Timely data-driven decisions

Get Started

About CloudThat

CloudThat is an award-winning company and the first in India to offer cloud training and consulting services worldwide. As a Microsoft Solutions Partner, AWS Advanced Tier Training Partner, and Google Cloud Platform Partner, CloudThat has empowered over 850,000 professionals through 600+ cloud certifications winning global recognition for its training excellence including 20 MCT Trainers in Microsoft’s Global Top 100 and an impressive 12 awards in the last 8 years. CloudThat specializes in Cloud Migration, Data Platforms, DevOps, IoT, and cutting-edge technologies like Gen AI & AI/ML. It has delivered over 500 consulting projects for 250+ organizations in 30+ countries as it continues to empower professionals and enterprises to thrive in the digital-first world.

FAQs

1. What is Scrapy?

ANS: – Scrapy is an open-source Python framework used for web scraping and crawling. It provides a powerful and flexible set of tools for extracting structured data from websites.

2. Are there alternatives to Scrapy?

ANS: – Yes, other web scraping tools, such as BeautifulSoup and Selenium, are available in Python. Each has its strengths and use cases, so it’s important to consider the specific requirements of the scraping project when choosing a framework.

3. Are there any limitations or ethical considerations with web scraping using Scrapy?

ANS: – Web scraping has certain legal and ethical implications. It’s important to respect website terms of service and robots.txt files and be mindful of the impact on the targeted website’s server. Additionally, some websites may have measures to prevent scraping, so it’s important to be aware of any limitations or restrictions.