Data Extraction with Web Scraping Using Python-based framework Scrapy

Introduction

Web scraping is the process of extracting data from websites automatically. It is a valuable tool for Researchers, Data Analysts, and other professionals who need to collect and analyze data from the web.

One of the most popular tools for web scraping is Scrapy, a Python-based framework that allows you to create spiders that crawl through websites and extract data.

In this blog, we will explore how to use Scrapy for web scraping in more detail.

Pioneers in Cloud Consulting & Migration Services

Reduced infrastructural costs
Accelerated application deployment

Get Started

Why Use Scrapy for Web Scraping?

Scrapy is a powerful and flexible tool for web scraping with several features that make it a popular choice among developers. Some of the advantages of using Scrapy for web scraping include:

Python-based: Scrapy is written in Python, a popular language among data scientists and developers. Python is known for its ease of use and readability, making writing and maintaining code easy.
Scalability: Scrapy is designed to handle large-scale web scraping tasks. It can efficiently crawl through websites and extract data at high speeds, making it a valuable tool for scraping large datasets.
Extensible: Scrapy is a highly extensible tool, with many plugins and extensions available that can be used to customize its functionality. This allows developers to tailor Scrapy to their specific needs and requirements.
Built–in features: Scrapy has many built-in features that make web scraping easier, such as the automatic handling of cookies and sessions, support for multiple types of data storage, and the ability to use different user agents and proxies.
Active community: Scrapy has an active community of developers who contribute to its development and provide support for users. This community helps to ensure that Scrapy is always up-to-date and reliable.

Steps to Create a folder for the project

Create a folder where you want to make a project using CMD (md ScrapryTutorial) or by manually creating a folder.

Go inside the folder where you will install the package using the terminal in vscode as cd folder_name and open the folder the vscode.

Install the required packages inside the folder using cmd or terminal start with the environment

pip install pipenv
pip freeze (to check it is installed properly)
pip install scrapy (main scrapy DataFrame)

Steps to Create a Scrapy Project

To start with Scrapy, you first must create a new Scrapy project. This can be done using the command line by running the following command:

scrapy startproject project_name

1	scrapy startproject project_name

This will create a new directory with the name project_name, containing all the files needed for your Scrapy project. The most important files in this directory are:

scrapy.cfg: This file contains Scrapy settings for your project, such as the target website and the user agent to use.
items.py: This file defines the data items you want to extract from the website.
middlewares.py: This file contains middleware settings, such as the handling of cookies and user agents.
pipelines.py: This file defines the pipelines you want to use to store the extracted data.

Steps to Create a Spider main file

A spider is the core component of Scrapy, defining how to crawl a website and what data to extract. You can create a spider by running the following command in your project directory:

scrapy genspider spider_name website_name

1	scrapy genspider spider_name website_name

This command will create a new Python file with the name spider_name.py (here it is quotes_spider.py) in the spiders directory of your project. You can then edit this file to define how to crawl the website and extract data.

quote_spider.py file

import scrapy 
class QuoteSpider(scrapy.Spider):
    name= 'quotes' 
    start_urls= ['https://websitetoscrap.com/']        #site which we are going to #scrap 
    def parse(self, response):
        title = response.css('title::text').extract() #we want only titile tag here can be used different tag 
        # title = response.css('span.text::text').extract()   #when we want the #quotes presnt in it 
        # yield
        yield {'titletext':title}

import scrapy

class QuoteSpider(scrapy.Spider):

name= 'quotes'

start_urls= ['https://websitetoscrap.com/'] #site which we are going to #scrap

def parse(self, response):

title = response.css('title::text').extract() #we want only titile tag here can be used different tag

# title = response.css('span.text::text').extract() #when we want the #quotes presnt in it

# yield

yield {'titletext':title}

Let’s discuss the code:

The QuoteSpider class extends the scrapy.Spider class provides the basic functionality for crawling websites and extracting data.
The name attribute specifies the name(quotes) of the Spider.
The start_urls attribute specifies the starting URLs for the Spider to crawl. Multiple sites can be used.
The parse method is called for each URL that Spider crawls. It extracts the data from the HTML response using Scrapy’s built-in CSS selectors and yields the results as a title object.

Define items.py

In Scrapy, an item is a container containing the data extracted from the website. We will define an item to hold the product data we extract from the website. To define the item, create a new file called items.py in the project directory and define the following class:

import scrapy
class QuotetutorialItem(scrapy.Item):
    # define the fields for your item here like:
    title = scrapy.Field()
    author = scrapy.Field()
    tag = scrapy.Field()
    pass

import scrapy

class QuotetutorialItem(scrapy.Item):

# define the fields for your item here like:

title = scrapy.Field()

author = scrapy.Field()

tag = scrapy.Field()

pass

The QuotestutorialItem class extends the scrapy.Item class defines the fields that will be used to store the product data we extract, like title and scrapy.

Configure Settings

Fields Scrapy provides a settings module that allows you to configure various settings for your Spider, such as the user agent, the download delay, and the maximum number of concurrent requests. To configure the settings, open the settings.py file in the project directory and make the changes you want.

Run the Spider

To run the Spider, open a terminal window in the project directory, and run the following command:

scrapy crawl QuoteSpider -o products.json

1	scrapy crawl QuoteSpider -o products.json

This command tells Scrapy to run the QuoteSpider Spider and output the results to a JSON file called products.json. Scrapy will crawl the given website, extract the title data, and save it to the specified file. We can store the data in a different file format like Jason, CSV, etc.

Conclusion

In this blog post, we discussed using Scrapy to extract data from a website with a real-world example. We walked through creating a Scrapy project, defining a Spider to crawl the website and extract data, defining an item to hold the extracted data, configuring the settings for the Spider, running the Spider, and processing the extracted data. Scrapy is a powerful tool for web scraping, and it provides a wide range of tools and features that can be used to extract data from websites.

Drop a query if you have any questions regarding Web Scrapy and we will get back to you quickly.

Making IT Networks Enterprise-ready – Cloud Management Services

Accelerated cloud migration
End-to-end view of the cloud environment

Get Started

About CloudThat

CloudThat is an award-winning company and the first in India to offer cloud training and consulting services worldwide. As a Microsoft Solutions Partner, AWS Advanced Tier Training Partner, and Google Cloud Platform Partner, CloudThat has empowered over 850,000 professionals through 600+ cloud certifications winning global recognition for its training excellence including 20 MCT Trainers in Microsoft’s Global Top 100 and an impressive 12 awards in the last 8 years. CloudThat specializes in Cloud Migration, Data Platforms, DevOps, IoT, and cutting-edge technologies like Gen AI & AI/ML. It has delivered over 500 consulting projects for 250+ organizations in 30+ countries as it continues to empower professionals and enterprises to thrive in the digital-first world.

FAQs

1. What is Scrapy, and how does it work for web scraping?

ANS: – Scrapy is a powerful open-source Python framework for web scraping, data extraction, and crawling. It sends HTTP requests to the target website and extracts data from the HTML responses using XPath or CSS selectors.

2. What are the benefits of using Scrapy for web scraping?

ANS: – Scrapy offers several advantages for web scraping, such as high performance, built-in support for handling asynchronous requests, powerful parsing capabilities, and an extensible architecture that allows you to customize and extend its functionality.

3. How do you install Scrapy on your system?

ANS: – To install Scrapy, you can use pip, the package manager for Python. Open a command prompt or terminal window and run “pip install scrapy”. Make sure you have Python installed on your system before installing Scrapy.

WRITTEN BY Vinay Lanjewar

Vinay specializes in designing and implementing scalable data pipelines and end-to-end data solutions on the AWS Cloud. Skilled in technologies such as Amazon EC2, S3, Athena, Glue, QuickSight, and Lambda, he also leverages Python and SQL scripting to build efficient ETL processes. Vinay has extensive experience in creating automated workflows using AWS services, transforming and organizing data, and developing insightful visualizations with Amazon QuickSight. His work ensures that data is collected efficiently, structured effectively, and made analytics-ready to drive informed decision-making.