Scrapy Tutorial - An Introduction | Python Scrapy Tutorial

Web Scraper

A web scraper is a tool that is used to extract the data from a website.

It involves the following process:

Figure out the target website
Get the URL of the pages from which the data needs to be extracted.
Obtain the HTML/CSS/JS of those pages.
Find the locators such as XPath or CSS selectors or regex of those data which needs to be extracted.
Save the data in a structured format such as JSON or CSV file.

Web Crawler

A web crawler is used to collect the URL of the websites and their corresponding child websites. The crawler will collect all the links associated with the website. It then records (or copies) them and stores them in the servers as a search index. This helps the server to find the websites easily. Servers then use this index and rank them accordingly. The pages are then displayed to the user based on ranking given by the search engine.

The web crawler can also be called a web spider, spider bot, crawler or web bot.

Also Read: Web Scraping Tutorial | What is Web Scraping?

Scrapy

Scrapy does the work of a web crawler and the work of a web scraper. Hence, Scrapy is quite a handful in crawling a site, then extracting it and storing it in a structured format. Scrapy also works with API to extract data as well.

Scrapy provides:

the methods like Xpath and regex used for selecting and extracting data from locators like CSS selectors.
Scrapy shell is an interactive shell console that we can use to execute spider commands without running the entire code. This facility can debug or write the Scrapy code or just check it before the final spider file execution.
Facility to store the data in a structured data in formats such as :
- JSON
- JSON Lines
- CSV
- XML
- Pickle
- Marshal
Facility to store the extracted data in:
- Local filesystems
- FTP
- S3
- Google Cloud Storage
- Standard output

Facility to use API or signals (which are functions that are written in case of an event)
Facility to handle :
- HTTP features
- User-agent spoofing
- Robots.txt
- Crawl depth restriction

Telnet console – Python console that could run inside Scrapy to introspect.
And more

Scrapy Installation

Scrapy can be installed by:

Using Anaconda / Miniconda.

Type the following command in the Conda shell:

conda install -c conda-forage scrapy

Alternatively, you could do the following.

pip install Scrapy

Scrapy Packages

lxml – XML and HTML parser
parsel – HTML/XML library that lies on top of lxml
w3lib – Deals with webpages
twisted – asynchronous networking framework
cryptography and pyOpenSSL – for network-level security needs.

Scrapy File Structure

A scrapy project will have two parts.

Configuration file – It is the project root directory. It has the settings for the project. The location of the cfg can be seen in the following place:

System wide – /etc/scrapyg.cfg or c:\scrapy\scrapy.cfg
Global – ~/.config/scrapy.cfg($XDG_CONFIG_HOME) and ~/.scrapy.cfg($HOME)
Scrapy project root – scrapy.cfg

Settings from these files have the following precedence :

Project-wide settings
System-wide defaults
User-defined values

Environment variables through which Scrapy can be controlled are :

SCRAPY_SETTINGS_MODULE
SCRAPY_PROJECT
SCRAPY_PYTHON_SHELL

A project folder – It contains files as follows :

__init__.py
items.py
middleware.py
pipelines.py
settings.py
spider – folder. It is the place where the spider that we create gets stored.

A project’s configuration file can be shared between multiple projects having its own settings module.

SCRAPY COMMAND LINE TOOL

The Scrapy command line provides many commands. Those commands can be classified into two groups.

Global commands
Project – only commands

To see all the commands available type the following in the shell:

scrapy -h

Syntax to see the help for a particular command is:

scrapy <command> [options] [args]

Global Commands

These are those commands that can work without an active scrapy project.

startproject

scrapy startproject <project_name> [project_dir]

Usage: It is used to create a project with the specified project name under the specified project directory. If the directory is not mentioned, then the project directory will be the same as the project name.

Example:

scrapy startproject tutorial

This will create a directory with the name “tutorial” and the project name as “tutorial” and the configuration file.

genspider

scrapy genspider [-t template] <name> <domain>

Usage: This is used to create a new spider in the current folder. It is always best practice to create the spider after traversing inside the project’s spider folder. Spider’s name is given by the <name> parameter and <domain> generates “start_urls” and “allowed_domains”.

Example:

scrapy genspider tuts https://www.imdb.com/chart/top/

This will create a directory with the spider with the name tuts.py and the allowed domain is “imdb”. Use this command post traversing into the spider folder.

settings

scrapy settings [options]

Usage: It shows the scrapy setting outside the project and the project setting inside the project.

The following options can be used with the settings:

–help show this help message and exit

–get=SETTING print raw setting value

–getbool = SETTING print setting value, interpreted as Boolean

–getint = SETTING print setting value, interpreted as an integer

–getfloat = SETTING print setting value,interpreted as an float

–getlist = SETTING print setting value,interpreted as a list

–logfile = FILE logfile,if omitted stderr will be used

–loglevel = LEVEL log level

–nolog disable logging completely

–profile=FILE write python cProfile to file

–pidfile = FILE write process id to file

–set NAME=VALUE set/override setting

–pdb enable pdb on failure

Example:

scrapy crawl tuts -s LOG_FILE = scrapy.log

runspider

scrapy runspider <spider.py>

Usage: To run spider without having to creating project

Example:

scrapy runspider tuts.py

shell

scrapy shell [url]

Usage: Shell will start for the given url.

Options:

–spider = SPIDER (The mentioned spider will be used and auto-detection gets bypassed)

–c code (Evaluates, prints the result and exited)

–no-redirect (Does not follow HTTP 3xx redirects)

Example:

scrapy shell https://www.imdb.com/chart/top/

Scrapy will start the shell on https://www.imdb.com/chart/top/ page.

fetch

scrapy fetch <url>

Usage:

Scrapy Downloader will download the page and give the output.

Options:

–spider = SPIDER (The mentioned spider will be used and auto-detection gets bypassed)

–headers (Header’s of the HTTP request will be shown in the output)

–no-redirect (Does not follow HTTP 3xx redirects)

Example:

scrapy fetch https://www.imdb.com/chart/top/

Scrapy will download the https://www.imdb.com/chart/top/ page.

View

scrapy view <url>

Usage:

Scrapy will open the mentioned URL in the default browser. This will help to view the page from the spider’s perspective

Options:

–spider = SPIDER (The mentioned spider will be used, and auto-detection gets bypassed)

–no-redirect (Does not follow HTTP 3xx redirects)

Example:

scrapy view https://www.imdb.com/chart/top/

Scrapy will open https://www.imdb.com/chart/top/ page in the default browser.

Version

Syntax: scrapy version -v

Usage:

Prints the version of the scrapy.

Project-only Commands

These are those commands that can work inside an active scrapy project.

crawl

Syntax:

scrapy crawl <spider>

Usage:

This will start the crawling.

Example:

scrapy crawl tuts

Scrapy will crawl the domains mentioned in the spider.

check

Syntax:

scrapy check [-I] <spider>

Usage:

Checks what’s returned by the crawler

Example:

scrap check tuts

Scrapy will check the crawled output of the crawler and returns the result as “OK”.

list

Syntax:

scrapy list

Usage:

All the spider’s names are returned that are present in the project.

Example:

scrapy list

Scrapy will return all the spiders that are there in the project

edit

Syntax:

scrapy edit <spider>

Usage:

This command is used to edit the spider. The editor mentioned in the editor environment variable will open up. If it’s not set, then IDLE (windows) will open up, or vi (UNIX) will open up. The developer is not restricted to use this editor but can use any editor.

Example:

scrapy editor tuts

Scrapy will open tuts in the editor.

parse

Syntax:

scrapy parse <url> [options]

Usage:

Scrapy will parse the URL mentioned with the spider. Method if mentions in the –callback will be used; if not, parse will be used.

Options:

–spider = SPIDER (The mentioned spider will be used, and auto-detection gets bypassed)

–a Name = Value (To set the spider option)

–callback (Spider method for parsing)

–cb_kwargs (Additional methods for callback parsing)

–meta (Spider meta for the callback method)

–pipelines (To process via pipelines)

–rules (Rules for parsing)

–noitems (Hides scraped items)

–nocolour (Removes colours)

–nolinks (Hides links)

–depth (The level to which the requests needs to done recursively)

–verbose (Displays information depth level)

–output (Output is stored in a file)

Example:

scrapy parse https://www.imdb.com/chart/top/

Scrapy will parse the https://www.imdb.com/chart/top/ page.

Bench

Syntax: scrapy bench

Usage:

To run a benchmark test.

To add custom commands.

COMMANDS_MODULE = ‘command_name’

scrapy.commands can be used in setup.py for adding up the commands externally.

SPIDERS

Spider folder is the place which contains the classes that are needed for scraping data and for crawling the site. Customisation can be done as per the requirement.

SPIDER SCRAPING CYCLE

There are different types of Spiders available for various purposes.

Scrapy.Spider

Class: scrapy.spiders.Spider

It is the simplest spider. It has the default method start_requests(). This will send requests from start_urls() calls the parse for each resulting response.

name – Name of the spider is given in this. It should be unique, and more than one instance can be instantiated. It’s the best practice to keep the spider’s name the same as the name of the website that’s crawled.

allowed_domains – Only the domains that are mentioned in this list are allowed to crawl. To crawl the domain that is not mentioned in the list “OffsieMiddelware” should be enabled.

start_urls – A list of URLs that needs to be crawled gets mentioned over here

custom_settings – Settings that need to be overridden are given here. It should be defined as a class as the settings are updated first before crawling.

crawler – from_crawler() method sets this attribute. It links the crawler object with the spider object

settings – settings for the spider/project gets mentioned over here

logger – logger with the same name as the Spider’s name will have all the log of the spider.

from_crawler(crawler,*args,**kwargs) – Sets the crawler and the settings attribute. It creates spiders.

A. crawler – object that bounds spider and the crawler

B. args – arguments that are passed to the __int__()

C. kwargs – kwargs that are passed to __int__()

start_requests() – Used to scrape the website. It’s called only once and start_url() will generate Request() for each url.

parse(response) – Callback method is used to get the response returns the scraped data.

log(message,level,component) – Sends the log throught the “logger”

closed(reason) – It will close the spider and signal.connect() gets triggered for spider_closed signal.

Spider Arguments

Arguments can be given to spiders. The arguments are passed through the crawl command using -a option.

The __init__() will take these arguments and apply them as attributes.

Example:

scrapy crawl tuts –a category = electronics

__init__() should have category as an argument for this code to work

Generic Spiders

These spiders can be used for rule-based crawling, crawling Sitemaps, or parsing XML/CSV feed.

CrawlSpider

Class – scrapy.spider.CrawlSpider

This is the spider that crawls based on rules that can be custom written.

Attributes:

rules – List of Rule object that defines the crawling behaviour.
parse_start_url(response, **kwargs) – This is called whenever a response is created for the URL requests. Expects an item object or an item containing iterable object.

Crawling Rules:

class scrapy.spiders.Rule(link_extractor=None, callback=None, cb_kwargs=None, follow=None,

process_links=None, process_request=None, errback=None)

link_extractor – rule for how the link is to be extracted is mentioned here. It then creates a Request object for each generated link

callback – This is called when each link is extracted. Receives a response as it’s the first argument and must return the iterable object.

cb_kwargs – arguments for callback function

follow – If callback is None, then follow is set to True otherwise, it’s False. It is a Boolean.

process_links – Called for each link extracted from each response.

process_request – called for each request.

errback – Exception is raised if there is an error.

XMLFeedSpider

Class – scrapy.spider.XMLFeedSpider

It is used to parse XML feeds. This will parse iternodes, XML, HTML for performance reasons through a particular node name.

The following class attributes must be defines to set the iterator and tag name:

iterator – Tells what iterator to be used, i.e. iternodes or HTML or XML. Default is iternodes.
itertag – Name of the string that needs to be iterated.
namespaces – (prefix,url) tuples that are mentioned in the document will be gets processed in this spider.

The following overridable methods are available as well :

adapt_response(respURLe) – It can change the response body before parsing . It can receive and send responses.
parse_node(response,selector) – This must me overridden if the a matching node and itertag is there for the spider to work. It should return an iterable object or a Request.
process_results(response, results) – Does last-minute processing if required.

CSVFeedSpider

Class – scrapy.spiders.CSVFeedSpider

This spider iterate over rows. parse_row() will be called for each iteration.

delimiter: it’s the separator character for each string. Default is “,”

quotechar: It defines the enclosure character. Default is ‘ “ ‘.

headers: Column names in CSV file.

parse_row(response,row) : It helps to override adapt_response and process_results for post and preprocessing. It obtains dict with a key for each header of the CSV file.

SitemapSpider

Class – scrapy.spiders.SitemapSpider

It is used for crawling the site. It discovers sitemap urls from robot.txt

sitemap_urls – This will contain the list of urls. These urls usually point to the sitemap or robot.txt which needs to be crawled.
sitemap_rules- It’s value is defined by a tuple (regex,callback). Callbacks should match with the url extracted from regex.
sitemap_follow – It containts regexes.
sitemap_alternate_link – Alternate links gets specified here. This is disabled by default.
sitemap_filter(entries) – Can be used when there is a need to override sitemap attributes.

Selectors

Scrapy uses CSS or Xpath to select HTML elements.

Querying can be done using response.css() or response.XPath().

Example:

response.css(“div::text”).get()

Selector() can also be used if needed directly.

.get() or .getall() is used along with the response to extract the data.

.get() – will give a single result. None if nothing gets matched.

.getall() – will give a list of matches.

CSS pseudo-elements can be used to select text or attribute-nodes.

.get() has an alias .extract-first().

.get() returns NONE if no match is found. Default value can be given to replace NONE with some other value with the help of .get(default=’value’)

.attrib[] can also be used to query via attributes of a tag for CSS selectors.

Example:

response.css(‘a’).attrib[‘href’]

Non-standard pseudo-elements that are essential for web scraping are:

::text – selects the text nodes
::attr(name) – selects attributes values.

Adding a * infront of ::text will help to select all the elements of the node.

*::text

foo::text can be used to check if there is no result incase the element is present but does not have any value .

Nesting Selectors

Selectors having the same type on which selection can be done again is nesting of selectors.

Example:

val = response.css(“div::text”)

val.getall()

Selecting element attributes

Attributes of an element can obtained using Xpath or CSS selectors.

XPATH – Advantage with Xpath is that , @attributes can be used as a filter and it’s standard feature as well.

Example : response.xpath(“//a/@href”).get()

CSS Selector : ::attr(…) can be used to get attribute vales as well.

Example : response.css(‘img::attrb(src)’).get()

Or .attrib() property can also be used

Example : response.css.(‘img’).attrib[‘src’]

Using Selectors with regular expressions

.re() can be used to extract data along with Xpath or with CSS.

Example : response.xpath(‘//a[contains(@href,”image”)]/text()’).re(r’Name:\s*(.*)’)

.re_first() can also be used to extract the first element.

Some equivalents

Selection	Equaivalent Value Used these days
SelectorList.extract_first()	SelectorList.get()
SelectorList.extract()	SelectorList.getall()
Selector.extract()	Selector.get()

Selector.getall() – will return a list.

.get() returns single output

.getall() – return a list

.extract() will return either a single output or a list as the output. To get single result either extract() or extract_first() can be called.

Working with relative XPATHS

Absolute Xpath – Absolute Xpath gets created whenever an Xpath starts with ‘/’ and it’s nested.

A proper way to make it relative is use “.” Infront of ‘/’.

Example:

divs = response.xpath(“//div”)

for p in divs.xpath(‘.//p”):

print(p.get())

for p in divs.xpath(‘p):

print(p.get())

For mode details on XPATH can be obtained from https://www.w3.org/TR/xpath/all/#location-paths

Querying the elements by Class Use CSS

If done with Xpath then the resulting output will end up having so much of complications.

If ‘@class = “someclass”’ is used the output might have missing elements.

If ‘contains(@class,’someclass’) is used then more then needed elements might come up in the result.

As Scrapy allows chaining of selectors, CSS selector can be chained to select the class element and then Xpath can be used along with it to select the required elements instead.

Example:

response.css(“.shout”).xpath(‘./div’).getall()

“.” Should be appended before ‘/’ in the xpath that follows the CSS selector.

Difference between //node[1] and (//node)[1]

(//node)[1] – selects all the nodes first then the first element from that list will get selected.

//node[1] – First node of all the parent node will get selected.

Text nodes under condition

.//text() when passed to contains() or starts-with() will result in a collection of text elements. The resulting node set will not give any result even if it gets converted to a string . And hence it is better to use “.” alone instead of “.//text()”.

Variables in Xpath expressions

$somevariable is used as a reference variables. It’s value will be passed to the query after substitution.

Example:

response.xpath(‘//div[count(a)=$cnt]/@id’, cnt=5).get()

More examples on https://parsel.readthedocs.io/en/latest/usage.html#variables-in-xpath-expressions

Removing namespaces

selector.namespaces() method can be used so that all the namespaces of that html file can be used.

Example:

response.selector.namespaces()

Namespaces are not removed by default by scrapy because namespaces of the page are needed at times and not need at times. So this method is called only when needed.

Using EXSLT extensions

Prefix	Namespace	Usage
re	http://exslt.org/regular-expressions	Regular expression
set	http://exslt.org/sets	Set manipulation

Regular Expressions

test() is used when starts-with() and contains() are not helpful

Set operations

These are used when there is a need to excluding data before extraction.

Example

scope.xpath(‘’’set:difference(./descendant::*/@itemprop)’’’)

Other Xpath extensions

has-class returns false if the nodes does not match with the given HTML classes and True for nodes that are matching.

response.xpath(‘//p[has-class(“foo”)]’)

Built-in Selectors reference

Selector objects

Class – scrapy.selector.Selector(*args,**kwargs)

response – It is a Htmlresponse or a XMLresponse.

text – It is a Unicode string or a utf-8 encoded text cases

type – type can be “html” for HtmlResponse,”xml” for XmlResponse or None

xpath(query,namespaces=None,**kwargs) – SelectorList will be returned with flattened elements, where query is the Xpath query. Namespaces are optional and is nothing but dictionaries that are registered with register_namespace(prefix,uri)

css(query) – SelectorList is returned post application of the css where query containing the css selector is given as the argument.

get() – Matches nodes will be returned.

attrib – Element’s attributes will be returned.

re(regex,replace_entities = True) – Returns a list of Unicode post application of regex. Regex will contain the regex queries and replace_entities will replace if it’s true.

re_first(regex,default=None,entities=True) – Default value will be returned if there is not match, first Unicode will be returned if there is a match

register_namespace(prefix,uri) – To register the namespaces

remove_namespaces() – Removes all namespaces

__bool__() – Return True if the content is real

getall() – Returns a list of matched content

SelectorList objects –

css(query) – SelectorList is returned post application of the css where query containing the css selector is given as the argument.

get() – returns the result for the first element in the list

getall() – get() is called for each element in the list.

attrib – Element’s attributes will be returned.

re_first(regex,default=None,entities=True) – re() is called for each element in the list

attrib – first element attribute is returned.

ITEMS

A dict (key-value) pair is usually returned. Different types of items are there.

Item Types

Dictionaries – dict is convenient and familiar.
Item Objects

Class – scrapy.item.Item([arg])

Item behaves the same way as the standard dict API and allows to define the field names such as :

KeyError – Raised when undefined field names are called.
Item exporters – Exports all fields

Item allows metadata definition. trackref can track item object inorder to find memory leak.

Additional Item API members that can be used are copy() , deepcopy() and fields

Dataclass objects

Item classes field names can be defined with dataclass(). Default value and type for each dataclass can be defined. dataclasses.field() can be used to define custom field.

attr.s objects

Item classes with field names can be defined with attr.s(). Each field type and definition and custom field metadata can also be defined.

Working with Item Objects

Declaring Item subclasses

Simple class definition and Field objects can be used to declare Item subclasses.

Example:

import scrapy

class Product(scrapy.Item):
    name = scrapy.Field()
    price = scrapy.Field()

Declaring Fields

Field objects are used to specify any kind of metadata for each field. Different components can use the Field object.

Class – scrapy.item.Field

Example

Creating items

product = Product(name='Desktop PC', price=1000)

Getting field values

product['price']

Setting field values

product['lala'] = 'test'

Accessing all populated values

product.keys()
product.items()

Copying items

product2 = product.copy()
product2 = product.deepcopy()

Extending Item Subclass

Items can also be extended by defining a subclass of the original item.

Metadata can be extended with previous metadata.

Supporting all Item Types

Class – itemadapter.ItemAdapter(item:Any)

Common interface to extract and set data

Itemadapter.is__item(obj:Any) -> bool

If the item belongs to the supported types then True will be returned.

ITEM LOADERS

This is used to populate the items.

Using Item Loaders to populate items

Item class creates item loader __init__ which is how item loader gets instantiated. Selectors load the value into the item loader. Item loader then joins using processing functions.

add_xpath(), add_css() and add_value() are all used to collect data into an item loader. ItemLoader.load_item() populates the data extracted from add_xpath(),add_css() and add_value().

Working with data class items

Passing of values can be controlled using field() when used with item loaders which will load the item automatically with the methods add_xpath(),add_css() and add_value().

Input and output processors

Each item loader has 1 input processor and 1 output processor.

The input processor loads the data in the item loader through add_xpath(),add_css() and add_value().

ItemLoader.load_item() then populates the data in the item loader.

The output processor then assigns the value to the items.

Declaring Item Loaders

Input processors are declared using _in suffix.

Output processors are declared using _out suffix.

Also can be declared using ItemLoader.default_input_processor and ItemLoader.default_output_processor.

Declaring Input and Output processors

Input/Output processors can also be declared using Item Field metadata.

Precedence order:

Item loader field specific attributes
Field metadata
Item Loader defaults

Item Loader Context

Item Loader Context can modify the behavior of the input/output processors. It can be passed anytime and it is of dict type.

loader_context passes the context that is active and parse_length uses it.

To modify

Modify the Item Loader context attribute
On loader instantiation
On item loader declaration

Item Loader Object

If no item then default_item_class gets instantiated.

item – The objects that’s parsed by the item loader	context – current active context
default_item_class – instantiates when not given in __init__()	default_input_processor – Default input processor for which there is none
default_output_processor – Default output processor for which there is none	default_selector_class – Ignored if __init__() is given, if not then selector of item loader will get constructed
selector – This object extracts the data.	*add_css(field_name,css,processors,kw) – css selector given in this extracts list of Unicode strings
*add_value(field_name,xpath,processors,kw) – Processors and kw passes the value to get_value() , then to field input processors and then appended to the data collected.	*add_xpath(field_name,processors,kw) – xpath will be used to extract list of strings
get_collected_values(field_name) – Collected values will be returned	*get_css(css,processors,kw) – Css selector will be used to extract list of Unicode strings
*get_output_value(value,processors,kw) – collected values from parsed through output processors are returned.	*get_value(value,processors,kw) – given value is processed by the processors.
*get_xpath(xpath,processors,kw) – xpath will extract list of Unicode strings	load_item() – Used to populate the item
nested_class(css,context)** – css selector creates nested loader	nested_xpath(xpath,context)** – xpath selector creates nested loader
*replace_css(field_name,css,processors,kw) – replaces collected data	*replace_value(field_name,value,processors,kw) – replaces collected data
replace_value(field_name,value,processors,*kw) – replaces collected data	replace_xpath(field_name,value,preprocess,*kw) – replaces collected data

Nested Loaders

Nested Loaders can be used when the subsection values need to be parsed.

Reusing and Extending Item Loaders

Scrapy provides the support for python class inheritance and hence item loaders can be reused and extended.

SCRAPY SHELL

Scrapy shell can be used for testing and evaluating spiders before running the entire spider. Individual queries can be checked in this.

Configuring the shell

Scrapy works wonderful with IPython, and can support bpython. IPython is recommended as it provides auto-completion and colorized output.

The setting can be changed by

[settings]

shell = bpython

Launch the shell

To launch the shell

scrapy shell <url>

Using the shell

It just a regular python shell with additional shortcuts

Available shortcuts

shelp() – print list of available objects and lits
fetch(url,[.redirect=True]) – fetch response from URL
fetch(request) – fetch response from given request
view(response) – open the given response in the local browse

Available scrapy objects

crawler – current crawler object
spider – that which can handle URL
request – Request object of last fetched page
response – response object containing last fetched item
settings – current scrapy settings

Invoking shell from spiders to inspect responses

To see the response use:

scrapy.shell.inspect_response

ITEM PIPELINE

Post scraping item pipeline processes them.

Item pipelines:

cleanses HTML data
scraped data validation
duplicates validation
storing of scraped data

Writing item pipeline

Item pipeline components are python classes.

process_item(self,item,spider) – All the component calls this method and returns an item object, Deferred object or raise a DropItem. Item is scraped item , spider – the spider that scraped the item
open_spider(self,spider) – to open the spider.
Close_spider(self,spider) – to close the spider.
from_crawler(cls,crawler) – It creates a crawler and returns a new instance of pipeline.

Example application:

price validation and dropping items with no prices
write items to json file
write items to mongodb
take a screenshot of item
duplicates filter

To activate a pipeline, it has to be added to the ITEM_PIPELINES settings.

FEED EXPORTS

Scrapy supports feed exports that is to export the scraped data into storage in multiple formarts.

Serialization formats

Item exporters are used for this process. The supported formats are :

*Serialization format*	Feed setting format key	Exporter
JSON	json	JsonItemExporter
JSON lines	jsonlines	JsonItemExporter
CSV	csv	CsvItemExporter
XML	xml	XmlItemExporter
Pickle	pickle	MarshalItemExporter
Marshal	marshal	MarshalItemExporter

Storages

Supported backend storage:

Local filesystem
FTP
S3
Google cloud storage
Standard output

Storage URI parameters

%(time)s – timestamp replaces this parameter

%(name)s – spider name replaces this parameter

Storage backends

Storage backend	URI scheme	Example URI	Required external library
FTP	ftp	ftp://user:pass@ftp.example.com/path/to/export.csv	None	Two connections : active or passiveDefault connection mode : PassiveFor active connection :FEED_STORAGE_FTP_ACTIVE = True
Amazon S3	s3	s3://mybucket/path/to/export.csv	botocore >= 1.4.87	AWS credentials can be passed through :AWS_ACCESS_KEY_IDAWS_SECRET_ACCESS_KEY Custom ACL;FEED_STORAGE_S3_ACL
Google Cloud Storage	gs	gs://mybucket/path/to/export.csv	google-cloud-storage	Project setting and Access Control Light setting:FEED_STORAGE_GCS_ACLGCS_PROJECT_ID
Standard Output	stdout	stdout:	none

Delayed File Directory

Storage backends that uses delayed file directory are :

FTP
S3
Google Cloud Storage

File content will be uploaded to the feed URI only if all the contents are collected entirely.

To start the item delivery early use FEED_EXPORT_BATCH_ITEM_COUNT

Settings

Settings for feed exporters

FEEDS (mandatory)
FEED_EXPORT_ENCODING
FEED_STORE_EMPTY
FEED_EXPORT_FIELDS
FEED_EXPORT_INDENT
FEED_STORAGES
FEED_STORAGE_FTP_ACTIVE
FEED_STORAGE_S3_ACL
FEED_EXPORTERS
FEED_EXPORT_BATCH_ITEM_COUNT

Feeds

Default : {}

Feed is a dictionary in which all the feed URI are the keys and values are nested parameters.

Accepted Keys	Fallback Value
format	NIL
batch_item_count	FEED_EXPORT_BATCH_ITEM_COUNT
encoding	FEED_EXPORT_ENCODING
fields	FEED_EXPORT_FIELDS
Indent	FEED_EXPORT_INDENT
Item_exports_kwargs	dict with keyword arguments to corresponding item exporter class
overwrite	If already exists then True or else False
Local filesystem	False
FTP	True
S3	True
Standard Output	False
store_empty	FEED_STORE_EMPTY
uri_params	FEED_URI_PARAMS

Feed Export Encoding

Default: None

Encoding: If unset or None is setting then UTF-8 will be set except for JSON. Utf-8 can be set for JSON too if needed.

FEED_EXPORT_FIELDS

Default: None

To define fields use FEED_EXPORT_FIELDS

When FEED_EXPORT_FIELDS are empty scrapy used fields from item objects

FEED_EXPORT_INDENT

Default:0

If this is non-negative integer – array elements and objects are given

If this is 0 or negative, it ll be in new line

None will select compact representation

FEED_STORE_EMPTY

Default : False

FEED_STORAGES

Default : {}

FEED_STORAGE_FTP_ACTIVE

Default:False

To use active or passive connection when exporting FTP

FEED_STORAGE_S3_ACL

Default:False

Default: ‘ ’

String have custom ACL

FEED_STORAGES_BASE

Dict containing built-in feed storage.

FEED_EXPORTERS

Default: {}

Dict containing additional exporters

FEED_EXPORTERS_BASE

Dict having build-in feed exporters

FEED_EXPORT_BATCH_ITEM_COUNT

Default: 0

Number greater than 0 then scrapy generates multiple file storing to a particular number

FEED_URI_PARAMS

Default: None

String with import path of function.

REQUESTS AND RESPONSES

Requests and responses are made for crawling the site.

Request Objects

PARAMETERS

url – url of the request
callback – the function that gets called as a response for a request
method – Defaut : get. Method for the request
meta – dictionary values for Request.meta
body – If not available then bytes is stored.
headers – headers of the request
cookies – request cookies
encoding – encoding of the request
priority – priority of the request
don’t_filter – request should not be filtered
errback – functions gets called if there is an exception
flags – flags sent for logging
cb_kwargs – dict passed as keyword arguments

Passing additional data to callback functions

Request.cb_kwargs can be used to pass arguments to the callback functions so that these then can be passed to the second callback later

Using errbacks to catch exceptions in request processing.

Failure will be received as the first parameter for the errbacks, this then can be used to track errors.
Additional data can be accessed by Failure.request.cb_kwargs

Request.meta special keys

Special keys ;

dont_redirect
dont_retry
handle_httpstatus_list
handle_httpstatus_all
dont_merge_cookies
cookiejar
dont_cache
redirect_reasons
redirect_urls
bindaddress
dont_obey_robotstxt
download_timeout
download_maxsize
download_latency
download_fail_on_dataloss
proxy
ftp_user
ftp_password
referrer_policy
max_retry_times

bindaddress – Outgoing IP address

download_timeout – time for the downloader to wait

download_latency – time to fetch response

download_fail_on_dataloss – to fail or not to fail on broken response

max_retry_times – to set retry times per request

Stopping the download of response

StopDownload exception will be raised to stop the download

Request subclasses

List of request subclasses

FormRequest Objects

Parameters:

formdata

classmethodfrom_response(response[, formname=None, formid=None, formnumber=0, formdata=None, formxpath=None, formcss=None, clickdata=None, dont_click=False, …]

Parameters:

response
formname
formid
formxpath
formcss
formnumber
formdata
clickdata
don’t_click

Examples:

Fromrequest to send data via HTTP post

To simulate user login

JsonRequest

Parameters:

data
dumps_kwargs

Response Objects

These are HTTP responses.

Parameters:

url
status
headers
body
flags
request
certificate
ip_address
cb_kwargs
copy()
replace ([url, status, headers, body, request, flags, cls])
urljoin(url)
follow(url, callback=None, method=’GET’, headers=None, body=None, cookies=None, meta=None, encoding=’utf-8′, priority=0, dont_filter=False, errback=None, cb_kwargs=None, flags=None)
follow_all(urls, callback=None, method=’GET’, headers=None, body=None, cookies=None, meta=None, encoding=’utf-8′, priority=0, dont_filter=False, errback=None, cb_kwargs=None, flags=None)

Response subclasses

List of subclasses:

TestResponse objects
HtmlResponse objects
XmlResponse objects

LINK EXTRACTORS

Extracts links from responses.

LxmlExtractor.extract_links returns a list of matching Link objects.

Link Extractor Reference

Link extractor class is scrapy.linkextractor.lxmlhtml.LxmlLinkExtractor

LxmlLinkExtractor

Parameters:

allow
deny
allow_domains
deny_domains
deny_extensions
restrict_xpaths
restrict_css
restrict_text
tags
attrs
canonicalize
unique
process_value
strip
extract_links(response)

Link

They represent the extracted link

Parameters:

url
text
fragment
nofollow

SETTINGS

Scrapy settings can be adjusted as needed

Designating the setting

SCRAPY_SETTINGS_MODULE is used to set the settings.

Populating the settings

Settings can be populated in the following precedence :

Command line options – “-s” or “—set” is used to override the settings
Settings per-spider – This can be defined through “custom_settings” attribute
Project settings module – This can be changed in the “settings.py” file.
Default settings per-command – “default_settings” is used to define this
Default global settings – scrapy.settings.default_settings is used to set this.

Import Paths and Classes

Importing can be done

String containing the import path
Object

How to access settings

Settings can be accessed through “self.settings” in spider , “scrapy.crawler.Crawler.settings” in Crawler from “from_crawler”

Rationale for setting names

Setting name are prefixed with component name.

Built-in settings reference

AWS_ACCESS_KEY_ID	AWS_SECRET_ACCESS_KEY	AWS_ENDPOINT_URL	AWS_ENDPOINT_URL	AWS_USE_SSL
AWS_VERIFY	AWS_REGION_NAME	ASYNCIO_EVENT_LOOP	BOT_NAME	CONCURRENT_ITEMS
CONCURRENT_REQUESTS	CONCURRENT_REQUESTS_PER_DOMAIN	DEFAULT_ITEM_CLASS	DEFAULT_REQUEST_HEADERS	DEPTH_LIMIT
DEPTH_PRIORITY	DEPTH_STAT_VERBOSE	DNSCACHE_ENABLED	DNSCACHE_SIZE	DNS_RESOLVER
DOWNLOADER	DOWNLOADER_HTTPCLIENTFACTORY	DOWNLOADER_CLIENTCONTEXTFACTORY	DOWNLOADER_CLIENT_TLS_CIPHERS	DOWNLOADER_CLIENT_TLS_METHOD
DOWNLOADER_CLIENT_TLS_VERBOSE_LOGGING	DOWNLOADER_MIDDLEWARE	DOWNLOADER_MIDDLWARES_BASE	DOWNLOADER_STATS	DOWNLOAD_DELAY
DOWNLOAD_HANDLERS	DOWNLOAD_HANDLERS_BASE	DOWNLOAD_TIMEOUT	DOWNLOAD_MAXSIZE	DOWNLOAD_WARNSIZE
DOWNLOAD_FAIL_ON_DATALOSS	DUPEFILTER_CLASS	DUPEFILTER_DEBUG	EDITOR	EXTENSIONS
EXTENSIONS_BASE	FEED_TEMPDIR	FEED_STORAGE_GCS_ACL	FTP_PASSIVE_MODE	FTP_PASSWORD
FTP_USER	GCS_PROJECT_ID	ITEM_PIPELINES	ITEM_PIPELINES_BASE	LOG_ENABLED
LOG_FILE	LOG_FORMAT	LOG_DATEFORMAT	LOG_FORMATTER	LOG_LEVEL
LOG_STDOUT	LOG_SHORT_NAMES	LOGSTATS_INTERVAL	MEMDEBUG_ENGABLED	MEMDEBUG_NOTIFY
MEMUSAGE_ENABLED	MEMUSAGE_LIMIT_MB	MEMUSAGE_CHECK_INTERVAL_SECONDS	MEMUSAGE_WARNING_MB	NEWSPIDER_MODULE
RANDOMIZE_DOWNLOAD_DELAY	REACTOR_THREADPOOL_MAXSIZE	REDIRECT_PRIORITY_ADJUST	RETRY_PRIORITY_ADJUST	ROBOTSTXT_OBEY
ROBOTSTXT_PARSER	ROBOTSTXT_USER_AGENT	SCHEDULER	SCHEDULER_DEBUG	SCHEDULER_DISK_QUEUE
SCHEDULER_MEMORY_QUEUE	SCHEDULER_PRIORITY_QUEUE	SCRAPER_SLOT_MAX_ACTIVE_SIZE	SPIDER_CONTACTS	SPIDER_CONTACTS_BASE
SPIDER_LOADER_CLASS	SPIDER_LOADER_WARN_ONLY	SPIDER_MIDLDLEWARES	SPIDER_MIDDLEWARES_BASE	SPIDER_MODULES
STATS_CLASS	STATS_DUMP	STATSMAILER_RCPTS	TELNETCONSOLE_ENABLED	TEMPLATES_DIR
TWISTED_REACTOR	URLLENGTH_LIMIT	USER_AGENT

EXCEPTIONS

Built-in Exceptions reference

CloseSpider – Raised when the spider needs to be closed
DontCloseSpider – To stop spider from closing
DropItem – Item pipeline stops the item processing
IgnoreRequest – Request when needed to be ignored
NotConfigured – Raised by Extension/Item pipelines/Downloader middleware/Spider middleware to tell that this will remain disabled.
NotSupported – Indicates when feature is not supported.
StopDownload – Nothing should be downloaded henceforth

A sample tutorial to try

1. Open command prompt and traverse to the folder where you want to store the scraped data.

2. Let’s create the project under the name “scrape”

Type the following in the conda shell

scrapy startproject scrape

The above command will create a folder with the name scrape containing a scrape folder and scrapy.cfg file.

Traverse inside this project scrape
Go inside the folder called spider and then create a file called “project.py”

Type the following inside it:

import scrapy
 #scrapy.Spider needs to be extended
class scrape(scrapy.Spider): 
    #unique name that identifies the spider
    name = "posts"    
    start_urls  = ['https://blog.scrapinghub.com']
 
     #takes in response to process downloaded responses.
    def parse(self,response): 
         #for crawling each and every links
        for post in response.css('div.post-item'): 
            yield {
                #extracts title
                'title':post.css('.post-header h2 a::text')[0].get(), 
                #extracts date 
                'date':post.css('.post-header a::text')[1].get(),  
                 #extracts author name
                'author':post.css('.post-header a::text')[2].get() 
            }
        #goes to next page
        next_page = response.css('a.next-posts-link::attr(href)').get()
        #if there is next page then this parse method gets called again   
        if next_page is not None :    
            next_page = response.urljoin(next_page)
            yield scrapy.Request(next_page, callback=self.parse)

5. Save the file
6. In the cmd, run the file with the following command
7. scrapy crawl posts
8. All the links get crawled and at the same time title author date gets extracted.

This brings us to the end of the Scrapy Tutorial. We hope that you were able to gain a comprehensive understanding of the same. If you wish to learn more such skills, check out the pool of Free Online Courses offered by Great Learning Academy.

Full Stack Software Development Course from UT Austin

Cloud Computing PG Program by Great Lakes

Scrapy Tutorial – An Introduction

Web Scraper

Web Crawler

Scrapy

Scrapy Installation

Scrapy Packages

Scrapy File Structure

SCRAPY COMMAND LINE TOOL

Global Commands

Project-only Commands

SPIDERS

SPIDER SCRAPING CYCLE

Scrapy.Spider

Spider Arguments

Generic Spiders

XMLFeedSpider

CSVFeedSpider

SitemapSpider

Selectors

Nesting Selectors

Selecting element attributes

Using Selectors with regular expressions

Some equivalents

Working with relative XPATHS

Querying the elements by Class Use CSS

Difference between //node[1] and (//node)[1]

Text nodes under condition

Variables in Xpath expressions

Removing namespaces

Using EXSLT extensions

Regular Expressions

Built-in Selectors reference

ITEMS

Item Types

Working with Item Objects

Extending Item Subclass

Supporting all Item Types

ITEM LOADERS

Using Item Loaders to populate items

Working with data class items

Input and output processors

Declaring Item Loaders

Declaring Input and Output processors

Item Loader Context

Item Loader Object

Nested Loaders

Reusing and Extending Item Loaders

SCRAPY SHELL

Configuring the shell

Launch the shell

Using the shell

Invoking shell from spiders to inspect responses

ITEM PIPELINE

Writing item pipeline

Example application:

FEED EXPORTS

Serialization formats

Storages

Storage URI parameters

Storage backends

Delayed File Directory

Settings

Feeds

Feed Export Encoding

FEED_EXPORT_FIELDS

FEED_EXPORT_INDENT

FEED_STORE_EMPTY

FEED_STORAGES

FEED_STORAGE_FTP_ACTIVE

FEED_STORAGE_S3_ACL

FEED_STORAGES_BASE

FEED_EXPORTERS

FEED_EXPORTERS_BASE

FEED_EXPORT_BATCH_ITEM_COUNT

FEED_URI_PARAMS

REQUESTS AND RESPONSES

Request Objects

Passing additional data to callback functions