- Web Scraper
- Web Crawler
- Scrapy
- Scrapy Installation
- Scrapy Packages
- Scrapy File Structure
- SCRAPY COMMAND LINE TOOL
- Global Commands
- Project-only Commands
- SPIDERS
- Scrapy.Spider
- Spider Arguments
- Generic Spiders
- Selectors
- ITEMS
- Item Types
- Working with Item Objects
- ITEM LOADERS
- SCRAPY SHELL
- ITEM PIPELINE
- FEED EXPORTS
- Settings
- Feeds
- REQUESTS AND RESPONSES
- Response subclasses
- LINK EXTRACTORS
- Link Extractor Reference
- SETTINGS
- EXCEPTIONS
Web Scraper
A web scraper is a tool that is used to extract the data from a website.
It involves the following process:
- Figure out the target website
- Get the URL of the pages from which the data needs to be extracted.
- Obtain the HTML/CSS/JS of those pages.
- Find the locators such as XPath or CSS selectors or regex of those data which needs to be extracted.
- Save the data in a structured format such as JSON or CSV file.
Web Crawler
A web crawler is used to collect the URL of the websites and their corresponding child websites. The crawler will collect all the links associated with the website. It then records (or copies) them and stores them in the servers as a search index. This helps the server to find the websites easily. Servers then use this index and rank them accordingly. The pages are then displayed to the user based on ranking given by the search engine.
The web crawler can also be called a web spider, spider bot, crawler or web bot.
Also Read: Web Scraping Tutorial | What is Web Scraping?
Scrapy
Scrapy does the work of a web crawler and the work of a web scraper. Hence, Scrapy is quite a handful in crawling a site, then extracting it and storing it in a structured format. Scrapy also works with API to extract data as well.
Scrapy provides:
- the methods like Xpath and regex used for selecting and extracting data from locators like CSS selectors.
- Scrapy shell is an interactive shell console that we can use to execute spider commands without running the entire code. This facility can debug or write the Scrapy code or just check it before the final spider file execution.
- Facility to store the data in a structured data in formats such as :
- JSON
- JSON Lines
- CSV
- XML
- Pickle
- Marshal
- Facility to store the extracted data in:
- Local filesystems
- FTP
- S3
- Google Cloud Storage
- Standard output
- Facility to use API or signals (which are functions that are written in case of an event)
- Facility to handle :
- HTTP features
- User-agent spoofing
- Robots.txt
- Crawl depth restriction
- Telnet console – Python console that could run inside Scrapy to introspect.
- And more
Scrapy Installation
Scrapy can be installed by:
Using Anaconda / Miniconda.
Type the following command in the Conda shell:
conda install -c conda-forage scrapy
Alternatively, you could do the following.
pip install Scrapy
Scrapy Packages
- lxml – XML and HTML parser
- parsel – HTML/XML library that lies on top of lxml
- w3lib – Deals with webpages
- twisted – asynchronous networking framework
- cryptography and pyOpenSSL – for network-level security needs.
Scrapy File Structure
A scrapy project will have two parts.
- Configuration file – It is the project root directory. It has the settings for the project. The location of the cfg can be seen in the following place:
- System wide – /etc/scrapyg.cfg or c:\scrapy\scrapy.cfg
- Global – ~/.config/scrapy.cfg($XDG_CONFIG_HOME) and ~/.scrapy.cfg($HOME)
- Scrapy project root – scrapy.cfg
Settings from these files have the following precedence :
- Project-wide settings
- System-wide defaults
- User-defined values
Environment variables through which Scrapy can be controlled are :
- SCRAPY_SETTINGS_MODULE
- SCRAPY_PROJECT
- SCRAPY_PYTHON_SHELL
- A project folder – It contains files as follows :
- __init__.py
- items.py
- middleware.py
- pipelines.py
- settings.py
- spider – folder. It is the place where the spider that we create gets stored.
A project’s configuration file can be shared between multiple projects having its own settings module.
SCRAPY COMMAND LINE TOOL
The Scrapy command line provides many commands. Those commands can be classified into two groups.
- Global commands
- Project – only commands
To see all the commands available type the following in the shell:
scrapy -h
Syntax to see the help for a particular command is:
scrapy <command> [options] [args]
Global Commands
These are those commands that can work without an active scrapy project.
- startproject
scrapy startproject <project_name> [project_dir]
Usage: It is used to create a project with the specified project name under the specified project directory. If the directory is not mentioned, then the project directory will be the same as the project name.
Example:
scrapy startproject tutorial
This will create a directory with the name “tutorial” and the project name as “tutorial” and the configuration file.
- genspider
scrapy genspider [-t template] <name> <domain>
Usage: This is used to create a new spider in the current folder. It is always best practice to create the spider after traversing inside the project’s spider folder. Spider’s name is given by the <name> parameter and <domain> generates “start_urls” and “allowed_domains”.
Example:
scrapy genspider tuts https://www.imdb.com/chart/top/
This will create a directory with the spider with the name tuts.py and the allowed domain is “imdb”. Use this command post traversing into the spider folder.
- settings
scrapy settings [options]
Usage: It shows the scrapy setting outside the project and the project setting inside the project.
The following options can be used with the settings:
–help show this help message and exit
–get=SETTING print raw setting value
–getbool = SETTING print setting value, interpreted as Boolean
–getint = SETTING print setting value, interpreted as an integer
–getfloat = SETTING print setting value,interpreted as an float
–getlist = SETTING print setting value,interpreted as a list
–logfile = FILE logfile,if omitted stderr will be used
–loglevel = LEVEL log level
–nolog disable logging completely
–profile=FILE write python cProfile to file
–pidfile = FILE write process id to file
–set NAME=VALUE set/override setting
–pdb enable pdb on failure
Example:
scrapy crawl tuts -s LOG_FILE = scrapy.log
- runspider
scrapy runspider <spider.py>
Usage: To run spider without having to creating project
Example:
scrapy runspider tuts.py
- shell
scrapy shell [url]
Usage: Shell will start for the given url.
Options:
–spider = SPIDER (The mentioned spider will be used and auto-detection gets bypassed)
–c code (Evaluates, prints the result and exited)
–no-redirect (Does not follow HTTP 3xx redirects)
Example:
scrapy shell https://www.imdb.com/chart/top/
Scrapy will start the shell on https://www.imdb.com/chart/top/ page.
- fetch
scrapy fetch <url>
Usage:
Scrapy Downloader will download the page and give the output.
Options:
–spider = SPIDER (The mentioned spider will be used and auto-detection gets bypassed)
–headers (Header’s of the HTTP request will be shown in the output)
–no-redirect (Does not follow HTTP 3xx redirects)
Example:
scrapy fetch https://www.imdb.com/chart/top/
Scrapy will download the https://www.imdb.com/chart/top/ page.
- View
scrapy view <url>
Usage:
Scrapy will open the mentioned URL in the default browser. This will help to view the page from the spider’s perspective
Options:
–spider = SPIDER (The mentioned spider will be used, and auto-detection gets bypassed)
–no-redirect (Does not follow HTTP 3xx redirects)
Example:
scrapy view https://www.imdb.com/chart/top/
Scrapy will open https://www.imdb.com/chart/top/ page in the default browser.
- Version
Syntax: scrapy version -v
Usage:
Prints the version of the scrapy.
Project-only Commands
These are those commands that can work inside an active scrapy project.
- crawl
Syntax:
scrapy crawl <spider>
Usage:
This will start the crawling.
Example:
scrapy crawl tuts
Scrapy will crawl the domains mentioned in the spider.
- check
Syntax:
scrapy check [-I] <spider>
Usage:
Checks what’s returned by the crawler
Example:
scrap check tuts
Scrapy will check the crawled output of the crawler and returns the result as “OK”.
- list
Syntax:
scrapy list
Usage:
All the spider’s names are returned that are present in the project.
Example:
scrapy list
Scrapy will return all the spiders that are there in the project
- edit
Syntax:
scrapy edit <spider>
Usage:
This command is used to edit the spider. The editor mentioned in the editor environment variable will open up. If it’s not set, then IDLE (windows) will open up, or vi (UNIX) will open up. The developer is not restricted to use this editor but can use any editor.
Example:
scrapy editor tuts
Scrapy will open tuts in the editor.
- parse
Syntax:
scrapy parse <url> [options]
Usage:
Scrapy will parse the URL mentioned with the spider. Method if mentions in the –callback will be used; if not, parse will be used.
Options:
–spider = SPIDER (The mentioned spider will be used, and auto-detection gets bypassed)
–a Name = Value (To set the spider option)
–callback (Spider method for parsing)
–cb_kwargs (Additional methods for callback parsing)
–meta (Spider meta for the callback method)
–pipelines (To process via pipelines)
–rules (Rules for parsing)
–noitems (Hides scraped items)
–nocolour (Removes colours)
–nolinks (Hides links)
–depth (The level to which the requests needs to done recursively)
–verbose (Displays information depth level)
–output (Output is stored in a file)
Example:
scrapy parse https://www.imdb.com/chart/top/
Scrapy will parse the https://www.imdb.com/chart/top/ page.
- Bench
Syntax: scrapy bench
Usage:
To run a benchmark test.
To add custom commands.
COMMANDS_MODULE = ‘command_name’
scrapy.commands can be used in setup.py for adding up the commands externally.
SPIDERS
Spider folder is the place which contains the classes that are needed for scraping data and for crawling the site. Customisation can be done as per the requirement.
SPIDER SCRAPING CYCLE
There are different types of Spiders available for various purposes.
Scrapy.Spider
Class: scrapy.spiders.Spider
It is the simplest spider. It has the default method start_requests(). This will send requests from start_urls() calls the parse for each resulting response.
name – Name of the spider is given in this. It should be unique, and more than one instance can be instantiated. It’s the best practice to keep the spider’s name the same as the name of the website that’s crawled.
allowed_domains – Only the domains that are mentioned in this list are allowed to crawl. To crawl the domain that is not mentioned in the list “OffsieMiddelware” should be enabled.
start_urls – A list of URLs that needs to be crawled gets mentioned over here
custom_settings – Settings that need to be overridden are given here. It should be defined as a class as the settings are updated first before crawling.
crawler – from_crawler() method sets this attribute. It links the crawler object with the spider object
settings – settings for the spider/project gets mentioned over here
logger – logger with the same name as the Spider’s name will have all the log of the spider.
from_crawler(crawler,*args,**kwargs) – Sets the crawler and the settings attribute. It creates spiders.
A. crawler – object that bounds spider and the crawler
B. args – arguments that are passed to the __int__()
C. kwargs – kwargs that are passed to __int__()
start_requests() – Used to scrape the website. It’s called only once and start_url() will generate Request() for each url.
parse(response) – Callback method is used to get the response returns the scraped data.
log(message,level,component) – Sends the log throught the “logger”
closed(reason) – It will close the spider and signal.connect() gets triggered for spider_closed signal.
Spider Arguments
Arguments can be given to spiders. The arguments are passed through the crawl command using -a option.
The __init__() will take these arguments and apply them as attributes.
Example:
scrapy crawl tuts –a category = electronics
__init__() should have category as an argument for this code to work
Generic Spiders
These spiders can be used for rule-based crawling, crawling Sitemaps, or parsing XML/CSV feed.
CrawlSpider
Class – scrapy.spider.CrawlSpider
This is the spider that crawls based on rules that can be custom written.
Attributes:
- rules – List of Rule object that defines the crawling behaviour.
- parse_start_url(response, **kwargs) – This is called whenever a response is created for the URL requests. Expects an item object or an item containing iterable object.
Crawling Rules:
class scrapy.spiders.Rule(link_extractor=None, callback=None, cb_kwargs=None, follow=None,
process_links=None, process_request=None, errback=None)
link_extractor – rule for how the link is to be extracted is mentioned here. It then creates a Request object for each generated link
callback – This is called when each link is extracted. Receives a response as it’s the first argument and must return the iterable object.
cb_kwargs – arguments for callback function
follow – If callback is None, then follow is set to True otherwise, it’s False. It is a Boolean.
process_links – Called for each link extracted from each response.
process_request – called for each request.
errback – Exception is raised if there is an error.
XMLFeedSpider
Class – scrapy.spider.XMLFeedSpider
It is used to parse XML feeds. This will parse iternodes, XML, HTML for performance reasons through a particular node name.
The following class attributes must be defines to set the iterator and tag name:
- iterator – Tells what iterator to be used, i.e. iternodes or HTML or XML. Default is iternodes.
- itertag – Name of the string that needs to be iterated.
- namespaces – (prefix,url) tuples that are mentioned in the document will be gets processed in this spider.
The following overridable methods are available as well :
- adapt_response(respURLe) – It can change the response body before parsing . It can receive and send responses.
- parse_node(response,selector) – This must me overridden if the a matching node and itertag is there for the spider to work. It should return an iterable object or a Request.
- process_results(response, results) – Does last-minute processing if required.
CSVFeedSpider
Class – scrapy.spiders.CSVFeedSpider
This spider iterate over rows. parse_row() will be called for each iteration.
delimiter: it’s the separator character for each string. Default is “,”
quotechar: It defines the enclosure character. Default is ‘ “ ‘.
headers: Column names in CSV file.
parse_row(response,row) : It helps to override adapt_response and process_results for post and preprocessing. It obtains dict with a key for each header of the CSV file.
SitemapSpider
Class – scrapy.spiders.SitemapSpider
It is used for crawling the site. It discovers sitemap urls from robot.txt
- sitemap_urls – This will contain the list of urls. These urls usually point to the sitemap or robot.txt which needs to be crawled.
- sitemap_rules- It’s value is defined by a tuple (regex,callback). Callbacks should match with the url extracted from regex.
- sitemap_follow – It containts regexes.
- sitemap_alternate_link – Alternate links gets specified here. This is disabled by default.
- sitemap_filter(entries) – Can be used when there is a need to override sitemap attributes.
Selectors
Scrapy uses CSS or Xpath to select HTML elements.
Querying can be done using response.css() or response.XPath().
Example:
response.css(“div::text”).get()
Selector() can also be used if needed directly.
.get() or .getall() is used along with the response to extract the data.
.get() – will give a single result. None if nothing gets matched.
.getall() – will give a list of matches.
CSS pseudo-elements can be used to select text or attribute-nodes.
.get() has an alias .extract-first().
.get() returns NONE if no match is found. Default value can be given to replace NONE with some other value with the help of .get(default=’value’)
.attrib[] can also be used to query via attributes of a tag for CSS selectors.
Example:
response.css(‘a’).attrib[‘href’]
Non-standard pseudo-elements that are essential for web scraping are:
- ::text – selects the text nodes
- ::attr(name) – selects attributes values.
Adding a * infront of ::text will help to select all the elements of the node.
*::text
foo::text can be used to check if there is no result incase the element is present but does not have any value .
Nesting Selectors
Selectors having the same type on which selection can be done again is nesting of selectors.
Example:
val = response.css(“div::text”)
val.getall()
Selecting element attributes
Attributes of an element can obtained using Xpath or CSS selectors.
XPATH – Advantage with Xpath is that , @attributes can be used as a filter and it’s standard feature as well.
Example : response.xpath(“//a/@href”).get()
CSS Selector : ::attr(…) can be used to get attribute vales as well.
Example : response.css(‘img::attrb(src)’).get()
Or .attrib() property can also be used
Example : response.css.(‘img’).attrib[‘src’]
Using Selectors with regular expressions
.re() can be used to extract data along with Xpath or with CSS.
Example : response.xpath(‘//a[contains(@href,”image”)]/text()’).re(r’Name:\s*(.*)’)
.re_first() can also be used to extract the first element.
Some equivalents
Selection | Equaivalent Value Used these days |
SelectorList.extract_first() | SelectorList.get() |
SelectorList.extract() | SelectorList.getall() |
Selector.extract() | Selector.get() |
Selector.getall() – will return a list.
.get() returns single output
.getall() – return a list
.extract() will return either a single output or a list as the output. To get single result either extract() or extract_first() can be called.
Working with relative XPATHS
Absolute Xpath – Absolute Xpath gets created whenever an Xpath starts with ‘/’ and it’s nested.
A proper way to make it relative is use “.” Infront of ‘/’.
Example:
divs = response.xpath(“//div”)
for p in divs.xpath(‘.//p”):
print(p.get())
or
for p in divs.xpath(‘p):
print(p.get())
For mode details on XPATH can be obtained from https://www.w3.org/TR/xpath/all/#location-paths
Querying the elements by Class Use CSS
If done with Xpath then the resulting output will end up having so much of complications.
If ‘@class = “someclass”’ is used the output might have missing elements.
If ‘contains(@class,’someclass’) is used then more then needed elements might come up in the result.
As Scrapy allows chaining of selectors, CSS selector can be chained to select the class element and then Xpath can be used along with it to select the required elements instead.
Example:
response.css(“.shout”).xpath(‘./div’).getall()
“.” Should be appended before ‘/’ in the xpath that follows the CSS selector.
Difference between //node[1] and (//node)[1]
(//node)[1] – selects all the nodes first then the first element from that list will get selected.
//node[1] – First node of all the parent node will get selected.
Text nodes under condition
.//text() when passed to contains() or starts-with() will result in a collection of text elements. The resulting node set will not give any result even if it gets converted to a string . And hence it is better to use “.” alone instead of “.//text()”.
Variables in Xpath expressions
$somevariable is used as a reference variables. It’s value will be passed to the query after substitution.
Example:
response.xpath(‘//div[count(a)=$cnt]/@id’, cnt=5).get()
More examples on https://parsel.readthedocs.io/en/latest/usage.html#variables-in-xpath-expressions
Removing namespaces
selector.namespaces() method can be used so that all the namespaces of that html file can be used.
Example:
response.selector.namespaces()
Namespaces are not removed by default by scrapy because namespaces of the page are needed at times and not need at times. So this method is called only when needed.
Using EXSLT extensions
Prefix | Namespace | Usage |
re | http://exslt.org/regular-expressions | Regular expression |
set | http://exslt.org/sets | Set manipulation |
Regular Expressions
test() is used when starts-with() and contains() are not helpful
Set operations
These are used when there is a need to excluding data before extraction.
Example
scope.xpath(‘’’set:difference(./descendant::*/@itemprop)’’’)
Other Xpath extensions
has-class returns false if the nodes does not match with the given HTML classes and True for nodes that are matching.
response.xpath(‘//p[has-class(“foo”)]’)
Built-in Selectors reference
- Selector objects
Class – scrapy.selector.Selector(*args,**kwargs)
response – It is a Htmlresponse or a XMLresponse.
text – It is a Unicode string or a utf-8 encoded text cases
type – type can be “html” for HtmlResponse,”xml” for XmlResponse or None
xpath(query,namespaces=None,**kwargs) – SelectorList will be returned with flattened elements, where query is the Xpath query. Namespaces are optional and is nothing but dictionaries that are registered with register_namespace(prefix,uri)
css(query) – SelectorList is returned post application of the css where query containing the css selector is given as the argument.
get() – Matches nodes will be returned.
attrib – Element’s attributes will be returned.
re(regex,replace_entities = True) – Returns a list of Unicode post application of regex. Regex will contain the regex queries and replace_entities will replace if it’s true.
re_first(regex,default=None,entities=True) – Default value will be returned if there is not match, first Unicode will be returned if there is a match
register_namespace(prefix,uri) – To register the namespaces
remove_namespaces() – Removes all namespaces
__bool__() – Return True if the content is real
getall() – Returns a list of matched content
- SelectorList objects –
xpath(query,namespaces=None,**kwargs) – SelectorList will be returned with flattened elements, where query is the Xpath query. Namespaces are optional and is nothing but dictionaries that are registered with register_namespace(prefix,uri)
css(query) – SelectorList is returned post application of the css where query containing the css selector is given as the argument.
get() – returns the result for the first element in the list
getall() – get() is called for each element in the list.
attrib – Element’s attributes will be returned.
re_first(regex,default=None,entities=True) – re() is called for each element in the list
attrib – first element attribute is returned.
ITEMS
A dict (key-value) pair is usually returned. Different types of items are there.
Item Types
- Dictionaries – dict is convenient and familiar.
- Item Objects
Class – scrapy.item.Item([arg])
Item behaves the same way as the standard dict API and allows to define the field names such as :
- KeyError – Raised when undefined field names are called.
- Item exporters – Exports all fields
Item allows metadata definition. trackref can track item object inorder to find memory leak.
Additional Item API members that can be used are copy() , deepcopy() and fields
- Dataclass objects
Item classes field names can be defined with dataclass(). Default value and type for each dataclass can be defined. dataclasses.field() can be used to define custom field.
- attr.s objects
Item classes with field names can be defined with attr.s(). Each field type and definition and custom field metadata can also be defined.
Working with Item Objects
Declaring Item subclasses
Simple class definition and Field objects can be used to declare Item subclasses.
Example:
import scrapy
class Product(scrapy.Item):
name = scrapy.Field()
price = scrapy.Field()
Declaring Fields
Field objects are used to specify any kind of metadata for each field. Different components can use the Field object.
Class – scrapy.item.Field
Example
Creating items
product = Product(name='Desktop PC', price=1000)
Getting field values
product['price']
Setting field values
product['lala'] = 'test'
Accessing all populated values
product.keys()
product.items()
Copying items
product2 = product.copy()
product2 = product.deepcopy()
Extending Item Subclass
Items can also be extended by defining a subclass of the original item.
Metadata can be extended with previous metadata.
Supporting all Item Types
Class – itemadapter.ItemAdapter(item:Any)
Common interface to extract and set data
Itemadapter.is__item(obj:Any) -> bool
If the item belongs to the supported types then True will be returned.
ITEM LOADERS
This is used to populate the items.
Using Item Loaders to populate items
Item class creates item loader __init__ which is how item loader gets instantiated. Selectors load the value into the item loader. Item loader then joins using processing functions.
add_xpath(), add_css() and add_value() are all used to collect data into an item loader. ItemLoader.load_item() populates the data extracted from add_xpath(),add_css() and add_value().
Working with data class items
Passing of values can be controlled using field() when used with item loaders which will load the item automatically with the methods add_xpath(),add_css() and add_value().
Input and output processors
Each item loader has 1 input processor and 1 output processor.
The input processor loads the data in the item loader through add_xpath(),add_css() and add_value().
ItemLoader.load_item() then populates the data in the item loader.
The output processor then assigns the value to the items.
Declaring Item Loaders
Input processors are declared using _in suffix.
Output processors are declared using _out suffix.
Also can be declared using ItemLoader.default_input_processor and ItemLoader.default_output_processor.
Declaring Input and Output processors
Input/Output processors can also be declared using Item Field metadata.
Precedence order:
- Item loader field specific attributes
- Field metadata
- Item Loader defaults
Item Loader Context
Item Loader Context can modify the behavior of the input/output processors. It can be passed anytime and it is of dict type.
loader_context passes the context that is active and parse_length uses it.
To modify
- Modify the Item Loader context attribute
- On loader instantiation
- On item loader declaration
Item Loader Object
If no item then default_item_class gets instantiated.
item – The objects that’s parsed by the item loader | context – current active context |
default_item_class – instantiates when not given in __init__() | default_input_processor – Default input processor for which there is none |
default_output_processor – Default output processor for which there is none | default_selector_class – Ignored if __init__() is given, if not then selector of item loader will get constructed |
selector – This object extracts the data. | add_css(field_name,css,*processors,**kw) – css selector given in this extracts list of Unicode strings |
add_value(field_name,xpath,*processors,**kw) – Processors and kw passes the value to get_value() , then to field input processors and then appended to the data collected. | add_xpath(field_name,*processors,**kw) – xpath will be used to extract list of strings |
get_collected_values(field_name) – Collected values will be returned | get_css(css,*processors,**kw) – Css selector will be used to extract list of Unicode strings |
get_output_value(value,*processors,**kw) – collected values from parsed through output processors are returned. | get_value(value,*processors,**kw) – given value is processed by the processors. |
get_xpath(xpath,*processors,**kw) – xpath will extract list of Unicode strings | load_item() – Used to populate the item |
nested_class(css,**context) – css selector creates nested loader | nested_xpath(xpath,**context) – xpath selector creates nested loader |
replace_css(field_name,css,*processors,**kw) – replaces collected data | replace_value(field_name,value,*processors,**kw) – replaces collected data |
replace_value(field_name,value,*processors,**kw) – replaces collected data | replace_xpath(field_name,value,*preprocess,**kw) – replaces collected data |
Nested Loaders
Nested Loaders can be used when the subsection values need to be parsed.
Reusing and Extending Item Loaders
Scrapy provides the support for python class inheritance and hence item loaders can be reused and extended.
SCRAPY SHELL
Scrapy shell can be used for testing and evaluating spiders before running the entire spider. Individual queries can be checked in this.
Configuring the shell
Scrapy works wonderful with IPython, and can support bpython. IPython is recommended as it provides auto-completion and colorized output.
The setting can be changed by
[settings]
shell = bpython
Launch the shell
To launch the shell
scrapy shell <url>
Using the shell
It just a regular python shell with additional shortcuts
Available shortcuts
- shelp() – print list of available objects and lits
- fetch(url,[.redirect=True]) – fetch response from URL
- fetch(request) – fetch response from given request
- view(response) – open the given response in the local browse
Available scrapy objects
- crawler – current crawler object
- spider – that which can handle URL
- request – Request object of last fetched page
- response – response object containing last fetched item
- settings – current scrapy settings
Invoking shell from spiders to inspect responses
To see the response use:
scrapy.shell.inspect_response
ITEM PIPELINE
Post scraping item pipeline processes them.
Item pipelines:
- cleanses HTML data
- scraped data validation
- duplicates validation
- storing of scraped data
Writing item pipeline
Item pipeline components are python classes.
- process_item(self,item,spider) – All the component calls this method and returns an item object, Deferred object or raise a DropItem. Item is scraped item , spider – the spider that scraped the item
- open_spider(self,spider) – to open the spider.
- Close_spider(self,spider) – to close the spider.
- from_crawler(cls,crawler) – It creates a crawler and returns a new instance of pipeline.
Example application:
- price validation and dropping items with no prices
- write items to json file
- write items to mongodb
- take a screenshot of item
- duplicates filter
To activate a pipeline, it has to be added to the ITEM_PIPELINES settings.
FEED EXPORTS
Scrapy supports feed exports that is to export the scraped data into storage in multiple formarts.
Serialization formats
Item exporters are used for this process. The supported formats are :
Serialization format | Feed setting format key | Exporter |
JSON | json | JsonItemExporter |
JSON lines | jsonlines | JsonItemExporter |
CSV | csv | CsvItemExporter |
XML | xml | XmlItemExporter |
Pickle | pickle | MarshalItemExporter |
Marshal | marshal | MarshalItemExporter |
Storages
Supported backend storage:
- Local filesystem
- FTP
- S3
- Google cloud storage
- Standard output
Storage URI parameters
%(time)s – timestamp replaces this parameter
%(name)s – spider name replaces this parameter
Storage backends
Storage backend | URI scheme | Example URI | Required external library | |
FTP | ftp | ftp://user:pass@ftp.example.com/path/to/export.csv | None | Two connections : active or passiveDefault connection mode : PassiveFor active connection :FEED_STORAGE_FTP_ACTIVE = True |
Amazon S3 | s3 | s3://mybucket/path/to/export.csv | botocore >= 1.4.87 | AWS credentials can be passed through :AWS_ACCESS_KEY_IDAWS_SECRET_ACCESS_KEY Custom ACL;FEED_STORAGE_S3_ACL |
Google Cloud Storage | gs | gs://mybucket/path/to/export.csv | google-cloud-storage | Project setting and Access Control Light setting:FEED_STORAGE_GCS_ACLGCS_PROJECT_ID |
Standard Output | stdout | stdout: | none |
Delayed File Directory
Storage backends that uses delayed file directory are :
- FTP
- S3
- Google Cloud Storage
File content will be uploaded to the feed URI only if all the contents are collected entirely.
To start the item delivery early use FEED_EXPORT_BATCH_ITEM_COUNT
Settings
Settings for feed exporters
- FEEDS (mandatory)
- FEED_EXPORT_ENCODING
- FEED_STORE_EMPTY
- FEED_EXPORT_FIELDS
- FEED_EXPORT_INDENT
- FEED_STORAGES
- FEED_STORAGE_FTP_ACTIVE
- FEED_STORAGE_S3_ACL
- FEED_EXPORTERS
- FEED_EXPORT_BATCH_ITEM_COUNT
Feeds
Default : {}
Feed is a dictionary in which all the feed URI are the keys and values are nested parameters.
Accepted Keys | Fallback Value |
format | NIL |
batch_item_count | FEED_EXPORT_BATCH_ITEM_COUNT |
encoding | FEED_EXPORT_ENCODING |
fields | FEED_EXPORT_FIELDS |
Indent | FEED_EXPORT_INDENT |
Item_exports_kwargs | dict with keyword arguments to corresponding item exporter class |
overwrite | If already exists then True or else False |
Local filesystem | False |
FTP | True |
S3 | True |
Standard Output | False |
store_empty | FEED_STORE_EMPTY |
uri_params | FEED_URI_PARAMS |
Feed Export Encoding
Default: None
Encoding: If unset or None is setting then UTF-8 will be set except for JSON. Utf-8 can be set for JSON too if needed.
FEED_EXPORT_FIELDS
Default: None
To define fields use FEED_EXPORT_FIELDS
When FEED_EXPORT_FIELDS are empty scrapy used fields from item objects
FEED_EXPORT_INDENT
Default:0
If this is non-negative integer – array elements and objects are given
If this is 0 or negative, it ll be in new line
None will select compact representation
FEED_STORE_EMPTY
Default : False
FEED_STORAGES
Default : {}
FEED_STORAGE_FTP_ACTIVE
Default:False
To use active or passive connection when exporting FTP
FEED_STORAGE_S3_ACL
Default:False
Default: ‘ ’
String have custom ACL
FEED_STORAGES_BASE
Dict containing built-in feed storage.
FEED_EXPORTERS
Default: {}
Dict containing additional exporters
FEED_EXPORTERS_BASE
Dict having build-in feed exporters
FEED_EXPORT_BATCH_ITEM_COUNT
Default: 0
Number greater than 0 then scrapy generates multiple file storing to a particular number
FEED_URI_PARAMS
Default: None
String with import path of function.
REQUESTS AND RESPONSES
Requests and responses are made for crawling the site.
Request Objects
PARAMETERS
- url – url of the request
- callback – the function that gets called as a response for a request
- method – Defaut : get. Method for the request
- meta – dictionary values for Request.meta
- body – If not available then bytes is stored.
- headers – headers of the request
- cookies – request cookies
- encoding – encoding of the request
- priority – priority of the request
- don’t_filter – request should not be filtered
- errback – functions gets called if there is an exception
- flags – flags sent for logging
- cb_kwargs – dict passed as keyword arguments
Passing additional data to callback functions
Request.cb_kwargs can be used to pass arguments to the callback functions so that these then can be passed to the second callback later
Using errbacks to catch exceptions in request processing.
Failure will be received as the first parameter for the errbacks, this then can be used to track errors.
Additional data can be accessed by Failure.request.cb_kwargs
Request.meta special keys
Special keys ;
- dont_redirect
- dont_retry
- handle_httpstatus_list
- handle_httpstatus_all
- dont_merge_cookies
- cookiejar
- dont_cache
- redirect_reasons
- redirect_urls
- bindaddress
- dont_obey_robotstxt
- download_timeout
- download_maxsize
- download_latency
- download_fail_on_dataloss
- proxy
- ftp_user
- ftp_password
- referrer_policy
- max_retry_times
bindaddress – Outgoing IP address
download_timeout – time for the downloader to wait
download_latency – time to fetch response
download_fail_on_dataloss – to fail or not to fail on broken response
max_retry_times – to set retry times per request
Stopping the download of response
StopDownload exception will be raised to stop the download
Request subclasses
List of request subclasses
- FormRequest Objects
Parameters:
- formdata
classmethodfrom_response(response[, formname=None, formid=None, formnumber=0, formdata=None, formxpath=None, formcss=None, clickdata=None, dont_click=False, …]
Parameters:
- response
- formname
- formid
- formxpath
- formcss
- formnumber
- formdata
- clickdata
- don’t_click
Examples:
Fromrequest to send data via HTTP post
To simulate user login
- JsonRequest
Parameters:
- data
- dumps_kwargs
Response Objects
These are HTTP responses.
Parameters:
- url
- status
- headers
- body
- flags
- request
- certificate
- ip_address
- cb_kwargs
- copy()
- replace ([url, status, headers, body, request, flags, cls])
- urljoin(url)
- follow(url, callback=None, method=’GET’, headers=None, body=None, cookies=None, meta=None, encoding=’utf-8′, priority=0, dont_filter=False, errback=None, cb_kwargs=None, flags=None)
- follow_all(urls, callback=None, method=’GET’, headers=None, body=None, cookies=None, meta=None, encoding=’utf-8′, priority=0, dont_filter=False, errback=None, cb_kwargs=None, flags=None)
Response subclasses
List of subclasses:
- TestResponse objects
- HtmlResponse objects
- XmlResponse objects
LINK EXTRACTORS
Extracts links from responses.
LxmlExtractor.extract_links returns a list of matching Link objects.
Link Extractor Reference
Link extractor class is scrapy.linkextractor.lxmlhtml.LxmlLinkExtractor
LxmlLinkExtractor
Parameters:
- allow
- deny
- allow_domains
- deny_domains
- deny_extensions
- restrict_xpaths
- restrict_css
- restrict_text
- tags
- attrs
- canonicalize
- unique
- process_value
- strip
- extract_links(response)
Link
They represent the extracted link
Parameters:
- url
- text
- fragment
- nofollow
SETTINGS
Scrapy settings can be adjusted as needed
Designating the setting
SCRAPY_SETTINGS_MODULE is used to set the settings.
Populating the settings
Settings can be populated in the following precedence :
- Command line options – “-s” or “—set” is used to override the settings
- Settings per-spider – This can be defined through “custom_settings” attribute
- Project settings module – This can be changed in the “settings.py” file.
- Default settings per-command – “default_settings” is used to define this
- Default global settings – scrapy.settings.default_settings is used to set this.
Import Paths and Classes
Importing can be done
- String containing the import path
- Object
How to access settings
Settings can be accessed through “self.settings” in spider , “scrapy.crawler.Crawler.settings” in Crawler from “from_crawler”
Rationale for setting names
Setting name are prefixed with component name.
Built-in settings reference
AWS_ACCESS_KEY_ID | AWS_SECRET_ACCESS_KEY | AWS_ENDPOINT_URL | AWS_ENDPOINT_URL | AWS_USE_SSL |
AWS_VERIFY | AWS_REGION_NAME | ASYNCIO_EVENT_LOOP | BOT_NAME | CONCURRENT_ITEMS |
CONCURRENT_REQUESTS | CONCURRENT_REQUESTS_PER_DOMAIN | DEFAULT_ITEM_CLASS | DEFAULT_REQUEST_HEADERS | DEPTH_LIMIT |
DEPTH_PRIORITY | DEPTH_STAT_VERBOSE | DNSCACHE_ENABLED | DNSCACHE_SIZE | DNS_RESOLVER |
DOWNLOADER | DOWNLOADER_HTTPCLIENTFACTORY | DOWNLOADER_CLIENTCONTEXTFACTORY | DOWNLOADER_CLIENT_TLS_CIPHERS | DOWNLOADER_CLIENT_TLS_METHOD |
DOWNLOADER_CLIENT_TLS_VERBOSE_LOGGING | DOWNLOADER_MIDDLEWARE | DOWNLOADER_MIDDLWARES_BASE | DOWNLOADER_STATS | DOWNLOAD_DELAY |
DOWNLOAD_HANDLERS | DOWNLOAD_HANDLERS_BASE | DOWNLOAD_TIMEOUT | DOWNLOAD_MAXSIZE | DOWNLOAD_WARNSIZE |
DOWNLOAD_FAIL_ON_DATALOSS | DUPEFILTER_CLASS | DUPEFILTER_DEBUG | EDITOR | EXTENSIONS |
EXTENSIONS_BASE | FEED_TEMPDIR | FEED_STORAGE_GCS_ACL | FTP_PASSIVE_MODE | FTP_PASSWORD |
FTP_USER | GCS_PROJECT_ID | ITEM_PIPELINES | ITEM_PIPELINES_BASE | LOG_ENABLED |
LOG_FILE | LOG_FORMAT | LOG_DATEFORMAT | LOG_FORMATTER | LOG_LEVEL |
LOG_STDOUT | LOG_SHORT_NAMES | LOGSTATS_INTERVAL | MEMDEBUG_ENGABLED | MEMDEBUG_NOTIFY |
MEMUSAGE_ENABLED | MEMUSAGE_LIMIT_MB | MEMUSAGE_CHECK_INTERVAL_SECONDS | MEMUSAGE_WARNING_MB | NEWSPIDER_MODULE |
RANDOMIZE_DOWNLOAD_DELAY | REACTOR_THREADPOOL_MAXSIZE | REDIRECT_PRIORITY_ADJUST | RETRY_PRIORITY_ADJUST | ROBOTSTXT_OBEY |
ROBOTSTXT_PARSER | ROBOTSTXT_USER_AGENT | SCHEDULER | SCHEDULER_DEBUG | SCHEDULER_DISK_QUEUE |
SCHEDULER_MEMORY_QUEUE | SCHEDULER_PRIORITY_QUEUE | SCRAPER_SLOT_MAX_ACTIVE_SIZE | SPIDER_CONTACTS | SPIDER_CONTACTS_BASE |
SPIDER_LOADER_CLASS | SPIDER_LOADER_WARN_ONLY | SPIDER_MIDLDLEWARES | SPIDER_MIDDLEWARES_BASE | SPIDER_MODULES |
STATS_CLASS | STATS_DUMP | STATSMAILER_RCPTS | TELNETCONSOLE_ENABLED | TEMPLATES_DIR |
TWISTED_REACTOR | URLLENGTH_LIMIT | USER_AGENT |
EXCEPTIONS
Built-in Exceptions reference
- CloseSpider – Raised when the spider needs to be closed
- DontCloseSpider – To stop spider from closing
- DropItem – Item pipeline stops the item processing
- IgnoreRequest – Request when needed to be ignored
- NotConfigured – Raised by Extension/Item pipelines/Downloader middleware/Spider middleware to tell that this will remain disabled.
- NotSupported – Indicates when feature is not supported.
- StopDownload – Nothing should be downloaded henceforth
A sample tutorial to try
1. Open command prompt and traverse to the folder where you want to store the scraped data.
2. Let’s create the project under the name “scrape”
Type the following in the conda shell
scrapy startproject scrape
The above command will create a folder with the name scrape containing a scrape folder and scrapy.cfg file.
- Traverse inside this project scrape
- Go inside the folder called spider and then create a file called “project.py”
Type the following inside it:
import scrapy
#scrapy.Spider needs to be extended
class scrape(scrapy.Spider):
#unique name that identifies the spider
name = "posts"
start_urls = ['https://blog.scrapinghub.com']
#takes in response to process downloaded responses.
def parse(self,response):
#for crawling each and every links
for post in response.css('div.post-item'):
yield {
#extracts title
'title':post.css('.post-header h2 a::text')[0].get(),
#extracts date
'date':post.css('.post-header a::text')[1].get(),
#extracts author name
'author':post.css('.post-header a::text')[2].get()
}
#goes to next page
next_page = response.css('a.next-posts-link::attr(href)').get()
#if there is next page then this parse method gets called again
if next_page is not None :
next_page = response.urljoin(next_page)
yield scrapy.Request(next_page, callback=self.parse)
5. Save the file
6. In the cmd, run the file with the following command
7. scrapy crawl posts
8. All the links get crawled and at the same time title author date gets extracted.
This brings us to the end of the Scrapy Tutorial. We hope that you were able to gain a comprehensive understanding of the same. If you wish to learn more such skills, check out the pool of Free Online Courses offered by Great Learning Academy.