Web Scraping Tutorial – A Complete Introduction

web scraping tutorial

Getting Started with Python

What is Python?

Python is a general-purpose, high-level, and very popular scripting language. It was released in 1991 and was created by Guido Van Rossum. Python is quite appreciated for its sleek syntax and readable code, and Python suits you better if you are just starting your programming career. While other languages use punctuation, Python uses English keywords, its syntax construction has no bound forms among its immediate constituents.

Python does not require you to specify the variable data format in advance. The format/type of a variable is automatically inferred by Python based on the type of value it contains.

For example:

Var1 = "Hello World"

string “Hello World” is assigned to the variable var1 using the code mentioned above, therefore the type of var1 is a string. 

Suppose, later in the program we reassign variable var1 a value of 2, that is:

Var1 = 2

Now, the var1 variable is of type int.

If you’ve coded in Perl, Rexx or JavaScript languages, one may have noted that they automatically convert data of one type to another.  These kinds of languages are considered as weakly typed.

For example, consider the following JavaScript code:

2 + "3"

Output:    ’23’

Before the addition (+) is done, integer 2 will be converted to a string “2” and concatenated to “3”, which will result in a string that will be “23”. However, Python does not allow such automatic conversions, so the method mentioned above will produce an error. Therefore, Python is said to be strongly typed.

Generally, applications written in Python are 10% of the Java code. This implies that to accomplish the same thing as in Java, we can write less code in Python.

When you run a Python program, the source code will get translated and executed line by line by the interpreter, comparing with compiled languages like C or C++, where the compiler translates all the statements of the code at once. But on the contrary, the compiled languages can be a nightmare for beginners compared to python.

In addition, you ought to code even the simplest functions in compiled languages, along with breaking the string, calculating the array length, searching a substring in a string, and so on. Therefore, you need to take care of all the technical specifics in compiled languages before you really begin resolving your issues. Talking about python, there’s no need to define small basic functions or any data structure. 

Web development, Deep Learning systems, along with all groundbreaking computing in industry technology, the Python programming language (Python 3) is used. In comparison to advanced programmers, Python Programming Language is very suitable for beginners. Currently, the most recent version is Python 3. However, Python 2 is still quite popular.

Why learn Python?

Even if you have a little bit of experience in coding, you will notice a huge difference between python and other languages. Below are the points that’ll provide you with a one-liner of basically what I mentioned in the introduction part.

  • In an interpreter framework, Python runs, it ensures that application development can be very rapid.
  • Similar to the English language, Python has a very basic syntax.
  • Python can be called Procedural, object-oriented, or functional as well.
  • On various platforms, Python can work from Windows to Linux and Raspberry Pi to Mac).
  • The syntax of Python enables developers to write programs with fewer lines than other computer languages. 

Features

Because of its features, it is quite versatile as a language. Below are some features listed.

  • Since it’s open-source, it deals heavily with python language with wide community support. 
  • Can run on any platform.
  • It makes real-life programming much simpler because of its broad standard library support.
  • To use the best of all worlds, you can also merge the bits of other programming languages such as C, C++, etc along with python.
  • It has a basic syntax that is quite easy to utilize.
  • It is object-oriented (OOP), which makes it more applicable to the programming of real-world applications.
  • As discussed earlier it is an interpreted language.
  • It can be used to write very complicated programs and can also be used to design applications with GUI that is a Graphical User Interface.    
  • Python has different implementations such as CPython and Jython where these integrated languages can be called both compiled and interpreted.

Applications

  • Python can be used to build web applications on a server.
  • To build workflows, Python could be used alongside apps.
  • It can interface to databases and can read or update files as well.
  • To manage big data and perform advanced math, it can be used.
  • Python’s strongest asset is a large set of functional libraries that can be used for the following purposes:
    • It is widely used by data scientists for Machine Learning, AI and etc.
    • It is used for image processing, text processing, and GUI (Graphical User Interface).
    • To test Frameworks.
    • Also, for Web Scraping

You can use a certain text editor or can write it in an IDE (Integrated Development Environment), such as Pycharm, Thonny, Netbeans, or even Eclipse which are best for controlling huge collections of Python files.

Are these reasons not good enough for you to get started with python?

What is Web Scraping?

Web Scraping, Screen Scraping, Web Harvesting, or Web Data Extraction is a strategy that is used to retrieve large quantities of data from websites or web pages by downloading and saving the data to a local file in the form of a spreadsheet on your computer or a database.

If you copy a large amount of data and paste it manually in a condition where the website does not allow you to save the page, web scraping can be a very lengthy process. Therefore, people prefer web scraping tools or software which reduces time consumption. Also, there might be cases when websites can contain a very large amount of invaluable data, so to cut the chase of your search these tools can be a great relief.

The software can be locally downloaded into your system or can be accessed through the cloud.

If the above options do not satisfy your goals of data retrieval then you can hire a software developer to build a custom data extraction tool for your specific needs.  

Mostly python is preferred to make tools for web scraping as it is ridiculously fast.

Web scraping falls under a grey-area as policies are important to check for each website before extracting the data, if possible, to make sure it is legal.

How does it work?

Initially, one or more URLs to load will be provided to a web scraper before scraping the website, including CSS and JavaScript components.

The scraper would either extract all the page data or particular user-selected data.

The customer will go through the process of choosing the specific data before the project is run preferably, they want from the website, for instance, you would want to scrape data related to the Flipkart product page for pricing and models, but are not really interested in product feedback.

Python Modules/Libraries for Web Scraping

1. Requests:

It is a basic but powerful library for python web scraping. It is an important HTTP library which is used to browse/surf web sites. We can get the raw HTML of web pages with the support of requests, which can then be parsed to extract the data. It can access API’s and is the only library that claims to be Non-GMO.

2. Beautiful Soup 4:

A parsing library that can use multiple parsers is Beautiful Soup (BS4). A parser’s job is essentially to retrieve data from HTML and XML files.

Choose Beautiful Soup if you need to tackle messy documents. 

The default parser of Beautiful Soup is part of Python’s standard library. It is quite slow but flexible. It can manage HTML documents with special characters.

Also, it is easier to build common applications as it helps in navigating parsed documents and searching for what you require.

3. LXML:

Lxml is a high-quality python tool used for c libraries such as libxslt and libxml2. It is also a superior parsing library for HTML and XML. It is known as one of the extremely simple-to-use and feature-rich modules for Python-language XML and HTML rendering. Go for lxml if you need speed. If you have familiarity with either XPaths or CSS, it’s pretty quick to learn.

4. Scrapy: 

Scrapy is a web crawling framework which is written in python and is open-source. Based on XPath it extracts the data from the websites with the help of selectors. Compared to other libraries it is really fast. It basically provides everything that we require such as extraction, processing, and structuring the data from web pages.

5. Selenium

It is an open-source web-based automation testing tool over multiple browsers. It is a collection of software each having different engagement to support testing automation. It has different selenium bindings for Ruby, Java, Python, C#, JavaScript.

All Selenium WebDriver functionality can be accessed by a user in an organic way, with the support of the Selenium Python Application Programming Interface. The versions of Python currently supported are 2.7, 3.5, and above.

Web scraping can be appreciated or disliked, depending on whom you ask. If you notice, web scraping has existed for a very long time which is around when the internet was born (In 1989) and is a key foundation of the internet.

For instance, ‘nice bots’ allow search engines to filter website content, rate comparison sites to save money for customers, and market analysts to evaluate social market sentiment.

However, ‘evil bots’ retrieve data from a site in order to exploit it for uses beyond the reach of the owner of the domain. Evil bots cover almost 20% to 25% of all website traffic which are used to perform a series of malicious operations, such as DOS (denial of service attacks), data theft, online fraud account hijacking, spam, or digital advertising fraud.

So, is it legal or illegal? If the data that is scraped or crawled through sites is just for personal use then there is no issue. But if you are going to re-post that data somewhere, then before doing so you can make a download request to the owner of the website, or do some background analysis on policies as well as the details that you are going to scrape.

The bitter truth is that the Start-up companies enjoy it because, without the need for alliances, it is a cheap and efficient way to capture data. Big companies don’t want other businesses to use bot against them although they themselves enjoy the service of web scraper.

For example, LinkedIn’s attempt to prohibit HiQ, an analytics firm, from scrapping its records, was rejected by the US Court of Appeals in late 2019. A momentous event in the age of data protection and data control was this decision. It demonstrated that for web crawlers, any data that is freely accessible and not copyrighted is fair to be used.

But still, you cannot use scraped data for commercial purposes. Therefore, HiQ does not get any permission either to use LinkedIn’s scraped data for commercial use.   

Another example you could take is the YouTube video titles that you can search and use it the same as it is but the content of the video is copyrighted therefore it cannot be scraped and used again.

With Python, we can scrape any website or particular elements of a web, to know exactly what you should keep in your mind while web scraping and what is exactly required, you’ll find out further when you scroll down.

Some files to read before scraping any site

We ought to apprehend the size and layout if you are approaching a website to scrape data from it. Below are some files that we need to review before performing web scraping.

  • Sitemap files

As there are constant changes updated every second and minute on the world wide web and if your work demands for an update of data, then in turn you will scrape each and every site possible to find that piece of updated information, this is not considered healthy for websites as it increases the server traffic of that particular website. Therefore, websites provide you with a file named as ‘sitemap file’ to help the scrapers or crawlers to locate the updated data, which is also very time saving. The standard sitemap file is defined at http://www.sitemaps.org/protocol.html.

  • robots.txt

The human readable file robots.txt is used to classify the areas of the site that crawlers are permitted to scrape and not permitted to scrape. There is no default file format for robots.txt, and website publishers can make improvements according to their needs.

In general, many publishers encourage programmers to crawl through their sites to a certain degree or they would want you to crawl particular parts of the website. Therefore, there is a need to define a certain f set of rules to state which sections can be crawled and which cannot be crawled. Therefore   robots.txt comes into the picture, in which certain rules are specified.

By providing a slash and robots.txt after the URL of that domain, we can check the robots.txt file for a specific website. For example, if we want to check it for yahoo.com, then we need to type https://www.yahoo.com/robots.txt.

Technology used by websites

A further critical question is whether website technology impacts the way we crawl? Yes, that has an effect. But how can we verify the technology that a webpage uses? With the help of a Python library called builtwith, we can discover the technology a website is using.

Owner of website

The website owner is also important because if the owner is known to disable the crawlers, then while scraping the website data, the crawlers have to be cautious. With the aid of a protocol called Whois, we can find out about the website operator.

Size of Website?

Is the size of a website affected by the way we crawl, i.e. the number of web pages on a website? Surely, true. Because we’ll have fewer web content to crawl, then performance would not be a major problem, but if our website has hundreds of thousands of web pages, such as intel.com, it would take several months to download each web page sequentially and then performance would be a serious concern.

Data Extraction

Understanding the structure of the website or analysing it is important. Below an approach is given for you to understand how to extract data through web scraping. There are certain steps to follow to extract the data by web scraping.

Analysing the Web page

Analysing the web page is really important as the user or one extracting is not aware of the format (structured or unstructured) of data which he/she is going to obtain after extraction from the website.

We can analyse the web page in following manner:

1. The first step would be to view the page source in HTML format:

As mentioned above analysing the web page is basically understanding the structure or architecture and by comprehending its source code. Click View page source option to implement by right clicking the mouse. We will see the data appearing in the form of HTML file/format. The biggest issue here is that there are formatting and whitespaces that are difficult to handle.

2. The next step is to inspect page source

This is another web page analysis method. The difference, however, is that the problem of formatting and whitespace in the web page source code will be solved. You can do this by right-clicking and then selecting from the menu the Inspect element option. Data on a specific area or element of that web page will be provided.

Extracting Data from Web Page Using Different Methods

The following are the methods that are mostly used for extracting the data from a web page −

  • Regular (RE)

They are specialised languages embedded in Python for programming. Developers can also use it via a Python re-module. It is also known as patterns of RE or regexes or regexes. We can specify some rules with the help of regular expressions for the possible value of strings we would like to match from the information.

If you are interested in learning more about RE, then go to this link and learn more.

Example:

We will scrap data on Bangladesh from http:/example.webscraping.com after matching the contents of the website. <td> with the help of regular expression.

import re
import urllib.request
response =
   urllib.request.urlopen('http://example.webscraping.com/places/default/view/Bangladesh-19')
html = response.read()
text = html.decode()
re.findall('<td class="w2p_fw">(.*?)</td>',text)


Output
The O/P is shown below −
[
   '<img src="/places/static/images/flags/in.png" />',
   '144,000 square kilometres',
   ' 156,118,464',
   ' BD',
   'Bangladesh',
   ' Dhaka',
   '<a href="/places/default/continent/AS">AS</a>',
   ' .bd',
   ' BDT',
   ' Taka',
   '880',
   '####',
   ' ^(\d{4})$',
   ' bn-BD,en',
   '<div>
      <a href="/places/default/iso/CN">MM </a>
      <a href="/places/default/iso/NP">IN </a>
   </div>'
]

  • Beautiful Soup

If we’re to gather all the hyperlinks from a web page, we can use a parser named BeautifulSoup, which can be found in even more clarity at https:/www.crummy.com/software/BeautifulSoup/bs4/doc/. BeautifulSoup is a library for python, in short language, for extracting information from HTML and XML files. It can also be used on queries, because to create a soup object, it needs an input (document or ip address) as it cannot have a web page on its own.

Installing Beautiful Soup

Using the pip command, we can install beautifulsoup.

(base) D:\ProgramData>pip install bs4
Collecting bs4
   Downloading
https://files.pythonhosted.org/packages/10/ed/7e8b97591f6f456174139ec089c769f89
a94a1a4025fe967691de971f314/bs4-0.0.1.tar.gz
Requirement already satisfied: beautifulsoup4 in d:\programdata\lib\sitepackages
(from bs4) (4.6.0)
Building wheels for collected packages: bs4
   Running setup.py bdist_wheel for bs4 ... done
   Stored in directory:
C:\Users\pratibha\AppData\Local\pip\Cache\wheels\a0\b0\b2\4f80b9456b87abedbc0bf2d
52235414c3467d8889be38dd472
Successfully built bs4
Installing collected packages: bs4
Successfully installed bs4-0.0.1

Example:

Notice that in this case, we are expanding the above example implemented with the python module requests. We use r.text to create a soup object that will be used to retrieve details like the title of the web page.

First, we require to import the Python modules that are needed–

import requests
from bs4 import BeautifulSoup

In this code we use requests to make a GET HTTP requests for the url: https://authoraditiagarwal.com/ by making a GET request.

r = requests.get(‘https://authoraditiagarwal.com/’)

Now create a Soup object −

soup = BeautifulSoup(r.text, 'lxml')
print (soup.title)
print (soup.title.text)

Output
The subsequent output is shown here −
<title>Learn and Grow with Aditi Agarwal</title>
Learn and Grow with Aditi Agarwal

  • Lxml

For web scraping, another Python library we are going to address is lxml which we have already discussed before. It is a high-performance library for parsing HTML and XML. It is relatively fast and simple. On https:/lxml.de/, you can read more about it.

Installing lxml

Using the pip command, we will install lxml.

(base) D:\ProgramData>pip install lxml
Collecting lxml
   Downloading
https://files.pythonhosted.org/packages/b9/55/bcc78c70e8ba30f51b5495eb0e
3e949aa06e4a2de55b3de53dc9fa9653fa/lxml-4.2.5-cp36-cp36m-win_amd64.whl
(3.
6MB)
   100% |¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦| 3.6MB 64kB/s
Installing collected packages: lxml
Successfully installed lxml-4.2.5

Example: Data extraction using lxml and requests

In the following example, we are scraping a particular element of the web page from mygreatlearning by using lxml and requests −

First, import the requests of html from lxml library −

import requests
from lxml import html 

Now provide the url of web page to scrap-

url = ‘https://www.mygreatlearning.com/blog/understanding-causality-in-machine-learning-and-future-of-hiring-with-ai-weekly-guide/’

Now give the path (Xpath) to a particular element of that web page –

path = ' //*[@id="post-22569"]/div[2]/div[1]/div/div[3]/p[2]/a/strong'
response = requests.get(url) 
byte_data = response.content 
source_code = html.fromstring(byte_data) 
tree = source_code.xpath(path) 
print(tree[0].text_content())


Output:

“Understanding Causality Is the Next Challenge for Machine Learning”

Data Processing

In the preceding sections, we learned regarding extracting data from web sites or scraping web pages from different Python modules. In this section, let us look at the different techniques used to process the data that has been scrapped.

Introduction

To process the information that has now been scrapped, we must store data on our remote computer in a specific format such as JSON, or even in databases such as MySQL, spreadsheet (CSV).

CSV and JSON Data Processing

Firstly, we’re going to start writing the details to a CSV file or spreadsheet after we retrieve it from the web page. Let’s all learn through a simple example in which we first capture the information using the BeautifulSoup module, as we did before, and then use the Python CSV module to assign the text information to the CSV file.

import the required Python libraries as follows –

import requests
from bs4 import BeautifulSoup
import csv

Throughout this following piece of code, we will use ‘requests’ to make GET HTTP requests for the url: https://Wikipedia.org / by sending a GET request.

r = requests.get(‘ https://Wikipedia.org / ‘)

Now, we create a Soup object −

soup = BeautifulSoup(r.text, 'lxml')

Now, we need to write the documented data to a CSV file called dataprocessing.csv.

f = csv.writer(open(' dataprocessing.csv ','w'))
f.writerow(['Title'])
f.writerow([soup.title.text])

After this script is executed, the text details or the title of the website will be stored in the above specified CSV file on your local computer.

In the same way, we can save the information collected in a JSON file. The following is an easy-to – understand Python script for doing the same thing that we are collecting the same information as we did in the last Python script, except this time the information is stored in JSONfile.txt using the JSON Python module.

import requests
from bs4 import BeautifulSoup
import csv
import json
r = requests.get('https://Wikipedia.org/')
soup = BeautifulSoup(r.text, 'lxml')
y = json.dumps(soup.title.text)
with open('JSONFile.txt', 'wt') as outfile:
   json.dump(y, outfile)

After executing this script, the captured information, i.e. the title of the webpage, will be saved to your local machine in the text file described above.

Data Processing using AWS S3

Often, we may like to preserve scraped data in our local folder. But what if we need to store and evaluate this data on a large scale? The solution is a cloud storage service called AWS S3 (Storage Service). Amazon S3 has a basic online server interface which you can use to store and access any volume of information from anywhere on the web at any time.

We will follow the following steps to store data in AWS S3 (cloud) –

Step 1 − First, we need to have an AWS account to provide us with the secret keys to use in our Python script when storing the data. It’s going to generate an S3 bucket where we can store our files.

Step 2 − Next, we need to install the Python Boto3 library to reach the S3 bucket. It can be installed using the following order –

pip install boto3

Step 3 − Next, we will use the preceding Python script to scrape data from the site page and save it to the AWS S3 bucket.

Firstly, we need to import Python libraries for scrapping, here we’re dealing with demands, and Boto3 saves data to the S3 bucket.

import requests
import boto3

Now scrape the data from our URL.

data = requests.get("Enter the URL").text

Now to store the data to S3 bucket, create S3 client as following −

s3 = boto3.client('s3')
bucket_name = "our-content"

Following code will create S3 bucket –

s3.create_bucket(Bucket = bucket_name, ACL = 'public-read')
s3.put_object(Bucket = bucket_name, Key = '', Body = data, ACL = "public-read")

You can now check the bucket with tag our-content from your AWS account.

Data processing using MySQL

Let’s all explore how to use MySQL to process data. If you’d like to read more about MySQL, you should follow the https:/www.mysql.com/.

With the aid of the following steps, data can be scrapped and stored in the MySQL table −

Step 1 − First, while using MySQL, we need to build a database as well as a table in which we’d like to save our scrapped info. For eg, we construct a table with the following question –

CREATE TABLE Scrap_pages (id BIGINT(7) NOT NULL AUTO_INCREMENT,
title VARCHAR(200), content VARCHAR(10000),PRIMARY KEY(id));

Step 2 − Second, we’ve got to deal with Unicode. Please note that MySQL doesn’t really treat Unicode by default. We have to switch this function on with the aid of the following commands, which will change the default character set for the database, the table, and both columns –

ALTER DATABASE scrap CHARACTER SET = utf8mb4 COLLATE = utf8mb4_unicode_ci;
ALTER TABLE Scrap_pages CONVERT TO CHARACTER SET utf8mb4 COLLATE
utf8mb4_unicode_ci;
ALTER TABLE Scrap_pages CHANGE title title VARCHAR(200) CHARACTER SET utf8mb4
COLLATE utf8mb4_unicode_ci;
ALTER TABLE pages CHANGE content content VARCHAR(10000) CHARACTER SET utf8mb4
COLLATE utf8mb4_unicode_ci;

Step 3 − Now, you can merge MySQL with Python. To do this, we will need PyMySQL that can be configured with the aid of the following instruction.

pip install PyMySQL 

Step 4 − Now, our database called Scrap, recorded earlier, is prepared to save all the data to a table called Scrap pages after it has been scrapped from the site. Here in this case, we’re going to scrap data from Wikipedia and save it to our database.

First of all, we need to import the necessary Python modules.

from urllib.request import urlopen
from bs4 import BeautifulSoup
import datetime
import random
import pymysql
import re

Let us make a connection that integrates this with Python.

conn = pymysql.connect(host='127.0.0.1',user='root', passwd = None, db = 'mysql',
charset = 'utf8')
cur = conn.cursor()
cur.execute("USE scrap")
random.seed(datetime.datetime.now())
def store(title, content):
   cur.execute('INSERT INTO scrap_pages (title, content) VALUES ''("%s","%s")', (title, content))
   cur.connection.commit()

Now, let us connect with Wikipedia and retrieve data from it.

def getLinks(articleUrl):
   html = urlopen('http://en.wikipedia.org'+articleUrl)
   bs = BeautifulSoup(html, 'html.parser')
   title = bs.find('h1').get_text()
   content = bs.find('div', {'id':'mw-content-text'}).find('p').get_text()
   store(title, content)
   return bs.find('div', {'id':'bodyContent'}).findAll('a',href=re.compile('^(/wiki/)((?!:).)*$'))
links = getLinks('/wiki/Ar-Rahman)
try:
   while len(links) > 0:
      newArticle = links[random.randint(0, len(links)-1)].attrs['href']
      print(newArticle)
      links = getLinks(newArticle)

In the end close both cursor and connection.

finally:
   cur.close()
   conn.close()

This would save the information compiled via Wikipedia to a table called scrap pages. If you’re acquainted with MySQL and site crawling, then the code above wouldn’t be hard to grasp.

Data processing using PostgreSQL

PostgreSQL, created by a multinational volunteer team, is an open source relational database management system (RDMS). The method of manipulating scrapped data using PostgreSQL is close to that of MySQL. There will be two changes: first, the commands will be different from MySQL, and second, here we’ll just use psycopg2 Python module to combine with Python.

If you are not familiar with PostgreSQL, you can read it at https:/www.postgresql.org/.

Install pyscopg2
pip install psycopg2 

Processing Images and Videos

Web scraping typically includes uploading, saving and editing web media content. Let us understand how to process the content downloaded from the web in this chapter.

Introduction

The web media material that we receive during the scrapping process may be photographs, audio and video files, in the form of non-web pages, as well as data files. But, can we trust the downloaded data, particularly the extension of the data that we’re going to download and store in our computer memory? This makes it important to know the kind of data we’re going to store locally.

Getting Media Content from Web Page

In this segment, we will learn how to retrieve media content that correctly describes a media category based on information from a web server. We can do this with the aid of the Python Requests Module as we did in the previous chapter.

First, we need to import the necessary Python modules as follows −

import requests

Now, send the URL of the media material that we want to import and store locally.

url = “https://authoraditiagarwal.com/wpcontent/uploads/2018/05/MetaSlider_ThinkBig-1080×180.jpg”

Create an HTTP response object.

r = requests.get(url) 

With the support of the following code line, we can save the obtained content as a .png format.

with open("ThinkBig.png",'wb') as f:
   f.write(r.content) 

After executing the above Python script, we’ll get a file called ThinkBig.png that will have the image downloaded.

Extracting Filename from URL

After downloading the contents from the website, we still want to store the contents in a file with the file names in the URL. But we can also verify whether there are additional fragment numbers in the URL as well. To do this, we need to find the real file name from the URL.

With the support of following the Python script, we can extract the filename from the URL by using urlparse.

import urllib3
import os
url = ” https://upload.wikimedia.org/wikipedia/commons/4/40/Erdfunkstelle_Raisting_2.jpg”
a = urlparse(url)
a.path

After execution, you will get the filename from the URL.

Generating Thumbnail for Images

Thumbnail is a very limited outline or illustration. A user may choose to save only the thumbnail of a large image or save both the image and the thumbnail. In this segment, we will generate a thumbnail of the picture called ThinkBig.png that was downloaded from the previous section, “Getting Media Material from the Web page.”

For this Python script, we need to download a Python library called Pillow, a Python Image library fork that has useful functions for manipulating images.

pip install pillow

The following Python script will generate a thumbnail of the image and also save this one to the current working directory by prefixing the thumbnail file.

import glob
from PIL import Image
for infile in glob.glob("ThinkBig.png"):
   img = Image.open(infile)
   img.thumbnail((128, 128), Image.ANTIALIAS)
   if infile[0:2] != "Th_":
      img.save("Th_" + infile, "png")

Check the thumbnail file in the current directory.

Screenshot from Website

In web scraping, a very common task is to take a screenshot of a website. For implementing this, we are going to use selenium and webdriver. 

The coming code will take a screenshot from the site and save it to the current directory.

From selenium import webdriver
path = r'C:\\Users\\pratibha\\Desktop\\Chromedriver'
browser = webdriver.Chrome(executable_path = path)
browser.get('https://wikipedia.org/')
screenshot = browser.save_screenshot('screenshot.png')
browser.quit
the output is shown below −
DevTools listening on ws://127.0.0.1:1456/devtools/browser/488ed704-9f1b-44f0-
a571-892dc4c90eb7
<bound method WebDriver.quit of <selenium.webdriver.chrome.webdriver.WebDriver
(session="37e8e440e2f7807ef41ca7aa20cec887c6")>>

After executing the code, check your current directory for screenshot.png file.

Thumbnail Generation for Video

Assume we downloaded videos from the website and decided to create thumbnails for them so that a particular video based on the thumbnail can be selected. To create a thumbnail for images, we need a basic tool called ffmpeg, which can be installed from www.ffmpeg.org. After installing, we need to update it as per the specification of our Operating System.

The subsequent Python script will create a thumbnail of the video and save it to our local directory −

import subprocess
video_MP4_file = “C:\Users\pratibha\desktop\Dynamite.mp4
thumbnail_image_file = 'thumbnail_solar_video.jpg'
subprocess.call(['ffmpeg', '-i', video_MP4_file, '-ss', '00:00:20.000', '-
   vframes', '1', thumbnail_image_file, "-y"]) 


We will get the thumbnail as thumbnail_Dynamite_video.jpg saved in the local directory.

Ripping an MP4 video to an MP3

Suppose you download any video file from a site, and you only require audio from this same file to serve your function, then it could be achieved in Python with the aid of the Python library named moviepy, which can be installed with the help of the following command –

pip install moviepy

Now, after you have successfully enabled moviepy with the aid of the following script, you can convert MP4 to MP3.

import moviepy.editor as mp
clip = mp.VideoFileClip(r"C:\Users\pratibha\Desktop\1234.mp4")
clip.audio.write_audiofile("Love_yourself_audio.mp3")

output shown below −

[MoviePy] Writing audio in movie_audio.mp3
100%|¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦
¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦| 674/674 [00:01<00:00,
476.30it/s]
[MoviePy] Done.

MP3 file is saved in the local directory.

Python Web Scraping – Dealing with Text

In the previous section, we’ve seen how to work with videos and photos that we get as part of web scraping material. In this part, we will deal with the study of text using the Python Library and learn more about it in depth.

Introduction

You can do text analysis using the Python library called the Natural Language ToolKit (NLTK). Before continuing with the NLTK principles, let us consider the relationship between text analysis and web scraping.

Evaluating the words in text will lead one to know which words are important, which words are odd, how words are clustered together. This study eases the challenge of scraping the web.

Getting started with NLTK

The Natural Language Toolkit (NLTK) is a series of Python libraries developed primarily to define and mark parts of speech used in natural language text such as English.

Installing NLTK

You can also use the code to install NLTK on Python −

pip install nltk

When you’re using Anaconda, you can create a conda package for NLTK by using the following command −

conda install -c anaconda nltk

Downloading NLTK’s Data

After downloading NLTK, we need to install preset text repositories. But before downloading preset repositories of text, we need to import NLTK with the aid of the import command as follows −

import nltk

downloading NLTK data −

nltk.download()

The installation of all available NLTK packages will take some time, but it is still advised to install all NLTK packages.

Installing Other Necessary packages

We will need several other Python packages including gensim and pattern for doing text classification as well as developing natural language processing software using NLTK.

gensim − A versatile semantic modelling library that is useful for many applications. The following order should be installed –

pip install gensim

Pattern − Used to make the gensim kit operate correctly. To install –

pip install pattern

Tokenization

The Process of breaking the given text, into the smaller units called tokens, is called tokenization. These tokens can be the words, numbers or punctuation marks. It is also called word segmentation.

The NLTK module offers various tokenization sets. We will use these bundles as per our requirements. Any of these bundles are listed here −

  • sent_tokenize package − This package splits the text of the input into sentences. 

to import this package −

from nltk.tokenize import sent_tokenize

  • word_tokenize package − This package splits the input text into words. 

to import this package −

from nltk.tokenize import word_tokenize

  • WordPunctTokenizer package − This package will separate both the input text and the punctuation marks into words.

to import this package −

from nltk.tokenize import WordPuncttokenizer

Stemming

There are various types of words in every language. English contains a number of variations due to grammatical factors. Remember, for instance, the terms freedom, democracy and liberalisation. For artificial intelligence as well as web scraping projects, it is necessary for machines to recognize that these different terms have the same basic type. We would then conclude that it would be helpful to isolate the basic forms of words when processing the text.

This can be accomplished by stemming, which can be described as the heuristic method of removing the basic forms of words by cutting off the ends of words.

The NLTK module includes a variety of packages for stemming. We will use these bundles as per our requirements. Some of these packages are listed in this −

  • PorterStemmer package − the Porter algorithm is used for this Python stemming package to remove the base form. 

command to import the package –

from nltk.stem.porter import PorterStemmer

  • LancasterStemmer package − The Lancaster algorithm is used by this Python stemming package to extract the base type. 

 command to import the package –

from nltk.stem.lancaster import LancasterStemmer

  • SnowballStemmer package − The Snowball algorithm is used by this Python stemming package to extract the base form. 

command to import the package –

from nltk.stem.snowball import SnowballStemmer

Lemmatization

Another way to derive the fundamental form of words is by lemmatization, which is typically meant to eliminate inflective ends by using vocabulary and morphological analysis. The basic form of every term after lemmatization is called lemma.

The NLTK module includes the following lemmatization packages(bundles)−

WordNetLemmatizer package − Extracts the fundamental form of a word based on whether it is used as a verb or noun. 

command to import the package –

from nltk.stem import WordNetLemmatizer

Chunking

Chunking, which involves splitting the data into small bits, is one of the most critical methods of natural language analysis to classify sections of expression and short phrases such as noun phrases. Chunking is about tagging tokens. With the aid of the chunking process, we can get the framework of the sentence.

Dynamic Websites

In this segment, let us learn how to do web scraping on complex websites and the principals involved in depth.

Introduction

Web scraping is a difficult task because, if the website is interactive, the difficulty multiplies. According to the United Nations Global Online Usability Audit, more than 70% of websites are interactive in design and rely on JavaScript for their features.

Dynamic Website Example

Let’s all consider the example of a dynamic website and see why it’s hard to scrape. Here we take an example of seeking from a site called http:/example.webscraping.com/places/default/search. But how do we say that this platform is dynamic in nature? It can be assessed from the performance of the following script file, which will attempt to scrape data from the above listed webpage–

import re
import urllib.request
response = urllib.request.urlopen('http://example.webscraping.com/places/default/search')
html = response.read()
text = html.decode()
re.findall('(.*?)',text)


Approaches for Scraping data from Dynamic Websites

  • We have shown that the scraper could never scrape information from a dynamic website as its data is dynamically loaded with JavaScript. In such instances, the following two methods can be used to scrap data from dynamic JavaScript-dependent websites –
  • Reverse Engineering JavaScript
  • Rendering JavaScript

Reverse Engineering JavaScript

The method called reverse engineering would be helpful and would help one to learn how data is dynamically loaded into web pages. To do this, click the Inspect element tab for the URL.  Then, press the NETWORK tab to locate all requests made for that web page, like search.json with a / ajax path. 

We can also do this with the aid of the Python script −

import requests
url=requests.get('http://example.webscraping.com/ajax/search.json?page=0&page_size=10&search_term=a')
url.json() 

Rendering JavaScript

  • In the preceding sections, we did reverse engineering on the web page about how the API worked and how we could use it to extract results. However, we can face the following difficulties when doing reverse engineering −
  •  The websites can be very complicated often. For example, if the website uses sophisticated browser software such as Google Web Toolkit (GWT), the resulting JS code will be computer-generated and difficult to comprehend and reverse engineered.

• Certain higher-level frameworks like React.js will make reverse engineering challenging by abstracting the already complicated JavaScript logic.

The solution to the above difficulties is to use a browser rendering engine that decodes HTML, uses CSS formatting, and executes JavaScript to view a web site.

Form based Websites

In the previous segment, dynamic websites were scrapped. In this part, let us understand the scrapping of websites that operate on user-based inputs, that is, form-based websites.

Introduction

These days, WWW is heading towards both social media and user-generated content. So the question emerges as to how we can reach this kind of knowledge outside the login page? We need to work with forms and passwords for this.

In earlier sections, we interacted with the HTTP GET approach to access data, but in this section, we will work with the HTTP POST method, which moves information to a database server for storage and analysis.

Interacting with Login forms

You may have dealt with login forms several times while surfing the Web. They can be very plain, like just a few HTML fields, a submit button, and an action tab, or they can be complex and have some additional fields like email, leave a message with captcha for security purposes.

In this section, we’re going to be dealing with a basic submit form with the aid of the Python request library.

First, we need to import the library of requests as follows –

import requests

provide the data for the fields of login form.

parameters = {‘Name’:’Enter  name’, ‘Email id’:’Your Emailid’,’Message’:’Write your message here’}

provide the URL 

r = requests.post(“enter the URL”, data = parameters)

print(r.text)

If you want to upload a picture with a form, it is really simple with requests.post(). You can understand this by following the Python script –

import requests

file = {‘Uploadfile’: open(’C:\Usres\desktop\Hello.png’,‘rb’)}

r = requests.post(“enter the URL”, files = file)

print(r.text)

Loading Cookies from the Web Server

A cookie, also referred to as a web cookie, is a tiny piece of data sent by a website and saved by our computer in a file within our web browser.

Cookies can be of two kinds in conjunction with the handling of login forms. One, dealt with in the preceding sections, which allows us to send information to the website and the other, which allows us to remain in a permanent “logged-in” state during our visit to the website. For the second category of forms, the website uses cookies to keep track of who is logging in.

What do cookies do?

Most websites use cookies for monitoring these days. We should understand how cookies function with the aid of the following measures −

Step 1 − First, the site will verify our login information and save them in the cookie of our browser. This cookie usually includes a server-generated token, time-out and monitoring detail.

Step 2 – Then, the website will use the cookie as evidence of authentication. This verification is still seen when we access the site.

Cookies are very dangerous for web scrapers and if web scrapers do not keep track of cookies, the sent form is sent back because it appears like they have never logged in on the next post. It is very easy to monitor cookies with the support of the Python request library, as seen below-

import requests
parameters = {‘Name’:’Enter  name’, ‘Email id’:’Your Emailid’,’Message’:’Write your message here’}

r = requests.post(“enter the URL”, data = parameters)

print(‘Cookie is:’)
print(r.cookies.get_dict())
print(r.text)

We will get the cookies from the last request.

There is another problem with cookies because websites sometimes change cookies without warning. Such a condition may be dealt with requests.Session() as follows –

import requests
session = requests.Session()
parameters = {‘Name’:’Enter  name’, ‘Email id’:’Your Emailid’,’Message’:’Write your message here’}
r = session.post(“enter the URL”, data = parameters)

print(‘The cookie is:’)
print(r.cookies.get_dict())
print(r.text)


Processing CAPTCHA

In this segment, let us learn how to conduct the CAPTCHA web scraping and processing that is used to verify a human or bot client.

What is CAPTCHA?

The long title of CAPTCHA is the Completely Automated Public Turing test to tell Computers and Humans Apart, which explicitly means that it is a test to decide if the client is human or bot.

A CAPTCHA is a blurred image that is normally not readily identified by a software program, but can somehow be interpreted by a human being. Most websites use CAPTCHA to block robots from communicating.

Loading CAPTCHA with Python

Assume we want to register on a webpage and there is a form with CAPTCHA, then before loading the CAPTCHA image, we need to know the basic details needed by the form. With the support of the next Python script, we can understand the type specifications of the registration form on the site named http:/example.webscrapping.com.

import lxml.html
import urllib.request as urllib2
import pprint
import http.cookiejar as cookielib
def form_parsing(html):
   tree = lxml.html.fromstring(html)
   data = {}
   for e in tree.cssselect('form input'):
      if e.get('name'):
         data[e.get('name')] = e.get('value')
   return data
REGISTER_URL = '<a target="_blank" rel="nofollow" 
   href="http://example.webscraping.com/user/register">http://example.webscraping.com/user/register'</a>
ckj = cookielib.CookieJar()
browser = urllib2.build_opener(urllib2.HTTPCookieProcessor(ckj))
html = browser.open(
   '<a target="_blank" rel="nofollow" 
      href="http://example.webscraping.com/places/default/user/register?_next">
      http://example.webscraping.com/places/default/user/register?_next</a> = /places/default/index'
).read()
form = form_parsing(html)
pprint.pprint(form)

In the above code block, first a function is defined that will parse the form by using lxml module and it will print the form specifications as follows −

{
   '_formkey': '5e306d73-5774-4146-a94e-3541f22c95ab',
   '_formname': 'register',
   '_next': '/places/default/index',
   'email': '',
   'first_name': '',
   'last_name': '',
   'password': '',
   'password_two': '',
   'recaptcha_response_field': None
}

You can verify from the above performance that all details except recpatcha_response_field is understandable and straightforward. The question now emerges as to how we can treat this complicated knowledge and download CAPTCHA. It can be achieved with the aid of the Python library pillow as follows:

Pillow Python Package

Pillow is a Python Image Library fork that has useful features for manipulating images. It can be installed using the following order –

pip install pillow
now we will use it for loading CAPTCHA −
from io import BytesIO
import lxml.html
from PIL import Image
def load_captcha(html):
   tree = lxml.html.fromstring(html)
   img_data = tree.cssselect('div#recaptcha img')[0].get('src')
   img_data = img_data.partition(',')[-1]
   binary_img_data = img_data.decode('base64')
   file_like = BytesIO(binary_img_data)
   img = Image.open(file_like)
   return img

The mentioned script is using a pillow python package and is defining a function to load CAPTCHA image. It must be used with the function called form parser() specified in the previous script to obtain details about the registration form. This code would save the CAPTCHA image in a comprehensible format that can be further extracted as a string.

The above python script uses the python pillow package and specifies the function for loading the CAPTCHA image. It has to be used with the function called form_parser() in the previous script to obtain details about the registration form. This script will save the CAPTCHA image in a useful format that can be further extracted as a string.

Testing with Scrapers

This segment explores how to do web scraper tests in Python.

Introduction

In big network schemes, systematic repository web testing is carried out on a regular basis, but frontend testing is still skipped. The main reason for this is that the programming of websites is just like a network of sign-up and programming languages. We can write a test unit for one language, but it becomes difficult if a dialogue is held in another language. That’s why we need to have a series of tests to make sure that our code is working as we expect it to be.

Testing using Python

When we talk of a test, it means testing the machine (unit). We need to know about the unit tests before we dive right into the Python test. Here are some of the characteristics of the research unit −

  • For each unit evaluation (test) at least one dimension of the component specification must be tested.
  • Each unit test is self-contained and can be performed independently.
  • Unit testing shall not be contrary to the success or failure of any other test.
  • Unit checks can be carried out in any order.

Unittest − Python Module

The python module called Unittest for unit testing comes with all basic Python installation. We only need to insert it and rest is the job of the unittest. TestCase class, which will do the following −

  • The setup and teardown functions are given in the unittest. TestCase class. Such functions can be operated before and after the test of each unit.
  • It also offers claim statements that allow assessments to pass or fail.
  • Executes all the methods that start with a unit test.

Example:

We’re trying to integrate web scraping with unittest in this case. We’ll test the Wikipedia page to look for the ‘Ruby’ string. Basically, it performs two tests, the first time the title page is the same as the search string, i.e. ‘Ruby’ or not, and the second test guarantees that the page has a text div.

First, we’re going to import the necessary Python modules. We use BeautifulSoup for site scraping and, of course, for unit checking.

from urllib.request import urlopen
from bs4 import BeautifulSoup
import unittest

Now we just need to describe a class that will broaden the unittest. TestCase. Global object bs will be distributed between all of the tests. The setUpClass function defined by the unittest will accomplish this. Here we describe two functions, one to test the title page and the other to test the content of the page.

class Test(unittest.TestCase):
   bs = None
   def setUpClass():
      url = '<a target="_blank" rel="nofollow" href="https://en.wikipedia.org/wiki/Ruby">https://en.wikipedia.org/wiki/Ruby'</a>
      Test.bs = BeautifulSoup(urlopen(url), 'html.parser')
   def test_titleText(self):
      pageTitle = Test.bs.find('h1').get_text()
      self.assertEqual('Ruby', pageTitle);
   def test_contentExists(self):
      content = Test.bs.find('div',{'id':'mw-content-text'})
      self.assertIsNotNone(content)
if __name__ == '__main__':
   unittest.main()

output −
----------------------------------------------------------------------
Ran 2 tests in 2.573s

OK

SystemExit: False

D:\ProgramData\lib\site-packages\IPython\core\interactiveshell.py:2870:
UserWarning: To exit: use 'exit', 'quit', or Ctrl-D.
 warn("To exit: use 'exit', 'quit', or Ctrl-D.", stacklevel=1)

Testing with Selenium

Let us see how to use Selenium for testing. It is also called Selenium testing. unittest and Selenium do not have much in common. We have realized that Selenium sends the standard Python orders to various browsers, notwithstanding variety in their browser’s plan. Revise that we already have installed and worked with Selenium in previous sections. Now we will create test scripts in Selenium.

Example:

With the support of the next Python script, we’re building a test script to automate the Facebook Login tab. You can change the example to automate the other types and login of your choosing, but the principle will be the same.

First, to link to the web browser, we can import the webdriver from the selenium module −

from selenium import webdriver
Now import Keys from selenium module.
from selenium.webdriver.common.keys import Keys
Provide username and password for login into your facebook account
user = "pratibhadimri2@gmail.com"
pwd = ""
provide the path to the web driver for the browser.
path = r'C:\\Users\\pratibha\\Desktop\\Chromedriver'
driver = webdriver.Chrome(executable_path=path)
driver.get("http://www.facebook.com")
verify the conditions.
assert "Facebook" in driver.title

element = driver.find_element_by_id("email")
element.send_keys(user)

element = driver.find_element_by_id("pass")
element.send_keys(pwd)

element.send_keys(Keys.RETURN)
close the browser.
driver.close()

After running the above script, the browser window will be opened and you will see the username and password entered and press the login button.

This brings us to the end of the blog on Web Scraping Tutorial. We hope that you found this comprehensive Web Scraping Tutorial helpful and were able to gain the required knowledge. If you wish to upskill and learn more such concepts, you can check out the pool of Free Online courses on Great Learning Academy.

→ Explore this Curated Program for You ←

Avatar photo
Great Learning Editorial Team
The Great Learning Editorial Staff includes a dynamic team of subject matter experts, instructors, and education professionals who combine their deep industry knowledge with innovative teaching methods. Their mission is to provide learners with the skills and insights needed to excel in their careers, whether through upskilling, reskilling, or transitioning into new fields.

Full Stack Software Development Course from UT Austin

Learn full-stack development and build modern web applications through hands-on projects. Earn a certificate from UT Austin to enhance your career in tech.

4.8 ★ Ratings

Course Duration : 28 Weeks

Cloud Computing PG Program by Great Lakes

Enroll in India's top-rated Cloud Program for comprehensive learning. Earn a prestigious certificate and become proficient in 120+ cloud services. Access live mentorship and dedicated career support.

4.62 ★ (2,760 Ratings)

Course Duration : 8 months

Scroll to Top