Introduction of Pandas | Data Analysis using Pandas

According to Forbes magazine report in 2019, this is a record year for enterprises’ interest in data science, AI, and machine learning features in their business strategies and goals. This means businesses around the world have started making corporate decisions based on the data that they have collected over the years – using Machine and Deep learning methods.

Programming languages like python and R provide a great platform for anyone starting out in Machine learning and AI, to analyze and extract useful insights for businesses. Dealing with data for analysis and visualization is an imperative process in Machine Learning and Artificial Intelligence.

What is Pandas?

In short Pandas is a Software Libarary in Computer Programming and it is written for the Python Programming Language its work to do data analysis and manipulation.

Also read Python Numpy Tutorial and Fibonacci Series in Python

You can also take up python pandas course designed for beginner level.

We all know that Python is majorly a programming language. However, after the introduction of data handling libraries like NumPy, Pandas and Data Visualization libraries like Seaborn and Matplotlib, and the ease of understanding languages, simple syntaxes, Python is rapidly gaining popularity among data science and ML professionals. The Below picture shows a google trends page, showing a comparison of growths (in terms of google searches) of python and R over the past 15 years. It is evident that python is seeing exponential growth, while R is dropping down.

In this article, we will go through the basics of Pandas and the commands that any beginner needs to know to do fundamental data analysis in a given dataset.

So, what is Pandas and how is it used in AI?

Artificial Intelligence is about executing machine learning algorithms on products that we use every day. Any ML algorithm, for it to be effective, needs the following prerequisite steps to be done.

Data Collection – Conducting opinion Surveys, scraping the internet, etc.
Data Handling – Viewing data as a table, performing cleaning activities like checking for spellings, removal of blanks and wrong cases, removal of invalid values from data, etc.
Data Visualization – plotting appealing graphs, so anyone who looks at the data can know what story the data tells us.

“Pandas” – short for “Panel Data” (A panel is a 3D container of data) – is a library in python which contains in-built functions to clean, transform, manipulate, visualize and analyze data.

Getting started…

NumPy – Numerical python – forms the basics of what pandas is all about. While NumPy deals with “Arrays” and “Matrices”, Pandas deals with “Series” & “Data Frames”. In order to work with Pandas first Python has to be installed in your system. Download and install python from here – https://www.python.org/downloads/windows/

You can verify python installation by entering “python” in the command prompt. The command gives the installed version of python.

Python gets automatically installed through an application called “Anaconda”, which simplifies package/library management when there are many basic packages needed to be installed for the project.

Anaconda installer download – https://www.anaconda.com/distribution/#windows.

Once Anaconda is installed, you can navigate to the ‘lib’ folder within the Anaconda installation to have a look at what are all the packages that got installed by default. One such package is “Pandas”. In order to import Pandas to our command line, we will use a “Jupyter Notebook” in this article.

Jupyter Notebook, is basically a web application, mainly used in data science and machine learning to develop and share code. Jupyter Notebook is part of Anaconda installation and it can be accessed through Anaconda’s UI as shown below.

Click on the “Launch”, it opens the Jupyter Notebook. Each cell in this notebook can hold one or more python commands. Typing and executing the following command imports “Pandas” in our work environment.

Now that We have installed Pandas successfully, let us learn how to do some analysis on data.

Data Types

A data type is used by a programming language to understand how to store and manipulate data. The table below summarizes the different data types in Pandas.

Data type	Use
int	Integer number, eg: 10, 12
float	Floating point number, eg: 100.2, 3.1415
bool	True/False value
object	Test, non-numeric, or a combination of text and non-numeric values, eg: Apple
DateTime	Date and time values
category	A finite list of values

What does Pandas deal with?

There are two major categories of data that you can come across while doing data analysis.

One dimensional data
Two-dimensional data

These data can be of any data type. Character, number or even an object.

Series in Pandas is one-dimensional data, and data frames are 2-dimensional data. A series can hold only a single data type, whereas a data frame is meant to contain more than one data type.

In the example shown below, “Types of Vehicles” is a series and it is of the datatype – “Object” and it is treated as a character array. “Count” is another series and it is of the type – “Integer”. Third is the “Number Of wheels” is the third series and it is of the type “Integer” again. The individual Series are one dimensional and hold only one data type. However, the data frame as a whole contains more than 2 dimensions and is heterogeneous in nature.

This is the reason why Pandas is so powerful and so much in use today in Data science world

Data Frame

Creating Series & data frames in python

A series can be created in 3 different ways – Converting an array or List or a dictionary into a series. We will see an example for each of the categories.

Array: We first create an array using the ‘NumPy’ package and then convert them into a series using the “Series()” function.

Same can be done for lists as well.

Creating a data frame can be done using the following command.

We can also create data frames with multiple series by using dictionaries and converting them using a data frame.

All statistical functions

Function	Description
count()	Returns the number of times an element/data has occurred (non-null)
sum()	Returns sum of all values
mean()	Returns the average of all values
median()	Returns the median of all values
mode()	Returns the mode
std()	Returns the standard deviation
min()	Returns the minimum of all values
max()	Returns the maximum of all values
abs()	Returns the absolute value

Input and Output

Often, you won’t be creating data but will be having it in some form, and you would want to import it to run your analysis on it. Fortunately, Pandas allows you to do this. Not only does it help in importing data, but you can also save your data in your desired format using Pandas.

The below table shows the formats supported by Pandas, the function to read files using Pandas, and the function to write files.

Input type	Reader	Writer
CSV	read_csv	to_csv
JSON	read_json	to_json
HTML	read_html	to_html
Excel	read_excel	to_excel
SAS	read_sas	–
Python Pickle Format	read_pickle	to_pickle
SQL	read_sql	to_sql
Google Big Query	read_gbq	to_gbq

In the below example, I have shown how to read a CSV file.

import pandas as pd

import numpy as np

#Read input file

df = pd.read_csv(‘/content/player_data.csv’)

df.head()

Output

name year_start year_end position height weight birth_date college

0 Alaa Abdelnaby 1991 1995 F-C 6-10 240.0 June 24, 1968 Duke University

1 Zaid Abdul-Aziz 1969 1978 C-F 6-9 235.0 April 7, 1946 Iowa State University

2 Kareem Abdul-Jabbar 1970 1989 C 7-2 225.0 April 16, 1947 University of California, Los Angeles

3 Mahmoud Abdul-Rauf 1991 2001 G 6-1 162.0 March 9, 1969 Louisiana State University

4 Tariq Abdul-Wahad 1998 2003 F 6-6 223.0 November 3, 1974 San Jose State University

The example below shows how to save a DataFrame to a CSV file.

import pandas as pd

# initialize a dictionary

data = {‘Name’:[‘Captain America’, ‘Iron Man’, ‘Hulk’, ‘Thor’,’Black Panther’],

‘Rating’:[100, 80, 84, 93, 90],

‘Place’:[‘USA’,’USA’,’USA’,’Asgard’,’Wakanda’]}

# Create DataFrame

df = pd.DataFrame(data, index=[‘a’,’b’,’c’,’d’,’e’])

# Saving to CSV

df.to_csv(“avengers.csv”)

Aggregation

The aggregation function can be applied against a single or more column. You can either apply the same aggregate function across various columns or different aggregate functions across various columns.
Commonly used aggregate functions()- sum, min, max, mean.

Example: Same aggregate function on all columns.

import pandas as pd

# initialize a dictionary

data = {‘Name’:[‘jennifer Lawrence’, ‘Brad Pitt’, ‘Chris Hemsworth’, ‘Dwayne Johnson’],

‘Salary’:[1000, 80000, 79000, 93000],

‘Age’:[33, 50, 45, 52]}

# Create DataFrame

df = pd.DataFrame(data)

df.aggregate([‘sum’,’min’,’max’,’mean’])

Output

Name Salary Age

sum jennifer LawrenceBrad PittChris Hemsworth Dwayn… 253000.0 180.0

min Brad Pitt 1000.0 33.0

max jennifer Lawrence 93000.0 52.0

mean NaN 63250.0 45.0

Example: Different aggregate functions for different columns.

import pandas as pd

# initialize a dictionary

data = {‘Name’:[‘jennifer Lawrence’, ‘Brad Pitt’, ‘Chris Hemsworth’, ‘Dwayne Johnson’],

‘Salary’:[1000, 80000, 79000, 93000],

‘Age’:[33, 50, 45, 52]}

# Create DataFrame

df = pd.DataFrame(data)

df.aggregate({‘Salary’:[‘sum’,’mean’],

‘Age’:[‘min’,’max’]})

Output

Salary Age

max NaN 52.0

mean 63250.0 NaN

min NaN 33.0

sum 253000.0 NaN

Groupby

Pandas groupby function is used to split the DataFrame into groups based on some criteria.
First, we will import the dataset, and explore it.

import pandas as pd

import numpy as np

#Read input file

df = pd.read_csv(‘/content/player_data.csv’)

df.head()

Output:

name year_start year_end position height weight birth_date college

0 Alaa Abdelnaby 1991 1995 F-C 6-10 240.0 June 24, 1968 Duke University

1 Zaid Abdul-Aziz 1969 1978 C-F 6-9 235.0 April 7, 1946 Iowa State University

2 Kareem Abdul-Jabbar 1970 1989 C 7-2 225.0 April 16, 1947 University of California, Los Angeles

3 Mahmoud Abdul-Rauf 1991 2001 G 6-1 162.0 March 9, 1969 Louisiana State University

4 Tariq Abdul-Wahad 1998 2003 F 6-6 223.0 November 3, 1974 San Jose State University

Let’s groupby the players’ college names.

# group the data on name and position.

gd = df.groupby(‘college’)

gd.first()

Output:

name year_start year_end position height weight birth_date

college

Acadia University Brian Heaney 1970 1970 G 6-2 180.0 September 3, 1946

Alabama – Huntsville Josh Magette 2018 2018 G 6-1 160.0 November 28, 1989

Alabama A&M University Mickell Gladness 2012 2012 C 6-11 220.0 July 26, 1986

Alabama State University Kevin Loder 1982 1984 F-G 6-6 205.0 March 15, 1959

Albany State University Mack Daughtry 1971 1971 G 6-3 175.0 August 4, 1950

… … … … … … … …

Xavier University Torraye Braggs 2004 2005 F 6-8 245.0 May 15, 1976

Xavier University of Louisiana Nat Clifton 1951 1958 C-F 6-6 220.0 October 13, 1922

Yale University Chris Dudley 1988 2003 C 6-11 235.0 February 22, 1965

Yankton College Chuck Lloyd 1971 1971 C-F 6-8 220.0 May 22, 1947

Youngstown State University Leo Mogus 1947 1951 F-C 6-4 190.0 April 13, 1921

Let’s print the values in any one of the groups.

gd.get_group((‘C’,’A.J. Bramlett’))

Output

Year_start year_end height weight birth_date college

435 2000 2000 6-10 227.0 January 10, 1977 University of Arizona

Let’s create groups based on more than one category

# group the data on name and position.

gd = df.groupby([‘position’,’name’])

gd.first()

Output

year_start year_end height weight birth_date college

position name

C A.J. Bramlett 2000 2000 6-10 227.0 January 10, 1977 University of Arizona

A.J. Hammons 2017 2017 7-0 260.0 August 27, 1992 Purdue University

Aaron Gray 2008 2014 7-0 270.0 December 7, 1984 University of Pittsburgh

Adonal Foyle 1998 2009 6-10 250.0 March 9, 1975 Colgate University

Al Beard 1968 1968 6-9 200.0 April 27, 1942 Norfolk State University

… … … … … … … …

G-F Win Wilfong 1958 1961 6-2 185.0 March 18, 1933 University of Memphis

Winford Boynes 1979 1981 6-6 185.0 May 17, 1957 University of San Francisco

Wyndol Gray 1947 1948 6-1 175.0 March 20, 1922 Harvard University

Yakhouba Diawara 2007 2010 6-7 225.0 August 29, 1982 Pepperdine University

Zoran Dragic 2015 2015 6-5 200.0 June 22, 1989 NaN

Merging, Joining and Concatenation

Before I start with Pandas join and merge functions, let me introduce you to four different types of joins, they are inner join, left join, right join, outer join.

Full outer join: Combines results from both DataFrames. The result will have all columns from both DataFrames.
Inner join: Only those rows which are present in both DataFrame A and DataFrame B will be present in the output.
Right join: Right join uses all records from DataFrame B and matching records from DataFrame A.
Left join: Left join uses all records from DataFrame A and matching records from DataFrame B.

Merging

Merging a Dataframe with one unique key.

import pandas as pd

# Define a dictionary containing employee data

data1 = {‘key’:[‘K0′,’K1′,’K2′,’K3’],

‘Name’:[‘Mercy’, ‘Prince’, ‘John’, ‘Cena’],

‘Age’:[27, 24, 22, 32],}

# Define a dictionary containing employee data

data2 = {‘key’:[‘K0′,’K1′,’K2′,’K3’],

‘Address’:[‘Canada’, ‘UK’, ‘India’, ‘USA’],

‘Qualification’:[‘Btech’, ‘B.A’, ‘MS’, ‘Phd’]}

# Convert the dictionary into DataFrame

df1 = pd.DataFrame(data1)

# Convert the dictionary into DataFrame

df2 = pd.DataFrame(data2)

print(df1.head)

print(“n”)

print(df2.head())

res = pd.merge(df1, df2, on=’key’)

res

Output

key Name Age

0 K0 Mercy 27

1 K1 Prince 24

2 K2 John 22

3 K3 Cena 32>

key Address Qualification

0 K0 Canada Btech

1 K1 UK B.A

2 K2 India MS

3 K3 USA Phd

key Name Age Address Qualification

0 K0 Mercy 27 Canada Btech

1 K1 Prince 24 UK B.A

2 K2 John 22 India MS

3 K3 Cena 32 USA Phd

Merging Dataframe using multiple keys.

import pandas as pd

# Define a dictionary containing employee data

data1 = {‘key’:[‘K0′,’K1′,’K2′,’K3’],

‘Name’:[‘Mercy’, ‘Prince’, ‘John’, ‘Cena’],

‘Address’:[‘Canada’, ‘Australia’, ‘India’, ‘Japan’],

‘Age’:[27, 24, 22, 32],}

# Define a dictionary containing employee data

data2 = {‘key’:[‘K0′,’K1′,’K2′,’K3’],

‘Address’:[‘Canada’, ‘UK’, ‘India’, ‘USA’],

‘Qualification’:[‘Btech’, ‘B.A’, ‘MS’, ‘Phd’]}

# Convert the dictionary into DataFrame

df1 = pd.DataFrame(data1)

# Convert the dictionary into DataFrame

df2 = pd.DataFrame(data2)

print(df1.head)

print(“n”)

print(df2.head())

res = pd.merge(df1, df2, on=[‘key’, ‘Address’])

res

Output

key Name Address Age

0 K0 Mercy Canada 27

1 K1 Prince Australia 24

2 K2 John India 22

3 K3 Cena Japan 32

key Address Qualification

0 K0 Canada Btech

1 K1 UK B.A

2 K2 India MS

3 K3 USA Phd

key Name Address Age Qualification

0 K0 Mercy Canada 27 Btech

1 K2 John India 22 MS

Left merge

In pd.merge() I pass the argument ‘how = left’ to perform a left merge.

import pandas as pd

# Define a dictionary containing employee data

data1 = {‘key’:[‘K0′,’K1′,’K2′,’K3’],

‘Name’:[‘Mercy’, ‘Prince’, ‘John’, ‘Cena’],

‘Address’:[‘Canada’, ‘Australia’, ‘India’, ‘Japan’],

‘Age’:[27, 24, 22, 32],}

# Define a dictionary containing employee data

data2 = {‘key’:[‘K0′,’K1′,’K2′,’K3’],

‘Address’:[‘Canada’, ‘UK’, ‘India’, ‘USA’],

‘Qualification’:[‘Btech’, ‘B.A’, ‘MS’, ‘Phd’]}

# Convert the dictionary into DataFrame

df1 = pd.DataFrame(data1)

# Convert the dictionary into DataFrame

df2 = pd.DataFrame(data2)

print(df1.head(),”n”)

print(df2.head(),”n”)

res = pd.merge(df1, df2, how=’left’, on=[‘key’, ‘Address’])

res

Output

key Name Address Age

0 K0 Mercy Canada 27

1 K1 Prince Australia 24

2 K2 John India 22

3 K3 Cena Japan 32

key Address Qualification

0 K0 Canada Btech

1 K1 UK B.A

2 K2 India MS

3 K3 USA Phd

key Name Address Age Qualification

0 K0 Mercy Canada 27 Btech

1 K1 Prince Australia 24 NaN

2 K2 John India 22 MS

3 K3 Cena Japan 32 NaN

Right merge

In pd.merge() I pass the argument ‘how = right’ to perform a left merge.

import pandas as pd

# Define a dictionary containing employee data

data1 = {‘key’:[‘K0′,’K1′,’K2′,’K3’],

‘Name’:[‘Mercy’, ‘Prince’, ‘John’, ‘Cena’],

‘Address’:[‘Canada’, ‘Australia’, ‘India’, ‘Japan’],

‘Age’:[27, 24, 22, 32],}

# Define a dictionary containing employee data

data2 = {‘key’:[‘K0′,’K1′,’K2′,’K3’],

‘Address’:[‘Canada’, ‘UK’, ‘India’, ‘USA’],

‘Qualification’:[‘Btech’, ‘B.A’, ‘MS’, ‘Phd’]}

# Convert the dictionary into DataFrame

df1 = pd.DataFrame(data1)

# Convert the dictionary into DataFrame

df2 = pd.DataFrame(data2)

print(df1.head(),”n”)

print(df2.head(),”n”)

res = pd.merge(df1, df2, how=’right’, on=[‘key’, ‘Address’])

res

Output

key Name Address Age

0 K0 Mercy Canada 27

1 K1 Prince Australia 24

2 K2 John India 22

3 K3 Cena Japan 32

key Address Qualification

0 K0 Canada Btech

1 K1 UK B.A

2 K2 India MS

3 K3 USA Phd

key Name Address Age Qualification

0 K0 Mercy Canada 27.0 Btech

1 K1 NaN UK NaN B.A

2 K2 John India 22.0 MS

3 K3 NaN USA NaN Phd

Outer Merge

In pd.merge(), I pass the argument ‘how = outer’ to perform a left merge.

import pandas as pd

# Define a dictionary containing employee data

data1 = {‘key’:[‘K0′,’K1′,’K2′,’K3’],

‘Name’:[‘Mercy’, ‘Prince’, ‘John’, ‘Cena’],

‘Address’:[‘Canada’, ‘Australia’, ‘India’, ‘Japan’],

‘Age’:[27, 24, 22, 32],}

# Define a dictionary containing employee data

data2 = {‘key’:[‘K0′,’K1′,’K2′,’K3’],

‘Address’:[‘Canada’, ‘UK’, ‘India’, ‘USA’],

‘Qualification’:[‘Btech’, ‘B.A’, ‘MS’, ‘Phd’]}

# Convert the dictionary into DataFrame

df1 = pd.DataFrame(data1)

# Convert the dictionary into DataFrame

df2 = pd.DataFrame(data2)

print(df1.head(),”n”)

print(df2.head(),”n”)

res = pd.merge(df1, df2, how=’outer’, on=[‘key’, ‘Address’])

res

Output

key Name Address Age

0 K0 Mercy Canada 27

1 K1 Prince Australia 24

2 K2 John India 22

3 K3 Cena Japan 32

key Address Qualification

0 K0 Canada Btech

1 K1 UK B.A

2 K2 India MS

3 K3 USA Phd

key Name Address Age Qualification

0 K0 Mercy Canada 27.0 Btech

1 K1 Prince Australia 24.0 NaN

2 K2 John India 22.0 MS

3 K3 Cena Japan 32.0 NaN

4 K1 NaN UK NaN B.A

5 K3 NaN USA NaN Phd

Inner Merge

In pd.merge(), I pass the argument ‘how = inner’ to perform a left merge.

import pandas as pd

# Define a dictionary containing employee data

data1 = {‘key’:[‘K0′,’K1′,’K2′,’K3’],

‘Name’:[‘Mercy’, ‘Prince’, ‘John’, ‘Cena’],

‘Address’:[‘Canada’, ‘Australia’, ‘India’, ‘Japan’],

‘Age’:[27, 24, 22, 32],}

# Define a dictionary containing employee data

data2 = {‘key’:[‘K0′,’K1′,’K2′,’K3’],

‘Address’:[‘Canada’, ‘UK’, ‘India’, ‘USA’],

‘Qualification’:[‘Btech’, ‘B.A’, ‘MS’, ‘Phd’]}

# Convert the dictionary into DataFrame

df1 = pd.DataFrame(data1)

# Convert the dictionary into DataFrame

df2 = pd.DataFrame(data2)

print(df1.head(),”n”)

print(df2.head(),”n”)

res = pd.merge(df1, df2, how=’inner’, on=[‘key’, ‘Address’])

res

Output

key Name Address Age

0 K0 Mercy Canada 27

1 K1 Prince Australia 24

2 K2 John India 22

3 K3 Cena Japan 32

key Address Qualification

0 K0 Canada Btech

1 K1 UK B.A

2 K2 India MS

3 K3 USA Phd

key Name Address Age Qualification

0 K0 Mercy Canada 27 Btech

1 K2 John India 22 MS

Join

Join is used to combine DataFrames having different index values.

Example

import pandas as pd

# Define a dictionary containing employee data

data1 = {‘Name’:[‘Mercy’, ‘Prince’, ‘John’, ‘Cena’],

‘Age’:[27, 24, 22, 32]}

# Define a dictionary containing employee data

data2 = {‘Address’:[‘Canada’, ‘UK’, ‘India’, ‘USA’],

‘Qualification’:[‘Btech’, ‘B.A’, ‘MS’, ‘Phd’]}

# Convert the dictionary into DataFrame

df1 = pd.DataFrame(data1)

# Convert the dictionary into DataFrame

df2 = pd.DataFrame(data2)

print(df1.head(),”n”)

print(df2.head(),”n”)

res = df1.join(df2)

res

Output

Name Age

0 Mercy 27

1 Prince 24

2 John 22

3 Cena 32

Address Qualification

0 Canada Btech

1 UK B.A

2 India MS

3 USA Phd

Name Age Address Qualification

0 Mercy 27 Canada Btech

1 Prince 24 UK B.A

2 John 22 India MS

3 Cena 32 USA Phd

Performing join with ‘how’ parameter. Different inputs to the ‘how’ parameter are, inner, outer, left, right.

import pandas as pd

# Define a dictionary containing employee data

data1 = {‘Name’:[‘Mercy’, ‘Prince’, ‘John’, ‘Cena’],

‘Age’:[27, 24, 22, 32]}

# Define a dictionary containing employee data

data2 = {‘Address’:[‘Canada’, ‘UK’, ‘India’, ‘USA’],

‘Qualification’:[‘Btech’, ‘B.A’, ‘MS’, ‘Phd’]}

# Convert the dictionary into DataFrame

df1 = pd.DataFrame(data1)

# Convert the dictionary into DataFrame

df2 = pd.DataFrame(data2)

print(df1.head(),”n”)

print(df2.head(),”n”)

res = df1.join(df2, how=’inner’)

res

Output

Name Age

0 Mercy 27

1 Prince 24

2 John 22

3 Cena 32

Address Qualification

0 Canada Btech

1 UK B.A

2 India MS

3 USA Phd

Name Age Address Qualification

0 Mercy 27 Canada Btech

1 Prince 24 UK B.A

2 John 22 India MS

3 Cena 32 USA Phd

Concatenation

Concatenating using ‘.concat()’ function

import pandas as pd

# Define a dictionary containing employee data

data1 = {‘Name’:[‘Mercy’, ‘Prince’, ‘John’, ‘Cena’],

‘Age’:[27, 24, 22, 32],}

# Define a dictionary containing employee data

data2 = {‘Address’:[‘Canada’, ‘UK’, ‘India’, ‘USA’],

‘Qualification’:[‘Btech’, ‘B.A’, ‘MS’, ‘Phd’]}

# Convert the dictionary into DataFrame

df1 = pd.DataFrame(data1, index=[‘K0’, ‘K1’, ‘K2’, ‘K3’])

# Convert the dictionary into DataFrame

df2 = pd.DataFrame(data2, index=[‘K0’, ‘K1’, ‘K2’, ‘K3’])

frames = [df1, df2]

res = pd.concat(frames)

res

Output

Name Age Address Qualification

K0 Mercy 27.0 NaN NaN

K1 Prince 24.0 NaN NaN

K2 John 22.0 NaN NaN

K3 Cena 32.0 NaN NaN

K0 NaN NaN Canada Btech

K1 NaN NaN UK B.A

K2 NaN NaN India MS

K3 NaN NaN USA Phd

The resultant DataFrame has a repeated index. If you want the new Dataframe to have its own index, set ‘ignore_index’ to True.

import pandas as pd

# Define a dictionary containing employee data

data1 = {‘Name’:[‘Mercy’, ‘Prince’, ‘John’, ‘Cena’],

‘Age’:[27, 24, 22, 32],}

# Define a dictionary containing employee data

data2 = {‘Address’:[‘Canada’, ‘UK’, ‘India’, ‘USA’],

‘Qualification’:[‘Btech’, ‘B.A’, ‘MS’, ‘Phd’]}

# Convert the dictionary into DataFrame

df1 = pd.DataFrame(data1, index=[‘K0’, ‘K1’, ‘K2’, ‘K3’])

# Convert the dictionary into DataFrame

df2 = pd.DataFrame(data2, index=[‘K0’, ‘K1’, ‘K2’, ‘K3’])

frames = [df1, df2]

res = pd.concat(frames, ignore_index=True)

res

Output

Name Age Address Qualification

0 Mercy 27.0 NaN NaN

1 Prince 24.0 NaN NaN

2 John 22.0 NaN NaN

3 Cena 32.0 NaN NaN

4 NaN NaN Canada Btech

5 NaN NaN UK B.A

6 NaN NaN India MS

7 NaN NaN USA Phd

The second DataFrame is concatenating below the first one, making the resultant DataFrame have new rows. If you want the second DataFrame to be added as columns, pass the argument axis=1.

import pandas as pd

# Define a dictionary containing employee data

data1 = {‘Name’:[‘Mercy’, ‘Prince’, ‘John’, ‘Cena’],

‘Age’:[27, 24, 22, 32],}

# Define a dictionary containing employee data

data2 = {‘Address’:[‘Canada’, ‘UK’, ‘India’, ‘USA’],

‘Qualification’:[‘Btech’, ‘B.A’, ‘MS’, ‘Phd’]}

# Convert the dictionary into DataFrame

df1 = pd.DataFrame(data1, index=[‘K0’, ‘K1’, ‘K2’, ‘K3’])

# Convert the dictionary into DataFrame

df2 = pd.DataFrame(data2, index=[‘K0’, ‘K1’, ‘K2’, ‘K3’])

frames = [df1, df2]

res = pd.concat(frames, axis=1, ignore_index=True)

res

Output

0 1 2 3

K0 Mercy 27 Canada Btech

K1 Prince 24 UK B.A

K2 John 22 India MS

K3 Cena 32 USA Phd

Concatenating using ‘.append()’ function

Append function concatenates along axis = 0 only. It can take multiple objects as input.

import pandas as pd

# Define a dictionary containing employee data

data1 = {‘Name’:[‘Mercy’, ‘Prince’, ‘John’, ‘Cena’],

‘Age’:[27, 24, 22, 32],}

# Define a dictionary containing employee data

data2 = {‘Address’:[‘Canada’, ‘UK’, ‘India’, ‘USA’],

‘Qualification’:[‘Btech’, ‘B.A’, ‘MS’, ‘Phd’]}

# Convert the dictionary into DataFrame

df1 = pd.DataFrame(data1, index=[‘K0’, ‘K1’, ‘K2’, ‘K3’])

# Convert the dictionary into DataFrame

df2 = pd.DataFrame(data2, index=[‘K0’, ‘K1’, ‘K2’, ‘K3’])

df1.append(df2)

Output

Name Age Address Qualification

K0 Mercy 27.0 NaN NaN

K1 Prince 24.0 NaN NaN

K2 John 22.0 NaN NaN

K3 Cena 32.0 NaN NaN

K0 NaN NaN Canada Btech

K1 NaN NaN UK B.A

K2 NaN NaN India MS

K3 NaN NaN USA Phd

Date Time

You will often encounter time data. Pandas is a very useful tool when working with time series data.

Generating random datetime

In the below code I am generating random datetime.

import pandas as pd

# Create dates dataframe with frequency

date = pd.date_range(’10/28/2011′, periods = 5, freq =’H’)

date

Output

DatetimeIndex([‘2011-10-28 00:00:00’, ‘2011-10-28 01:00:00’,

‘2011-10-28 02:00:00’, ‘2011-10-28 03:00:00’,

‘2011-10-28 04:00:00’],

dtype=’datetime64[ns]’, freq=’H’)

In the below code I am generating datetime using a range, which has a starting value, ending value and periods which specifies how many samples do I want,

import pandas as pd

date = pd.date_range(start=’9/28/2018′, end=’10/28/2018′, periods = 10)

date

Output

DatetimeIndex([‘2018-09-28 00:00:00’, ‘2018-10-01 08:00:00’,

‘2018-10-04 16:00:00’, ‘2018-10-08 00:00:00’,

‘2018-10-11 08:00:00’, ‘2018-10-14 16:00:00’,

‘2018-10-18 00:00:00’, ‘2018-10-21 08:00:00’,

‘2018-10-24 16:00:00’, ‘2018-10-28 00:00:00’],

dtype=’datetime64[ns]’, freq=None)

To convert the datetime to either a Pandas Series or a DataFrame, just pass the argument into the initializer.

Converting to timestamps

You can use the ‘to_datetime’ function to convert a Pandas Series or list-like object. When passed a Series, it returns a Series. If you pass a string, it returns a timestamp.

import pandas as pd

date = pd.to_datetime(pd.Series([‘Jul 04, 2020’, ‘2020-10-28’]))

date

Output

0 2020-07-04

1 2020-10-28

dtype: datetime64[ns]

In the below code I have specified the format of my input datetime. This speeds up the processing.

import pandas as pd

date = pd.to_datetime(‘4/7/1994′, format=’%d/%m/%Y’)

date

Output

Timestamp(‘1994-07-04 00:00:00’)

Dividing datetime into its features

Datetime can be divided into its components using-

pandas.Series.dt.year returns the year.

pandas.Series.dt.month returns the month.

pandas.Series.dt.day returns the day.

pandas.Series.dt.hour returns the hour.

pandas.Series.dt.minute returns the minute.

import pandas as pd

# Create datetime with dataframe

date = pd.DataFrame()

date[‘date’] = pd.date_range(’10/28/2020′, periods = 10, freq =’H’)

# Create features for year, month, day, hour, and minute

date[‘year’] = date[‘date’].dt.year

date[‘month’] = date[‘date’].dt.month

date[‘day’] = date[‘date’].dt.day

date[‘hour’] = date[‘date’].dt.hour

date[‘minute’] = date[‘date’].dt.minute

# Print the dates divided into features

date.head()

Output

date year month day hour minute

0 2020-10-28 00:00:00 2020 10 28 0 0

1 2020-10-28 01:00:00 2020 10 28 1 0

2 2020-10-28 02:00:00 2020 10 28 2 0

3 2020-10-28 03:00:00 2020 10 28 3 0

4 2020-10-28 04:00:00 2020 10 28 4 0

Visualization

Pandas can also be used to visualize data.

Line plot

In the below code I am generating a line plot. I am using random normal values generated by NumPy as input.

import pandas as pd

import numpy as np

df = pd.DataFrame(np.random.randn(10,4),

index=pd.date_range(’10/28/2020′,periods=10),

columns=list(‘ABCD’))

df.plot()

Bar/Horizontal Bar plot

Bar plot can be made by using ‘.plot.bar()’. Pass the argument ‘stacked = True’ if you want stacked bars.

import pandas as pd

import numpy as np

df = pd.DataFrame(np.random.rand(10,4),

columns=[‘a’,’b’,’c’,’d’])

df.plot.bar()

# using stacked bars

df.plot.bar(stacked=True)

To generate a horizontal bar graph, use ‘.plot.barh()’. You can also pass the argument ‘stacked = True’ if you want the bars to be stacked.

import pandas as pd

import numpy as np

df = pd.DataFrame(np.random.rand(10,5),

columns=[‘a’,’b’,’c’,’d’,’e’])

# using stacked bars

df.plot.barh(stacked=True)

Histograms

To generate a histogram use ‘DataFrame.plot.hist()’. Pass the argument ‘bins’ specifying how many bins you want.

Example – df.plot.hist()

import pandas as pd

import numpy as np

df = pd.DataFrame({‘A’:np.random.randn(100)-3,

‘B’:np.random.randn(100)+1,

‘C’:np.random.randn(100)+3,

‘D’:np.random.randn(100)-1},

columns=[‘A’, ‘B’, ‘C’, ‘D’])

df.plot.hist(bins=20)

To plot separate histograms for all your inputs, use your DataFrame name followed by ‘.hist()’. Pass the argument ‘bins’ specifying how many bins you want.

Example- df.hist()

import pandas as pd

import numpy as np

df = pd.DataFrame({‘A’:np.random.randn(100)-3,

‘B’:np.random.randn(100)+1,

‘C’:np.random.randn(100)+3,

‘D’:np.random.randn(100)-1},

columns=[‘A’, ‘B’, ‘C’, ‘D’])

df.hist(bins=20)

To plot a single histogram for any of your input pass the input name in square brackets followed by ‘.hist()’.

Example- df[‘A’].hist()

import pandas as pd

import numpy as np

df = pd.DataFrame({‘A’:np.random.randn(100)-3,

‘B’:np.random.randn(100)+1,

‘C’:np.random.randn(100)+3,

‘D’:np.random.randn(100)-1},

columns=[‘A’, ‘B’, ‘C’, ‘D’])

df[‘A’].hist(bins=20)

Scatter plot

Scatter plot can be created using DataFrame.plot.scatter() method.

Example- df.plot.scatter()

import pandas as pd

import numpy as np

df = pd.DataFrame({‘A’:np.random.randn(100)-3,

‘B’:np.random.randn(100)+1,

‘C’:np.random.randn(100)+3,

‘D’:np.random.randn(100)-1},

columns=[‘A’, ‘B’, ‘C’, ‘D’])

df.plot.scatter(x=’A’, y=’B’)

Pie chart

To generate a pie chart use ‘.plot.pie()’

Example – df.plot.pie()

import pandas as pd

import numpy as np

df = pd.DataFrame(np.random.rand(5), index=[‘A’, ‘B’, ‘C’, ‘D’, ‘E’])

df.plot.pie(subplots=True)

Data Handling with Pandas..

Reading from a csv or an excel – Pandas provide two functions – read_csv() and read_excel() to read data from a csv and an excel file respectively. Command can be used as follows.

Viewing data – Viewing data from a data frame can be done by three ways
- using the data frame’s name – returns the top and bottom 5 rows in the data frame.
- using dataframe.head() function
- using dataframe.tail() function

To see more details on the data frame, the info() function can be used. info() gives an idea about what datatype each series in a data frame points to.

The following functions are used to find the unique entries within a series/column in a data frame.
- datafame.unique() – returns the unique values
- dataframe.nunique() – returns the count of unique values
- dataframe.value_counts() – returns the frequency of each of the categories in the column

In our example, the titanic dataset contains a column called “Survived” which tells if the particular passenger survived the tragedy. Since this value could only be either 0 or 1, we can convert the data type from integer to object.
- dataframe.astype() is the function which lets us do the conversion

Missing Values – Identification and Imputation..

How to identify missing values?

Pandas provide the following three functions to find out if at all the data frame has missing or null values.

dataframe.isna()
dataframe.isnull()
dataframe.isna().sum() – gives the count of NA’s in each column/series of the dataframe.
dataframe.isna().sum().sum() – gives the count of NA’s in a whale of dataframe.

Imputation – Drop or replace??

Pandas provides the following functions to deal with imputation.
- dataframe.dropna()
- dataframe.fillna()

Indexing & Filtering in pandas

We can access any row in a dataframe using the following functions
- dataframe.loc() – returns the row based on the value of the index.
- dataframe.iloc() – returns the row based on the position of the index

We can filter out the required data with the help of ‘[]’, as shown in the following screenshot.
- ‘&’ is used when the dataframe has to be filtered by multiple conditions

To Summarize

In this article we discussed the basics of Pandas including creating data frames, handling missing values, and data retrieval methods. It is said that 80% of a Data scientists’ job is in Data Handling and manipulation. So, if you choose to go with python for your ML project, it is very important that you know how Pandas operate.

Full Stack Software Development Course from UT Austin

Cloud Computing PG Program by Great Lakes

A Brief Introduction to Pandas | What is Pandas in Python

What is Pandas?

So, what is Pandas and how is it used in AI?

Data Types

All statistical functions

Input and Output

Aggregation

Groupby

Merging, Joining and Concatenation

Merging

Left merge

Right merge

Outer Merge

Inner Merge

Join

Concatenation

Concatenating using ‘.concat()’ function

Concatenating using ‘.append()’ function

Date Time

Generating random datetime

Converting to timestamps

Dividing datetime into its features

Visualization

Line plot

Bar/Horizontal Bar plot

Histograms

Scatter plot

Pie chart

Natural Language Toolkit (NLTK) Tutorial with Python

Web scraping using Python

Understanding the Exploratory Data Analysis (EDA) in Python

How to build your own Twitter Auto Liker bot | How to Create Twitter Bot

How To Reverse a String In Python | Different ways to Reverse a String in Python

What is the Probability of Winning a Lottery

Full Stack Software Development Course from UT Austin

Cloud Computing PG Program by Great Lakes