According to Forbes magazine report in 2019, this is a record year for enterprises’ interest in data science, AI, and machine learning features in their business strategies and goals. This means businesses around the world have started making corporate decisions based on the data that they have collected over the years – using Machine and Deep learning methods.
Programming languages like python and R provide a great platform for anyone starting out in Machine learning and AI, to analyze and extract useful insights for businesses. Dealing with data for analysis and visualization is an imperative process in Machine Learning and Artificial Intelligence.
What is Pandas?
In short Pandas is a Software Libarary in Computer Programming and it is written for the Python Programming Language its work to do data analysis and manipulation.
Also read Python Numpy Tutorial and Fibonacci Series in Python
You can also take up python pandas course designed for beginner level.
We all know that Python is majorly a programming language. However, after the introduction of data handling libraries like NumPy, Pandas and Data Visualization libraries like Seaborn and Matplotlib, and the ease of understanding languages, simple syntaxes, Python is rapidly gaining popularity among data science and ML professionals. The Below picture shows a google trends page, showing a comparison of growths (in terms of google searches) of python and R over the past 15 years. It is evident that python is seeing exponential growth, while R is dropping down.
In this article, we will go through the basics of Pandas and the commands that any beginner needs to know to do fundamental data analysis in a given dataset.
So, what is Pandas and how is it used in AI?
Artificial Intelligence is about executing machine learning algorithms on products that we use every day. Any ML algorithm, for it to be effective, needs the following prerequisite steps to be done.
- Data Collection – Conducting opinion Surveys, scraping the internet, etc.
- Data Handling – Viewing data as a table, performing cleaning activities like checking for spellings, removal of blanks and wrong cases, removal of invalid values from data, etc.
- Data Visualization – plotting appealing graphs, so anyone who looks at the data can know what story the data tells us.
“Pandas” – short for “Panel Data” (A panel is a 3D container of data) – is a library in python which contains in-built functions to clean, transform, manipulate, visualize and analyze data.
Getting started…
NumPy – Numerical python – forms the basics of what pandas is all about. While NumPy deals with “Arrays” and “Matrices”, Pandas deals with “Series” & “Data Frames”. In order to work with Pandas first Python has to be installed in your system. Download and install python from here – https://www.python.org/downloads/windows/
You can verify python installation by entering “python” in the command prompt. The command gives the installed version of python.
Python gets automatically installed through an application called “Anaconda”, which simplifies package/library management when there are many basic packages needed to be installed for the project.
Anaconda installer download – https://www.anaconda.com/distribution/#windows.
Once Anaconda is installed, you can navigate to the ‘lib’ folder within the Anaconda installation to have a look at what are all the packages that got installed by default. One such package is “Pandas”. In order to import Pandas to our command line, we will use a “Jupyter Notebook” in this article.
Jupyter Notebook, is basically a web application, mainly used in data science and machine learning to develop and share code. Jupyter Notebook is part of Anaconda installation and it can be accessed through Anaconda’s UI as shown below.
Click on the “Launch”, it opens the Jupyter Notebook. Each cell in this notebook can hold one or more python commands. Typing and executing the following command imports “Pandas” in our work environment.
Now that We have installed Pandas successfully, let us learn how to do some analysis on data.
Data Types
A data type is used by a programming language to understand how to store and manipulate data. The table below summarizes the different data types in Pandas.
Data type | Use |
int | Integer number, eg: 10, 12 |
float | Floating point number, eg: 100.2, 3.1415 |
bool | True/False value |
object | Test, non-numeric, or a combination of text and non-numeric values, eg: Apple |
DateTime | Date and time values |
category | A finite list of values |
What does Pandas deal with?
There are two major categories of data that you can come across while doing data analysis.
- One dimensional data
- Two-dimensional data
These data can be of any data type. Character, number or even an object.
Series in Pandas is one-dimensional data, and data frames are 2-dimensional data. A series can hold only a single data type, whereas a data frame is meant to contain more than one data type.
In the example shown below, “Types of Vehicles” is a series and it is of the datatype – “Object” and it is treated as a character array. “Count” is another series and it is of the type – “Integer”. Third is the “Number Of wheels” is the third series and it is of the type “Integer” again. The individual Series are one dimensional and hold only one data type. However, the data frame as a whole contains more than 2 dimensions and is heterogeneous in nature.
This is the reason why Pandas is so powerful and so much in use today in Data science world
Data Frame
Creating Series & data frames in python
A series can be created in 3 different ways – Converting an array or List or a dictionary into a series. We will see an example for each of the categories.
Array: We first create an array using the ‘NumPy’ package and then convert them into a series using the “Series()” function.
Same can be done for lists as well.
Creating a data frame can be done using the following command.
We can also create data frames with multiple series by using dictionaries and converting them using a data frame.
All statistical functions
Function | Description |
count() | Returns the number of times an element/data has occurred (non-null) |
sum() | Returns sum of all values |
mean() | Returns the average of all values |
median() | Returns the median of all values |
mode() | Returns the mode |
std() | Returns the standard deviation |
min() | Returns the minimum of all values |
max() | Returns the maximum of all values |
abs() | Returns the absolute value |
Input and Output
Often, you won’t be creating data but will be having it in some form, and you would want to import it to run your analysis on it. Fortunately, Pandas allows you to do this. Not only does it help in importing data, but you can also save your data in your desired format using Pandas.
The below table shows the formats supported by Pandas, the function to read files using Pandas, and the function to write files.
Input type | Reader | Writer |
CSV | read_csv | to_csv |
JSON | read_json | to_json |
HTML | read_html | to_html |
Excel | read_excel | to_excel |
SAS | read_sas | – |
Python Pickle Format | read_pickle | to_pickle |
SQL | read_sql | to_sql |
Google Big Query | read_gbq | to_gbq |
In the below example, I have shown how to read a CSV file.
import pandas as pd
import numpy as np
#Read input file
df = pd.read_csv(‘/content/player_data.csv’)
df.head()
Output
name year_start year_end position height weight birth_date college
0 Alaa Abdelnaby 1991 1995 F-C 6-10 240.0 June 24, 1968 Duke University
1 Zaid Abdul-Aziz 1969 1978 C-F 6-9 235.0 April 7, 1946 Iowa State University
2 Kareem Abdul-Jabbar 1970 1989 C 7-2 225.0 April 16, 1947 University of California, Los Angeles
3 Mahmoud Abdul-Rauf 1991 2001 G 6-1 162.0 March 9, 1969 Louisiana State University
4 Tariq Abdul-Wahad 1998 2003 F 6-6 223.0 November 3, 1974 San Jose State University
The example below shows how to save a DataFrame to a CSV file.
import pandas as pd
# initialize a dictionary
data = {‘Name’:[‘Captain America’, ‘Iron Man’, ‘Hulk’, ‘Thor’,’Black Panther’],
‘Rating’:[100, 80, 84, 93, 90],
‘Place’:[‘USA’,’USA’,’USA’,’Asgard’,’Wakanda’]}
# Create DataFrame
df = pd.DataFrame(data, index=[‘a’,’b’,’c’,’d’,’e’])
# Saving to CSV
df.to_csv(“avengers.csv”)
Aggregation
The aggregation function can be applied against a single or more column. You can either apply the same aggregate function across various columns or different aggregate functions across various columns.
Commonly used aggregate functions()- sum, min, max, mean.
Example: Same aggregate function on all columns.
import pandas as pd
# initialize a dictionary
data = {‘Name’:[‘jennifer Lawrence’, ‘Brad Pitt’, ‘Chris Hemsworth’, ‘Dwayne Johnson’],
‘Salary’:[1000, 80000, 79000, 93000],
‘Age’:[33, 50, 45, 52]}
# Create DataFrame
df = pd.DataFrame(data)
df.aggregate([‘sum’,’min’,’max’,’mean’])
Output
Name Salary Age
sum jennifer LawrenceBrad PittChris Hemsworth Dwayn… 253000.0 180.0
min Brad Pitt 1000.0 33.0
max jennifer Lawrence 93000.0 52.0
mean NaN 63250.0 45.0
Example: Different aggregate functions for different columns.
import pandas as pd
# initialize a dictionary
data = {‘Name’:[‘jennifer Lawrence’, ‘Brad Pitt’, ‘Chris Hemsworth’, ‘Dwayne Johnson’],
‘Salary’:[1000, 80000, 79000, 93000],
‘Age’:[33, 50, 45, 52]}
# Create DataFrame
df = pd.DataFrame(data)
df.aggregate({‘Salary’:[‘sum’,’mean’],
‘Age’:[‘min’,’max’]})
Output
Salary Age
max NaN 52.0
mean 63250.0 NaN
min NaN 33.0
sum 253000.0 NaN
Groupby
Pandas groupby function is used to split the DataFrame into groups based on some criteria.
First, we will import the dataset, and explore it.
import pandas as pd
import numpy as np
#Read input file
df = pd.read_csv(‘/content/player_data.csv’)
df.head()
Output:
name year_start year_end position height weight birth_date college
0 Alaa Abdelnaby 1991 1995 F-C 6-10 240.0 June 24, 1968 Duke University
1 Zaid Abdul-Aziz 1969 1978 C-F 6-9 235.0 April 7, 1946 Iowa State University
2 Kareem Abdul-Jabbar 1970 1989 C 7-2 225.0 April 16, 1947 University of California, Los Angeles
3 Mahmoud Abdul-Rauf 1991 2001 G 6-1 162.0 March 9, 1969 Louisiana State University
4 Tariq Abdul-Wahad 1998 2003 F 6-6 223.0 November 3, 1974 San Jose State University
Let’s groupby the players’ college names.
# group the data on name and position.
gd = df.groupby(‘college’)
gd.first()
Output:
name year_start year_end position height weight birth_date
college
Acadia University Brian Heaney 1970 1970 G 6-2 180.0 September 3, 1946
Alabama – Huntsville Josh Magette 2018 2018 G 6-1 160.0 November 28, 1989
Alabama A&M University Mickell Gladness 2012 2012 C 6-11 220.0 July 26, 1986
Alabama State University Kevin Loder 1982 1984 F-G 6-6 205.0 March 15, 1959
Albany State University Mack Daughtry 1971 1971 G 6-3 175.0 August 4, 1950
… … … … … … … …
Xavier University Torraye Braggs 2004 2005 F 6-8 245.0 May 15, 1976
Xavier University of Louisiana Nat Clifton 1951 1958 C-F 6-6 220.0 October 13, 1922
Yale University Chris Dudley 1988 2003 C 6-11 235.0 February 22, 1965
Yankton College Chuck Lloyd 1971 1971 C-F 6-8 220.0 May 22, 1947
Youngstown State University Leo Mogus 1947 1951 F-C 6-4 190.0 April 13, 1921
Let’s print the values in any one of the groups.
gd.get_group((‘C’,’A.J. Bramlett’))
Output
Year_start year_end height weight birth_date college
435 2000 2000 6-10 227.0 January 10, 1977 University of Arizona
Let’s create groups based on more than one category
# group the data on name and position.
gd = df.groupby([‘position’,’name’])
gd.first()
Output
year_start year_end height weight birth_date college
position name
C A.J. Bramlett 2000 2000 6-10 227.0 January 10, 1977 University of Arizona
A.J. Hammons 2017 2017 7-0 260.0 August 27, 1992 Purdue University
Aaron Gray 2008 2014 7-0 270.0 December 7, 1984 University of Pittsburgh
Adonal Foyle 1998 2009 6-10 250.0 March 9, 1975 Colgate University
Al Beard 1968 1968 6-9 200.0 April 27, 1942 Norfolk State University
… … … … … … … …
G-F Win Wilfong 1958 1961 6-2 185.0 March 18, 1933 University of Memphis
Winford Boynes 1979 1981 6-6 185.0 May 17, 1957 University of San Francisco
Wyndol Gray 1947 1948 6-1 175.0 March 20, 1922 Harvard University
Yakhouba Diawara 2007 2010 6-7 225.0 August 29, 1982 Pepperdine University
Zoran Dragic 2015 2015 6-5 200.0 June 22, 1989 NaN
Merging, Joining and Concatenation
Before I start with Pandas join and merge functions, let me introduce you to four different types of joins, they are inner join, left join, right join, outer join.
- Full outer join: Combines results from both DataFrames. The result will have all columns from both DataFrames.
- Inner join: Only those rows which are present in both DataFrame A and DataFrame B will be present in the output.
- Right join: Right join uses all records from DataFrame B and matching records from DataFrame A.
- Left join: Left join uses all records from DataFrame A and matching records from DataFrame B.
Merging
Merging a Dataframe with one unique key.
import pandas as pd
# Define a dictionary containing employee data
data1 = {‘key’:[‘K0′,’K1′,’K2′,’K3’],
‘Name’:[‘Mercy’, ‘Prince’, ‘John’, ‘Cena’],
‘Age’:[27, 24, 22, 32],}
# Define a dictionary containing employee data
data2 = {‘key’:[‘K0′,’K1′,’K2′,’K3’],
‘Address’:[‘Canada’, ‘UK’, ‘India’, ‘USA’],
‘Qualification’:[‘Btech’, ‘B.A’, ‘MS’, ‘Phd’]}
# Convert the dictionary into DataFrame
df1 = pd.DataFrame(data1)
# Convert the dictionary into DataFrame
df2 = pd.DataFrame(data2)
print(df1.head)
print(“n”)
print(df2.head())
res = pd.merge(df1, df2, on=’key’)
res
Output
key Name Age
0 K0 Mercy 27
1 K1 Prince 24
2 K2 John 22
3 K3 Cena 32>
key Address Qualification
0 K0 Canada Btech
1 K1 UK B.A
2 K2 India MS
3 K3 USA Phd
key Name Age Address Qualification
0 K0 Mercy 27 Canada Btech
1 K1 Prince 24 UK B.A
2 K2 John 22 India MS
3 K3 Cena 32 USA Phd
Merging Dataframe using multiple keys.
import pandas as pd
# Define a dictionary containing employee data
data1 = {‘key’:[‘K0′,’K1′,’K2′,’K3’],
‘Name’:[‘Mercy’, ‘Prince’, ‘John’, ‘Cena’],
‘Address’:[‘Canada’, ‘Australia’, ‘India’, ‘Japan’],
‘Age’:[27, 24, 22, 32],}
# Define a dictionary containing employee data
data2 = {‘key’:[‘K0′,’K1′,’K2′,’K3’],
‘Address’:[‘Canada’, ‘UK’, ‘India’, ‘USA’],
‘Qualification’:[‘Btech’, ‘B.A’, ‘MS’, ‘Phd’]}
# Convert the dictionary into DataFrame
df1 = pd.DataFrame(data1)
# Convert the dictionary into DataFrame
df2 = pd.DataFrame(data2)
print(df1.head)
print(“n”)
print(df2.head())
res = pd.merge(df1, df2, on=[‘key’, ‘Address’])
res
Output
key Name Address Age
0 K0 Mercy Canada 27
1 K1 Prince Australia 24
2 K2 John India 22
3 K3 Cena Japan 32
key Address Qualification
0 K0 Canada Btech
1 K1 UK B.A
2 K2 India MS
3 K3 USA Phd
key Name Address Age Qualification
0 K0 Mercy Canada 27 Btech
1 K2 John India 22 MS
Left merge
In pd.merge() I pass the argument ‘how = left’ to perform a left merge.
import pandas as pd
# Define a dictionary containing employee data
data1 = {‘key’:[‘K0′,’K1′,’K2′,’K3’],
‘Name’:[‘Mercy’, ‘Prince’, ‘John’, ‘Cena’],
‘Address’:[‘Canada’, ‘Australia’, ‘India’, ‘Japan’],
‘Age’:[27, 24, 22, 32],}
# Define a dictionary containing employee data
data2 = {‘key’:[‘K0′,’K1′,’K2′,’K3’],
‘Address’:[‘Canada’, ‘UK’, ‘India’, ‘USA’],
‘Qualification’:[‘Btech’, ‘B.A’, ‘MS’, ‘Phd’]}
# Convert the dictionary into DataFrame
df1 = pd.DataFrame(data1)
# Convert the dictionary into DataFrame
df2 = pd.DataFrame(data2)
print(df1.head(),”n”)
print(df2.head(),”n”)
res = pd.merge(df1, df2, how=’left’, on=[‘key’, ‘Address’])
res
Output
key Name Address Age
0 K0 Mercy Canada 27
1 K1 Prince Australia 24
2 K2 John India 22
3 K3 Cena Japan 32
key Address Qualification
0 K0 Canada Btech
1 K1 UK B.A
2 K2 India MS
3 K3 USA Phd
key Name Address Age Qualification
0 K0 Mercy Canada 27 Btech
1 K1 Prince Australia 24 NaN
2 K2 John India 22 MS
3 K3 Cena Japan 32 NaN
Right merge
In pd.merge() I pass the argument ‘how = right’ to perform a left merge.
import pandas as pd
# Define a dictionary containing employee data
data1 = {‘key’:[‘K0′,’K1′,’K2′,’K3’],
‘Name’:[‘Mercy’, ‘Prince’, ‘John’, ‘Cena’],
‘Address’:[‘Canada’, ‘Australia’, ‘India’, ‘Japan’],
‘Age’:[27, 24, 22, 32],}
# Define a dictionary containing employee data
data2 = {‘key’:[‘K0′,’K1′,’K2′,’K3’],
‘Address’:[‘Canada’, ‘UK’, ‘India’, ‘USA’],
‘Qualification’:[‘Btech’, ‘B.A’, ‘MS’, ‘Phd’]}
# Convert the dictionary into DataFrame
df1 = pd.DataFrame(data1)
# Convert the dictionary into DataFrame
df2 = pd.DataFrame(data2)
print(df1.head(),”n”)
print(df2.head(),”n”)
res = pd.merge(df1, df2, how=’right’, on=[‘key’, ‘Address’])
res
Output
key Name Address Age
0 K0 Mercy Canada 27
1 K1 Prince Australia 24
2 K2 John India 22
3 K3 Cena Japan 32
key Address Qualification
0 K0 Canada Btech
1 K1 UK B.A
2 K2 India MS
3 K3 USA Phd
key Name Address Age Qualification
0 K0 Mercy Canada 27.0 Btech
1 K1 NaN UK NaN B.A
2 K2 John India 22.0 MS
3 K3 NaN USA NaN Phd
Outer Merge
In pd.merge(), I pass the argument ‘how = outer’ to perform a left merge.
import pandas as pd
# Define a dictionary containing employee data
data1 = {‘key’:[‘K0′,’K1′,’K2′,’K3’],
‘Name’:[‘Mercy’, ‘Prince’, ‘John’, ‘Cena’],
‘Address’:[‘Canada’, ‘Australia’, ‘India’, ‘Japan’],
‘Age’:[27, 24, 22, 32],}
# Define a dictionary containing employee data
data2 = {‘key’:[‘K0′,’K1′,’K2′,’K3’],
‘Address’:[‘Canada’, ‘UK’, ‘India’, ‘USA’],
‘Qualification’:[‘Btech’, ‘B.A’, ‘MS’, ‘Phd’]}
# Convert the dictionary into DataFrame
df1 = pd.DataFrame(data1)
# Convert the dictionary into DataFrame
df2 = pd.DataFrame(data2)
print(df1.head(),”n”)
print(df2.head(),”n”)
res = pd.merge(df1, df2, how=’outer’, on=[‘key’, ‘Address’])
res
Output
key Name Address Age
0 K0 Mercy Canada 27
1 K1 Prince Australia 24
2 K2 John India 22
3 K3 Cena Japan 32
key Address Qualification
0 K0 Canada Btech
1 K1 UK B.A
2 K2 India MS
3 K3 USA Phd
key Name Address Age Qualification
0 K0 Mercy Canada 27.0 Btech
1 K1 Prince Australia 24.0 NaN
2 K2 John India 22.0 MS
3 K3 Cena Japan 32.0 NaN
4 K1 NaN UK NaN B.A
5 K3 NaN USA NaN Phd
Inner Merge
In pd.merge(), I pass the argument ‘how = inner’ to perform a left merge.
import pandas as pd
# Define a dictionary containing employee data
data1 = {‘key’:[‘K0′,’K1′,’K2′,’K3’],
‘Name’:[‘Mercy’, ‘Prince’, ‘John’, ‘Cena’],
‘Address’:[‘Canada’, ‘Australia’, ‘India’, ‘Japan’],
‘Age’:[27, 24, 22, 32],}
# Define a dictionary containing employee data
data2 = {‘key’:[‘K0′,’K1′,’K2′,’K3’],
‘Address’:[‘Canada’, ‘UK’, ‘India’, ‘USA’],
‘Qualification’:[‘Btech’, ‘B.A’, ‘MS’, ‘Phd’]}
# Convert the dictionary into DataFrame
df1 = pd.DataFrame(data1)
# Convert the dictionary into DataFrame
df2 = pd.DataFrame(data2)
print(df1.head(),”n”)
print(df2.head(),”n”)
res = pd.merge(df1, df2, how=’inner’, on=[‘key’, ‘Address’])
res
Output
key Name Address Age
0 K0 Mercy Canada 27
1 K1 Prince Australia 24
2 K2 John India 22
3 K3 Cena Japan 32
key Address Qualification
0 K0 Canada Btech
1 K1 UK B.A
2 K2 India MS
3 K3 USA Phd
key Name Address Age Qualification
0 K0 Mercy Canada 27 Btech
1 K2 John India 22 MS
Join
Join is used to combine DataFrames having different index values.
Example
import pandas as pd
# Define a dictionary containing employee data
data1 = {‘Name’:[‘Mercy’, ‘Prince’, ‘John’, ‘Cena’],
‘Age’:[27, 24, 22, 32]}
# Define a dictionary containing employee data
data2 = {‘Address’:[‘Canada’, ‘UK’, ‘India’, ‘USA’],
‘Qualification’:[‘Btech’, ‘B.A’, ‘MS’, ‘Phd’]}
# Convert the dictionary into DataFrame
df1 = pd.DataFrame(data1)
# Convert the dictionary into DataFrame
df2 = pd.DataFrame(data2)
print(df1.head(),”n”)
print(df2.head(),”n”)
res = df1.join(df2)
res
Output
Name Age
0 Mercy 27
1 Prince 24
2 John 22
3 Cena 32
Address Qualification
0 Canada Btech
1 UK B.A
2 India MS
3 USA Phd
Name Age Address Qualification
0 Mercy 27 Canada Btech
1 Prince 24 UK B.A
2 John 22 India MS
3 Cena 32 USA Phd
Performing join with ‘how’ parameter. Different inputs to the ‘how’ parameter are, inner, outer, left, right.
import pandas as pd
# Define a dictionary containing employee data
data1 = {‘Name’:[‘Mercy’, ‘Prince’, ‘John’, ‘Cena’],
‘Age’:[27, 24, 22, 32]}
# Define a dictionary containing employee data
data2 = {‘Address’:[‘Canada’, ‘UK’, ‘India’, ‘USA’],
‘Qualification’:[‘Btech’, ‘B.A’, ‘MS’, ‘Phd’]}
# Convert the dictionary into DataFrame
df1 = pd.DataFrame(data1)
# Convert the dictionary into DataFrame
df2 = pd.DataFrame(data2)
print(df1.head(),”n”)
print(df2.head(),”n”)
res = df1.join(df2, how=’inner’)
res
Output
Name Age
0 Mercy 27
1 Prince 24
2 John 22
3 Cena 32
Address Qualification
0 Canada Btech
1 UK B.A
2 India MS
3 USA Phd
Name Age Address Qualification
0 Mercy 27 Canada Btech
1 Prince 24 UK B.A
2 John 22 India MS
3 Cena 32 USA Phd
Concatenation
Concatenating using ‘.concat()’ function
import pandas as pd
# Define a dictionary containing employee data
data1 = {‘Name’:[‘Mercy’, ‘Prince’, ‘John’, ‘Cena’],
‘Age’:[27, 24, 22, 32],}
# Define a dictionary containing employee data
data2 = {‘Address’:[‘Canada’, ‘UK’, ‘India’, ‘USA’],
‘Qualification’:[‘Btech’, ‘B.A’, ‘MS’, ‘Phd’]}
# Convert the dictionary into DataFrame
df1 = pd.DataFrame(data1, index=[‘K0’, ‘K1’, ‘K2’, ‘K3’])
# Convert the dictionary into DataFrame
df2 = pd.DataFrame(data2, index=[‘K0’, ‘K1’, ‘K2’, ‘K3’])
frames = [df1, df2]
res = pd.concat(frames)
res
Output
Name Age Address Qualification
K0 Mercy 27.0 NaN NaN
K1 Prince 24.0 NaN NaN
K2 John 22.0 NaN NaN
K3 Cena 32.0 NaN NaN
K0 NaN NaN Canada Btech
K1 NaN NaN UK B.A
K2 NaN NaN India MS
K3 NaN NaN USA Phd
The resultant DataFrame has a repeated index. If you want the new Dataframe to have its own index, set ‘ignore_index’ to True.
import pandas as pd
# Define a dictionary containing employee data
data1 = {‘Name’:[‘Mercy’, ‘Prince’, ‘John’, ‘Cena’],
‘Age’:[27, 24, 22, 32],}
# Define a dictionary containing employee data
data2 = {‘Address’:[‘Canada’, ‘UK’, ‘India’, ‘USA’],
‘Qualification’:[‘Btech’, ‘B.A’, ‘MS’, ‘Phd’]}
# Convert the dictionary into DataFrame
df1 = pd.DataFrame(data1, index=[‘K0’, ‘K1’, ‘K2’, ‘K3’])
# Convert the dictionary into DataFrame
df2 = pd.DataFrame(data2, index=[‘K0’, ‘K1’, ‘K2’, ‘K3’])
frames = [df1, df2]
res = pd.concat(frames, ignore_index=True)
res
Output
Name Age Address Qualification
0 Mercy 27.0 NaN NaN
1 Prince 24.0 NaN NaN
2 John 22.0 NaN NaN
3 Cena 32.0 NaN NaN
4 NaN NaN Canada Btech
5 NaN NaN UK B.A
6 NaN NaN India MS
7 NaN NaN USA Phd
The second DataFrame is concatenating below the first one, making the resultant DataFrame have new rows. If you want the second DataFrame to be added as columns, pass the argument axis=1.
import pandas as pd
# Define a dictionary containing employee data
data1 = {‘Name’:[‘Mercy’, ‘Prince’, ‘John’, ‘Cena’],
‘Age’:[27, 24, 22, 32],}
# Define a dictionary containing employee data
data2 = {‘Address’:[‘Canada’, ‘UK’, ‘India’, ‘USA’],
‘Qualification’:[‘Btech’, ‘B.A’, ‘MS’, ‘Phd’]}
# Convert the dictionary into DataFrame
df1 = pd.DataFrame(data1, index=[‘K0’, ‘K1’, ‘K2’, ‘K3’])
# Convert the dictionary into DataFrame
df2 = pd.DataFrame(data2, index=[‘K0’, ‘K1’, ‘K2’, ‘K3’])
frames = [df1, df2]
res = pd.concat(frames, axis=1, ignore_index=True)
res
Output
0 1 2 3
K0 Mercy 27 Canada Btech
K1 Prince 24 UK B.A
K2 John 22 India MS
K3 Cena 32 USA Phd
Concatenating using ‘.append()’ function
Append function concatenates along axis = 0 only. It can take multiple objects as input.
import pandas as pd
# Define a dictionary containing employee data
data1 = {‘Name’:[‘Mercy’, ‘Prince’, ‘John’, ‘Cena’],
‘Age’:[27, 24, 22, 32],}
# Define a dictionary containing employee data
data2 = {‘Address’:[‘Canada’, ‘UK’, ‘India’, ‘USA’],
‘Qualification’:[‘Btech’, ‘B.A’, ‘MS’, ‘Phd’]}
# Convert the dictionary into DataFrame
df1 = pd.DataFrame(data1, index=[‘K0’, ‘K1’, ‘K2’, ‘K3’])
# Convert the dictionary into DataFrame
df2 = pd.DataFrame(data2, index=[‘K0’, ‘K1’, ‘K2’, ‘K3’])
df1.append(df2)
Output
Name Age Address Qualification
K0 Mercy 27.0 NaN NaN
K1 Prince 24.0 NaN NaN
K2 John 22.0 NaN NaN
K3 Cena 32.0 NaN NaN
K0 NaN NaN Canada Btech
K1 NaN NaN UK B.A
K2 NaN NaN India MS
K3 NaN NaN USA Phd
Date Time
You will often encounter time data. Pandas is a very useful tool when working with time series data.
Generating random datetime
In the below code I am generating random datetime.
import pandas as pd
# Create dates dataframe with frequency
date = pd.date_range(’10/28/2011′, periods = 5, freq =’H’)
date
Output
DatetimeIndex([‘2011-10-28 00:00:00’, ‘2011-10-28 01:00:00’,
‘2011-10-28 02:00:00’, ‘2011-10-28 03:00:00’,
‘2011-10-28 04:00:00’],
dtype=’datetime64[ns]’, freq=’H’)
In the below code I am generating datetime using a range, which has a starting value, ending value and periods which specifies how many samples do I want,
import pandas as pd
date = pd.date_range(start=’9/28/2018′, end=’10/28/2018′, periods = 10)
date
Output
DatetimeIndex([‘2018-09-28 00:00:00’, ‘2018-10-01 08:00:00’,
‘2018-10-04 16:00:00’, ‘2018-10-08 00:00:00’,
‘2018-10-11 08:00:00’, ‘2018-10-14 16:00:00’,
‘2018-10-18 00:00:00’, ‘2018-10-21 08:00:00’,
‘2018-10-24 16:00:00’, ‘2018-10-28 00:00:00’],
dtype=’datetime64[ns]’, freq=None)
To convert the datetime to either a Pandas Series or a DataFrame, just pass the argument into the initializer.
Converting to timestamps
You can use the ‘to_datetime’ function to convert a Pandas Series or list-like object. When passed a Series, it returns a Series. If you pass a string, it returns a timestamp.
import pandas as pd
date = pd.to_datetime(pd.Series([‘Jul 04, 2020’, ‘2020-10-28’]))
date
Output
0 2020-07-04
1 2020-10-28
dtype: datetime64[ns]
In the below code I have specified the format of my input datetime. This speeds up the processing.
import pandas as pd
date = pd.to_datetime(‘4/7/1994′, format=’%d/%m/%Y’)
date
Output
Timestamp(‘1994-07-04 00:00:00’)
Dividing datetime into its features
Datetime can be divided into its components using-
pandas.Series.dt.year returns the year.
pandas.Series.dt.month returns the month.
pandas.Series.dt.day returns the day.
pandas.Series.dt.hour returns the hour.
pandas.Series.dt.minute returns the minute.
import pandas as pd
# Create datetime with dataframe
date = pd.DataFrame()
date[‘date’] = pd.date_range(’10/28/2020′, periods = 10, freq =’H’)
# Create features for year, month, day, hour, and minute
date[‘year’] = date[‘date’].dt.year
date[‘month’] = date[‘date’].dt.month
date[‘day’] = date[‘date’].dt.day
date[‘hour’] = date[‘date’].dt.hour
date[‘minute’] = date[‘date’].dt.minute
# Print the dates divided into features
date.head()
Output
date year month day hour minute
0 2020-10-28 00:00:00 2020 10 28 0 0
1 2020-10-28 01:00:00 2020 10 28 1 0
2 2020-10-28 02:00:00 2020 10 28 2 0
3 2020-10-28 03:00:00 2020 10 28 3 0
4 2020-10-28 04:00:00 2020 10 28 4 0
Visualization
Pandas can also be used to visualize data.
Line plot
In the below code I am generating a line plot. I am using random normal values generated by NumPy as input.
import pandas as pd
import numpy as np
df = pd.DataFrame(np.random.randn(10,4),
index=pd.date_range(’10/28/2020′,periods=10),
columns=list(‘ABCD’))
df.plot()
Bar/Horizontal Bar plot
Bar plot can be made by using ‘.plot.bar()’. Pass the argument ‘stacked = True’ if you want stacked bars.
import pandas as pd
import numpy as np
df = pd.DataFrame(np.random.rand(10,4),
columns=[‘a’,’b’,’c’,’d’])
df.plot.bar()
# using stacked bars
df.plot.bar(stacked=True)
To generate a horizontal bar graph, use ‘.plot.barh()’. You can also pass the argument ‘stacked = True’ if you want the bars to be stacked.
import pandas as pd
import numpy as np
df = pd.DataFrame(np.random.rand(10,5),
columns=[‘a’,’b’,’c’,’d’,’e’])
# using stacked bars
df.plot.barh(stacked=True)
Histograms
To generate a histogram use ‘DataFrame.plot.hist()’. Pass the argument ‘bins’ specifying how many bins you want.
Example – df.plot.hist()
import pandas as pd
import numpy as np
df = pd.DataFrame({‘A’:np.random.randn(100)-3,
‘B’:np.random.randn(100)+1,
‘C’:np.random.randn(100)+3,
‘D’:np.random.randn(100)-1},
columns=[‘A’, ‘B’, ‘C’, ‘D’])
df.plot.hist(bins=20)
To plot separate histograms for all your inputs, use your DataFrame name followed by ‘.hist()’. Pass the argument ‘bins’ specifying how many bins you want.
Example- df.hist()
import pandas as pd
import numpy as np
df = pd.DataFrame({‘A’:np.random.randn(100)-3,
‘B’:np.random.randn(100)+1,
‘C’:np.random.randn(100)+3,
‘D’:np.random.randn(100)-1},
columns=[‘A’, ‘B’, ‘C’, ‘D’])
df.hist(bins=20)
To plot a single histogram for any of your input pass the input name in square brackets followed by ‘.hist()’.
Example- df[‘A’].hist()
import pandas as pd
import numpy as np
df = pd.DataFrame({‘A’:np.random.randn(100)-3,
‘B’:np.random.randn(100)+1,
‘C’:np.random.randn(100)+3,
‘D’:np.random.randn(100)-1},
columns=[‘A’, ‘B’, ‘C’, ‘D’])
df[‘A’].hist(bins=20)
Scatter plot
Scatter plot can be created using DataFrame.plot.scatter() method.
Example- df.plot.scatter()
import pandas as pd
import numpy as np
df = pd.DataFrame({‘A’:np.random.randn(100)-3,
‘B’:np.random.randn(100)+1,
‘C’:np.random.randn(100)+3,
‘D’:np.random.randn(100)-1},
columns=[‘A’, ‘B’, ‘C’, ‘D’])
df.plot.scatter(x=’A’, y=’B’)
Pie chart
To generate a pie chart use ‘.plot.pie()’
Example – df.plot.pie()
import pandas as pd
import numpy as np
df = pd.DataFrame(np.random.rand(5), index=[‘A’, ‘B’, ‘C’, ‘D’, ‘E’])
df.plot.pie(subplots=True)
Data Handling with Pandas..
- Reading from a csv or an excel – Pandas provide two functions – read_csv() and read_excel() to read data from a csv and an excel file respectively. Command can be used as follows.
- Viewing data – Viewing data from a data frame can be done by three ways
- using the data frame’s name – returns the top and bottom 5 rows in the data frame.
- using dataframe.head() function
- using dataframe.tail() function
- To see more details on the data frame, the info() function can be used. info() gives an idea about what datatype each series in a data frame points to.
- The following functions are used to find the unique entries within a series/column in a data frame.
- datafame.unique() – returns the unique values
- dataframe.nunique() – returns the count of unique values
- dataframe.value_counts() – returns the frequency of each of the categories in the column
- In our example, the titanic dataset contains a column called “Survived” which tells if the particular passenger survived the tragedy. Since this value could only be either 0 or 1, we can convert the data type from integer to object.
- dataframe.astype() is the function which lets us do the conversion
Missing Values – Identification and Imputation..
How to identify missing values?
Pandas provide the following three functions to find out if at all the data frame has missing or null values.
- dataframe.isna()
- dataframe.isnull()
- dataframe.isna().sum() – gives the count of NA’s in each column/series of the dataframe.
- dataframe.isna().sum().sum() – gives the count of NA’s in a whale of dataframe.
Imputation – Drop or replace??
- Pandas provides the following functions to deal with imputation.
- dataframe.dropna()
- dataframe.fillna()
Indexing & Filtering in pandas
- We can access any row in a dataframe using the following functions
- dataframe.loc() – returns the row based on the value of the index.
- dataframe.iloc() – returns the row based on the position of the index
- We can filter out the required data with the help of ‘[]’, as shown in the following screenshot.
- ‘&’ is used when the dataframe has to be filtered by multiple conditions
To Summarize
In this article we discussed the basics of Pandas including creating data frames, handling missing values, and data retrieval methods. It is said that 80% of a Data scientists’ job is in Data Handling and manipulation. So, if you choose to go with python for your ML project, it is very important that you know how Pandas operate.