What is Data Science?
Data science is a process to get some meaningful information from the massive amount of data. In simple terms, read and study the data to get proper intuitive insights. Data Science is a mixture of various tools, algorithms, and machine learning and deep learning concepts to discover hidden patterns from the raw and unstructured data.
Why do we need Data-Science?
In the past, we used to have data in a structured format, but now as the volume of the data is increasing, the amount of structured data becomes very less. All the unstructured and semi-structured data are collected from various sources, so we cannot ensure that the data will be in a proper format.
Our conventional system cannot cope up with the massive amounts of unstructured data. To solve this problem, data science came into the picture. Let’s have a look at the statistics for the number of semi or unstructured data in the upcoming time-
As per the statistician, 80- 90% of the data will be unstructured because of significant growth in the industry.
Applications of Data Science
Some of the popular applications of data science are:
Product Recommendation
Product recommendation technique becomes one of the most popular techniques to influence the customer to buy similar products. Let’s see an example.
Suppose a salesperson from Big Bazaar is trying to increase the sales of the store by bundling the products together and giving discounts on them. He bundles the shampoo and the conditioner together and gave a discount on them. Customers will buy them together since they are receiving it for a discounted price.
Future Forecasting
Predictive analysis is one of the most used domains in data science. We are all aware of weather forecasting or future forecasting based on various types of data that are collected from various sources. For example, if we want to forecast COVID-19 cases to get an overview of upcoming days in this pandemic situation.
On the basis of collected data science techniques, we will be able to forecast the future condition-
Fraud and Risk Detection
As the online transactions are increasing with time, there is a possibility to lose your personal data. One of the most intellectual applications of data science is fraud and risk detection.
For example, a credit card fraud detection depends on the amount, merchant, location, time, and other variables as well.
If any of the aforementioned points look unnatural, the transaction will be automatically cancelled and your card will be blocked for 24 hours or more.
Self-Driving Car
In today’s world, the self-driving car is one of the most successful inventions. Based on the previous data, we train our car to take decisions on its own. In this process, we can give a penalty to our model if it does not perform well.
The car (model) becomes more intelligent with time as it starts learning through real-time experiences.
Image Recognition
When you want to recognize any images, data science has the ability to detect the object, classify it, and then recognize it. The most popular example of image recognition is the face recognition feature on our smartphones.
First, the system will detect the face, then it classifies your face as a human face, and after that, it decides if the phone belongs to the actual owner or not.
Quite interesting, right? Data science has plenty of exciting applications to work on.
Speech to text Convert
Speech recognition is the process used to understand natural language by computers. I think we are all quite familiar with Google Assistance. Have you ever tried to understand how this assistance works?
Google Assistance first tries to recognize our speech, and then it converts this speech into a text form with the help of algorithms.
Isn’t it so exciting? Let’s try to look at all the technologies involved in building these amazing applications.
What are the Components of Data Science?
1. Statistics: Statistics is used to analyse and get the insights of the essential components in the considerable amount of data.
2. Mathematics: Mathematics is the most critical part of data science. It is used to study structure, quantity, quality, space, and change in data. Every aspiring data scientist must have good knowledge in mathematics, this can help them build meaningful insights from the data.
3. Visualization: Visualization represents text in a visual format, along with insights. It helps us understand the huge volume of data in a clear manner.
4. Data engineering: Data engineering helps us acquire, store, retrieve, and transform the data. It also includes metadata.
5. Domain Expertise: Domain expertise helps us understand different areas of data science in a clear manner.
6. Advanced computing: Advanced computing is a big part of designing, writing, debugging, and maintaining the source code of computer programs.
7. Machine learning: Machine learning is the most useful and essential part of data science. It helps identify the best features to build an accurate model.
Now that we have a rough idea about the important domains in data science. Let’s have a look at the tools used.
Tools used in Data Science
The main features of these tools are that you wouldn’t require explicit programming. These tools come with predefined functions and algorithms that are easy to use.
They can be divided into four categories:
- Data Storage
- Exploratory Data Analysis
- Data Modelling
- Data Visualization
- Data Storage:
- Apache Hadoop
- Microsoft HD Insights
- Exploratory Data analysis: EDA is an approach to analyze huge amounts of unstructured data.
- Informatica
- SAS
- MATLAB
- Data modelling: Data modelling tools come with inbuilt ML algorithms. All you need to do is pass the processed data to train your model.
- H20.ai
- BigML
- DataRobot
- Scikit Learn
- Data Visualization: Once all the processes are complete, we need to visualize our data to find all the insights and hidden patterns from it. We also need to prepare reports.
- Tableau
- Matplotlib
- Seaborn
Also, now you can learn data analysis using python in hindi
Life cycle of Data Science
- Understand the business requirement
- Collection of data (Data Mining)
- Data pre-processing
- Data cleaning
- Data Exploration (EDA)
- Build Model
- Feature engineering
- Model Training
- Model Evaluation
- Data Visualization
- Deploy the model
Understand the business requirement
Suppose you are a doctor, you work with several patients with new symptoms on a daily basis. All you need to do is try to figure out the root cause of the problem and give a proper solution.
As a data scientist, the very first thing is you will do, is understand the root cause of the problem. To understand the problem, we have to answer a few questions:
- How much or how many? (regression)
- Which category does problem belong to? (classification)
- Problems come under which group? (clustering)
- Is this normal? (anomaly detection)
- Which option should we go for? (recommendation)
In this phase, you should find out the problem’s objectives and the variables that need to be predicted.
- Maybe the problem you are solving is regarding weather forecasting, then, you have to choose regression. Because regression analysis needs a continuous value to predict.
- Or, you will get problems where you need to cluster the same types of customers to understand their type.
Let us discuss these points in detail.
Collection of Data (Data Mining):
Now that we have an idea about the objectives, we must gather data that needs to be analyzed. Through the data mining process, we can collect relevant data from massive pools of data and find hidden patterns. Data mining is also known as knowledge discovery.
What are types of data mining?
1.Classification:
Classification is used to retrieve important and relevant information from data, and metadata. This process helps to classify data in different classes.
2. Clustering:
Clustering analysis is used to identify data that are most common to each other. This technique helps us understand the differences and similarities between the data.
3. Regression:
Regression analysis helps us identify and analyze the relationship between each variable. Identify the likelihood for a specific variable with respect to other variables.
4. Association Rules:
Association rules help to find the association between two or more items. It discovers the hidden pattern in the dataset.
5. Outer detection:
Outer detection is basically used to find the variables that are not similar to most other variables. This technique is used to detect fraud. Outer detection is also called Outlier Analysis or Outlier mining.
6. Sequential Patterns:
It helps to find the sequence of the data. We need sequential data for most of the text processing.
7. Prediction:
Prediction is a combination of the other data mining techniques like trends, sequential patterns, clustering, classification, etc and analyzes past data patterns to predict the future.
Data preprocessing
Data cleaning: After the process of collecting data, we need to clean them for further use. This is the most time-consuming technique because there are so many possibilities that your data is still noisy.
Let’s have look at some examples:
- Data can be inconsistent within the same column. Some of the data can be labelled as 0 or 1, and some of them as ‘yes’ or ‘no’
- Data types can be inconsistent
- Categorical values can be written incorrectly. Example: Male, Female; or male, female
There are so many problems you will deal with in this process. This is why it is considered as the most time-consuming process.
EDA (Exploratory Data Analysis): After this process, you will get clean data to work on. In this phase, we will try to analyze our data using some techniques.
Using all of the previous information, you are ready to assume hypotheses about your data and the problem.
Example: Suppose you are trying to figure out the reason for obesity based on food habits. You will assume a hypothesis based on that.
Check out Python for data analysis course to learn how to analyze different datasets in Python
Building the Model
- Feature extraction: One of the most essential parts before you build your model is feature extraction. You can assume the base features for your model, this decides how your model is going to perform. Choose all the features very wisely. Feature selection is used to remove the features that add more noise than information. Feature extraction is done to avoid the curse of dimensionality, which is the reason for complexity in the model.
- Train the model: Suppose you are making a cake, you have all the ingredients ready. All you need to do is mix them properly, and bake it.
Training the model is the same as baking a cake. Now you just need to pass the data in the proper algorithm to train your model.
- Model Evaluation: In model evaluation, you just need to evaluate your model using new sets data. And now your model is ready to predict the unknown data based on the training.
Data Visualization
Last but not least, we need to visualize our data with the help of data visualization tools. Have a brief look at data science using machine learning. We consider that machine learning holds one of the major parts of data science.
There are three types of Machine Learning:
- Supervised ML
- Unsupervised ML
- Reinforcement learning ML
What is supervised learning?
From the name, we can understand that supervised learning works as a supervisor or teacher. In supervised learning, we teach or train the machine with labeled data (data is tagged with predefined class). Then, we test our model with some unknown new set of data and predict the level for them.
What is unsupervised learning?
Unsupervised learning is a machine learning technique where you do not need to supervise the model. Instead, you need to allow the model to work on its own to discover information. It mainly deals with unlabeled data.
What is Reinforcement Learning?
Reinforcement learning is about taking suitable action to maximize reward in a particular situation. It is used to define the best sequence of decisions that allow the agent to solve a problem while maximizing a long-term reward.
Let’s discuss a few algorithms under supervised and unsupervised machine learning.
What is regression?
Regression is a supervised technique that predicts the value of variable ‘y’ based on the values of variable ‘x’.
In simple terms, regression helps us find the relation between two things.
For example:
As winter comes and temperature drops, the sales of jackets start increasing. Thus, we can conclude that the sales of jackets depends on the season.
What is Linear Regression?
Linear regression is a type of supervised algorithm used for finding linear relationships between independent and dependent variables. It finds a relationship between two or more continuous variables.
This algorithm is mostly used in forecasting and predictions. It shows the linear relationship between input and output variables, so it is called linear regression.
The equation to solve linear regression problems: Y= MX+C Where, y= Dependent variable X= independent variable M= slope C= intercept
What is Logistic regression?
Logistic regression is an easy approach to solve any problem. Due to the logistic function used in this method, it is named as Logistic regression. This function is also called Sigmoid Function.
It has an S-shaped curve which takes any real-value number and produces the value between 0 and 1.
Sigmoid function = 1 / (1 + e^-value)
Let’s take a real-life example: Consider that we need to classify whether an email is a spam email or not.
If: Email = spam (0) Email not equal spam (1)
In that case, we have to specify a threshold value to get the result.
If our prediction ~= 1 then the email is not spam. But if the prediction ~=0 then the email is spam.
Case study: PUBG Data analysis
So in this tutorial, we will perform data analysis on PUBG dataset.
What is PUBG?
PUBG Stands for PlayerUnknown’s Battlegrounds. Basically, the game is all about the battle. This is similar to a hunger game wherein you start with nothing and as time goes by, you will scavenge and collect weapons and equipment. The game is ultimately a battle to see the last player standing among 100 players on an 8 x 8 km island. The mode of the game is: Solo, Duo or Squad.
To perform the analysis on data, we will download the data from Kaggle, here is the source for the data. Let’s have a look at the data description which was taken from Kaggle itself.
Feature descriptions (From Kaggle)
- * DBNOs – Number of enemy players knocked.
- * assists – Number of enemy players this player damaged that were killed by teammates.
- * boosts – Number of boost items used.
- * damageDealt – Total damage dealt. Note: Self inflicted damage is subtracted.
- * headshotKills – Number of enemy players killed with headshots.
- * heals – Number of healing items used.
- * Id – Player’s Id
- * killPlace – Ranking in match of number of enemy players killed.
- * killPoints – Kills-based external ranking of player. (Think of this as an Elo ranking where only kills matter.) If there is a value other than -1 in rankPoints, then any 0 in killPoints should be treated as a “None”.
- * killStreaks – Max number of enemy players killed in a short amount of time.
- * kills – Number of enemy players killed.
- * longestKill – Longest distance between player and player killed at time of death. This may be misleading, as downing a player and driving away may lead to a large longestKill stat.
- * matchDuration – Duration of match in seconds.
- * matchId – ID to identify match. There are no matches that are in both the training and testing set.
- * matchType – String identifying the game mode that the data comes from. The standard modes are “solo”, “duo”, “squad”, “solo-fpp”, “duo-fpp”, and “squad-fpp”; other modes are from events or custom matches.
- * rankPoints – Elo-like ranking of player. This ranking is inconsistent and is being deprecated in the API’s next version, so use with caution. Value of -1 takes place of “None”.
- * revives – Number of times this player revived teammates.
- * rideDistance – Total distance traveled in vehicles measured in meters.
- * roadKills – Number of kills while in a vehicle.
- * swimDistance – Total distance traveled by swimming measured in meters.
- * teamKills – Number of times this player killed a teammate.
- * vehicleDestroys – Number of vehicles destroyed.
- * walkDistance – Total distance traveled on foot measured in meters.
- * weaponsAcquired – Number of weapons picked up.
- * winPoints – Win-based external ranking of player. (Think of this as an Elo ranking where only winning matters.) If there is a value other than -1 in rankPoints, then any 0 in winPoints should be treated as a “None”.
- * groupId – ID to identify a group within a match. If the same group of players plays in different matches, they will have a different groupId each time.
- * numGroups – Number of groups we have data for in the match.
- * maxPlace – Worst placement we have data for in the match. This may not match with numGroups, as sometimes the data skips over placements.
- * winPlacePerc – The target of prediction. This is a percentile winning placement, where 1 corresponds to 1st place, and 0 corresponds to last place in the match. It is calculated off of maxPlace, not numGroups, so it is possible to have missing chunks in a match.
I hope you got a brief about what the game is about and the dataset as well.
So let’s divide the whole project into a few parts:
- Load the dataset
- Import the libraries
- Clean the data
- Perform Exploratory Data analysis
- Perform Feature engineering
- Build a Linear regression model
- Predict the model
- Visualize actual and predicted value using matplotlib and seaborn library
- Load the dataset: Load the dataset from dropbox. We already loaded the dataset into dropbox from Kaggle because it is easy to fetch the dataset from dropbox.
https://www.dropbox.com/s/kqu004pn2xpg0tr/train_V2.csv
To fetch the dataset from dropbox we need to use a command that is !wget and then the link.
!wget https://www.dropbox.com/s/kqu004pn2xpg0tr/train_V2.csv
!wget https://www.dropbox.com/s/5rl09pble4g6dk1/test_V2.csv
So our dataset is divided into two parts:
- train_v2.csv
- Test_v2. Csv
- Import the libraries:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import itertools
import gc
import os
import sys
%matplotlib inline
from sklearn import metrics
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression,LinearRegression
2. Use memory saving function:
As the amount of dataset is too big, we need to use a memory saving function which will help us to reduce the memory usage.
The function also is taken from Kaggle itself:
# Memory saving function credit to https://www.kaggle.com/gemartin/load-data-reduce-memory-usage
def reduce_mem_usage(df):
""" iterate through all the columns of a dataframe and modify the data type
to reduce memory usage.
"""
start_mem = df.memory_usage().sum() / 1024**2
for col in df.columns:
col_type = df[col].dtype
if col_type != object:
c_min = df[col].min()
c_max = df[col].max()
if str(col_type)[:3] == 'int':
if c_min > np.iinfo(np.int8).min and c_max < np.iinfo(np.int8).max:
df[col] = df[col].astype(np.int8)
elif c_min > np.iinfo(np.int16).min and c_max < np.iinfo(np.int16).max:
df[col] = df[col].astype(np.int16)
elif c_min > np.iinfo(np.int32).min and c_max < np.iinfo(np.int32).max:
df[col] = df[col].astype(np.int32)
elif c_min > np.iinfo(np.int64).min and c_max < np.iinfo(np.int64).max:
df[col] = df[col].astype(np.int64)
else:
#if c_min > np.finfo(np.float16).min and c_max < np.finfo(np.float16).max:
# df[col] = df[col].astype(np.float16)
#el
if c_min > np.finfo(np.float32).min and c_max < np.finfo(np.float32).max:
df[col] = df[col].astype(np.float32)
else:
df[col] = df[col].astype(np.float64)
#else:
#df[col] = df[col].astype('category')
end_mem = df.memory_usage().sum() / 1024**2
print('Memory usage of dataframe is {:.2f} MB --> {:.2f} MB (Decreased by {:.1f}%)'.format(
start_mem, end_mem, 100 * (start_mem - end_mem) / start_mem))
return df
3. Store the training data and use memory saving function to reduce memory usage:
train_data=pd.read_csv("train_V2.csv")
train_data= reduce_mem_usage(train_data)
train_data → This is the variable which holds the training part of the dataset
Output: Memory usage of dataframe is 983.90 MB --> 339.28 MB (Decreased by 65.5%)
4. Store the test data and use memory saving function to reduce memory usage:
test_data=pd.read_csv("/content/test_V2.csv")
test_data= reduce_mem_usage(test_data)
test_data → This is the variable which holds the testing part of the dataset
Output: Memory usage of dataframe is 413.18 MB --> 140.19 MB (Decreased by 66.1%)
5. Now we will check the dataset description as well as the shape of the dataset
- The shape of training dataset:
- Input: train_data.shape
- Output: (4446966, 29)–> 4446966 rows and 29 columns
- The shape of the testing dataset:
- Input: test_data.shape
- output: (1934174, 28)–> 4446966 rows and 28 columns
7. Print the training data: Print top 5 rows of data
train_data.head()
head() method returns the first five rows of the dataset
6. Print the testing data: Print top 5 rows of data
test_data.head()
Data cleaning:
- Checking the null values in the dataset: train_data.isna().any()
Output:
Id False groupId False matchId False assists False boosts False damageDealt False DBNOs False headshotKills False heals False killPlace False killPoints False kills False killStreaks False longestKill False matchDuration False matchType False maxPlace False numGroups False rankPoints False revives False rideDistance False roadKills False swimDistance False teamKills False vehicleDestroys False walkDistance False weaponsAcquired False winPoints False winPlacePerc True dtype: bool
So from the output, we can conclude that no column has null values except winPlaceperc.
Get the percentage for each column for null values:
null_columns=pd.DataFrame({'Columns':train_data.isna().sum().index,'No. Null values':train_data.isna().sum().values,'Percentage':train_data.isna().sum().values/train_data.shape[0]})
Exploratory Data Analysis:
Get the Statistical description of the dataset:
train_data.describe()
Now we will find the unique ID we have in the dataset:
- Input: train_data[“Id”].nunique()
- Output:4446966
.nunique() function is used to fetch the unique values from the dataset.
Now we will find the unique group ID and match ID we have in the dataset:
- Input: train_data[“groupId”].nunique()
- Output: 2026745
- Input: train_data[“matchId”].nunique()
- Output: 47965
Match Type in the Game
There are 3 game modes in the game.
- — One can play solo
- — or with a friend (duo)
- — or with 3 other friends (squad)
Input:
train_data.groupby(["matchType"]).count()
We use groupby() function to group the data based on the specified column
Visualize the data using Python’s library:
fig, ax = plt.subplots(1, 2, figsize=(12, 4))
train_data.groupby('matchId')['matchType'].first().value_counts().plot.bar()
We know PUBG has three types of the match but in the dataset, we got more right?
Because PUBG has a criteria called fpp and tpp. They are used to fix the visualizations sucg as zoom in or zoom out. To solve this problem, we need to map our data for three types of matches:
Map the match function:
Input:
new_train_data=train_data
def mapthematch(data):
mapping = lambda y: 'solo' if ('solo' in y) else 'duo' if('duo' in y) or ('crash' in y) else 'squad'
data['matchType'] = data['matchType'].apply(mapping)
return(new_train_data)
data=mapthematch(new_train_data)
data.groupby('matchId')['matchType'].first().value_counts().plot.bar()
So we map our data into three types of match
Find the Illegal match:
Input:
data[data['winPlacePerc'].isnull()]
Where WinPlaceperc is null and we will drop the column because the data is not correct.
data.drop(2744604, inplace=True)
Display the histogram of each map type:
Visualize the match duration:
data['matchDuration'].hist(bins=50)
Team kills based on Match Type
- Solo
- Duo
- Squad
Input:
d=data[['teamKills','matchType']]
d.groupby('matchType').hist(bins=80)
Normalize the columns:
data['killsNormalization'] = data['kills']*((100-data['kills'])/100 + 1)
data['damageDealtNormalization'] = data['damageDealt']*((100-data['damageDealt'])/100 + 1)
data['maxPlaceNormalization'] = data['maxPlace']*((100-data['maxPlace'])/100 + 1)
data['matchDurationNormalization'] = data['matchDuration']*((100-data['matchDuration'])/100 + 1)
Let’s compare the actual and normalized data:
New_normalized_column = data[['Id','matchDuration','matchDurationNormalization','kills','killsNormalization','maxPlace','maxPlaceNormalization','damageDealt','damageDealtNormalization']]
Feature Engineering:
Before we apply feature engineering, let’s see what it is.
A feature engineering process is used to create a new feature from the existing data that helps us understand the data more deeply.
Create new features:
# Create new feature healsandboosts
data['healsandboostsfeature'] = data['heals'] + data['boosts']
data[['heals', 'boosts', 'healsandboostsfeature']].tail()
Total distance travelled:
data['totalDistancetravelled'] = data['rideDistance'] + data['walkDistance'] + data['swimDistance']
data[['rideDistance', 'walkDistance', 'swimDistance',totalDistancetravelled]].tail()
# headshot_rate feature
data['headshot_rate'] = data['headshotKills'] / data['kills']
Data['headshot_rate']
Now we will split our training data into two parts for:
- Train the model (80%)
- Test the model (20%)
- And for validation purpose we will use test_v2.csv
x=data[['killsNormalization', 'damageDealtNormalization','maxPlaceNormalization', 'matchDurationNormalization','healsandboostsfeature','totalDistancetravelled']]
#drop the target variable
y=data['winPlacePerc']
xtrain, xtest, ytrain, ytest = train_test_split(x, y, test_size = 0.2, random_state = 42)
Now create your own linear regression model:
linear=LinearRegression()
After the training predict your model using .predict() function with unknown dataset
y_pred=linear.predict(xtest)
Lastly we will visualize the actual and the predicted value of the model:
df1 = df.head(25)
df1.plot(kind='bar',figsize=(26,10))
plt.grid(which='major', linestyle='-', linewidth='0.5', color='green')
plt.grid(which='minor', linestyle=':', linewidth='0.5', color='black')
plt.show()
This brings us to the end of this article. If you found this interesting and wish to learn more such concepts, join Great Learning Academy’s free online courses today; explore our post graduate programs on data science here.