Contributed by: Mr. Ankush Bansal
LinkedIn profile: https://www.linkedin.com/in/ankush-bansal-84bab017/
This blog attempts to introduce the elementary steps in data analytics. Different methodologies, techniques to build one framework for problem solving. In the sample data taken here from UCI repository we are trying to solve the classification problem of diabetes detection. So, let’s get started.
1. Introduction to Machine Learning with Scikit-Learn
Machine learning is on the rise, organisations across the world are trying to harness the power of data. Due to this, multiple tools and software’s are being researched and developed to make the analysis simpler and easier. Python is one of the most favorite among data scientists as it offers quite rich libraries and tools for analysis.
Scikit-learn or sklearn is free software in python. It offers a bunch of algorithms in all clustering, prediction and classification problems such as k-means, RF, regressions etc. It can easily work with other python libraries such as numpy, scipy etc.
Read What is Machine Learning?
2. Prerequisites to Learn Scikit
Scikit is more for creating and building models, so one must have basic understanding of various supervised and unsupervised models. Model evaluation metrics, underlying mathematical calculations. One should also be comfortable with the basics of python programming, and other commonly used libraries.
3. Download and Install Scikit Learn
To use scikit learning features, one assumes that system has Python (preferable to have Python 2.7 and above), NumPy and SciPy in fact sklearn is built on SciPy itself.
Scikit is open source software hence no specific license is required.
Sklearn can be installed either using pip or conda command in the terminal (pip/conda install scikit-learn)
Once installed we can simply import this in our notebook by command import (import is for using any library in our notebook).
import scikit-learn
4. Loading an Example Dataset
There are several ways in which one can load dataset. Scikit has some inbuilt dataset which can be used for practice exercise, details of these dataset can be found on link “https://scikit-learn.org/stable/datasets/index.html”
To use these datasets, we can use command
from sklearn.datasets import load_diabetes
diabetes = load_diabetes()
df_diabetes = pd.DataFrame(diabetes.data, columns=diabetes.feature_names)
In case you have data in csv file on local system, you can use command read_csv in conjunction with pandas library (pandas is another important library in python, it is used quite often with dataframes, data manipulations etc)
Code:
import pandas as pd
diabetesData = pd.read_csv(“Downloads/diabetes.csv”)
(In case data is being read from url, we can pass the path of csv file with link itself in read_csv command)
Dataset used in this example can be found in this csv
5. Data Exploration, Data Visualization
One of the first and most important step in data analytics is to prepare and understand your data. Data exploration is done with aim to find any anomaly in data, check if any transformation or feature re-engineering helps to predict/classify the target variables. We can check for any missing values in dataset, any extreme or outliers etc.
As with other languages we can check for head and tail command to see if data has been read properly in python
diabetesData.head()
Once we are sure that data is read correctly, we can begin Exploratory analysis.
diabetesData.describe()
This command can help us to get descriptive analysis for variables, apart from count it will also show five-point summary for all variables in dataframe in one shot.
Here all variables were continuous, one can use function dtypes to see each variable type.
Missing Values – These are another problem with real world dataset, we should always check and do the imputation if we can before the modelling
print (diabetesData.isnull().sum())
This command can be used to detect any missing values. Thankfully in our dataset we don’t have any missing value. In case missing values do exists then mean, median or mode are quite popular imputation techniques. Scikit learn also provide KNNImputer function for replacing missing values using k nearest neighbor
Seaborn library in python is great visualization library. We can use pairplot function in this library to see the distribution of variables and type of relationship with other variables.
import seaborn as sn
sn.pairplot(diabetesData)
From the plot we can see that variables like Insulin, diabetesPedigreeFunction have skewed distribution, BMI and Skin Thickness, Glucose and Insulin shows linear relationship in certain patches.
sn.boxplot(data=diabetesData, orient=”h”)
to view the univariate data summary. Boxplot also helps to identify any outliers in data
Looking at boxplot we can see there are 0 values for BMI, Glucose, blood pressure, these should be anomaly as 0 for these values does not make much sense. We would ignore these values as of now. Apart from these Insulin does shows outlier at upper end as well.
In this process we have removed 44 rows that is around 5% of datapoint
We cannot do similar removal for Insulin or skin thickness variables as this would remove lot of information from the dataset. I would go ahead and replace with median (We can think of other ways as well but we are consciously not putting much effort here)
diabetesData[‘Insulin’] = diabetesData[‘Insulin’].replace(0,diabetesData[‘Insulin’].median())
diabetesData[‘SkinThickness’] = diabetesData[‘SkinThickness’].replace(0,diabetesData[‘SkinThickness’].median())
As last visualization step let us examine the correlation among the variables
correlation= diabetesData.corr()
sn.heatmap(correlation,annot=True)
We can see not much correlation is any seen among variables
6. Learning and Predicting
Before we start with machine learning I would like to go back to results of descriptive analysis once again. If we notice the mean values, we can see that variables are not at similar scale. Hence variables having high values would gain more importance in any distance-based calculations. So, we should normalize the variables using standard or MinMax scaler so that all of them are on similar scales before modelling.
Once independent variables are scaled then we can break the dataset into training and test portions. This helps to evaluate if our model is good to use or not. Here we would use 75% of data to train the models and use the prediction accuracy on remaining 25% in test data to do performance check of the models.
Code:
X = diabetesData.drop(‘Outcome’,axis=1)
Y = diabetesData[‘Outcome’]
to create feature and target variables. Here our target is already in 0, 1 format so we don’t need any separate label encoding for this
Scaling code:
from sklearn.preprocessing import StandardScaler as Scaler
scaler = Scaler()
scaler.fit(X)
X_scaled = scaler.transform(X)
For train and test set creation:
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, Y, test_size = .25, random_state=1)
Here we have broken down train and test set in ratio of 75:25. This can be changed on basis of data availability. There is another sampling techniques such as stratified which keeps the percentage of target variable same in both the datasets.
Now our data is ready for modelling.
Scikit learn has support for many classification models, here we would go ahead and work with Logistic, Random Forest, KNN, Naive bayes. Decision trees and Boosting algorithms. In scikit these are in form of target and features.( This is bit different to traditional approach in R where everything was simply done in one line with all values in one dataframe. Here we need to create model instance and then fit target and features in it)
eg: code for Logistic regression:
from sklearn.linear_model import LogisticRegression
lr = LogisticRegression()
lr.fit(X_train,y_train)
accuracyTrain=lr.score(X_train,y_train)
accuracyTest=lr.score(X_test,y_test)
print(accuracyTrain,accuracyTest)
this gives me the accuracy of 0.7863720073664825 0.7182320441988951 on train and test sets respectively
Note: Training accuracy does not mean much in itself, since model is trained on same data, so we expect a good accuracy on it, but when we compare train and test accuracy, we see it does not differ much. In case this difference is more than 10% we say that model is overfit. (Overfit is condition when model has imitated train data too closely that any prediction on unseen data is not that reliable) otherwise model can be used for prediction on unseen data.
Here accuracy score is calculated by ratio of all correct prediction/total data count.
Code for other models is essentially same, we just change the model type and then fit the model on train data. Here are the different libraries for the models
from sklearn.naive_bayes import GaussianNB
from sklearn.neighbors import KNeighborsClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.ensemble import GradientBoostingClassifier
Below table gives the accuracy score for train and test sets for different models
Train | Test | |
Logistic | 79 | 72 |
RF | 97 | 71 |
KNN | 80 | 70 |
Naïve Bayes | 76 | 72 |
Decision Tree | 77 | 70 |
GBM | 93 | 73 |
From model accuracy calculation we can clearly see that RF and GBM are overfit. LR and KNN are better models even though their accuracy is less because they do not show high deviation in train and test datasets. I will have more predictable results from these two algorithms.
Sklearn also provides features for K fold cross validation. This can help to reduce the problem of overfit. What happens in this is that instead of manually breaking the dataset into train and test we provide specific value of K in K-fold validation. Let us say K=10. Now the entire data is broken into 10 parts. Modelling is done on the first 9 parts and validated against the remaining one part. Now in 2nd round model is trained on 9 parts again, but it includes the left-out part in iteration one and leaves out another part. After modelling it validates now on the left-out part in this iteration. This process is repeated for K times and average accuracy is calculated for each iteration. This gives better accuracy as now average is tested on multiple dataset.
from sklearn.model_selection import cross_val_score
from sklearn.model_selection import KFold
kfold = KFold(n_splits=10, random_state=10)
score = cross_val_score(gb, X, Y, cv=kfold, scoring=’accuracy’).mean()
print(score)
This gives overall accuracy of 77% against the earlier test accuracy of 71%.
One can also check for confusion matrix for the model output.
from sklearn.metrics import confusion_matrix
conMat = confusion_matrix(y_test,lr.predict(X_test))
sn.heatmap(conMat, annot=True)
This diagrammatically represents that out of each class how many were correctly predicted and how many are incorrect.
For eg: for class 0 out of 112, 100 were predicted correctly and 12 were predicted as 1.
More detailed analysis for result can be seen in following classification report
from sklearn.metrics import classification_report
print(classification_report(y_test,lr.predict(X_test)))
Here we can see the following terms in output:
Precision
Recall
F1 score and
support
What are these terms and how are these related to model output?
1. Precision – It is the ratio of correct prediction by total prediction of that class. Here for class 0 it is 100/139 ~ 72% and for class 1 it is 30/42 ~ 71%
2. Recall – This denotes out of all values for the class, how many are correctly predicted. Here for class 0 out of 112, 100 are correctly predicted hence recall is 100/112 ~ 89%. For class 1 out of 69 prediction 30 are correctly predicted which makes recall = 30/69 ~ 43%.
3. F1 score – This is based on the above two metrics. Its formula is 2*Precision*Recall/(Precision+Recall).
4. Support – This is simply the occurrence of the class variable.
We may choose to have high precision value when we are working for medical problem statements whereas for any marketing domain, we might need to have high recall values. This criterion is dependent on data and again on problem statements.
Accuracy of the models can be improved further by tuning models hyperparameters. Grid search is another function which can help to tune model parameters. One can also opt for other model evaluation techniques such as ROC, gini index etc.
7. Model Persistence
The data taken here for demo purpose was quite small with around 800 rows only. But in real world problems we might get huge dataset. Training and testing is time taking on such a huge dataset, also we might not want to retrain/refresh model that often. So, we do have the option to save the model build for later use when the notebook is closed.
We may use dump and load commands from joblib in sklearn to save and reuse the model at later point of time
from sklearn.externals import joblib
joblib.dump(gb, ‘boostModel.pkl’)
here we have saved the boosting model
Now in order to reuse this again we can load with help of following command
recModel = joblib.load(‘boostModel.pkl’)
Now if we predict the accuracy again on test set with recModel we would get the same accuracy for boost as before.
In this blog, I have tried to cover the basics of any modelling exercise, highlighting the main steps which includes EDA, data cleaning/pre-processing.
Various model building, K fold cross validation, model output interpretation and saving/reusing the model are few of the other topics touched. Hopefully it can provide some insights and clarity to anyone who is seeking to build up skills for data analytics. Do post your views on the blogs.
Additionally, check out AIML courses on Great Learning (PGP- Machine Learning, PGP-Artificial Intelligence and Machine Learning)