What is data preprocessing?
For machine learning, we need data. Lots of it. The more we have, the better our model. Machine learning algorithms are data-hungry. But there’s a catch. They need data in a specific format.
In the real world, several terabytes of data is generated by multiple sources. But all of it is not directly usable. Audio, video, images, text, charts, logs all of them contain data. But this data needs to be cleaned in a usable format for the machine learning algorithms to produce meaningful results.
The process of cleaning raw data for it to be used for machine learning activities is known as data preprocessing. It’s the first and foremost step while doing a machine learning project. It’s the phase that is generally most time-taking as well.
Why data – preprocessing?
Real-world data is often noisy, incomplete with missing entries, and more often than not unsuitable for direct use for building models or solving complex data-related problems. There might be erroneous data, or the data might be unordered, unstructured, and unformatted.
The above reasons render the collected data unusable for machine learning purposes. It’s seen that the same data when formatted and cleaned produces more accurate and reliable results when used by machine learning models other than their unprocessed counterparts.
Data pre-processing steps
In data pre-processing several stages or steps are there. All the steps are listed below –
- Data Collection
- Data import
- Data Inspection
- Data Encoding
- Data interpolation
- Data splitting into train and test sets
- Feature scaling
Data Collection
Data collection is the stage when we collect data from various sources. Data might be laying across several storages or several servers and we need to get all that data collected in one single location for the ease of access.
Data is present in many formats. So we need to devise a common format for data collection. All the data required should be changed to a specific format for common operations to be done on them. Data of chat servers is in JSON, data of business applications is generally tabular. So, if we want to use both kinds of data we need to either convert all data into JSON, or all data into CSV or xlsx. Sometimes data is also present in the form of HTML text, so such texts also need to be cleaned.
Data Import
Data import is the process of importing data into the software such as R or python for data cleaning purposes. Sometimes the data is so huge in size that we have to take special care for importing it into the processing server/software. Tools like pandas, dask, NumPy, and matplotlib are handy when operating on such huge volumes of data.
Pandas
pandas is a fast, powerful, flexible, and easy to use open-source data analysis and manipulation tool, built on top of the Python programming language.
Installing pandas
Download Anaconda for your operating system and the latest Python version, run the installer, and follow the steps. Please note:
It is not needed (and discouraged) to install Anaconda as a root or administrator. When asked if you wish to initialize Anaconda3, answer yes. Restart the terminal after completing the installation. Detailed instructions on how to install Anaconda can be found in the Anaconda documentation.
In the Anaconda prompt (or terminal in Linux or macOS), start JupyterLab:
Importing pandas
In JupyterLab, create a new (Python 3) notebook:
In the first cell of the notebook, you can import pandas and check the version with:
Now we are ready to use pandas, and you can write your code in the next cells.
Numpy
It is a library for the Python programming language, adding support for large, multi-dimensional arrays and matrices, along with a large collection of high-level mathematical functions to operate on these arrays.
Importing Numpy
To import NumPy and check if it’s installed use the following code.
Here we imported NumPy and gave it an alias np. The alias np is further used to refer to NumPy.
Matplotlib
Matplotlib is a plotting library for the Python programming language and its numerical mathematics extension NumPy. It provides an object-oriented API for embedding plots into applications using general-purpose GUI toolkits like Tkinter, wxPython, Qt, or GTK+. There is also a procedural “pylab” interface based on a state machine (like OpenGL), designed to closely resemble that of MATLAB, though its use is discouraged.
Importing matplotlib
Here we imported matplotlib and printed the version. This is good check to verify if matplotlib got installed.
Dask
Analysts often use tools like Pandas, Scikit-Learn, Numpy, and the rest of the Python ecosystem to analyze data on their personal computers. They like these tools because they are efficient, intuitive, and widely trusted. However, when they choose to apply their analyses to larger datasets, they find that these tools were not designed to scale beyond a single machine. And so, the analyst rewrites their computation using a more scalable tool, often in another language altogether. This rewrite process slows down discovery and causes frustration.
Dask provides ways to scale Pandas, Scikit-Learn, and Numpy workflows more natively, with minimal rewriting. It integrates well with these tools so that it copies most of their API and uses its data structures internally.
Dask Installation
To install dask on our existing conda environment. We open anaconda prompt like earlier and execute the following command. W need to explicitly install dask, as it does not come pre-installed with Anaconda.
conda install dask
Having installed all these libraries lets load a sample dataset and see how it’s done in python.
We use a common dataset Boston.csv
import pandas as pd
import numpy as np
df=pd.read_csv("Boston.csv")
print(df)
Output:
505 506 0.04741 0.0 11.93 0 ... 273 21.0 396.90 7.88 11.9Unnamed: 0 crim zn indus chas ... tax ptratio black lstat medv 0 1 0.00632 18.0 2.31 0 ... 296 15.3 396.90 4.98 24.0 1 2 0.02731 0.0 7.07 0 ... 242 17.8 396.90 9.14 21.6 2 3 0.02729 0.0 7.07 0 ... 242 17.8 392.83 4.03 34.7 3 4 0.03237 0.0 2.18 0 ... 222 18.7 394.63 2.94 33.4 4 5 0.06905 0.0 2.18 0 ... 222 18.7 396.90 5.33 36.2 .. ... ... ... ... ... ... ... ... ... ... ... 501 502 0.06263 0.0 11.93 0 ... 273 21.0 391.99 9.67 22.4 502 503 0.04527 0.0 11.93 0 ... 273 21.0 396.90 9.08 20.6 503 504 0.06076 0.0 11.93 0 ... 273 21.0 396.90 5.64 23.9 504 505 0.10959 0.0 11.93 0 ... 273 21.0 393.45 6.48 22.0
[506 rows x 15 columns]
First, we import pandas. Then we use the read_csv() function of pandas to read the file in computer memory.
Inside the read_csv function, we have passed the dataset name as an argument. This is because the dataset is in the same directory as that of the python file. Had they been in different locations, we would have passed the entire path to the file.
Once we execute the line containing read_csv() the file is read and the contents of the boston.csv are loaded into a data frame called df according to the code.
To verify that the file has been loaded correctly, we use the df.head() function, which displays the top 10 rows of the dataset.
In a similar manner, padas.read_json() can be used to read a dataset in the JSON format, pandas.read_text() can be used to read a dataset in text format.
Data Inspection
After the data is imported, data is inspected for missing values and several sanity checks are done for ensuring the consistency of data. Domain knowledge comes in handy in such scenarios.
Checking for missing data
To check for missing data, we lookout for rows and columns which are having null or no data.
If any such scenarios are found we have to make decisions based on scenarios and intuitions.
Again the domain knowledge comes in handy in deciding the importance of certain columns.
If a column has more than 40 percent of data missing then the column is discarded completely and is considered good practice.
If the percentage of data missing is less than that, then various interpolation and replacement techniques can be employed to fill in the missing data. The most common of them is the replacement of nulls by measures of central tendency, or median/mode/mode.
Statistical significance tests can also be used to determine which columns to keep and what to not keep while model building, but that’s a story for another time.
Implementation
isna() function is used to check the null values in pandas.
import pandas as pd
import numpy as np
array = np.array([[1, np.nan, 3], [4, 5, np.nan]])
print(array)
print(pd.isna(array))
Output
[[ 1. nan 3.] [ 4. 5. nan]] [[False True False] [False False True]]
For indexes, and array of booleans is returned.
index = pd.DatetimeIndex(["2017-07-05", "2017-07-06", None,
"2017-07-08"])
print(index)
Output
DatetimeIndex(['2017-07-05', '2017-07-06', 'NaT', '2017-07-08'], dtype='datetime64[ns]', freq=None)
#checking for nulls in indexes
pd.isna(index)
Output
array([False, False, True, False])
#checking for nulls in series
For Series and DataFrame, the same type is returned, containing booleans.
df = pd.DataFrame([['ant', 'bee', 'cat'], ['dog', None, 'fly']])
print(df)
Output
0 1 2 0 ant bee cat 1 dog None fly
#use of isna() in df
pd.isna(df)
Output
0 1 2 0 False False False 1 False True False
#checking for nulls in first column
pd.isna(df[1])
Output
0 False 1 True Name: 1, dtype: bool
So now let’s try and inspect the given dataset :
First of all, we do a quantitative analysis. The descriptive statistics of each column gives us a good idea about the dataset. We use the describe() function of pandas for the following.
import pandas as pd
import numpy as np
df=pd.read_csv("Boston.csv")
print(df)
print (df.describe())
Output:
Unnamed: 0 crim zn ... black lstat medv count 506.000000 506.000000 506.000000 ... 506.000000 506.000000 506.000000 mean 253.500000 3.613524 11.363636 ... 356.674032 12.653063 22.532806 std 146.213884 8.601545 23.322453 ... 91.294864 7.141062 9.197104 min 1.000000 0.006320 0.000000 ... 0.320000 1.730000 5.000000 25% 127.250000 0.082045 0.000000 ... 375.377500 6.950000 17.025000 50% 253.500000 0.256510 0.000000 ... 391.440000 11.360000 21.200000 75% 379.750000 3.677082 12.500000 ... 396.225000 16.955000 25.000000 max 506.000000 88.976200 100.000000 ... 396.900000 37.970000 50.000000 [8 rows x 15 columns]
The data we get count for each column is 506, which means there are 506 values in each column.
Similarly the standard deviation, minimum value, maximum values and the first , second and third quartile values for each column are also printed.
We can use these values for manual elimination of values as well. For example, if we know the ranges of values for each column beforehand, we can check for consistency values and eliminate the erroneous values.
But these will have to be done manually. In our case boston dataset is a standard dataset, and hence we can take and use the given values without worrying about the quality or correctness of the data. But real world datasets are more complex and all these measures will have to be taken care of.
In this dataset, there are no missing values. Had there been missing values, we can do something like this:
- #Filling null values with a single value
# importing pandas as pd
import pandas as pd
# importing numpy as np
import numpy as np
# dictionary of lists
dict = {'A':[100, 90, np.nan, 95],
'B': [30, 45, 56, np.nan],
'C':[np.nan, 40, 80, 98]}
# creating a dataframe from dictionary
df = pd.DataFrame(dict)
print (df)
# filling missing value using fillna()
df.fillna(0)
Output:
A B C 0 100.0 30.0 NaN 1 90.0 45.0 40.0 2 NaN 56.0 80.0 3 95.0 NaN 98.0 Out[9]: A B C 0 100.0 30.0 0.0 1 90.0 45.0 40.0 2 0.0 56.0 80.0 3 95.0 0.0 98.0
In the above example, we have a dictionary with three keys A, B and C. We use the dictionary to create a dataframe.
We see that the data frame has null values as NAN. So we replace all values with 0.
Data Encoding
Data is in general of two types, quantitative and qualitative.
Quantitative data is used to deal with numbers and things used to measure:
- dimensions (height, width, and length).
- Temperature
- Humidity
- Prices
- Area and volume
There are many more examples where data of quantitative nature is used.
Qualitative data deals with characteristics and descriptors that can’t be easily measured, but can be observed subjectively—such as smells, tastes, textures, attractiveness, and color.
Broadly speaking, when we measure something and give it a numeric value, we generate quantitative data. When we classify or judge something, we generate qualitative data.
There are also different types of quantitative and qualitative data.
The type of data we are concerned with is categorical data. Categorical data is such data which is used to categorize different categories to differentiate between classes by assigning labels to them.
Since we know machine learning algorithms work on numeric data we have to convert these labels into numerics. This can be done in primarily two ways:
Label Encoding – Label Encoding is such encoding in which we assign numeric labels to categories. There would be as many labels as there are categories.
One Hot Encoding – One hot encoding creates extra columns for each category and is a multi column presence absence marker vector.
import pandas as pd
import numpy as np
df_iris = pd.read_csv("iris.csv")
print(df_iris.columns)
from sklearn.preprocessing import LabelEncoder, OneHotEncoder
#Encoding for dummy variables
onehot_encoder= OneHotEncoder()
X=onehot_encoder.fit_transform(df_iris["species"].values.reshape(-1,1))
print(X)
label_encoder_x= LabelEncoder()
df_iris["species"]= label_encoder_x.fit_transform(df_iris["species"])
print(df_iris)
Output:
Index(['sepal_length', 'sepal_width', 'petal_length', 'petal_width', 'species'], dtype='object') (0, 0) 1.0 (1, 0) 1.0 (2, 0) 1.0 (3, 0) 1.0 (4, 0) 1.0 (5, 0) 1.0 (6, 0) 1.0 (7, 0) 1.0 (8, 0) 1.0 (9, 0) 1.0 (10, 0) 1.0 (11, 0) 1.0 (12, 0) 1.0 (13, 0) 1.0 (14, 0) 1.0 (15, 0) 1.0 (16, 0) 1.0 (17, 0) 1.0 (18, 0) 1.0 (19, 0) 1.0 (20, 0) 1.0 (21, 0) 1.0 (22, 0) 1.0 (23, 0) 1.0 (24, 0) 1.0 : : (125, 2) 1.0 (126, 2) 1.0 (127, 2) 1.0 (128, 2) 1.0 (129, 2) 1.0 (130, 2) 1.0 (131, 2) 1.0 (132, 2) 1.0 (133, 2) 1.0 (134, 2) 1.0 (135, 2) 1.0 (136, 2) 1.0 (137, 2) 1.0 (138, 2) 1.0 (139, 2) 1.0 (140, 2) 1.0 (141, 2) 1.0 (142, 2) 1.0 (143, 2) 1.0 (144, 2) 1.0 (145, 2) 1.0 (146, 2) 1.0 (147, 2) 1.0 (148, 2) 1.0 (149, 2) 1.0 sepal_length sepal_width petal_length petal_width species 0 5.1 3.5 1.4 0.2 0 1 4.9 3.0 1.4 0.2 0 2 4.7 3.2 1.3 0.2 0 3 4.6 3.1 1.5 0.2 0 4 5.0 3.6 1.4 0.2 0 .. ... ... ... ... ... 145 6.7 3.0 5.2 2.3 2 146 6.3 2.5 5.0 1.9 2 147 6.5 3.0 5.2 2.0 2 148 6.2 3.4 5.4 2.3 2 149 5.9 3.0 5.1 1.8 2 [150 rows x 5 columns]
Data Interpolation
Interpolation is the process of using known data values to estimate unknown data values. Various interpolation techniques are often used in the atmospheric sciences. One of the simplest methods, linear interpolation, requires knowledge of two points and the constant rate of change between them.
Data interpolation is used for adding missing values to the columns with cells having missing values.
There are many different strategies which can be used to do interpolation, most prominent is average interpolation, knn- interpolation etc.
# importing pandas as pd
import pandas as pd
# importing numpy as np
import numpy as np
# dictionary of lists
dict = {'A':[100, 90, np.nan, 95],
'B': [30, 45, 56, np.nan],
'C':[np.nan, 40, 80, 98]}
# creating a dataframe from dictionary
df = pd.DataFrame(dict)
trainingData = df.iloc[:, :].values
dataset = df.iloc[:, :].values
from sklearn.preprocessing import Imputer
imputer = Imputer(missing_values="NaN", strategy="mean", axis = 0)
imputer = imputer.fit(trainingData[:, 1:2])
dataset[:, 1:2] = imputer.transform(dataset[:, 1:2])
print(dataset)
Output
[[100. 30. nan] [ 90. 45. 40. ] [ nan 56. 80. ] [ 95. 43.66666667 98. ]]
#Filling null values with a single value
# importing pandas as pd
import pandas as pd
# importing numpy as np
import numpy as np
# dictionary of lists
dict = {'A':[100, 90, np.nan, 95],
'B': [30, 45, 56, np.nan],
'C':[np.nan, 40, 80, 98]}
# creating a dataframe from dictionary
df = pd.DataFrame(dict)
import numpy as np
from sklearn.impute import SimpleImputer
imp = SimpleImputer(missing_values=np.nan, strategy='mean')
imp.fit(df)
SimpleImputer()
print(imp.transform(df))
[[100. 30. 72.66666667] [ 90. 45. 40. ] [ 95. 56. 80. ] [ 95. 43.66666667 98. ]]
Data Splitting
Data before being fed into machine learning algorithms is divided into train and validation sets.
Sklearn library of python provides a special function train-test-split for it. We can specify the percentage of data we want as a test and the function divides the given data into train and test sets.
It returns four arguments which are training independent variable, training dependent variable, testing independent variables and testing dependent variable.
We would do an example here: –
import pandas as pd
import numpy as np
df_iris = pd.read_csv("iris.csv")
print(df_iris.columns)
x=df_iris[['sepal_length', 'sepal_width', 'petal_length', 'petal_width']]
y=df_iris[['species']]
from sklearn.model_selection import train_test_split
x_train, x_test, y_train, y_test= train_test_split(x, y, test_size= 0.2, random_state=0)
print(x_train,y_train)
print(x_test,y_test)
Output
Index(['sepal_length', 'sepal_width', 'petal_length', 'petal_width', 'species'], dtype='object') sepal_length sepal_width petal_length petal_width 137 6.4 3.1 5.5 1.8 84 5.4 3.0 4.5 1.5 27 5.2 3.5 1.5 0.2 127 6.1 3.0 4.9 1.8 132 6.4 2.8 5.6 2.2 .. ... ... ... ... 9 4.9 3.1 1.5 0.1 103 6.3 2.9 5.6 1.8 67 5.8 2.7 4.1 1.0 117 7.7 3.8 6.7 2.2 47 4.6 3.2 1.4 0.2 [120 rows x 4 columns] species 137 virginica 84 versicolor 27 setosa 127 virginica 132 virginica .. ... 9 setosa 103 virginica 67 versicolor 117 virginica 47 setosa [120 rows x 1 columns] runfile('C:/Users/VAGISH/.spyder-py3/temp.py', wdir='C:/Users/VAGISH/.spyder-py3') Index(['sepal_length', 'sepal_width', 'petal_length', 'petal_width', 'species'], dtype='object') sepal_length sepal_width petal_length petal_width 137 6.4 3.1 5.5 1.8 84 5.4 3.0 4.5 1.5 27 5.2 3.5 1.5 0.2 127 6.1 3.0 4.9 1.8 132 6.4 2.8 5.6 2.2 .. ... ... ... ... 9 4.9 3.1 1.5 0.1 103 6.3 2.9 5.6 1.8 67 5.8 2.7 4.1 1.0 117 7.7 3.8 6.7 2.2 47 4.6 3.2 1.4 0.2 [120 rows x 4 columns] species 137 virginica 84 versicolor 27 setosa 127 virginica 132 virginica .. ... 9 setosa 103 virginica 67 versicolor 117 virginica 47 setosa [120 rows x 1 columns] sepal_length sepal_width petal_length petal_width 114 5.8 2.8 5.1 2.4 62 6.0 2.2 4.0 1.0 33 5.5 4.2 1.4 0.2 107 7.3 2.9 6.3 1.8 7 5.0 3.4 1.5 0.2 100 6.3 3.3 6.0 2.5 40 5.0 3.5 1.3 0.3 86 6.7 3.1 4.7 1.5 76 6.8 2.8 4.8 1.4 71 6.1 2.8 4.0 1.3 134 6.1 2.6 5.6 1.4 51 6.4 3.2 4.5 1.5 73 6.1 2.8 4.7 1.2 54 6.5 2.8 4.6 1.5 63 6.1 2.9 4.7 1.4 37 4.9 3.1 1.5 0.1 78 6.0 2.9 4.5 1.5 90 5.5 2.6 4.4 1.2 45 4.8 3.0 1.4 0.3 16 5.4 3.9 1.3 0.4 121 5.6 2.8 4.9 2.0 66 5.6 3.0 4.5 1.5 24 4.8 3.4 1.9 0.2 8 4.4 2.9 1.4 0.2 126 6.2 2.8 4.8 1.8 22 4.6 3.6 1.0 0.2 44 5.1 3.8 1.9 0.4 97 6.2 2.9 4.3 1.3 93 5.0 2.3 3.3 1.0 26 5.0 3.4 1.6 0.4 species 114 virginica 62 versicolor 33 setosa 107 virginica 7 setosa 100 virginica 40 setosa 86 versicolor 76 versicolor 71 versicolor 134 virginica 51 versicolor 73 versicolor 54 versicolor 63 versicolor 37 setosa 78 versicolor 90 versicolor 45 setosa 16 setosa 121 virginica 66 versicolor 24 setosa 8 setosa 126 virginica 22 setosa 44 setosa 97 versicolor 93 versicolor 26 setosa
Feature Scaling
Feature scaling is standard normalization of data. This is done so that no independent variable has more importance than any other independent variable.
All columns are standardized individually so that they follow the same distribution. This is the last step in data preprocessing.
from sklearn.preprocessing import StandardScaler
st_x= StandardScaler()
x_train= st_x.fit_transform(x_train)
x_test= st_x.transform(x_test)
print(x_train,y_train)
print(x_test,y_test)
Output
[[ 0.61303014 0.10850105 0.94751783 0.73603967] [-0.56776627 -0.12400121 0.38491447 0.34808318] [-0.80392556 1.03851009 -1.30289562 -1.3330616 ] [ 0.25879121 -0.12400121 0.60995581 0.73603967] [ 0.61303014 -0.58900572 1.00377816 1.25331499] [-0.80392556 -0.82150798 0.04735245 0.21876435] [-0.21352735 1.73601687 -1.19037495 -1.20374277] [ 0.14071157 -0.82150798 0.72247648 0.47740201] [ 0.02263193 -0.12400121 0.21613346 0.34808318] [-0.09544771 -1.05401024 0.10361279 -0.03987331] [ 1.0853487 -0.12400121 0.94751783 1.12399616] [-1.39432376 0.34100331 -1.41541629 -1.3330616 ] [ 1.20342834 0.10850105 0.72247648 1.38263382] [-1.04008484 1.03851009 -1.24663528 -0.81578628] [-0.56776627 1.50351461 -1.30289562 -1.3330616 ] [-1.04008484 -2.4490238 -0.1776889 -0.29851096] [ 0.73110978 -0.12400121 0.94751783 0.73603967] [ 0.96726906 0.57350557 1.0600385 1.64127148] [ 0.14071157 -1.98401928 0.66621615 0.34808318] [ 0.96726906 -1.2865125 1.11629884 0.73603967] [-0.33160699 -1.2865125 0.04735245 -0.16919214] [ 2.14806547 -0.12400121 1.28507985 1.38263382] [ 0.49495049 0.57350557 0.49743514 0.47740201] [-0.44968663 -1.51901476 -0.00890789 -0.16919214] [ 0.49495049 -0.82150798 0.60995581 0.73603967] [ 0.49495049 -0.58900572 0.72247648 0.34808318] [-1.15816448 -1.2865125 0.38491447 0.60672084] [ 0.49495049 -1.2865125 0.66621615 0.8653585 ] [ 1.32150798 0.34100331 0.49743514 0.21876435] [ 0.73110978 -0.12400121 0.77873682 0.99467733] [ 0.14071157 0.80600783 0.38491447 0.47740201] [-1.27624412 0.10850105 -1.24663528 -1.3330616 ] [-0.09544771 -0.82150798 0.72247648 0.8653585 ] [-0.33160699 -0.82150798 0.21613346 0.08944552] [-0.33160699 -0.35650346 -0.12142856 0.08944552] [-0.44968663 -1.2865125 0.10361279 0.08944552] [ 0.25879121 -0.12400121 0.4411748 0.21876435] [ 1.55766726 0.34100331 1.22881951 0.73603967] [-0.68584591 1.50351461 -1.30289562 -1.3330616 ] [-1.86664232 -0.12400121 -1.52793696 -1.46238043] [ 0.61303014 -0.82150798 0.83499716 0.8653585 ] [-0.21352735 -0.12400121 0.21613346 -0.03987331] [-0.56776627 0.80600783 -1.19037495 -1.3330616 ] [-0.21352735 3.13103043 -1.30289562 -1.07442394] [ 1.20342834 0.10850105 0.60995581 0.34808318] [-1.5124034 0.10850105 -1.30289562 -1.3330616 ] [ 0.02263193 -0.12400121 0.72247648 0.73603967] [-0.9220052 -1.2865125 -0.45899058 -0.16919214] [-1.5124034 0.80600783 -1.35915595 -1.20374277] [ 0.37687085 -1.98401928 0.38491447 0.34808318] [ 1.55766726 1.27101235 1.28507985 1.64127148] [-0.21352735 -0.35650346 0.21613346 0.08944552] [-1.27624412 -0.12400121 -1.35915595 -1.46238043] [ 1.43958762 -0.12400121 1.17255917 1.12399616] [ 1.20342834 0.34100331 1.0600385 1.38263382] [ 0.73110978 -0.12400121 1.11629884 1.25331499] [ 0.61303014 -0.58900572 1.00377816 1.12399616] [-0.9220052 1.73601687 -1.24663528 -1.3330616 ] [-1.27624412 0.80600783 -1.24663528 -1.3330616 ] [ 0.73110978 0.34100331 0.72247648 0.99467733] [ 0.96726906 0.57350557 1.0600385 1.12399616] [-1.63048304 -1.75151702 -1.41541629 -1.20374277] [ 0.37687085 0.80600783 0.89125749 1.38263382] [-1.15816448 -0.12400121 -1.35915595 -1.3330616 ] [-0.21352735 -1.2865125 0.66621615 0.99467733] [ 1.20342834 0.10850105 0.89125749 1.12399616] [-1.74856268 0.34100331 -1.41541629 -1.3330616 ] [-1.04008484 1.27101235 -1.35915595 -1.3330616 ] [ 1.55766726 -0.12400121 1.11629884 0.47740201] [-0.9220052 1.03851009 -1.35915595 -1.20374277] [-1.74856268 -0.12400121 -1.41541629 -1.3330616 ] [-0.56776627 1.96851913 -1.19037495 -1.07442394] [-0.44968663 -1.75151702 0.10361279 0.08944552] [ 1.0853487 0.34100331 1.17255917 1.38263382] [ 2.02998583 -0.12400121 1.56638153 1.12399616] [-0.9220052 1.03851009 -1.35915595 -1.3330616 ] [-1.15816448 0.10850105 -1.30289562 -1.46238043] [-0.80392556 0.80600783 -1.35915595 -1.3330616 ] [-0.21352735 -0.58900572 0.38491447 0.08944552] [ 0.84918942 -0.12400121 0.32865413 0.21876435] [-1.04008484 0.34100331 -1.47167663 -1.3330616 ] [-0.9220052 0.57350557 -1.19037495 -0.94510511] [ 0.61303014 -0.35650346 0.27239379 0.08944552] [-0.56776627 0.80600783 -1.30289562 -1.07442394] [ 2.14806547 -1.05401024 1.73516253 1.38263382] [-1.15816448 -1.51901476 -0.29020957 -0.29851096] [ 2.38422475 1.73601687 1.45386085 0.99467733] [ 0.96726906 0.10850105 0.32865413 0.21876435] [-0.80392556 2.43352365 -1.30289562 -1.46238043] [ 0.14071157 -0.12400121 0.55369548 0.73603967] [-0.09544771 2.20102139 -1.47167663 -1.3330616 ] [ 2.14806547 -0.58900572 1.62264186 0.99467733] [-0.9220052 1.73601687 -1.30289562 -1.20374277] [-1.39432376 0.34100331 -1.24663528 -1.3330616 ] [ 1.79382654 -0.58900572 1.28507985 0.8653585 ] [-1.04008484 0.57350557 -1.35915595 -1.3330616 ] [ 0.49495049 0.80600783 1.00377816 1.51195265] [-0.21352735 -0.58900572 0.15987312 0.08944552] [-0.09544771 -0.82150798 0.04735245 -0.03987331] [-0.21352735 -1.05401024 -0.1776889 -0.29851096] [ 0.61303014 0.34100331 0.83499716 1.38263382] [ 0.96726906 -0.12400121 0.77873682 1.38263382] [ 0.49495049 -1.2865125 0.60995581 0.34808318] [ 0.96726906 -0.12400121 0.66621615 0.60672084] [-1.04008484 -0.12400121 -1.24663528 -1.3330616 ] [-0.44968663 -1.51901476 -0.06516822 -0.29851096] [ 0.96726906 0.10850105 1.00377816 1.51195265] [-0.09544771 -0.82150798 0.72247648 0.8653585 ] [-0.9220052 0.80600783 -1.30289562 -1.3330616 ] [ 0.84918942 -0.35650346 0.4411748 0.08944552] [-0.33160699 -0.12400121 0.15987312 0.08944552] [ 0.02263193 0.34100331 0.55369548 0.73603967] [ 0.49495049 -1.75151702 0.32865413 0.08944552] [-0.44968663 1.03851009 -1.41541629 -1.3330616 ] [-0.9220052 1.50351461 -1.30289562 -1.07442394] [-1.15816448 0.10850105 -1.30289562 -1.46238043] [ 0.49495049 -0.35650346 1.00377816 0.73603967] [-0.09544771 -0.82150798 0.15987312 -0.29851096] [ 2.14806547 1.73601687 1.62264186 1.25331499] [-1.5124034 0.34100331 -1.35915595 -1.3330616 ]] species 137 2 84 1 27 0 127 2 132 2 .. ... 9 0 103 2 67 1 117 2 47 0 [120 rows x 1 columns] [[-0.09544771 -0.58900572 0.72247648 1.51195265] [ 0.14071157 -1.98401928 0.10361279 -0.29851096] [-0.44968663 2.66602591 -1.35915595 -1.3330616 ] [ 1.6757469 -0.35650346 1.39760052 0.73603967] [-1.04008484 0.80600783 -1.30289562 -1.3330616 ] [ 0.49495049 0.57350557 1.22881951 1.64127148] [-1.04008484 1.03851009 -1.41541629 -1.20374277] [ 0.96726906 0.10850105 0.49743514 0.34808318] [ 1.0853487 -0.58900572 0.55369548 0.21876435] [ 0.25879121 -0.58900572 0.10361279 0.08944552] [ 0.25879121 -1.05401024 1.00377816 0.21876435] [ 0.61303014 0.34100331 0.38491447 0.34808318] [ 0.25879121 -0.58900572 0.49743514 -0.03987331] [ 0.73110978 -0.58900572 0.4411748 0.34808318] [ 0.25879121 -0.35650346 0.49743514 0.21876435] [-1.15816448 0.10850105 -1.30289562 -1.46238043] [ 0.14071157 -0.35650346 0.38491447 0.34808318] [-0.44968663 -1.05401024 0.32865413 -0.03987331] [-1.27624412 -0.12400121 -1.35915595 -1.20374277] [-0.56776627 1.96851913 -1.41541629 -1.07442394] [-0.33160699 -0.58900572 0.60995581 0.99467733] [-0.33160699 -0.12400121 0.38491447 0.34808318] [-1.27624412 0.80600783 -1.07785427 -1.3330616 ] [-1.74856268 -0.35650346 -1.35915595 -1.3330616 ] [ 0.37687085 -0.58900572 0.55369548 0.73603967] [-1.5124034 1.27101235 -1.5841973 -1.3330616 ] [-0.9220052 1.73601687 -1.07785427 -1.07442394] [ 0.37687085 -0.35650346 0.27239379 0.08944552] [-1.04008484 -1.75151702 -0.29020957 -0.29851096] [-1.04008484 0.80600783 -1.24663528 -1.07442394]] species 114 2 62 1 33 0 107 2 7 0 100 2 40 0 86 1 76 1 71 1 134 2 51 1 73 1 54 1 63 1 37 0 78 1 90 1 45 0 16 0 121 2 66 1 24 0 8 0 126 2 22 0 44 0 97 1 93 1 26 0
import pandas as pd
import numpy as np
df_iris = pd.read_csv("iris.csv")
print(df_iris.columns)
from sklearn.preprocessing import LabelEncoder, OneHotEncoder
#Encoding for dummy variables
onehot_encoder= OneHotEncoder()
X=onehot_encoder.fit_transform(df_iris["species"].values.reshape(-1,1))
print(X)
label_encoder_x= LabelEncoder()
df_iris["species"]= label_encoder_x.fit_transform(df_iris["species"])
x=df_iris[['sepal_length', 'sepal_width', 'petal_length', 'petal_width']]
y=df_iris[['species']]
from sklearn.model_selection import train_test_split
x_train, x_test, y_train, y_test= train_test_split(x, y, test_size= 0.2, random_state=0)
#Feature Scaling of datasets
from sklearn.preprocessing import StandardScaler
st_x= StandardScaler()
x_train= st_x.fit_transform(x_train)
x_test= st_x.transform(x_test)
print(x_train,y_train)
print(x_test,y_test)
We hope you find the above code reusable for all your future endeavours in machine learning.
To conclude, data preprocessing is a very important step in machine learning and should be performed very diligently