Data Preprocessing Introduction, Concepts and Definition?

data preprocessing

What is data preprocessing?

For machine learning, we need data. Lots of it. The more we have, the better our model. Machine learning algorithms are data-hungry. But there’s a catch. They need data in a specific format.

In the real world, several terabytes of data is generated by multiple sources. But all of it is not directly usable. Audio, video, images, text, charts, logs all of them contain data. But this data needs to be cleaned in a usable format for the machine learning algorithms to produce meaningful results.

The process of cleaning raw data for it to be used for machine learning activities is known as data preprocessing. It’s the first and foremost step while doing a machine learning project. It’s the phase that is generally most time-taking as well.

Why data – preprocessing?

Real-world data is often noisy, incomplete with missing entries, and more often than not unsuitable for direct use for building models or solving complex data-related problems. There might be erroneous data, or the data might be unordered, unstructured, and unformatted. 

The above reasons render the collected data unusable for machine learning purposes. It’s seen that the same data when formatted and cleaned produces more accurate and reliable results when used by machine learning models other than their unprocessed counterparts.

Data pre-processing steps

In data pre-processing several stages or steps are there. All the steps are listed below –

  • Data Collection
  • Data import
  • Data Inspection
  • Data Encoding
  • Data interpolation
  • Data splitting into train and test sets
  • Feature scaling

Data Collection

Data collection is the stage when we collect data from various sources. Data might be laying across several storages or several servers and we need to get all that data collected in one single location for the ease of access.

Data is present in many formats. So we need to devise a common format for data collection. All the data required should be changed to a specific format for common operations to be done on them. Data of chat servers is in JSON, data of business applications is generally tabular. So, if we want to use both kinds of data we need to either convert all data into JSON, or all data into CSV or xlsx. Sometimes data is also present in the form of HTML text, so such texts also need to be cleaned.

Data Import 

Data import is the process of importing data into the software such as R or python for data cleaning purposes. Sometimes the data is so huge in size that we have to take special care for importing it into the processing server/software. Tools like pandas, dask, NumPy, and matplotlib are handy when operating on such huge volumes of data.

Pandas 

pandas is a fast, powerful, flexible, and easy to use open-source data analysis and manipulation tool, built on top of the Python programming language.

Installing pandas

Download Anaconda for your operating system and the latest Python version, run the installer, and follow the steps. Please note:

It is not needed (and discouraged) to install Anaconda as a root or administrator. When asked if you wish to initialize Anaconda3, answer yes. Restart the terminal after completing the installation. Detailed instructions on how to install Anaconda can be found in the Anaconda documentation.

In the Anaconda prompt (or terminal in Linux or macOS), start JupyterLab:

Importing pandas

In JupyterLab, create a new (Python 3) notebook:

In the first cell of the notebook, you can import pandas and check the version with:

Now we are ready to use pandas, and you can write your code in the next cells.

Numpy

It is a library for the Python programming language, adding support for large, multi-dimensional arrays and matrices, along with a large collection of high-level mathematical functions to operate on these arrays.

Importing Numpy

To import NumPy and check if it’s installed use the following code.

Here we imported NumPy and gave it an alias np. The alias np is further used to refer to NumPy.

Matplotlib

Matplotlib is a plotting library for the Python programming language and its numerical mathematics extension NumPy. It provides an object-oriented API for embedding plots into applications using general-purpose GUI toolkits like Tkinter, wxPython, Qt, or GTK+. There is also a procedural “pylab” interface based on a state machine (like OpenGL), designed to closely resemble that of MATLAB, though its use is discouraged.

Importing matplotlib

Here we imported matplotlib and printed the version. This is good check to verify if matplotlib got installed.

Dask

Analysts often use tools like Pandas, Scikit-Learn, Numpy, and the rest of the Python ecosystem to analyze data on their personal computers. They like these tools because they are efficient, intuitive, and widely trusted. However, when they choose to apply their analyses to larger datasets, they find that these tools were not designed to scale beyond a single machine. And so, the analyst rewrites their computation using a more scalable tool, often in another language altogether. This rewrite process slows down discovery and causes frustration.

Dask provides ways to scale Pandas, Scikit-Learn, and Numpy workflows more natively, with minimal rewriting. It integrates well with these tools so that it copies most of their API and uses its data structures internally.

Dask Installation

To install dask on our existing conda environment. We open anaconda prompt like earlier and execute the following command. W need to explicitly install dask, as it does not come pre-installed with Anaconda.

conda install dask

Having installed all these libraries lets load a sample dataset and see how it’s done in python.

We use a common dataset  Boston.csv

import pandas as pd
import numpy as np
df=pd.read_csv("Boston.csv")
print(df)

Output:

 505         506  0.04741   0.0  11.93     0  ...  273     21.0  396.90   7.88  11.9Unnamed: 0     crim    zn  indus  chas  ...  tax  ptratio   black  lstat  medv
0             1  0.00632  18.0   2.31     0  ...  296     15.3  396.90   4.98  24.0
1             2  0.02731   0.0   7.07     0  ...  242     17.8  396.90   9.14  21.6
2             3  0.02729   0.0   7.07     0  ...  242     17.8  392.83   4.03  34.7
3             4  0.03237   0.0   2.18     0  ...  222     18.7  394.63   2.94  33.4
4             5  0.06905   0.0   2.18     0  ...  222     18.7  396.90   5.33  36.2
..          ...      ...   ...    ...   ...  ...  ...      ...     ...    ...   ...
501         502  0.06263   0.0  11.93     0  ...  273     21.0  391.99   9.67  22.4
502         503  0.04527   0.0  11.93     0  ...  273     21.0  396.90   9.08  20.6
503         504  0.06076   0.0  11.93     0  ...  273     21.0  396.90   5.64  23.9
504         505  0.10959   0.0  11.93     0  ...  273     21.0  393.45   6.48  22.0

[506 rows x 15 columns]

First, we import pandas. Then we use the read_csv() function of pandas to read the file in computer memory.

Inside the read_csv function, we have passed the dataset name as an argument. This is because the dataset is in the same directory as that of the python file. Had they been in different locations, we would have passed the entire path to the file.

Once we execute the line containing read_csv() the file is read and the contents of the boston.csv are loaded into a data frame called df according to the code.

To verify that the file has been loaded correctly, we use the df.head() function, which displays the top 10 rows of the dataset.

In a similar manner, padas.read_json() can be used to read a dataset in the JSON format, pandas.read_text() can be used to read a dataset in text format.

Data Inspection

After the data is imported, data is inspected for missing values and several sanity checks are done for ensuring the consistency of data. Domain knowledge comes in handy in such scenarios.

Checking for missing data

To check for missing data, we lookout for rows and columns which are having null or no data.

If any such scenarios are found we have to make decisions based on scenarios and intuitions.

Again the domain knowledge comes in handy in deciding the importance of certain columns.

If a column has more than 40 percent of data missing then the column is discarded completely and is considered good practice.

If the percentage of data missing is less than that, then various interpolation and replacement techniques can be employed to fill in the missing data. The most common of them is the replacement of nulls by measures of central tendency, or median/mode/mode.

Statistical significance tests can also be used to determine which columns to keep and what to not keep while model building, but that’s a story for another time.

Implementation

isna() function is used to check the null values in pandas.
import pandas as pd
import numpy as np
array = np.array([[1, np.nan, 3], [4, 5, np.nan]])
print(array)
print(pd.isna(array))

Output

 [[ 1. nan  3.]
 [ 4.  5. nan]]
[[False  True False]
 [False False  True]]

For indexes, and array of booleans is returned.

index = pd.DatetimeIndex(["2017-07-05", "2017-07-06", None,
                          "2017-07-08"])
print(index)

Output

 DatetimeIndex(['2017-07-05', '2017-07-06', 'NaT', '2017-07-08'],
              dtype='datetime64[ns]', freq=None)

#checking for nulls in indexes

pd.isna(index)

Output

 array([False, False,  True, False])

#checking for nulls in series

For Series and DataFrame, the same type is returned, containing booleans.

df = pd.DataFrame([['ant', 'bee', 'cat'], ['dog', None, 'fly']])
print(df)

Output

   0     1    2
0  ant   bee  cat
1  dog  None  fly

#use of isna() in df

pd.isna(df)

Output

  0      1      2
0  False  False  False
1  False   True  False

#checking for nulls in first column

pd.isna(df[1])

Output

 0    False
1     True
Name: 1, dtype: bool

So now let’s try and inspect the given dataset :

First of all, we do a quantitative analysis. The descriptive statistics of each column gives us a good idea about the dataset. We use the describe() function of pandas for the following.

import pandas as pd
import numpy as np

df=pd.read_csv("Boston.csv")
print(df)
print (df.describe())

Output:

 Unnamed: 0        crim          zn  ...       black       lstat        medv
count  506.000000  506.000000  506.000000  ...  506.000000  506.000000  506.000000
mean   253.500000    3.613524   11.363636  ...  356.674032   12.653063   22.532806
std    146.213884    8.601545   23.322453  ...   91.294864    7.141062    9.197104
min      1.000000    0.006320    0.000000  ...    0.320000    1.730000    5.000000
25%    127.250000    0.082045    0.000000  ...  375.377500    6.950000   17.025000
50%    253.500000    0.256510    0.000000  ...  391.440000   11.360000   21.200000
75%    379.750000    3.677082   12.500000  ...  396.225000   16.955000   25.000000
max    506.000000   88.976200  100.000000  ...  396.900000   37.970000   50.000000

[8 rows x 15 columns]

The data we get count for each column is 506, which means there are 506 values in each column.

Similarly the standard deviation, minimum value, maximum values and the first , second and third quartile values for each column are also printed.

We can use these values for manual elimination of values as well. For example, if we know the ranges of values for each column beforehand, we can check for consistency values and eliminate the erroneous values.

But these will have to be done manually. In our case boston dataset is a standard dataset, and hence we can take and use the given values without worrying about the quality or correctness of the data. But real world datasets are more complex and all these measures will have to be taken care of.

In this dataset, there are no missing values. Had there been missing values, we can do something like this:

  1.  #Filling null values with a single value
# importing pandas as pd 
import pandas as pd 
  
# importing numpy as np 
import numpy as np 
  
# dictionary of lists 
dict = {'A':[100, 90, np.nan, 95], 
        'B': [30, 45, 56, np.nan], 
        'C':[np.nan, 40, 80, 98]} 
  
# creating a dataframe from dictionary 
df = pd.DataFrame(dict) 

print (df)
# filling missing value using fillna()   
df.fillna(0) 

Output:

  A     B     C
0  100.0  30.0   NaN
1   90.0  45.0  40.0
2    NaN  56.0  80.0
3   95.0   NaN  98.0

Out[9]: 
       A     B     C
0  100.0  30.0   0.0
1   90.0  45.0  40.0
2    0.0  56.0  80.0
3   95.0   0.0  98.0

In the above example, we have a dictionary with three keys A, B and C. We use the dictionary to create a dataframe.

We see that the data frame has null values as NAN. So we replace all values with 0.

Data Encoding

Data is in general of two types, quantitative and qualitative.

Quantitative data is used to deal with numbers and things used to measure:

  •  dimensions (height, width, and length). 
  • Temperature 
  • Humidity
  • Prices
  •  Area and volume

There are many more examples where data of quantitative nature is used.

Qualitative data deals with characteristics and descriptors that can’t be easily measured, but can be observed subjectively—such as smells, tastes, textures, attractiveness, and color. 

Broadly speaking, when we measure something and give it a numeric value, we generate quantitative data. When we classify or judge something, we generate qualitative data.

There are also different types of quantitative and qualitative data.

The type of data we are concerned with is categorical data. Categorical data is such data which is used to categorize different categories to differentiate between classes by assigning labels to them. 

Since we know machine learning algorithms work on numeric data we have to convert these labels into numerics. This can be done in primarily two ways:

Label Encoding – Label Encoding is such encoding in which we assign numeric labels to categories. There would be as many labels as there are categories.

One Hot Encoding – One hot encoding creates extra columns for each category and is a multi column presence absence marker vector.

import pandas as pd
import numpy as np

df_iris = pd.read_csv("iris.csv")
print(df_iris.columns)


from sklearn.preprocessing import LabelEncoder, OneHotEncoder  

#Encoding for dummy variables  
onehot_encoder= OneHotEncoder()    
X=onehot_encoder.fit_transform(df_iris["species"].values.reshape(-1,1))
print(X)

label_encoder_x= LabelEncoder()  
df_iris["species"]= label_encoder_x.fit_transform(df_iris["species"])  
print(df_iris)

Output:

 Index(['sepal_length', 'sepal_width', 'petal_length', 'petal_width',
       'species'],
      dtype='object')
  (0, 0)        1.0
  (1, 0)        1.0
  (2, 0)        1.0
  (3, 0)        1.0
  (4, 0)        1.0
  (5, 0)        1.0
  (6, 0)        1.0
  (7, 0)        1.0
  (8, 0)        1.0
  (9, 0)        1.0
  (10, 0)       1.0
  (11, 0)       1.0
  (12, 0)       1.0
  (13, 0)       1.0
  (14, 0)       1.0
  (15, 0)       1.0
  (16, 0)       1.0
  (17, 0)       1.0
  (18, 0)       1.0
  (19, 0)       1.0
  (20, 0)       1.0
  (21, 0)       1.0
  (22, 0)       1.0
  (23, 0)       1.0
  (24, 0)       1.0
  :     :
  (125, 2)      1.0
  (126, 2)      1.0
  (127, 2)      1.0
  (128, 2)      1.0
  (129, 2)      1.0
  (130, 2)      1.0
  (131, 2)      1.0
  (132, 2)      1.0
  (133, 2)      1.0
  (134, 2)      1.0
  (135, 2)      1.0
  (136, 2)      1.0
  (137, 2)      1.0
  (138, 2)      1.0
  (139, 2)      1.0
  (140, 2)      1.0
  (141, 2)      1.0
  (142, 2)      1.0
  (143, 2)      1.0
  (144, 2)      1.0
  (145, 2)      1.0
  (146, 2)      1.0
  (147, 2)      1.0
  (148, 2)      1.0
  (149, 2)      1.0
     sepal_length  sepal_width  petal_length  petal_width  species
0             5.1          3.5           1.4          0.2        0
1             4.9          3.0           1.4          0.2        0
2             4.7          3.2           1.3          0.2        0
3             4.6          3.1           1.5          0.2        0
4             5.0          3.6           1.4          0.2        0
..            ...          ...           ...          ...      ...
145           6.7          3.0           5.2          2.3        2
146           6.3          2.5           5.0          1.9        2
147           6.5          3.0           5.2          2.0        2
148           6.2          3.4           5.4          2.3        2
149           5.9          3.0           5.1          1.8        2

[150 rows x 5 columns]

Data Interpolation

Interpolation is the process of using known data values to estimate unknown data values. Various interpolation techniques are often used in the atmospheric sciences. One of the simplest methods, linear interpolation, requires knowledge of two points and the constant rate of change between them.

Data interpolation is used for adding missing values to the columns with cells having missing values.

There are many different strategies which can be used to do interpolation, most prominent is average interpolation, knn- interpolation etc.

# importing pandas as pd 
import pandas as pd 
  
# importing numpy as np 
import numpy as np 
  
# dictionary of lists 
dict = {'A':[100, 90, np.nan, 95], 
        'B': [30, 45, 56, np.nan], 
        'C':[np.nan, 40, 80, 98]} 
  
# creating a dataframe from dictionary 
df = pd.DataFrame(dict) 

trainingData = df.iloc[:, :].values
dataset = df.iloc[:, :].values

from sklearn.preprocessing import Imputer
imputer = Imputer(missing_values="NaN", strategy="mean", axis = 0)
imputer = imputer.fit(trainingData[:, 1:2])
dataset[:, 1:2] = imputer.transform(dataset[:, 1:2])

print(dataset)

Output 

 [[100.          30.                  nan]
 [ 90.          45.          40.        ]
 [         nan  56.          80.        ]
 [ 95.          43.66666667  98.        ]]

 #Filling null values with a single value

# importing pandas as pd 
import pandas as pd 
  
# importing numpy as np 
import numpy as np 
  
# dictionary of lists 
dict = {'A':[100, 90, np.nan, 95], 
        'B': [30, 45, 56, np.nan], 
        'C':[np.nan, 40, 80, 98]} 
  
# creating a dataframe from dictionary 
df = pd.DataFrame(dict) 
import numpy as np
from sklearn.impute import SimpleImputer
imp = SimpleImputer(missing_values=np.nan, strategy='mean')
imp.fit(df)
SimpleImputer()
print(imp.transform(df))
 [[100.          30.          72.66666667]
 [ 90.          45.          40.        ]
 [ 95.          56.          80.        ]
 [ 95.          43.66666667  98.        ]]

Data Splitting

Data before being fed into machine learning algorithms is divided into train and validation sets.

Sklearn library of python provides a special function train-test-split for it. We can specify the percentage of data we want as a test and the function divides the given data into train and test sets.

It returns four arguments which are training independent variable, training dependent variable, testing independent variables and testing dependent variable.

We would do an example here: –

import pandas as pd
import numpy as np
 
df_iris = pd.read_csv("iris.csv")
print(df_iris.columns)
 
x=df_iris[['sepal_length', 'sepal_width', 'petal_length', 'petal_width']]
y=df_iris[['species']]
 
from sklearn.model_selection import train_test_split  
x_train, x_test, y_train, y_test= train_test_split(x, y, test_size= 0.2, random_state=0) 
 
print(x_train,y_train)
 
print(x_test,y_test)

Output

 Index(['sepal_length', 'sepal_width', 'petal_length', 'petal_width',
       'species'],
      dtype='object')
     sepal_length  sepal_width  petal_length  petal_width
137           6.4          3.1           5.5          1.8
84            5.4          3.0           4.5          1.5
27            5.2          3.5           1.5          0.2
127           6.1          3.0           4.9          1.8
132           6.4          2.8           5.6          2.2
..            ...          ...           ...          ...
9             4.9          3.1           1.5          0.1
103           6.3          2.9           5.6          1.8
67            5.8          2.7           4.1          1.0
117           7.7          3.8           6.7          2.2
47            4.6          3.2           1.4          0.2
 
[120 rows x 4 columns]         species
137   virginica
84   versicolor
27       setosa
127   virginica
132   virginica
..          ...
9        setosa
103   virginica
67   versicolor
117   virginica
47       setosa
 
[120 rows x 1 columns]

 runfile('C:/Users/VAGISH/.spyder-py3/temp.py', wdir='C:/Users/VAGISH/.spyder-py3')
Index(['sepal_length', 'sepal_width', 'petal_length', 'petal_width',
       'species'],
      dtype='object')
     sepal_length  sepal_width  petal_length  petal_width
137           6.4          3.1           5.5          1.8
84            5.4          3.0           4.5          1.5
27            5.2          3.5           1.5          0.2
127           6.1          3.0           4.9          1.8
132           6.4          2.8           5.6          2.2
..            ...          ...           ...          ...
9             4.9          3.1           1.5          0.1
103           6.3          2.9           5.6          1.8
67            5.8          2.7           4.1          1.0
117           7.7          3.8           6.7          2.2
47            4.6          3.2           1.4          0.2
 
[120 rows x 4 columns]         species
137   virginica
84   versicolor
27       setosa
127   virginica
132   virginica
..          ...
9        setosa
103   virginica
67   versicolor
117   virginica
47       setosa
 
[120 rows x 1 columns]
     sepal_length  sepal_width  petal_length  petal_width
114           5.8          2.8           5.1          2.4
62            6.0          2.2           4.0          1.0
33            5.5          4.2           1.4          0.2
107           7.3          2.9           6.3          1.8
7             5.0          3.4           1.5          0.2
100           6.3          3.3           6.0          2.5
40            5.0          3.5           1.3          0.3
86            6.7          3.1           4.7          1.5
76            6.8          2.8           4.8          1.4
71            6.1          2.8           4.0          1.3
134           6.1          2.6           5.6          1.4
51            6.4          3.2           4.5          1.5
73            6.1          2.8           4.7          1.2
54            6.5          2.8           4.6          1.5
63            6.1          2.9           4.7          1.4
37            4.9          3.1           1.5          0.1
78            6.0          2.9           4.5          1.5
90            5.5          2.6           4.4          1.2
45            4.8          3.0           1.4          0.3
16            5.4          3.9           1.3          0.4
121           5.6          2.8           4.9          2.0
66            5.6          3.0           4.5          1.5
24            4.8          3.4           1.9          0.2
8             4.4          2.9           1.4          0.2
126           6.2          2.8           4.8          1.8
22            4.6          3.6           1.0          0.2
44            5.1          3.8           1.9          0.4
97            6.2          2.9           4.3          1.3
93            5.0          2.3           3.3          1.0
26            5.0          3.4           1.6          0.4         species
114   virginica
62   versicolor
33       setosa
107   virginica
7        setosa
100   virginica
40       setosa
86   versicolor
76   versicolor
71   versicolor
134   virginica
51   versicolor
73   versicolor
54   versicolor
63   versicolor
37       setosa
78   versicolor
90   versicolor
45       setosa
16       setosa
121   virginica
66   versicolor
24       setosa
8        setosa
126   virginica
22       setosa
44       setosa
97   versicolor
93   versicolor
26       setosa

Feature Scaling

Feature scaling is standard normalization of data. This is done so that no independent variable has more importance than any other independent variable.

All columns are standardized individually so that they follow the same distribution. This is the last step in data preprocessing.

from sklearn.preprocessing import StandardScaler  

st_x= StandardScaler()  
x_train= st_x.fit_transform(x_train)  
x_test= st_x.transform(x_test)  
print(x_train,y_train)
print(x_test,y_test)

Output

 [[ 0.61303014  0.10850105  0.94751783  0.73603967]
 [-0.56776627 -0.12400121  0.38491447  0.34808318]
 [-0.80392556  1.03851009 -1.30289562 -1.3330616 ]
 [ 0.25879121 -0.12400121  0.60995581  0.73603967]
 [ 0.61303014 -0.58900572  1.00377816  1.25331499]
 [-0.80392556 -0.82150798  0.04735245  0.21876435]
 [-0.21352735  1.73601687 -1.19037495 -1.20374277]
 [ 0.14071157 -0.82150798  0.72247648  0.47740201]
 [ 0.02263193 -0.12400121  0.21613346  0.34808318]
 [-0.09544771 -1.05401024  0.10361279 -0.03987331]
 [ 1.0853487  -0.12400121  0.94751783  1.12399616]
 [-1.39432376  0.34100331 -1.41541629 -1.3330616 ]
 [ 1.20342834  0.10850105  0.72247648  1.38263382]
 [-1.04008484  1.03851009 -1.24663528 -0.81578628]
 [-0.56776627  1.50351461 -1.30289562 -1.3330616 ]
 [-1.04008484 -2.4490238  -0.1776889  -0.29851096]
 [ 0.73110978 -0.12400121  0.94751783  0.73603967]
 [ 0.96726906  0.57350557  1.0600385   1.64127148]
 [ 0.14071157 -1.98401928  0.66621615  0.34808318]
 [ 0.96726906 -1.2865125   1.11629884  0.73603967]
 [-0.33160699 -1.2865125   0.04735245 -0.16919214]
 [ 2.14806547 -0.12400121  1.28507985  1.38263382]
 [ 0.49495049  0.57350557  0.49743514  0.47740201]
 [-0.44968663 -1.51901476 -0.00890789 -0.16919214]
 [ 0.49495049 -0.82150798  0.60995581  0.73603967]
 [ 0.49495049 -0.58900572  0.72247648  0.34808318]
 [-1.15816448 -1.2865125   0.38491447  0.60672084]
 [ 0.49495049 -1.2865125   0.66621615  0.8653585 ]
 [ 1.32150798  0.34100331  0.49743514  0.21876435]
 [ 0.73110978 -0.12400121  0.77873682  0.99467733]
 [ 0.14071157  0.80600783  0.38491447  0.47740201]
 [-1.27624412  0.10850105 -1.24663528 -1.3330616 ]
 [-0.09544771 -0.82150798  0.72247648  0.8653585 ]
 [-0.33160699 -0.82150798  0.21613346  0.08944552]
 [-0.33160699 -0.35650346 -0.12142856  0.08944552]
 [-0.44968663 -1.2865125   0.10361279  0.08944552]
 [ 0.25879121 -0.12400121  0.4411748   0.21876435]
 [ 1.55766726  0.34100331  1.22881951  0.73603967]
 [-0.68584591  1.50351461 -1.30289562 -1.3330616 ]
 [-1.86664232 -0.12400121 -1.52793696 -1.46238043]
 [ 0.61303014 -0.82150798  0.83499716  0.8653585 ]
 [-0.21352735 -0.12400121  0.21613346 -0.03987331]
 [-0.56776627  0.80600783 -1.19037495 -1.3330616 ]
 [-0.21352735  3.13103043 -1.30289562 -1.07442394]
 [ 1.20342834  0.10850105  0.60995581  0.34808318]
 [-1.5124034   0.10850105 -1.30289562 -1.3330616 ]
 [ 0.02263193 -0.12400121  0.72247648  0.73603967]
 [-0.9220052  -1.2865125  -0.45899058 -0.16919214]
 [-1.5124034   0.80600783 -1.35915595 -1.20374277]
 [ 0.37687085 -1.98401928  0.38491447  0.34808318]
 [ 1.55766726  1.27101235  1.28507985  1.64127148]
 [-0.21352735 -0.35650346  0.21613346  0.08944552]
 [-1.27624412 -0.12400121 -1.35915595 -1.46238043]
 [ 1.43958762 -0.12400121  1.17255917  1.12399616]
 [ 1.20342834  0.34100331  1.0600385   1.38263382]
 [ 0.73110978 -0.12400121  1.11629884  1.25331499]
 [ 0.61303014 -0.58900572  1.00377816  1.12399616]
 [-0.9220052   1.73601687 -1.24663528 -1.3330616 ]
 [-1.27624412  0.80600783 -1.24663528 -1.3330616 ]
 [ 0.73110978  0.34100331  0.72247648  0.99467733]
 [ 0.96726906  0.57350557  1.0600385   1.12399616]
 [-1.63048304 -1.75151702 -1.41541629 -1.20374277]
 [ 0.37687085  0.80600783  0.89125749  1.38263382]
 [-1.15816448 -0.12400121 -1.35915595 -1.3330616 ]
 [-0.21352735 -1.2865125   0.66621615  0.99467733]
 [ 1.20342834  0.10850105  0.89125749  1.12399616]
 [-1.74856268  0.34100331 -1.41541629 -1.3330616 ]
 [-1.04008484  1.27101235 -1.35915595 -1.3330616 ]
 [ 1.55766726 -0.12400121  1.11629884  0.47740201]
 [-0.9220052   1.03851009 -1.35915595 -1.20374277]
 [-1.74856268 -0.12400121 -1.41541629 -1.3330616 ]
 [-0.56776627  1.96851913 -1.19037495 -1.07442394]
 [-0.44968663 -1.75151702  0.10361279  0.08944552]
 [ 1.0853487   0.34100331  1.17255917  1.38263382]
 [ 2.02998583 -0.12400121  1.56638153  1.12399616]
 [-0.9220052   1.03851009 -1.35915595 -1.3330616 ]
 [-1.15816448  0.10850105 -1.30289562 -1.46238043]
 [-0.80392556  0.80600783 -1.35915595 -1.3330616 ]
 [-0.21352735 -0.58900572  0.38491447  0.08944552]
 [ 0.84918942 -0.12400121  0.32865413  0.21876435]
 [-1.04008484  0.34100331 -1.47167663 -1.3330616 ]
 [-0.9220052   0.57350557 -1.19037495 -0.94510511]
 [ 0.61303014 -0.35650346  0.27239379  0.08944552]
 [-0.56776627  0.80600783 -1.30289562 -1.07442394]
 [ 2.14806547 -1.05401024  1.73516253  1.38263382]
 [-1.15816448 -1.51901476 -0.29020957 -0.29851096]
 [ 2.38422475  1.73601687  1.45386085  0.99467733]
 [ 0.96726906  0.10850105  0.32865413  0.21876435]
 [-0.80392556  2.43352365 -1.30289562 -1.46238043]
 [ 0.14071157 -0.12400121  0.55369548  0.73603967]
 [-0.09544771  2.20102139 -1.47167663 -1.3330616 ]
 [ 2.14806547 -0.58900572  1.62264186  0.99467733]
 [-0.9220052   1.73601687 -1.30289562 -1.20374277]
 [-1.39432376  0.34100331 -1.24663528 -1.3330616 ]
 [ 1.79382654 -0.58900572  1.28507985  0.8653585 ]
 [-1.04008484  0.57350557 -1.35915595 -1.3330616 ]
 [ 0.49495049  0.80600783  1.00377816  1.51195265]
 [-0.21352735 -0.58900572  0.15987312  0.08944552]
 [-0.09544771 -0.82150798  0.04735245 -0.03987331]
 [-0.21352735 -1.05401024 -0.1776889  -0.29851096]
 [ 0.61303014  0.34100331  0.83499716  1.38263382]
 [ 0.96726906 -0.12400121  0.77873682  1.38263382]
 [ 0.49495049 -1.2865125   0.60995581  0.34808318]
 [ 0.96726906 -0.12400121  0.66621615  0.60672084]
 [-1.04008484 -0.12400121 -1.24663528 -1.3330616 ]
 [-0.44968663 -1.51901476 -0.06516822 -0.29851096]
 [ 0.96726906  0.10850105  1.00377816  1.51195265]
 [-0.09544771 -0.82150798  0.72247648  0.8653585 ]
 [-0.9220052   0.80600783 -1.30289562 -1.3330616 ]
 [ 0.84918942 -0.35650346  0.4411748   0.08944552]
 [-0.33160699 -0.12400121  0.15987312  0.08944552]
 [ 0.02263193  0.34100331  0.55369548  0.73603967]
 [ 0.49495049 -1.75151702  0.32865413  0.08944552]
 [-0.44968663  1.03851009 -1.41541629 -1.3330616 ]
 [-0.9220052   1.50351461 -1.30289562 -1.07442394]
 [-1.15816448  0.10850105 -1.30289562 -1.46238043]
 [ 0.49495049 -0.35650346  1.00377816  0.73603967]
 [-0.09544771 -0.82150798  0.15987312 -0.29851096]
 [ 2.14806547  1.73601687  1.62264186  1.25331499]
 [-1.5124034   0.34100331 -1.35915595 -1.3330616 ]]      species
137        2
84         1
27         0
127        2
132        2
..       ...
9          0
103        2
67         1
117        2
47         0
 
[120 rows x 1 columns]
[[-0.09544771 -0.58900572  0.72247648  1.51195265]
 [ 0.14071157 -1.98401928  0.10361279 -0.29851096]
 [-0.44968663  2.66602591 -1.35915595 -1.3330616 ]
 [ 1.6757469  -0.35650346  1.39760052  0.73603967]
 [-1.04008484  0.80600783 -1.30289562 -1.3330616 ]
 [ 0.49495049  0.57350557  1.22881951  1.64127148]
 [-1.04008484  1.03851009 -1.41541629 -1.20374277]
 [ 0.96726906  0.10850105  0.49743514  0.34808318]
 [ 1.0853487  -0.58900572  0.55369548  0.21876435]
 [ 0.25879121 -0.58900572  0.10361279  0.08944552]
 [ 0.25879121 -1.05401024  1.00377816  0.21876435]
 [ 0.61303014  0.34100331  0.38491447  0.34808318]
 [ 0.25879121 -0.58900572  0.49743514 -0.03987331]
 [ 0.73110978 -0.58900572  0.4411748   0.34808318]
 [ 0.25879121 -0.35650346  0.49743514  0.21876435]
 [-1.15816448  0.10850105 -1.30289562 -1.46238043]
 [ 0.14071157 -0.35650346  0.38491447  0.34808318]
 [-0.44968663 -1.05401024  0.32865413 -0.03987331]
 [-1.27624412 -0.12400121 -1.35915595 -1.20374277]
 [-0.56776627  1.96851913 -1.41541629 -1.07442394]
 [-0.33160699 -0.58900572  0.60995581  0.99467733]
 [-0.33160699 -0.12400121  0.38491447  0.34808318]
 [-1.27624412  0.80600783 -1.07785427 -1.3330616 ]
 [-1.74856268 -0.35650346 -1.35915595 -1.3330616 ]
 [ 0.37687085 -0.58900572  0.55369548  0.73603967]
 [-1.5124034   1.27101235 -1.5841973  -1.3330616 ]
 [-0.9220052   1.73601687 -1.07785427 -1.07442394]
 [ 0.37687085 -0.35650346  0.27239379  0.08944552]
 [-1.04008484 -1.75151702 -0.29020957 -0.29851096]
 [-1.04008484  0.80600783 -1.24663528 -1.07442394]]      species
114        2
62         1
33         0
107        2
7          0
100        2
40         0
86         1
76         1
71         1
134        2
51         1
73         1
54         1
63         1
37         0
78         1
90         1
45         0
16         0
121        2
66         1
24         0
8          0
126        2
22         0
44         0
97         1
93         1
26         0
import pandas as pd
import numpy as np
 
df_iris = pd.read_csv("iris.csv")
print(df_iris.columns)

from sklearn.preprocessing import LabelEncoder, OneHotEncoder  

#Encoding for dummy variables  
onehot_encoder= OneHotEncoder()    
X=onehot_encoder.fit_transform(df_iris["species"].values.reshape(-1,1))
print(X)
 
label_encoder_x= LabelEncoder()  
df_iris["species"]= label_encoder_x.fit_transform(df_iris["species"])  
 
x=df_iris[['sepal_length', 'sepal_width', 'petal_length', 'petal_width']]
y=df_iris[['species']]

from sklearn.model_selection import train_test_split  

x_train, x_test, y_train, y_test= train_test_split(x, y, test_size= 0.2, random_state=0) 
 
 
 
#Feature Scaling of datasets  
from sklearn.preprocessing import StandardScaler  
st_x= StandardScaler()  
x_train= st_x.fit_transform(x_train)  
x_test= st_x.transform(x_test)  
print(x_train,y_train)
print(x_test,y_test)

We hope you find the above code reusable for all your future endeavours in machine learning.

To conclude, data preprocessing is a very important step in machine learning and should be performed very diligently

→ Explore this Curated Program for You ←

Avatar photo
Great Learning Editorial Team
The Great Learning Editorial Staff includes a dynamic team of subject matter experts, instructors, and education professionals who combine their deep industry knowledge with innovative teaching methods. Their mission is to provide learners with the skills and insights needed to excel in their careers, whether through upskilling, reskilling, or transitioning into new fields.

Recommended AI Courses

MIT No Code AI and Machine Learning Program

Learn Artificial Intelligence & Machine Learning from University of Texas. Get a completion certificate and grow your professional career.

4.70 ★ (4,175 Ratings)

Course Duration : 12 Weeks

AI and ML Program from UT Austin

Enroll in the PG Program in AI and Machine Learning from University of Texas McCombs. Earn PG Certificate and and unlock new opportunities

4.73 ★ (1,402 Ratings)

Course Duration : 7 months

Scroll to Top