- Kinds of categorical data
- Python package to do the job
- 1. Ordinal Encoding:
- 2. One Hot Encoding:
- 3. Label Encoder for binary class columns:
- 4. Feature Hashing (or Hashing Trick):
- 5. Binary Encoding:
- 6. Target Encoder:
- How did we get these encoded values for class variables based on information from the target variable?
- Benefits:
- Limitations:
- Sample code on GitHub:
- Conclusion
Feature engineering is a crucial step in building a performant machine learning model. Understanding categorical variables and encoding those variables with the right encoding techniques is paramount during the data cleaning and preparation stage. A survey published on Forbes says that Data preparation accounts for about 80% of data scientists’ work. Data scientists spend 60% of their time cleaning and organizing data.
This article will look at various encoding techniques in Python with examples to better understand how to transform categorical data to make it model ready.
Contributed by: Sheikh Mohamed
Kinds of categorical data
- Ordinal Data: The categories have an inherent order associated [For example college degrees like bachelors, masters, PhD]
- Nominal Data: The categories do not have an inherent order [For example Indian Cities like Chennai, Bangalore, Mumbai]
When encoding ordinal data, please keep in mind that we need to retain the order’s information.
Python package to do the job
For encoding categorical data, we have a python package category_encoders. The following code helps you install easily on Jupyter Notebooks. For binary class encoding, we can use the pandas.Categorical() function in the python pandas package which we will discuss shortly.
!pip install category_encoders
1. Ordinal Encoding:
Ordinal encoding technique would be applied to categories where we need to preserve the information about the order. Let’s look at an example to understand this better.
Code snippet to import required libraries, create a data frame with degree column and apply ordinal encoding:
import category_encoders as ce import pandas as pd
train_df=pd.DataFrame({'Degree':['High school','Masters','Diploma','Bachelors','Bachelors','Masters','Phd','High school','High school']})
# create object of Ordinalencoding
encoder= ce.OrdinalEncoder(cols=['Degree'],return_df=True, mapping=[{'col':'Degree',
'mapping':{'None':0,'High school':1,'Diploma':2,'Bachelors':3,'Masters':4,'Phd':5}}]) #Original data
train_df
#fit and transform train data
df_train_transformed = encoder.fit_transform(rain_df) df_train_transformed
2. One Hot Encoding:
One Hot encoding technique would be applied to nominal categories (No order associated). One hot encoding will create a dummy variable for each level in the variable. Dummy variables created will have a value of either 0 or 1. 0 represents the absence of the category, whereas 1 represents the category’s presence.
One hot encoding is not an ideal technique for high cardinality variables. Many dummy variables were created to make the model computationally intensive and induce sparsity (mostly containing zero values than non-zero values). Hash encoder or Feature Hashing would be ideal in such cases which we will discuss shortly.
Let’s look at a python example to understand better.
import category_encoders as ce import pandas as pd data=pd.DataFrame({'City':[
'Delhi','Mumbai','Hydrabad','Chennai','Bangalore','Delhi','Hydrabad','Bangalore','Delhi'
]})
#Create object for one-hot encoding
encoder=ce.OneHotEncoder(cols='City',handle_unknown='return_nan',return_df=True,use_cat_nam es=True)
#Original Data data
#Fit and transform Data
data_encoded = encoder.fit_transform(data) data_encoded
Pandas pd.get_dummies() function also does the same job with drop_first argument set to False.
#encode the data data_encoded=pd.get_dummies(data=data,drop_first=False) data_encoded
drop_first set to True will get k-1 dummies out of k categorical levels by removing the first level.
#encode the data data_encoded=pd.get_dummies(data=data,drop_first=True) data_encoded
Here in this example, we can see that Bangalore level has been dropped and set as value 0. For more information on get_dummies function, refer here pandas.get_dummies — pandas 1.2.1 documentation (pydata.org).
3. Label Encoder for binary class columns:
For binary class variables i.e., variables with only two levels we can apply pandas.Categorical function in python pandas package for:
encoding. import pandas as pd train_df=pd.DataFrame({'Gender':['male','female']}) # To encode call pd.categorical
train_df['Gender'] = pd.Categorical(train_df['Gender']).codes train_df
#Female - 0
#Male - 1
4. Feature Hashing (or Hashing Trick):
Feature hashing is one hot encoding like technique but with lesser dimensions. Here, the user can fix the number of dimensions after transformation using the n_components argument. Here is what it means – A feature with five categories can be represented using N new features. Similarly, a high cardinality feature like 200 features can also be transformed using lesser new features.
You can explore this technique when you come across high cardinality features; else one hot encoding would be sufficient.
By default, the Hashing encoder uses the md5 hashing algorithm, but the user can pass any algorithm of his choice using the hash_method argument.
Let’s see it in action with an example:
import category_encoders as ce import pandas as pd
#Create the dataframe data=pd.DataFrame({'Month':['January','April','March','April','Februay','June','July','June','Septe mber']})
#Create object for hash encoder encoder=ce.HashingEncoder(cols='Month',n_components=6)
#Fit and Transform Data encoder.fit_transform(data)
An issue faced by hashing encoders is the collision. When a large number of features are transformed into lesser dimensions, multiple values can be represented by the same hash value, this is known as a collision.
5. Binary Encoding:
Binary encoding for categorical variables, similar to one hot, but stores categories as binary bit strings.
Binary encoding would be an ideal choice for categories with many levels.
#Import the libraries
import category_encoders as ce import pandas as pd
#Create the Dataframe data=pd.DataFrame({'City':['Delhi','Mumbai','Hyderabad','Chennai','Bangalore','Delhi','Hyderaba d','Mumbai','Agra']})
#Create object for binary encoding
encoder= ce.BinaryEncoder(cols=['City'],return_df=True)
#Original Data Data
#Fit and Transform Data data_encoded=encoder.fit_transform(data) data_encoded
6. Target Encoder:
Target encoding is a Bayesian encoding technique which uses information from dependent/target variables to encode the categorical variables.
For the case of continuous target: features are replaced with a blend of the expected value of the target given particular categorical value and the expected value of the target over all the training data.
For the case of categorical target: features are replaced with a blend of posterior probability of the target given particular categorical value and the prior probability of the target over all the training data.
Please refer documentation here – Target Encoder — Category Encoders 2.2.2 documentation (scikit-learn.org)
We will look at the case of categorical targets with a simple example to better understand what the above statement means.
#import the libraries import pandas as pd
import category_encoders as ce
#Create the Dataframe Data=pd.DataFrame({'class':['Chennai','Bangalore','Chennai','Chennai','Bangalore'],'Target':[1,0, 1,0,1]})
#Create target encoding object encoder=ce.TargetEncoder(cols='class')
#Original Data Data
We have a data frame with two classes [Chennai and Bangalore] and a binary target variable.
#Fit and Transform Train Data encoder.fit_transform(Data['class'],Data['Target'])
How did we get these encoded values for class variables based on information from the target variable?
Let’s look at these in detail to better understand the process behind.
- Group the data by each category and count the number of occurrences of each target.
Data.groupby(by='class')['Target'].count()
- Next, calculate the probability of Target 1 occurring given each specific ‘City’ When we do this, we get the following values:
When you closely look at original data below, Target 1 probability for city Chennai = 2/3 = 0.66 Target 1 probability for city Bangalore = 1/2 = 0.5
When you look at the encoded value for the class, this will be slightly different as
documentation says value is a “blend of posterior probability of the target given particular categorical value and the prior probability of the target over all the training data”.
What we calculated above is a posterior probability. Target encoder function also looks at prior probability of the Target 1 over all the training data which in our case is 3/5 = 0.6
Target encoder also uses this metric to smoothen the encoded value.
Benefits:
- Quick and efficient encoding method as categories are encoded by capturing information from the target variable.
- Doesn’t add to the dimensionality of the dataset.
Limitations:
- Since target encoding totally depends on the distribution of the target variable, careful validation of target distribution is crucial as this may lead to data leakage or overfitting.
Sample code on GitHub:
Should you refer to the code described in this article, I’ve uploaded the python Jupyter notebook on GitHub.
Conclusion
We have gone through some of the encoding techniques predominant in the industry. I would strongly recommend you to explore these and also other techniques under the category encoder python package. If you wish to learn more about Python, you can join the Python for Machine Learning Free Course offered by Great Learning Academy. You can also check out the wide range of courses offered on Great Learning Academy and learn the in-demand skills today.