Contributed by: Balaji Sundararaman
What is Dimensionality?
In any Machine Learning project, it all starts with the problem statement. The problem statement may point towards a particular feature (the ‘Target’ feature), that we need to be able to predict in which case it is a Supervised Learning problem. Otherwise, it becomes an Unsupervised Learning problem. In case of Supervised Learning, next we come up with various hypotheses regarding the possible features that can help is predict the value (Regression) or class label (Classification) of the Target feature. These hypotheses determine the ‘width’ of the data or the number of features (aka variables / columns) in our data. The number of observations or datapoints available across these features constitutes the ‘length’ of the data.
These two figures – the number of features and number of observations across these features are called the dimensions of the data or simply Dimensionality. For example, the dimensionality of a dataset with 1000 rows and 15 columns(features) is denoted as 1000 x 15 or (1000,15).
What is Dimensionality Reduction?
You got it! Dimensionality Reduction is simply the reduction in the number of features or number of observations or both, resulting in a dataset with a lower number of either or both dimensions. Intuitively, one may possibly expect that to do a better job of prediction of the target feature, more the number of observations across the hypothesized feature, the better. But it is not always the case. Let us now look at some of the common reasons why we need to consider Dimensionality Reduction.
From the perspective of building a predictive model using a machine learning algorithm it is more common to find reduction by eliminating a subset of features of the data than by eliminating entire observations. Though in this article, we are only trying to understanding the basic concepts around Dimensionality Reduction, it should be noted that in predictive modelling exercise aimed at addressing real world problems, the pros and cons of removing columns or observations have to be weighed carefully against the possible impact on reliability and accuracy of the model. While there are no rules set in stone and there may be some thumb rules available, each problem or domain may be unique, and the decisions have to be made in that particular context or with domain knowledge or experience.
Why Dimensionality Reduction
Some of the prominent reasons which compel us to go in for dimensionality reduction are:
Irrelevant Data / Missing Data
Based on our hypotheses about possible features that may impact our prediction of the target feature, we may set about collecting data. However, it is very rate that all datapoints or observations across all features are available at a single location or from a single source. More often than not, data from disparate sources and in different formats have to be stitched together to arrive at the final ‘clean’ or ‘tidy’ format of the data where each row is a complete observation and each feature is contained in its own row.
This may result in irrelevant (from a predictive modelling perspective) or missing data. Examples of irrelevant data can be the Employee ID column, Loan Application Number, Account Number, Serial number etc.
Gaps in Data collection, collation and other errors may result in missing data in features across many observations.
Dropping features due to irrelevance (eg ID columns) or a high proportion of missing values results in a reduction in the second dimensions(columns). We should also delete observations (rows) which do not have values for all the features (are not ‘complete cases’) resulting in reduction in number of rows.
Understanding the Exploratory Data Analysis (EDA) in Python
Features with zero/low variance
Continuous features with zero variance or constant value throughout all the observations or with very low variance do not contribute to the predictive capability of the model. Such features are prime candidates for being dropped from the dataset prior to modelling. For non-zero values of variance, it is a judgement call based on domain knowledge and experience, as to at what threshold values of variance below which such features should be dropped. An important point to note is that prior to comparing the variances of the continuous features, one should normalize the features. Continuous features with a narrow range (eg. Values between 1 and 5) will have a lower variance compared to features with a wide range (eg. Between 1 and 100,000).
With Categorical features, the decision will be based on the relative proportion of the levels. Features with overwhelming proportion of observations of one level maybe dropped. Again, the exact proportion of minority levels below which the categorical features can be dropped will be a judgement call based on experience and domain knowledge of the problem at hand.
Highly Correlated Features
Independent features that are highly correlated with each other, in supervised machine learning problems, are also of limited utility in enhancing the predictive power of the model. In fact, in linear machine learning algorithms, high correlation between independent variables severely downgrades the model reliability and performance. Among the pairs of highly correlated features, we can drop one which has a lower correlation with the target variable and retain the other. The threshold value of correlation coefficient for the above method is not hard and fast but as a rule of thumb 0.5 may be considered, above which we may drop one of the features.
Model Complexity, Over-fitting and Interpretability
Having a large number of features in the dataset taken as the input for a machine leaning algorithm, makes the resulting model very complex and frequently results in over-fitting. Such models perform very well in predictions on the training data but the performance metrics dip drastically on test /unseen data. Features that are not strongly related to the variation of target variables results in the model learning the noise as well and adversely impacts performance on test data.
Internal stake holders or external clients for whom the Machine learning assignment is being executed may expect to understand the interaction between the independent variables in the model and its impact on the prediction of target feature. They may expect to be able to do a ‘what-if’ analysis using the model to check the impact of change in each independent variable values on the target variable. Such interpretability may not be possible or difficult to interpret with a huge number of independent features in the model. This problem is very acute especially with Linear Regression model.
What is the Probability of Winning a Lottery
Preliminary Evaluation of modelling approach
In projects involving a huge number of observations (possibly running into millions of rows) or domains/problems not modelled before, it may be necessary to try out alternate modelling approaches to see which model (or a combination of model predictions referred to as Ensembles) is likely to be best suited for the task at hand.
Given this challenge, it is prudent to try out the various approaches on a smaller subset of the overall observations in the data to save time, cost and effort required in arriving at an optimal solution. However, it is important to ensure that the subset is large enough and representative of the characteristics of whole data set. Various sampling techniques and methods are available to achieve this but is outside the scope of this article.
Computational Cost, Capacity, Time & Storage Constraints
High dimensionality datasets also come up against the constraints of computational cost, processing capacity, storage and time taken to train the models and generate predictions. Hence dimensionality reduction plays an important role in mitigating this challenge
Curse of Dimensionality & Principle of Parsimony
All of the above-mentioned challenges and more, in dealing with large dimensional data, is encapsulated in the term Curse of Dimensionality coined by Richard.E Bellman. All the efforts to reduce dimensionality are geared towards the Principle of Parsimony, which advocates the simplest possible explanation or prediction of a phenomenon with the fewest possible variables and assumptions. From a Machine Learning / Predictive modelling perspective, this translates to developing the simplest possible model i.e the one with the least number of independent features that can deliver the same or acceptably similar level of performance benchmark as compared to a model trained on more number of independent features.
Basic Dimensionality Reduction Methods
Let’s now look at the python implementation of some of the common and basic dimensionality reduction methods that are used in Machine Learning projects. These methods can be categorized based on when or at what stage in the Machine Learning process flow they are used.
- Methods listed under Feature Selection/Elimination below are usually employed during the data pre-processing stage or prior to the predictive modelling stage.
- Methods listed under Feature Importance involve fitting a predictive model to the dataset. The objective here is to identify the top independent features that contribute to the variations in the target feature. This is done either by starting with an individual feature and adding more to see the impact on model performance or starting with the full model and eliminating one feature at a time from the model to asses the impact on model performance.
We will be using an adapted version or a subset of the Bike Rental Dataset from UCI Machine Learning Repository to demonstrate these methods. The full data set can be downloaded from here. The dataset has observations on the number of bike rentals on a given day along with details of weather conditions like temperature, humidity, windspeed and if the day is a holiday, working day and the season.
Feature Selection / Elimination (Pre-Modelling)
Dropping Features with Missing Values
Dropping Features or observations with missing values should be the option of the last resort and it is always advisable to go back to the data source and try to plug in the missing values or if not possible to impute the same. However dropping columns with percentage or ratio of missing values above a threshold is also an option, though there is no hard and fast threshold cut-off.
In the code snippet below, we load the data set and look at the first few observations and can see that most features have missing values.
Let us check the feature wise instances of missing values as well as their proportion compared to overall number of observations.
The bulk of the missing values seem to be in holiday and windspeed. If we decide to drop features with more than 40% values missing, then these two would be the candidates. This can be achieved using the drop() function of DataFrame object.
For very high dimensionality datasets with features running into dozens or even hundreds, we can achieve the same result with the below code.
Dropping Features with Low Variance
As note earlier, before we start comparing variance of the features, we need to ensure that the feature values are normalized. Now we take a subset of the continuous features in the same dataset and have normalized it.
The code snippet below and outputs show how we can set a threshold variance and drop features with variance below the threshold. Let’s say we want to retain only those features with variance of more than 0.2. So windspeed is dropped here.
Dropping Highly Correlated Features
We take another subset of the same dataset with a few continuous features. Based on the coefficient of correlation threshold, say 0.4, we will drop those features having a correlation coefficient higher than the threshold.
We will now check the correlation between the independent features. Since in this dataset the cnt feature is the target, we will save it to another variable so that it can be added back later just prior to modelling stage.
atemp / temp and hum / weathersit with correlation coeffients of 0.99 and 0.42 seem to the be the two pairs having any substantial correlation.
We plot the heatmap of the correlation matrix in the next page.
As we will be dropping features based on the strength of correlation and not direction, we take the absolute values of the correlation coefficients. Moreover, since the upper triangle of the correlation matrix is identical to the lower, we take only the values in the upper triangle.
We then drop those features which have a correlation coefficient of more than 0.4. In this case these features are atemp and hum.
Feature Importance (Model Performance Based)
The two methods mentioned below involve fitting a model to the data and selecting / eliminating features to view the impact on the model’s predictive performance and retaining those features that have the dominant influence. The below two methods differ from the ones discussed so far, in that the earlier models did not involve fitting a model to the dataset whereas now we do.
We use a subset of the same Bike Rental dataset.
Forward Feature Selection
The implementation is very simple in python and takes barely a couple of lines of code. However, under the hood, the algorithm fits several models to the dataset. To begin with, individual features are used in the model and performance is evaluated. Best performing model with the single feature is taken and now options with a second feature added is tried out and so on. We need to specify in the sequential feature selector constructor as to what is the final number of features (lower than the total independent features in the dataset). Below, we have used the SequentialFeatureSelector class from mlxtend module. Similar implementation are also found in sklearn module.
From this fitted feature selection model, we can now extract the top n number of best features using the k_feature_names attribute. We create a subset with lower dimensions from the original data using these selected features.
Since we had specified to the feature selector that we wanted only the top 3 features, the k_feature_names contains the temp, atemp and hum features using which we can create the new subset new_df with lower dimensionality that can be then used for modelling.
Backward Feature Elimination
Here we go the otherway around by building a full model and then evaluating the model performance with elimination of features. The features with least impact on being removed from the model are dropped till the specified number of features are left over. In coding, the only difference is that we specify the forward parameter in the SequenceFeatureSelector instance to be false.
It is common to get different sets of features in the forward and backward exercise with the features on the fringe of the cut-off number of features getting added or excluded.
In the above examples of model based dimensionality reduction techniques, we had chosen Linear Regression as the model to be used for the feature selection or elimination. However, many regression algorithms implemented in python like Random Forest have built in functionality for rank ordering all the independent features based on importance scores. This output can be used to decide on pruning the number of features to decrease dimensionality in such cases.
Summing Up
We have seen what is Dimensionality, what are the reasons why we need to reduce the dimensions of the data and the basic methods of dimensionality reduction. We have also seen two broad categories of these basic methods based on when in the project life cycle, these methods are employed – preprocessing or Modelling. These methods are basic, in the sense that we were looking to just reduce the dimensionality by removing irrelevant or relatively unimportant features.
Do try out these methods on the datasets repositories available online. Would love to hear your feedback and comments.
In part 2 of this article, we will look at the two advanced dimensionality reduction methods of Principal Component Analysis (PCA) and Factor Analysis(FA).