top of page

Feature Selection using Correlation Matrix (Numerical) | Machine Learning | Python

Writer's picture: Hackers RealmHackers Realm

The correlation matrix measures the linear relationship between pairs of features in a dataset. It provides an indication of how strongly and in what direction two features are related. A correlation value ranges from -1 to 1, where -1 indicates a strong negative correlation, 0 indicates no correlation, and 1 indicates a strong positive correlation.

Feature selection using Correlation Matrix
Feature selection using Correlation Matrix

Additionally, correlation-based feature selection is best suited for problems where the relationship between features is expected to be linear. If you suspect non-linear relationships, you may need to explore other methods such as feature importance based on model performance or feature engineering techniques



You can watch the video-based tutorial with step by step explanation down below.


Load the Dataset


First we will have to load the data

df = pd.read_csv('data/bike sharing dataset.csv')
df.head()
  • We will read the CSV file 'bike sharing dataset.csv' located in the 'data' directory and assign it to the DataFrame df using read_csv() function

  • The head() method is called on the DataFrame df to display the first few rows of the modified DataFrame

First few rows of DataFrame
First few rows of DataFrame



Finding Correlation Matrix


Next we will create a correlation Matrix of the dataset

corr = df.corr()
corr
  • In the above code snippet you will calculate the correlation matrix for the features in the DataFrame df and store it in the variable corr. You can then print corr to see the correlation matrix, which shows the pairwise correlations between all the features in the dataset

Correlation Matrix
Correlation Matrix


Display Correlation Matrix


Next we will display the correlation matrix in heatmap with which we can easily analyze the correlation matrix

# display correlation matrix in heatmap
corr = df.corr()
plt.figure(figsize=(14,9))
sns.heatmap(corr, annot=True, cmap='coolwarm')
  • First we will calculate the correlation matrix

  • We will set the figure size using plt.figure(figsize=(14, 9)) to make the heatmap larger and easier to read

  • Then, we use sns.heatmap() to create the heatmap, passing the correlation matrix corr as the data. The annot=True argument adds the correlation values to the heatmap cells. The cmap='coolwarm' argument sets the color map for the heatmap

  • Finally, we use plt.show() to display the heatmap

Heatmap of correlation matrix
Heatmap of correlation matrix
  • cnt is the target variable of this correlation matrix

  • From the heatmap we can infer that the casual and registered attributes have high correlation with target variable

  • If you have high correlation then those attributes are treated as important attributes and with the help of those attributes we can easily predict target variable

  • Any attribute whose range is above +0.05 or -0.05 that attribute will have some importance with the variable. Here you can see that temperature attribute has positive correlation of 0.4

  • Based on hour attribute you can also predict how many vehicles will be rented by users in the particular hour

  • You can also infer that the attributes holiday, weekday and workingday are not much important variables as their values are below +0.05 or -0.05

  • To eliminate some of the features in the input variable you should check the complete data, here you can clearly see that the attributes atemp and temp has 0.99 correlation which is highly positive value. If you see correlation values more than 0.7 then you can drop any one of the feature as both the values represent a similar pattern

  • You can also see that yr(year) is highly correlated with instant , instant attribute contains serial numbers which is of less importance so we can drop instant and you can also see mnth(month) is highly correlated with season so you can drop any one of them

  • You can observe the correlation matrix more carefully and infer many other information from it



Final Thoughts

  • Correlation matrix allows you to quickly identify highly correlated features, which can help in identifying redundant (or) overlapping information

  • By removing highly correlated features, you can reduce dimensionality, improve model interpretability, and potentially enhance model performance by reducing noise and overfitting

  • Correlation matrix-based feature selection considers pairwise relationships, but it may not account for the combined influence of multiple features on the target variable

  • Correlation analysis assumes that the relationship between variables is linear and follows a normal distribution. If these assumptions are violated, the correlation results may not be accurate or meaningful

  • Correlated features may still be important if they have non-linear or complex relationships with the target variable, which are not captured by correlation analysis alone

  • It is important to consider domain knowledge, as well as the performance of the selected features in a chosen model, to ensure the most relevant and informative features are selected

  • In summary, correlation matrix-based feature selection is a valuable technique to identify and remove highly correlated features. However, it should be used as part of a broader feature selection strategy, considering other methods and domain knowledge, to ensure a comprehensive and accurate selection of features for your specific machine learning problem

In this article we have explored how we can perform feature selection using correlation matrix. In future articles we will explore different methods to perform feature selection



Get the project notebook from here


Thanks for reading the article!!!


Check out more project videos from the YouTube channel Hackers Realm

4,477 views
bottom of page