The correlation matrix measures the linear relationship between pairs of features in a dataset. It provides an indication of how strongly and in what direction two features are related. A correlation value ranges from -1 to 1, where -1 indicates a strong negative correlation, 0 indicates no correlation, and 1 indicates a strong positive correlation.
Additionally, correlation-based feature selection is best suited for problems where the relationship between features is expected to be linear. If you suspect non-linear relationships, you may need to explore other methods such as feature importance based on model performance or feature engineering techniques
You can watch the video-based tutorial with step by step explanation down below.
Load the Dataset
First we will have to load the data
df = pd.read_csv('data/bike sharing dataset.csv')
df.head()
We will read the CSV file 'bike sharing dataset.csv' located in the 'data' directory and assign it to the DataFrame df using read_csv() function
The head() method is called on the DataFrame df to display the first few rows of the modified DataFrame
Finding Correlation Matrix
Next we will create a correlation Matrix of the dataset
corr = df.corr()
corr
In the above code snippet you will calculate the correlation matrix for the features in the DataFrame df and store it in the variable corr. You can then print corr to see the correlation matrix, which shows the pairwise correlations between all the features in the dataset
Display Correlation Matrix
Next we will display the correlation matrix in heatmap with which we can easily analyze the correlation matrix
# display correlation matrix in heatmap
corr = df.corr()
plt.figure(figsize=(14,9))
sns.heatmap(corr, annot=True, cmap='coolwarm')
First we will calculate the correlation matrix
We will set the figure size using plt.figure(figsize=(14, 9)) to make the heatmap larger and easier to read
Then, we use sns.heatmap() to create the heatmap, passing the correlation matrix corr as the data. The annot=True argument adds the correlation values to the heatmap cells. The cmap='coolwarm' argument sets the color map for the heatmap
Finally, we use plt.show() to display the heatmap
cnt is the target variable of this correlation matrix
From the heatmap we can infer that the casual and registered attributes have high correlation with target variable
If you have high correlation then those attributes are treated as important attributes and with the help of those attributes we can easily predict target variable
Any attribute whose range is above +0.05 or -0.05 that attribute will have some importance with the variable. Here you can see that temperature attribute has positive correlation of 0.4
Based on hour attribute you can also predict how many vehicles will be rented by users in the particular hour
You can also infer that the attributes holiday, weekday and workingday are not much important variables as their values are below +0.05 or -0.05
To eliminate some of the features in the input variable you should check the complete data, here you can clearly see that the attributes atemp and temp has 0.99 correlation which is highly positive value. If you see correlation values more than 0.7 then you can drop any one of the feature as both the values represent a similar pattern
You can also see that yr(year) is highly correlated with instant , instant attribute contains serial numbers which is of less importance so we can drop instant and you can also see mnth(month) is highly correlated with season so you can drop any one of them
You can observe the correlation matrix more carefully and infer many other information from it
Final Thoughts
Correlation matrix allows you to quickly identify highly correlated features, which can help in identifying redundant (or) overlapping information
By removing highly correlated features, you can reduce dimensionality, improve model interpretability, and potentially enhance model performance by reducing noise and overfitting
Correlation matrix-based feature selection considers pairwise relationships, but it may not account for the combined influence of multiple features on the target variable
Correlation analysis assumes that the relationship between variables is linear and follows a normal distribution. If these assumptions are violated, the correlation results may not be accurate or meaningful
Correlated features may still be important if they have non-linear or complex relationships with the target variable, which are not captured by correlation analysis alone
It is important to consider domain knowledge, as well as the performance of the selected features in a chosen model, to ensure the most relevant and informative features are selected
In summary, correlation matrix-based feature selection is a valuable technique to identify and remove highly correlated features. However, it should be used as part of a broader feature selection strategy, considering other methods and domain knowledge, to ensure a comprehensive and accurate selection of features for your specific machine learning problem
In this article we have explored how we can perform feature selection using correlation matrix. In future articles we will explore different methods to perform feature selection
Get the project notebook from here
Thanks for reading the article!!!
Check out more project videos from the YouTube channel Hackers Realm