Ensemble techniques are a popular approach in machine learning and data science that involve combining multiple models to improve overall performance and predictive accuracy in python. The fundamental idea behind ensemble techniques is that by aggregating the predictions of several individual models, the strengths of each model can compensate for the weaknesses of others, resulting in more robust and accurate predictions.
Ensemble techniques can significantly improve predictive performance compared to individual models because they reduce overfitting, increase model stability, and capture complex relationships present in the data.
You can watch the video-based tutorial with step by step explanation down below.
Load the dataset
Let us first read the data.
df = pd.read_csv('data/winequality.csv')
df = df.drop(columns=['type'], axis=1)
df = df.fillna(-2)
df.head()
pd.read_csv('data/winequality.csv') reads the CSV file 'winequality.csv' located in the 'data' directory and loads it into a pandas DataFrame named 'df'.
df = df.drop(columns=['type'], axis=1) drops the column named 'type' from the DataFrame 'df'. The drop() function is used to remove columns or rows from the DataFrame. The columns parameter specifies the list of column names to be dropped, and axis=1 indicates that we want to drop columns.
df = df.fillna(-2) fills any missing values (NaN) in the DataFrame 'df' with the value '-2'. The fillna() function is used to handle missing data by replacing NaN or null values with specified values.
df.head() displays the first few rows of the DataFrame 'df'. The head() function is used to view the top rows of the DataFrame, which gives a quick preview of its contents.
Next let us get the input and output split
X = df.drop(columns=['quality'], axis=1)
y = df['quality']
Create a new DataFrame 'X' by dropping the 'quality' column from the original DataFrame 'df'. The drop() function with axis=1 indicates that we want to remove columns, and the 'quality' column is specified in the columns parameter. As a result, 'X' will contain all the features except the 'quality' column.
Next create a new Series 'y' by extracting the 'quality' column from the original DataFrame 'df'. The 'quality' column represents the target variable, which is the variable we want to predict. Therefore, 'y' will contain the target values.
Next let us split your data into training and testing sets for building and evaluating a machine learning model.
from sklearn.model_selection import train_test_split
x_train, x_test, y_train, y_test = train_test_split(X, y, test_size=0.25, random_state=42, stratify=y)
Import the train_test_split function from the sklearn.model_selection module, which is part of scikit-learn, a popular machine learning library in Python.
train_test_split() splits the data into training and testing sets. The parameters used in the function call are as follows
X: The DataFrame containing the input features (independent variables) for the machine learning model. This would be the DataFrame 'X' that you created earlier, which contains all the features except the 'quality' column.
y: The Series containing the target variable (dependent variable) you want to predict. This would be the Series 'y' that you created earlier, which contains the 'quality' column.
test_size=0.25: This parameter specifies the proportion of the data that should be used for testing. Here, it's set to 0.25, which means 25% of the data will be used for testing, and the remaining 75% will be used for training.
random_state=42: This parameter is used to set a random seed, ensuring reproducibility. When you set a fixed random_state, the data will be split the same way each time you run the code.
stratify=y: This parameter is used for stratified sampling. It ensures that the class distribution in the target variable 'y' is maintained in both the training and testing sets. This is particularly useful when dealing with imbalanced datasets, where some classes may have significantly fewer samples than others.
Next let us train the dataset
from sklearn.linear_model import LogisticRegression
model = LogisticRegression()
model.fit(x_train, y_train)
model.score(x_test, y_test)
0.4664615384615385
Import the LogisticRegression class from the sklearn.linear_model module. Logistic regression is a common algorithm used for binary and multi-class classification tasks.
model = LogisticRegression() creates an instance of the LogisticRegression class and initializes it with default hyperparameters. The model is now ready to be trained on your training data.
model.fit(x_train, y_train) trains the logistic regression model using the fit method. The x_train variable contains the input features for training, and y_train contains the corresponding target values. The model will learn the relationship between the input features and the target variable during this training process.
model.score(x_test, y_test) evaluates the trained model's performance on the test dataset using the score method. The x_test variable contains the input features for testing, and y_test contains the corresponding true target values. The score method calculates the mean accuracy of the model on the test data.
After running this code snippet, we can see that the output 0.4664615384615385 will be a floating-point number representing the mean accuracy of the logistic regression model on the test dataset. The accuracy score is a common metric for classification models, and it represents the proportion of correctly predicted samples out of all the samples in the test dataset.
Different Ensemble Techniques
1) Voting Classifier
A Voting Classifier is an ensemble machine learning technique that combines multiple individual classifiers (models) to make predictions by aggregating their outputs. It allows different models to "vote" on the final prediction, and the majority vote determines the final prediction. The Voting Classifier can be used for both classification and regression tasks.
Hard Voting Classifier: In hard voting, each individual classifier predicts the class label, and the final prediction is based on the majority vote. For example, if three individual classifiers predict the classes A, A, and B for a particular instance, the hard voting classifier would predict class A because it is the majority vote.
Soft Voting Classifier: In soft voting, each individual classifier produces a probability distribution over all possible classes for the input instance. The final prediction is obtained by averaging the class probabilities from all individual classifiers. Soft voting often works better than hard voting, as it takes into account the confidence levels of individual classifiers in their predictions.
Let us see how voting classifier works
from sklearn.ensemble import VotingClassifier, RandomForestClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.neighbors import KNeighborsClassifier
model1 = LogisticRegression()
model2 = KNeighborsClassifier()
model3 = RandomForestClassifier()
model = VotingClassifier(estimators=[('lr', model1), ('kn', model2), ('rf', model3)], voting='soft') # soft-probability score, hard-take the majority class
model.fit(x_train, y_train)
model.score(x_test, y_test)
0.6344615384615384
First we will import the necessary classes that is VotingClassifier, RandomForestClassifier, LogisticRegression, and KNeighborsClassifier.
Instantiate three base classifiers LogisticRegression, KNeighborsClassifier, and RandomForestClassifier. These will be the individual models used within the Voting Classifier.
model = VotingClassifier(estimators=[('lr', model1), ('kn', model2), ('rf', model3)], voting='soft') creates the Voting Classifier instance. The estimators parameter is set to a list of tuples, where each tuple contains a string identifier and the corresponding classifier instance. The string identifier is used to label each individual model, and it will be helpful when inspecting the results or accessing individual classifiers later. The voting parameter is set to 'soft', indicating that soft voting (based on probabilities) will be used.
The Voting Classifier is trained using the fit method with the training data (x_train and y_train).
The score method is used to evaluate the Voting Classifier's performance on the test dataset (x_test and y_test). It returns the mean accuracy of the classifier on the test data.
After running this code snippet, model.score(x_test, y_test) will give you the mean accuracy of the Voting Classifier on the test data.
The soft voting approach is often preferred as it considers the confidence levels of individual classifiers, leading to better overall performance in many cases. However, you can also use 'hard' for hard voting (majority voting) by setting voting='hard' when creating the Voting Classifier.
2) Averaging
Averaging is a simple and effective ensemble technique used in machine learning to improve the predictive performance of models. It involves combining the predictions from multiple individual models by averaging their outputs. Averaging can be applied to both regression and classification problems.
Let us see how we can implement averaging
model1 = LogisticRegression()
model2 = KNeighborsClassifier()
model3 = RandomForestClassifier()
model1.fit(x_train, y_train)
model2.fit(x_train, y_train)
model3.fit(x_train, y_train)
pred1 = model1.predict_proba(x_test)
pred2 = model2.predict_proba(x_test)
pred3 = model3.predict_proba(x_test)
final_pred = (pred1+pred2+pred3)/3
The three classifiers (model1, model2, and model3) are initialized and trained on the training data.
The predict_proba method returns an array of class probabilities for each class label for each instance in x_test. Since you have three classifiers, you'll get three arrays (pred1, pred2, and pred3), each containing the predicted probabilities.
Next, you are averaging the predicted probabilities from all three models by element-wise addition and then dividing by the total number of models (3 in this case).
The resulting final_pred will contain the averaged class probabilities for each instance in x_test. To get the final predicted class label for each instance, you can choose the class with the highest probability.
Next let us visualize the different classes
sns.countplot(y)
We can see that there 7 different classes.
Next let us display final prediction
final_pred
We saw that we have 7 classes so here also we have 7 predictions.
Next we will get the actual class result
pred = []
for res in final_pred:
pred.append(np.argmax(res)+3)
Initialize an empty list called pred to store the final predicted class labels
Iterate over each row of the final_pred array.
For each row (res), find the index of the class with the highest probability using the np.argmax function.
The np.argmax function returns the index of the maximum value in the array res, which corresponds to the class with the highest probability.
Add 3 to the index to obtain the final predicted class label (assuming the class labels start from 3, as indicated by np.argmax(res) + 3).
Append the final predicted class label to the pred list.
After the loop, the list pred will contain the final predicted class labels for each instance in x_test. Note that the +3 is specific to your dataset, where class labels start from 3. If your class labels start from a different value, you would need to adjust the value accordingly.
Next let us calculate the accuracy of the final predicted class labels.
from sklearn.metrics import accuracy_score
accuracy_score(y_test, pred)
0.6350769230769231
Import the accuracy_score function from scikit-learn.
Calculate the accuracy of the final predicted class labels (pred) compared to the true class labels from the test dataset (y_test).
The accuracy_score function takes two arguments: the true class labels (y_test) and the predicted class labels (pred). It then compares the predicted labels to the true labels and calculates the proportion of correct predictions (accuracy) over all instances in the test dataset.
The returned value from accuracy_score will be a floating-point number representing the accuracy of the model on the test dataset. The accuracy score ranges from 0 to 1, where 1 indicates that all predictions are correct, and 0 means that none of the predictions are correct.
By using this metric, you can assess the performance of your ensemble model with the averaged predictions (pred) on the test data (x_test and y_test). A higher accuracy score generally indicates better predictive performance, but it's essential to consider other evaluation metrics as well, depending on the specific requirements and characteristics of your problem.
3) Weighted Average
Weighted average, also known as weighted mean or weighted sum, is a statistical method that assigns different weights to the data points before calculating the average. The weights represent the relative importance or significance of each data point in the average. In the context of ensemble techniques in machine learning, weighted averaging is used to combine the predictions from individual models while giving different importance to each model's contribution.
The weighted averaging approach gives you greater flexibility in tailoring the ensemble to suit the specific characteristics of your data and the performance of individual models.
Let us see how we can implement weighted average
model1 = LogisticRegression()
model2 = KNeighborsClassifier()
model3 = RandomForestClassifier()
model1.fit(x_train, y_train)
model2.fit(x_train, y_train)
model3.fit(x_train, y_train)
pred1 = model1.predict_proba(x_test)
pred2 = model2.predict_proba(x_test)
pred3 = model3.predict_proba(x_test)
final_pred = (pred1*0.25+pred2*0.25+pred3*0.5)/3
Train the three classifiers on the training data (x_train and y_train)
The three classifiers (model1, model2, and model3) are initialized and trained on the training data.
The predict_proba method returns an array of class probabilities for each class label for each instance in x_test. Since you have three classifiers, you'll get three arrays (pred1, pred2, and pred3), each containing the predicted probabilities.
Next, you will have to perform element-wise multiplication of the predicted probabilities from each model by their corresponding weights (0.25, 0.25, and 0.5, respectively). Then, you sum the weighted predictions and divide by the total sum of the weights (0.25 + 0.25 + 0.5 = 1) to normalize the result.
The resulting final_pred will contain the final weighted averaged class probabilities for each instance in x_test. To get the final predicted class label for each instance, you can choose the class with the highest probability.
Weighted averaging allows you to give different importance to each model's prediction based on their relative performance or other considerations. By tuning the weights, you can optimize the ensemble's performance on your specific problem.
Next we will get the actual class result.
pred = []
for res in final_pred:
pred.append(np.argmax(res)+3)
from sklearn.metrics import accuracy_score
accuracy_score(y_test, pred)
0.6652307692307692
Initialize an empty list called pred to store the final predicted class labels.
Iterate over each row of the final_pred array.
For each row (res), find the index of the class with the highest probability using the np.argmax function.
The np.argmax function returns the index of the maximum value in the array res, which corresponds to the class with the highest probability.
Add 3 to the index to obtain the final predicted class label (assuming the class labels start from 3, as indicated by np.argmax(res) + 3)
Append the final predicted class label to the pred list
After the loop, the list pred will contain the final predicted class labels for each instance in x_test. The +3 is added to account for the assumption that class labels start from 3
Import the accuracy_score function from scikit-learn and calculate the accuracy of the final predicted class labels compared to the true class labels from the test dataset.
The accuracy_score function takes two arguments: the true class labels (y_test) and the predicted class labels (pred). It then compares the predicted labels to the true labels and calculates the proportion of correct predictions (accuracy) over all instances in the test dataset.
The returned value from accuracy_score will be a floating-point number representing the accuracy of the model on the test dataset. The accuracy score ranges from 0 to 1, where 1 indicates that all predictions are correct, and 0 means that none of the predictions are correct.
By using this metric, you can assess the performance of your ensemble model with the weighted averaged predictions (pred) on the test data (x_test and y_test). A higher accuracy score generally indicates better predictive performance, but it's essential to consider other evaluation metrics as well, depending on the specific requirements and characteristics of your problem.
Final Thoughts
Ensembles often outperform individual models, especially when combining diverse models that capture different aspects of the data. By leveraging the collective knowledge of multiple models, ensembles can achieve higher accuracy and robustness.
Ensemble methods can reduce overfitting, particularly in complex models, by combining several models with different sources of error. This diversity helps to smooth out individual model's errors, leading to more stable and reliable predictions.
For ensembles to be effective, it's crucial to ensure that the individual models are diverse. Diversity can be achieved through using different algorithms, varying hyperparameters, or employing different subsets of the training data for each model.
There are various ensemble techniques, such as Bagging, Boosting, Stacking, and Voting, each with its strengths and suitable use cases. The choice of ensemble method depends on the problem, dataset, and models used.
Ensembles can be computationally expensive and may require more resources for training and prediction compared to single models. However, the trade-off between computational cost and performance gain is often worthwhile for significant improvements in accuracy.
Ensemble methods tend to be more effective on large and diverse datasets. On small datasets, the improvement might be limited, and overfitting can still occur if individual models are too complex.
Ensemble models can be more challenging to interpret than individual models since they combine multiple predictions. This can make it harder to understand the underlying decision-making process.
While diversity is generally beneficial, the inclusion of unreliable or poorly performing models in an ensemble can lead to a degradation in overall performance. Careful model selection is essential to ensure that all included models add value to the ensemble.
Ensembles often have additional hyperparameters to tune, such as the number of base models, their weights, or specific parameters for each ensemble technique. Careful hyperparameter tuning is essential for optimizing ensemble performance.
In this article we have seen how ensembles can be a key component in building highly accurate and reliable predictive models for a wide range of real-world applications. We have seen techniques such as Voting classifier , Averaging and Weighted Average in this article. We can further explore on different other ensemble techniques.
Get the project notebook from here
Thanks for reading the article!!!
Check out more project videos from the YouTube channel Hackers Realm
Comments