Unveil the secrets of the Iris dataset with Python! This comprehensive tutorial dives into classification techniques and machine learning algorithms to analyze and classify Iris flowers based on their features. Learn to preprocess data, train models, and evaluate their performance. Enhance your skills in data analysis, machine learning, and unlock the power of the Iris dataset. Join this project tutorial to unravel the patterns hidden within the flowers and master the art of classification with Python. #IrisDataset #Python #Classification #MachineLearning #DataAnalysis #FlowerClassification
In this project tutorial, we are going to analyze the tabular data with various visualizations and build a robust machine learning model to predict the class of the flower.
You can watch the video based tutorial with step by step explanation down below
Dataset Information
The data set contains 3 classes of 50 instances each, where each class refers to a type of iris plant. One class is linearly separable from the other 2; the latter are NOT linearly separable from each other.
Attribute Information:-
sepal length in cm
sepal width in cm
petal length in cm
petal width in cm
species
Iris Setosa
Iris Versicolour
Iris Virginica
Download the Iris Dataset here
Import modules
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import warnings
warnings.filterwarnings('ignore')
pandas - used to perform data manipulation and analysis
numpy - used to perform a wide variety of mathematical operations on arrays
matplotlib - used for data visualization and graphical plotting
seaborn - built on top of matplotlib with similar functionalities
warnings - to manipulate warnings details
filterwarnings('ignore') is to ignore the warnings thrown by the modules (gives clean results)
Loading the Dataset
# load the csv data
df = pd.read_csv('Iris.csv')
df.head()
pd.read_csv() loads the csv(comma seperated value) data into a dataframe
df.head() displays the 5 first rows from the dataframe
# delete a column
df = df.drop(columns = ['Id'])
df.head()
# to display stats about data
df.describe()
# to get basic info about datatypes
df.info()
All the input attributes(0-3) are in float and the output attribute(4) is in object
# to display no. of samples on each class
df['Species'].value_counts()
value_counts() creates a dictionary of counts for each unique value.
We have 50 samples in each output class
Preprocessing the Dataset
Let's check for NULL values in the dataset
# check for null values
df.isnull().sum()
There are no NULL values present in the dataset.
If any NULL values are present, we have to fill all the NULL values before proceeding to model training.
Exploratory Data Analysis
In Exploratory Data Analysis(EDA), we will visualize the data with different kinds of plots for inference. It is helpful to find some patterns (or) relations within the data
# histograms
df['SepalLengthCm'].hist()
df['SepalWidthCm'].hist()
df['PetalLengthCm'].hist()
df['PetalWidthCm'].hist()
Sepal Length and Sepal Width forming a normal distritbution
Petal Length and Petal Width have two separate bells, it's due to the measurements of different species
Let's create some scatter plots for inference
# create list of colors and class labels
colors = ['red', 'orange', 'blue']
species = ['Iris-virginica', 'Iris-versicolor', 'Iris-setosa']
df[df['Species'] == species[i]] - filters samples for each class label
plt.scatter() - generates a scatterplot for the data
plt.xlabel() - label for x-axis
plt.ylabel() - label for y-axis
plt.legend() - display the legend for the plot
for i in range(3):
# filter data on each class
x = df[df['Species'] == species[i]]
# plot the scatter plot
plt.scatter(x['SepalLengthCm'], x['SepalWidthCm'], c = colors[i], label=species[i])
plt.xlabel("Sepal Length")
plt.ylabel("Sepal Width")
plt.legend()
for i in range(3):
# filter data on each class
x = df[df['Species'] == species[i]]
# plot the scatter plot
plt.scatter(x['PetalLengthCm'], x['PetalWidthCm'], c = colors[i], label=species[i])
plt.xlabel("Petal Length")
plt.ylabel("Petal Width")
plt.legend()
for i in range(3):
# filter data on each class
x = df[df['Species'] == species[i]]
# plot the scatter plot
plt.scatter(x['SepalLengthCm'], x['PetalLengthCm'], c = colors[i], label=species[i])
plt.xlabel("Sepal Length")
plt.ylabel("Petal Length")
plt.legend()
for i in range(3):
# filter data on each class
x = df[df['Species'] == species[i]]
# plot the scatter plot
plt.scatter(x['SepalWidthCm'], x['PetalWidthCm'], c = colors[i], label=species[i])
plt.xlabel("Sepal Width")
plt.ylabel("Petal Width")
plt.legend()
Here we can see, iris-setosa is easily separable from the other 2 classes
In petal length and petal width plot, the classes plotted without overlapping
In other plots, some samples are overlapping with other classes
Correlation Matrix
A correlation matrix is a table showing correlation coefficients between variables. Each cell in the table shows the correlation between two variables. The value is in the range of -1 to 1. If two variables have high correlation, we can neglect one variable from those two.
# display the correlation matrix
df.corr()
corr = df.corr()
# plot the heat map
fig, ax = plt.subplots(figsize=(5,4))
sns.heatmap(corr, annot=True, ax=ax, cmap = 'coolwarm')
Petal length and petal width have high positive correlation of 0.96
If petal length value increases, petal width also increases
Sepal length have high positive correlation with petal length and petal width
Sepal width have negative correlation with petal length and petal width
Label Encoder
In machine learning, we usually deal with datasets which contains multiple labels in one or more than one columns. These labels can be in the form of words or numbers. Label Encoding refers to converting the labels into numeric form so as to convert it into the machine-readable form.
from sklearn.preprocessing import LabelEncoder
le = LabelEncoder()
# transform the string labels to integer
df['Species'] = le.fit_transform(df['Species'])
df.head()
Model Training and Testing
Now the preprocessing has been done, let's perform the model training and testing
from sklearn.model_selection import train_test_split
## train - 70%
## test - 30%
# input data
X = df.drop(columns=['Species'])
# output data
Y = df['Species']
# split the data for train and test
x_train, x_test, y_train, y_test = train_test_split(X, Y, test_size=0.30)
X - contains input attributes
Y - contains the output attribute
train_test_split() - splits the data for training and testing (here we are splitting 70% data for training and 30% for testing)
Let's import some models and train
# logistic regression
from sklearn.linear_model import LogisticRegression
model = LogisticRegression()
# model training
model.fit(x_train, y_train)
fit() - used for training the model with the data
# print metric to get performance
print("Accuracy: ",model.score(x_test, y_test) * 100)
Accuracy: 91.11111111111111
model.score() - gives the accuracy for the test data
# knn - k-nearest neighbours
from sklearn.neighbors import KNeighborsClassifier
model = KNeighborsClassifier()
model.fit(x_train, y_train)
# print metric to get performance
print("Accuracy: ",model.score(x_test, y_test) * 100)
Accuracy: 100.0
# decision tree
from sklearn.tree import DecisionTreeClassifier
model = DecisionTreeClassifier()
model.fit(x_train, y_train)
# print metric to get performance
print("Accuracy: ",model.score(x_test, y_test) * 100)
Accuracy: 91.11111111111111
Final Thoughts
We have got around 100% accuracy for KNN with our test data split
You can also try out various machine learning models similar to above
More EDA can be done with boxplots, violinplot, barplot, etc.,
In this project tutorial, we have learnt on how to train machine learning classification model for iris flower dataset. We also learned about data analysis, visualizations, data transformation, model creation, etc.,
Get the project notebook from here
Thanks for reading the article!!!
Check out more project videos from the YouTube channel Hackers Realm
Comentários