Explore the process of how to detect and remove outliers in data using Python for machine learning tasks. Gain insights into outlier detection techniques, such as statistical methods and visualization tools. Learn how to handle outliers by applying robust statistical measures and preprocessing techniques. Enhance your understanding of outlier impact on machine learning models and improve the accuracy and reliability of your predictions.
Outlier handling depends on the specific context and goals of your analysis, and there is no one-size-fits-all solution. It's important to note that the decision to remove outliers should be made judiciously and should be based on a thorough understanding of the data and the specific goals of your analysis. Removing outliers can alter the distribution and characteristics of your data, so it's crucial to consider the potential implications and document the choices made during the outlier detection and removal process.
You can watch the video-based tutorial with step by step explanation down below.
Load the Dataset
We will read the data from the csv file
df = pd.read_csv('data/winequality.csv')
df.head()
The code snippet reads a CSV file named 'winequality.csv' into a Pandas DataFrame object named 'df' and then displaying the first few rows of the DataFrame using the head() function
Next we will see the statistical summary of the DataFrame
df.describe()
The describe() function in Pandas provides a statistical summary of the DataFrame, including various descriptive statistics such as count, mean, standard deviation, minimum value, 25th percentile (Q1), median (50th percentile or Q2), 75th percentile (Q3), and maximum value for each numerical column in the DataFrame
We will use the residual sugar column to detect and remove the outliers
Visualize the Data
Next we will plot the data
sns.distplot(df['residual sugar'])
This will generate a distribution plot that displays the distribution of values in the 'residual_sugar' column. The plot will include a histogram to visualize the frequency of different values and a smooth curve representing the kernel density estimate
There is a outlier as the plot is completely right skewed
Next we will use boxplot to see the outliers clearly
# to see outliers clearly
sns.boxplot(df['residual sugar'])
The code snippet you provided makes use of the sns.boxplot() function from the Seaborn library to create a box plot for the 'residual sugar' variable in the DataFrame df
The box represents the interquartile range (IQR), with the line inside representing the median. The whiskers extend to the minimum and maximum values within 1.5 times the IQR from the first and third quartiles. Any points outside of the whiskers are considered potential outliers
Methods to remove Outliers
There are different methods using which we can remove outliers. Let us see few of them
Z-Score Method
The z-score method is a statistical technique used to detect outliers by measuring how many standard deviations a data point is away from the mean. A z-score tells you how relatively far a data point is from the mean in terms of standard deviations
First we will get the upper and lower limits
# find the limits
upper_limit = df['residual sugar'].mean() + 3*df['residual sugar'].std()
lower_limit = df['residual sugar'].mean() - 3*df['residual sugar'].std()
print('upper limit:', upper_limit)
print('lower limit:', lower_limit)
This code snippet calculates the upper limit as the mean plus three times the standard deviation (mean + 3 * std) and the lower limit as the mean minus three times the standard deviation (mean - 3 * std)
These limits define a range beyond which data points are considered outliers based on the z-score method
This is the possible upper and lower limit that we can consider
Next let us find outliers using the limits
# find the outliers
df.loc[(df['residual sugar'] > upper_limit) | (df['residual sugar'] < lower_limit)]
The code snippet uses the upper and lower limits calculated earlier to identify outliers in the 'residual sugar' column of the DataFrame df. It uses boolean indexing to filter the DataFrame and select rows where the 'residual sugar' values are outside the calculated limits
The .loc[] method is used to access the rows in df that meet the specified condition
The condition (df['residual sugar'] > upper_limit) | (df['residual sugar'] < lower_limit) checks whether the 'residual sugar' values are greater than the upper limit or less than the lower limit, indicating outliers
The resulting DataFrame outliers will contain only the rows where outliers are present in the 'residual sugar' column
Next we will trim the outliers. Trimming is a data transformation technique where outliers are removed or "trimmed" from the dataset, rather than replacing or imputing their values. Trimming involves setting a threshold or cutoff value, and any data points exceeding this threshold are removed from the dataset
# trimming - delete the outlier data
new_df = df.loc[(df['residual sugar'] <= upper_limit) & (df['residual sugar'] >= lower_limit)]
print('before removing outliers:', len(df))
print('after removing outliers:',len(new_df))
print('outliers:', len(df)-len(new_df))
The code snippet performs trimming by removing the outlier data from the DataFrame df based on the upper and lower limits calculated earlier
It creates a new DataFrame named new_df that contains only the rows with 'residual sugar' values within the calculated limits
The code calculates the length of df before removing outliers using len(df)
It then calculates the length of new_df after removing outliers using len(new_df)
Finally, it calculates the number of outliers removed by subtracting the length of new_df from the length of df
By printing these values, you can see the number of rows in df before and after removing outliers, as well as the count of outliers that were removed
Next let us plot the data after trimming outliers
sns.boxplot(new_df['residual sugar'])
Next we will perform capping. Capping, also known as Winsorization, is a technique used to handle outliers by setting a threshold and capping or truncating extreme values to a specified percentile. Capping involves replacing outlier values with less extreme values, thus reducing the impact of outliers on the dataset without entirely removing them
# capping - change the outlier values to upper (or) lower limit values
new_df = df.copy()
new_df.loc[(new_df['residual sugar']>=upper_limit), 'residual sugar'] = upper_limit
new_df.loc[(new_df['residual sugar']<=lower_limit), 'residual sugar'] = lower_limit
You are performing capping by changing the outlier values in the 'residual sugar' column of the DataFrame df to the upper or lower limit values
The DataFrame new_df is created as a copy of the original DataFrame df, and the outlier values are replaced accordingly
new_df is created as a copy of df. The upper_limit and lower_limit values are calculated using the mean and standard deviation of the 'residual sugar' column
The .loc[] method is then used to identify the rows where the 'residual sugar' values exceed the upper limit or fall below the lower limit
The corresponding outlier values are replaced with the upper or lower limit values using the assignment statement
By performing capping in this way, the outlier values in the 'residual sugar' column are replaced with the specified upper or lower limit values, effectively bringing them within the desired range
Next let us plot the data after performing capping
sns.boxplot(new_df['residual sugar'])
Here we have not deleted any of the data rather we have capped it . We can check by printing the length of data
len(new_df)
6497
We can see that length of new dataframe is 6497 which is same as the old dataframe
Inter Quartile Range Method
The Interquartile Range (IQR) method is another statistical technique used to detect and handle outliers in a dataset. The IQR represents the range between the first quartile (Q1) and the third quartile (Q3) of a dataset
First let us calculate the first quartile (Q1), third quartile (Q3), and interquartile range (IQR) of the 'residual sugar' column in the DataFrame df
q1 = df['residual sugar'].quantile(0.25)
q3 = df['residual sugar'].quantile(0.75)
iqr = q3-q1
In this code, q1 is calculated as the value at the 25th percentile (first quartile) of the 'residual sugar' column using the .quantile() function with a parameter of 0.25
Similarly, q3 is calculated as the value at the 75th percentile (third quartile)
Finally, iqr is computed as the difference between q3 and q1, representing the interquartile range
By calculating the Q1, Q3, and IQR, you obtain important descriptive statistics that can help in understanding the spread and distribution of the 'residual sugar' data.
These values are commonly used in the Interquartile Range (IQR) method for outlier detection and other data analysis techniques
q1, q3, iqr
(1.8, 8.1, 6.3)
These are the values of Q1, Q3, and IQR for the 'residual sugar' data in your DataFrame
Next let us calculate the upper and lower limit using the Interquartile Range (IQR) method
upper_limit = q3 + (1.5 * iqr)
lower_limit = q1 - (1.5 * iqr)
lower_limit, upper_limit
upper_limit is computed by adding 1.5 times the IQR to Q3 (q3 + (1.5 * iqr)), while lower_limit is calculated by subtracting 1.5 times the IQR from Q1 (q1 - (1.5 * iqr))
By printing these values, you can obtain the specific lower and upper limits that define the range within which data points are considered non-outliers according to the IQR method
(-7.6499999999999995, 17.549999999999997)
Next let us plot the data
sns.boxplot(df['residual sugar'])
Next we will find the outliers using upper and lower limit calculated earlier
# find the outliers
df.loc[(df['residual sugar'] > upper_limit) | (df['residual sugar'] < lower_limit)]
Next let us perform trimming of the outliers
# trimming - delete the outlier data
new_df = df.loc[(df['residual sugar'] <= upper_limit) & (df['residual sugar'] >= lower_limit)]
print('before removing outliers:', len(df))
print('after removing outliers:',len(new_df))
print('outliers:', len(df)-len(new_df))
The code snippet performs outlier removal using the calculated upper and lower limits based on the Interquartile Range (IQR) method. It creates a new DataFrame named new_df that includes only the rows with 'residual sugar' values within the calculated limits
By printing the lengths of df and new_df, you can see the number of rows in the DataFrame before and after removing outliers. Additionally, the difference in lengths (len(df) - len(new_df)) gives you the count of outliers that were removed
Next let us plot the data after trimming outliers
sns.boxplot(new_df['residual sugar'])
Next let us perform capping of the outliers
# capping - change the outlier values to upper (or) lower limit values
new_df = df.copy()
new_df.loc[(new_df['residual sugar']>upper_limit), 'residual sugar'] = upper_limit
new_df.loc[(new_df['residual sugar']<lower_limit), 'residual sugar'] = lower_limit
The code snippet performs capping by replacing the outlier values in the 'residual sugar' column of the DataFrame df with the upper or lower limit values. The DataFrame new_df is created as a copy of the original DataFrame df, and the outlier values are modified accordingly
Next let us plot the data after performing capping
sns.boxplot(new_df['residual sugar'])
Percentile Method
The percentile method can be used to handle outliers in a dataset. The percentile method involves setting a threshold based on percentiles and capping or truncating the outlier values accordingly
First let us calculate the upper and lower limit
upper_limit = df['residual sugar'].quantile(0.99)
lower_limit = df['residual sugar'].quantile(0.01)
print('upper limit:', upper_limit)
print('lower limit:', lower_limit)
The quantile() function in pandas is used to calculate the desired percentiles of the 'residual sugar' column in the DataFrame df
upper_limit is calculated as the value at the 99th percentile (0.99) of the 'residual sugar' column, and lower_limit is calculated as the value at the 1st percentile (0.01)
By printing these values, you can obtain the specific upper and lower limits that define the range within which data points are considered non-outliers according to the percentile method. These limits are calculated based on the specified percentiles and can be used to handle outliers in the 'residual sugar' column of your dataset
Next let us plot the data
sns.boxplot(df['residual sugar'])
Next we will find the outliers using upper and lower limit calculated earlier
# find the outliers
df.loc[(df['residual sugar'] > upper_limit) | (df['residual sugar'] < lower_limit)]
Next let us perform trimming of the outliers
# trimming - delete the outlier data
new_df = df.loc[(df['residual sugar'] <= upper_limit) & (df['residual sugar'] >= lower_limit)]
print('before removing outliers:', len(df))
print('after removing outliers:',len(new_df))
print('outliers:', len(df)-len(new_df))
The code snippet filters the DataFrame df based on the upper and lower limits calculated using the percentile method. It creates a new DataFrame named new_df that includes only the rows with 'residual sugar' values within the calculated limits
Next let us plot the data after trimming outliers
sns.boxplot(new_df['residual sugar'])
Next let us perform capping of the outliers
# capping - change the outlier values to upper (or) lower limit values
new_df = df.copy()
new_df.loc[(new_df['residual sugar']>upper_limit), 'residual sugar'] = upper_limit
new_df.loc[(new_df['residual sugar']<lower_limit), 'residual sugar'] = lower_limit
The code you provided performs capping by replacing the outlier values in the 'residual sugar' column of the DataFrame df with the upper or lower limit values. The DataFrame new_df is created as a copy of the original DataFrame df, and the outlier values are modified accordingly
Next let us plot the data after performing capping
sns.boxplot(new_df['residual sugar'])
Next let us plot the distplot for both old and new dataframe
sns.distplot(df['residual sugar'])
sns.distplot(new_df['residual sugar'])
Final Thoughts
Outliers are data points that deviate significantly from the majority of the dataset, and they can have a significant impact on statistical measures and model performance
It's crucial to have a good understanding of the data and the domain in which it is collected. Outliers may arise due to various reasons, such as measurement errors, data entry mistakes, or genuinely rare events. Understanding the nature of the data helps in making informed decisions about whether to remove or retain outliers
There are several methods available for detecting outliers, including statistical techniques like z-score, modified z-score, and box plots. These methods help identify observations that fall outside a certain threshold. Additionally, domain-specific knowledge and visual exploration of the data can also aid in outlier detection.
Outliers can significantly influence statistical measures such as mean, variance, and correlation coefficients. Therefore, it is essential to assess the impact of outliers on the analysis and decide whether their presence distorts the results. Sometimes, outliers may contain valuable information, and removing them can lead to biased or inaccurate conclusions
Once outliers are detected, the next step is to decide how to handle them. There are several approaches such as Remove outliers (trimming), Transform data(capping), and Treat separately
While outlier detection and removal can be valuable, it is essential to exercise caution and be aware of potential pitfalls such as Overzealous removal, Sample size and statistical power and Outlier definition
In summary, detecting and removing outliers should be approached with careful consideration of the data, domain knowledge, and the goals of the analysis. It is a crucial step in data preprocessing, but it requires judgment and an understanding of the potential impact on subsequent analyses or models
In this article we have explored how we can detect and remove outliers using Z-score method , Inter Quartile Range method and Percentile method and we have also seen how we can perform trimming and capping in each of this methods.
Get the project notebook from here
Thanks for reading the article!!!
Check out more project videos from the YouTube channel Hackers Realm