Filling missing values in a dataset is an essential step in data preprocessing to ensure the accuracy and reliability of your analysis or machine learning model. There are several techniques you can use to fill missing values based on the nature of your data and the characteristics of the missing values.
We will see basic to advanced techniques to fill the missing values in the dataset. It's important to understand the characteristics of the missing data and choose an appropriate approach that preserves the integrity and meaningfulness of your dataset.
You can watch the video-based tutorial with step by step explanation down below.
Load the Dataset
We will read the data from the csv file.
df = pd.read_csv('data/Loan Prediction Dataset.csv')
df.head()
The code snippet reads a CSV file named 'Loan Prediction Dataset.csv' into a Pandas DataFrame object named 'df' and then displaying the first few rows of the DataFrame using the head() function.
Check the missing values
First we will see the count of missing values in each column.
# check null values
df.isnull().sum()
By calling the isnull() method on the DataFrame df, it returns a DataFrame of the same shape with True values where the data is missing (null) and False values where the data is present.
The sum() function is then applied to the resulting DataFrame, which sums the values along each column.
Different Approaches to fill missing values
Let us see basics to advanced approaches to fill missing values in the dataset.
1) Fill with negative values
Filling missing values with negative values can be a suitable approach in certain scenarios such as Absence of data, Numerical data with zero as a valid value, Bias or pattern preservation, Compatibility with downstream algorithms.
First let us copy the existing dataframe to new dataframe
new_df = df.copy()
The code snippet creates a new DataFrame named new_df that is a copy of an existing DataFrame named df. This approach is commonly used to create a separate copy of the original DataFrame, ensuring that any modifications made to new_df do not affect the original DataFrame df.
Next let us fill with negative value
new_df = df.fillna(-999)
new_df.isnull().sum()
The code snippet fills the missing values in df with a specific value, such as -999, and then check the number of null values in each column of new_df using new_df.isnull().sum().
We can see that the count of missing values has become zero after filling it with negative values.
2) Consider NULL Value as new category
We can treat null values as a distinct category or label in your dataset instead of filling them with a specific value, you can assign them a unique value or string to represent the null category.
First let us copy the existing dataframe to new dataframe.
new_df = df.copy()
Next let us see the count of different category in a particular column.
df['Gender'].value_counts()
Next let us consider null values as a different category.
# consider nan as category
new_df['Gender'] = df['Gender'].fillna('nan')
The fillna() method is used to replace NaN values in the 'Gender' column of the DataFrame new_df with the string 'nan'. By using the inplace=True parameter, the changes are made directly in the 'Gender' column of new_df without the need to assign the modified column back.
Next let us see the count of different category after filling null values.
new_df['Gender'].value_counts()
We can see that there is a new category nan and the count of it is 13 which is same as the count of missing values in Gender column.
3) Drop rows which have NULL values
If the missing values are sparse and don't significantly affect your dataset's overall integrity, you may choose to drop the rows or columns containing missing values. However, this approach should be used cautiously to avoid losing valuable information.
First let us copy the existing dataframe with missing values to new dataframe.
new_df = df.copy()
Next let us check the length of the dataframe with missing values.
len(new_df)
614
Next let us drop the rows which has null values.
new_df = df.dropna(axis=0)
len(new_df)
480
df.dropna(axis=0) drops all rows from df that contain any NaN values along the row axis (axis=0). The resulting DataFrame is stored in new_df.
The len() function is then used to calculate the number of rows in new_df, which represents the length of the DataFrame.
We can see that the length of new dataframe is less than the old dataframe as the rows with null values are removed
Next let us check the count of null values in each column after dropping the missing values rows
new_df.isnull().sum()
4) Fill missing value with mean, median and mode
For numerical data, you can replace missing values with the mean, median and for categorical data with mode of the available values in that column. This approach assumes that the missing values are most likely to have the same distribution as the available data.
First let us copy the existing dataframe to new dataframe.
new_df = df.copy()
Next let us calculate the mean value for one of the column. Here we have considered LoanAmount column as it is a numerical column.
df['LoanAmount'].mean()
Next let us plot the distribution of the column.
sns.distplot(df['LoanAmount'])
This plots the distribution of the 'LoanAmount' column from the DataFrame df. The resulting plot shows the density estimation of the variable.
Next let us fill the null values in LoanAmount column with the calculated mean value.
# fill missing value for numerical
new_df['LoanAmount'] = df['LoanAmount'].fillna(df['LoanAmount'].mean())
new_df.isnull().sum()
The code snippet fills the missing values in the 'LoanAmount' column of new_df with the mean value of the 'LoanAmount' column from df. This approach assumes that the missing values are most likely to have the same distribution as the available data.
Finally, new_df.isnull().sum() calculates the sum of null values in each column of new_df. By printing null_counts, you can see the count of remaining missing values in each column after filling the 'LoanAmount' column with the mean.
We can see that the count of null values in the LoanAmount column is zero as we have filled the null values with the mean value of that column.
Next let us fill another numerical column Loan_Amount_Term with median value.
new_df['Loan_Amount_Term'] = df['Loan_Amount_Term'].fillna(df['Loan_Amount_Term'].median())
new_df.isnull().sum()
This fills the missing values in the 'Loan_Amount_Term' column of new_df with the median value of the 'Loan_Amount_Term' column from df. Filling with the median is a common strategy when dealing with numerical data and can be more robust to outliers compared to using the mean.
Finally, new_df.isnull().sum() calculates the sum of null values in each column of new_df. By printing null_counts, you can see the count of remaining missing values in each column after filling the 'Loan_Amount_Term' column with the median.
We can see that the count of null values in the Loan_Amount_Term column is zero as we have filled the null values with the median value of that column.
Next let us fill a categorical column with mode value . Let us plot a count plot of the column Self_Employed.
sns.countplot(df['Self_Employed'])
This plots the count of each unique value in the 'Self_Employed' column from the DataFrame df. The resulting plot displays the number of occurrences for each category.
Next let us find the mode of the Self_Employed column.
df['Self_Employed'].mode()[0]
'No'
Next let us fill the missing values in Self_Employed with the mode value.
# fill missing value for categorical
new_df['Self_Employed'] = df['Self_Employed'].fillna(df['Self_Employed'].mode()[0])
new_df.isnull().sum()
This fills the missing values in the 'Self_Employed' column of new_df with the mode value of the 'Self_Employed' column from df. Filling with the mode is a common strategy when dealing with categorical data, as it replaces missing values with the most frequently occurring category.
Finally, new_df.isnull().sum() calculates the sum of null values in each column of new_df. By printing null_counts, you can see the count of remaining missing values in each column after filling the 'Self_Employed' column with the mode.
We can see that the count of null values in the Self_Employed column is zero as we have filled the null values with the mode value of that column.
5) Fill missing value based on grouping category
Filling missing values based on grouping categories can be a suitable approach in situations where the missing values exhibit a relationship or dependence on other categorical variables. By considering the means of grouping categories, you can utilize the available information from similar groups to estimate and fill the missing values more accurately.
First let us copy the existing dataframe to new dataframe.
new_df = df.copy()
Next let us calculate the mean value by performing groupBy on the category column.
mean_df = df.groupby('Loan_Status').mean()['LoanAmount']
mean_df
df.groupby('Loan_Status').mean() groups the DataFrame df by the 'Loan_Status' column and calculates the mean value for each group.
Then, ['LoanAmount'] is used to select only the 'LoanAmount' column from the resulting grouped data.
The final result, stored in mean_df, is a series with the average 'LoanAmount' for each 'Loan_Status' category.
By printing mean_df, you will see the mean 'LoanAmount' value for each unique 'Loan_Status' category.
Mean value for the Loan_Amount column with Loan_Status N category is 151.220994 and for Y category is 144.294404.
Next let us fill the missing value with mean value
# fill missing value for numerical column
new_df.loc[(new_df['Loan_Status']=='N'), 'LoanAmount'] = new_df.loc[(new_df['Loan_Status']=='N'), 'LoanAmount'].fillna(mean_df['N'])
new_df.loc[(new_df['Loan_Status']=='Y'), 'LoanAmount'] = new_df.loc[(new_df['Loan_Status']=='Y'), 'LoanAmount'].fillna(mean_df['Y'])
new_df.loc[new_df['Loan_Status'] == 'N', 'LoanAmount'] selects the 'LoanAmount' values where the 'Loan_Status' is 'N' and fillna(mean_df['N']) fills the corresponding missing values with the mean value for the 'N' category from mean_df. Similarly, the second line fills missing values for the 'Y' category.
Next let us see the count of missing values in each column.
new_df.isnull().sum()
We can see that the count of null values in the LoanAmount column is zero as we have filled the null values with the mean value from mean_df.
Next let us print the each key (category) in the mean_df series This will be helpful when we have multiple categories
for val in mean_df.keys():
print(val)
N
Y
mean_df.keys() returns an iterable object representing the keys (categories) in the mean_df series
Next let us get the mean value for Loan_Amount_Term column.
mean_df = df.groupby('Loan_Status').mean()['Loan_Amount_Term']
mean_df
df.groupby('Loan_Status')['Loan_Amount_Term'].mean() groups the DataFrame df by the 'Loan_Status' column and calculates the mean value for the 'Loan_Amount_Term' column for each group.
The resulting series, mean_df, will contain the average 'Loan_Amount_Term' value for each unique 'Loan_Status' category.
By printing mean_df, you will see the mean 'Loan_Amount_Term' value for each category in the 'Loan_Status' column.
Next let us fill the missing values for Loan_Amount_Term column with mean value
for val in mean_df.keys():
new_df.loc[(new_df['Loan_Status']==val), 'Loan_Amount_Term'] = new_df.loc[(new_df['Loan_Status']==val), 'Loan_Amount_Term'].fillna(mean_df[val])
The loop for val in mean_df.keys() iterates over each key (category) in mean_df. new_df.loc[new_df['Loan_Status'] == val, 'Loan_Amount_Term'] selects the 'Loan_Amount_Term' values where the 'Loan_Status' matches the current category (val), and fillna(mean_df[val]) fills the corresponding missing values with the mean value for that category from mean_df.
Next let us see the count of missing values in each column
new_df.isnull().sum()
Next let us fill the missing values for categorical column Self_Employed. First let us find the mode of the column
# fill missing value for categorical
mode_df = df.groupby('Loan_Status')['Self_Employed'].agg(pd.Series.mode)
mode_df
df.groupby('Loan_Status')['Self_Employed'] groups the DataFrame df by the 'Loan_Status' column and selects the 'Self_Employed' column.
The agg(pd.Series.mode) part calculates the mode for each category in the 'Loan_Status' column.
The resulting series, mode_df, will contain the mode value for 'Self_Employed' within each 'Loan_Status' category.
By printing mode_df, you will see the mode 'Self_Employed' value for each unique 'Loan_Status' category.
Next let us fill the missing values for Self_Employed column.
for val in mode_df.keys():
new_df.loc[(new_df['Loan_Status']==val), 'Self_Employed'] = new_df.loc[(new_df['Loan_Status']==val), 'Self_Employed'].fillna(mode_df[val])
The loop for val in mode_df.keys() iterates over each key (category) in mode_df. new_df.loc[new_df['Loan_Status'] == val, 'Self_Employed'] selects the 'Self_Employed' values where the 'Loan_Status' matches the current category (val), and fillna(mode_df[val]) fills the corresponding missing values with the mode value for that category from mode_df.
Next let us see the count of missing values in each column.
new_df.isnull().sum()
6) Fill missing value using ML Model
You can use machine learning algorithms to predict missing values based on other features in your dataset. This approach involves treating missing values as the target variable and training a model to predict their values based on the available data. Regression algorithms or advanced techniques like K-nearest neighbors (KNN) can be used for this purpose.
First let us copy the existing dataframe to new dataframe.
new_df = df.copy()
Next let us select few specific columns.
new_df = new_df[['LoanAmount', 'Loan_Amount_Term', 'ApplicantIncome', 'CoapplicantIncome']]
new_df.head()
line 1 selects the specified columns from the DataFrame new_df and assigns them to new_df itself, effectively creating a new DataFrame that contains only the selected columns.
The new_df.head() statement then displays the first few rows of the new DataFrame, allowing you to inspect the selected columns.
Next let us check the length of the new dataframe.
len(new_df)
614
Next let us first drop the rows with missing values.
col = "LoanAmount"
# fill numerical values
new_df_temp = new_df.dropna(subset=[col], axis=0)
print(col, len(new_df_temp))
LoanAmount 592
new_df.dropna(subset=[col], axis=0) drops the rows with missing values in the specified column (col) from the DataFrame new_df.
The subset=[col] parameter specifies the column(s) to consider when dropping the rows. The resulting DataFrame is stored in new_df_temp.
The print(col, len(new_df_temp)) statement then prints the column name (col) and the length of new_df_temp, which indicates the number of rows remaining in the DataFrame after dropping the missing values in the specified column.
Next let us get the input and output split
# input and output split
X = new_df_temp.drop(columns=[col], axis=1)
y = new_df_temp[col]
new_df_temp.drop(columns=[col], axis=1) drops the column specified by col from the DataFrame new_df_temp to form the input features matrix X.
The columns=[col] parameter specifies the column(s) to drop, and axis=1 indicates that the columns should be dropped.
The line y = new_df_temp[col] assigns the column specified by col as the output variable y.
Next let us fit the ML model
from lightgbm import LGBMRegressor
model = LGBMRegressor(use_missing=False)
model.fit(X, y)
LGBMRegressor(use_missing=False) creates an instance of the LightGBM regressor model.
The parameter use_missing=False indicates that missing values should not be treated as a special category and should be ignored during model training.
The model.fit(X, y) line fits the model to the input features X and the target variable y, training the model on the provided data.
Next let us populate with the predicted values from the model.
d = {}
temp = new_df.drop(columns=[col], axis=1)
d[col] = list(model.predict(temp))
d = {} initializes an empty dictionary d to store the predictions.
The line temp = new_df.drop(columns=[col], axis=1) creates a temporary DataFrame temp that excludes the column specified by col. This DataFrame is used as input features to generate predictions.
The line d[col] = list(model.predict(temp)) populates the dictionary d with predictions generated by calling model.predict(temp).
The predictions are stored as a list and associated with the key col in the dictionary.
Next let us fill the missing values in the column with the corresponding predicted values.
i = 0
for val, d_val in zip(new_df[col], d[col]):
if pd.isna(val):
new_df.at[i, col] = d_val
i += 1
i is initialized as 0 to track the index of the iteration.
The loop for val, d_val in zip(new_df[col], d[col]) iterates over the values of the col column in new_df and the corresponding predicted values from d[col].
For any missing value (NaN) encountered in the col column, the line new_df.at[i, col] = d_val assigns the corresponding predicted value d_val to that position in the DataFrame.
By the end of the loop, the missing values in the col column of new_df will be replaced with the corresponding predicted values.
Next let us see the count of missing values in each column.
new_df.isnull().sum()
Next let us see few of the rows in the dataframe.
new_df.head()
Final Thoughts
It's crucial to analyze the missing data pattern in your dataset. Determine if the missing values are random or if there is a systematic reason behind them. This understanding can help guide your approach to filling missing values.
The method of filling missing values depends on the type of variable. Numerical variables can be filled using statistical measures like mean, median, or regression-based imputation. Categorical variables can be filled with mode or by creating a new category to represent missing values.
When filling missing values, be aware that the imputed values can introduce bias to the data. Ensure that the imputation method aligns with the characteristics and distribution of the variable. Evaluate the impact of imputation on your analysis and consider sensitivity analysis to assess the robustness of the results.
Utilize your domain knowledge or subject matter expertise to guide the imputation process. Consider any relevant information that may help inform the filling of missing values. Domain knowledge can provide insights into potential relationships between variables and guide the choice of imputation method.
Assess the quality and reliability of the imputed values. Compare imputed values with known values, perform cross-validation, or conduct statistical tests to ensure the imputed values align with the overall data distribution.
If the missing values are not completely random, you may want to consider advanced techniques like multiple imputation. Multiple imputation generates multiple imputed datasets, capturing the uncertainty associated with the missing values and allowing for more accurate estimates and statistical inference.
In this article we have explored basic to advanced techniques to fill the missing values in the dataset. However, the choice of method depends on the specific dataset, the nature of the missing values, and the analysis goals. It is essential to carefully analyze your data, understand the underlying patterns, and make informed decisions when filling missing values.
Get the project notebook from here
Thanks for reading the article!!!
Check out more project videos from the YouTube channel Hackers Realm
Comments