Combat SMS spam using Python! This tutorial delves into NLP techniques and machine learning algorithms for accurate spam detection. Learn to preprocess text data, extract meaningful features, and build models that can distinguish between legitimate and spam messages. Enhance your skills in natural language processing, machine learning, and contribute to a safer communication environment. Join this comprehensive project tutorial to unravel the world of SMS spam detection with Python. #SMSSpamDetection #Python #NLP #MachineLearning #TextClassification #SpamDetection
In this project tutorial we are going to analyze and classify the text messages from the dataset using a classifying model with pipelines.
You can watch the step by step explanation video tutorial down below
Dataset Information
The "spam" concept is diverse: advertisements for products/web sites, make money fast schemes, chain letters, pornography, etc.,
The SMS Spam Collection is a set of SMS tagged messages that have been collected for SMS Spam research. It contains one set of SMS messages in English of 5,574 messages, tagged according being ham (legitimate) or spam.
Attributes
SMS Messages
Label (spam/ham)
Download the dataset here
Import modules
import pandas as pd
import numpy as np
import nltk
import re
from nltk.corpus import stopwords
pandas - used to perform data manipulation and analysis
numpy - used to perform a wide variety of mathematical operations on arrays
nltk – a natural language processing toolkit module associated in anaconda
re – used as a regular expression to find particular patterns and process it
stopwords - used to remove stop words from the text data
Loading the dataset
Now we load the dataset for preprocessing
df = pd.read_csv('spam.csv')
df.head()
Relevant columns are v1 and v2
Other columns are null, unnecessary for processing
Let us extract the relevant data for preprocessing
# get necessary columns for processing
df = df[['v2', 'v1']]
# df.rename(columns={'v2': 'messages', 'v1': 'label'}, inplace=True)
df = df.rename(columns={'v2': 'messages', 'v1': 'label'})
df.head()
Columns renamed to relate better in the codes
Two ways listed to rename the columns, either one is viable
Preprocessing the dataset
# check for null values
df.isnull().sum()
messages 0 label 0 dtype: int64
Checks and shows the no. of null values in the two columns.
In case of null values you must filter it for easier processing
STOPWORDS = set(stopwords.words('english'))
def clean_text(text):
# convert to lowercase
text = text.lower()
# remove special characters
text = re.sub(r'[^0-9a-zA-Z]', ' ', text)
# remove extra spaces
text = re.sub(r'\s+', ' ', text)
# remove stopwords
text = " ".join(word for word in text.split() if word not in STOPWORDS)
return text
Defined to call and clean the text to avoid repeating line by line if further cleaning is needed
set(stopwords.words('...')) - used to load the unique list of common stop words from the specified language as tokens
Stop words are not meaningful words, deleting those words will not affect the results
Text are converted to lower case to avoid mismatching
Special characters and extra spaces are removed
Stop words removed from text by splitting the original text and comparing with the STOPWORDS list
Now let us clean the text messages
# clean the messages
df['clean_text'] = df['messages'].apply(clean_text)
df.head()
New column created to visualize the results from the text cleaning
Input Split
Let us split the data for training
X = df['clean_text']
y = df['label']
X - input attribute
y - output attribute
Model Training
from sklearn.pipeline import Pipeline
from sklearn.model_selection import train_test_split, cross_val_score
from sklearn.metrics import classification_report
from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer, TfidfTransformer
def classify(model, X, y):
# train test split
x_train, x_test, y_train, y_test = train_test_split(X, y, test_size=0.25, random_state=42, shuffle=True, stratify=y)
# model training
pipeline_model = Pipeline([('vect', CountVectorizer()),
('tfidf',TfidfTransformer()),
('clf', model)])
pipeline_model.fit(x_train, y_train)
print('Accuracy:', pipeline_model.score(x_test, y_test)*100)
# cv_score = cross_val_score(model, X, y, cv=5)
# print("CV Score:", np.mean(cv_score)*100)
y_pred = pipeline_model.predict(x_test)
print(classification_report(y_test, y_pred))
Pipeline - used for a linear sequence of data transforms to be chained together culminating in a modeling process that can be evaluated.
train_test_split() - used to estimate the performance of machine learning algorithms when they are used to make predictions on data not used to train the model.
cross_val_score() - used to split the data into (x) equal files, trains the data in (y) combinations and returns the (cv) calculated accuracy of the given model.
CountVectorizer - used to transform a given text into a vector on the basis of the frequency (count) of each word that occurs in the entire text.
TfidfVectorizer - used to perform both word frequency and inverse document frequency of the text.
TfidfTransformer - used to transform text into a meaningful representation of numbers which is used to fit machine algorithm for prediction.
from sklearn.linear_model import LogisticRegression
model = LogisticRegression()
classify(model, X, y)
Accuracy: 96.8413496051687
Results using the Logistic Regression model
from sklearn.naive_bayes import MultinomialNB
model = MultinomialNB()
classify(model, X, y)
Accuracy: 96.69777458722182
The accuracy got decreased a little comparing logistic regression model
from sklearn.svm import SVC
model = SVC(C=3)
classify(model, X, y)
Accuracy: 98.27709978463747
SVC model giving better results comparing to the above models
from sklearn.ensemble import RandomForestClassifier
model = RandomForestClassifier()
classify(model, X, y)
Accuracy: 97.4156496769562
Accuracy decreased a little comparing to SVC model
Final Thoughts
SVC model has the best accuracy with 98.28
You may use different machine learning models of your preference for comparison
Pipeline is used to chain multiple estimators into one and automate the machine learning process. This is extremely useful as there are often a fixed sequence of steps in processing the data.
Simplifying and filtering text can achieve cleaner data to process, giving better results
In this project tutorial, we have explored the SMS Spam Detection Analysis dataset as a classification machine learning project in NLP. The data has been preprocessed with custom cleaning functions and processed using pipelines.
Get the project notebook from here Thanks for reading the article!!! Check out more project videos from the YouTube channel Hackers Realm
댓글