Equip yourself with the tools to combat fake news using Python! This tutorial explores LSTM classification, a powerful deep learning technique, for detection and analyzing fake news. Learn to build a robust model that can identify misinformation and enhance your skills in natural language processing. Dive into the world of deep learning and gain insights into the fascinating field of fake news detection. Arm yourself with knowledge and contribute to a more informed society. #FakeNewsDetection #Python #LSTM #DeepLearning #NaturalLanguageProcessing #Misinformation
In this project tutorial we are going to analyze and classify a dataset of articles as reliable or unreliable information and visualize frequent words through a plot graph.
You can watch the step by step explanation video tutorial down below
Dataset Information
Develop a Deep learning program to identify if an article might be fake news or not.
Attributes
id: unique id for a news article
title: the title of a news article
author: author of the news article
text: the text of the article; could be incomplete
label: a label that marks the article as potentially unreliable
1: unreliable
0: reliable
Download the dataset here
Import Modules
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from wordcloud import WordCloud
import re
import nltk
import warnings
%matplotlib inline
warnings.filterwarnings('ignore')
pandas - used to perform data manipulation and analysis
numpy - used to perform a wide variety of mathematical operations on arrays
matplotlib - used for data visualization and graphical plotting
seaborn - built on top of matplotlib with similar functionalities
re – used as a regular expression to find particular patterns and process it
nltk – a natural language processing toolkit module
warnings - to manipulate warnings details
%matplotlib inline - to enable the inline plotting
filterwarnings('ignore') is to ignore the warnings thrown by the modules (gives clean results)
Loading the Dataset
df = pd.read_csv('train.csv')
df.head()
We can see the top 5 samples from the data
Important information is in the 'text' column and the label column so other columns are irrelevant for the process
Let us visualize the title and the text of the first article.
df['title'][0]
'House Dem Aide: We Didn’t Even See Comey’s Letter Until Jason Chaffetz Tweeted It'
df['text'][0]
'House Dem Aide: We Didn’t Even See Comey’s Letter Until Jason Chaffetz Tweeted It By Darrell Lucus on October 30, 2016 Subscribe Jason Chaffetz on the stump in American Fork, Utah ( image courtesy Michael Jolley, available under a Creative Commons-BY license) \nWith apologies to Keith Olbermann, there is no doubt who the Worst Person in The World is this week–FBI Director James Comey. But according to a House Democratic aide, it looks like we also know who the second-worst person is as well. It turns out that when Comey sent his now-infamous letter announcing that the FBI was looking into emails that may be related to Hillary Clinton’s email server, the ranking Democrats on the relevant committees didn’t hear about it from Comey. They found out via a tweet from one of the Republican committee chairmen. \nAs we now know, Comey notified the Republican chairmen and Democratic ranking members of the House Intelligence, Judiciary, and Oversight committees that his agency was reviewing emails it had recently discovered in order to see if they contained classified information. Not long after this letter went out, Oversight Committee Chairman Jason Chaffetz set the political world ablaze with this tweet. FBI Dir just informed me, "The FBI has learned of the existence of emails that appear to be pertinent to the investigation." Case reopened \n— Jason Chaffetz (@jasoninthehouse) October 28, 2016 \nOf course, we now know that this was not the case . Comey was actually saying that it was reviewing the emails in light of “an unrelated case”–which we now know to be Anthony Weiner’s sexting with a teenager. But apparently such little things as facts didn’t matter to Chaffetz. The Utah Republican had already vowed to initiate a raft of investigations if Hillary wins–at least two years’ worth, and possibly an entire term’s worth of them. Apparently Chaffetz thought the FBI was already doing his work for him–resulting in a tweet that briefly roiled the nation before cooler heads realized it was a dud. \nBut according to a senior House Democratic aide, misreading that letter may have been the least of Chaffetz’ sins. That aide told Shareblue that his boss and other Democrats didn’t even know about Comey’s letter at the time–and only found out when they checked Twitter. “Democratic Ranking Members on the relevant committees didn’t receive Comey’s letter until after the Republican Chairmen. In fact, the Democratic Ranking Members didn’ receive it until after the Chairman of the Oversight and Government Reform Committee, Jason Chaffetz, tweeted it out and made it public.” \nSo let’s see if we’ve got this right. The FBI director tells Chaffetz and other GOP committee chairmen about a major development in a potentially politically explosive investigation, and neither Chaffetz nor his other colleagues had the courtesy to let their Democratic counterparts know about it. Instead, according to this aide, he made them find out about it on Twitter. \nThere has already been talk on Daily Kos that Comey himself provided advance notice of this letter to Chaffetz and other Republicans, giving them time to turn on the spin machine. That may make for good theater, but there is nothing so far that even suggests this is the case. After all, there is nothing so far that suggests that Comey was anything other than grossly incompetent and tone-deaf. \nWhat it does suggest, however, is that Chaffetz is acting in a way that makes Dan Burton and Darrell Issa look like models of responsibility and bipartisanship. He didn’t even have the decency to notify ranking member Elijah Cummings about something this explosive. If that doesn’t trample on basic standards of fairness, I don’t know what does. \nGranted, it’s not likely that Chaffetz will have to answer for this. He sits in a ridiculously Republican district anchored in Provo and Orem; it has a Cook Partisan Voting Index of R+25, and gave Mitt Romney a punishing 78 percent of the vote in 2012. Moreover, the Republican House leadership has given its full support to Chaffetz’ planned fishing expedition. But that doesn’t mean we can’t turn the hot lights on him. After all, he is a textbook example of what the House has become under Republican control. And he is also the Second Worst Person in the World. About Darrell Lucus \nDarrell is a 30-something graduate of the University of North Carolina who considers himself a journalist of the old school. An attempt to turn him into a member of the religious right in college only succeeded in turning him into the religious right\'s worst nightmare--a charismatic Christian who is an unapologetic liberal. His desire to stand up for those who have been scared into silence only increased when he survived an abusive three-year marriage. You may know him on Daily Kos as Christian Dem in NC . Follow him on Twitter @DarrellLucus or connect with him on Facebook . Click here to buy Darrell a Mello Yello. Connect'
Punctuations and escape characters are present in the text, they can be filtered to keep only meaningful information
Let us see the datatypes and no. of samples in the dataframe
df.info()
<class 'pandas.core.frame.DataFrame'> RangeIndex: 20800 entries, 0 to 20799 Data columns (total 5 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 id 20800 non-null int64 1 title 20242 non-null object 2 author 18843 non-null object 3 text 20761 non-null object 4 label 20800 non-null int64 dtypes: int64(2), object(3) memory usage: 812.6+ KB
Total of 20800 articles in the dataset
There are less data in the text meaning the remaining has null values.
Data Preprocessing
Now we filter the data for processing
# drop unnecessary columns
df = df.drop(columns=['id', 'title', 'author'], axis=1)
# drop null values
df = df.dropna(axis=0)
len(df)
20761
Drops entire row if it has a NULL value
# remove special characters and punctuations
df['clean_news'] = df['text'].str.lower()
df['clean_news']
0 house dem aide: we didn’t even see comey’s let... 1 ever get the feeling your life circles the rou... 2 why the truth might get you fired october 29, ... 3 videos 15 civilians killed in single us airstr... 4 print \nan iranian woman has been sentenced to... ... 20795 rapper t. i. unloaded on black celebrities who... 20796 when the green bay packers lost to the washing... 20797 the macy’s of today grew from the union of sev... 20798 nato, russia to hold parallel exercises in bal... 20799 david swanson is an author, activist, journa... Name: clean_news, Length: 20761, dtype: object
str.lower() - converts all characters to lower case
Now we proceed in removing the punctuations and special characters
df['clean_news'] = df['clean_news'].str.replace('[^A-Za-z0-9\s]', '')
df['clean_news'] = df['clean_news'].str.replace('\n', '')
df['clean_news'] = df['clean_news'].str.replace('\s+', ' ')
df['clean_news']
0 house dem aide we didnt even see comeys letter... 1 ever get the feeling your life circles the rou... 2 why the truth might get you fired october 29 2... 3 videos 15 civilians killed in single us airstr... 4 print an iranian woman has been sentenced to s... ... 20795 rapper t i unloaded on black celebrities who m... 20796 when the green bay packers lost to the washing... 20797 the macys of today grew from the union of seve... 20798 nato russia to hold parallel exercises in balk... 20799 david swanson is an author activist journalis... Name: clean_news, Length: 20761, dtype: object
All special characters and punctuations are removed
Escape characters are removed
Extra spaces are removed
# remove stopwords
from nltk.corpus import stopwords
stop = stopwords.words('english')
df['clean_news'] = df['clean_news'].apply(lambda x: " ".join([word for word in x.split() if word not in stop]))
df.head()
Stop words are meaningless information, removing them simplifies the text data for good feature extraction
Stop words are removed from text by splitting the original text and comparing with the STOPWORDS list
Exploratory Data Analysis
# visualize the frequent words
all_words = " ".join([sentence for sentence in df['clean_news']])
wordcloud = WordCloud(width=800, height=500, random_state=42, max_font_size=100).generate(all_words)
# plot the graph
plt.figure(figsize=(15, 9))
plt.imshow(wordcloud, interpolation='bilinear')
plt.axis('off')
plt.show()
Concatenation of all the sentences from clean_news column
The most frequent words are larger and less frequent words are smaller
Visualization of frequent words from genuine and fake news.
# visualize the frequent words for genuine news
all_words = " ".join([sentence for sentence in df['clean_news'][df['label']==0]])
wordcloud = WordCloud(width=800, height=500, random_state=42, max_font_size=100).generate(all_words)
# plot the graph
plt.figure(figsize=(15, 9))
plt.imshow(wordcloud, interpolation='bilinear')
plt.axis('off')
plt.show()
Concatenation of sentences of genuine news only
Visualization of most frequent words of genuine news
# visualize the frequent words for fake news
all_words = " ".join([sentence for sentence in df['clean_news'][df['label']==1]])
wordcloud = WordCloud(width=800, height=500, random_state=42, max_font_size=100).generate(all_words)
# plot the graph
plt.figure(figsize=(15, 9))
plt.imshow(wordcloud, interpolation='bilinear')
plt.axis('off')
plt.show()
Concatenation of sentences of fake news only
Visualization of most frequent words of fake news
Compared with the plot of genuine news, there's a difference in the frequency of the words, including different words
Create Word Embeddings
from keras.preprocessing.text import Tokenizer
from keras.preprocessing.sequence import pad_sequences
Tokenizer - used for loading the text and convert them into a token
pad_sequences - used for equal distribution of words in sentences filling the remaining spaces with zeros
# tokenize text
tokenizer = Tokenizer()
tokenizer.fit_on_texts(df['clean_news'])
word_index = tokenizer.word_index
vocab_size = len(word_index)
vocab_size
199536
Returns all unique words as tokens
vocab_size returns the total number of unique words from the data
# padding data
sequences = tokenizer.texts_to_sequences(df['clean_news'])
padded_seq = pad_sequences(sequences, maxlen=500, padding='post', truncating='post')
Padding the data equalizes the length of all sentences
For this project we determine the max length to 500 words for faster processing, normally you must find the max length of a sentence in the whole dataset
# create embedding index
embedding_index = {}
with open('glove.6B.100d.txt', encoding='utf-8') as f:
for line in f:
values = line.split()
word = values[0]
coefs = np.asarray(values[1:], dtype='float32')
embedding_index[word] = coefs
Must download the Glove embedding file for this process and place it in the same folder as the notebook
Glove embedding dictionary contains vectors for words in 100 dimensions, mainly all words from the dictionary
You may download the Glove embedding file here
# create embedding matrix
embedding_matrix = np.zeros((vocab_size+1, 100))
for word, i in word_index.items():
embedding_vector = embedding_index.get(word)
if embedding_vector is not None:
embedding_matrix[i] = embedding_vector
embedding_matrix[1]
array([-0.13128 , -0.45199999, 0.043399 , -0.99798 , -0.21053 , -0.95867997, -0.24608999, 0.48413 , 0.18178 , 0.47499999, -0.22305 , 0.30063999, 0.43496001, -0.36050001, 0.20245001, -0.52594 , -0.34707999, 0.0075873 , -1.04970002, 0.18673 , 0.57369 , 0.43814 , 0.098659 , 0.38769999, -0.22579999, 0.41911 , 0.043602 , -0.73519999, -0.53583002, 0.19276001, -0.21961001, 0.42515001, -0.19081999, 0.47187001, 0.18826 , 0.13357 , 0.41839001, 1.31379998, 0.35677999, -0.32172 , -1.22570002, -0.26635 , 0.36715999, -0.27586001, -0.53245997, 0.16786 , -0.11253 , -0.99958998, -0.60706002, -0.89270997, 0.65156001, -0.88783997, 0.049233 , 0.67110997, -0.27553001, -2.40050006, -0.36989 , 0.29135999, 1.34979999, 1.73529994, 0.27000001, 0.021299 , 0.14421999, 0.023784 , 0.33643001, -0.35475999, 1.09210002, 1.48450005, 0.49430001, 0.15688001, 0.34678999, -0.57221001, 0.12093 , -1.26160002, 1.05410004, 0.064335 , -0.002732 , 0.19038001, -1.76429999, 0.055068 , 1.47370005, -0.41782001, -0.57341999, -0.12129 , -1.31690001, -0.73882997, 0.17682 , -0.019991 , -0.49175999, -0.55247003, 1.06229997, -0.62879002, 0.29098001, 0.13237999, -0.70414001, 0.67128003, -0.085462 , -0.30526 , -0.045495 , 0.56509 ])
Vectors in the embedding matrix as float32 data type
The 100 values represents a single word
Input Split
padded_seq[1]
array([ 258, 28, 1557, 92, 4913, 27340, 415, 2246, 2067, 377, 532, 1558, 5339, 29, 12, 796, 179, 361, 1917, 17459, 829, 20147, 2990, 2626, 640, 747, 252, 2025, 3113, 10995, 125, 39, 2086, 78618, 3022, 3646, 3561, 3113, 835, 153, 3458, 29, 9775, 51963, 3724, 18, 218, 20, 3234, 20147, 10024, 625, 11, 481, 2494, 2417, 8173, 442, 701, 613, 147, 14, 22280, 902, 324, 8, 164, 3712, 60, 11541, 867, 2644, 16, 864, 4422, 176, 5305, 2086, 4253, 40, 257, 835, 192, 10, 2403, 10, 2086, 9775, 58, 8372, 11246, 104297, 20952, 3713, 20953, 78619, 104298, 5459, 31169, 25044, 7998, 19120, 65806, 4403, 168, 261, 25045, 4403, 162, 355, 904, 1581, 424, 1302, 20, 344, 37, 1963, 187, 394, 59, 8107, 3658, 18529, 177, 1356, 745, 7401, 2379, 7787, 1602, 2532, 152, 12, 458, 10153, 11900, 17701, 8681, 128, 102, 22769, 10582, 10025, 13518, 9418, 316, 7, 136, 626, 480, 370, 95, 47538, 2439, 19434, 1139, 9775, 7163, 3591, 8173, 4, 840, 169, 625, 14079, 414, 51, 465, 177, 1, 446, 1139, 446, 1078, 1139, 10, 39, 369, 182, 446, 1139, 8031, 51, 51, 1557, 30058, 1703, 516, 16, 2633, 19772, 1139, 8031, 957, 11901, 165, 60, 493, 957, 16, 588, 6, 19772, 13107, 35329, 1635, 1688, 3751, 2121, 254, 12, 104, 19772, 1099, 287, 8032, 12768, 1159, 19121, 52, 14721, 8208, 22, 6, 3, 20548, 3724, 69, 3241, 69, 292, 893, 2020, 17201, 37, 1615, 250, 448, 2825, 14721, 12, 562, 104299, 471, 7358, 1910, 2322, 1438, 1502, 1212, 592, 448, 674, 1452, 22, 6, 2420, 1387, 592, 197, 12000, 142, 192, 42, 49, 6, 102, 14885, 1502, 230, 292, 973, 1019, 137, 209, 627, 994, 17202, 8, 15, 6, 4785, 3640, 29, 12, 9944, 907, 86, 2648, 1521, 229, 176, 13108, 1376, 20147, 481, 95, 11, 164, 2557, 12, 9203, 70, 146, 604, 1732, 2688, 263, 25735, 41482, 4166, 21, 20147, 13639, 4977, 118, 39, 43, 8681, 86, 320, 2478, 447, 1049, 335, 1304, 1273, 447, 1049, 247, 891, 1871, 335, 179, 361, 1917, 4311, 361, 44, 41, 7472, 489, 1464, 16, 335, 1453, 683, 737, 1032, 169, 934, 30, 3341, 557, 11, 361, 5797, 7952, 20954, 6089, 148, 51964, 7203, 1387, 637, 418, 1615, 37, 53, 8, 1809, 47539, 11442, 3561, 53, 19773, 981, 12649, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0])
Visualization of word index from the padded sequence
Good example viewing a padded sentence, remaining spaces filled with zero to match the max length
Now we proceed in splitting the data for training
from sklearn.model_selection import train_test_split
x_train, x_test, y_train, y_test = train_test_split(padded_seq, df['label'], test_size=0.20, random_state=42, stratify=df['label'])
80% data split for training and remaining 20% for testing
stratify will equally distribute the samples for train and test
Model Training
from keras.layers import LSTM, Dropout, Dense, Embedding
from keras import Sequential
# model = Sequential([
# Embedding(vocab_size+1, 100, weights=[embedding_matrix], trainable=False),
# Dropout(0.2),
# LSTM(128, return_sequences=True),
# LSTM(128),
# Dropout(0.2),
# Dense(512),
# Dropout(0.2),
# Dense(256),
# Dense(1, activation='sigmoid')
# ])
model = Sequential([
Embedding(vocab_size+1, 100, weights=[embedding_matrix], trainable=False),
Dropout(0.2),
LSTM(128),
Dropout(0.2),
Dense(256),
Dense(1, activation='sigmoid')
])
Embedding - maps the word index to the corresponding vector representation
LSTM - process sequence of data
Dense - single dimension linear layer
Use Dropout if augmentation was not applied on the data to avoid over fitting
activation='sigmoid' - used for binary classification
model.compile(loss='binary_crossentropy', optimizer='adam', metrics='accuracy')
model.summary()
Model: "sequential_2" _________________________________________________________________ Layer (type) Output Shape Param # ================================================================= embedding_2 (Embedding) (None, None, 100) 19953700 _________________________________________________________________ dropout_5 (Dropout) (None, None, 100) 0 _________________________________________________________________ lstm_3 (LSTM) (None, 128) 117248 _________________________________________________________________ dropout_6 (Dropout) (None, 128) 0 _________________________________________________________________ dense_5 (Dense) (None, 1) 129 ================================================================= Total params: 20,071,077 Trainable params: 117,377 Non-trainable params: 19,953,700 _________________________________________________________________
model.compile() - compilation of the model
optimizer='adam' - automatically adjust the learning rate for the model over the no. of epochs
loss='binary_crossentropy' - loss function for binary outputs
# train the model
history = model.fit(x_train, y_train, epochs=10, batch_size=256, validation_data=(x_test, y_test))
Epoch 1/10 65/65 [==============================] - 42s 617ms/step - loss: 0.6541 - accuracy: 0.6098 - val_loss: 0.6522 - val_accuracy: 0.6152 Epoch 2/10 65/65 [==============================] - 39s 607ms/step - loss: 0.6436 - accuracy: 0.6241 - val_loss: 0.5878 - val_accuracy: 0.6769 Epoch 3/10 65/65 [==============================] - 40s 611ms/step - loss: 0.6057 - accuracy: 0.6688 - val_loss: 0.5908 - val_accuracy: 0.7144 Epoch 4/10 65/65 [==============================] - 40s 613ms/step - loss: 0.5693 - accuracy: 0.7239 - val_loss: 0.6280 - val_accuracy: 0.6326 Epoch 5/10 65/65 [==============================] - 40s 612ms/step - loss: 0.5990 - accuracy: 0.6699 - val_loss: 0.5887 - val_accuracy: 0.6959 Epoch 6/10 65/65 [==============================] - 40s 614ms/step - loss: 0.6060 - accuracy: 0.6593 - val_loss: 0.5807 - val_accuracy: 0.6766 Epoch 7/10 65/65 [==============================] - 40s 609ms/step - loss: 0.5546 - accuracy: 0.6906 - val_loss: 0.5704 - val_accuracy: 0.6641 Epoch 8/10 65/65 [==============================] - 39s 606ms/step - loss: 0.5517 - accuracy: 0.6973 - val_loss: 0.5553 - val_accuracy: 0.6689 Epoch 9/10 65/65 [==============================] - 33s 508ms/step - loss: 0.5400 - accuracy: 0.6855 - val_loss: 0.5281 - val_accuracy: 0.7226 Epoch 10/10 65/65 [==============================] - 40s 609ms/step - loss: 0.5244 - accuracy: 0.7236 - val_loss: 0.5442 - val_accuracy: 0.6988
Set the no. of epochs and batch size according to the hardware specifications
Training accuracy and validation accuracy increases each iteration
Training loss and validation loss decreases each iteration
The maximum validation accuracy is 72.26 and can be increased further if we increase the no. of epochs.
Now we visualize the results through a plot graph
# visualize the results
plt.plot(history.history['accuracy'])
plt.plot(history.history['val_accuracy'])
plt.xlabel('epochs')
plt.ylabel('accuracy')
plt.legend(['Train', 'Test'])
plt.show()
plt.plot(history.history['loss'])
plt.plot(history.history['val_loss'])
plt.xlabel('epochs')
plt.ylabel('loss')
plt.legend(['Train', 'Test'])
plt.show()
Final Thoughts
Training the model by increasing the no. of epochs can give better and more accurate results.
Processing large amount of data can take a lot of time and system resource.
Basic deep learning model trained in a small neural network, adding new layers may improve the results.
We can trained a LSTM model to predict the fake news and other models like GRU, Bi-LSTM, Transformers can be used to improve the performance of the model.
In this project tutorial, we have explored the Fake News Detection Analysis as a NLP classification deep learning project. Various deep learning techniques were used for processing the data including padding sequence and creating an embedded matrix to later visualize the results through various plot graphs.
Get the project notebook from here
Thanks for reading the article!!!
Check out more project videos from the YouTube channel Hackers Realm
Comentarios