Many users worldwide rely on email and text messages for communication and they have become an essential part of daily life. Emails and text messages contain a lot of sensitive information that hackers attempt to steal, which is known as data theft. As a result, it is critical to distinguish between spam and ham emails.
Genuine mail/text that is important to the user and informative is referred to as ham. Spam, on the other hand, is bogus mail/text sent from untrustworthy sources with malicious intent.
Creating the vectors for the text data using the basic techniques like Bag of Words and TF-IDF.
# To read and manipulate the data
import pandas as pd
pd.set_option('max_colwidth', None)
# To visualise the graphs
import matplotlib.pyplot as plt
from matplotlib import cycler
colors = cycler('color',
['#EE6666', '#3388BB', '#9988DD',
'#EECC55', '#88BB44', '#FFBBBB'])
plt.rc('axes', facecolor='#E6E6E6', edgecolor='none',
axisbelow=True, grid=True, prop_cycle=colors)
plt.rc('grid', color='w', linestyle='solid')
plt.rc('xtick', direction='out', color='black')
plt.rc('ytick', direction='out', color='black')
plt.rc('patch', edgecolor='#E6E6E6')
plt.rc('lines', linewidth=2)
import seaborn as sns
# Helps to display the images
from PIL import Image
# Helps to remove the punctuation
import string
# Helps to create the counter
from collections import Counter
# Helped to create train and test data
from sklearn.model_selection import train_test_split
# Importing the vectorization classes
from sklearn.feature_extraction.text import CountVectorizer,TfidfVectorizer
# Importing the Random Forest model
from sklearn.ensemble import RandomForestClassifier
# Metrics to evaluate the model
from sklearn.metrics import accuracy_score,classification_report, confusion_matrix
from google.colab import drive
drive.mount('/content/drive')
Mounted at /content/drive
We are loading the cleaned data from the notebook HAM vs Spam notebook, to avoid the repetative preprocessing steps. As we mentioned earlier, now let’s learn and implement the vectorization techniques other than one hot encoding.
filepath = "/content/drive/MyDrive/data_vertical/Final/1. Week-1: NLP/Notebooks/1.2 SMS_spam.csv"
messages = pd.read_csv(filepath, index_col = [0])
# Creating the copy of the data frame
data = messages.copy()
# View the first 5 rows of the dataset
data.head(5)
type | text | |
---|---|---|
0 | ham | hope having good checking |
1 | ham | dong cbe bt pay |
2 | ham | ask mummy father |
3 | ham | fyi usf swing room |
4 | ham | sure thing big hockey election longer hour |
# View the last five rows of the dataset
data.tail(5)
type | text | |
---|---|---|
4023 | spam | cd congratulation ur awarded ps500 cd gift voucher ps125 gift guaranteed reentry 2 ps100 draw xt music 87066 tn |
4024 | spam | mobile 11myths update free orange latest colour camera mobile unlimited weekend mobile upd8 freeform 08000839402 2stoptxt |
4025 | spam | 3 lion england reply lion 4 mono lion 4 4 2 original n tone 3gb network operator rate |
4026 | spam | ur balance ur question sang 2 answer txt ur answer good |
4027 | spam | ac energy u know 2channel 2day ur leadership skill r reply an reply end sco |
# Checking the shape of the dataset
data.shape
(4028, 2)
# Checking the datatypes and columns
data.info()
<class 'pandas.core.frame.DataFrame'> Int64Index: 4028 entries, 0 to 4027 Data columns (total 2 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 type 4028 non-null object 1 text 4028 non-null object dtypes: object(2) memory usage: 94.4+ KB
# Checking for duplicate values
data.duplicated().sum()
0
# Checking for missing values
data.isna().sum()
type 0 text 0 dtype: int64
data['type'].value_counts()
ham 3414 spam 614 Name: type, dtype: int64
# this library is used to expand contractions
!pip install contractions
Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/ Requirement already satisfied: contractions in /usr/local/lib/python3.7/dist-packages (0.1.73) Requirement already satisfied: textsearch>=0.0.21 in /usr/local/lib/python3.7/dist-packages (from contractions) (0.0.24) Requirement already satisfied: pyahocorasick in /usr/local/lib/python3.7/dist-packages (from textsearch>=0.0.21->contractions) (1.4.4) Requirement already satisfied: anyascii in /usr/local/lib/python3.7/dist-packages (from textsearch>=0.0.21->contractions) (0.3.1)
# Helps to extract the data using regular expressions
import re
import nltk
import contractions
nltk.download('punkt')
nltk.download('all')
from nltk.corpus import stopwords
from nltk import word_tokenize
# Used in Lemmatization
from nltk.stem import WordNetLemmatizer
[nltk_data] Downloading package punkt to /root/nltk_data... [nltk_data] Package punkt is already up-to-date! [nltk_data] Downloading collection 'all' [nltk_data] | [nltk_data] | Downloading package abc to /root/nltk_data... [nltk_data] | Package abc is already up-to-date! [nltk_data] | Downloading package alpino to /root/nltk_data... [nltk_data] | Package alpino is already up-to-date! [nltk_data] | Downloading package averaged_perceptron_tagger to [nltk_data] | /root/nltk_data... [nltk_data] | Package averaged_perceptron_tagger is already up- [nltk_data] | to-date! [nltk_data] | Downloading package averaged_perceptron_tagger_ru to [nltk_data] | /root/nltk_data... [nltk_data] | Package averaged_perceptron_tagger_ru is already [nltk_data] | up-to-date! [nltk_data] | Downloading package basque_grammars to [nltk_data] | /root/nltk_data... [nltk_data] | Package basque_grammars is already up-to-date! [nltk_data] | Downloading package biocreative_ppi to [nltk_data] | /root/nltk_data... [nltk_data] | Package biocreative_ppi is already up-to-date! [nltk_data] | Downloading package bllip_wsj_no_aux to [nltk_data] | /root/nltk_data... [nltk_data] | Package bllip_wsj_no_aux is already up-to-date! [nltk_data] | Downloading package book_grammars to [nltk_data] | /root/nltk_data... [nltk_data] | Package book_grammars is already up-to-date! [nltk_data] | Downloading package brown to /root/nltk_data... [nltk_data] | Package brown is already up-to-date! [nltk_data] | Downloading package brown_tei to /root/nltk_data... [nltk_data] | Package brown_tei is already up-to-date! [nltk_data] | Downloading package cess_cat to /root/nltk_data... [nltk_data] | Package cess_cat is already up-to-date! [nltk_data] | Downloading package cess_esp to /root/nltk_data... [nltk_data] | Package cess_esp is already up-to-date! [nltk_data] | Downloading package chat80 to /root/nltk_data... [nltk_data] | Package chat80 is already up-to-date! [nltk_data] | Downloading package city_database to [nltk_data] | /root/nltk_data... [nltk_data] | Package city_database is already up-to-date! [nltk_data] | Downloading package cmudict to /root/nltk_data... [nltk_data] | Package cmudict is already up-to-date! [nltk_data] | Downloading package comparative_sentences to [nltk_data] | /root/nltk_data... [nltk_data] | Package comparative_sentences is already up-to- [nltk_data] | date! [nltk_data] | Downloading package comtrans to /root/nltk_data... [nltk_data] | Package comtrans is already up-to-date! [nltk_data] | Downloading package conll2000 to /root/nltk_data... [nltk_data] | Package conll2000 is already up-to-date! [nltk_data] | Downloading package conll2002 to /root/nltk_data... [nltk_data] | Package conll2002 is already up-to-date! [nltk_data] | Downloading package conll2007 to /root/nltk_data... [nltk_data] | Package conll2007 is already up-to-date! [nltk_data] | Downloading package crubadan to /root/nltk_data... [nltk_data] | Package crubadan is already up-to-date! [nltk_data] | Downloading package dependency_treebank to [nltk_data] | /root/nltk_data... [nltk_data] | Package dependency_treebank is already up-to-date! [nltk_data] | Downloading package dolch to /root/nltk_data... [nltk_data] | Package dolch is already up-to-date! [nltk_data] | Downloading package europarl_raw to [nltk_data] | /root/nltk_data... [nltk_data] | Package europarl_raw is already up-to-date! [nltk_data] | Downloading package extended_omw to [nltk_data] | /root/nltk_data... [nltk_data] | Package extended_omw is already up-to-date! [nltk_data] | Downloading package floresta to /root/nltk_data... [nltk_data] | Package floresta is already up-to-date! [nltk_data] | Downloading package framenet_v15 to [nltk_data] | /root/nltk_data... [nltk_data] | Package framenet_v15 is already up-to-date! [nltk_data] | Downloading package framenet_v17 to [nltk_data] | /root/nltk_data... [nltk_data] | Package framenet_v17 is already up-to-date! [nltk_data] | Downloading package gazetteers to /root/nltk_data... [nltk_data] | Package gazetteers is already up-to-date! [nltk_data] | Downloading package genesis to /root/nltk_data... [nltk_data] | Package genesis is already up-to-date! [nltk_data] | Downloading package gutenberg to /root/nltk_data... [nltk_data] | Package gutenberg is already up-to-date! [nltk_data] | Downloading package ieer to /root/nltk_data... [nltk_data] | Package ieer is already up-to-date! [nltk_data] | Downloading package inaugural to /root/nltk_data... [nltk_data] | Package inaugural is already up-to-date! [nltk_data] | Downloading package indian to /root/nltk_data... [nltk_data] | Package indian is already up-to-date! [nltk_data] | Downloading package jeita to /root/nltk_data... [nltk_data] | Package jeita is already up-to-date! [nltk_data] | Downloading package kimmo to /root/nltk_data... [nltk_data] | Package kimmo is already up-to-date! [nltk_data] | Downloading package knbc to /root/nltk_data... [nltk_data] | Package knbc is already up-to-date! [nltk_data] | Downloading package large_grammars to [nltk_data] | /root/nltk_data... [nltk_data] | Package large_grammars is already up-to-date! [nltk_data] | Downloading package lin_thesaurus to [nltk_data] | /root/nltk_data... [nltk_data] | Package lin_thesaurus is already up-to-date! [nltk_data] | Downloading package mac_morpho to /root/nltk_data... [nltk_data] | Package mac_morpho is already up-to-date! [nltk_data] | Downloading package machado to /root/nltk_data... [nltk_data] | Package machado is already up-to-date! [nltk_data] | Downloading package masc_tagged to /root/nltk_data... [nltk_data] | Package masc_tagged is already up-to-date! [nltk_data] | Downloading package maxent_ne_chunker to [nltk_data] | /root/nltk_data... [nltk_data] | Package maxent_ne_chunker is already up-to-date! [nltk_data] | Downloading package maxent_treebank_pos_tagger to [nltk_data] | /root/nltk_data... [nltk_data] | Package maxent_treebank_pos_tagger is already up- [nltk_data] | to-date! [nltk_data] | Downloading package moses_sample to [nltk_data] | /root/nltk_data... [nltk_data] | Package moses_sample is already up-to-date! [nltk_data] | Downloading package movie_reviews to [nltk_data] | /root/nltk_data... [nltk_data] | Package movie_reviews is already up-to-date! [nltk_data] | Downloading package mte_teip5 to /root/nltk_data... [nltk_data] | Package mte_teip5 is already up-to-date! [nltk_data] | Downloading package mwa_ppdb to /root/nltk_data... [nltk_data] | Package mwa_ppdb is already up-to-date! [nltk_data] | Downloading package names to /root/nltk_data... [nltk_data] | Package names is already up-to-date! [nltk_data] | Downloading package nombank.1.0 to /root/nltk_data... [nltk_data] | Package nombank.1.0 is already up-to-date! [nltk_data] | Downloading package nonbreaking_prefixes to [nltk_data] | /root/nltk_data... [nltk_data] | Package nonbreaking_prefixes is already up-to-date! [nltk_data] | Downloading package nps_chat to /root/nltk_data... [nltk_data] | Package nps_chat is already up-to-date! [nltk_data] | Downloading package omw to /root/nltk_data... [nltk_data] | Package omw is already up-to-date! [nltk_data] | Downloading package omw-1.4 to /root/nltk_data... [nltk_data] | Package omw-1.4 is already up-to-date! [nltk_data] | Downloading package opinion_lexicon to [nltk_data] | /root/nltk_data... [nltk_data] | Package opinion_lexicon is already up-to-date! [nltk_data] | Downloading package panlex_swadesh to [nltk_data] | /root/nltk_data... [nltk_data] | Package panlex_swadesh is already up-to-date! [nltk_data] | Downloading package paradigms to /root/nltk_data... [nltk_data] | Package paradigms is already up-to-date! [nltk_data] | Downloading package pe08 to /root/nltk_data... [nltk_data] | Package pe08 is already up-to-date! [nltk_data] | Downloading package perluniprops to [nltk_data] | /root/nltk_data... [nltk_data] | Package perluniprops is already up-to-date! [nltk_data] | Downloading package pil to /root/nltk_data... [nltk_data] | Package pil is already up-to-date! [nltk_data] | Downloading package pl196x to /root/nltk_data... [nltk_data] | Package pl196x is already up-to-date! [nltk_data] | Downloading package porter_test to /root/nltk_data... [nltk_data] | Package porter_test is already up-to-date! [nltk_data] | Downloading package ppattach to /root/nltk_data... [nltk_data] | Package ppattach is already up-to-date! [nltk_data] | Downloading package problem_reports to [nltk_data] | /root/nltk_data... [nltk_data] | Package problem_reports is already up-to-date! [nltk_data] | Downloading package product_reviews_1 to [nltk_data] | /root/nltk_data... [nltk_data] | Package product_reviews_1 is already up-to-date! [nltk_data] | Downloading package product_reviews_2 to [nltk_data] | /root/nltk_data... [nltk_data] | Package product_reviews_2 is already up-to-date! [nltk_data] | Downloading package propbank to /root/nltk_data... [nltk_data] | Package propbank is already up-to-date! [nltk_data] | Downloading package pros_cons to /root/nltk_data... [nltk_data] | Package pros_cons is already up-to-date! [nltk_data] | Downloading package ptb to /root/nltk_data... [nltk_data] | Package ptb is already up-to-date! [nltk_data] | Downloading package punkt to /root/nltk_data... [nltk_data] | Package punkt is already up-to-date! [nltk_data] | Downloading package qc to /root/nltk_data... [nltk_data] | Package qc is already up-to-date! [nltk_data] | Downloading package reuters to /root/nltk_data... [nltk_data] | Package reuters is already up-to-date! [nltk_data] | Downloading package rslp to /root/nltk_data... [nltk_data] | Package rslp is already up-to-date! [nltk_data] | Downloading package rte to /root/nltk_data... [nltk_data] | Package rte is already up-to-date! [nltk_data] | Downloading package sample_grammars to [nltk_data] | /root/nltk_data... [nltk_data] | Package sample_grammars is already up-to-date! [nltk_data] | Downloading package semcor to /root/nltk_data... [nltk_data] | Package semcor is already up-to-date! [nltk_data] | Downloading package senseval to /root/nltk_data... [nltk_data] | Package senseval is already up-to-date! [nltk_data] | Downloading package sentence_polarity to [nltk_data] | /root/nltk_data... [nltk_data] | Package sentence_polarity is already up-to-date! [nltk_data] | Downloading package sentiwordnet to [nltk_data] | /root/nltk_data... [nltk_data] | Package sentiwordnet is already up-to-date! [nltk_data] | Downloading package shakespeare to /root/nltk_data... [nltk_data] | Package shakespeare is already up-to-date! [nltk_data] | Downloading package sinica_treebank to [nltk_data] | /root/nltk_data... [nltk_data] | Package sinica_treebank is already up-to-date! [nltk_data] | Downloading package smultron to /root/nltk_data... [nltk_data] | Package smultron is already up-to-date! [nltk_data] | Downloading package snowball_data to [nltk_data] | /root/nltk_data... [nltk_data] | Package snowball_data is already up-to-date! [nltk_data] | Downloading package spanish_grammars to [nltk_data] | /root/nltk_data... [nltk_data] | Package spanish_grammars is already up-to-date! [nltk_data] | Downloading package state_union to /root/nltk_data... [nltk_data] | Package state_union is already up-to-date! [nltk_data] | Downloading package stopwords to /root/nltk_data... [nltk_data] | Package stopwords is already up-to-date! [nltk_data] | Downloading package subjectivity to [nltk_data] | /root/nltk_data... [nltk_data] | Package subjectivity is already up-to-date! [nltk_data] | Downloading package swadesh to /root/nltk_data... [nltk_data] | Package swadesh is already up-to-date! [nltk_data] | Downloading package switchboard to /root/nltk_data... [nltk_data] | Package switchboard is already up-to-date! [nltk_data] | Downloading package tagsets to /root/nltk_data... [nltk_data] | Package tagsets is already up-to-date! [nltk_data] | Downloading package timit to /root/nltk_data... [nltk_data] | Package timit is already up-to-date! [nltk_data] | Downloading package toolbox to /root/nltk_data... [nltk_data] | Package toolbox is already up-to-date! [nltk_data] | Downloading package treebank to /root/nltk_data... [nltk_data] | Package treebank is already up-to-date! [nltk_data] | Downloading package twitter_samples to [nltk_data] | /root/nltk_data... [nltk_data] | Package twitter_samples is already up-to-date! [nltk_data] | Downloading package udhr to /root/nltk_data... [nltk_data] | Package udhr is already up-to-date! [nltk_data] | Downloading package udhr2 to /root/nltk_data... [nltk_data] | Package udhr2 is already up-to-date! [nltk_data] | Downloading package unicode_samples to [nltk_data] | /root/nltk_data... [nltk_data] | Package unicode_samples is already up-to-date! [nltk_data] | Downloading package universal_tagset to [nltk_data] | /root/nltk_data... [nltk_data] | Package universal_tagset is already up-to-date! [nltk_data] | Downloading package universal_treebanks_v20 to [nltk_data] | /root/nltk_data... [nltk_data] | Package universal_treebanks_v20 is already up-to- [nltk_data] | date! [nltk_data] | Downloading package vader_lexicon to [nltk_data] | /root/nltk_data... [nltk_data] | Package vader_lexicon is already up-to-date! [nltk_data] | Downloading package verbnet to /root/nltk_data... [nltk_data] | Package verbnet is already up-to-date! [nltk_data] | Downloading package verbnet3 to /root/nltk_data... [nltk_data] | Package verbnet3 is already up-to-date! [nltk_data] | Downloading package webtext to /root/nltk_data... [nltk_data] | Package webtext is already up-to-date! [nltk_data] | Downloading package wmt15_eval to /root/nltk_data... [nltk_data] | Package wmt15_eval is already up-to-date! [nltk_data] | Downloading package word2vec_sample to [nltk_data] | /root/nltk_data... [nltk_data] | Package word2vec_sample is already up-to-date! [nltk_data] | Downloading package wordnet to /root/nltk_data... [nltk_data] | Package wordnet is already up-to-date! [nltk_data] | Downloading package wordnet2021 to /root/nltk_data... [nltk_data] | Package wordnet2021 is already up-to-date! [nltk_data] | Downloading package wordnet31 to /root/nltk_data... [nltk_data] | Package wordnet31 is already up-to-date! [nltk_data] | Downloading package wordnet_ic to /root/nltk_data... [nltk_data] | Package wordnet_ic is already up-to-date! [nltk_data] | Downloading package words to /root/nltk_data... [nltk_data] | Package words is already up-to-date! [nltk_data] | Downloading package ycoe to /root/nltk_data... [nltk_data] | Package ycoe is already up-to-date! [nltk_data] | [nltk_data] Done downloading collection all
# function for text pre-processing
def clean_text(df, punctuations=r'''!()-[]{};:'"\,<>./?@#$%^&*_~'''):
"""
A method to clean text
"""
# Cleaning the urls
string = re.sub(r'https?://\S+|www\.\S+', '', df)
# Cleaning the html elements
string = re.sub(r'<.*?>', '', df)
# Removing the punctuations using regular expression
# i.e remove anything which is not word or whitespace character
string = re.sub(r'[^\w\s]', '', df)
# Converting the text to lower
string = string.lower()
# Removing stop words
string = ' '.join([word for word in string.split() if word not in stopwords.words('english')])
# Cleaning the whitespaces
string = re.sub(r'\s+', ' ', string).strip()
#tokenize data
string = word_tokenize(string)
#remove number
string=[s for s in string if s.isalpha()]
#lemmatize the data
string= [WordNetLemmatizer().lemmatize(i) for i in string]
#fix contractions (example: "'cause": "because","could've": "could have",etc)
string = ' '.join([contractions.fix(word) for word in string])
return string
#return " ".join(string)
data["text"] = data["text"].astype(str)
data['clean_text'] = data['text'].apply(clean_text)
Let’s have a look into the text data to understand the most frequent words –
# spam messages
data_spam = data[data['type'] == 'spam']
# ham messages
data_ham = data[data['type'] == 'ham']
data_spam.head(5)
type | text | clean_text | |
---|---|---|---|
3414 | spam | complimentary 4 star biz holiday cash need urgent 09066364349 landing lose | complimentary star biz holiday cash need urgent landing lose |
3415 | spam | dear dave final notice collect tenerife holiday cash 09061743806 tc sae box326 cw25wx 150ppm | dear dave final notice collect tenerife holiday cash tc sae |
3416 | spam | marvel mobile play official ultimate game ur mobile right text spider 83338 game ll send u free 8ball wallpaper | marvel mobile play official ultimate game you are mobile right text spider game send you free wallpaper |
3417 | spam | u win ps100 music gift voucher week starting txt word draw 87066 sc | you win music gift voucher week starting txt word draw sc |
3418 | spam | u won nokia 6230 plus free digital u u win free send nokia 83383 16 | you nokia plus free digital you you win free send nokia |
def top_n_ngram(corpus, n = None, ngram = 1):
vec = CountVectorizer(stop_words = 'english',ngram_range=(ngram,ngram)).fit(corpus)
bag_of_words = vec.transform(corpus)
sum_words = bag_of_words.sum(axis =0)
words_freq = [(word,sum_words[0,idx]) for word,idx in vec.vocabulary_.items()]
words_freq = sorted(words_freq,key = lambda x:x[1],reverse = True)
return words_freq[:n]
UNIGRAM
words1 = top_n_ngram(data_spam['clean_text'], 20, 1)
df_s = pd.DataFrame(words1, columns=['Frequent Words in Spam Texts', 'frequency'])
df_s.plot(kind='bar', x='Frequent Words in Spam Texts')
<matplotlib.axes._subplots.AxesSubplot at 0x7f44d1daca50>
words2 = top_n_ngram(data_ham['clean_text'], 20, 1)
df_h = pd.DataFrame(words2, columns=['Frequent Words in Ham Texts', 'frequency'])
df_h.plot(kind='bar', x='Frequent Words in Ham Texts')
<matplotlib.axes._subplots.AxesSubplot at 0x7f44d1c1b350>
words3 = top_n_ngram(data['clean_text'], 10, 1)
df_w = pd.DataFrame(words3, columns=['Frequent Words in whole Texts', 'frequency'])
df_w.plot(kind='bar', x='Frequent Words in whole Texts')
<matplotlib.axes._subplots.AxesSubplot at 0x7f44d16d52d0>
Now let’s create the vectors for the text data using Bag-of-Words and TF-IDF techniques.
A bag of words is a representation of text that describes the occurrence of words within a document.
# Creating the Bag of Words model
# setting the max features to 1500
cv = CountVectorizer(max_features = 200) #TBD
CountVectorizer has several parameters that are useful to create the effective model. We mostly, use below three parameters in CountVectorizer() function.
ngram_range: The lower and upper bounds of the range of n-values to be extracted for distinct n-grams. All values of n such that min_n <= n <= max_n will be used. An n gram range of (1, 1), for example, denotes just unigrams, (1, 2), unigrams and bigrams, and (2, 2), only bigrams.
analyzer{‘word’, ‘char’, ‘char_wb’} or callable, default=’word’: Whether the feature should be formed out of character n-grams or word n-grams. Option ‘char wb’ generates character n-grams mainly from text within word boundaries; n-grams outside of word boundaries are padded with space.
max_features(int), default=None: If not None, creates a vocabulary that only considers the top max features sorted by term frequency across the corpus.
There are other parameters which can be useful for data cleaning, refer to this source to explore more abput it.
# fit and transforming into vectors from the text data
vectors = cv.fit_transform(data['clean_text']).toarray()
# Printing the identified Unique words along with their indices
print("Vocabulary: ", cv.vocabulary_)
# Summarizing the Encoded Texts
print("Encoded Document is:")
print(vectors)
Vocabulary: {'hope': 69, 'good': 53, 'pay': 115, 'ask': 4, 'room': 135, 'sure': 154, 'thing': 162, 'big': 11, 'hour': 70, 'work': 191, 'night': 108, 'tell': 157, 'told': 167, 'you': 199, 'wan': 179, 'come': 27, 'coming': 28, 'wish': 189, 'gun': 57, 'sent': 140, 'think': 163, 'cost': 30, 'contact': 29, 'love': 91, 'need': 104, 'stop': 152, 'are': 3, 'today': 166, 'plan': 121, 'sleep': 144, 'said': 137, 'cannot': 17, 'wait': 177, 'hear': 62, 'text': 158, 'oh': 113, 'got': 54, 'job': 73, 'is': 72, 'yeah': 195, 'use': 174, 'help': 64, 'let': 81, 'like': 83, 'leave': 79, 'min': 97, 'pick': 119, 'tomorrow': 168, 'dun': 38, 'hey': 65, 'want': 180, 'co': 26, 'going': 51, 'enjoy': 41, 'book': 13, 'minute': 98, 'thought': 164, 'that': 161, 'soon': 147, 'mean': 93, 'time': 165, 'stuff': 153, 'phone': 118, 'look': 88, 'lot': 90, 'word': 190, 'do': 36, 'not': 110, 'holiday': 67, 'send': 139, 'heart': 63, 'feel': 43, 'day': 33, 'go': 50, 'miss': 99, 'long': 87, 'know': 75, 'remember': 132, 'wat': 182, 'meeting': 95, 'right': 134, 'home': 68, 'ok': 114, 'meet': 94, 'week': 185, 'working': 192, 'guy': 58, 'maybe': 92, 'called': 16, 'getting': 48, 'chance': 21, 'dear': 34, 'probably': 127, 'gon': 52, 'na': 103, 'bit': 12, 'way': 184, 'watching': 183, 'went': 186, 'people': 116, 'life': 82, 'da': 32, 'ya': 194, 'thanks': 160, 'lol': 86, 'babe': 7, 'eat': 39, 'thank': 159, 'happy': 61, 'late': 76, 'hi': 66, 'new': 106, 'better': 10, 'play': 122, 'actually': 1, 'talk': 156, 'place': 120, 'year': 196, 'car': 18, 'pls': 123, 'money': 101, 'did': 35, 'buy': 15, 'girl': 49, 'reply': 133, 'free': 46, 'yes': 197, 'box': 14, 'txt': 173, 'finish': 45, 'xxx': 193, 'will': 187, 'bad': 8, 'guess': 56, 'am': 2, 'real': 130, 'try': 171, 'sorry': 148, 'find': 44, 'ready': 129, 'check': 23, 'nice': 107, 'message': 96, 'run': 136, 'half': 60, 'great': 55, 'little': 84, 'left': 80, 'friend': 47, 'care': 19, 'number': 111, 'speak': 150, 'morning': 102, 'wanted': 181, 'say': 138, 'haha': 59, 'later': 77, 'looking': 89, 'start': 151, 'end': 40, 'person': 117, 'trying': 172, 'yo': 198, 'sweet': 155, 'ill': 71, 'asked': 5, 'account': 0, 'shall': 142, 'mobile': 100, 'smile': 146, 'join': 74, 'video': 175, 'offer': 112, 'tone': 169, 'class': 25, 'best': 9, 'waiting': 178, 'po': 124, 'live': 85, 'sound': 149, 'customer': 31, 'sm': 145, 'cash': 20, 'draw': 37, 'win': 188, 'tonight': 170, 'chat': 22, 'service': 141, 'reach': 128, 'show': 143, 'nokia': 109, 'receive': 131, 'pound': 125, 'network': 105, 'entry': 42, 'latest': 78, 'voucher': 176, 'awarded': 6, 'prize': 126, 'claim': 24} Encoded Document is: [[0 0 0 ... 0 0 0] [0 0 0 ... 0 0 0] [0 0 0 ... 0 0 0] ... [0 0 0 ... 0 0 0] [0 0 0 ... 0 0 3] [0 0 0 ... 0 0 2]]
# Function to print the classification report and get confusion matrix in a proper format
def metrics_score(actual, predicted):
print(classification_report(actual, predicted))
cm = confusion_matrix(actual, predicted)
plt.figure(figsize = (8, 5))
sns.heatmap(cm, annot = True, fmt = '.2f', xticklabels = ['ham', 'spam'], yticklabels = ['ham', 'spam'])
plt.ylabel('Actual')
plt.xlabel('Predicted')
plt.show()
Model Building
# Independent feature
X = vectors
# Target feature
y = data["type"].map({'ham':0,'spam':1})
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.20, random_state = 0, stratify = y)
print("Shape of Training set : ", X_train.shape)
print("Shape of test set : ", X_test.shape)
print("Percentage of classes in training set:")
print(y_train.value_counts(normalize = True))
print("Percentage of classes in test set:")
print(y_test.value_counts(normalize = True))
Shape of Training set : (3222, 200) Shape of test set : (806, 200) Percentage of classes in training set: 0 0.84761 1 0.15239 Name: type, dtype: float64 Percentage of classes in test set: 0 0.847395 1 0.152605 Name: type, dtype: float64
# intializing the Random Forest model
model = RandomForestClassifier(random_state = 1)
# fitting the model on training set
model.fit(X_train, y_train)
RandomForestClassifier(random_state=1)
Model performance on the training data
# making predictions on the test set
y_pred_train = model.predict(X_train)
metrics_score(y_train, y_pred_train)
precision recall f1-score support 0 0.99 1.00 0.99 2731 1 0.99 0.93 0.96 491 accuracy 0.99 3222 macro avg 0.99 0.97 0.98 3222 weighted avg 0.99 0.99 0.99 3222
Model performance on the testing data
# making predictions on the test set
y_pred = model.predict(X_test)
metrics_score(y_test, y_pred)
precision recall f1-score support 0 0.97 0.97 0.97 683 1 0.83 0.82 0.82 123 accuracy 0.95 806 macro avg 0.90 0.90 0.90 806 weighted avg 0.95 0.95 0.95 806
TF-IDF is an abbreviation for Term Frequency-Inverse Document Frequency. The process of TF-IDF vectorization involves computing the TF-IDF score for each word in your corpus in relation to that document and then storing that information in a vector.
# Creating the object to the TfidfVectorizer class
vectorizer = TfidfVectorizer(max_features = 200)
TfidfVectorizer has several parameters that are useful to create the effective model. We mostly, use below three parameters in TfidfVectorizer() function.
ngram_range: The lower and upper bounds of the range of n-values to be extracted for various n-grams. All values of n such that min_n <= n <= max_n will be used. An ngram range of (1, 1), for example, denotes just unigrams, (1, 2) means unigrams and bigrams, and (2, 2), only bigrams.
max_df(float or int), default=1.0: When creating the vocabulary, it excludes terms with a document frequency that is strictly greater than the given threshold (corpus-specific stop words). The parameter represents a proportion of documents if it is a float in the range [0.0, 1.0], otherwise it is an integer absolute count. If vocabulary is not None, this parameter is ignored.
max_features(int), default=None: If not None, it creates a vocabulary that only considers the top max features arranged by term frequency across the corpus.
There are other parameters which can be useful for data cleaning, refer to this source to explore more on it.
# fit and transforming into vectors from the text data
tf_vectors = vectorizer.fit_transform(data['text']).toarray()
# get indexing
print('\nWord indexes:')
print(vectorizer.vocabulary_)
# tf-idf values
print('\ntf-idf values:')
print(tf_vectors)
Word indexes: {'hope': 70, 'having': 62, 'good': 52, 'pay': 116, 'ask': 5, 'room': 137, 'sure': 154, 'thing': 162, 'big': 11, 'hour': 71, 'work': 193, 'night': 110, 'tell': 157, 'told': 167, 'come': 27, 'coming': 28, 'wish': 189, 'gun': 57, 'sent': 142, 'think': 163, 'cost': 30, 'contact': 29, 'love': 92, 'need': 106, 'stop': 152, 'ur': 174, 'today': 166, 'plan': 123, 'sleep': 146, 'said': 139, 'cant': 17, 'wait': 178, 'hear': 63, 'text': 158, 'oh': 114, 'got': 53, 'job': 74, 'yeah': 197, 'use': 175, 'help': 65, 'let': 82, 'like': 84, 'leave': 80, 'min': 99, 'pick': 121, 'tomorrow': 168, 'dun': 38, 'hey': 66, 'want': 181, 'co': 26, 'going': 50, 'enjoy': 41, 'book': 13, 'minute': 100, 'thought': 164, 'thats': 161, 'soon': 148, 'mean': 95, 'time': 165, 'stuff': 153, 'phone': 119, 'look': 89, 'lot': 91, 'word': 192, 'dont': 36, 'holiday': 68, 'send': 141, 'heart': 64, 'feel': 42, 'day': 33, 'go': 49, 'miss': 101, 'long': 88, 'remember': 134, 'wat': 183, 'meeting': 97, 'right': 136, 'home': 69, 'ok': 115, 'meet': 96, 'week': 186, 'working': 194, 'guy': 58, 'maybe': 94, 'called': 16, 'getting': 47, 'chance': 21, 'dear': 34, 'probably': 128, 'gonna': 51, 'bit': 12, 'way': 185, 'watching': 184, 'went': 187, 'people': 117, 'life': 83, 'da': 32, 'ya': 196, 'know': 76, 'thanks': 160, '1st': 1, 'lol': 87, 'babe': 7, 'eat': 39, 'thank': 159, 'happy': 61, 'late': 77, 'hi': 67, 'new': 108, 'better': 10, 'play': 124, 'actually': 4, 'talk': 156, 'place': 122, 'year': 198, 'car': 18, 'pls': 125, 'money': 103, 'didnt': 35, 'buy': 15, 'girl': 48, 'reply': 135, 'free': 45, 'yes': 199, 'box': 14, 'txt': 173, 'finish': 44, 'xxx': 195, 'wont': 191, 'bad': 8, 'guess': 56, 'im': 73, 'real': 132, 'try': 171, 'sorry': 149, 'find': 43, 'ready': 131, 'check': 23, 'nice': 109, 'msg': 105, 'run': 138, 'half': 60, 'great': 54, 'ma': 93, 'little': 85, 'left': 81, 'friend': 46, 'pic': 120, 'care': 19, 'number': 112, 'speak': 150, 'morning': 104, 'wanted': 182, 'say': 140, 'message': 98, 'haha': 59, 'later': 78, 'looking': 90, 'start': 151, 'end': 40, 'person': 118, 'trying': 172, 'sweet': 155, 'ill': 72, 'account': 3, 'shall': 144, 'mobile': 102, 'smile': 147, 'join': 75, 'wanna': 180, 'video': 176, 'offer': 113, 'tone': 169, 'class': 25, 'best': 9, 'waiting': 179, 'po': 126, 'live': 86, 'customer': 31, 'cash': 20, '2nd': 2, 'draw': 37, 'win': 188, 'tonight': 170, 'chat': 22, 'service': 143, 'reach': 130, 'won': 190, 'show': 145, 'nokia': 111, 'receive': 133, 'network': 107, 'latest': 79, '150ppm': 0, 'voucher': 177, 'awarded': 6, 'prize': 127, 'claim': 24, 'guaranteed': 55, 'ps1000': 129} tf-idf values: [[0. 0. 0. ... 0. 0. 0.] [0. 0. 0. ... 0. 0. 0.] [0. 0. 0. ... 0. 0. 0.] ... [0. 0. 0. ... 0. 0. 0.] [0. 0. 0. ... 0. 0. 0.] [0. 0. 0. ... 0. 0. 0.]]
# Independent feature
X = tf_vectors
# Target feature
y = data["type"].map({'ham':0,'spam':1})
# Splitting the data into train and test
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.20, random_state = 0, stratify = y)
print("Shape of Training set : ", X_train.shape)
print("Shape of test set : ", X_test.shape)
print("Percentage of classes in training set:")
print(y_train.value_counts(normalize = True))
print("Percentage of classes in test set:")
print(y_test.value_counts(normalize = True))
Shape of Training set : (3222, 200) Shape of test set : (806, 200) Percentage of classes in training set: 0 0.84761 1 0.15239 Name: type, dtype: float64 Percentage of classes in test set: 0 0.847395 1 0.152605 Name: type, dtype: float64
# intializing the Random Forest model
model = RandomForestClassifier(random_state = 1)
# fitting the model on training set
model.fit(X_train, y_train)
RandomForestClassifier(random_state=1)
Model performance on the training data
# making predictions on the test set
y_pred_train = model.predict(X_train)
metrics_score(y_train, y_pred_train)
precision recall f1-score support 0 0.99 1.00 0.99 2731 1 0.98 0.93 0.96 491 accuracy 0.99 3222 macro avg 0.99 0.97 0.98 3222 weighted avg 0.99 0.99 0.99 3222
Let’s have a look on the testing data score to verify whether the model is performing well or overfitted the data.
Model performance on the testing data
# making predictions on the test set
y_pred = model.predict(X_test)
metrics_score(y_test, y_pred)
precision recall f1-score support 0 0.96 0.98 0.97 683 1 0.86 0.78 0.82 123 accuracy 0.95 806 macro avg 0.91 0.88 0.89 806 weighted avg 0.95 0.95 0.95 806