General Middleware

August 24, 2023 anbanerj 1 Comment

NLP – Vectorisation using Bag-of-Words and TF-IDF

1.1 Basic Vectorization

SPAM...png

1. Problem Context¶

Many users worldwide rely on email and text messages for communication and they have become an essential part of daily life. Emails and text messages contain a lot of sensitive information that hackers attempt to steal, which is known as data theft. As a result, it is critical to distinguish between spam and ham emails.

Genuine mail/text that is important to the user and informative is referred to as ham. Spam, on the other hand, is bogus mail/text sent from untrustworthy sources with malicious intent.

1.1 Data Dictionary¶

Text: Messages sent by the users
Type: Target variable which provides information about a message being Spam or Ham

1.2 Objective¶

Creating the vectors for the text data using the basic techniques like Bag of Words and TF-IDF.

2. Overview of the dataset¶

2.1 Importing necessary libraries¶

In [ ]:

# To read and manipulate the data
import pandas as pd
pd.set_option('max_colwidth', None)

# To visualise the graphs
import matplotlib.pyplot as plt
from matplotlib import cycler
colors = cycler('color',
                ['#EE6666', '#3388BB', '#9988DD',
                 '#EECC55', '#88BB44', '#FFBBBB'])
plt.rc('axes', facecolor='#E6E6E6', edgecolor='none',
       axisbelow=True, grid=True, prop_cycle=colors)
plt.rc('grid', color='w', linestyle='solid')
plt.rc('xtick', direction='out', color='black')
plt.rc('ytick', direction='out', color='black')
plt.rc('patch', edgecolor='#E6E6E6')
plt.rc('lines', linewidth=2)
import seaborn as sns

# Helps to display the images
from PIL import Image

# Helps to remove the punctuation
import string

# Helps to create the counter 
from collections import Counter

# Helped to create train and test data
from sklearn.model_selection import train_test_split

# Importing the vectorization classes
from sklearn.feature_extraction.text import CountVectorizer,TfidfVectorizer

# Importing the Random Forest model
from sklearn.ensemble import RandomForestClassifier

# Metrics to evaluate the model
from sklearn.metrics import accuracy_score,classification_report, confusion_matrix

In [ ]:

from google.colab import drive
drive.mount('/content/drive')

Mounted at /content/drive

2.2 Loading the dataset¶

We are loading the cleaned data from the notebook HAM vs Spam notebook, to avoid the repetative preprocessing steps. As we mentioned earlier, now let’s learn and implement the vectorization techniques other than one hot encoding.

In [ ]:

filepath = "/content/drive/MyDrive/data_vertical/Final/1. Week-1: NLP/Notebooks/1.2 SMS_spam.csv"

messages = pd.read_csv(filepath, index_col = [0])

In [ ]:

# Creating the copy of the data frame
data = messages.copy()

2.3 Understanding the dataset¶

In [ ]:

# View the first 5 rows of the dataset
data.head(5)

Out[ ]:

	type	text
0	ham	hope having good checking
1	ham	dong cbe bt pay
2	ham	ask mummy father
3	ham	fyi usf swing room
4	ham	sure thing big hockey election longer hour

In [ ]:

# View the last five rows of the dataset
data.tail(5)

Out[ ]:

	type	text
4023	spam	cd congratulation ur awarded ps500 cd gift voucher ps125 gift guaranteed reentry 2 ps100 draw xt music 87066 tn
4024	spam	mobile 11myths update free orange latest colour camera mobile unlimited weekend mobile upd8 freeform 08000839402 2stoptxt
4025	spam	3 lion england reply lion 4 mono lion 4 4 2 original n tone 3gb network operator rate
4026	spam	ur balance ur question sang 2 answer txt ur answer good
4027	spam	ac energy u know 2channel 2day ur leadership skill r reply an reply end sco

In [ ]:

# Checking the shape of the dataset
data.shape

Out[ ]:

(4028, 2)

In [ ]:

# Checking the datatypes and columns
data.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 4028 entries, 0 to 4027
Data columns (total 2 columns):
 #   Column  Non-Null Count  Dtype 
---  ------  --------------  ----- 
 0   type    4028 non-null   object
 1   text    4028 non-null   object
dtypes: object(2)
memory usage: 94.4+ KB

In [ ]:

# Checking for duplicate values
data.duplicated().sum()

Out[ ]:

In [ ]:

# Checking for missing values
data.isna().sum()

Out[ ]:

type    0
text    0
dtype: int64

In [ ]:

data['type'].value_counts()

Out[ ]:

ham     3414
spam     614
Name: type, dtype: int64

Observations:¶

There are total 4028 rows and 2 columns in the dataset.
All the columns are object type.
There are no duplicate values in the data.
There are no null values in the dataset.
Majority of the messages(around 85%) are of ham class.

In [ ]:

# this library is used to expand contractions
!pip install contractions

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Requirement already satisfied: contractions in /usr/local/lib/python3.7/dist-packages (0.1.73)
Requirement already satisfied: textsearch>=0.0.21 in /usr/local/lib/python3.7/dist-packages (from contractions) (0.0.24)
Requirement already satisfied: pyahocorasick in /usr/local/lib/python3.7/dist-packages (from textsearch>=0.0.21->contractions) (1.4.4)
Requirement already satisfied: anyascii in /usr/local/lib/python3.7/dist-packages (from textsearch>=0.0.21->contractions) (0.3.1)

In [ ]:

# Helps to extract the data using regular expressions
import re
import nltk
import contractions
nltk.download('punkt')
nltk.download('all')

from nltk.corpus import stopwords
from nltk import word_tokenize
# Used in Lemmatization
from nltk.stem import WordNetLemmatizer

[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading collection 'all'
[nltk_data]    | 
[nltk_data]    | Downloading package abc to /root/nltk_data...
[nltk_data]    |   Package abc is already up-to-date!
[nltk_data]    | Downloading package alpino to /root/nltk_data...
[nltk_data]    |   Package alpino is already up-to-date!
[nltk_data]    | Downloading package averaged_perceptron_tagger to
[nltk_data]    |     /root/nltk_data...
[nltk_data]    |   Package averaged_perceptron_tagger is already up-
[nltk_data]    |       to-date!
[nltk_data]    | Downloading package averaged_perceptron_tagger_ru to
[nltk_data]    |     /root/nltk_data...
[nltk_data]    |   Package averaged_perceptron_tagger_ru is already
[nltk_data]    |       up-to-date!
[nltk_data]    | Downloading package basque_grammars to
[nltk_data]    |     /root/nltk_data...
[nltk_data]    |   Package basque_grammars is already up-to-date!
[nltk_data]    | Downloading package biocreative_ppi to
[nltk_data]    |     /root/nltk_data...
[nltk_data]    |   Package biocreative_ppi is already up-to-date!
[nltk_data]    | Downloading package bllip_wsj_no_aux to
[nltk_data]    |     /root/nltk_data...
[nltk_data]    |   Package bllip_wsj_no_aux is already up-to-date!
[nltk_data]    | Downloading package book_grammars to
[nltk_data]    |     /root/nltk_data...
[nltk_data]    |   Package book_grammars is already up-to-date!
[nltk_data]    | Downloading package brown to /root/nltk_data...
[nltk_data]    |   Package brown is already up-to-date!
[nltk_data]    | Downloading package brown_tei to /root/nltk_data...
[nltk_data]    |   Package brown_tei is already up-to-date!
[nltk_data]    | Downloading package cess_cat to /root/nltk_data...
[nltk_data]    |   Package cess_cat is already up-to-date!
[nltk_data]    | Downloading package cess_esp to /root/nltk_data...
[nltk_data]    |   Package cess_esp is already up-to-date!
[nltk_data]    | Downloading package chat80 to /root/nltk_data...
[nltk_data]    |   Package chat80 is already up-to-date!
[nltk_data]    | Downloading package city_database to
[nltk_data]    |     /root/nltk_data...
[nltk_data]    |   Package city_database is already up-to-date!
[nltk_data]    | Downloading package cmudict to /root/nltk_data...
[nltk_data]    |   Package cmudict is already up-to-date!
[nltk_data]    | Downloading package comparative_sentences to
[nltk_data]    |     /root/nltk_data...
[nltk_data]    |   Package comparative_sentences is already up-to-
[nltk_data]    |       date!
[nltk_data]    | Downloading package comtrans to /root/nltk_data...
[nltk_data]    |   Package comtrans is already up-to-date!
[nltk_data]    | Downloading package conll2000 to /root/nltk_data...
[nltk_data]    |   Package conll2000 is already up-to-date!
[nltk_data]    | Downloading package conll2002 to /root/nltk_data...
[nltk_data]    |   Package conll2002 is already up-to-date!
[nltk_data]    | Downloading package conll2007 to /root/nltk_data...
[nltk_data]    |   Package conll2007 is already up-to-date!
[nltk_data]    | Downloading package crubadan to /root/nltk_data...
[nltk_data]    |   Package crubadan is already up-to-date!
[nltk_data]    | Downloading package dependency_treebank to
[nltk_data]    |     /root/nltk_data...
[nltk_data]    |   Package dependency_treebank is already up-to-date!
[nltk_data]    | Downloading package dolch to /root/nltk_data...
[nltk_data]    |   Package dolch is already up-to-date!
[nltk_data]    | Downloading package europarl_raw to
[nltk_data]    |     /root/nltk_data...
[nltk_data]    |   Package europarl_raw is already up-to-date!
[nltk_data]    | Downloading package extended_omw to
[nltk_data]    |     /root/nltk_data...
[nltk_data]    |   Package extended_omw is already up-to-date!
[nltk_data]    | Downloading package floresta to /root/nltk_data...
[nltk_data]    |   Package floresta is already up-to-date!
[nltk_data]    | Downloading package framenet_v15 to
[nltk_data]    |     /root/nltk_data...
[nltk_data]    |   Package framenet_v15 is already up-to-date!
[nltk_data]    | Downloading package framenet_v17 to
[nltk_data]    |     /root/nltk_data...
[nltk_data]    |   Package framenet_v17 is already up-to-date!
[nltk_data]    | Downloading package gazetteers to /root/nltk_data...
[nltk_data]    |   Package gazetteers is already up-to-date!
[nltk_data]    | Downloading package genesis to /root/nltk_data...
[nltk_data]    |   Package genesis is already up-to-date!
[nltk_data]    | Downloading package gutenberg to /root/nltk_data...
[nltk_data]    |   Package gutenberg is already up-to-date!
[nltk_data]    | Downloading package ieer to /root/nltk_data...
[nltk_data]    |   Package ieer is already up-to-date!
[nltk_data]    | Downloading package inaugural to /root/nltk_data...
[nltk_data]    |   Package inaugural is already up-to-date!
[nltk_data]    | Downloading package indian to /root/nltk_data...
[nltk_data]    |   Package indian is already up-to-date!
[nltk_data]    | Downloading package jeita to /root/nltk_data...
[nltk_data]    |   Package jeita is already up-to-date!
[nltk_data]    | Downloading package kimmo to /root/nltk_data...
[nltk_data]    |   Package kimmo is already up-to-date!
[nltk_data]    | Downloading package knbc to /root/nltk_data...
[nltk_data]    |   Package knbc is already up-to-date!
[nltk_data]    | Downloading package large_grammars to
[nltk_data]    |     /root/nltk_data...
[nltk_data]    |   Package large_grammars is already up-to-date!
[nltk_data]    | Downloading package lin_thesaurus to
[nltk_data]    |     /root/nltk_data...
[nltk_data]    |   Package lin_thesaurus is already up-to-date!
[nltk_data]    | Downloading package mac_morpho to /root/nltk_data...
[nltk_data]    |   Package mac_morpho is already up-to-date!
[nltk_data]    | Downloading package machado to /root/nltk_data...
[nltk_data]    |   Package machado is already up-to-date!
[nltk_data]    | Downloading package masc_tagged to /root/nltk_data...
[nltk_data]    |   Package masc_tagged is already up-to-date!
[nltk_data]    | Downloading package maxent_ne_chunker to
[nltk_data]    |     /root/nltk_data...
[nltk_data]    |   Package maxent_ne_chunker is already up-to-date!
[nltk_data]    | Downloading package maxent_treebank_pos_tagger to
[nltk_data]    |     /root/nltk_data...
[nltk_data]    |   Package maxent_treebank_pos_tagger is already up-
[nltk_data]    |       to-date!
[nltk_data]    | Downloading package moses_sample to
[nltk_data]    |     /root/nltk_data...
[nltk_data]    |   Package moses_sample is already up-to-date!
[nltk_data]    | Downloading package movie_reviews to
[nltk_data]    |     /root/nltk_data...
[nltk_data]    |   Package movie_reviews is already up-to-date!
[nltk_data]    | Downloading package mte_teip5 to /root/nltk_data...
[nltk_data]    |   Package mte_teip5 is already up-to-date!
[nltk_data]    | Downloading package mwa_ppdb to /root/nltk_data...
[nltk_data]    |   Package mwa_ppdb is already up-to-date!
[nltk_data]    | Downloading package names to /root/nltk_data...
[nltk_data]    |   Package names is already up-to-date!
[nltk_data]    | Downloading package nombank.1.0 to /root/nltk_data...
[nltk_data]    |   Package nombank.1.0 is already up-to-date!
[nltk_data]    | Downloading package nonbreaking_prefixes to
[nltk_data]    |     /root/nltk_data...
[nltk_data]    |   Package nonbreaking_prefixes is already up-to-date!
[nltk_data]    | Downloading package nps_chat to /root/nltk_data...
[nltk_data]    |   Package nps_chat is already up-to-date!
[nltk_data]    | Downloading package omw to /root/nltk_data...
[nltk_data]    |   Package omw is already up-to-date!
[nltk_data]    | Downloading package omw-1.4 to /root/nltk_data...
[nltk_data]    |   Package omw-1.4 is already up-to-date!
[nltk_data]    | Downloading package opinion_lexicon to
[nltk_data]    |     /root/nltk_data...
[nltk_data]    |   Package opinion_lexicon is already up-to-date!
[nltk_data]    | Downloading package panlex_swadesh to
[nltk_data]    |     /root/nltk_data...
[nltk_data]    |   Package panlex_swadesh is already up-to-date!
[nltk_data]    | Downloading package paradigms to /root/nltk_data...
[nltk_data]    |   Package paradigms is already up-to-date!
[nltk_data]    | Downloading package pe08 to /root/nltk_data...
[nltk_data]    |   Package pe08 is already up-to-date!
[nltk_data]    | Downloading package perluniprops to
[nltk_data]    |     /root/nltk_data...
[nltk_data]    |   Package perluniprops is already up-to-date!
[nltk_data]    | Downloading package pil to /root/nltk_data...
[nltk_data]    |   Package pil is already up-to-date!
[nltk_data]    | Downloading package pl196x to /root/nltk_data...
[nltk_data]    |   Package pl196x is already up-to-date!
[nltk_data]    | Downloading package porter_test to /root/nltk_data...
[nltk_data]    |   Package porter_test is already up-to-date!
[nltk_data]    | Downloading package ppattach to /root/nltk_data...
[nltk_data]    |   Package ppattach is already up-to-date!
[nltk_data]    | Downloading package problem_reports to
[nltk_data]    |     /root/nltk_data...
[nltk_data]    |   Package problem_reports is already up-to-date!
[nltk_data]    | Downloading package product_reviews_1 to
[nltk_data]    |     /root/nltk_data...
[nltk_data]    |   Package product_reviews_1 is already up-to-date!
[nltk_data]    | Downloading package product_reviews_2 to
[nltk_data]    |     /root/nltk_data...
[nltk_data]    |   Package product_reviews_2 is already up-to-date!
[nltk_data]    | Downloading package propbank to /root/nltk_data...
[nltk_data]    |   Package propbank is already up-to-date!
[nltk_data]    | Downloading package pros_cons to /root/nltk_data...
[nltk_data]    |   Package pros_cons is already up-to-date!
[nltk_data]    | Downloading package ptb to /root/nltk_data...
[nltk_data]    |   Package ptb is already up-to-date!
[nltk_data]    | Downloading package punkt to /root/nltk_data...
[nltk_data]    |   Package punkt is already up-to-date!
[nltk_data]    | Downloading package qc to /root/nltk_data...
[nltk_data]    |   Package qc is already up-to-date!
[nltk_data]    | Downloading package reuters to /root/nltk_data...
[nltk_data]    |   Package reuters is already up-to-date!
[nltk_data]    | Downloading package rslp to /root/nltk_data...
[nltk_data]    |   Package rslp is already up-to-date!
[nltk_data]    | Downloading package rte to /root/nltk_data...
[nltk_data]    |   Package rte is already up-to-date!
[nltk_data]    | Downloading package sample_grammars to
[nltk_data]    |     /root/nltk_data...
[nltk_data]    |   Package sample_grammars is already up-to-date!
[nltk_data]    | Downloading package semcor to /root/nltk_data...
[nltk_data]    |   Package semcor is already up-to-date!
[nltk_data]    | Downloading package senseval to /root/nltk_data...
[nltk_data]    |   Package senseval is already up-to-date!
[nltk_data]    | Downloading package sentence_polarity to
[nltk_data]    |     /root/nltk_data...
[nltk_data]    |   Package sentence_polarity is already up-to-date!
[nltk_data]    | Downloading package sentiwordnet to
[nltk_data]    |     /root/nltk_data...
[nltk_data]    |   Package sentiwordnet is already up-to-date!
[nltk_data]    | Downloading package shakespeare to /root/nltk_data...
[nltk_data]    |   Package shakespeare is already up-to-date!
[nltk_data]    | Downloading package sinica_treebank to
[nltk_data]    |     /root/nltk_data...
[nltk_data]    |   Package sinica_treebank is already up-to-date!
[nltk_data]    | Downloading package smultron to /root/nltk_data...
[nltk_data]    |   Package smultron is already up-to-date!
[nltk_data]    | Downloading package snowball_data to
[nltk_data]    |     /root/nltk_data...
[nltk_data]    |   Package snowball_data is already up-to-date!
[nltk_data]    | Downloading package spanish_grammars to
[nltk_data]    |     /root/nltk_data...
[nltk_data]    |   Package spanish_grammars is already up-to-date!
[nltk_data]    | Downloading package state_union to /root/nltk_data...
[nltk_data]    |   Package state_union is already up-to-date!
[nltk_data]    | Downloading package stopwords to /root/nltk_data...
[nltk_data]    |   Package stopwords is already up-to-date!
[nltk_data]    | Downloading package subjectivity to
[nltk_data]    |     /root/nltk_data...
[nltk_data]    |   Package subjectivity is already up-to-date!
[nltk_data]    | Downloading package swadesh to /root/nltk_data...
[nltk_data]    |   Package swadesh is already up-to-date!
[nltk_data]    | Downloading package switchboard to /root/nltk_data...
[nltk_data]    |   Package switchboard is already up-to-date!
[nltk_data]    | Downloading package tagsets to /root/nltk_data...
[nltk_data]    |   Package tagsets is already up-to-date!
[nltk_data]    | Downloading package timit to /root/nltk_data...
[nltk_data]    |   Package timit is already up-to-date!
[nltk_data]    | Downloading package toolbox to /root/nltk_data...
[nltk_data]    |   Package toolbox is already up-to-date!
[nltk_data]    | Downloading package treebank to /root/nltk_data...
[nltk_data]    |   Package treebank is already up-to-date!
[nltk_data]    | Downloading package twitter_samples to
[nltk_data]    |     /root/nltk_data...
[nltk_data]    |   Package twitter_samples is already up-to-date!
[nltk_data]    | Downloading package udhr to /root/nltk_data...
[nltk_data]    |   Package udhr is already up-to-date!
[nltk_data]    | Downloading package udhr2 to /root/nltk_data...
[nltk_data]    |   Package udhr2 is already up-to-date!
[nltk_data]    | Downloading package unicode_samples to
[nltk_data]    |     /root/nltk_data...
[nltk_data]    |   Package unicode_samples is already up-to-date!
[nltk_data]    | Downloading package universal_tagset to
[nltk_data]    |     /root/nltk_data...
[nltk_data]    |   Package universal_tagset is already up-to-date!
[nltk_data]    | Downloading package universal_treebanks_v20 to
[nltk_data]    |     /root/nltk_data...
[nltk_data]    |   Package universal_treebanks_v20 is already up-to-
[nltk_data]    |       date!
[nltk_data]    | Downloading package vader_lexicon to
[nltk_data]    |     /root/nltk_data...
[nltk_data]    |   Package vader_lexicon is already up-to-date!
[nltk_data]    | Downloading package verbnet to /root/nltk_data...
[nltk_data]    |   Package verbnet is already up-to-date!
[nltk_data]    | Downloading package verbnet3 to /root/nltk_data...
[nltk_data]    |   Package verbnet3 is already up-to-date!
[nltk_data]    | Downloading package webtext to /root/nltk_data...
[nltk_data]    |   Package webtext is already up-to-date!
[nltk_data]    | Downloading package wmt15_eval to /root/nltk_data...
[nltk_data]    |   Package wmt15_eval is already up-to-date!
[nltk_data]    | Downloading package word2vec_sample to
[nltk_data]    |     /root/nltk_data...
[nltk_data]    |   Package word2vec_sample is already up-to-date!
[nltk_data]    | Downloading package wordnet to /root/nltk_data...
[nltk_data]    |   Package wordnet is already up-to-date!
[nltk_data]    | Downloading package wordnet2021 to /root/nltk_data...
[nltk_data]    |   Package wordnet2021 is already up-to-date!
[nltk_data]    | Downloading package wordnet31 to /root/nltk_data...
[nltk_data]    |   Package wordnet31 is already up-to-date!
[nltk_data]    | Downloading package wordnet_ic to /root/nltk_data...
[nltk_data]    |   Package wordnet_ic is already up-to-date!
[nltk_data]    | Downloading package words to /root/nltk_data...
[nltk_data]    |   Package words is already up-to-date!
[nltk_data]    | Downloading package ycoe to /root/nltk_data...
[nltk_data]    |   Package ycoe is already up-to-date!
[nltk_data]    | 
[nltk_data]  Done downloading collection all

In [ ]:

# function for text pre-processing
def clean_text(df, punctuations=r'''!()-[]{};:'"\,<>./?@#$%^&*_~'''):
    """
    A method to clean text 
    """
    # Cleaning the urls
    string = re.sub(r'https?://\S+|www\.\S+', '', df)

    # Cleaning the html elements
    string = re.sub(r'<.*?>', '', df)

    # Removing the punctuations using regular expression
    # i.e remove anything which is not word or whitespace character
    
    string = re.sub(r'[^\w\s]', '', df)

    # Converting the text to lower
    string = string.lower()

    # Removing stop words
    string = ' '.join([word for word in string.split() if word not in stopwords.words('english')])

    # Cleaning the whitespaces
    string = re.sub(r'\s+', ' ', string).strip()

    #tokenize data
    string = word_tokenize(string)

    #remove number
    string=[s for s in string if s.isalpha()]

    #lemmatize the data
    string= [WordNetLemmatizer().lemmatize(i) for i in string]

    #fix contractions (example: "'cause": "because","could've": "could have",etc)
    string = ' '.join([contractions.fix(word) for word in string])


    return string 
    #return " ".join(string)       

In [ ]:

data["text"] = data["text"].astype(str) 
data['clean_text'] = data['text'].apply(clean_text)

3. Exploratory Data Analysis¶

Let’s have a look into the text data to understand the most frequent words –

In [ ]:

# spam messages
data_spam = data[data['type'] == 'spam']

# ham messages
data_ham = data[data['type'] == 'ham']

In [ ]:

data_spam.head(5)

Out[ ]:

	type	text	clean_text
3414	spam	complimentary 4 star biz holiday cash need urgent 09066364349 landing lose	complimentary star biz holiday cash need urgent landing lose
3415	spam	dear dave final notice collect tenerife holiday cash 09061743806 tc sae box326 cw25wx 150ppm	dear dave final notice collect tenerife holiday cash tc sae
3416	spam	marvel mobile play official ultimate game ur mobile right text spider 83338 game ll send u free 8ball wallpaper	marvel mobile play official ultimate game you are mobile right text spider game send you free wallpaper
3417	spam	u win ps100 music gift voucher week starting txt word draw 87066 sc	you win music gift voucher week starting txt word draw sc
3418	spam	u won nokia 6230 plus free digital u u win free send nokia 83383 16	you nokia plus free digital you you win free send nokia

In [ ]:

def top_n_ngram(corpus, n = None, ngram = 1):
    vec = CountVectorizer(stop_words = 'english',ngram_range=(ngram,ngram)).fit(corpus)
    bag_of_words = vec.transform(corpus)
    sum_words = bag_of_words.sum(axis =0) 
    words_freq = [(word,sum_words[0,idx]) for word,idx in vec.vocabulary_.items()]
    words_freq = sorted(words_freq,key = lambda x:x[1],reverse = True)
    return words_freq[:n]

UNIGRAM

In [ ]:

words1 = top_n_ngram(data_spam['clean_text'], 20, 1)
df_s = pd.DataFrame(words1, columns=['Frequent Words in Spam Texts', 'frequency'])
df_s.plot(kind='bar', x='Frequent Words in Spam Texts')

Out[ ]:

<matplotlib.axes._subplots.AxesSubplot at 0x7f44d1daca50>

A few keywords such as ‘Free’, ‘Claim’, ‘Prize’, ‘Cash’, ‘Win’, and so on correctly indicate that the messages sometimes contains a fake prize offering and is a scam.

In [ ]:

words2 = top_n_ngram(data_ham['clean_text'], 20, 1)
df_h = pd.DataFrame(words2, columns=['Frequent Words in Ham Texts', 'frequency'])
df_h.plot(kind='bar', x='Frequent Words in Ham Texts')

Out[ ]:

<matplotlib.axes._subplots.AxesSubplot at 0x7f44d1c1b350>

The majority of the common words here are generic here, which is obvious for ‘Ham’ class.

In [ ]:

words3 = top_n_ngram(data['clean_text'], 10, 1)
df_w = pd.DataFrame(words3, columns=['Frequent Words in whole Texts', 'frequency'])
df_w.plot(kind='bar', x='Frequent Words in whole Texts')

Out[ ]:

<matplotlib.axes._subplots.AxesSubplot at 0x7f44d16d52d0>

Once again, the most common words in the combined text of both Spam and ham are generic words only, which is understandable given that the majority of the entries in this dataset are of the Ham class.

Now let’s create the vectors for the text data using Bag-of-Words and TF-IDF techniques.

3.1 Bag-of-Words¶

A bag of words is a representation of text that describes the occurrence of words within a document.

In [ ]:

# Creating the Bag of Words model
# setting the max features to 1500
cv = CountVectorizer(max_features = 200) #TBD

CountVectorizer has several parameters that are useful to create the effective model. We mostly, use below three parameters in CountVectorizer() function.

ngram_range: The lower and upper bounds of the range of n-values to be extracted for distinct n-grams. All values of n such that min_n <= n <= max_n will be used. An n gram range of (1, 1), for example, denotes just unigrams, (1, 2), unigrams and bigrams, and (2, 2), only bigrams.
analyzer{‘word’, ‘char’, ‘char_wb’} or callable, default=’word’: Whether the feature should be formed out of character n-grams or word n-grams. Option ‘char wb’ generates character n-grams mainly from text within word boundaries; n-grams outside of word boundaries are padded with space.
max_features(int), default=None: If not None, creates a vocabulary that only considers the top max features sorted by term frequency across the corpus.

There are other parameters which can be useful for data cleaning, refer to this source to explore more abput it.

In [ ]:

# fit and transforming into vectors from the text data
vectors = cv.fit_transform(data['clean_text']).toarray()

In [ ]:

# Printing the identified Unique words along with their indices
print("Vocabulary: ", cv.vocabulary_)
# Summarizing the Encoded Texts
print("Encoded Document is:")
print(vectors)

Vocabulary:  {'hope': 69, 'good': 53, 'pay': 115, 'ask': 4, 'room': 135, 'sure': 154, 'thing': 162, 'big': 11, 'hour': 70, 'work': 191, 'night': 108, 'tell': 157, 'told': 167, 'you': 199, 'wan': 179, 'come': 27, 'coming': 28, 'wish': 189, 'gun': 57, 'sent': 140, 'think': 163, 'cost': 30, 'contact': 29, 'love': 91, 'need': 104, 'stop': 152, 'are': 3, 'today': 166, 'plan': 121, 'sleep': 144, 'said': 137, 'cannot': 17, 'wait': 177, 'hear': 62, 'text': 158, 'oh': 113, 'got': 54, 'job': 73, 'is': 72, 'yeah': 195, 'use': 174, 'help': 64, 'let': 81, 'like': 83, 'leave': 79, 'min': 97, 'pick': 119, 'tomorrow': 168, 'dun': 38, 'hey': 65, 'want': 180, 'co': 26, 'going': 51, 'enjoy': 41, 'book': 13, 'minute': 98, 'thought': 164, 'that': 161, 'soon': 147, 'mean': 93, 'time': 165, 'stuff': 153, 'phone': 118, 'look': 88, 'lot': 90, 'word': 190, 'do': 36, 'not': 110, 'holiday': 67, 'send': 139, 'heart': 63, 'feel': 43, 'day': 33, 'go': 50, 'miss': 99, 'long': 87, 'know': 75, 'remember': 132, 'wat': 182, 'meeting': 95, 'right': 134, 'home': 68, 'ok': 114, 'meet': 94, 'week': 185, 'working': 192, 'guy': 58, 'maybe': 92, 'called': 16, 'getting': 48, 'chance': 21, 'dear': 34, 'probably': 127, 'gon': 52, 'na': 103, 'bit': 12, 'way': 184, 'watching': 183, 'went': 186, 'people': 116, 'life': 82, 'da': 32, 'ya': 194, 'thanks': 160, 'lol': 86, 'babe': 7, 'eat': 39, 'thank': 159, 'happy': 61, 'late': 76, 'hi': 66, 'new': 106, 'better': 10, 'play': 122, 'actually': 1, 'talk': 156, 'place': 120, 'year': 196, 'car': 18, 'pls': 123, 'money': 101, 'did': 35, 'buy': 15, 'girl': 49, 'reply': 133, 'free': 46, 'yes': 197, 'box': 14, 'txt': 173, 'finish': 45, 'xxx': 193, 'will': 187, 'bad': 8, 'guess': 56, 'am': 2, 'real': 130, 'try': 171, 'sorry': 148, 'find': 44, 'ready': 129, 'check': 23, 'nice': 107, 'message': 96, 'run': 136, 'half': 60, 'great': 55, 'little': 84, 'left': 80, 'friend': 47, 'care': 19, 'number': 111, 'speak': 150, 'morning': 102, 'wanted': 181, 'say': 138, 'haha': 59, 'later': 77, 'looking': 89, 'start': 151, 'end': 40, 'person': 117, 'trying': 172, 'yo': 198, 'sweet': 155, 'ill': 71, 'asked': 5, 'account': 0, 'shall': 142, 'mobile': 100, 'smile': 146, 'join': 74, 'video': 175, 'offer': 112, 'tone': 169, 'class': 25, 'best': 9, 'waiting': 178, 'po': 124, 'live': 85, 'sound': 149, 'customer': 31, 'sm': 145, 'cash': 20, 'draw': 37, 'win': 188, 'tonight': 170, 'chat': 22, 'service': 141, 'reach': 128, 'show': 143, 'nokia': 109, 'receive': 131, 'pound': 125, 'network': 105, 'entry': 42, 'latest': 78, 'voucher': 176, 'awarded': 6, 'prize': 126, 'claim': 24}
Encoded Document is:
[[0 0 0 ... 0 0 0]
 [0 0 0 ... 0 0 0]
 [0 0 0 ... 0 0 0]
 ...
 [0 0 0 ... 0 0 0]
 [0 0 0 ... 0 0 3]
 [0 0 0 ... 0 0 2]]

In [ ]:

# Function to print the classification report and get confusion matrix in a proper format
def metrics_score(actual, predicted):
    print(classification_report(actual, predicted))
    
    cm = confusion_matrix(actual, predicted)
    
    plt.figure(figsize = (8, 5))
    
    sns.heatmap(cm, annot = True,  fmt = '.2f', xticklabels = ['ham', 'spam'], yticklabels = ['ham', 'spam'])
    
    plt.ylabel('Actual')
    
    plt.xlabel('Predicted')
    
    plt.show()

Model Building

In [ ]:

# Independent feature
X = vectors

# Target feature
y = data["type"].map({'ham':0,'spam':1})

In [ ]:

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.20, random_state = 0, stratify = y)

In [ ]:

print("Shape of Training set : ", X_train.shape)

print("Shape of test set : ", X_test.shape)

print("Percentage of classes in training set:")

print(y_train.value_counts(normalize = True))

print("Percentage of classes in test set:")

print(y_test.value_counts(normalize = True))

Shape of Training set :  (3222, 200)
Shape of test set :  (806, 200)
Percentage of classes in training set:
0    0.84761
1    0.15239
Name: type, dtype: float64
Percentage of classes in test set:
0    0.847395
1    0.152605
Name: type, dtype: float64

In [ ]:

# intializing the Random Forest model
model = RandomForestClassifier(random_state = 1)

# fitting the model on training set
model.fit(X_train, y_train)

Out[ ]:

RandomForestClassifier(random_state=1)

Model performance on the training data

In [ ]:

# making predictions on the test set
y_pred_train = model.predict(X_train)

metrics_score(y_train, y_pred_train)

              precision    recall  f1-score   support

           0       0.99      1.00      0.99      2731
           1       0.99      0.93      0.96       491

    accuracy                           0.99      3222
   macro avg       0.99      0.97      0.98      3222
weighted avg       0.99      0.99      0.99      3222

Model performance on the testing data

In [ ]:

# making predictions on the test set
y_pred = model.predict(X_test)

metrics_score(y_test, y_pred)

              precision    recall  f1-score   support

           0       0.97      0.97      0.97       683
           1       0.83      0.82      0.82       123

    accuracy                           0.95       806
   macro avg       0.90      0.90      0.90       806
weighted avg       0.95      0.95      0.95       806

Random Forest model is able to perform well for both the classes present in the data on training and testing data through Bag of Words vectors.
The results are generalized, model does not overfitted.

3.2 TF-IDF Vectorization¶

TF-IDF is an abbreviation for Term Frequency-Inverse Document Frequency. The process of TF-IDF vectorization involves computing the TF-IDF score for each word in your corpus in relation to that document and then storing that information in a vector.

**Term Frequency (TF)**- It shows us how many times the word appears in each report in the corpus. It is the ratio of the number of times a word appears in a report to the total number of words in that record. It increases in proportion to the number of times that term appears in the record.

**Inverse Data Frequency (IDF)** – It is used to calculate the occurrence of rare words across all reports in the corpus. Words that appear infrequently in the corpus have a high IDF score.

When we combine these two, we get the TF-IDF score for a word in a record in the corpus.

In [ ]:

# Creating the object to the TfidfVectorizer class

vectorizer = TfidfVectorizer(max_features = 200)

TfidfVectorizer has several parameters that are useful to create the effective model. We mostly, use below three parameters in TfidfVectorizer() function.

ngram_range: The lower and upper bounds of the range of n-values to be extracted for various n-grams. All values of n such that min_n <= n <= max_n will be used. An ngram range of (1, 1), for example, denotes just unigrams, (1, 2) means unigrams and bigrams, and (2, 2), only bigrams.
max_df(float or int), default=1.0: When creating the vocabulary, it excludes terms with a document frequency that is strictly greater than the given threshold (corpus-specific stop words). The parameter represents a proportion of documents if it is a float in the range [0.0, 1.0], otherwise it is an integer absolute count. If vocabulary is not None, this parameter is ignored.
max_features(int), default=None: If not None, it creates a vocabulary that only considers the top max features arranged by term frequency across the corpus.

There are other parameters which can be useful for data cleaning, refer to this source to explore more on it.

In [ ]:

# fit and transforming into vectors from the text data
tf_vectors = vectorizer.fit_transform(data['text']).toarray()

In [ ]:

# get indexing
print('\nWord indexes:')
print(vectorizer.vocabulary_)
 
# tf-idf values
print('\ntf-idf values:')
print(tf_vectors)

Word indexes:
{'hope': 70, 'having': 62, 'good': 52, 'pay': 116, 'ask': 5, 'room': 137, 'sure': 154, 'thing': 162, 'big': 11, 'hour': 71, 'work': 193, 'night': 110, 'tell': 157, 'told': 167, 'come': 27, 'coming': 28, 'wish': 189, 'gun': 57, 'sent': 142, 'think': 163, 'cost': 30, 'contact': 29, 'love': 92, 'need': 106, 'stop': 152, 'ur': 174, 'today': 166, 'plan': 123, 'sleep': 146, 'said': 139, 'cant': 17, 'wait': 178, 'hear': 63, 'text': 158, 'oh': 114, 'got': 53, 'job': 74, 'yeah': 197, 'use': 175, 'help': 65, 'let': 82, 'like': 84, 'leave': 80, 'min': 99, 'pick': 121, 'tomorrow': 168, 'dun': 38, 'hey': 66, 'want': 181, 'co': 26, 'going': 50, 'enjoy': 41, 'book': 13, 'minute': 100, 'thought': 164, 'thats': 161, 'soon': 148, 'mean': 95, 'time': 165, 'stuff': 153, 'phone': 119, 'look': 89, 'lot': 91, 'word': 192, 'dont': 36, 'holiday': 68, 'send': 141, 'heart': 64, 'feel': 42, 'day': 33, 'go': 49, 'miss': 101, 'long': 88, 'remember': 134, 'wat': 183, 'meeting': 97, 'right': 136, 'home': 69, 'ok': 115, 'meet': 96, 'week': 186, 'working': 194, 'guy': 58, 'maybe': 94, 'called': 16, 'getting': 47, 'chance': 21, 'dear': 34, 'probably': 128, 'gonna': 51, 'bit': 12, 'way': 185, 'watching': 184, 'went': 187, 'people': 117, 'life': 83, 'da': 32, 'ya': 196, 'know': 76, 'thanks': 160, '1st': 1, 'lol': 87, 'babe': 7, 'eat': 39, 'thank': 159, 'happy': 61, 'late': 77, 'hi': 67, 'new': 108, 'better': 10, 'play': 124, 'actually': 4, 'talk': 156, 'place': 122, 'year': 198, 'car': 18, 'pls': 125, 'money': 103, 'didnt': 35, 'buy': 15, 'girl': 48, 'reply': 135, 'free': 45, 'yes': 199, 'box': 14, 'txt': 173, 'finish': 44, 'xxx': 195, 'wont': 191, 'bad': 8, 'guess': 56, 'im': 73, 'real': 132, 'try': 171, 'sorry': 149, 'find': 43, 'ready': 131, 'check': 23, 'nice': 109, 'msg': 105, 'run': 138, 'half': 60, 'great': 54, 'ma': 93, 'little': 85, 'left': 81, 'friend': 46, 'pic': 120, 'care': 19, 'number': 112, 'speak': 150, 'morning': 104, 'wanted': 182, 'say': 140, 'message': 98, 'haha': 59, 'later': 78, 'looking': 90, 'start': 151, 'end': 40, 'person': 118, 'trying': 172, 'sweet': 155, 'ill': 72, 'account': 3, 'shall': 144, 'mobile': 102, 'smile': 147, 'join': 75, 'wanna': 180, 'video': 176, 'offer': 113, 'tone': 169, 'class': 25, 'best': 9, 'waiting': 179, 'po': 126, 'live': 86, 'customer': 31, 'cash': 20, '2nd': 2, 'draw': 37, 'win': 188, 'tonight': 170, 'chat': 22, 'service': 143, 'reach': 130, 'won': 190, 'show': 145, 'nokia': 111, 'receive': 133, 'network': 107, 'latest': 79, '150ppm': 0, 'voucher': 177, 'awarded': 6, 'prize': 127, 'claim': 24, 'guaranteed': 55, 'ps1000': 129}

tf-idf values:
[[0. 0. 0. ... 0. 0. 0.]
 [0. 0. 0. ... 0. 0. 0.]
 [0. 0. 0. ... 0. 0. 0.]
 ...
 [0. 0. 0. ... 0. 0. 0.]
 [0. 0. 0. ... 0. 0. 0.]
 [0. 0. 0. ... 0. 0. 0.]]

In [ ]:

# Independent feature
X = tf_vectors

# Target feature
y = data["type"].map({'ham':0,'spam':1})

In [ ]:

# Splitting the data into train and test
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.20, random_state = 0, stratify = y)

In [ ]:

print("Shape of Training set : ", X_train.shape)

print("Shape of test set : ", X_test.shape)

print("Percentage of classes in training set:")

print(y_train.value_counts(normalize = True))

print("Percentage of classes in test set:")

print(y_test.value_counts(normalize = True))

Shape of Training set :  (3222, 200)
Shape of test set :  (806, 200)
Percentage of classes in training set:
0    0.84761
1    0.15239
Name: type, dtype: float64
Percentage of classes in test set:
0    0.847395
1    0.152605
Name: type, dtype: float64

In [ ]:

# intializing the Random Forest model
model = RandomForestClassifier(random_state = 1)

# fitting the model on training set
model.fit(X_train, y_train)

Out[ ]:

RandomForestClassifier(random_state=1)

Model performance on the training data

In [ ]:

# making predictions on the test set
y_pred_train = model.predict(X_train)

metrics_score(y_train, y_pred_train)

              precision    recall  f1-score   support

           0       0.99      1.00      0.99      2731
           1       0.98      0.93      0.96       491

    accuracy                           0.99      3222
   macro avg       0.99      0.97      0.98      3222
weighted avg       0.99      0.99      0.99      3222

The model accuracy on the training data was 99%.

Let’s have a look on the testing data score to verify whether the model is performing well or overfitted the data.

Model performance on the testing data

In [ ]:

# making predictions on the test set
y_pred = model.predict(X_test)

metrics_score(y_test, y_pred)

              precision    recall  f1-score   support

           0       0.96      0.98      0.97       683
           1       0.86      0.78      0.82       123

    accuracy                           0.95       806
   macro avg       0.91      0.88      0.89       806
weighted avg       0.95      0.95      0.95       806

The testing accuracy was 95% with an macro average score of 89%.
The F1-score for the class 1 is 82%.

4. Conclusion¶

We were able to build an effective model that can able to detect the messages either spam or ham.
The vectors created from Bag of Words technique were able to give 90% macro average score on the test data. Also the precision and recall scores for class 1 were more balanced.
Similar to Bag of Words, TFIDF technique was able to give 89% macro average score on the test data. The recall for class 1 is low as compared to the recall score given by the model through bag of words vectors.

In [ ]:

NLP - Vectorisation using Bag-of-Words and TF-IDF

NLP – Vectorisation using Bag-of-Words and TF-IDF

1. Problem Context¶

1.1 Data Dictionary¶

1.2 Objective¶

2. Overview of the dataset¶

2.1 Importing necessary libraries¶

2.2 Loading the dataset¶

2.3 Understanding the dataset¶

Observations:¶

3. Exploratory Data Analysis¶

3.1 Bag-of-Words¶

3.2 TF-IDF Vectorization¶

4. Conclusion¶

Leave a Reply Cancel reply