Introduction

Text classification is the process of assigning tags or categories to text according to its content. It’s one of the fundamental tasks in natural language processing.

The text we wanna classify is given as input to an algorithm, the algorithm will then analyze the text’s content, and then categorize the input as one of the tags or categories previously given.

Input → Classifying Algorithm → Classification of Input

Real life examples:

Sentiment analysis: how does the writer of the sentence feel about what they are writing about, do they think positively or negatively of the subject? Ex. restaurant reviews topic labeling: given sentences and a set of topics, which topic does this sentence fall under? Ex. is this essay about history? Math? etc? spam detection Ex. Email filtering: is this email a real important email or spam?

Example. A restaurant wants to evaluate their ratings but don’t want to read through all of them. Therefore, they wanna use a computer algorithm to do all their work. They simply want to know if the customer’s review is positive or negative.

Here’s an example of a customer’s review and a simple way an algorithm could classify their review.

Input: “The food here was too salty and too expensive”

Algorithm: Goes through every word in the sentence and counts how many positive words and how many negative words are in the sentence.

    “The, food, here, was, too, and” are all neutral words

    “Salty, expensive” are negative words.

    Negative words: 2
    Positive words: 0

Classification: Negative Review, because there are more negative words (2) than positive (0).

However, this algorithm obviously doesn’t work in a lot of cases.

For example, “The food here was good, not expensive and not salty” would be classified as negative but it’s actually a positive review.

Language and text can get very complicated which makes creating these algorithms difficult. Some things that make language difficult could be words that have multiple meanings, negation words (words such as not), slang, etc.

Set up data and imports

Library imports

This section of code is to import any necessary Python libraries that we'll need for the rest of this notebook. Some packages may need to be installed since they are not built in to Python3.

!pip3 install seaborn
!pip3 install plotly --user
!pip3 install sklearn

import sys
import string
from scipy import sparse
from pprint import pprint
import pandas as pd
import seaborn as sns
import plotly.offline as py
from plotly.offline import init_notebook_mode, iplot
import plotly.graph_objs as go
init_notebook_mode(connected = True)
import matplotlib.pyplot as plt
import numpy as np
from html import escape
from IPython.core.display import display, HTML
from string import Template
from sklearn.metrics import classification_report
import json

HTML('<script src="https://d3js.org/d3.v3.min.js" charset="utf-8"></script>')

# Our two files that contain our data, split up into a training set and a testing set.

trainingFile = "trainingSet.txt"
testingFile = "testSet.txt"

Requirement already satisfied: seaborn in /usr/local/lib/python3.9/site-packages (0.11.1)
Requirement already satisfied: numpy>=1.15 in /usr/local/lib/python3.9/site-packages (from seaborn) (1.19.5)
Requirement already satisfied: pandas>=0.23 in /usr/local/lib/python3.9/site-packages (from seaborn) (1.2.0)
Requirement already satisfied: scipy>=1.0 in /usr/local/lib/python3.9/site-packages (from seaborn) (1.6.0)
Requirement already satisfied: matplotlib>=2.2 in /usr/local/lib/python3.9/site-packages (from seaborn) (3.3.3)
Requirement already satisfied: python-dateutil>=2.7.3 in /usr/local/lib/python3.9/site-packages (from pandas>=0.23->seaborn) (2.8.1)
Requirement already satisfied: pytz>=2017.3 in /usr/local/lib/python3.9/site-packages (from pandas>=0.23->seaborn) (2020.5)
Requirement already satisfied: pillow>=6.2.0 in /usr/local/lib/python3.9/site-packages (from matplotlib>=2.2->seaborn) (8.1.0)
Requirement already satisfied: cycler>=0.10 in /usr/local/lib/python3.9/site-packages (from matplotlib>=2.2->seaborn) (0.10.0)
Requirement already satisfied: kiwisolver>=1.0.1 in /usr/local/lib/python3.9/site-packages (from matplotlib>=2.2->seaborn) (1.3.1)
Requirement already satisfied: pyparsing!=2.0.4,!=2.1.2,!=2.1.6,>=2.0.3 in /usr/local/lib/python3.9/site-packages (from matplotlib>=2.2->seaborn) (2.4.7)
Requirement already satisfied: six>=1.5 in /usr/local/lib/python3.9/site-packages (from python-dateutil>=2.7.3->pandas>=0.23->seaborn) (1.15.0)
Requirement already satisfied: plotly in /Users/laurajiang/Library/Python/3.9/lib/python/site-packages (4.14.3)
Requirement already satisfied: retrying>=1.3.3 in /Users/laurajiang/Library/Python/3.9/lib/python/site-packages (from plotly) (1.3.3)
Requirement already satisfied: six in /usr/local/lib/python3.9/site-packages (from plotly) (1.15.0)
Requirement already satisfied: sklearn in /usr/local/lib/python3.9/site-packages (0.0)
Requirement already satisfied: scikit-learn in /usr/local/lib/python3.9/site-packages (from sklearn) (0.24.1)
Requirement already satisfied: joblib>=0.11 in /usr/local/lib/python3.9/site-packages (from scikit-learn->sklearn) (1.0.0)
Requirement already satisfied: threadpoolctl>=2.0.0 in /usr/local/lib/python3.9/site-packages (from scikit-learn->sklearn) (2.1.0)
Requirement already satisfied: scipy>=0.19.1 in /usr/local/lib/python3.9/site-packages (from scikit-learn->sklearn) (1.6.0)
Requirement already satisfied: numpy>=1.13.3 in /usr/local/lib/python3.9/site-packages (from scikit-learn->sklearn) (1.19.5)

Getting our data

Below is a definition of getData, a basic function to pull from the trainingSet.txt and testSet.txt. The data that we're using for this example is a set of reviews written by users on Yelp, classified as positive (1) or negative (0).

We open the file, create temporary arrays, and pull from the file line by line.

Open the cell if you'd like to peek into what the function looks like.

def getData(fileName):
    f = open(fileName)
    file = f.readlines()

    sentences = []
    sentiments = []

    for line in file:
        sentence, sentiment = line.split('\t')
        sentences.append(sentence.strip())
        sentiments.append(int(sentiment.strip())) # Sentiment in {0,1}

    return sentences, np.array(sentiments)

trainingSentences, trainingLabels = getData(trainingFile)
testingSentences, testingLabels = getData(testingFile)

Let's take a peek at what this data looks like:

f = open("trainingSet.txt")
file = f.readlines()

sentences = []
sentiments = []

for line in file:
    sentence, sentiment = line.split('\t')
    sentences.append(sentence.strip())
    sentiments.append(int(sentiment.strip())) 
    
print("Sample sentences:")
pprint(sentences[:10]) 
print("Corresponding sentiments:")
pprint(sentiments[:10])

Sample sentences:
['Wow... Loved this place.',
 'Not tasty and the texture was just nasty.',
 'Stopped by during the late May bank holiday off Rick Steve recommendation '
 'and loved it.',
 'The selection on the menu was great and so were the prices.',
 'Now I am getting angry and I want my damn pho.',
 "Honeslty it didn't taste THAT fresh.)",
 'The potatoes were like rubber and you could tell they had been made up ahead '
 'of time being kept under a warmer.',
 'The fries were great too.',
 'A great touch.',
 'Service was very prompt.']
Corresponding sentiments:
[1, 0, 1, 1, 0, 0, 0, 1, 1, 1]

Pre-processing our data

We need to modify these sentences by tokenizing them into individual strings (word by word) so that we can feed our model individual words and their associated sentiment (negative / positive).

def preProcess(sentences):

    def cleanText(text):
        # Make lower case
        text = text.lower()

        # Replace non-text characters with spaces
        nonText = string.punctuation + ("")
        text = text.translate(str.maketrans(nonText, ' ' * (len(nonText))))

        # Split sentences into individual words - tokenize
        words = text.split()

        return words

    return list(map(cleanText, sentences))

trainingTokens = preProcess(trainingSentences)
testingTokens = preProcess(testingSentences)

Let's look at what these tokenized sentences look like now:

print("Training tokens:")
pprint(trainingTokens[:2]) 
print("Testing tokens:")
pprint(testingTokens[:3])

Training tokens:
[['wow', 'loved', 'this', 'place'],
 ['not', 'tasty', 'and', 'the', 'texture', 'was', 'just', 'nasty']]
Testing tokens:
[['crust', 'is', 'not', 'good'],
 ['would', 'not', 'go', 'back'],
 ['i', 'was', 'shocked', 'because', 'no', 'signs', 'indicate', 'cash', 'only']]

Vectorizing our data

Now that we have our sentences tokenized, notice how our training tokens are nested arrays. We want to pull them out of nested arrays and into just one general vocabulary list.

def getVocab(sentences):
    vocab = set()
    for sentence in sentences:
        for word in sentence:
            vocab.add(word)
    return sorted(vocab)

vocabulary = getVocab(trainingTokens)

We can peek at our vocabulary list, an alphabetically sorted list of words, now at a random set of indices:

pprint(vocabulary[50:70])

['amount',
 'an',
 'and',
 'angry',
 'another',
 'anticipated',
 'any',
 'anything',
 'anytime',
 'anyway',
 'apologize',
 'app',
 'appalling',
 'appetizers',
 'apple',
 'approval',
 'are',
 'area',
 'aren',
 'aria']

We want our arrays to actually be proper vectors to feed to our model, which we'll create below as well. This function, createVector transforms our arrays into vectors.

def createVector(vocab, sentences):
    indices = []
    wordOccurrences = []

    for sentenceIndex, sentence in enumerate(sentences):
        alreadyCounted = set() # Keep track of words so we don't double count.
        for word in sentence:
            if (word in vocab) and word not in alreadyCounted:
                # If we just want {0,1} for the presence of the word (bernoulli NB),
                # only count each word once. Otherwise (multinomial NB) count each
                # occurrence of the word.
                
            
                #which sentence, which word
                indices.append((sentenceIndex, vocab.index(word)))
                
                wordOccurrences.append(1)
                alreadyCounted.add(word)

    # Unzip
    rows = [row for row, _ in indices]
    columns = [column for _, column in indices]

    sentenceVectors = sparse.csr_matrix((wordOccurrences, (rows, columns)), dtype=int, shape=(len(sentences), len(vocab)))

    return sentenceVectors

training = createVector(vocabulary, trainingTokens)
testing = createVector(vocabulary, testingTokens)

Our training and test data has gone through some transformation. Here's what the training data looks like now:

print("Training data:")
print(training[:2])

Training data:
  (0, 694)	1
  (0, 884)	1
  (0, 1186)	1
  (0, 1335)	1
  (1, 52)	1
  (1, 640)	1
  (1, 768)	1
  (1, 788)	1
  (1, 1158)	1
  (1, 1166)	1
  (1, 1171)	1
  (1, 1281)	1

A Naive Bayes model

Creating and Training our Model

Below is our Naive Bayes classifier, which is the model we've chosen to use for our sentiment analysis of restaurant reviews.

class NaiveBayesClassifier:
    def __init__(self):
        self.priorPositive = None  # Probability that a review is positive
        self.priorNegative = None  # Probability that a review is negative
        self.positiveLogConditionals = None
        self.negativeLogConditionals = None

    def computePriorProbabilities(self, labels):
        self.priorPositive = len([y for y in labels if y == 1]) / len(labels)
        self.priorNegative = 1 - self.priorPositive

    def computeConditionProbabilities(self, examples, labels, dirichlet=1):
        _, vocabularyLength = examples.shape

        # How many of each word are there in all of the positive reviews
        positiveCounts = np.array([dirichlet for _ in range(vocabularyLength)])
        # How many of each word are there in all of the negative reviews
        negativeCounts = np.array([dirichlet for _ in range(vocabularyLength)])

        # Here's how to iterate through a spare array
        coordinates = examples.tocoo()  # Converted to a `coordinate` format
        for exampleIndex, featureIndex, observationCount in zip(coordinates.row, coordinates.col, coordinates.data):
            # For sentence {exampleIndex}, for word at index {featureIndex}, the word occurred {observationCount} times
            if labels[exampleIndex] == 1:
                positiveCounts[featureIndex] += observationCount
            else:
                negativeCounts[featureIndex] += observationCount

        # [!] Make sure to use the logs of the probabilities
        positiveReviewCount = len([y for y in labels if y == 1])
        negativeReviewCount = len([y for y in labels if y == 0])

        # We are using bernoulli NB (single occurance of a word)
        self.positiveLogConditionals = np.log(positiveCounts) - np.log(positiveReviewCount + dirichlet*2)
        self.negativeLogConditionals = np.log(negativeCounts) - np.log(negativeReviewCount + dirichlet*2)

        # This works for multinomial NB (multiple occurances of a word)
        # self.positiveLogConditionals = np.log(positiveCounts) - np.log(sum(positiveCounts))
        # self.negativeLogConditionals = np.log(negativeCounts) - np.log(sum(negativeCounts))

    # Calculate all of the parameters for making a naive bayes classification
    def fit(self, trainingExamples, trainingLabels):
        # Compute the probability of positive/negative review
        self.computePriorProbabilities(trainingLabels)

        # Compute
        self.computeConditionProbabilities(trainingExamples, trainingLabels)

    def computeLogPosteriors(self, sentence):
        return ((np.log(self.priorPositive) + sum(sentence * self.positiveLogConditionals)),
                (np.log(self.priorNegative) + sum(sentence * self.negativeLogConditionals)))
 
    # Have the model try predicting if a review if positive or negative
    def predict(self, examples):
        totalReviewCount, _ = examples.shape
        conf_list = []

        predictions = np.array([0 for _ in range(totalReviewCount)])

        for index, sentence in enumerate(examples):
            logProbabilityPositive, logProbabilityNegative = self.computeLogPosteriors(
                sentence)
            conf_list.append([np.exp(logProbabilityPositive), np.exp(logProbabilityNegative)])
            predictions[index] = 1 if logProbabilityPositive > logProbabilityNegative else 0

        return conf_list, predictions

Initialize an instance of model and begin to fit the model with our training data and corresponding labels.

nbClassifier = NaiveBayesClassifier()
nbClassifier.fit(training, trainingLabels)

def accuracy(predictions, actual):
    return sum((predictions == actual)) / len(actual)

Let's take our model for a spin, using both the training set and the testing set. You may notice discrepencies in accuracy between training and testing - why is that?

train_confidence_scores, trainingPredictions = nbClassifier.predict(training)
test_confidence_scores, testingPredictions = nbClassifier.predict(testing)

print("Training accuracy:", accuracy(trainingPredictions, trainingLabels))
print("Testing accuracy:", accuracy(testingPredictions, testingLabels))

Training accuracy: 0.9519038076152304
Testing accuracy: 0.7947686116700201

Visualizing Results

Here's another to visualize our results using a confusion matrix.

data = {'Actual':    testingLabels,
        'Predicted': testingPredictions
        }

df = pd.DataFrame(data, columns=['Actual','Predicted'])
confusion_matrix = pd.crosstab(df['Actual'], df['Predicted'], rownames=['Actual'], colnames=['Predicted'])

ax = sns.heatmap(confusion_matrix, annot=True,cmap="YlGnBu")
ax.set_ylim(2.0, 0)

plt.title('Confusion Matrix of Testing')
plt.show()

A Closer Look

Let's look at the general results for our model - notably, we can look at its precision for predicting negative and positive sentiment in a given sentence.

target_names = ['negative', 'positive']
print(classification_report(testingLabels, testingPredictions, target_names=target_names))

---------------------------------------------------------------------------
NameError                                 Traceback (most recent call last)
<ipython-input-1-73b7949ec6a0> in <module>
      1 target_names = ['negative', 'positive']
----> 2 print(classification_report(testingLabels, testingPredictions, target_names=target_names))

NameError: name 'classification_report' is not defined

Now, we want to make an interactive confusion matrix so we can precisely see which results are accurately classified and which are mis-classified, as well as the confidence at which the model has classified that result.

# work with the model results to create a JSON dump of the data for future use

import json

output_filename = "predict.json"
data = []
for i in range(len(testingPredictions)):
  data.append({
      'index': i,
      'true_label': int(testingLabels[i]),
      'predicted_label': int(testingPredictions[i]),
      'confidence_score': test_confidence_scores[i],
      'text': testingSentences[i]
  })

with open(output_filename, 'w') as outfile:
    json.dump(data, outfile, indent=4, sort_keys=False)

from IPython.core.display import display, HTML
from string import Template


json_filepath = "\"" + output_filename + "\""
HTML('<script src="https://d3js.org/d3.v3.min.js" charset="utf-8"></script>')