Text Classification
Chapter 1 - learn about using a neural network to classify text using sentiment (positive or negative).
Text classification is the process of assigning tags or categories to text according to its content. It’s one of the fundamental tasks in natural language processing.
The text we wanna classify is given as input to an algorithm, the algorithm will then analyze the text’s content, and then categorize the input as one of the tags or categories previously given.
Input → Classifying Algorithm → Classification of Input
Real life examples:
- Sentiment analysis: how does the writer of the sentence feel about what they are writing about, do they think positively or negatively of the subject? Ex. restaurant reviews topic labeling: given sentences and a set of topics, which topic does this sentence fall under? Ex. is this essay about history? Math? etc? spam detection Ex. Email filtering: is this email a real important email or spam?
Example. A restaurant wants to evaluate their ratings but don’t want to read through all of them. Therefore, they wanna use a computer algorithm to do all their work. They simply want to know if the customer’s review is positive or negative.
Here’s an example of a customer’s review and a simple way an algorithm could classify their review.
Input: “The food here was too salty and too expensive”
Algorithm: Goes through every word in the sentence and counts how many positive words and how many negative words are in the sentence.
“The, food, here, was, too, and” are all neutral words
“Salty, expensive” are negative words.
Negative words: 2
Positive words: 0
Classification: Negative Review, because there are more negative words (2) than positive (0).
However, this algorithm obviously doesn’t work in a lot of cases.
For example, “The food here was good, not expensive and not salty” would be classified as negative but it’s actually a positive review.
Language and text can get very complicated which makes creating these algorithms difficult. Some things that make language difficult could be words that have multiple meanings, negation words (words such as not), slang, etc.
!pip3 install seaborn
!pip3 install plotly --user
!pip3 install sklearn
import sys
import string
from scipy import sparse
from pprint import pprint
import pandas as pd
import seaborn as sns
import plotly.offline as py
from plotly.offline import init_notebook_mode, iplot
import plotly.graph_objs as go
init_notebook_mode(connected = True)
import matplotlib.pyplot as plt
import numpy as np
from html import escape
from IPython.core.display import display, HTML
from string import Template
from sklearn.metrics import classification_report
import json
HTML('<script src="https://d3js.org/d3.v3.min.js" charset="utf-8"></script>')
# Our two files that contain our data, split up into a training set and a testing set.
trainingFile = "trainingSet.txt"
testingFile = "testSet.txt"
Getting our data
Below is a definition of getData, a basic function to pull from the trainingSet.txt
and testSet.txt
. The data that we're using for this example is a set of reviews written by users on Yelp, classified as positive (1) or negative (0).
We open the file, create temporary arrays, and pull from the file line by line.
Open the cell if you'd like to peek into what the function looks like.
def getData(fileName):
f = open(fileName)
file = f.readlines()
sentences = []
sentiments = []
for line in file:
sentence, sentiment = line.split('\t')
sentences.append(sentence.strip())
sentiments.append(int(sentiment.strip())) # Sentiment in {0,1}
return sentences, np.array(sentiments)
trainingSentences, trainingLabels = getData(trainingFile)
testingSentences, testingLabels = getData(testingFile)
Let's take a peek at what this data looks like:
f = open("trainingSet.txt")
file = f.readlines()
sentences = []
sentiments = []
for line in file:
sentence, sentiment = line.split('\t')
sentences.append(sentence.strip())
sentiments.append(int(sentiment.strip()))
print("Sample sentences:")
pprint(sentences[:10])
print("Corresponding sentiments:")
pprint(sentiments[:10])
def preProcess(sentences):
def cleanText(text):
# Make lower case
text = text.lower()
# Replace non-text characters with spaces
nonText = string.punctuation + ("")
text = text.translate(str.maketrans(nonText, ' ' * (len(nonText))))
# Split sentences into individual words - tokenize
words = text.split()
return words
return list(map(cleanText, sentences))
trainingTokens = preProcess(trainingSentences)
testingTokens = preProcess(testingSentences)
Let's look at what these tokenized sentences look like now:
print("Training tokens:")
pprint(trainingTokens[:2])
print("Testing tokens:")
pprint(testingTokens[:3])
def getVocab(sentences):
vocab = set()
for sentence in sentences:
for word in sentence:
vocab.add(word)
return sorted(vocab)
vocabulary = getVocab(trainingTokens)
We can peek at our vocabulary list, an alphabetically sorted list of words, now at a random set of indices:
pprint(vocabulary[50:70])
We want our arrays to actually be proper vectors to feed to our model, which we'll create below as well. This function, createVector
transforms our arrays into vectors.
def createVector(vocab, sentences):
indices = []
wordOccurrences = []
for sentenceIndex, sentence in enumerate(sentences):
alreadyCounted = set() # Keep track of words so we don't double count.
for word in sentence:
if (word in vocab) and word not in alreadyCounted:
# If we just want {0,1} for the presence of the word (bernoulli NB),
# only count each word once. Otherwise (multinomial NB) count each
# occurrence of the word.
#which sentence, which word
indices.append((sentenceIndex, vocab.index(word)))
wordOccurrences.append(1)
alreadyCounted.add(word)
# Unzip
rows = [row for row, _ in indices]
columns = [column for _, column in indices]
sentenceVectors = sparse.csr_matrix((wordOccurrences, (rows, columns)), dtype=int, shape=(len(sentences), len(vocab)))
return sentenceVectors
training = createVector(vocabulary, trainingTokens)
testing = createVector(vocabulary, testingTokens)
Our training and test data has gone through some transformation. Here's what the training data looks like now:
print("Training data:")
print(training[:2])
class NaiveBayesClassifier:
def __init__(self):
self.priorPositive = None # Probability that a review is positive
self.priorNegative = None # Probability that a review is negative
self.positiveLogConditionals = None
self.negativeLogConditionals = None
def computePriorProbabilities(self, labels):
self.priorPositive = len([y for y in labels if y == 1]) / len(labels)
self.priorNegative = 1 - self.priorPositive
def computeConditionProbabilities(self, examples, labels, dirichlet=1):
_, vocabularyLength = examples.shape
# How many of each word are there in all of the positive reviews
positiveCounts = np.array([dirichlet for _ in range(vocabularyLength)])
# How many of each word are there in all of the negative reviews
negativeCounts = np.array([dirichlet for _ in range(vocabularyLength)])
# Here's how to iterate through a spare array
coordinates = examples.tocoo() # Converted to a `coordinate` format
for exampleIndex, featureIndex, observationCount in zip(coordinates.row, coordinates.col, coordinates.data):
# For sentence {exampleIndex}, for word at index {featureIndex}, the word occurred {observationCount} times
if labels[exampleIndex] == 1:
positiveCounts[featureIndex] += observationCount
else:
negativeCounts[featureIndex] += observationCount
# [!] Make sure to use the logs of the probabilities
positiveReviewCount = len([y for y in labels if y == 1])
negativeReviewCount = len([y for y in labels if y == 0])
# We are using bernoulli NB (single occurance of a word)
self.positiveLogConditionals = np.log(positiveCounts) - np.log(positiveReviewCount + dirichlet*2)
self.negativeLogConditionals = np.log(negativeCounts) - np.log(negativeReviewCount + dirichlet*2)
# This works for multinomial NB (multiple occurances of a word)
# self.positiveLogConditionals = np.log(positiveCounts) - np.log(sum(positiveCounts))
# self.negativeLogConditionals = np.log(negativeCounts) - np.log(sum(negativeCounts))
# Calculate all of the parameters for making a naive bayes classification
def fit(self, trainingExamples, trainingLabels):
# Compute the probability of positive/negative review
self.computePriorProbabilities(trainingLabels)
# Compute
self.computeConditionProbabilities(trainingExamples, trainingLabels)
def computeLogPosteriors(self, sentence):
return ((np.log(self.priorPositive) + sum(sentence * self.positiveLogConditionals)),
(np.log(self.priorNegative) + sum(sentence * self.negativeLogConditionals)))
# Have the model try predicting if a review if positive or negative
def predict(self, examples):
totalReviewCount, _ = examples.shape
conf_list = []
predictions = np.array([0 for _ in range(totalReviewCount)])
for index, sentence in enumerate(examples):
logProbabilityPositive, logProbabilityNegative = self.computeLogPosteriors(
sentence)
conf_list.append([np.exp(logProbabilityPositive), np.exp(logProbabilityNegative)])
predictions[index] = 1 if logProbabilityPositive > logProbabilityNegative else 0
return conf_list, predictions
Initialize an instance of model and begin to fit the model with our training data and corresponding labels.
nbClassifier = NaiveBayesClassifier()
nbClassifier.fit(training, trainingLabels)
def accuracy(predictions, actual):
return sum((predictions == actual)) / len(actual)
Let's take our model for a spin, using both the training set and the testing set. You may notice discrepencies in accuracy between training and testing - why is that?
train_confidence_scores, trainingPredictions = nbClassifier.predict(training)
test_confidence_scores, testingPredictions = nbClassifier.predict(testing)
print("Training accuracy:", accuracy(trainingPredictions, trainingLabels))
print("Testing accuracy:", accuracy(testingPredictions, testingLabels))
data = {'Actual': testingLabels,
'Predicted': testingPredictions
}
df = pd.DataFrame(data, columns=['Actual','Predicted'])
confusion_matrix = pd.crosstab(df['Actual'], df['Predicted'], rownames=['Actual'], colnames=['Predicted'])
ax = sns.heatmap(confusion_matrix, annot=True,cmap="YlGnBu")
ax.set_ylim(2.0, 0)
plt.title('Confusion Matrix of Testing')
plt.show()
target_names = ['negative', 'positive']
print(classification_report(testingLabels, testingPredictions, target_names=target_names))
Now, we want to make an interactive confusion matrix so we can precisely see which results are accurately classified and which are mis-classified, as well as the confidence at which the model has classified that result.
# work with the model results to create a JSON dump of the data for future use
import json
output_filename = "predict.json"
data = []
for i in range(len(testingPredictions)):
data.append({
'index': i,
'true_label': int(testingLabels[i]),
'predicted_label': int(testingPredictions[i]),
'confidence_score': test_confidence_scores[i],
'text': testingSentences[i]
})
with open(output_filename, 'w') as outfile:
json.dump(data, outfile, indent=4, sort_keys=False)
from IPython.core.display import display, HTML
from string import Template
json_filepath = "\"" + output_filename + "\""
HTML('<script src="https://d3js.org/d3.v3.min.js" charset="utf-8"></script>')