9

NLP Sarcasm Detection

Intro

The headline you see above comes from a satirical publication called The Onion. It's an example of the kinds of sarcastic headlines we will encounter in our quest to categorize whether or not a headline is sarcastic. The goal of this blog is to walk through fundamental techniques that will aid in tackling a vast majority of natural language processing (NLP) problems, be they a multiclass classification, regression, or even clustering (data transformation will be relevant). Since the field of NLP is wide, we wont be able to cover every single technique/variant, but after reading this post, you should have a good idea on how to approach the problem, and the algorithms that exist for you to explore.

To start with, our data is approximately 29k rows of headlines with labels corresponding to whether or not those headlines are sarcastic. The class split between sarcastic and not-sarcastic is approximately 44%-56%, so it is balanced enough for us to justify using an accuracy metric as proxy for model performance. Head over to kaggle to download this data, or to grab both the notebook and the data, head to my Github.

(h/t to Michael Burnam-Fink)

Pre-processing

Lets pre-process this data and load it into memory before we dive into modeling. Below, we reorganize the data (originally in json format) into a 2D numpy array with two columns: headline text column, and a column telling us whether a headline is sarcastic or not (represented by a string indicator of 0/1).

fp = 'Sarcasm_Headlines_Dataset.json'

def load_data(fp):
    # Return data as list of list, first element of each data point is
    # headline, second is 0/1 indicator for is_sarcastic
    with open(fp, 'r') as f:
        data = f.readlines()
        data = [json.loads(line) for line in data]
        return np.array([[row['headline'], row['is_sarcastic']] for row in data])

data = load_data(fp)

Pipeline and Modeling

In NLP, there are a few fundamental steps we need to take to transform any raw text into a machine parseable format for modeling. Lets start by thinking about the simplest way we can represent a sentence to a computer. If we can represent words with just a numerical indexing, and simply count the occurence of words in a sentence (in our case in a single headline), then we can use that as input to see if there's a pattern among sarcastic/not-sarcastic headlines with regards to count distribution of certain words. This is what sklearn's CountVectorizer does. For example:

from sklearn.feature_extraction.text import CountVectorizer

count = CountVectorizer()
X = count.fit_transform(['One sheep, two sheep.', 'Three sheep, or more!'])
print(X)
print(count.vocabulary_)
print(X.toarray())

#  (0, 5)	1
#  (0, 3)	2
#  (0, 1)	1
#  (1, 0)	1
#  (1, 2)	1
#  (1, 4)	1
#  (1, 3)	1
#
#{'one': 1, 'sheep': 3, 'two': 5, 'three': 4, 'or': 2, 'more': 0}
#
#[[0 1 0 2 0 1]
# [1 0 1 1 1 0]]

You can see that the result of count vectorizing the two sample sentence above yields a matrix. Printing out the string representation of our fit_transformed text yields a set of tuples with a count next to them. Think of the tuples as matrix indexing for (row, column) and corresponding count fills in that index location. The rows of the matrix represent a single sample (headline 1, headline 2, etc). The columns represent the unique set of words that appear throughout all of the samples that we fit to count vectorizer. When we print out the actual matrix, we see that each row is filled out with respective counts of words for each sample. For example, the first sentence has 1 count for the word 'one', 2 counts for 'sheep', 1 count for 'two', which turns out to be [0 1 0 2 0 1] in row-matrix representation (take a look at the dictionary to see the column to vocabulary mapping).

Notice that using sklearn's implementation automatically takes care of turning everything into lower case, and ommits punctuation marks (a detail we have to take care of ourselves in a different modeling approach later). Furthermore, we can set an option to omit stopwords in the english dictionary. Stopwords are common words like, 'is', 'the', 'a', etc., that carry extremely low amounts of information. In other words, they don't help us with our task. Last thing to note is that count vectorizers can simply represent the actual counts, or normalize the counts over frequency of appearance in the document.

We can stop there, and simply use that to model, but I want to introduce a slightly more interesting transformation: TF-IDF. It stands for term frequency-inverse document frequency. The idea here is that if a word appears very frequently within a document (headline), it might be an important term that we should pay attention to, but if it is a word that appears throughout lots of different documents, it is probably not a useful word in helping us discriminate between documents. Thus it is a weighting factor. There are various implementations and weighting schemes for the TF portion as well as the IDF portion, check here to learn more.

Finally we can put these two steps together into an sklearn pipeline to create our baseline model using naive bayes as a basis for a bernoulli model. Naive bayes is so-called because it naively assumes independence between our features (here the features are words). Despite that assumption, we'll see that it does a really great job in this circumstance:

Note: We could've combined the count vectorizer and tf_idf transformer calls into a single call instead of two seperate calls, using TfidfVectorizer from sklearn. However, I wanted to emphasize the relationship between these two concepts so I chose to seperate them.

class bernoulli_model():
    def __init__(self, data):
        self.X_train, self.X_test, self.y_train, self.y_test = \
            train_test_split(data[:,0], data[:,1], test_size=0.30, random_state=20)
        self.sarcasm_clf = Pipeline([('vect', CountVectorizer(stop_words='english')),
                                ('tfidf', TfidfTransformer(use_idf=True)),
                                ('clf', BernoulliNB())])

    def model_results(self):
        sarcasm_clf = self.sarcasm_clf.fit(self.X_train, self.y_train)
        predictions = sarcasm_clf.predict(self.X_test)
        print('Accuracy is:', np.mean(predictions == self.y_test))
        print('Positive class ratio is:', np.mean(self.y_test == '1'))

    def model_GS(self):
        parameters = {'vect__ngram_range': [(1, 1), (1, 2), (1,3)],
                        'tfidf__use_idf': (True, False),
                        'clf__alpha': (1.0, .1, .01, )}

        gs_clf = GridSearchCV(self.sarcasm_clf, parameters, n_jobs=-1, cv=5)
        gs_clf = gs_clf.fit(self.X_train, self.y_train)
        print('GS best score:', gs_clf.best_score_)
        print('GS best params:', gs_clf.best_params_)

m1_bernoulli = bernoulli_model(data)
m1_bernoulli.model_results()
m1_bernoulli.model_GS()

# Accuracy is: 0.7984525146636715
# Positive class ratio is: 0.44128291526269814
# GS best score: 0.7968014548566539
# GS best params: {'clf__alpha': 1.0, 'tfidf__use_idf': True, 'vect__ngram_range': (1, 1)}

What I've done here is to split the dataset into a 70%-30% train-test. Furthermore, I've used training set in a latter portion of the code to perform a gridsearch, tuning the hyperparameters of our bernoulli model (with that 70% further being split to do a 5-fold cross validation within the gridsearch). With the optimal parameters (pretty much default options), we receive an accuracy score of 79.8% on our test set. Since our data has a 44%-56% class split, if we just guessed the negative class all of the time, we'd get an accuracy rate of 56%, so our model is a vast improvement. How does that compare to a logistic regression model?

class logistic():
    def __init__(self, data):
        self.X_train, self.X_test, self.y_train, self.y_test = \
            train_test_split(data[:,0], data[:,1], test_size=0.30, random_state=20)
        self.sarcasm_clf = Pipeline([('vect', CountVectorizer(stop_words='english')),
                                ('tfidf', TfidfTransformer(use_idf=True)),
                                ('clf', LogisticRegression(solver='lbfgs'))])

    def model_results(self):
        sarcasm_clf = self.sarcasm_clf.fit(self.X_train, self.y_train)
        predictions = sarcasm_clf.predict(self.X_test)
        print('Accuracy is:', np.mean(predictions == self.y_test))
        print('Positive class ratio is:', np.mean(self.y_test == '1'))

m2_logistic = logistic(data)
m2_logistic.model_results()

# Accuracy is: 0.7855984025957818
# Positive class ratio is: 0.44128291526269814

We put our data through the same transformation pipeline to feed into a logistic model, yielding a 78.6% accuracy rate on our model. Both the Bernoulli Naive Bayes and Logistic Regression methods are great baseline models because they are easy and quick to implement. Since we are working with language, naturally, Recurrent Neural Network (RNN) architecture comes to mind as a deep learning model that can capture semantic dependencies in a sentence structure that something like naive bayes and logistic does not. There are two particular architecture that we will play around with which solves two pernicious problems of RNNs called, the vanishing gradient and exploding gradient. Both the long short-term memory (LSTM) and gated recurrent unit (GRU) networks are good with sequential data like text input, or time-series data because they can capture long-term dependencies and mitigate vanishing/exploding gradient problems.

(sourced from https://isaacchanghau.github.io/post/lstm-gru-formula/)

LSTM Model

In order to implement an LSTM model, we will use Keras to aid us. Keras is a high level implementation for deep learning built on tensorflow. The first step, like before, is transforming text into a machine parseable format. Many of the aforementioned strategies are relevant: lower-case transformation, removing punctuation, removing stopwords, and sometimes even stemming and lemmatization. I've chosen to omit stemming because we will later use pre-trained embeddings which does not require stemming (because the corpus will contain grammatical variants of the same word).

Unlike before, instead of transforming our text into a word count, we will maintain ordering of the words, and turn each word into a feature vector. What is a feature vector you ask? Imagine a system in which each word has two features that are associated with it, its 'royalty' measure, and 'gender' measure. A word like 'man' might have a royalty measure of 0.2, and a gender measure of -1, thus its 2D feature vector can be represented as [0.2, -1]. A word like 'woman' might have a royalty measure of 0.2, and a gender measure of 1, and is represented as a [0.2, 1] vector. When we compare these words, the 2D vector representation tells us that they are different along one of these dimensions (here, the gender dimension). In contrast, a word like 'queen' might have a vector representation of [.9, .9], which when compared with 'woman', is primarily different in the 'royalty' dimension. You can see the advantages of learning such a feature embedding. How about instead of two features, we use 50? Or 100? Or 300? What is optimal? How about instead of 'royalty' and 'gender', we pick other features, or better (how it is done in practice), we use an algorithm to choose these latent features (we will no longer have a clear insight to the meaning of these feature dimensions)? In our first algorithm, the matrix that will contain a mapping of each word into a 50 dimensional feature space is called the embedding matrix. Here we will learn it from scratch.

class LSTM_model():
    def __init__(self, data):
        self.X_train, self.X_test, self.y_train, self.y_test = \
                    train_test_split(data[:,0], data[:,1], test_size=0.30, random_state=20)
        self.y_train, self.y_test = list(map(int, self.y_train)), list(map(int, self.y_test))
        self.model = None

    def run_model(self):
        def clean_sen(sen):
            tokens = word_tokenize(sen)
            tokens = [w.lower() for w in tokens]

            table = str.maketrans('', '', string.punctuation)
            stripped = [w.translate(table) for w in tokens]
            words = [word for word in stripped if word.isalpha()]

            stop_words = set(stopwords.words('english'))
            words = [w for w in words if not w in stop_words]

            return ' '.join(words)

        self.X_train = list(map(lambda x: clean_sen(x), self.X_train))
        self.X_test = list(map(lambda x: clean_sen(x), self.X_test))

        all_data = self.X_train + self.X_test
        tokenizer = Tokenizer()
        tokenizer.fit_on_texts(all_data)

        # Max num of words in headline
        max_len = max([len(s.split()) for s in all_data])
        vocab_size = len(tokenizer.word_index) + 1

        X_train_tokens = tokenizer.texts_to_sequences(self.X_train)
        X_test_tokens = tokenizer.texts_to_sequences(self.X_test)

        X_train_pad = pad_sequences(X_train_tokens, maxlen=max_len)
        X_test_pad = pad_sequences(X_test_tokens, maxlen=max_len)

        EMBEDDING_DIMS = 50

        self.model = Sequential()
        self.model.add(Embedding(vocab_size, EMBEDDING_DIMS, input_length=max_len))
        self.model.add(LSTM(units=20, dropout=0.2, recurrent_dropout=0.2))
        self.model.add(Dense(1, activation='sigmoid'))
        self.model.compile(loss='binary_crossentropy', optimizer='adam', metrics=['accuracy'])
        self.model.fit(X_train_pad, y_train, batch_size=128, epochs=3, validation_split=0.2)
        print("Evaluating model against test set:")
        print(self.model.evaluate(X_test_pad, self.y_test))
        print(self.model.metrics_names)

m3_LSTM = LSTM_model(data)
m3_LSTM.run_model()

# Evaluating model against test set:
# 8013/8013 [==============================] - 4s 457us/step
# [0.47915095404770497, 0.8003244727466878]
# ['loss', 'acc']

One note is because the text sequences vary in length, I've chosen to pad every sequence to the maximum length that appears in the entire dataset. Sometimes this is a good choice, other times, we need to set a maximum length and truncate otherwise our feature space would be too large. In this network, I've created a 20 unit LSTM architecture with an embedding matrix (number of units is a hyperparameter we can tune later). Notice that the LSTM has dropout options set to 20%, which will help to regularize the weights of the network (also a hyperparameter). I've run 3 epochs to prevent overfitting. Evaluating this network against the test set, the performance is marginally better than previous models at 80% accuracy.

Stacked GRU with Pre-trained Embedding Model

Previously we asked about varying the number of features in our embedding matrix: why not 100, or 300? Does it matter if the word embeddings are trained from Wikipedia data versus Twitter data? Does it matter if the algorithm used to train the embedding is a GloVe or a Word2Vec implementation? Trial and error. Sometimes the gains will be marginal across these variations, sometimes it might be more drastic for your particular task, and sometimes you don't care to waste time loading such huge embedding vectors into RAM. I'll leave this up to the reader to experiment with. Here, I've chosen to use the 'glove-wiki-gigaword-100' embedding. The network is composed of two stacked GRU variants that are a few orders of magnitude quicker for GPU computation than the normal GRU layers.

class GRU_model():
    def __init__(self, data):
        self.X_train, self.X_test, self.y_train, self.y_test = \
                    train_test_split(data[:,0], data[:,1], test_size=0.30, random_state=20)
        self.y_train, self.y_test = list(map(int, self.y_train)), list(map(int, self.y_test))
        self.model = None

    def run_model(self):
        def clean_sen(sen):
            tokens = word_tokenize(sen)
            tokens = [w.lower() for w in tokens]

            table = str.maketrans('', '', string.punctuation)
            stripped = [w.translate(table) for w in tokens]
            words = [word for word in stripped if word.isalpha()]

            stop_words = set(stopwords.words('english'))
            words = [w for w in words if not w in stop_words]

            return ' '.join(words)

        self.X_train = list(map(lambda x: clean_sen(x), self.X_train))
        self.X_test = list(map(lambda x: clean_sen(x), self.X_test))

        # Max num of words in headline
        max_len = max([len(s.split()) for s in all_data])
        vocab_size = len(tokenizer.word_index) + 1

        X_train_tokens = tokenizer.texts_to_sequences(self.X_train)
        X_test_tokens = tokenizer.texts_to_sequences(self.X_test)

        X_train_pad = pad_sequences(X_train_tokens, maxlen=max_len)
        X_test_pad = pad_sequences(X_test_tokens, maxlen=max_len)

        pretrained_embedding = api.load("glove-wiki-gigaword-100")
        word_index = tokenizer.word_index
        num_words = len(word_index) + 1

        embedding_matrix = np.zeros((num_words, 100))

        for word, index in word_index.items():
            if word not in pretrained_embedding:
                continue
            embedding_vector = pretrained_embedding[word]
            if embedding_vector is not None:
                embedding_matrix[index] = embedding_vector

        # create model
        self.model = Sequential()
        embedding_layer = Embedding(num_words,
                                    100, 
                                    embeddings_initializer=Constant(embedding_matrix),
                                    input_length=max_len,
                                    trainable=True)

        self.model.add(embedding_layer)
        self.model.add(CuDNNGRU(units=20, return_sequences=True))
        self.model.add(CuDNNGRU(units=20))
        self.model.add(Dense(1, activation='sigmoid'))

        self.model.compile(loss='binary_crossentropy', optimizer='adam', metrics=['accuracy'])
        self.model.fit(X_train_pad, y_train, validation_split=0.2, epochs = 3)
        print("Evaluating model against test set:")
        print(self.model.evaluate(X_test_pad, self.y_test))
        print(self.model.metrics_names)

m4_GRU = GRU_model(data)
m4_GRU.run_model()

# Evaluating model against test set:
# 8013/8013 [==============================] - 1s 103us/step
# [0.47489634850610435, 0.8109322351551258]
# ['loss', 'acc']

Aha! We've recieved the higest accuracy yet of 81.1% compared to any previous model! WOOOWEEE, that's a whopping 1.3% increase from our baseline naive bayes model! Worthwhile? Hmmm....

There are many other variations which I will leave to the reader to experiment with like bidirectional RNNs. The highest accuracy model might not be the best model, depending on what your use case is, perhaps inference speed is important, especially in large scale production scenarios. If you have any questions or comments, email me! Look out for a future post on NLP clustering and recommenders!