Simple sentiment analysis: design of experiment

4 min readSep 6, 2023

Draft 8 September

Sentiment analysis is a deep topic. I read a couple of pages of the first chapter of Bing Liu’s book. It describes the meaning of opinion:

Out of topic time! Defining an abstract concept with a tuple of critical components is a great idea. While I have met something like this in algorithm definitions like Greedy, etc., seeing components composing on an abstract language concept is new to me.

The ultimate goal of sentiment analysis is to define this tuple. We can interchange the target feature. We can choose to describe the sentiment from the opinion posted, predict the time of the argument based on the sentiment and holder, or even predict the opinion holder based on the sentiment and the aspect of the opinion target.

In the end, this problem is a simple classification task. But unlike rows of weather-related data to predict if tomorrow will rain, the text needs transformation to be more structured. We need to do preprocessing and feature extraction before we execute our classification algorithms.

App Goal

While it is tempting to do something more astonishing, like labeling emotion, I want to start with a more minor task. Predicting the star a customer gave given their text has a smaller target class set than labeling emotion. I hope my estimation is correct this time. It has always been wrong for most of my life.

So, an input field will be provided, and then the user can enter their text. The frontend app will hit the inferencing endpoint every time the text changes. This constrains the inference time upper bound to be under 200ms.

Design of Experiment

A standard scheme to do this is presented in this diagram:

From the “IF4072 Natural Language Processing” slide

Let’s plan every process here.

Data Sources

I want this simple sentiment analysis app to predict Indonesian and English input. So, I need a dataset from both languages.

English data set:
https://www.kaggle.com/datasets/bittlingmayer/amazonreviews

The data set uses the fastText format I have never used before.

Bahasa data set:
https://www.kaggle.com/datasets/grikomsn/lazada-indonesian-reviews

This Lazada data source has a limited scope of review here:

beli-harddisk-eksternal
beli-laptop
beli-smart-tv
jual-flash-drives
shop-televisi-digital

— edit on 8 September ‘23

A massive issue in this Indonesia data set is that there is no advanced pre-trained model by NLTK or Spacy. Due to time constraints, I dropped the dual language specification to English only.

Preprocessing

Let’s assume I have cleaned the data set above and then turned it into a simple data frame consisting of the review text and the rating. So I can focus on preprocessing the text.

First, tokenization. It is a simple string-split procedure. Plus, the data set comes from reviews on which I assume no weird unrelated token like “@george1232112” exists. Removing stop words is probably a good idea.

Second, we want to extract meaning instead of simple information retrieval. So rather than stemming, we will do lemmatization. Some concerns we need to take care of:

slang languages, how the library process it, and if the library does not lemmatize it, how will we handle them

Doing experiments on this issue will be the right way to get it done. There are choices like looking for the most frequent slang words, rephrase them, and then remove the rest. Or removing all slang words altogether. Or mask the slang words to the same other word. There are a lot of choices to play with.

Feature Extraction

Bag of words
SVD
word2vec

After preprocessing the review, we move to a more mathematical space, feature extraction. For those who don’t know what feature extraction is, it turns set words in documents into feature rows.

Using the supervised learning technique, a target feature will be assigned as the rating, the associated emotion, or any other valid label. And then, commonly, there will be n token number of features. If there are 11 unique tokens in your documents, there will be 11 features. If it is 11111, the feature will also be that much. We call this feature rows, a bag of words.

The value of each column is an experiment-wise decision. We can use TfIdf, term frequency inverse document frequency.

SVD reduces the dimension of the bag of words. Training with 11 rows is fine, but when 11111 unique token comes, the curse of dimensionality follows.

Classifications

Multinomial Naive Bayes
Nearest Neighbor
Etc.

Metrics

F1
Accuracy
Precision
Recall
ROC