Predicting Positive vs Negative Movie Reviews

This is a step-by-step walk through of sentiment analysis on IMDb movie reviews. The goal is to develop[ a model that accurately predicts whether a reciew is positive or negative by uncovering laanguage patterns that drive sentiment. This is a Text Classification task.

The Data

The dataset used consists of 50,000 film reviews scraped from IMDb that are labelled as either positive or negative. It can be found on Kaggle The goal is to develop an optimal model that accurately predicts sentiment and identifies patterns in review content.


Data Processing

We cleaned the text by:

  • Removing special characters, URLs, and HTML tags
  • Converting text to lowercase
  • Removing stopwords using NLTK
  • Tokenizing text into unigrams, bigrams, and trigrams
  • Adding START and STOP markers to preserve context

These steps reduced noise and made the linguistic structure suitable for modeling.


Exploratory Data Analysis

Before modeling, we visualized word frequency patterns.
Positive reviews frequently used terms like “excellent” and “fun,” while negative reviews often used “worst” and “poor.”
These clear distinctions highlighted the potential of text-based sentiment modeling.

Screenshot 2025-11-06 at 1 28 55 PM


Methodology

In order to capture relationships between text, we need to transform it into embeddings. Embedding are numerical representations of words, words with similar meaning will have similar representations.

There are different types of text embedding methods, some more complex than others. This schematic captures the 3 general categories.

Screenshot 2025-11-07 at 10 07 44 AM

We evaluated the following:

  • Bag of Words (BoW)
  • TF-IDF
  • GloVe
  • BERT embeddings

Each embedding was used across several machine learning models: Logistic Regression, K-Nearest Neighbors (KNN), Random Forest, a Deep Neural Network (DNN), and a fine-tuned BERT Transformer model.


Logistic Regression

The TF-IDF Logistic Regression model performed best, achieving 89.83% accuracy and AUC = 0.96, followed closely by BoW (89.4% accuracy).
Regularization experiments showed that L2 penalties (Ridge regression) were consistently optimal, suggesting that allowing all features to contribute led to more accurate predictions.


K-Nearest Neighbors

KNN models were optimized through grid search for distance metrics and weighting schemes.
The best-performing configuration used TF-IDF embeddings with distance-based weighting, achieving 79.92% accuracy and AUC = 0.88.
Performance decreased significantly for GloVe embeddings (54.66% accuracy), indicating poorer feature representation.


Random Forest

Random Forest models captured complex word interactions.
With TF-IDF embeddings, accuracy reached 84.02%, while BERT embeddings followed closely at 83.62%.
BoW and GloVe lagged behind due to their limited ability to represent nuanced sentiment context.


Deep Neural Network (DNN)

A 3-layer DNN trained with TF-IDF embeddings achieved 87.23% accuracy and AUC = 0.95, outperforming simpler models but not BERT.
The model used dropout regularization and ReLU activations to prevent overfitting.


Fine-tuned BERT Transformer

The pre-trained BERT model achieved the highest overall performance, with 91.74% accuracy and AUC = 0.97.
Fine-tuning for two epochs with a learning rate of 2e-5 and batch size of 16 yielded highly contextual sentiment understanding.


Results Summary

Model Embedding Accuracy AUC
Logistic Regression TF-IDF 89.83% 0.96
KNN TF-IDF 79.92% 0.88
Random Forest TF-IDF 84.02% 0.91
DNN TF-IDF 87.23% 0.95
BERT Transformer BERT 91.74% 0.97

TF-IDF consistently yielded the most reliable results across traditional models, while BERT surpassed them with its ability to capture deeper contextual meaning.


Conclusion & Future Work

This analysis demonstrates that while traditional models like Logistic Regression and DNN perform well with TF-IDF features, contextual embeddings like BERT significantly enhance accuracy for nuanced text.

Future improvements may include:

  • Comparing performance with XLNet or GPT-based models
  • Data augmentation via synonym replacement or paraphrasing
  • Combining ensemble models to enhance robustness

Overall, the project highlights how effective feature representation is central to sentiment prediction performance.


Repository: GitHub Project Link Dataset: IMDb 50K Movie Reviews