Skip to content

IMDB sentiment analysis with a from-scratch RNN in low-level TensorFlow 2 (no Keras RNN layers). Padding/truncation, vocab limits, and BPTT training.

License

Notifications You must be signed in to change notification settings

Ashly1991/rnn-text-classification-tf2

Repository files navigation

RNN Text Classification (TensorFlow 2, Low‑Level)

This project tackles binary sentiment analysis on the IMDB movie reviews dataset using a from‑scratch RNN built with low‑level TensorFlow operations. It emphasizes understanding the recurrent update equations, padding/truncation strategies, vocabulary limits, and practical training challenges for sequence data.

Purpose

  • Load IMDB via tf.keras.datasets.imdb (50k reviews: 25k train / 25k test, binary labels).
  • Inspect indexed sequences; optionally reconstruct text via get_word_index().
  • Represent words as one‑hot vectors (sequence length × vocabulary size) and discuss why simple integer inputs are unsuitable.
  • Handle variable sequence lengths with padding (and explore pre vs post padding) and truncation (e.g., to 200 tokens).
  • Limit vocabulary size (e.g., keep top 20k, map rare words to an UNK token) to reduce memory and improve learning.
  • Implement an RNN from scratch (no Keras RNNCell/high‑level RNNs):
    • Loop over time steps, update hidden state from previous state + current input (tf.matmul, nonlinearity).
    • Use many‑to‑one setup (final step output) or aggregate across steps.
    • Train with tf.GradientTape (BPTT) and Keras optimizers/losses/metrics.
  • Compare output formulations:
    • 2‑unit logits + softmax + sparse categorical cross‑entropy vs 1‑unit sigmoid + binary cross‑entropy.
  • Explore training issues: slow starts, vanishing gradients, initialization, learning rate, and using all time steps (averaging states/logits).

Questions to Explore

  • Why is padding to the global max length wasteful? Smarter batching/padding schemes?
  • Truncate long sequences vs remove them—trade‑offs?
  • Alternatives to one‑hot (e.g., learned embeddings) to avoid huge vectors?
  • Pre vs post padding—why does it matter when using the last time‑step output?
  • Masking padded steps—how to skip updates for padding within a batch?
  • Ways to leverage all time steps (averaging logits/states/probabilities)—pros/cons?

Run locally

python -m venv .venv && source .venv/bin/activate     # Windows: .venv\Scripts\activate
pip install -r requirements.txt
jupyter lab rnn-text-classification.ipynb

Notes

  • Consider sequence length caps (e.g., 200) and vocabulary limits (e.g., 20k) for speed and stability.
  • You may use Keras optimizers/losses/metrics, but the RNN recurrence itself is implemented with low‑level ops.

About

IMDB sentiment analysis with a from-scratch RNN in low-level TensorFlow 2 (no Keras RNN layers). Padding/truncation, vocab limits, and BPTT training.

Topics

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published