A retrieval-augmented generation (RAG) approach for visual storytelling that generates coherent narratives from image sequences by retrieving similar examples from a training corpus.
This project tackles the visual storytelling task by using a two-level RAG system:
- Image-level RAG: Retrieves the most similar training image (by CLIP embedding similarity) and uses its description as a few-shot example for generating descriptions
- Story-level RAG: Retrieves the most similar training story (by text embedding similarity) to guide the final narrative generation
The approach ensures visual grounding through example-based description generation and maintains narrative coherence through story-level retrieval.
The complete pipeline consists of four main stages:
Purpose: Load dataset and extract visual/textual features
Process:
- Load the
tonyhong/vwpvisual storytelling dataset (train/val/test splits) - Extract CLIP/SWIN-base embeddings for all images
- Tokenize and embed story text using T5 tokenizer (padded to 768 tokens)
- Create mappings: image_id → embedding, image_id → descriptions, scene_id → images
- Save preprocessed features for efficient retrieval
Outputs:
outputs/train_features.json: Training set featuresoutputs/val_features.json: Validation set featuresoutputs/test_features.json: Test set featuresoutputs/train_data.json: Training data with text and mappingsoutputs/val_data.json: Validation data with text and mappingsoutputs/test_data.json: Test data with text and mappings
Purpose: Pre-compute similarity matrices for retrieval
Process:
- Compute cosine similarity between each val/test image embedding and all training image embeddings
- Results in two similarity matrices:
val_train_sim: [1759 × 16494] matrixtest_train_sim: [1604 × 16494] matrix
- Each entry sim[i][j] represents the similarity between target image i and training image j
Outputs:
outputs/val_train_sim.json: Validation-to-training similarity matrixoutputs/test_train_sim.json: Test-to-training similarity matrix
Purpose: Generate descriptions for each image using retrieved examples
Process: For each target image in a sequence:
- Look up the most similar training image using the pre-computed similarity matrix
- Retrieve the training image's ground-truth descriptions
- Call GPT-4o/GPT-4o-mini with:
- System prompt: "Write a short description (~20 words) based on the given example"
- Few-shot example: Similar training image + its description(s)
- Query: Target image to describe
- Generate ~20-word description for the target image
Key Function: generate_img_desc_rag(client, model, img_base64, img_exp_base64, desc_exp)
Purpose: Generate coherent story from image descriptions
Process:
- Concatenate all generated image descriptions into a single string
- Encode the concatenated descriptions using T5 tokenizer (padded to 768 tokens)
- Compute cosine similarity with all training story embeddings
- Retrieve the most similar training story
- Call GPT-4o/GPT-4o-mini with:
- Prompt: "Combine these descriptions into a coherent story (~100 words) based on this example"
- Descriptions: Generated image descriptions
- Example: Retrieved similar training story
- Generate final ~100-word story
Key Function: generate_story(client, model, descriptions, story_example)
Outputs:
results/val_result.csv: Validation set generated storiesresults/test_result.csv: Test set generated storieslogs/val_gen_story.log: Validation generation logslogs/test_gen_story.log: Test generation logs
vgsg_rag/
├── README.md # This file
├── LICENSE # Project license
├── .gitignore # Git ignore patterns
│
├── expr/ # Jupyter notebooks for experimentation
│ ├── load_features.ipynb # [Stage 1 & 2] Feature extraction & similarity computation
│ ├── similarity_rag.ipynb # [Stage 3 & 4] Main generation pipeline (with visualizations)
│ └── similarity_rag_console.ipynb # [Stage 3 & 4] Console version (no visualizations)
│
├── scripts/ # Python scripts for production use
│ ├── load_features.py # Python script version of load_features.ipynb
│ ├── similarity_rag.py # Python script version of similarity_rag.ipynb
│ └── similarity_rag_console.py # Python script version of similarity_rag_console.ipynb
│
├── outputs/ # Preprocessed features and similarity matrices
│ ├── train_features.json # Training features (16,494 images, 11,773 stories)
│ ├── val_features.json # Validation features (1,759 images, 849 stories)
│ ├── test_features.json # Test features (1,604 images, 586 stories)
│ ├── train_data.json # Training text data and mappings
│ ├── val_data.json # Validation text data and mappings
│ ├── test_data.json # Test text data and mappings
│ ├── val_train_sim.json # Val-to-train similarity matrix [1759×16494]
│ └── test_train_sim.json # Test-to-train similarity matrix [1604×16494]
│
├── results/ # Generated stories
│ ├── val_result.csv # Validation generated stories
│ ├── test_result.csv # Test generated stories
│ ├── val_gen_story_li_0_494.json # Validation stories (batch 1)
│ ├── val_gen_story_li_495_.json # Validation stories (batch 2)
│ ├── test_gen_story_li_0_122.json # Test stories (batch 1)
│ └── test_gen_story_li_123_.json # Test stories (batch 2)
│
└── logs/ # Generation logs with intermediate outputs
├── val_gen_story.log # Validation generation details
└── test_gen_story.log # Test generation details
| File | Location | Pipeline Stages | Key Functions | Description |
|---|---|---|---|---|
load_features.ipynb |
expr/ |
Stage 1-2 | load_features(), similarity_matrix() |
Loads dataset, extracts CLIP/SWIN embeddings, computes similarity matrices |
load_features.py |
scripts/ |
Stage 1-2 | Same as above | Command-line version of load_features notebook |
similarity_rag.ipynb |
expr/ |
Stage 3-4 | generate_img_desc_rag(), generate_story(), vgsg_rag() |
Main pipeline with visualizations for debugging |
similarity_rag.py |
scripts/ |
Stage 3-4 | Same as above | Command-line version with visualizations |
similarity_rag_console.ipynb |
expr/ |
Stage 3-4 | Same as above | Streamlined notebook without visualization overhead |
similarity_rag_console.py |
scripts/ |
Stage 3-4 | Same as above | Command-line version optimized for production |
Note:
- Jupyter notebooks (
.ipynb) are in theexpr/directory for interactive experimentation - Python scripts (
.py) are in thescripts/directory for command-line execution - All scripts contain identical logic to their corresponding notebooks
- Scripts use relative paths (
../outputs/,../results/, etc.) to access data directories
| Directory | Contents | Purpose |
|---|---|---|
outputs/ |
Features & similarity matrices | Stores preprocessed data to avoid recomputation |
results/ |
Generated stories (CSV & JSON) | Final outputs for evaluation |
logs/ |
Generation logs | Debugging info: similar images, descriptions, examples |
┌─────────────────────────────────────────────────────────────┐
│ Stage 1: Feature Extraction (load_features.ipynb) │
│ ┌─────────────┐ ┌──────────────┐ ┌─────────────┐ │
│ │ Load Dataset│ → │Extract CLIP │ → │ Save to │ │
│ │ (VWP) │ │ Embeddings │ │ outputs/ │ │
│ └─────────────┘ └──────────────┘ └─────────────┘ │
└─────────────────────────────────────────────────────────────┘
↓
┌─────────────────────────────────────────────────────────────┐
│ Stage 2: Similarity Computation (load_features.ipynb) │
│ ┌─────────────┐ ┌──────────────┐ ┌─────────────┐ │
│ │ Compute │ → │ Cosine │ → │ Save to │ │
│ │ Similarities│ │ Similarity │ │ outputs/ │ │
│ └─────────────┘ └──────────────┘ └─────────────┘ │
└─────────────────────────────────────────────────────────────┘
↓
┌─────────────────────────────────────────────────────────────┐
│ Stage 3: Description Generation (similarity_rag.ipynb) │
│ │
│ For each image in sequence: │
│ ┌──────────────┐ ┌──────────────┐ ┌─────────────┐ │
│ │ Retrieve │ → │ Few-shot │ → │ Generate │ │
│ │ Similar Image│ │ Prompting │ │ Description │ │
│ └──────────────┘ └──────────────┘ └─────────────┘ │
│ ↓ ↓ ↓ │
│ Similarity Matrix Similar Image GPT-4o/mini │
└─────────────────────────────────────────────────────────────┘
↓
┌─────────────────────────────────────────────────────────────┐
│ Stage 4: Story Generation (similarity_rag.ipynb) │
│ ┌──────────────┐ ┌──────────────┐ ┌─────────────┐ │
│ │ Concatenate │ → │ Retrieve │ → │ Generate │ │
│ │ Descriptions │ │Similar Story │ │ Story │ │
│ └──────────────┘ └──────────────┘ └─────────────┘ │
│ ↓ ↓ ↓ │
│ T5 Embeddings Cosine Similarity GPT-4o/mini │
└─────────────────────────────────────────────────────────────┘
↓
┌─────────────────┐
│ results/*.csv │
└─────────────────┘
# In load_features.ipynb
load_features(data)
# Extracts image/scene lists, embeddings, and creates ID mappings
preprocess_labelled_data(data)
# Tokenizes stories and creates image-sentence mappings
preprocess_test_data(data)
# Processes test data (no story labels)# In load_features.ipynb
similarity_matrix(vec_li_1, vec_li_2)
# Computes cosine similarity matrix between two lists of embeddings# In similarity_rag.ipynb
generate_img_desc_rag(client, model, img_base64, img_exp_base64, desc_exp)
# Generates image description using retrieved example
# - img_base64: Target image to describe
# - img_exp_base64: Retrieved similar image (example)
# - desc_exp: Description(s) of the similar image
generate_story(client, model, descriptions, story_example)
# Generates final story from image descriptions
# - descriptions: Concatenated image descriptions
# - story_example: Retrieved similar story
vgsg_rag(tokenizer, tgt_data, tgt_index_img_dic, tgt_src_sim, ...)
# Main pipeline orchestrator: runs full RAG process for all sequencespip install tqdm numpy pandas matplotlib openai datasets transformers torch pillowRequired API:
- OpenAI API key (for GPT-4o or GPT-4o-mini)
Option A: Using Jupyter Notebook
cd expr
jupyter notebook load_features.ipynb
# Run all cells to generate files in ../outputs/ directoryOption B: Using Python Script
cd scripts
python load_features.py
# Generates files in ../outputs/ directoryOption A: Using Jupyter Notebook (with visualizations)
cd expr
jupyter notebook similarity_rag.ipynb
# Set your OpenAI API key in the notebook
# Run cells to generate stories with visual debugging
# Results saved to ../results/ directoryOption B: Using Python Script
cd scripts
# Edit the script to set your API key
python similarity_rag.py
# Or use the console version (same functionality)
python similarity_rag_console.py
# Results saved to ../results/ directoryimport pandas as pd
results = pd.read_csv('results/val_result.csv')
print(results[['story_id', 'generated_story']].head())- All Jupyter notebooks are in
expr/and should be run from that directory - All Python scripts are in
scripts/and should be run from that directory - Both use relative paths (
../outputs/,../results/,../logs/,../images/) to access shared data directories - This organization separates experimental notebooks from production scripts
| Split | Stories | Images | Avg Images/Story |
|---|---|---|---|
| Train | 11,773 | 16,494 | ~1.4 |
| Val | 849 | 1,759 | ~2.1 |
| Test | 586 | 1,604 | ~2.7 |
- Visual Grounding: Uses actual image content (via CLIP embeddings) to retrieve relevant examples
- Style Consistency: Few-shot examples help maintain consistent description style
- Narrative Coherence: Story-level retrieval provides structural templates
- Scalability: Pre-computed similarities enable fast retrieval
- Modularity: Two-stage design allows independent optimization of descriptions and stories
If you use this code, please cite:
@misc{vgsg_rag,
title={Visual Grounded Story Generation with Retrieval-Augmented Generation},
author={Ruitao Feng},
year={2024}
}See LICENSE file for details.