A retrieval-augmented generation (RAG) approach for visual storytelling that generates coherent narratives from image sequences by retrieving similar examples from a training corpus.
This project tackles the visual storytelling task by using a two-level RAG system:
- Image-level RAG: Retrieves the most similar training image (by CLIP embedding similarity) and uses its description as a few-shot example for generating descriptions
- Story-level RAG: Retrieves the most similar training story (by text embedding similarity) to guide the final narrative generation
The approach ensures visual grounding through example-based description generation and maintains narrative coherence through story-level retrieval.
The complete pipeline consists of four main stages:
Purpose: Load dataset and extract visual/textual features
Process:
- Load the tonyhong/vwpvisual storytelling dataset (train/val/test splits)
- Extract CLIP/SWIN-base embeddings for all images
- Tokenize and embed story text using T5 tokenizer (padded to 768 tokens)
- Create mappings: image_id → embedding, image_id → descriptions, scene_id → images
- Save preprocessed features for efficient retrieval
Outputs:
- outputs/train_features.json: Training set features
- outputs/val_features.json: Validation set features
- outputs/test_features.json: Test set features
- outputs/train_data.json: Training data with text and mappings
- outputs/val_data.json: Validation data with text and mappings
- outputs/test_data.json: Test data with text and mappings
Purpose: Pre-compute similarity matrices for retrieval
Process:
- Compute cosine similarity between each val/test image embedding and all training image embeddings
- Results in two similarity matrices:
- val_train_sim: [1759 × 16494] matrix
- test_train_sim: [1604 × 16494] matrix
 
- Each entry sim[i][j] represents the similarity between target image i and training image j
Outputs:
- outputs/val_train_sim.json: Validation-to-training similarity matrix
- outputs/test_train_sim.json: Test-to-training similarity matrix
Purpose: Generate descriptions for each image using retrieved examples
Process: For each target image in a sequence:
- Look up the most similar training image using the pre-computed similarity matrix
- Retrieve the training image's ground-truth descriptions
- Call GPT-4o/GPT-4o-mini with:
- System prompt: "Write a short description (~20 words) based on the given example"
- Few-shot example: Similar training image + its description(s)
- Query: Target image to describe
 
- Generate ~20-word description for the target image
Key Function: generate_img_desc_rag(client, model, img_base64, img_exp_base64, desc_exp)
Purpose: Generate coherent story from image descriptions
Process:
- Concatenate all generated image descriptions into a single string
- Encode the concatenated descriptions using T5 tokenizer (padded to 768 tokens)
- Compute cosine similarity with all training story embeddings
- Retrieve the most similar training story
- Call GPT-4o/GPT-4o-mini with:
- Prompt: "Combine these descriptions into a coherent story (~100 words) based on this example"
- Descriptions: Generated image descriptions
- Example: Retrieved similar training story
 
- Generate final ~100-word story
Key Function: generate_story(client, model, descriptions, story_example)
Outputs:
- results/val_result.csv: Validation set generated stories
- results/test_result.csv: Test set generated stories
- logs/val_gen_story.log: Validation generation logs
- logs/test_gen_story.log: Test generation logs
vgsg_rag/
├── README.md                           # This file
├── LICENSE                             # Project license
├── .gitignore                          # Git ignore patterns
│
├── expr/                               # Jupyter notebooks for experimentation
│   ├── load_features.ipynb             # [Stage 1 & 2] Feature extraction & similarity computation
│   ├── similarity_rag.ipynb            # [Stage 3 & 4] Main generation pipeline (with visualizations)
│   └── similarity_rag_console.ipynb    # [Stage 3 & 4] Console version (no visualizations)
│
├── scripts/                            # Python scripts for production use
│   ├── load_features.py                # Python script version of load_features.ipynb
│   ├── similarity_rag.py               # Python script version of similarity_rag.ipynb
│   └── similarity_rag_console.py       # Python script version of similarity_rag_console.ipynb
│
├── outputs/                            # Preprocessed features and similarity matrices
│   ├── train_features.json             # Training features (16,494 images, 11,773 stories)
│   ├── val_features.json               # Validation features (1,759 images, 849 stories)
│   ├── test_features.json              # Test features (1,604 images, 586 stories)
│   ├── train_data.json                 # Training text data and mappings
│   ├── val_data.json                   # Validation text data and mappings
│   ├── test_data.json                  # Test text data and mappings
│   ├── val_train_sim.json              # Val-to-train similarity matrix [1759×16494]
│   └── test_train_sim.json             # Test-to-train similarity matrix [1604×16494]
│
├── results/                            # Generated stories
│   ├── val_result.csv                  # Validation generated stories
│   ├── test_result.csv                 # Test generated stories
│   ├── val_gen_story_li_0_494.json     # Validation stories (batch 1)
│   ├── val_gen_story_li_495_.json      # Validation stories (batch 2)
│   ├── test_gen_story_li_0_122.json    # Test stories (batch 1)
│   └── test_gen_story_li_123_.json     # Test stories (batch 2)
│
└── logs/                               # Generation logs with intermediate outputs
    ├── val_gen_story.log               # Validation generation details
    └── test_gen_story.log              # Test generation details
| File | Location | Pipeline Stages | Key Functions | Description | 
|---|---|---|---|---|
| load_features.ipynb | expr/ | Stage 1-2 | load_features(),similarity_matrix() | Loads dataset, extracts CLIP/SWIN embeddings, computes similarity matrices | 
| load_features.py | scripts/ | Stage 1-2 | Same as above | Command-line version of load_features notebook | 
| similarity_rag.ipynb | expr/ | Stage 3-4 | generate_img_desc_rag(),generate_story(),vgsg_rag() | Main pipeline with visualizations for debugging | 
| similarity_rag.py | scripts/ | Stage 3-4 | Same as above | Command-line version with visualizations | 
| similarity_rag_console.ipynb | expr/ | Stage 3-4 | Same as above | Streamlined notebook without visualization overhead | 
| similarity_rag_console.py | scripts/ | Stage 3-4 | Same as above | Command-line version optimized for production | 
Note:
- Jupyter notebooks (.ipynb) are in theexpr/directory for interactive experimentation
- Python scripts (.py) are in thescripts/directory for command-line execution
- All scripts contain identical logic to their corresponding notebooks
- Scripts use relative paths (../outputs/,../results/, etc.) to access data directories
| Directory | Contents | Purpose | 
|---|---|---|
| outputs/ | Features & similarity matrices | Stores preprocessed data to avoid recomputation | 
| results/ | Generated stories (CSV & JSON) | Final outputs for evaluation | 
| logs/ | Generation logs | Debugging info: similar images, descriptions, examples | 
┌─────────────────────────────────────────────────────────────┐
│ Stage 1: Feature Extraction (load_features.ipynb)          │
│ ┌─────────────┐    ┌──────────────┐    ┌─────────────┐   │
│ │ Load Dataset│ → │Extract CLIP  │ → │  Save to    │   │
│ │  (VWP)      │    │ Embeddings   │    │ outputs/    │   │
│ └─────────────┘    └──────────────┘    └─────────────┘   │
└─────────────────────────────────────────────────────────────┘
                            ↓
┌─────────────────────────────────────────────────────────────┐
│ Stage 2: Similarity Computation (load_features.ipynb)      │
│ ┌─────────────┐    ┌──────────────┐    ┌─────────────┐   │
│ │ Compute     │ → │  Cosine      │ → │  Save to    │   │
│ │ Similarities│    │  Similarity  │    │ outputs/    │   │
│ └─────────────┘    └──────────────┘    └─────────────┘   │
└─────────────────────────────────────────────────────────────┘
                            ↓
┌─────────────────────────────────────────────────────────────┐
│ Stage 3: Description Generation (similarity_rag.ipynb)     │
│                                                             │
│  For each image in sequence:                               │
│  ┌──────────────┐    ┌──────────────┐    ┌─────────────┐ │
│  │ Retrieve     │ → │  Few-shot    │ → │  Generate   │ │
│  │ Similar Image│    │  Prompting   │    │ Description │ │
│  └──────────────┘    └──────────────┘    └─────────────┘ │
│         ↓                    ↓                    ↓        │
│  Similarity Matrix    Similar Image      GPT-4o/mini      │
└─────────────────────────────────────────────────────────────┘
                            ↓
┌─────────────────────────────────────────────────────────────┐
│ Stage 4: Story Generation (similarity_rag.ipynb)           │
│ ┌──────────────┐    ┌──────────────┐    ┌─────────────┐  │
│ │ Concatenate  │ → │  Retrieve    │ → │  Generate   │  │
│ │ Descriptions │    │Similar Story │    │   Story     │  │
│ └──────────────┘    └──────────────┘    └─────────────┘  │
│         ↓                    ↓                    ↓        │
│  T5 Embeddings      Cosine Similarity    GPT-4o/mini      │
└─────────────────────────────────────────────────────────────┘
                            ↓
                  ┌─────────────────┐
                  │ results/*.csv   │
                  └─────────────────┘
# In load_features.ipynb
load_features(data)
# Extracts image/scene lists, embeddings, and creates ID mappings
preprocess_labelled_data(data)  
# Tokenizes stories and creates image-sentence mappings
preprocess_test_data(data)
# Processes test data (no story labels)# In load_features.ipynb
similarity_matrix(vec_li_1, vec_li_2)
# Computes cosine similarity matrix between two lists of embeddings# In similarity_rag.ipynb
generate_img_desc_rag(client, model, img_base64, img_exp_base64, desc_exp)
# Generates image description using retrieved example
# - img_base64: Target image to describe
# - img_exp_base64: Retrieved similar image (example)
# - desc_exp: Description(s) of the similar image
generate_story(client, model, descriptions, story_example)
# Generates final story from image descriptions
# - descriptions: Concatenated image descriptions
# - story_example: Retrieved similar story
vgsg_rag(tokenizer, tgt_data, tgt_index_img_dic, tgt_src_sim, ...)
# Main pipeline orchestrator: runs full RAG process for all sequencespip install tqdm numpy pandas matplotlib openai datasets transformers torch pillowRequired API:
- OpenAI API key (for GPT-4o or GPT-4o-mini)
Option A: Using Jupyter Notebook
cd expr
jupyter notebook load_features.ipynb
# Run all cells to generate files in ../outputs/ directoryOption B: Using Python Script
cd scripts
python load_features.py
# Generates files in ../outputs/ directoryOption A: Using Jupyter Notebook (with visualizations)
cd expr
jupyter notebook similarity_rag.ipynb
# Set your OpenAI API key in the notebook
# Run cells to generate stories with visual debugging
# Results saved to ../results/ directoryOption B: Using Python Script
cd scripts
# Edit the script to set your API key
python similarity_rag.py
# Or use the console version (same functionality)
python similarity_rag_console.py
# Results saved to ../results/ directoryimport pandas as pd
results = pd.read_csv('results/val_result.csv')
print(results[['story_id', 'generated_story']].head())- All Jupyter notebooks are in expr/and should be run from that directory
- All Python scripts are in scripts/and should be run from that directory
- Both use relative paths (../outputs/,../results/,../logs/,../images/) to access shared data directories
- This organization separates experimental notebooks from production scripts
| Split | Stories | Images | Avg Images/Story | 
|---|---|---|---|
| Train | 11,773 | 16,494 | ~1.4 | 
| Val | 849 | 1,759 | ~2.1 | 
| Test | 586 | 1,604 | ~2.7 | 
- Visual Grounding: Uses actual image content (via CLIP embeddings) to retrieve relevant examples
- Style Consistency: Few-shot examples help maintain consistent description style
- Narrative Coherence: Story-level retrieval provides structural templates
- Scalability: Pre-computed similarities enable fast retrieval
- Modularity: Two-stage design allows independent optimization of descriptions and stories
If you use this code, please cite:
@misc{vgsg_rag,
  title={Visual Grounded Story Generation with Retrieval-Augmented Generation},
  author={Ruitao Feng},
  year={2024}
}See LICENSE file for details.