Skip to content

forfrt/vgsg_rag

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

5 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Visual Grounded Story Generation with RAG

A retrieval-augmented generation (RAG) approach for visual storytelling that generates coherent narratives from image sequences by retrieving similar examples from a training corpus.

Overview

This project tackles the visual storytelling task by using a two-level RAG system:

  1. Image-level RAG: Retrieves the most similar training image (by CLIP embedding similarity) and uses its description as a few-shot example for generating descriptions
  2. Story-level RAG: Retrieves the most similar training story (by text embedding similarity) to guide the final narrative generation

The approach ensures visual grounding through example-based description generation and maintains narrative coherence through story-level retrieval.

Pipeline

The complete pipeline consists of four main stages:

Stage 1: Feature Extraction & Preprocessing

Purpose: Load dataset and extract visual/textual features

Process:

  • Load the tonyhong/vwp visual storytelling dataset (train/val/test splits)
  • Extract CLIP/SWIN-base embeddings for all images
  • Tokenize and embed story text using T5 tokenizer (padded to 768 tokens)
  • Create mappings: image_id → embedding, image_id → descriptions, scene_id → images
  • Save preprocessed features for efficient retrieval

Outputs:

  • outputs/train_features.json: Training set features
  • outputs/val_features.json: Validation set features
  • outputs/test_features.json: Test set features
  • outputs/train_data.json: Training data with text and mappings
  • outputs/val_data.json: Validation data with text and mappings
  • outputs/test_data.json: Test data with text and mappings

Stage 2: Similarity Computation

Purpose: Pre-compute similarity matrices for retrieval

Process:

  • Compute cosine similarity between each val/test image embedding and all training image embeddings
  • Results in two similarity matrices:
    • val_train_sim: [1759 × 16494] matrix
    • test_train_sim: [1604 × 16494] matrix
  • Each entry sim[i][j] represents the similarity between target image i and training image j

Outputs:

  • outputs/val_train_sim.json: Validation-to-training similarity matrix
  • outputs/test_train_sim.json: Test-to-training similarity matrix

Stage 3: Image Description Generation (RAG Stage 1)

Purpose: Generate descriptions for each image using retrieved examples

Process: For each target image in a sequence:

  1. Look up the most similar training image using the pre-computed similarity matrix
  2. Retrieve the training image's ground-truth descriptions
  3. Call GPT-4o/GPT-4o-mini with:
    • System prompt: "Write a short description (~20 words) based on the given example"
    • Few-shot example: Similar training image + its description(s)
    • Query: Target image to describe
  4. Generate ~20-word description for the target image

Key Function: generate_img_desc_rag(client, model, img_base64, img_exp_base64, desc_exp)


Stage 4: Story Generation (RAG Stage 2)

Purpose: Generate coherent story from image descriptions

Process:

  1. Concatenate all generated image descriptions into a single string
  2. Encode the concatenated descriptions using T5 tokenizer (padded to 768 tokens)
  3. Compute cosine similarity with all training story embeddings
  4. Retrieve the most similar training story
  5. Call GPT-4o/GPT-4o-mini with:
    • Prompt: "Combine these descriptions into a coherent story (~100 words) based on this example"
    • Descriptions: Generated image descriptions
    • Example: Retrieved similar training story
  6. Generate final ~100-word story

Key Function: generate_story(client, model, descriptions, story_example)

Outputs:

  • results/val_result.csv: Validation set generated stories
  • results/test_result.csv: Test set generated stories
  • logs/val_gen_story.log: Validation generation logs
  • logs/test_gen_story.log: Test generation logs

Project Structure

vgsg_rag/
├── README.md                           # This file
├── LICENSE                             # Project license
├── .gitignore                          # Git ignore patterns
│
├── expr/                               # Jupyter notebooks for experimentation
│   ├── load_features.ipynb             # [Stage 1 & 2] Feature extraction & similarity computation
│   ├── similarity_rag.ipynb            # [Stage 3 & 4] Main generation pipeline (with visualizations)
│   └── similarity_rag_console.ipynb    # [Stage 3 & 4] Console version (no visualizations)
│
├── scripts/                            # Python scripts for production use
│   ├── load_features.py                # Python script version of load_features.ipynb
│   ├── similarity_rag.py               # Python script version of similarity_rag.ipynb
│   └── similarity_rag_console.py       # Python script version of similarity_rag_console.ipynb
│
├── outputs/                            # Preprocessed features and similarity matrices
│   ├── train_features.json             # Training features (16,494 images, 11,773 stories)
│   ├── val_features.json               # Validation features (1,759 images, 849 stories)
│   ├── test_features.json              # Test features (1,604 images, 586 stories)
│   ├── train_data.json                 # Training text data and mappings
│   ├── val_data.json                   # Validation text data and mappings
│   ├── test_data.json                  # Test text data and mappings
│   ├── val_train_sim.json              # Val-to-train similarity matrix [1759×16494]
│   └── test_train_sim.json             # Test-to-train similarity matrix [1604×16494]
│
├── results/                            # Generated stories
│   ├── val_result.csv                  # Validation generated stories
│   ├── test_result.csv                 # Test generated stories
│   ├── val_gen_story_li_0_494.json     # Validation stories (batch 1)
│   ├── val_gen_story_li_495_.json      # Validation stories (batch 2)
│   ├── test_gen_story_li_0_122.json    # Test stories (batch 1)
│   └── test_gen_story_li_123_.json     # Test stories (batch 2)
│
└── logs/                               # Generation logs with intermediate outputs
    ├── val_gen_story.log               # Validation generation details
    └── test_gen_story.log              # Test generation details

Key Files and Their Roles

Core Notebooks & Scripts

File Location Pipeline Stages Key Functions Description
load_features.ipynb expr/ Stage 1-2 load_features(), similarity_matrix() Loads dataset, extracts CLIP/SWIN embeddings, computes similarity matrices
load_features.py scripts/ Stage 1-2 Same as above Command-line version of load_features notebook
similarity_rag.ipynb expr/ Stage 3-4 generate_img_desc_rag(), generate_story(), vgsg_rag() Main pipeline with visualizations for debugging
similarity_rag.py scripts/ Stage 3-4 Same as above Command-line version with visualizations
similarity_rag_console.ipynb expr/ Stage 3-4 Same as above Streamlined notebook without visualization overhead
similarity_rag_console.py scripts/ Stage 3-4 Same as above Command-line version optimized for production

Note:

  • Jupyter notebooks (.ipynb) are in the expr/ directory for interactive experimentation
  • Python scripts (.py) are in the scripts/ directory for command-line execution
  • All scripts contain identical logic to their corresponding notebooks
  • Scripts use relative paths (../outputs/, ../results/, etc.) to access data directories

Data Files

Directory Contents Purpose
outputs/ Features & similarity matrices Stores preprocessed data to avoid recomputation
results/ Generated stories (CSV & JSON) Final outputs for evaluation
logs/ Generation logs Debugging info: similar images, descriptions, examples

Pipeline Flow Diagram

┌─────────────────────────────────────────────────────────────┐
│ Stage 1: Feature Extraction (load_features.ipynb)          │
│ ┌─────────────┐    ┌──────────────┐    ┌─────────────┐   │
│ │ Load Dataset│ → │Extract CLIP  │ → │  Save to    │   │
│ │  (VWP)      │    │ Embeddings   │    │ outputs/    │   │
│ └─────────────┘    └──────────────┘    └─────────────┘   │
└─────────────────────────────────────────────────────────────┘
                            ↓
┌─────────────────────────────────────────────────────────────┐
│ Stage 2: Similarity Computation (load_features.ipynb)      │
│ ┌─────────────┐    ┌──────────────┐    ┌─────────────┐   │
│ │ Compute     │ → │  Cosine      │ → │  Save to    │   │
│ │ Similarities│    │  Similarity  │    │ outputs/    │   │
│ └─────────────┘    └──────────────┘    └─────────────┘   │
└─────────────────────────────────────────────────────────────┘
                            ↓
┌─────────────────────────────────────────────────────────────┐
│ Stage 3: Description Generation (similarity_rag.ipynb)     │
│                                                             │
│  For each image in sequence:                               │
│  ┌──────────────┐    ┌──────────────┐    ┌─────────────┐ │
│  │ Retrieve     │ → │  Few-shot    │ → │  Generate   │ │
│  │ Similar Image│    │  Prompting   │    │ Description │ │
│  └──────────────┘    └──────────────┘    └─────────────┘ │
│         ↓                    ↓                    ↓        │
│  Similarity Matrix    Similar Image      GPT-4o/mini      │
└─────────────────────────────────────────────────────────────┘
                            ↓
┌─────────────────────────────────────────────────────────────┐
│ Stage 4: Story Generation (similarity_rag.ipynb)           │
│ ┌──────────────┐    ┌──────────────┐    ┌─────────────┐  │
│ │ Concatenate  │ → │  Retrieve    │ → │  Generate   │  │
│ │ Descriptions │    │Similar Story │    │   Story     │  │
│ └──────────────┘    └──────────────┘    └─────────────┘  │
│         ↓                    ↓                    ↓        │
│  T5 Embeddings      Cosine Similarity    GPT-4o/mini      │
└─────────────────────────────────────────────────────────────┘
                            ↓
                  ┌─────────────────┐
                  │ results/*.csv   │
                  └─────────────────┘

Key Functions Reference

Feature Extraction & Preprocessing

# In load_features.ipynb
load_features(data)
# Extracts image/scene lists, embeddings, and creates ID mappings

preprocess_labelled_data(data)  
# Tokenizes stories and creates image-sentence mappings

preprocess_test_data(data)
# Processes test data (no story labels)

Similarity Computation

# In load_features.ipynb
similarity_matrix(vec_li_1, vec_li_2)
# Computes cosine similarity matrix between two lists of embeddings

Generation Functions

# In similarity_rag.ipynb
generate_img_desc_rag(client, model, img_base64, img_exp_base64, desc_exp)
# Generates image description using retrieved example
# - img_base64: Target image to describe
# - img_exp_base64: Retrieved similar image (example)
# - desc_exp: Description(s) of the similar image

generate_story(client, model, descriptions, story_example)
# Generates final story from image descriptions
# - descriptions: Concatenated image descriptions
# - story_example: Retrieved similar story

vgsg_rag(tokenizer, tgt_data, tgt_index_img_dic, tgt_src_sim, ...)
# Main pipeline orchestrator: runs full RAG process for all sequences

Requirements

pip install tqdm numpy pandas matplotlib openai datasets transformers torch pillow

Required API:

  • OpenAI API key (for GPT-4o or GPT-4o-mini)

Usage

1. Feature Extraction (One-time setup)

Option A: Using Jupyter Notebook

cd expr
jupyter notebook load_features.ipynb
# Run all cells to generate files in ../outputs/ directory

Option B: Using Python Script

cd scripts
python load_features.py
# Generates files in ../outputs/ directory

2. Generate Stories

Option A: Using Jupyter Notebook (with visualizations)

cd expr
jupyter notebook similarity_rag.ipynb
# Set your OpenAI API key in the notebook
# Run cells to generate stories with visual debugging
# Results saved to ../results/ directory

Option B: Using Python Script

cd scripts
# Edit the script to set your API key
python similarity_rag.py
# Or use the console version (same functionality)
python similarity_rag_console.py
# Results saved to ../results/ directory

3. View Results

import pandas as pd
results = pd.read_csv('results/val_result.csv')
print(results[['story_id', 'generated_story']].head())

Directory Structure Notes

  • All Jupyter notebooks are in expr/ and should be run from that directory
  • All Python scripts are in scripts/ and should be run from that directory
  • Both use relative paths (../outputs/, ../results/, ../logs/, ../images/) to access shared data directories
  • This organization separates experimental notebooks from production scripts

Dataset Statistics

Split Stories Images Avg Images/Story
Train 11,773 16,494 ~1.4
Val 849 1,759 ~2.1
Test 586 1,604 ~2.7

Method Advantages

  1. Visual Grounding: Uses actual image content (via CLIP embeddings) to retrieve relevant examples
  2. Style Consistency: Few-shot examples help maintain consistent description style
  3. Narrative Coherence: Story-level retrieval provides structural templates
  4. Scalability: Pre-computed similarities enable fast retrieval
  5. Modularity: Two-stage design allows independent optimization of descriptions and stories

Citation

If you use this code, please cite:

@misc{vgsg_rag,
  title={Visual Grounded Story Generation with Retrieval-Augmented Generation},
  author={Ruitao Feng},
  year={2024}
}

License

See LICENSE file for details.

About

Visual Grounded Story Generation with RAG

Topics

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published