This repository provides a reproducible pipeline for downloading, parsing, and publishing up-to-date vector database (ChromaDB) dumps of the complete Star Trek Memory Alpha wiki. These vector DB artifacts are intended for use in downstream projects, such as search, RAG, or LLM applications.
- Automated Data Pipeline: Download, extract, and process the latest Memory Alpha XML dump
- ChromaDB Vector Database: Converts all articles into a persistent ChromaDB vector DB
- Compressed Artifact: Publishes a compressed, ready-to-use DB for easy distribution
- CI/CD Workflows: GitHub Actions for validation and release artifact publishing
- Containerized: All steps run in Docker or Dev Container for reproducibility
The easiest way to use the Memory Alpha vector database is to download the latest release artifact:
-
Go to the Releases page
-
Download
enmemoryalpha_db.tar.gz -
Extract it:
tar xzf enmemoryalpha_db.tar.gz # or 7z x enmemoryalpha_db.tar.gz -
Use the extracted
enmemoryalpha_db/directory in your own ChromaDB-powered project.
Here's a minimal example of how to load the DB and perform a cosine similarity search:
import sys
import pysqlite3
sys.modules["sqlite3"] = pysqlite3
import chromadb
from chromadb.config import Settings
client = chromadb.PersistentClient(path="enmemoryalpha_db", settings=Settings(allow_reset=True))
collection = client.get_or_create_collection("memoryalpha")
# Example query
query = "Who is Captain Picard?"
results = collection.query(query_texts=[query], n_results=3)
for i, doc in enumerate(results["documents"][0]):
print(f"Result {i+1}:\nTitle: {results['metadatas'][0][i]['title']}\nContent: {doc[:300]}\n---")- Docker
- VS Code with Dev Containers extension (optional)
git clone https://github.com/aniongithub/memoryalpha_chromadb.git
cd memoryalpha_ragOpen in VS Code and reopen in container if desired.
# This will download, extract, vectorize, and compress the Memory Alpha database
./data-pipeline-docker.shThe result will be a compressed ChromaDB artifact at:
data/enmemoryalpha_db.tar.gz
You can now use data/enmemoryalpha_db.tar.gz in your own projects. Decompress and mount as needed for downstream applications.
memoryalpha_rag/
├── pipeline/ # Data processing pipeline scripts
│ ├── 00-download-memory-alpha # Download Memory Alpha dump
│ ├── 10-extract-memoryalpha-data # Parse and create ChromaDB
│ ├── 20-compress-memoryalpha-db # Compress database
│ └── pipeline.Dockerfile # Pipeline container
├── data/ # Data directory (gitignored)
│ ├── enmemoryalpha_pages_current.xml # Raw Memory Alpha dump
│ ├── enmemoryalpha_db/ # ChromaDB database
│ └── enmemoryalpha_db.tar.gz # Compressed database
├── data-pipeline-docker.sh # Pipeline execution script
├── .github/workflows/ # CI/CD workflows
└── README.md # This file
- Pull Request to main: Runs the pipeline as a CI check (no artifact published)
- Release Published: Runs the pipeline and uploads the compressed DB as a release asset
See .github/workflows/ for details.
This project is licensed under the MIT License.
- Memory Alpha - The Star Trek wiki providing the comprehensive database
- Wikia/Fandom - Hosting the Memory Alpha XML dumps
- ChromaDB - Vector database for semantic search
Live long and prosper! 🖖