Index your entire codebase to empower intelligent coding agents with precise search capabilities. This solution, leveraging pgvector, sentence-transformers, and the MCP Python SDK, offers a basic setup for a quick start.
- Semantic codebase search using vector embeddings
- Git-tracked files only with smart filtering
- Support for all text-based languages and file types
- MCP server for AI assistant integration
- Content hashing to skip unchanged files during re-indexing
- Intelligent chunking with overlapping sections
- Fast vector search with cosine similarity
- Docker and Docker Compose
- Python 3.11+
# 1. Start database and MCP server
docker-compose up -d --build
# 2. Install dependencies
pip install -r requirements.txt
# 3. Index codebase (one-time, 5-10 minutes)
python codebase_indexer.py# Interactive search
python search_codebase.py
# Direct queries
python search_codebase.py "Angular component" -n 3
python search_codebase.py "HTTP service API" -n 5
python search_codebase.py "authentication guard" -n 2The system includes an MCP (Model Context Protocol) server for AI assistant integration:
docker-compose up -d --buildAvailable MCP Tools:
search_codebase(query, limit)- Semantic codebase searchsearch_by_file_type(extension, query, limit)- Search specific file typesget_codebase_stats()- Get indexing statistics
Client Configuration:
{
"mcpServers": {
"codebase-search": {
"type": "http",
"url": "http://localhost:8111/mcp"
}
}
}"authentication service" → Find auth-related services
"HTTP interceptor" → Find HTTP interceptors
"Angular component routing" → Find routing components
"database migration" → Find migration scripts
"service injection" → Find dependency injection
"guard implementation" → Find route guards
"error handling" → Find error handling code
"form validation" → Find form validation logic
"about this app" → Find the about this app pages
"search functionality" → Find search implementations
"user authentication" → Find user auth workflows
- Git-tracked files only: Uses
git ls-filesto discover files - Text files only: Automatically detects and skips binary files
- All languages: TypeScript, JavaScript, HTML, CSS, JSON, Markdown, etc.
- Smart filtering: Excludes images, fonts, archives automatically
- Respects .gitignore: Only indexes what git tracks
Makes use of the pgvector extension for vector search.
CREATE TABLE codebase_chunks
(
--- embedding_dim=384
embedding halfvec(384)
);Configurable via EMBEDDING_MODEL (default: all-MiniLM-L6-v2 with embedding_dim=384)
DB connection and model configuration are stored in .env.
DB_HOST=localhost
DB_PORT=5432
POSTGRES_DB=codebase-search
POSTGRES_USER=dev
POSTGRES_PASSWORD=dev
# Model Configuration
EMBEDDING_MODEL=all-MiniLM-L6-v2
#EMBEDDING_MODEL=nomic-ai/nomic-embed-code
# Chunking Configuration
TOKENIZER_ENCODING=cl100k_base
MAX_TOKENS_PER_CHUNK=500
CHUNK_OVERLAP=50
# Where to find the codebase on the host machine
HOST_PROJECT_ROOT=.- Auto-detection: Embedding dimensions are automatically detected from the selected model
- Chunking: Token limits and overlap are configurable via environment variables
- Tokenizer: Uses OpenAI's cl100k_base tokenizer by default (works well across models)
- Model flexibility: Change
EMBEDDING_MODELto switch between different Sentence Transformer models
After code changes:
python codebase_indexer.pyThe system uses content hashing to skip unchanged files for efficiency.
The MCP server enables AI assistants to:
- Code Analysis: Understand existing code patterns and architectures
- Bug Investigation: Find relevant code sections for debugging
- Feature Development: Locate similar implementations for reference
- Code Review: Search for best practices and consistency patterns
- Documentation: Find examples and usage patterns
- Refactoring: Identify code that needs to be updated together
Connection Errors: Check Docker with docker compose up -d --build --force-recreate
codebase_indexer.py- The main indexing scriptsearch_codebase.py- Command-line search interfacemcp_server.py- MCP server for AI assistant integration
Dockerfile- Container image for the MCP serverdocker-compose.yml- PostgreSQL database with pgvectorrequirements.txt- Python dependencies.env- Database configurationAGENTS.md- Instructions for AI assistant integration