💬 PDFCrawler (RAG + Ollama + Gradio)

This project is a Retrieval-Augmented Generation (RAG) based chatbot that lets users ask questions about their PDF documents. It leverages:

🧠 Ollama LLMs (e.g., Mistral)
📚 Sentence Transformers for embeddings
🔍 FAISS for vector similarity search
🖥️ Gradio for a web-based chatbot interface

Pipeline

The pipeline involves parsing PDFs, splitting them into chunks, and storing embeddings in a FAISS vector database. The chatbot retrieves relevant document sections using these embeddings and answers user queries through a custom-trained language model (Ollama, e.g., Mistral). A Gradio-based web interface allows users to interact with the chatbot, which provides answers, validation against expected answers, and source information.

📁 Project Structure

├── chatbot.py # RAG chatbot logic using Ollama + FAISS
├── vectordb_setup.py # Creates vector DB from PDF files
├── ChatBotGUI.py # Gradio web interface
├── create_relevance.py # Created relevancy list for context and pred/expected answer
├── files/ # Place your PDF files here
├── query.json # (Optional) Ground-truth Q&A for validation
├── vectordb/ # Saved FAISS vector DB (auto-generated)
├── requirements.txt # Python dependencies
└── setup.sh # Virtual environment + dependency setup

⚙️ Setup Instructions

1. Clone the Repository

git clone https://github.com/your-username/PDFCrawler.git
cd PDFCrawler

2. Run the Setup Script

bash setup.sh

This will:

Create a Python virtual environment
Install all dependencies listed in requirements.txt
Install and configure Ollama (if not already installed)

📦 How to Use

Step 1: Add PDFs

Place all your .pdf files inside the files/ folder.

Step 2: Create the Vector Database

python3 vectordb_setup.py

This will:

Load and split PDFs
Generate embeddings using Sentence Transformers
Save vectors into a local FAISS vector store under vectordb/

Step 3: Start the Chat Interface

python3 live-chat.py

This launches a Gradio web app for you to interact with your PDF-aware chatbot.

🧪 Optional: Add Expected Answers

If you'd like to validate chatbot responses, you can include a query.json file in this format:

{
  "queries": [
    {
      "question": "What is the capital of France?",
      "answer": "Paris"
    },
    {
      "question": "Who wrote The Odyssey?",
      "answer": "Homer"
    }
  ]
}

The chatbot will compare its response to the expected answer and show ✅/❌ match results.

🔍 Models Used

Component	Model Name
Embeddings	`sentence-transformers/all-MiniLM-L6-v2`
LLM (Ollama)	`mistral`

You can change these in live-chat.py or chatbot.py

🙌 Acknowledgements

LangChain
FAISS
HuggingFace Sentence Transformers
Ollama
Gradio

🚀 Future Improvements

Support for multi-modal PDFs (images + text)
Session memory for follow-up questions
Dockerized deployment and Hugging Face Spaces version

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

💬 PDFCrawler (RAG + Ollama + Gradio)

Pipeline

📁 Project Structure

⚙️ Setup Instructions

1. Clone the Repository

2. Run the Setup Script

📦 How to Use

Step 1: Add PDFs

Step 2: Create the Vector Database

Step 3: Start the Chat Interface

🧪 Optional: Add Expected Answers

🔍 Models Used

🙌 Acknowledgements

🚀 Future Improvements

About

Uh oh!

Releases

Packages

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 3 Commits
.gitignore		.gitignore
ChatBotGUI.py		ChatBotGUI.py
LICENSE		LICENSE
README.md		README.md
chatbot.py		chatbot.py
create_relevance.py		create_relevance.py
question_answer.py		question_answer.py
requirements.txt		requirements.txt
setup.sh		setup.sh
vectordb_setup.py		vectordb_setup.py

License

kunalkumar168/PDFCrawler

Folders and files

Latest commit

History

Repository files navigation

💬 PDFCrawler (RAG + Ollama + Gradio)

Pipeline

📁 Project Structure

⚙️ Setup Instructions

1. Clone the Repository

2. Run the Setup Script

📦 How to Use

Step 1: Add PDFs

Step 2: Create the Vector Database

Step 3: Start the Chat Interface

🧪 Optional: Add Expected Answers

🔍 Models Used

🙌 Acknowledgements

🚀 Future Improvements

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Languages

Packages