This repository hosts the backend for the ContentLab - Automated Content Analysis & Suggestion Engine service. The service uses FastAPI to deliver content suggestions, including semantic search, automated content aggregation, and AI-generated insights. It uses MongoDB for scalable storage and vector search, with scheduled jobs for continuous data ingestion and analysis. Designed for content creators and analysts, it transforms raw news and social media data into actionable, searchable intelligence.
This backend is designed as a microservice, focusing on automated content analysis, aggregation, and AI-powered suggestions. Each component is modular, enabling independent scaling, maintenance, and integration with other services.
-
FastAPI Application:
The main entry point (backend/main.py
) serves as the API gateway for all backend operations. It exposes REST endpoints for content retrieval, semantic search, user management, and AI-driven insights. FastAPI ensures high performance, automatic documentation, and easy integration with other microservices. -
Database Layer:
The MongoDB connector (backend/db/mdb.py
) abstracts all database interactions. It manages connections, queries, and CRUD operations for collections such asnews
,reddit_posts
,suggestions
,drafts
, anduserProfiles
. This layer ensures data consistency and scalability, supporting high-throughput ingestion and retrieval. -
API Routes:
Modular routers inbackend/routers/
separate concerns for different functionalities. Each router handles specific endpoints, making the codebase maintainable and extensible. This structure supports microservice best practices by isolating business logic. -
Background Processing:
Automated data collection is handled by schedulers and scrapers. These run as background jobs, ingesting news and social media data at scheduled intervals. This ensures the microservice remains up-to-date with the latest content without manual intervention. -
AI/ML Components:
Integration with AWS Bedrock enables advanced AI capabilities, such as semantic embeddings and content analysis. These components process raw data into actionable intelligence, supporting features like semantic search and automated suggestions.
The Suggestion Engine serves as a smart topic suggestion tool that analyzes user queries, gathers relevant data from news and social media, and presents actionable insights. Here’s a complete overview of the workflow from query to response:
-
Data Ingestion:
- Scraped data from news and Reddit is ingested and stored in MongoDB collections (
news
,reddit_posts
).
- Scraped data from news and Reddit is ingested and stored in MongoDB collections (
-
Embedding Generation:
- A scheduled embedding process converts this content into semantic vectors using Cohere via AWS Bedrock.
-
Vector Indexing & Semantic Search:
- MongoDB Atlas Vector Search enables fast, semantic similarity search on these vectors.
- The MongoDB Aggregation Framework is used during semantic search and result aggregation for efficient and flexible data retrieval.
-
Tavily Search Agent Integration:
- The workflow leverages the Tavily Search Agent
(visible on the drafts page in the frontend)
to improve the search capabilities and provide more accurate, context-aware suggestions.
- The workflow leverages the Tavily Search Agent
-
Topic Suggestion Generation:
- Topic suggestions are generated based on the most relevant, semantically matched content.
- These suggestions are stored in the collection, which is called
suggestions
(Topic Suggestion) in our codebase.
-
Insight Generation:
- The engine analyses the aggregated data, identifies trends, and generates actionable topic suggestions using LLMs (
e.g., Claude 3 Haiku via AWS Bedrock
).
- The engine analyses the aggregated data, identifies trends, and generates actionable topic suggestions using LLMs (
-
Response Presentation:
- Actionable insights and suggested topics are displayed to the user in the frontend, enabling informed decision-making for content creation.
In summary:
The Suggestion Engine combines user input, real-time data aggregation, semantic search, and advanced agents like Tavily to deliver relevant, actionable topic suggestions directly in the user interface.
Example documents for all collections are stored in JSON format under backend/db/collections/
-
news
Normalized scraped news articles plus generated embedding vectors for semantic search. -
reddit_posts
Subreddit posts (and optionally comments) enriched with metadata and embeddings for cross-source semantic matching. -
suggestions
AI-generated topic suggestions derived from aggregated news + Reddit relevance and LLM analysis. -
drafts
User-authored draft content (HTML / rich text) that exists only pre-publication. -
userProfiles
Per-user profile & preference data guiding filtering and tuning. -
preview
Once the user clicks the Publish button, the drafts are stored as a clean version (with no HTML tags) in this collection and automatically connected to IST Media, where the article is published. You can explore that project as well.
-
Document Model: MongoDB's flexible document model (using BSON/JSON) is perfect for storing scraped data, which often varies in structure and fields. Each scraped item can be stored as a document, allowing easy updates and schema evolution. The flexible schema reduces development time by allowing you to change data structures on the fly, which is crucial when dealing with varying scraped content formats.
-
Efficient Storage of Scraped Media Data: Scraped media suggestions are stored as documents in a collection, enabling fast inserts and queries while supporting rich, nested data (metadata, tags, source info) MongoDB shows it can improve read and write speeds compared to traditional databases, essential for handling large volumes of scraped content efficiently.
-
Daily Scheduler Integration: As the scheduler runs daily, new suggestions are added or updated in the collection. MongoDB's atomic operations and upserts make it easy to update existing documents or add new ones without complex migrations.
-
Atlas Vector Search for Semantic Discovery: MongoDB Atlas provides integrated vector search, enabling semantic search over scraped content. Users can find relevant topic suggestions based on meaning, not just keywords. By embedding textual data and storing vectors in MongoDB, you enable instant retrieval of semantically relevant content—perfect for intelligent media recommendations.
-
Schema Flexibility for Evolving Data: MongoDB's schemaless nature allows easy adaptation to changes in data structure as scraping sources evolve, without downtime or complex migrations.
-
Easy: MongoDB's document model naturally fits with object-oriented programming, utilizing BSON documents that closely resemble JSON. This design simplifies the management of complex data structures such as user accounts, allowing developers to build features like account creation, retrieval, and updates with greater ease.
-
Fast: Following the principle of "Data that is accessed together should be stored together," MongoDB enhances query performance. This approach ensures that related data—like user and account information—can be quickly retrieved, optimizing the speed of operations such as account look-ups or status checks, which is crucial in services demanding real-time access to operational data.
-
Flexible: MongoDB's schema flexibility allows account models to evolve with changing business requirements. This adaptability lets financial services update account structures or add features without expensive and disruptive schema migrations, thus avoiding costly downtime often associated with structural changes.
-
Versatile: The document model in MongoDB effectively handles a wide variety of data types, such as strings, numbers, booleans, arrays, objects, and even vectors. This versatility empowers applications to manage diverse account-related data, facilitating comprehensive solutions that integrate user, account, and transactional data seamlessly.
- pymongo for MongoDB connectivity and operations.
- boto3 for AWS SDK integration and Bedrock API access.
- botocore for low-level AWS service operations.
- cohere from Bedrock for generating vector embeddings.
- requests for HTTP requests and API calls.
- beautifulsoup4 for HTML parsing and web scraping.
- langchain-community for additional LangChain tools and integrations.
- langchain-tavily for Tavily search integration with LangChain.
- scheduler for job scheduling and automated tasks.
- pytz for timezone handling in scheduled operations.
- python-dotenv for environment variable management.
- Claude 3 Haiku for text generation and content analysis through AWS Bedrock.
- Cohere Embed English v3 for generating 1024-dimensional vector embeddings for semantic search.
- Tavily Search API for primary topic research and content discovery.
-
Content Router (
content.py
):
Exposes endpoints for fetching content suggestions, news articles, Reddit posts, and user profiles. It acts as the main interface for clients to access aggregated and analyzed content. -
Drafts Router (
drafts.py
):
Provides CRUD operations for user-generated drafts. Supports user-specific access control, allowing secure creation, editing, and deletion of draft documents. -
Services Router (
services.py
):
Offers advanced service endpoints, such as topic research and AI-powered content analysis. Enables integration with external tools and APIs for enhanced functionality.
-
News Scraper (
scrapers/news_scraper.py
):
Continuously collects articles from multiple news sources and categories. The scraper normalizes and stores articles in the database, preparing them for downstream processing and analysis. -
Reddit Scraper (
scrapers/social_listening.py
):
Monitors and scrapes posts from configured subreddits using the PRAW library. Extracted posts are mapped to relevant topics and stored for semantic analysis and search. -
Embedding Processor (
embeddings/process_embeddings.py
):
Transforms ingested news and Reddit content into high-dimensional vector embeddings using Cohere Embed English model via AWS Bedrock. These embeddings power semantic search and similarity matching across the microservice. -
Content Analyzer (
bedrock/llm_output.py
):
Utilizes Anthropic Claude models through AWS Bedrock to analyze and summarize content. Generates structured insights (topics, keywords, descriptions, labels) in JSON format, supporting automated suggestions and research.
Automated daily jobs ensure the microservice remains current and efficient:
- News scraping at 04:00 UTC: Ingests the latest news articles from configured sources.
- Reddit scraping at 04:15 UTC: Collects new Reddit posts and comments for analysis.
- Embedding processing at 04:30 UTC: Generates semantic embeddings for all newly ingested content.
- Content suggestion generation at 04:45 UTC: Creates AI-powered suggestions based on analyzed data.
- Cleanup tasks: Regularly prunes collections to maintain size limits and optimize performance.
- This microservice architecture enables scalable, automated, and intelligent content analysis, making it ideal for integration into larger platforms or as a standalone backend for content-driven applications.
Before you begin, ensure you have met the following requirements:
- MongoDB Atlas account - Register Here
- Python 3.10 or higher (but less than 3.11)
- Poetry - Install Here
- AWS CLI configured with appropriate credentials - Installation Guide
- AWS Account with Bedrock access enabled - Sign up Here
- Reddit Developer Account for API access - Apply Here
- NewsAPI Account for news aggregation - Get API Key
- Tavily Search API account - Register Here
- Docker (optional, for containerized deployment) - Install Here
-
Fork the Repository
- Visit the GitHub repository page and click the Fork button in the top right corner to create your own copy of the repository under your GitHub account.
-
Clone Your Fork
- Open your terminal and run:
git clone https://github.com/<your-username>/ist-media-internship-be.git cd ist-media-internship-be
- Open your terminal and run:
-
(Optional) Set Up Upstream Remote
- To keep your fork up to date with the original repository, add the upstream remote:
git remote add upstream https://github.com/<original-owner>/ist-media-internship-be.git
- To keep your fork up to date with the original repository, add the upstream remote:
-
Create a MongoDB Atlas Database
- Log in to MongoDB Atlas and create a new project and cluster.
- Create a database named
contentlab
. - Create the following collections if they do not already exist:
news
(for storing scraped news articles with embeddings)reddit_posts
(for storing Reddit posts and comments with embeddings)suggestions
(for storing AI-generated topic suggestions)drafts
(for storing user-created draft documents)userProfiles
(for storing user profile information and preferences)preview
(Once the user clicks the Publish button, the drafts are stored as a clean version (with no HTML tags) in this collection and automatically connected to IST Media, where the article is published. You can explore that project as well.)
-
Create a MongoDB User
- Follow MongoDB's guide to create a user with readWrite access to the
contentlab
database.
- Follow MongoDB's guide to create a user with readWrite access to the
-
Configure Environment Variables
- Copy the
.env.example
file to.env
in the/backend
directory (or create a new.env
file). - Fill in the required environment variables as described in the README after.
- Copy the
- Create the vector search index for the
news
andreddit_posts
. You can do this using the MongoDB Atlas UI or by running the following python script located in thebackend/
directory:_vector_search_idx_creator.py
. Make sure to parametrize the script accordingly for each collection.
Follow MongoDB's guide to create a user with readWrite access to the contentlab
database.
Important
Create a .env
file in the /backend
directory with the following content:
MONGODB_URI=your_mongod_uri
DATABASE_NAME=dbname
APP_NAME=appname
NEWS_COLLECTION=news
REDDIT_COLLECTION=reddit_posts
SUGGESTION_COLLECTION=suggestions
USER_PROFILES_COLLECTION=userProfiles
DRAFTS_COLLECTION=drafts
PREVIEW_COLLECTION=preview
AWS_REGION=us-east-1
NEWSAPI_KEY=your_newsapi_key
TAVILY_API_KEYS=your_tavily_key1,your_tavily_key2
REDDIT_CLIENT_ID=your_reddit_client_id
REDDIT_SECRET=your_reddit_secret
REDDIT_USER_AGENT=your_reddit_user_agent
- Open a terminal in the project root directory.
- Run the following commands:
make poetry_start make poetry_install
- Verify that the
.venv
folder has been generated within the/backend
directory.
To start the backend service, run:
poetry run uvicorn main:app --host 0.0.0.0 --port 8000
Default port is
8000
, modify the--port
flag if needed.
Run the following command in the root directory:
make build
To remove the container and image:
make clean
You can access the API documentation by visiting the following URL:
http://localhost:<PORT_NUMBER>/docs
E.g. http://localhost:8000/docs
Note
Make sure to replace <PORT_NUMBER>
with the port number you are using and ensure the backend is running.
Important
Check that you've created an .env
file that contains the required environment variables.