AI scrape reviews
This project is a FastAPI-based web scraper designed to extract and format reviews from a given webpage. Using Selenium for web scraping and OpenAI's API for processing, it outputs structured JSON data.
-
Deep Content Extraction: To access deeply embedded information on webpages, the project utilizes Selenium with the Edge WebDriver. The configuration mimics a typical desktop environment to minimize the risk of IP blocking. Scrolling is implemented to ensure all dynamic content is fully loaded.
-
Content Cleaning: Irrelevant elements such as headers, footers, scripts, and styles are removed to focus on the essential content. This helps in reducing noise and improving data quality.
-
Content Conversion for LLMs: Large Language Models (LLMs) perform better with plain text rather than HTML or CSS. Therefore, HTML content is converted to Markdown, which is more readable and easier for LLMs to process.
-
Chunking for Efficiency: To avoid exceeding the context length limits of the LLM, the Markdown content is divided into chunks of 6000 characters. This ensures that each part is manageable and can be processed efficiently without losing information.
-
Model Utilization: The
gpt-4o-mini
model is used for extracting structured review data from the text. This choice balances performance and resource constraints, providing accurate extraction while managing computational load.
This approach ensures efficient data extraction and processing, leveraging Selenium for dynamic content handling and LLMs for intelligent information retrieval.
- Web Scraping: Utilizes Selenium to fetch and render HTML content from a specified webpage.
- Review Processing: Uses OpenAI's API to extract and format review information.
- RESTful API: Exposes a single endpoint to retrieve review data in JSON format.
- Python 3.10+
- Selenium
- FastAPI
- Uvicorn
- OpenAI API Key or Azure OpenAI key
- Microsoft Edge WebDriver
-
Clone the Repository:
git clone https://github.com/sourabhkv/LLM_review_API.git cd LLM_review_API
-
Install Dependencies:
pip install -r requirements.txt
-
Setup Microsoft Edge WebDriver:
- Download the WebDriver from the official site and place it in the project directory.
- Ensure the path is correctly set in the code (default:
./msedgedriver.exe
).
-
Configure Environment Variables:
- Set
OPENAI_ENDPOINT
,OPENAI_API_KEY
,DEPLOYMENT_NAME
, andAPI_VERSION
with your OpenAI credentials.
- Set
-
Run the FastAPI Server:
uvicorn app:app --reload
-
Access the API:
- Endpoint:
/api/reviews
- Method:
GET
- Query Parameter:
page
(URL of the webpage to scrape) - Example:
http://localhost:8000/api/reviews?page=https://example.com
- Endpoint:
{
"reviews_count": 100,
"reviews": [
{
"title": "Review Title",
"body": "Review body text",
"rating": 5,
"reviewer": "Reviewer Name"
},
...
]
}
- See
results.json
file for more info, URL for it was https://amzn.in/d/5GN09lj
Go.marble.assignment.mp4
- Returns a 500 HTTP error if an exception occurs during the scraping or processing.
This project is licensed under the MIT License.
Contributions are welcome! Please open an issue or submit a pull request for any improvements or bug fixes.
For any questions or support, please contact sourabhkv at [email protected].