LLM_review_API

AI scrape reviews

Review Scraper API

This project is a FastAPI-based web scraper designed to extract and format reviews from a given webpage. Using Selenium for web scraping and OpenAI's API for processing, it outputs structured JSON data.

Solution Approach

Deep Content Extraction: To access deeply embedded information on webpages, the project utilizes Selenium with the Edge WebDriver. The configuration mimics a typical desktop environment to minimize the risk of IP blocking. Scrolling is implemented to ensure all dynamic content is fully loaded.
Content Cleaning: Irrelevant elements such as headers, footers, scripts, and styles are removed to focus on the essential content. This helps in reducing noise and improving data quality.
Content Conversion for LLMs: Large Language Models (LLMs) perform better with plain text rather than HTML or CSS. Therefore, HTML content is converted to Markdown, which is more readable and easier for LLMs to process.
Chunking for Efficiency: To avoid exceeding the context length limits of the LLM, the Markdown content is divided into chunks of 6000 characters. This ensures that each part is manageable and can be processed efficiently without losing information.
Model Utilization: The gpt-4o-mini model is used for extracting structured review data from the text. This choice balances performance and resource constraints, providing accurate extraction while managing computational load.

This approach ensures efficient data extraction and processing, leveraging Selenium for dynamic content handling and LLMs for intelligent information retrieval.

Features

Web Scraping: Utilizes Selenium to fetch and render HTML content from a specified webpage.
Review Processing: Uses OpenAI's API to extract and format review information.
RESTful API: Exposes a single endpoint to retrieve review data in JSON format.

Prerequisites

Python 3.10+
Selenium
FastAPI
Uvicorn
OpenAI API Key or Azure OpenAI key
Microsoft Edge WebDriver

Installation

Clone the Repository:

git clone https://github.com/sourabhkv/LLM_review_API.git
cd LLM_review_API

Install Dependencies:
```
pip install -r requirements.txt
```
Setup Microsoft Edge WebDriver:
- Download the WebDriver from the official site and place it in the project directory.
- Ensure the path is correctly set in the code (default: ./msedgedriver.exe).
Configure Environment Variables:
- Set OPENAI_ENDPOINT, OPENAI_API_KEY, DEPLOYMENT_NAME, and API_VERSION with your OpenAI credentials.

Usage

Run the FastAPI Server:
```
uvicorn app:app --reload
```
Access the API:
- Endpoint: /api/reviews
- Method: GET
- Query Parameter: page (URL of the webpage to scrape)
- Example: http://localhost:8000/api/reviews?page=https://example.com

Response Format

{
  "reviews_count": 100,
  "reviews": [
    {
      "title": "Review Title",
      "body": "Review body text",
      "rating": 5,
      "reviewer": "Reviewer Name"
    },
    ...
  ]
}

See results.json file for more info, URL for it was https://amzn.in/d/5GN09lj

Go.marble.assignment.mp4

Error Handling

Returns a 500 HTTP error if an exception occurs during the scraping or processing.

License

This project is licensed under the MIT License.

Contributing

Contributions are welcome! Please open an issue or submit a pull request for any improvements or bug fixes.

Contact

For any questions or support, please contact sourabhkv at [email protected].

Name		Name	Last commit message	Last commit date
Latest commit History 16 Commits
templates		templates
.env		.env
Dockerfile		Dockerfile
LICENSE		LICENSE
README.md		README.md
app.py		app.py
requirements.txt		requirements.txt
results.json		results.json

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

LLM_review_API

Review Scraper API

Solution Approach

Features

Prerequisites

Installation

Usage

Response Format

Error Handling

License

Contributing

Contact

About

Uh oh!

Releases

Packages

Uh oh!

Languages

License

sourabhkv/LLM_review_API

Folders and files

Latest commit

History

Repository files navigation

LLM_review_API

Review Scraper API

Solution Approach

Features

Prerequisites

Installation

Usage

Response Format

Error Handling

License

Contributing

Contact

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Languages

Packages