Popular Tv Shows

Data Scraping, Data Storing, and Data Visualization from rottentomatoes.com

Web Scraper

Description

As written on the website, Rotten Tomatoes and the Tomatometer score are the world’s most trusted recommendation resources for quality entertainment. As the leading online aggregator of movie and TV show reviews from critics, they provide fans with a comprehensive guide to what’s Fresh – and what’s Rotten – in theaters and at home. This website provides general information on movies and TV shows and also their reviews containing the tomatometer score and the audience score. Tomatometer score is based on the opinions of hundreds of film and television critics and is a trusted measurement of critical recommendation for millions of fans. And the audience score represents the percentage of users who have rated a movie or TV show positively.

The author decided to scrape the contents of the most popular TV shows on this website because of the increase in entertainment consumption since the COVID-19 outbreak, especially in TV shows or series. A number of companies in the entertainment industry took this chance and so released a lot of new TV shows and series. But since the beginning of 2022, people has been starting to go back to their normal routines and that resulted in less leisure time. Considering that condition, the author decided to do this project to help people decide what to watch according to the most popular TV shows or series at that time by providing the information on the TV show or series along with its ratings.

The DBMS used to store the result of web scraping in this project is MongoDB as the default DBMS. The reason why the author chose this DBMS is that because of its high performance and flexibility. On top of that, it is also compatible with .json file that is used when exporting the result of web scraping. Furthermore, MongoDB has MongoDB Atlas as its cloud database that simplifies the process of making a cluster in cloud, which is relatively safer.

Spesification

These are some Python libraries and tools required to run the scraper program.

Libraries

Jupyter Notebook
To make the code easier to write and maintain, Jupyter Notebook is used. The scraper file is stored in .ipynb format.
BeautifulSoup4
Since the main language used in this project is Python, this library is used as the main library to scrape the contents of a website. Its syntax is fairly simple, easy to understand and easy to use.
lxml
This library is used as HTML parser in this project. It is relatively faster than HTML parser provided by Python because it's written in C language.
Requests
This library is used to access websites and request objects from the website.
Selenium
Since the Rotten Tomatoes website uses load more pagination and the website itself prevents the user to access further than page 5 directly, this library is used to open the Chrome Webdriver. On top of hat, this library is also used to click the load more button to reveal more pages to scrape.
Time
To avoid the website's anti-scraping mechanism, and to keep the server from crashing, the program uses the time.sleep() method to stop the program for a few seconds. Time library is already preinstalled with Python.
JSON
In order to store the data with a .json format, this library is used to dump the scraped data to a .json file. JSON library is already preinstalled with Python.
OS
This library is used to join the path and the name of the file when exporting the .json file. OS library is already preinstalled with Python.

To install all these libraries, open right directory where the libs.txt is located on Command Prompt or Terminal and simply type in

pip install -r libs.txt

Tools

Chrome Webdriver
This tool is used with Selenium to access the desired page of the web. If you don't have this tool in your device yet, you can download it here.

How to Use

Make sure you already have all the required libraries and tools installed in your device. Also, make sure to have a stable internet connection before running the code to prevent error when the code is running (RTE).
Clone this repository to your local directory.
Change the path of the Chrome Webdriver according to the local directory in your device.
Change the path and the name of the exported .json to your liking.
Open the scraper.ipynb in Jupyter Notebook or any IDE that you may have.
Run all the codes.

JSON Structure

The scraped data will be stored into a .json file with the structure as written below.

{
  _id:{
    $oid (string)           : _id is set as the default primary key in MongoDB and is automatically generated when exported from the MongoDB
  }
  title (string)              : title of the series/TV show
  airing (string)             : airing years of the series (as a whole)
  synopsis (string)           : synopsis of the series/TV show
  average_tomatometer (int)   : average tomatometer score of the whole series/TV show (in percent)
  average_audience-score (int): average audience score of the whole series/TV Show (in percent)
  tv_network (string)         : TV network where the series/TV show can be watched
  premiere_date (string)      : premiere date of the whole series/TV show (in format yyyy-mm-dd)
  genre (string)              : genre of the series/TV show
  main_casts [(string)]       : name of the main casts of the whole series/TV show
  num_of_seasons (int)        : number of seasons the series/TV show has
  seasons_info                : 
    [
      {
        season_title (string)  : title of the season
        airing_year (int)      : airing year of the season
        episodes (int)         : number of episodes in the season
        tomatometer (int)      : tomatometer score of the season (in percent)
        audience_score (int)   : audience score of the season (in percent)
      }
    ]
}

Database Structure

The following is ERD of the database to store the scraped data, with _id as the primary key.

Screenshots

Scraper Function
Functions and Procedures
Program Running
Data Storing in MongoDB
Data Storing in MongoDB Atlas

API

Description

The author made a simple API to access the online database. The API itself is capable of Insert and Read operations. The API is deployed on the URL below.

https://rottentomatoes-tvshows.herokuapp.com/

Spesification

The API is written in JavaScript using NodeJS. These are some libraries and tools used to create the API. If you don't have NodeJS installed on your device yet, you can download it here.

Libraries

You can see all used libraries in the package.json inside the API folder.

Body Parser
This library is used to parse the req.body in order to do the POST operation.
Dotenv
This library is used to make the .env file so that the MONGO_URI including the username and password of the DB so that it is not leaked to the public.
Express
This library is used to simplify the process of building the web application used by the API.
Mongoose
This library is used to create the schema and the model of the data to do posting to the web. It is also used to translate between objects in code and its representation in MongoDB.
Nodemon
This library is used to simplify the process of starting the API when developing as it wraps the Node app, watches the file system and automatically restarts the program if any changes is made.

Tools

Postman or Thunder Client (VS Code extension)
This tool is used to test and use the API by sending requests. The GET tool is used when any Get All or Get by ID requests. The POST tool is used when doing Insert requests.

How to Use

Open Postman or Thunder Client in VS Code.
Copy the URL below.

https://rottentomatoes-tvshows.herokuapp.com/tvshows

Send requests by GET and POST.
- Get All
  - Send the GET request from the URL above.
- Get by ID
  - add /<id of the tv show> to the URL above and send the request.
- Insert
  - Add /post to the URL above and type in the JSON format of the data to the Body of the request. Then send the request.

Testing Screenshots

API Testing

Get All
Get by ID
Post

References

Documentations
PyPI
Selenium
BeautifulSoup
MongoDB
Chrome WebDriver
Mongoose API
Express
Additional Sources
Web scraping : Web Scraping with Python - Beautiful Soup Crash Course
Python & JSON :
Python JSON dump() and dumps() for JSON Encoding
Python Encode Unicode and non-ASCII characters as-is into JSON
Selenium : Web Scraping with Selenium in Python
Rest API :
Build A Restful Api With Node.js Express & MongoDB | Rest Api Tutorial
Create a complete REST API with Node, Express and MongoDB | Deploy on Heroku
Ask questions
Stack Overflow
Geeks For Geeks

Data Visualization

The data visualization of this database is made by using MongoDB Charts as it is connected to the MongoDB Database. The full dashboard can be accessed through TV Shows Dashboard .

Author

Gresya Angelina Eunike Leman (18220104)
Information System and Technology
Institut Teknologi Bandung

Name		Name	Last commit message	Last commit date
Latest commit History 46 Commits
API		API
Data Scraping		Data Scraping
Data Storing		Data Storing
README.md		README.md

graell12/Tomatometer-Web-Scraping-and-Data-Visualization

Folders and files

Latest commit

History

Repository files navigation

Popular Tv Shows

Data Scraping, Data Storing, and Data Visualization from rottentomatoes.com

Table of Contents

Web Scraper

Description

Spesification

Libraries

Jupyter Notebook

BeautifulSoup4

lxml

Requests

Selenium

Time

JSON

OS

Tools

Chrome Webdriver

How to Use

JSON Structure

Database Structure

Screenshots

Scraper Function

Functions and Procedures

Program Running

Data Storing in MongoDB

Data Storing in MongoDB Atlas

API

Description

Spesification

Libraries

Body Parser

Dotenv

Express

Mongoose

Nodemon

Tools

Postman or Thunder Client (VS Code extension)

How to Use

Get All

Get by ID

Insert

Testing Screenshots

API Testing

Get All

Get by ID

Post

References

Documentations

Additional Sources

Ask questions

Data Visualization

Author

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Contributors 3

Uh oh!

Languages

`Get All`

`Get by ID`

`Insert`

Packages