As written on the website, Rotten Tomatoes and the Tomatometer score are the world’s most trusted recommendation resources for quality entertainment. As the leading online aggregator of movie and TV show reviews from critics, they provide fans with a comprehensive guide to what’s Fresh – and what’s Rotten – in theaters and at home. This website provides general information on movies and TV shows and also their reviews containing the tomatometer score and the audience score. Tomatometer score is based on the opinions of hundreds of film and television critics and is a trusted measurement of critical recommendation for millions of fans. And the audience score represents the percentage of users who have rated a movie or TV show positively.
The author decided to scrape the contents of the most popular TV shows on this website because of the increase in entertainment consumption since the COVID-19 outbreak, especially in TV shows or series. A number of companies in the entertainment industry took this chance and so released a lot of new TV shows and series. But since the beginning of 2022, people has been starting to go back to their normal routines and that resulted in less leisure time. Considering that condition, the author decided to do this project to help people decide what to watch according to the most popular TV shows or series at that time by providing the information on the TV show or series along with its ratings.
The DBMS used to store the result of web scraping in this project is MongoDB as the default DBMS. The reason why the author chose this DBMS is that because of its high performance and flexibility. On top of that, it is also compatible with .json
file that is used when exporting the result of web scraping. Furthermore, MongoDB has MongoDB Atlas as its cloud database that simplifies the process of making a cluster in cloud, which is relatively safer.
These are some Python
libraries and tools required to run the scraper program.
-
To make the code easier to write and maintain, Jupyter Notebook is used. The scraper file is stored in
.ipynb
format. -
Since the main language used in this project is
Python
, this library is used as the main library to scrape the contents of a website. Its syntax is fairly simple, easy to understand and easy to use. - This library is used as HTML parser in this project. It is relatively faster than HTML parser provided by Python because it's written in C language.
- This library is used to access websites and request objects from the website.
-
Since the Rotten Tomatoes website uses
load more
pagination and the website itself prevents the user to access further than page 5 directly, this library is used to open theChrome Webdriver
. On top of hat, this library is also used to click the load more button to reveal more pages to scrape. - To avoid the website's anti-scraping mechanism, and to keep the server from crashing, the program uses the time.sleep() method to stop the program for a few seconds. Time library is already preinstalled with Python.
-
In order to store the data with a
.json
format, this library is used to dump the scraped data to a.json
file. JSON library is already preinstalled with Python. -
This library is used to join the path and the name of the file when exporting the
.json
file. OS library is already preinstalled with Python.
To install all these libraries, open right directory where the libs.txt
is located on Command Prompt
or Terminal
and simply type in
pip install -r libs.txt
- This tool is used with Selenium to access the desired page of the web. If you don't have this tool in your device yet, you can download it here.
- Make sure you already have all the required libraries and tools installed in your device. Also, make sure to have a stable internet connection before running the code to prevent error when the code is running (RTE).
- Clone this repository to your local directory.
- Change the path of the
Chrome Webdriver
according to the local directory in your device. - Change the path and the name of the exported
.json
to your liking. - Open the
scraper.ipynb
inJupyter Notebook
or any IDE that you may have. - Run all the codes.
The scraped data will be stored into a .json
file with the structure as written below.
{ _id:{ $oid (string) : _id is set as the default primary key in MongoDB and is automatically generated when exported from the MongoDB } title (string) : title of the series/TV show airing (string) : airing years of the series (as a whole) synopsis (string) : synopsis of the series/TV show average_tomatometer (int) : average tomatometer score of the whole series/TV show (in percent) average_audience-score (int): average audience score of the whole series/TV Show (in percent) tv_network (string) : TV network where the series/TV show can be watched premiere_date (string) : premiere date of the whole series/TV show (in format yyyy-mm-dd) genre (string) : genre of the series/TV show main_casts [(string)] : name of the main casts of the whole series/TV show num_of_seasons (int) : number of seasons the series/TV show has seasons_info : [ { season_title (string) : title of the season airing_year (int) : airing year of the season episodes (int) : number of episodes in the season tomatometer (int) : tomatometer score of the season (in percent) audience_score (int) : audience score of the season (in percent) } ] }
The following is ERD of the database to store the scraped data, with _id as the primary key.
The author made a simple API
to access the online database. The API
itself is capable of Insert and Read
operations. The API
is deployed on the URL below.
https://rottentomatoes-tvshows.herokuapp.com/
The API is written in JavaScript
using NodeJS
. These are some libraries and tools used to create the API
. If you don't have NodeJS
installed on your device yet, you can download it here.
You can see all used libraries in the package.json
inside the API folder.
-
This library is used to parse the
req.body
in order to do thePOST
operation. -
This library is used to make the
.env
file so that theMONGO_URI
including the username and password of the DB so that it is not leaked to the public. -
This library is used to simplify the process of building the web application used by the
API
. - This library is used to create the schema and the model of the data to do posting to the web. It is also used to translate between objects in code and its representation in MongoDB.
-
This library is used to simplify the process of starting the
API
when developing as it wraps the Node app, watches the file system and automatically restarts the program if any changes is made.
-
This tool is used to test and use the API by sending requests. The
GET
tool is used when anyGet All
orGet by ID
requests. ThePOST
tool is used when doingInsert
requests.
- Open
Postman
orThunder Client
in VS Code. - Copy the URL below.
https://rottentomatoes-tvshows.herokuapp.com/tvshows
- Send requests by
GET
andPOST
.-
- Send the
GET
request from the URL above.
- Send the
-
- add
/<id of the tv show>
to the URL above and send the request.
- add
-
- Add
/post
to the URL above and type in the JSON format of the data to theBody
of the request. Then send the request.
- Add
-
-
PyPI
Selenium
BeautifulSoup
MongoDB
Chrome WebDriver
Mongoose API
Express -
Web scraping : Web Scraping with Python - Beautiful Soup Crash Course
Python & JSON :
Python JSON dump() and dumps() for JSON Encoding
Python Encode Unicode and non-ASCII characters as-is into JSON
Selenium : Web Scraping with Selenium in Python
Rest API :
Build A Restful Api With Node.js Express & MongoDB | Rest Api Tutorial
Create a complete REST API with Node, Express and MongoDB | Deploy on Heroku -
Stack Overflow
Geeks For Geeks
The data visualization of this database is made by using MongoDB Charts as it is connected to the MongoDB Database. The full dashboard can be accessed through TV Shows Dashboard .
Gresya Angelina Eunike Leman (18220104)
Information System and Technology
Institut Teknologi Bandung