This repo provides an example notebook showcasing the Common Crawl Dataset on AWS. It has been tested on Amazon SageMaker and Jupyter Notebook, but designed to run seamlessly in any Python notebook environment that support standard Jupyter .ipynb execution.
All Common Crawl datasets are hosted on AWS s3 buckets and are
publicly available through s3 and https protocols. To read more
about the data and how to use it, visit
https://commoncrawl.org/get-started. For the command-line version of
this notebook, see https://github.com/commoncrawl/whirlwind-python/
NB: The current version of warcio library (v1.7.5) does not support direct s3 access. In the meantime, this notebook downloads example filesets from Common Crawl github repository and works on them locally.