A notebook whirlwind tour of Common Crawl's datasets on AWS

This repo provides an example notebook showcasing the Common Crawl Dataset on AWS. It has been tested on Amazon SageMaker and Jupyter Notebook, but designed to run seamlessly in any Python notebook environment that support standard Jupyter .ipynb execution.

All Common Crawl datasets are hosted on AWS s3 buckets and are publicly available through s3 and https protocols. To read more about the data and how to use it, visit https://commoncrawl.org/get-started. For the command-line version of this notebook, see https://github.com/commoncrawl/whirlwind-python/

NB: The current version of warcio library (v1.7.5) does not support direct s3 access. In the meantime, this notebook downloads example filesets from Common Crawl github repository and works on them locally.

Name		Name	Last commit message	Last commit date
Latest commit History 4 Commits
LICENSE		LICENSE
README.md		README.md
aws-ccf-dataset.ipynb		aws-ccf-dataset.ipynb

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

A notebook whirlwind tour of Common Crawl's datasets on AWS

About

Uh oh!

Releases

Packages

Contributors 2

Uh oh!

Languages

License

commoncrawl/whirlwind-python-notebook

Folders and files

Latest commit

History

Repository files navigation

A notebook whirlwind tour of Common Crawl's datasets on AWS

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Contributors 2

Uh oh!

Languages

Packages