Skip to content

commoncrawl/whirlwind-python-notebook

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

4 Commits
 
 
 
 
 
 

Repository files navigation

A notebook whirlwind tour of Common Crawl's datasets on AWS

This repo provides an example notebook showcasing the Common Crawl Dataset on AWS. It has been tested on Amazon SageMaker and Jupyter Notebook, but designed to run seamlessly in any Python notebook environment that support standard Jupyter .ipynb execution.

All Common Crawl datasets are hosted on AWS s3 buckets and are publicly available through s3 and https protocols. To read more about the data and how to use it, visit https://commoncrawl.org/get-started. For the command-line version of this notebook, see https://github.com/commoncrawl/whirlwind-python/

NB: The current version of warcio library (v1.7.5) does not support direct s3 access. In the meantime, this notebook downloads example filesets from Common Crawl github repository and works on them locally.

About

A jupyter notebook illistrating the basics of Common Crawl's datasets.

Topics

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Contributors 2

  •  
  •