Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
5 changes: 3 additions & 2 deletions metaclip/README.md
Original file line number Diff line number Diff line change
@@ -1,7 +1,7 @@
# MetaCLIP

This is a minimal demo/skeleton code of CLIP curation, please check Algorithm 1 in MetaCLIP paper.
**This is not the production pipeline used to collect data for paper**.
This is a minimal demo/skeleton code of CLIP curation, please check Algorithm 1 in [MetaCLIP paper](https://arxiv.org/pdf/2309.16671.pdf).
**This is not the pipeline used to collect data in paper**.

## Part 1 Sub-string matching

Expand Down Expand Up @@ -37,6 +37,7 @@ Want a distributed system to parse the full CC and download a dataset? consider

## Part 2 Balancing (expected after image downloading/NSFW/dedup)


```bash
mkdir -p data/CC/balanced
python metaclip/balancing.py data/CC/matched data/CC/balanced 20000 # the magic 20k !
Expand Down
File renamed without changes.