Skip to content

Commit b2d310b

Browse files
author
Damian Stewart
committed
address comments
1 parent 432473c commit b2d310b

File tree

2 files changed

+11
-10
lines changed

2 files changed

+11
-10
lines changed

Makefile

Lines changed: 3 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -36,13 +36,13 @@ extract:
3636
@echo "hint: python -m json.tool extraction.json"
3737

3838
cdx_toolkit:
39-
@echo lookup captures for the given url in the commoncrawl cdx index for CC-MAIN-2024-22, returning only the first match
40-
cdxt --limit 1 --crawl CC-MAIN-2024-22 iter an.wikipedia.org/wiki/Escopete
39+
@echo demonstrate that we have this entry in the index
40+
cdxt --crawl CC-MAIN-2024-22 --from 20240518015810 --to 20240518015810 iter an.wikipedia.org/wiki/Escopete
4141
@echo
4242
@echo cleanup previous work
4343
rm -f TEST-000000.extracted.warc.gz
4444
@echo retrieve the content from the commoncrawl s3 bucket
45-
cdxt --limit 1 --crawl CC-MAIN-2024-22 warc an.wikipedia.org/wiki/Escopete
45+
cdxt --crawl CC-MAIN-2024-22 --from 20240518015810 --to 20240518015810 warc an.wikipedia.org/wiki/Escopete
4646
@echo
4747
@echo index this new warc
4848
cdxj-indexer TEST-000000.extracted.warc.gz > TEST-000000.extracted.warc.cdxj

README.md

Lines changed: 8 additions & 7 deletions
Original file line numberDiff line numberDiff line change
@@ -350,14 +350,14 @@ The output looks like this:
350350
<summary>Click to view output</summary>
351351

352352
```
353-
lookup captures for the given url in the commoncrawl cdx index for CC-MAIN-2024-22, returning only the first match
354-
cdxt --limit 1 --crawl CC-MAIN-2024-22 iter an.wikipedia.org/wiki/Escopete
353+
demonstrate that we have this entry in the index
354+
cdxt --crawl CC-MAIN-2024-22 --from 20240518015810 --to 20240518015810 iter an.wikipedia.org/wiki/Escopete
355355
status 200, timestamp 20240518015810, url https://an.wikipedia.org/wiki/Escopete
356356
357357
cleanup previous work
358358
rm -f TEST-000000.extracted.warc.gz
359359
retrieve the content from the commoncrawl s3 bucket
360-
cdxt --limit 1 --crawl CC-MAIN-2024-22 warc an.wikipedia.org/wiki/Escopete
360+
cdxt --crawl CC-MAIN-2024-22 --from 20240518015810 --to 20240518015810 warc an.wikipedia.org/wiki/Escopete
361361
362362
index this new warc
363363
cdxj-indexer TEST-000000.extracted.warc.gz > TEST-000000.extracted.warc.cdxj
@@ -377,14 +377,15 @@ There's a lot going on here so let's unpack it a little.
377377

378378
#### Check that the crawl has a record for the page we are interested in
379379

380-
We check for capture results using the `cdxt` command `iter`, specifying the exact URL `an.wikipedia.org/wiki/Escopete` and the crawl identifier `CC-MAIN-2024-22`. The result of this tells us that the crawl successfuly fetched this page at timestamp `20240518015810`.
381-
* You can try removing the `--limit 1` flag and/or replacing `--crawl CC-MAIN-2024-22` with `--cc`, which will return more results reflecting more times when this URL was crawled.
382-
* You can also use `--from <timestamp>` and `--to <timestamp>` to restrict the time range when the URL was crawled. This can even be used to pinpoint an exact record — for example, `--from 20240518015810 --to 20240518015810` will only ever return the record that we've been looking at elsewhere in this tutorial.
380+
We check for capture results using the `cdxt` command `iter`, specifying the exact URL `an.wikipedia.org/wiki/Escopete` and the timestamp range `--from 20240518015810 --to 20240518015810`. The result of this tells us that the crawl successfuly fetched this page at timestamp `20240518015810`.
381+
* Captures are named by the surtkey and the time.
382+
* Instead of `--crawl CC-MAIN-2024-22`, you could pass `--cc` to search across all crawls.
383+
* You can pass `--limit <N>` to limit the number of results returned - in this case because we have restricted the timestamp range to a single value, we only expect one result.
383384
* URLs may be specified with wildcards to return even more results: `"an.wikipedia.org/wiki/Escop*"` matches `an.wikipedia.org/wiki/Escopulión` and `an.wikipedia.org/wiki/Escopete`.
384385

385386
#### Retrieve the fetched content as WARC
386387

387-
Next, we use the `cdxt` command `warc` to retrieve the content and save it locally as a new WARC file, again specifying the exact URL and crawl identifier. This creates the WARC file `TEST-000000.extracted.warc.gz` which contains a `warcinfo` record explaining what the WARC is, followed by the `response` record we requested.
388+
Next, we use the `cdxt` command `warc` to retrieve the content and save it locally as a new WARC file, again specifying the exact URL, crawl identifier, and timestamp range. This creates the WARC file `TEST-000000.extracted.warc.gz` which contains a `warcinfo` record explaining what the WARC is, followed by the `response` record we requested.
388389
* If you dig into cdx_toolkit's code, you'll find that it is using the offset and length of the WARC record (as returned by the CDX index query) to make a HTTP byte range request to S3 that isolates and returns just the single record we want from the full file. It only downloads the response WARC record because our CDX index only has the response records indexed.
389390
* By default `cdxt` avoids overwriting existing files by automatically incrementing the counter in the filename. If you run this again without deleting `TEST-000000.extracted.warc.gz`, the data will be written again to a new file `TEST-000001.extracted.warc.gz`.
390391
* Limit, timestamp, and crawl index args, as well as URL wildcards, work as for `iter`.

0 commit comments

Comments
 (0)