Migrate files database to S3 #377

Mark-Simulacrum · 2019-07-12T19:24:43Z

This adds the migrate command to start moving files out of the DB and into S3.

The Cargo.lock update is unfortunate, but I ran into trouble with rand when adding tokio, and this fixed it -- I can try to reduce it down but I suspect that'll be rather painful, so I'd rather not (this is just the effects of cargo update after the changes in Cargo.toml).

This PR can be reverted once we're done with the migration if desired.

Based on local testing, albeit with minio and not real S3, this should take around 80 hours. It might take longer -- we are talking over network and not to a local server -- or maybe S3 is faster to upload to than minio; I'm not sure.

r? @QuietMisdreavus

There's no reason for the world to be able to read our objects. We read them using the API keys, so we still have access.

Mark-Simulacrum · 2019-07-12T19:25:41Z

src/db/file.rs

We use a single-threaded runtime here because something in docs.rs (I think, maybe postgres or something) is using a Cell which can't be shared/sent between threads. I don't think it matters much -- maybe slows us down, but that seems fine.

Since the futures are probably going to be mostly IO-bound anyway, using the single-threaded runtime seems fine to me too.

Mark-Simulacrum · 2019-07-12T19:26:48Z

src/db/file.rs

In local testing I've confirmed that if anything goes wrong with the migration we will indeed abort the transaction here and nothing will get written to the database, i.e., we don't accidentally overwrite some of the files with "in-s3" despite not uploading them as such. It also means either all or nothing, but we're not doing more than 5000 at once so that's also fine.

Mark-Simulacrum · 2019-07-12T19:28:05Z

src/bin/cratesfyi.rs

This 5000 value is arbitrarily chosen -- but increasing it much more seems like a bad idea, as it means we're touching that many more rows and attempting to do that much more at the same time. This takes ~5 seconds to run once on my machine (with minio, as I've said, not S3), so running it in a loop won't have too high overhead from process spawning.

Is the plan to run the loop in bash, instead? How would we know that it finished?

Very loosely, select count(*) from files where files.content != E'in-s3' should return no rows. I plan to monitor it somewhat closely while I'm around for the first few hours and, if you'd like, can kill it when I'm not around. I can add the looping logic directly in here, though. Up to you.

QuietMisdreavus · 2019-07-12T21:36:22Z

The Cargo.lock update is unfortunate, but I ran into trouble with rand when adding tokio, and this fixed it -- I can try to reduce it down but I suspect that'll be rather painful, so I'd rather not (this is just the effects of cargo update after the changes in Cargo.toml).

If you don't need the latest patch updates in your docs.rs code, you can try changing the Cargo.toml versions of each to just be "0.1", which will happily pull versions of futures/tokio that are already in our lockfile. I don't want to commit a blanket cargo update because that will also update cargo, which is a git dependency.

Alternately, if you want to also include the patch update to futures/tokio, i was able to minimize the impact to the lockfile by only updating with the following commands:

cargo update -p futures -p tokio # this gets them to the patch versions in Cargo.toml, since they're currently the latest
cargo update -p rand_core:0.3.0 # this solves the incoherent rand impls
cargo update -p tokio-uds # this solves a different build error in tokio from its patch update

This creates a coherent dependency tree and a completable build, while keeping the diff to +108/-78.

This will pull 5,000 files (essentially randomly chosen) out of the DB and upload them to the configured S3 instance.

Mark-Simulacrum · 2019-07-12T21:39:19Z

Ah, that makes sense. I've replaced the dependencies onto tokio/futures with just 0.1 so that we don't need a big Cargo.lock update.

QuietMisdreavus

Seems good to me. I think i would want to keep this around long-term, to allow development servers and independent instances to be able to migrate themselves as well. (I'm also tempted to ask for the reverse operation, but for now having only database->s3 is fine.

QuietMisdreavus · 2019-07-12T21:56:23Z

src/db/file.rs

Since the futures are probably going to be mostly IO-bound anyway, using the single-threaded runtime seems fine to me too.

QuietMisdreavus · 2019-07-12T21:56:51Z

src/bin/cratesfyi.rs

Is the plan to run the loop in bash, instead? How would we know that it finished?

Mark-Simulacrum · 2019-07-12T22:04:42Z

Seems good to me. I think i would want to keep this around long-term, to allow development servers and independent instances to be able to migrate themselves as well. (I'm also tempted to ask for the reverse operation, but for now having only database->s3 is fine.

FWIW, the only reason I wanted to revert was the large Cargo.lock diff -- since this no longer includes it, I have no problems keeping it around.

Mark-Simulacrum · 2019-07-13T16:12:57Z

I've pushed up a new commit (the third one) which encodes the loop that'll be needed directly into the code. This makes the whole process much more streamlined (no need for bash, etc.). Each upload should still take around 5 seconds, and since we're authenticating and such I'm not too worried about rate limits or other AWS-imposed limitations.

The loop ended up pretty simple and should not add overhead as we simply check the amount of rows returned by the query (can never be more than the limit, but still).

QuietMisdreavus · 2019-07-15T20:00:37Z

src/bin/cratesfyi.rs

The expect message didn't get updated. It should probably be something like "Failed to upload batch to S3".

Ah, yes, indeed. Fixed now -- I also added a total so there's a relatively easy way to tell how many we've uploaded so far without looking at relatively slow queries that search the whole file table.

This loop will halt when all files are uploaded to S3, avoiding needing to write something in bash / externally.

QuietMisdreavus

I think this is good. Thanks so much!

A question about deployment: Do you think this is feasible to do while also processing new crates? Or should we lock the database while this is happening? At first brush it seems like it should be fine - if postgres can handle this transaction happening in parallel with other writes to the files table, then we're good, since we're not going to be competing for the same rows. But i'm not sure if there are other factors that would cause problems.

Mark-Simulacrum · 2019-07-16T22:29:01Z

I can't think of any reason this can't run in parallel, but if we do see issues (or high server load, or whatever); we can lock the DB for say 5 minutes a time, during which we should upload around 300,000 files -- and we can do so in bursts over time if needed.

QuietMisdreavus · 2019-07-16T22:48:39Z

That seems fine to me. Let's get this rolling!

Do not mark objects as public-read in the S3 bucket

ff6b23e

There's no reason for the world to be able to read our objects. We read them using the API keys, so we still have access.

Mark-Simulacrum requested a review from QuietMisdreavus July 12, 2019 19:24

Mark-Simulacrum commented Jul 12, 2019

View reviewed changes

Add s3 migration command

16e6017

This will pull 5,000 files (essentially randomly chosen) out of the DB and upload them to the configured S3 instance.

Mark-Simulacrum force-pushed the s3-migrate branch from aa9c40a to 16e6017 Compare July 12, 2019 21:38

QuietMisdreavus approved these changes Jul 12, 2019

View reviewed changes

QuietMisdreavus reviewed Jul 15, 2019

View reviewed changes

Encode upload loop into S3 migration command

1dfc81c

This loop will halt when all files are uploaded to S3, avoiding needing to write something in bash / externally.

Mark-Simulacrum force-pushed the s3-migrate branch from 9c402d2 to 1dfc81c Compare July 15, 2019 22:07

QuietMisdreavus approved these changes Jul 16, 2019

View reviewed changes

QuietMisdreavus merged commit 635e91e into rust-lang:master Jul 16, 2019

Mark-Simulacrum deleted the s3-migrate branch July 16, 2019 22:54

Migrate files database to S3 #377

Migrate files database to S3 #377

Uh oh!

Conversation

Mark-Simulacrum commented Jul 12, 2019

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

QuietMisdreavus commented Jul 12, 2019

Uh oh!

Mark-Simulacrum commented Jul 12, 2019

Uh oh!

QuietMisdreavus left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Mark-Simulacrum commented Jul 12, 2019

Uh oh!

Mark-Simulacrum commented Jul 13, 2019

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

QuietMisdreavus left a comment

Choose a reason for hiding this comment

Uh oh!

Mark-Simulacrum commented Jul 16, 2019

Uh oh!

QuietMisdreavus commented Jul 16, 2019

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants