Skip to content

Conversation

@Mark-Simulacrum
Copy link
Member

This adds the migrate command to start moving files out of the DB and into S3.

The Cargo.lock update is unfortunate, but I ran into trouble with rand when adding tokio, and this fixed it -- I can try to reduce it down but I suspect that'll be rather painful, so I'd rather not (this is just the effects of cargo update after the changes in Cargo.toml).

This PR can be reverted once we're done with the migration if desired.

Based on local testing, albeit with minio and not real S3, this should take around 80 hours. It might take longer -- we are talking over network and not to a local server -- or maybe S3 is faster to upload to than minio; I'm not sure.

r? @QuietMisdreavus

There's no reason for the world to be able to read our objects. We read
them using the API keys, so we still have access.
src/db/file.rs Outdated
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We use a single-threaded runtime here because something in docs.rs (I think, maybe postgres or something) is using a Cell which can't be shared/sent between threads. I don't think it matters much -- maybe slows us down, but that seems fine.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since the futures are probably going to be mostly IO-bound anyway, using the single-threaded runtime seems fine to me too.

src/db/file.rs Outdated
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In local testing I've confirmed that if anything goes wrong with the migration we will indeed abort the transaction here and nothing will get written to the database, i.e., we don't accidentally overwrite some of the files with "in-s3" despite not uploading them as such. It also means either all or nothing, but we're not doing more than 5000 at once so that's also fine.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This 5000 value is arbitrarily chosen -- but increasing it much more seems like a bad idea, as it means we're touching that many more rows and attempting to do that much more at the same time. This takes ~5 seconds to run once on my machine (with minio, as I've said, not S3), so running it in a loop won't have too high overhead from process spawning.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is the plan to run the loop in bash, instead? How would we know that it finished?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Very loosely, select count(*) from files where files.content != E'in-s3' should return no rows. I plan to monitor it somewhat closely while I'm around for the first few hours and, if you'd like, can kill it when I'm not around. I can add the looping logic directly in here, though. Up to you.

@QuietMisdreavus
Copy link
Contributor

The Cargo.lock update is unfortunate, but I ran into trouble with rand when adding tokio, and this fixed it -- I can try to reduce it down but I suspect that'll be rather painful, so I'd rather not (this is just the effects of cargo update after the changes in Cargo.toml).

If you don't need the latest patch updates in your docs.rs code, you can try changing the Cargo.toml versions of each to just be "0.1", which will happily pull versions of futures/tokio that are already in our lockfile. I don't want to commit a blanket cargo update because that will also update cargo, which is a git dependency.

Alternately, if you want to also include the patch update to futures/tokio, i was able to minimize the impact to the lockfile by only updating with the following commands:

cargo update -p futures -p tokio # this gets them to the patch versions in Cargo.toml, since they're currently the latest
cargo update -p rand_core:0.3.0 # this solves the incoherent rand impls
cargo update -p tokio-uds # this solves a different build error in tokio from its patch update

This creates a coherent dependency tree and a completable build, while keeping the diff to +108/-78.

This will pull 5,000 files (essentially randomly chosen) out of the DB
and upload them to the configured S3 instance.
@Mark-Simulacrum
Copy link
Member Author

Ah, that makes sense. I've replaced the dependencies onto tokio/futures with just 0.1 so that we don't need a big Cargo.lock update.

Copy link
Contributor

@QuietMisdreavus QuietMisdreavus left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Seems good to me. I think i would want to keep this around long-term, to allow development servers and independent instances to be able to migrate themselves as well. (I'm also tempted to ask for the reverse operation, but for now having only database->s3 is fine.

src/db/file.rs Outdated
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since the futures are probably going to be mostly IO-bound anyway, using the single-threaded runtime seems fine to me too.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is the plan to run the loop in bash, instead? How would we know that it finished?

@Mark-Simulacrum
Copy link
Member Author

Seems good to me. I think i would want to keep this around long-term, to allow development servers and independent instances to be able to migrate themselves as well. (I'm also tempted to ask for the reverse operation, but for now having only database->s3 is fine.

FWIW, the only reason I wanted to revert was the large Cargo.lock diff -- since this no longer includes it, I have no problems keeping it around.

@Mark-Simulacrum
Copy link
Member Author

I've pushed up a new commit (the third one) which encodes the loop that'll be needed directly into the code. This makes the whole process much more streamlined (no need for bash, etc.). Each upload should still take around 5 seconds, and since we're authenticating and such I'm not too worried about rate limits or other AWS-imposed limitations.

The loop ended up pretty simple and should not add overhead as we simply check the amount of rows returned by the query (can never be more than the limit, but still).

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The expect message didn't get updated. It should probably be something like "Failed to upload batch to S3".

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, yes, indeed. Fixed now -- I also added a total so there's a relatively easy way to tell how many we've uploaded so far without looking at relatively slow queries that search the whole file table.

This loop will halt when all files are uploaded to S3, avoiding needing
to write something in bash / externally.
Copy link
Contributor

@QuietMisdreavus QuietMisdreavus left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this is good. Thanks so much!

A question about deployment: Do you think this is feasible to do while also processing new crates? Or should we lock the database while this is happening? At first brush it seems like it should be fine - if postgres can handle this transaction happening in parallel with other writes to the files table, then we're good, since we're not going to be competing for the same rows. But i'm not sure if there are other factors that would cause problems.

@Mark-Simulacrum
Copy link
Member Author

I can't think of any reason this can't run in parallel, but if we do see issues (or high server load, or whatever); we can lock the DB for say 5 minutes a time, during which we should upload around 300,000 files -- and we can do so in bursts over time if needed.

@QuietMisdreavus
Copy link
Contributor

That seems fine to me. Let's get this rolling!

@QuietMisdreavus QuietMisdreavus merged commit 635e91e into rust-lang:master Jul 16, 2019
@Mark-Simulacrum Mark-Simulacrum deleted the s3-migrate branch July 16, 2019 22:54
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants