-
Notifications
You must be signed in to change notification settings - Fork 218
Migrate files database to S3 #377
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
There's no reason for the world to be able to read our objects. We read them using the API keys, so we still have access.
src/db/file.rs
Outdated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We use a single-threaded runtime here because something in docs.rs (I think, maybe postgres or something) is using a Cell which can't be shared/sent between threads. I don't think it matters much -- maybe slows us down, but that seems fine.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Since the futures are probably going to be mostly IO-bound anyway, using the single-threaded runtime seems fine to me too.
src/db/file.rs
Outdated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In local testing I've confirmed that if anything goes wrong with the migration we will indeed abort the transaction here and nothing will get written to the database, i.e., we don't accidentally overwrite some of the files with "in-s3" despite not uploading them as such. It also means either all or nothing, but we're not doing more than 5000 at once so that's also fine.
src/bin/cratesfyi.rs
Outdated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This 5000 value is arbitrarily chosen -- but increasing it much more seems like a bad idea, as it means we're touching that many more rows and attempting to do that much more at the same time. This takes ~5 seconds to run once on my machine (with minio, as I've said, not S3), so running it in a loop won't have too high overhead from process spawning.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is the plan to run the loop in bash, instead? How would we know that it finished?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Very loosely, select count(*) from files where files.content != E'in-s3' should return no rows. I plan to monitor it somewhat closely while I'm around for the first few hours and, if you'd like, can kill it when I'm not around. I can add the looping logic directly in here, though. Up to you.
If you don't need the latest patch updates in your docs.rs code, you can try changing the Cargo.toml versions of each to just be Alternately, if you want to also include the patch update to futures/tokio, i was able to minimize the impact to the lockfile by only updating with the following commands: This creates a coherent dependency tree and a completable build, while keeping the diff to +108/-78. |
This will pull 5,000 files (essentially randomly chosen) out of the DB and upload them to the configured S3 instance.
aa9c40a to
16e6017
Compare
|
Ah, that makes sense. I've replaced the dependencies onto tokio/futures with just |
QuietMisdreavus
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Seems good to me. I think i would want to keep this around long-term, to allow development servers and independent instances to be able to migrate themselves as well. (I'm also tempted to ask for the reverse operation, but for now having only database->s3 is fine.
src/db/file.rs
Outdated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Since the futures are probably going to be mostly IO-bound anyway, using the single-threaded runtime seems fine to me too.
src/bin/cratesfyi.rs
Outdated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is the plan to run the loop in bash, instead? How would we know that it finished?
FWIW, the only reason I wanted to revert was the large Cargo.lock diff -- since this no longer includes it, I have no problems keeping it around. |
|
I've pushed up a new commit (the third one) which encodes the loop that'll be needed directly into the code. This makes the whole process much more streamlined (no need for bash, etc.). Each upload should still take around 5 seconds, and since we're authenticating and such I'm not too worried about rate limits or other AWS-imposed limitations. The loop ended up pretty simple and should not add overhead as we simply check the amount of rows returned by the query (can never be more than the limit, but still). |
src/bin/cratesfyi.rs
Outdated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The expect message didn't get updated. It should probably be something like "Failed to upload batch to S3".
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah, yes, indeed. Fixed now -- I also added a total so there's a relatively easy way to tell how many we've uploaded so far without looking at relatively slow queries that search the whole file table.
This loop will halt when all files are uploaded to S3, avoiding needing to write something in bash / externally.
9c402d2 to
1dfc81c
Compare
QuietMisdreavus
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this is good. Thanks so much!
A question about deployment: Do you think this is feasible to do while also processing new crates? Or should we lock the database while this is happening? At first brush it seems like it should be fine - if postgres can handle this transaction happening in parallel with other writes to the files table, then we're good, since we're not going to be competing for the same rows. But i'm not sure if there are other factors that would cause problems.
|
I can't think of any reason this can't run in parallel, but if we do see issues (or high server load, or whatever); we can lock the DB for say 5 minutes a time, during which we should upload around 300,000 files -- and we can do so in bursts over time if needed. |
|
That seems fine to me. Let's get this rolling! |
This adds the migrate command to start moving files out of the DB and into S3.
The Cargo.lock update is unfortunate, but I ran into trouble with
randwhen adding tokio, and this fixed it -- I can try to reduce it down but I suspect that'll be rather painful, so I'd rather not (this is just the effects ofcargo updateafter the changes in Cargo.toml).This PR can be reverted once we're done with the migration if desired.
Based on local testing, albeit with minio and not real S3, this should take around 80 hours. It might take longer -- we are talking over network and not to a local server -- or maybe S3 is faster to upload to than minio; I'm not sure.
r? @QuietMisdreavus