-
Notifications
You must be signed in to change notification settings - Fork 78
[ETCM-540] Improve peer discovery algorithm #903
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
b322739 to
567450f
Compare
567450f to
0e753a7
Compare
KonradStaniec
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
did you try running mantis node with those changes to see how fast it traverses the peers it get from discovery ?
I ran to see that nothing got broken, but was not sure how to see if there's any improvement in the speed as it's working only in a specific situation when previously blacklisted modes are tried instead of new ones, for this the network should be large together with a large amount of blacklisted nodes. Edit: I'm measuring the time to start syncing at the moment |
|
Time to start syncing is pretty tricky measure which has also other factors in it. Maybe the best metric would be number of peers we tried to connect to after some time ( maybe as percentage of nodes provided by discovery) ? ( just thinking out loud here as i am also not sure 😅 ) |
Indeed, after running it multiple times I saw the time to sync drastically differs. Ended up looking at the ratio of connected nodes to discovered ones :) So running on develop branch the percentage of connected nodes after 10min is 0.273% |
1b54e7e to
5ee7d7b
Compare
KonradStaniec
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It would nice to check what the ratio looks after 1h i.e when we had acquire more peers, but if it helps even a little bit then LGTM!
When i was thinking about this problem my inital solution was to change our interactions with discovery form: ask discovery for all found peers every 1 minute, and try to connect to X peers
to something like: ask discovery for all found peers, try to connect to all of them (by calling to them in batches as it is now), and ask for discovery for all found peers again only after we check we are sure we checked all peers from previous batch.
But maybe this something for other pr.
5ee7d7b to
5beb9bd
Compare
|
Just checked the ration after running for 1h: this difference could be caused by increased time for the blacklist as well, as seen above the number of blacklisted nodes with the new configuration is 65% higher |
|
@KonradStaniec I like your idea of changing the interaction with discovery as well. No need asking for new nodes while we didn't process the previously discovered ones. The question in this case would be how fast the discovered batch is processed and if it doesn't leave us with an outdated info. I think the follow-up task could be created, WDYT @dzajkowski? |
5beb9bd to
49ba7d3
Compare
Description
With each round of scanning for new nodes to connect, we only take N from that pool, trying to connect to them. This process takes a lot of time as not all peers are easy to connect to and if no nodes we suitable - we have to wait for another round of scanning. And then it's possible that we take N nodes from the new scan and they will contain nodes that were previously tried, but the connection was not successful. So instead of retrying those nodes, first we want to go through the whole list of discovered peers.
Proposed Solution
This PR introduces cashing of the tried nodes with a Map bounded by size to the size of the blacklist. So first we'll be trying to connect to the new nodes, that are not cashed in that map.
If the scanning round didn't find enough nodes for us to be picky - we try to connect to all of them regardless the fact if they were tried before.
Parameters that control blacklisting were reviewed - long blacklisting (in case wrong protocol, incompatible network, timeout during connection) was increased from 30min to 10h, as it's unlikely that nodes from another network/different protocol might suddenly change within 30min timeframe.