Skip to content

Conversation

@jaymode
Copy link
Member

@jaymode jaymode commented Jun 27, 2017

This commit fixes a race condition in the node supplier used by the RemoteClusterConnection. The
node supplier stores an iterator over a set backed by a ConcurrentHashMap, but the get operation
of the supplier uses multiple methods of the iterator and is suceptible to a race between the
calls to hasNext() and next(). The test in this commit fails under the old implementation with a
NoSuchElementException. This commit adds a wrapper object over a set and a list, with all methods
being synchronized to avoid races. Additionally, iterators are no longer used and replaced with a
counter and index based access of an array list to maintain the round robin aspect of the previous
node supplier implementation.

This commit fixes a race condition in the node supplier used by the RemoteClusterConnection. The
node supplier stores an iterator over a set backed by a ConcurrentHashMap, but the get operation
of the supplier uses multiple methods of the iterator and is suceptible to a race between the
calls to hasNext() and next(). The test in this commit fails under the old implementation with a
NoSuchElementException. This commit adds a wrapper object over a set and a list, with all methods
being synchronized to avoid races. Additionally, iterators are no longer used and replaced with a
counter and index based access of an array list to maintain the round robin aspect of the previous
node supplier implementation.
@jaymode
Copy link
Member Author

jaymode commented Jun 27, 2017

@pickypg observed the following errors that were caused by this race condition:

java.util.NoSuchElementException: null
        at java.util.concurrent.ConcurrentHashMap$KeyIterator.next(ConcurrentHashMap.java:3416) ~[?:1.8.0_111]
        at org.elasticsearch.transport.RemoteClusterConnection$1.get(RemoteClusterConnection.java:127) ~[elasticsearch-5.5.0.jar:5.5.0]
        at org.elasticsearch.transport.RemoteClusterConnection$1.get(RemoteClusterConnection.java:117) ~[elasticsearch-5.5.0.jar:5.5.0]
        at org.elasticsearch.transport.RemoteClusterConnection.fetchShardsInternal(RemoteClusterConnection.java:183) ~[elasticsearch-5.5.0.jar:5.5.0]
        at org.elasticsearch.transport.RemoteClusterConnection.fetchSearchShards(RemoteClusterConnection.java:165) ~[elasticsearch-5.5.0.jar:5.5.0]
        at org.elasticsearch.transport.RemoteClusterService.collectSearchShards(RemoteClusterService.java:225) ~[elasticsearch-5.5.0.jar:5.5.0]
        at org.elasticsearch.action.search.TransportSearchAction.doExecute(TransportSearchAction.java:190) ~[elasticsearch-5.5.0.jar:5.5.0]
        at org.elasticsearch.action.search.TransportSearchAction.doExecute(TransportSearchAction.java:66) ~[elasticsearch-5.5.0.jar:5.5.0]

Copy link
Contributor

@s1monw s1monw left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

left a comment, thanks for the fix @jaymode

// nodes, this class uses a counter and retrieves by index from the list. The arraylist enables us to do this in O(1).
private final Set<DiscoveryNode> nodeSet = new HashSet<>();
private final List<DiscoveryNode> nodeList = new ArrayList<>();
private final AtomicInteger counter = new AtomicInteger(0);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this can be a simple integer now?

// this classes uses both a set and a list to support faster operations. Insertion and contains for this class are O(1) thanks
// to the use of the set and removal is O(n) due to the arraylist. In order to support a round-robin scheme through the connected
// nodes, this class uses a counter and retrieves by index from the list. The arraylist enables us to do this in O(1).
private final Set<DiscoveryNode> nodeSet = new HashSet<>();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't like to have 2 data structures that we have to keep in sync. if you really wanna optimize for removal / insertion we can just use an array or an array list and sort it? I doubt we should to be honest. Maybe we just go with a list and linear scan or we use a set only and the same iterator approach and with the synchronization in place we can just set the iterator to null if we changed the set? that might be the easiest

@jaymode jaymode requested a review from s1monw June 28, 2017 14:40
Copy link
Contributor

@s1monw s1monw left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@jaymode jaymode merged commit 64d11b8 into elastic:master Jun 28, 2017
@jaymode jaymode deleted the remote_connect_node_supplier branch June 28, 2017 21:50
jaymode added a commit that referenced this pull request Jun 28, 2017
This commit fixes a race condition in the node supplier used by the RemoteClusterConnection. The
node supplier stores an iterator over a set backed by a ConcurrentHashMap, but the get operation
of the supplier uses multiple methods of the iterator and is suceptible to a race between the
calls to hasNext() and next(). The test in this commit fails under the old implementation with a
NoSuchElementException. This commit adds a wrapper object over a set and a iterator, with all methods
being synchronized to avoid races. Modifications to the set result in the iterator being set to null
and the next retrieval creates a new iterator.
jaymode added a commit that referenced this pull request Jun 28, 2017
This commit fixes a race condition in the node supplier used by the RemoteClusterConnection. The
node supplier stores an iterator over a set backed by a ConcurrentHashMap, but the get operation
of the supplier uses multiple methods of the iterator and is suceptible to a race between the
calls to hasNext() and next(). The test in this commit fails under the old implementation with a
NoSuchElementException. This commit adds a wrapper object over a set and a iterator, with all methods
being synchronized to avoid races. Modifications to the set result in the iterator being set to null
and the next retrieval creates a new iterator.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants