Wait replica respond before load it in safe_load() #229

avtikhon · 2020-11-03T11:05:07Z

Check data exists before load in safe_load()

lua_eval() routine is used for evaluating lua command on a given
instance. Before run the command it reconnects to the needed instance
and runs it there. After the command run safe_load() is called to parse
the result. Found that some times safe_load() routine may fail with
AttributeError internal exception. It happens, because result value
from running command on the instance was not checked that it was not
empty. To fix it was added waiting loop for replica respond before
load it in safe_load() routine.

Check the following output from gitlab-ci job [1]:

      DEBUG: sending command: test_run:wait_fullmesh(SERVERS)
      DEBUG: test-run received command: config engine
      DEBUG: test-run's response for [config engine]
       | !!python/unicode 'engine': !!python/unicode 'memtx'
       | ...
      DEBUG: test-run received command: eval autobootstrap1 "return box.info.server"
      DEBUG: test-run's response for [eval autobootstrap1 "return box.info.server"]
       | - id: null
       |   lsn: -1
       |   ro: true
       |   uuid: e84d86bd-d2fc-43c1-b832-8146d1f02cd1
       | ...
      DEBUG: test-run received command: eval autobootstrap1 "return box.info.server"
      DEBUG: test-run's response for [eval autobootstrap1 "return box.info.server"]
       | - id: null
       |   lsn: -1
       |   ro: true
       |   uuid: e84d86bd-d2fc-43c1-b832-8146d1f02cd1
       | ...
      DEBUG: test-run received command: eval autobootstrap1 "return box.info.server"
    
      TarantoolInpector.handle() received the following error:
      Traceback (most recent call last):
        File "test-run/lib/inspector.py", line 100, in handle
          result = self.parser.parse_preprocessor(line)
        File "test-run/lib/preprocessor.py", line 87, in parse_preprocessor
          return self.lua_eval(name, expr[1:-1])
        File "test-run/lib/preprocessor.py", line 404, in lua_eval
          result = yaml.safe_load(result)
        File "/usr/local/lib/python2.7/site-packages/yaml/__init__.py", line 162, in safe_load
          return load(stream, SafeLoader)
        File "/usr/local/lib/python2.7/site-packages/yaml/__init__.py", line 112, in load
          loader = Loader(stream)
        File "/usr/local/lib/python2.7/site-packages/yaml/loader.py", line 34, in __init__
          Reader.__init__(self, stream)
        File "/usr/local/lib/python2.7/site-packages/yaml/reader.py", line 87, in __init__
          self.determine_encoding()
        File "/usr/local/lib/python2.7/site-packages/yaml/reader.py", line 126, in determine_encoding
          self.update_raw()
        File "/usr/local/lib/python2.7/site-packages/yaml/reader.py", line 183, in update_raw
          data = self.stream.read(size)
      AttributeError: 'NoneType' object has no attribute 'read'
      DEBUG: test-run's response for [eval autobootstrap1 "return box.info.server"]
       | error: AttributeError("'NoneType' object has no attribute 'read'",)
       | ...
      Kill all servers ...
    
      [Instance "autobootstrap1" returns with non-zero exit code: 1]

It is seen here that wait_fullmesh() routine waits for the server in
loop and after successfull 2 checks it fails on the 3rd one.

after fix it works like this [2] (file: log/096_replication.log, line: 29111):

[2020-11-30 13:02:56.799806] DEBUG: test-run received command: eval replica_uuid_ro1 "return box.info.server"
[2020-11-30 13:02:56.802305] DEBUG: test-run's response for [eval replica_uuid_ro1 "return box.info.server"]
[2020-11-30 13:02:56.802305]  | - id: 1
[2020-11-30 13:02:56.802305]  |   lsn: 8
[2020-11-30 13:02:56.802305]  |   ro: false
[2020-11-30 13:02:56.802305]  |   uuid: 9cdee878-98e1-4a6f-af4d-fd19d999aa2d
[2020-11-30 13:02:56.802305]  | ...
[2020-11-30 13:02:56.803032] DEBUG: test-run received command: eval replica_uuid_ro2 "box.info.replication[1]"
[2020-11-30 13:02:56.804264] DEBUG: test-run's response for [eval replica_uuid_ro2 "box.info.replication[1]"]
[2020-11-30 13:02:56.804264]  | - null
[2020-11-30 13:02:56.804264]  | ...
[2020-11-30 13:02:56.863545] DEBUG: test-run received command: eval replica_uuid_ro2 "box.info.replication[1]"
[2020-11-30 13:02:56.865426] DEBUG: test-run's response for [eval replica_uuid_ro2 "box.info.replication[1]"]
[2020-11-30 13:02:56.865426]  | - null
[2020-11-30 13:02:56.865426]  | ...
[2020-11-30 13:02:56.896547] DEBUG: test-run received command: eval replica_uuid_ro2 "box.info.replication[1]"
[2020-11-30 13:02:56.898011] DEBUG: test-run's response for [eval replica_uuid_ro2 "box.info.replication[1]"]
[2020-11-30 13:02:56.898011]  | - null
[2020-11-30 13:02:56.898011]  | ...
[2020-11-30 13:02:56.958887] DEBUG: test-run received command: eval replica_uuid_ro2 "box.info.replication[1]"
[2020-11-30 13:02:56.960591] DEBUG: test-run's response for [eval replica_uuid_ro2 "box.info.replication[1]"]
[2020-11-30 13:02:56.960591]  | - null
[2020-11-30 13:02:56.960591]  | ...
[2020-11-30 13:02:57.050987] DEBUG: test-run received command: eval replica_uuid_ro2 "box.info.replication[1]"
[2020-11-30 13:02:57.052878] DEBUG: test-run's response for [eval replica_uuid_ro2 "box.info.replication[1]"]
[2020-11-30 13:02:57.052878]  | - null
[2020-11-30 13:02:57.052878]  | ...
[2020-11-30 13:02:57.107417] DEBUG: test-run received command: eval replica_uuid_ro2 "box.info.replication[1]"
[2020-11-30 13:02:57.109231] DEBUG: test-run's response for [eval replica_uuid_ro2 "box.info.replication[1]"]
[2020-11-30 13:02:57.109231]  | - null
[2020-11-30 13:02:57.109231]  | ...
[2020-11-30 13:02:57.188574] DEBUG: test-run received command: eval replica_uuid_ro2 "box.info.replication[1]"
[2020-11-30 13:02:57.190406] DEBUG: test-run's response for [eval replica_uuid_ro2 "box.info.replication[1]"]
[2020-11-30 13:02:57.190406]  | - null
[2020-11-30 13:02:57.190406]  | ...
[2020-11-30 13:02:57.261315] DEBUG: test-run received command: eval replica_uuid_ro2 "box.info.replication[1]"
[2020-11-30 13:02:57.265057] DEBUG: test-run's response for [eval replica_uuid_ro2 "box.info.replication[1]"]
[2020-11-30 13:02:57.265057]  | - id: 1
[2020-11-30 13:02:57.265057]  |   lsn: 10
[2020-11-30 13:02:57.265057]  |   upstream:
[2020-11-30 13:02:57.265057]  |     idle: 0.012409999966621
[2020-11-30 13:02:57.265057]  |     lag: 0.00017714500427246
[2020-11-30 13:02:57.265057]  |     peer: cluster@unix/:/private/tmp/tnt/096_replication/replica_uuid_ro1.sock
[2020-11-30 13:02:57.265057]  |     status: follow
[2020-11-30 13:02:57.265057]  |   uuid: 9cdee878-98e1-4a6f-af4d-fd19d999aa2d
[2020-11-30 13:02:57.265057]  | ...

[1] - https://gitlab.com/tarantool/tarantool/-/jobs/878962286
[2] - https://gitlab.com/tarantool/tarantool/-/jobs/879929201/artifacts/download

Closes tarantool/tarantool#5572

Totktonada · 2020-11-21T00:38:57Z

lib/preprocessor.py

+        try:
+            result = yaml.safe_load(result)
+        except AttributeError:
+            result = []


TarantoolAsyncConnection returns None at socket.error. There are several comments here:

How can I reproduce the situation?

Please, don't use try-except, where 'foo' in my_dict or x is None may be used. The exception handling is not cheap thing.

As in PR Fix server_stop() routine on broken connections #228 and PR Fix lua_eval() routine on broken connections #230, we should not hide the error and report an empty result to the caller.

There is nothing called kill_all_servers(), please, change or clarify (it is about the commit message).

Totktonada · 2020-11-27T21:07:33Z

It is unclear how to reproduce the situation. There are also unresolved comments regarding the implementation.

I'll close the PR so. If the problem persists, feel free to reopen the PR with description how to reproduce the issue (a mangled tarantool implementation / a mangled test are okay) and with an implementation updated according to the comments above. Or file an issue with a reproducer.

avtikhon · 2020-11-29T16:32:22Z

Please check the new way to fix the issue. The issue is flaky and it can't be reproduce too easy, so only checking the logs in past currently available. I hope to find the way to reproduce it and working on it.

Totktonada · 2020-11-29T23:23:21Z

Pushed-pushed the old state (due to isaacs/github#361), reopened, force-pushed the new state back.

Totktonada · 2020-11-30T03:32:14Z

I looked at the changes. You again try to propose a 'fix' without any reproducer, even synthetic one. So I just have no any information, based on which I can say that the change is okay or not okay.

lua_eval() routine is used for evaluating lua command on a given instance. Before run the command it reconnects to the needed instance and runs it there. After the command run safe_load() is called to parse the result. Found that some times safe_load() routine may fail with AttributeError internal exception. It happens, because result value from running command on the instance was not checked that it was not empty. To fix it was added waiting loop for replica respond before load it in safe_load() routine. Check the following output from gitlab-ci job [1]: DEBUG: sending command: test_run:wait_fullmesh(SERVERS) DEBUG: test-run received command: config engine DEBUG: test-run's response for [config engine] | !!python/unicode 'engine': !!python/unicode 'memtx' | ... DEBUG: test-run received command: eval autobootstrap1 "return box.info.server" DEBUG: test-run's response for [eval autobootstrap1 "return box.info.server"] | - id: null | lsn: -1 | ro: true | uuid: e84d86bd-d2fc-43c1-b832-8146d1f02cd1 | ... DEBUG: test-run received command: eval autobootstrap1 "return box.info.server" DEBUG: test-run's response for [eval autobootstrap1 "return box.info.server"] | - id: null | lsn: -1 | ro: true | uuid: e84d86bd-d2fc-43c1-b832-8146d1f02cd1 | ... DEBUG: test-run received command: eval autobootstrap1 "return box.info.server" TarantoolInpector.handle() received the following error: Traceback (most recent call last): File "test-run/lib/inspector.py", line 100, in handle result = self.parser.parse_preprocessor(line) File "test-run/lib/preprocessor.py", line 87, in parse_preprocessor return self.lua_eval(name, expr[1:-1]) File "test-run/lib/preprocessor.py", line 404, in lua_eval result = yaml.safe_load(result) File "/usr/local/lib/python2.7/site-packages/yaml/__init__.py", line 162, in safe_load return load(stream, SafeLoader) File "/usr/local/lib/python2.7/site-packages/yaml/__init__.py", line 112, in load loader = Loader(stream) File "/usr/local/lib/python2.7/site-packages/yaml/loader.py", line 34, in __init__ Reader.__init__(self, stream) File "/usr/local/lib/python2.7/site-packages/yaml/reader.py", line 87, in __init__ self.determine_encoding() File "/usr/local/lib/python2.7/site-packages/yaml/reader.py", line 126, in determine_encoding self.update_raw() File "/usr/local/lib/python2.7/site-packages/yaml/reader.py", line 183, in update_raw data = self.stream.read(size) AttributeError: 'NoneType' object has no attribute 'read' DEBUG: test-run's response for [eval autobootstrap1 "return box.info.server"] | error: AttributeError("'NoneType' object has no attribute 'read'",) | ... Kill all servers ... [Instance "autobootstrap1" returns with non-zero exit code: 1] It is seen here that wait_fullmesh() routine waits for the server in loop and after successfull 2 checks it fails on the 3rd one. [1] - https://gitlab.com/tarantool/tarantool/-/jobs/878962286 Closes tarantool/tarantool#5572

Totktonada · 2021-03-03T01:30:03Z

No responses from the author of the PR for more than three months. And we still need a reproducer, even if it'll be probabilistic. The point regarding error hiding is actual: we'll not ever know that something goes wrong. Aside of this, the new patch makes it unclear, whether test-run may cycle for the infinite time (what if an instance is crashed?).

I would not object against some kind of good error reporting that will help us to spot a rare problem (collect and report all helpful information and dump it to the terminal). But attempts to blindly fix such problems is not a way to go. We'll just got some other problems like worker hangs.

I'll close the PR as stale. Feel free to reopen, if you'll get some kind of reproducer.

avtikhon requested a review from Totktonada November 3, 2020 13:02

Totktonada reviewed Nov 21, 2020

View reviewed changes

Totktonada mentioned this pull request Nov 21, 2020

Stop replica with SIGKILL if SIGTERM failed #231

Closed

avtikhon force-pushed the avtikhon/kill_all_servers_exception branch from fade4c4 to bc80d8a Compare November 21, 2020 05:23

Totktonada closed this Nov 27, 2020

Totktonada deleted the avtikhon/kill_all_servers_exception branch November 27, 2020 21:07

avtikhon self-assigned this Nov 27, 2020

avtikhon changed the title ~~Fix lua_eval() routine on AttributeError exception~~ Fix fail on lua_eval() routine at wait_fullmesh() call on checking replicas missing information Nov 28, 2020

Totktonada restored the avtikhon/kill_all_servers_exception branch November 29, 2020 02:26

avtikhon changed the title ~~Fix fail on lua_eval() routine at wait_fullmesh() call on checking replicas missing information~~ Check data exists before load in safe_load() Nov 29, 2020

Totktonada reopened this Nov 29, 2020

Totktonada force-pushed the avtikhon/kill_all_servers_exception branch from bc80d8a to 3a8774f Compare November 29, 2020 23:22

avtikhon changed the title ~~Check data exists before load in safe_load()~~ Wait replica respond before load it in safe_load() Nov 30, 2020

avtikhon force-pushed the avtikhon/kill_all_servers_exception branch from 3a8774f to df32290 Compare November 30, 2020 06:23

avtikhon added teamQ labels Nov 30, 2020

avtikhon force-pushed the avtikhon/kill_all_servers_exception branch 2 times, most recently from 51182fe to f5686da Compare December 6, 2020 11:24

avtikhon force-pushed the avtikhon/kill_all_servers_exception branch from f5686da to b0a2ead Compare January 18, 2021 12:44

Totktonada closed this Mar 3, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Wait replica respond before load it in safe_load() #229

Wait replica respond before load it in safe_load() #229

Uh oh!

avtikhon commented Nov 3, 2020 •

edited

Loading

Uh oh!

Totktonada Nov 21, 2020

Uh oh!

Totktonada commented Nov 27, 2020

Uh oh!

avtikhon commented Nov 29, 2020

Uh oh!

Totktonada commented Nov 29, 2020

Uh oh!

Totktonada commented Nov 30, 2020

Uh oh!

Totktonada commented Mar 3, 2021

Uh oh!

Uh oh!

Wait replica respond before load it in safe_load() #229

Wait replica respond before load it in safe_load() #229

Uh oh!

Conversation

avtikhon commented Nov 3, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Totktonada Nov 21, 2020

Choose a reason for hiding this comment

Uh oh!

Totktonada commented Nov 27, 2020

Uh oh!

avtikhon commented Nov 29, 2020

Uh oh!

Totktonada commented Nov 29, 2020

Uh oh!

Totktonada commented Nov 30, 2020

Uh oh!

Totktonada commented Mar 3, 2021

Uh oh!

Uh oh!

avtikhon commented Nov 3, 2020 •

edited

Loading