Skip to content

Conversation

avtikhon
Copy link
Contributor

@avtikhon avtikhon commented Nov 3, 2020

Check data exists before load in safe_load()

lua_eval() routine is used for evaluating lua command on a given
instance. Before run the command it reconnects to the needed instance
and runs it there. After the command run safe_load() is called to parse
the result. Found that some times safe_load() routine may fail with
AttributeError internal exception. It happens, because result value
from running command on the instance was not checked that it was not
empty. To fix it was added waiting loop for replica respond before
load it in safe_load() routine.

Check the following output from gitlab-ci job [1]:
      DEBUG: sending command: test_run:wait_fullmesh(SERVERS)
      DEBUG: test-run received command: config engine
      DEBUG: test-run's response for [config engine]
       | !!python/unicode 'engine': !!python/unicode 'memtx'
       | ...
      DEBUG: test-run received command: eval autobootstrap1 "return box.info.server"
      DEBUG: test-run's response for [eval autobootstrap1 "return box.info.server"]
       | - id: null
       |   lsn: -1
       |   ro: true
       |   uuid: e84d86bd-d2fc-43c1-b832-8146d1f02cd1
       | ...
      DEBUG: test-run received command: eval autobootstrap1 "return box.info.server"
      DEBUG: test-run's response for [eval autobootstrap1 "return box.info.server"]
       | - id: null
       |   lsn: -1
       |   ro: true
       |   uuid: e84d86bd-d2fc-43c1-b832-8146d1f02cd1
       | ...
      DEBUG: test-run received command: eval autobootstrap1 "return box.info.server"
    
      TarantoolInpector.handle() received the following error:
      Traceback (most recent call last):
        File "test-run/lib/inspector.py", line 100, in handle
          result = self.parser.parse_preprocessor(line)
        File "test-run/lib/preprocessor.py", line 87, in parse_preprocessor
          return self.lua_eval(name, expr[1:-1])
        File "test-run/lib/preprocessor.py", line 404, in lua_eval
          result = yaml.safe_load(result)
        File "/usr/local/lib/python2.7/site-packages/yaml/__init__.py", line 162, in safe_load
          return load(stream, SafeLoader)
        File "/usr/local/lib/python2.7/site-packages/yaml/__init__.py", line 112, in load
          loader = Loader(stream)
        File "/usr/local/lib/python2.7/site-packages/yaml/loader.py", line 34, in __init__
          Reader.__init__(self, stream)
        File "/usr/local/lib/python2.7/site-packages/yaml/reader.py", line 87, in __init__
          self.determine_encoding()
        File "/usr/local/lib/python2.7/site-packages/yaml/reader.py", line 126, in determine_encoding
          self.update_raw()
        File "/usr/local/lib/python2.7/site-packages/yaml/reader.py", line 183, in update_raw
          data = self.stream.read(size)
      AttributeError: 'NoneType' object has no attribute 'read'
      DEBUG: test-run's response for [eval autobootstrap1 "return box.info.server"]
       | error: AttributeError("'NoneType' object has no attribute 'read'",)
       | ...
      Kill all servers ...
    
      [Instance "autobootstrap1" returns with non-zero exit code: 1]
It is seen here that wait_fullmesh() routine waits for the server in
loop and after successfull 2 checks it fails on the 3rd one.

after fix it works like this [2] (file: log/096_replication.log, line: 29111):

[2020-11-30 13:02:56.799806] DEBUG: test-run received command: eval replica_uuid_ro1 "return box.info.server"
[2020-11-30 13:02:56.802305] DEBUG: test-run's response for [eval replica_uuid_ro1 "return box.info.server"]
[2020-11-30 13:02:56.802305]  | - id: 1
[2020-11-30 13:02:56.802305]  |   lsn: 8
[2020-11-30 13:02:56.802305]  |   ro: false
[2020-11-30 13:02:56.802305]  |   uuid: 9cdee878-98e1-4a6f-af4d-fd19d999aa2d
[2020-11-30 13:02:56.802305]  | ...
[2020-11-30 13:02:56.803032] DEBUG: test-run received command: eval replica_uuid_ro2 "box.info.replication[1]"
[2020-11-30 13:02:56.804264] DEBUG: test-run's response for [eval replica_uuid_ro2 "box.info.replication[1]"]
[2020-11-30 13:02:56.804264]  | - null
[2020-11-30 13:02:56.804264]  | ...
[2020-11-30 13:02:56.863545] DEBUG: test-run received command: eval replica_uuid_ro2 "box.info.replication[1]"
[2020-11-30 13:02:56.865426] DEBUG: test-run's response for [eval replica_uuid_ro2 "box.info.replication[1]"]
[2020-11-30 13:02:56.865426]  | - null
[2020-11-30 13:02:56.865426]  | ...
[2020-11-30 13:02:56.896547] DEBUG: test-run received command: eval replica_uuid_ro2 "box.info.replication[1]"
[2020-11-30 13:02:56.898011] DEBUG: test-run's response for [eval replica_uuid_ro2 "box.info.replication[1]"]
[2020-11-30 13:02:56.898011]  | - null
[2020-11-30 13:02:56.898011]  | ...
[2020-11-30 13:02:56.958887] DEBUG: test-run received command: eval replica_uuid_ro2 "box.info.replication[1]"
[2020-11-30 13:02:56.960591] DEBUG: test-run's response for [eval replica_uuid_ro2 "box.info.replication[1]"]
[2020-11-30 13:02:56.960591]  | - null
[2020-11-30 13:02:56.960591]  | ...
[2020-11-30 13:02:57.050987] DEBUG: test-run received command: eval replica_uuid_ro2 "box.info.replication[1]"
[2020-11-30 13:02:57.052878] DEBUG: test-run's response for [eval replica_uuid_ro2 "box.info.replication[1]"]
[2020-11-30 13:02:57.052878]  | - null
[2020-11-30 13:02:57.052878]  | ...
[2020-11-30 13:02:57.107417] DEBUG: test-run received command: eval replica_uuid_ro2 "box.info.replication[1]"
[2020-11-30 13:02:57.109231] DEBUG: test-run's response for [eval replica_uuid_ro2 "box.info.replication[1]"]
[2020-11-30 13:02:57.109231]  | - null
[2020-11-30 13:02:57.109231]  | ...
[2020-11-30 13:02:57.188574] DEBUG: test-run received command: eval replica_uuid_ro2 "box.info.replication[1]"
[2020-11-30 13:02:57.190406] DEBUG: test-run's response for [eval replica_uuid_ro2 "box.info.replication[1]"]
[2020-11-30 13:02:57.190406]  | - null
[2020-11-30 13:02:57.190406]  | ...
[2020-11-30 13:02:57.261315] DEBUG: test-run received command: eval replica_uuid_ro2 "box.info.replication[1]"
[2020-11-30 13:02:57.265057] DEBUG: test-run's response for [eval replica_uuid_ro2 "box.info.replication[1]"]
[2020-11-30 13:02:57.265057]  | - id: 1
[2020-11-30 13:02:57.265057]  |   lsn: 10
[2020-11-30 13:02:57.265057]  |   upstream:
[2020-11-30 13:02:57.265057]  |     idle: 0.012409999966621
[2020-11-30 13:02:57.265057]  |     lag: 0.00017714500427246
[2020-11-30 13:02:57.265057]  |     peer: cluster@unix/:/private/tmp/tnt/096_replication/replica_uuid_ro1.sock
[2020-11-30 13:02:57.265057]  |     status: follow
[2020-11-30 13:02:57.265057]  |   uuid: 9cdee878-98e1-4a6f-af4d-fd19d999aa2d
[2020-11-30 13:02:57.265057]  | ...

[1] - https://gitlab.com/tarantool/tarantool/-/jobs/878962286
[2] - https://gitlab.com/tarantool/tarantool/-/jobs/879929201/artifacts/download

Closes tarantool/tarantool#5572

@avtikhon avtikhon requested a review from Totktonada November 3, 2020 13:02
try:
result = yaml.safe_load(result)
except AttributeError:
result = []
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

TarantoolAsyncConnection returns None at socket.error. There are several comments here:

  1. How can I reproduce the situation?
  2. Please, don't use try-except, where 'foo' in my_dict or x is None may be used. The exception handling is not cheap thing.
  3. As in PR Fix server_stop() routine on broken connections #228 and PR Fix lua_eval() routine on broken connections #230, we should not hide the error and report an empty result to the caller.
  4. There is nothing called kill_all_servers(), please, change or clarify (it is about the commit message).

@avtikhon avtikhon force-pushed the avtikhon/kill_all_servers_exception branch from fade4c4 to bc80d8a Compare November 21, 2020 05:23
@Totktonada
Copy link
Member

It is unclear how to reproduce the situation. There are also unresolved comments regarding the implementation.

I'll close the PR so. If the problem persists, feel free to reopen the PR with description how to reproduce the issue (a mangled tarantool implementation / a mangled test are okay) and with an implementation updated according to the comments above. Or file an issue with a reproducer.

@Totktonada Totktonada closed this Nov 27, 2020
@Totktonada Totktonada deleted the avtikhon/kill_all_servers_exception branch November 27, 2020 21:07
@avtikhon avtikhon self-assigned this Nov 27, 2020
@avtikhon avtikhon changed the title Fix lua_eval() routine on AttributeError exception Fix fail on lua_eval() routine at wait_fullmesh() call on checking replicas missing information Nov 28, 2020
@Totktonada Totktonada restored the avtikhon/kill_all_servers_exception branch November 29, 2020 02:26
@avtikhon avtikhon changed the title Fix fail on lua_eval() routine at wait_fullmesh() call on checking replicas missing information Check data exists before load in safe_load() Nov 29, 2020
@avtikhon
Copy link
Contributor Author

Please check the new way to fix the issue. The issue is flaky and it can't be reproduce too easy, so only checking the logs in past currently available. I hope to find the way to reproduce it and working on it.

@Totktonada Totktonada reopened this Nov 29, 2020
@Totktonada Totktonada force-pushed the avtikhon/kill_all_servers_exception branch from bc80d8a to 3a8774f Compare November 29, 2020 23:22
@Totktonada
Copy link
Member

Pushed-pushed the old state (due to isaacs/github#361), reopened, force-pushed the new state back.

@Totktonada
Copy link
Member

I looked at the changes. You again try to propose a 'fix' without any reproducer, even synthetic one. So I just have no any information, based on which I can say that the change is okay or not okay.

@avtikhon avtikhon changed the title Check data exists before load in safe_load() Wait replica respond before load it in safe_load() Nov 30, 2020
@avtikhon avtikhon force-pushed the avtikhon/kill_all_servers_exception branch from 3a8774f to df32290 Compare November 30, 2020 06:23
@avtikhon avtikhon force-pushed the avtikhon/kill_all_servers_exception branch 2 times, most recently from 51182fe to f5686da Compare December 6, 2020 11:24
lua_eval() routine is used for evaluating lua command on a given
instance. Before run the command it reconnects to the needed instance
and runs it there. After the command run safe_load() is called to parse
the result. Found that some times safe_load() routine may fail with
AttributeError internal exception. It happens, because result value
from running command on the instance was not checked that it was not
empty. To fix it was added waiting loop for replica respond before
load it in safe_load() routine.

Check the following output from gitlab-ci job [1]:

  DEBUG: sending command: test_run:wait_fullmesh(SERVERS)
  DEBUG: test-run received command: config engine
  DEBUG: test-run's response for [config engine]
   | !!python/unicode 'engine': !!python/unicode 'memtx'
   | ...
  DEBUG: test-run received command: eval autobootstrap1 "return box.info.server"
  DEBUG: test-run's response for [eval autobootstrap1 "return box.info.server"]
   | - id: null
   |   lsn: -1
   |   ro: true
   |   uuid: e84d86bd-d2fc-43c1-b832-8146d1f02cd1
   | ...
  DEBUG: test-run received command: eval autobootstrap1 "return box.info.server"
  DEBUG: test-run's response for [eval autobootstrap1 "return box.info.server"]
   | - id: null
   |   lsn: -1
   |   ro: true
   |   uuid: e84d86bd-d2fc-43c1-b832-8146d1f02cd1
   | ...
  DEBUG: test-run received command: eval autobootstrap1 "return box.info.server"

  TarantoolInpector.handle() received the following error:
  Traceback (most recent call last):
    File "test-run/lib/inspector.py", line 100, in handle
      result = self.parser.parse_preprocessor(line)
    File "test-run/lib/preprocessor.py", line 87, in parse_preprocessor
      return self.lua_eval(name, expr[1:-1])
    File "test-run/lib/preprocessor.py", line 404, in lua_eval
      result = yaml.safe_load(result)
    File "/usr/local/lib/python2.7/site-packages/yaml/__init__.py", line 162, in safe_load
      return load(stream, SafeLoader)
    File "/usr/local/lib/python2.7/site-packages/yaml/__init__.py", line 112, in load
      loader = Loader(stream)
    File "/usr/local/lib/python2.7/site-packages/yaml/loader.py", line 34, in __init__
      Reader.__init__(self, stream)
    File "/usr/local/lib/python2.7/site-packages/yaml/reader.py", line 87, in __init__
      self.determine_encoding()
    File "/usr/local/lib/python2.7/site-packages/yaml/reader.py", line 126, in determine_encoding
      self.update_raw()
    File "/usr/local/lib/python2.7/site-packages/yaml/reader.py", line 183, in update_raw
      data = self.stream.read(size)
  AttributeError: 'NoneType' object has no attribute 'read'
  DEBUG: test-run's response for [eval autobootstrap1 "return box.info.server"]
   | error: AttributeError("'NoneType' object has no attribute 'read'",)
   | ...
  Kill all servers ...

  [Instance "autobootstrap1" returns with non-zero exit code: 1]

It is seen here that wait_fullmesh() routine waits for the server in
loop and after successfull 2 checks it fails on the 3rd one.

[1] - https://gitlab.com/tarantool/tarantool/-/jobs/878962286

Closes tarantool/tarantool#5572
@avtikhon avtikhon force-pushed the avtikhon/kill_all_servers_exception branch from f5686da to b0a2ead Compare January 18, 2021 12:44
@Totktonada
Copy link
Member

No responses from the author of the PR for more than three months. And we still need a reproducer, even if it'll be probabilistic. The point regarding error hiding is actual: we'll not ever know that something goes wrong. Aside of this, the new patch makes it unclear, whether test-run may cycle for the infinite time (what if an instance is crashed?).

I would not object against some kind of good error reporting that will help us to spot a rare problem (collect and report all helpful information and dump it to the terminal). But attempts to blindly fix such problems is not a way to go. We'll just got some other problems like worker hangs.

I'll close the PR as stale. Feel free to reopen, if you'll get some kind of reproducer.

@Totktonada Totktonada closed this Mar 3, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants