Skip to content

Add signal argument to server restart routine #236

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed

Conversation

avtikhon
Copy link
Contributor

@avtikhon avtikhon commented Nov 11, 2020

Found that stop and restart routines should have the ability to set
signal using restart routine to fix the issues, like:

Test hung! Result content mismatch:
--- replication-py/init_storage.result	Wed Aug 26 06:06:15 2020
+++ /tmp/tnt/101_replication-py/init_storage.result	Wed Nov 11
10:17:50 2020
@@ -130,5 +130,3 @@
 -------------------------------------------------------------
 waiting reconnect on JOIN...
 ok
-waiting reconnect on SUBSCRIBE...
-ok

@avtikhon avtikhon requested a review from Totktonada November 11, 2020 21:03
@avtikhon avtikhon self-assigned this Nov 11, 2020
@avtikhon avtikhon force-pushed the avtikhon/gh-5510-restart-sigkill branch from d344a7b to 591b576 Compare November 21, 2020 05:10
@avtikhon avtikhon force-pushed the avtikhon/add-signal-to-restart branch from e40cba4 to 60e9fce Compare November 21, 2020 05:12
Totktonada and others added 24 commits November 26, 2020 21:04
I looked over places, where LuaPreprocessorException is raised and they
do not look as show-stoppers. Those situations are about incorrect
commands (say, parsing error) or commands evaluated in an incorrect
state (e.g. attempt to start a server that is not created yet). We can
return an error to the test, report a notice to the terminal and
continue the test. The decision, whether such cases should fail the
test, is up to a test author now.

The new format of displaying the notice on the terminal shifts
understanding of the error from 'looks as a testing system problem' to
'the test performs unexpected commands'.

Before the change such situations were reported to the terminal this
way:

 | [001] TarantoolInpector.handle() received the following error:
 | [001] <..a python stack trace and an error message..>
 | [001] LuaPreprocessorException: lua preprocessor error: "Can't stop nonexistent server 'foo'"

And the test stops.

After the change, it looks so:

 | [001] * [QA Notice]
 | [001] *
 | [001] * Can't stop nonexistent server 'foo'
 | [001] *

And the test receives a Lua error. Usually it leads to a fail of the
test, however now it is possible to catch the error with pcall() and
handle it.

The idea appears while I looking into [1]. The test creates a full-mesh
replicaset, spins in a cycle with stopping and starting instances and
then drops the replicaset. There are several situations that correctly
leads to failure of this particular test, but may be considered as
appropriate ones in another tests.

1. The replicaset instances are numbered from 1 to 3, but when the test
   is unable to determine an instance to stop during 60 seconds (default
   test_run:wait_cond() timeout), it tries to stop 0th instance. It is
   attempt to stop a nonexistent instance.
2. The similar to the problem 1, but with attempt to evaluate a command
   on a nonexistent instance.
3. An error may occurs between stopping and starting the instance. In
   this case the loop exits and test_run:drop_cluster() is unable to
   stop all instances in the replicaset.

The first problem is reported now as it should (with the QA Notice, as
shown above). The second will be reported this way after the next
commit. The third requires to ignore 'stop server <...>' failures in
test_run:drop_cluster() and will be handled in the separate commit
within this patchset.

While I'm here, changed several related things:

* Color 'info' messages as yellow. It affects several rarely shown
  existing messages, which looks okay.
* Removed the 'lua preprocessor error' prefix from the error message: it
  does not give any new information and so looks useless.
* Fixed conversion of the error to string: repr() -> str().
* Show an unhandled Lua error from a test-run command evaluation without
  a path to the test_run.lua file: it is useless and looks ugly (in
  particular a colon between the path and the error message obligates
  the yaml emitter to wrap the string into quotes and so duplicate all
  quotes inside).

[1]: https://github.com/tarantool/tarantool/issues/5395
Handled the following cases:

* test_run:cmd('stop server foo') is called against a stopped server.
* test_run:eval('foo', '<lua expression>') is called against a
  nonexistent server.

Found while looking into [1]. See the previus commit for a bit more
broad description and reasoning.

[1]: https://github.com/tarantool/tarantool/issues/5395
Since the method is used as clean up after a test / a test case, it does
not look worthful to fail the test when one of instances is already
stopped. However test-run's QA notice will be printed to the terminal in
the case: it is possible that this situation is not expected and some
subtle problem exists. It worth to let a developer know about this.

Found while looking into [1].

[1]: https://github.com/tarantool/tarantool/issues/5395
We already had log messages for creating, starting and stopping a
server, but without a response. Now all commands are traced, so those
messages becomes redundant and removed so.

The idea appears while looking into [1].

[1]: https://github.com/tarantool/tarantool/issues/5395
In the next commit, I'll add tracing of commands that are sent to
tarantool and their responses. In order to trace `\set language <lang>`
too, it is convenient to have code that sends a command and reads a
response in one place.

Part of activities around [1].

[1]: https://github.com/tarantool/tarantool/issues/5395
The commands and responses are written into test-run's log file that is
named like test/var/log/001_replication.log (where '001' is number of
worker and 'replication' is the test suite name).

A command is logged exactly as it is sent to tarantool: without newlines
in a middle[^1]. It'll look a bit weird for multiline commands, but may
ease debugging if a problem will be in test-run's way to format a
command before sending or somewhere around processing commands by the
console module.

[^1]: The last newline is stripped from logs for readability.

The idea appears while looking into [1].

[1]: https://github.com/tarantool/tarantool/issues/5395
I plan to reuse the function outside of the OutputWatcher class in the
next commit.

Part of activities around [1].

[1]: https://github.com/tarantool/tarantool/issues/5395
A time difference between different events may be important for
debugging a test.

For example, we start a tarantool instance at time X and at time Y it
says 'entering the event loop' in the log and reports 'orphan' status in
box.info.status. If the waiting unexpectedly long, we can make a guess
what is going wrong: say, compare (Y-X) against
box.cfg.replication_sync_timeout value.

The idea appears while looking into [1].

[1]: https://github.com/tarantool/tarantool/issues/5395
test-run outputs some lines in several chunks, e.g.
"engine/tuple.test.lua", then "memtx", then "[ pass ]". When test-run
prints on the terminal, it accumulates the chunks and write them at once
to don't mix output from different workers. Logs cannot be bufferred
this way: otherwise we can miss information that is valuable for
debugging.

After the previous commit ('logging: prepend each log entry with a
timestamp') we write each of such chunks on its own line and they looks
as if each would be a full line. In order to distinguish a chunk from a
full line, add `<no newline>` after a chunk without a newline at the
end.

The difference does not look important, but it should make easier to
understand relation between strings shown on the terminal and ones
written to the test-run log file.

Part of activities around [1].

[1]: https://github.com/tarantool/tarantool/issues/5395
When the option is passed, test-run prints debugging messages on the
terminal in addition to writting them to the log file.

It may be convenient for debugging of a test.

The idea appears while looking into [1].

[1]: https://github.com/tarantool/tarantool/issues/5395
All log entries are prepended with a timestamp in a test-run log file,
because this information may be useful for debugging a test[^1]. Let's
also show timestamps for debug messages on the terminal, when --debug
option is passed.

[^1]: See commit 'logging: prepend each log entry with a timestamp' for
      details.

Part of activities around [1].

[1]: https://github.com/tarantool/tarantool/issues/5395
Added more information about installing, starting, stopping instances:
instance name, process details (PID, cmdline, custom process title),
loading status, box.info.status value after loading.

Revisit existing messages to keep them formatted in the similar way. It
should ease reading of a log file.

Part of activities around [1].

[1]: https://github.com/tarantool/tarantool/issues/5395
The idea of this timer is to stop testing if nothing is written to the
terminal for a long time (120 seconds by default).

Before this commit, the timer reset on any event: say, when an entry is
written to a test-run log file. It is counter-intuitive and sometimes
may look like the timer just does not work.

Let's consider the following snippet of a test code:

 | -- Wait until <cmd> on instance 'foo' will return
 | -- an expected value.
 | local cmd = <...>
 | local exp = <...>
 | while true do                                \
 |     local res = test_run:eval('foo', cmd)[1] \
 |     if res == exp then                       \
 |         break                                \
 |     end                                      \
 | end

After commit 'logging: trace test-run commands and responses' we log
test-run commands and so the snippet would discard the no-output timer
at all. If something going wrong and the <cmd> command will never return
the expected result, the loop will spin infinitely.

After this commit, the logging events will not reset the timer and so
the testing will be finished eventually (after 120 seconds without
output to the terminal).

However if --debug is passed, the logging events will be printed on the
terminal and will reset the timer. It looks logical, because the timer
is about the printed messages, but it means that the behaviour of the
testing system may depend on whether the --debug option is passed. Don't
know whether behaviour would be better, so implemented the most
intuitive one.

Part of activities around [1].

[1]: https://github.com/tarantool/tarantool/issues/5395
Drop unnecessary test_debug() function: it is not used anywhere in
Python tests and was used only to implement the property.

In one of the following commits regarding --snapshot and
--disable-schema-upgrade options we'll need to determine whether a given
tarantool build contains debugging functions (error injections) before
creation of any TarantoolServer instance.

Part of #240
This utility function will be used in one of the following commits
regarding --snapshot and --disable-schema-upgrade options.

Co-authored-by: Alexander Turenko <[email protected]>

Part of #240
Option '--snapshot' specifies a path to snapshot that will be loaded in
Tarantool before testing.

Option '--disable-schema-upgrade' allows to skip execution
of Tarantool upgrade script using error injection mechanism.

Closes #240
- add trailing newlines to error message
- remove whitespaces in error message
- break loop on first option match
- use proper dict name to get options

Closes #241
When there is no output from workers during a long time (10 seconds by
default or 60 seconds when --long argument is passed), test-run prints a
warning and shows amount of lines in the temporary result file. It is
useful to understand on which statement a test hungs.

I reproduced the problem, when mangled tarantool to ignore SIGTERM and
SIGINT signals and run a simple 'tarantool = core' test. The test
successfully passes, but the worker stucks in waiting for stopping the
tarantool server.

This particular case should be resolved in PR #186, but just because the
timeout for stopping the server is less than the warning delay. This
assumption looks fragile, especially if we'll want to make some of those
timeouts / delays configurable. Let's handle the situation when the file
does not exist.

Found while looking into
https://github.com/tarantool/tarantool/issues/5573

Fixes #245
Now we have only one place, where a [QA Notice] message is printed. But
we'll have more and it is convenient to have one function for this
purpose.

The function should be used in places, where we observe an incorrect
behaviour of tarantool or a test, but want to lower it to a warning and
proceed further. Of course, there should be a solid reason to do so.

Part of #157
If a tarantool instance does not handle SIGTERM correctly, it is the bug
in tarantool. A testing system should reveal problems rather than hide
them. However this patch doing that and I'll explain why.

Many tests we have perform instance starting / stopping: they are
written to spot different problems, but not this one. If everything else
except hanging at stop is okay, there is no reason to fail the test. We
print a notice on the terminal in the case and proceed further.

In future, when all problems of this kind will be resolved, we can
interpret presence of such notices as error. However now it'll not add
anything for quality of our testing.

We had a problem of this kind ([1]) in the past and now have another one
([2]). Until it will be resolved, it worth to handle the situation on
the testing system side.

[1]: tarantool/tarantool#4127
[2]: https://github.com/tarantool/tarantool/issues/5573

Part of #157

Co-authored-by: Alexander Turenko <[email protected]>
test-run copies .tarantooctl file from a test/ directory to a worker's
working directory (e.g. test/var/001_app). It would be quite intuitive
if tarantoolctl would use this copy when operating on tarantool
instances of the worker. Suprisingly, it is not so.

tarantoolctl checks whether the file exists inside a directory pointed
by the PWD environment variable, not a real current working directory.
test-run did not set PWD prior to this commit, however everything work.
Why? Because test-run.py is always executed from the test/ directory,
which contains the same .tarantoolctl file. Even `make test` / `make
test-force` targets issue commands like `cd <...>/test &&
<...>/test-run.py <...>` (where `cd` sets PWD).

The time has come. Let's set PWD and use test/var/001_app/.tarantoolctl
configuration for instances of the 001_app worker.

Side note: it would be good to fix tarantoolctl too to use a real cwd,
not PWD.

This inconsistency was spotted, when we perform attempt to remove the
test/.tarantoolctl config from the main tarantool repository, move it to
the test-run repository, but keep copying into a worker directory.

As a side effect, test-run now could be invoked without manual changing
a current directory to test/. So, where we run a command like this
(example for in-source build):

 | (cd test && ./test-run.py)

The following command can be used:

 | ./test/test-run.py

Nice!

Sadly, the trick `Popen(<...>, env=dict(os.environ, PWD=self.vardir))`
does not work for us, because we use os.putenv() in test-run and Python
tests. os.putenv() does not update os.environ, so we would feed
non-actual environment to the subprocess. Because of this, we just set
PWD before and after Popen() call.

Part of tarantool/tarantool#5504
Part of #78

Co-authored-by: Alexander V. Tikhonov <[email protected]>
Moved .tarantoolctl to test-run tool submodule repository as:

  <tarantool repository>/test-run/.tarantoolctl

Also set backward compability to use old path location:

  <tarantool repository>/test/.tarantoolctl

as its primary place.

Needed for tarantool/tarantool#5504
Part of #78
Found that tests may fail due to hang in seek_wait() loop on starting
and stopping instances. It happened, because instances were not synced
within default output timeout which is by default 120 seconds, while
replication sync happens only each 300 seconds by default. To fix it
replication_sync_timeout should be decreased to some value lower than
'no-output-timeout', so decided to set it to 100 seconds.

The issue looked like in tests:

  --- replication/gh-5140-qsync-casc-rollback.result    Mon Oct 19
  17:29:46 2020
  +++ /rw_bins/test/var/070_replication/gh-5140-qsync-casc-rollback.result
  Mon Oct 19 17:31:35 2020
  @@ -169,56 +169,3 @@
   -- all the records are replayed one be one without yields for WAL writes, and
   -- nothing should change.
   test_run:cmd('restart server default')
  - |
  -test_run:cmd('restart server replica')
  - | ---
  - | - true
  - | ...
  -

Part of tarantool/tarantool#5504
@Totktonada
Copy link
Member

I guess it was resolved without changing of Python tests by PR #186. Right?

Found that stop and restart routines should have the ability to set
signal using restart routine to fix the issues, like:

Test hung! Result content mismatch:
--- replication-py/init_storage.result	Wed Aug 26 06:06:15 2020
+++ /tmp/tnt/101_replication-py/init_storage.result	Wed Nov 11
10:17:50 2020
@@ -130,5 +130,3 @@
 -------------------------------------------------------------
 waiting reconnect on JOIN...
 ok
-waiting reconnect on SUBSCRIBE...
-ok
@avtikhon avtikhon force-pushed the avtikhon/add-signal-to-restart branch from 60e9fce to a0b89ce Compare December 6, 2020 10:52
@avtikhon
Copy link
Contributor Author

avtikhon commented Dec 6, 2020

Right, fixed.

@avtikhon avtikhon closed this Dec 6, 2020
@avtikhon avtikhon deleted the avtikhon/add-signal-to-restart branch December 6, 2020 11:03
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants