Add signal argument to server restart routine #236

avtikhon · 2020-11-11T21:03:56Z

Found that stop and restart routines should have the ability to set
signal using restart routine to fix the issues, like:

Test hung! Result content mismatch:
--- replication-py/init_storage.result	Wed Aug 26 06:06:15 2020
+++ /tmp/tnt/101_replication-py/init_storage.result	Wed Nov 11
10:17:50 2020
@@ -130,5 +130,3 @@
 -------------------------------------------------------------
 waiting reconnect on JOIN...
 ok
-waiting reconnect on SUBSCRIBE...
-ok

I looked over places, where LuaPreprocessorException is raised and they do not look as show-stoppers. Those situations are about incorrect commands (say, parsing error) or commands evaluated in an incorrect state (e.g. attempt to start a server that is not created yet). We can return an error to the test, report a notice to the terminal and continue the test. The decision, whether such cases should fail the test, is up to a test author now. The new format of displaying the notice on the terminal shifts understanding of the error from 'looks as a testing system problem' to 'the test performs unexpected commands'. Before the change such situations were reported to the terminal this way: | [001] TarantoolInpector.handle() received the following error: | [001] <..a python stack trace and an error message..> | [001] LuaPreprocessorException: lua preprocessor error: "Can't stop nonexistent server 'foo'" And the test stops. After the change, it looks so: | [001] * [QA Notice] | [001] * | [001] * Can't stop nonexistent server 'foo' | [001] * And the test receives a Lua error. Usually it leads to a fail of the test, however now it is possible to catch the error with pcall() and handle it. The idea appears while I looking into [1]. The test creates a full-mesh replicaset, spins in a cycle with stopping and starting instances and then drops the replicaset. There are several situations that correctly leads to failure of this particular test, but may be considered as appropriate ones in another tests. 1. The replicaset instances are numbered from 1 to 3, but when the test is unable to determine an instance to stop during 60 seconds (default test_run:wait_cond() timeout), it tries to stop 0th instance. It is attempt to stop a nonexistent instance. 2. The similar to the problem 1, but with attempt to evaluate a command on a nonexistent instance. 3. An error may occurs between stopping and starting the instance. In this case the loop exits and test_run:drop_cluster() is unable to stop all instances in the replicaset. The first problem is reported now as it should (with the QA Notice, as shown above). The second will be reported this way after the next commit. The third requires to ignore 'stop server <...>' failures in test_run:drop_cluster() and will be handled in the separate commit within this patchset. While I'm here, changed several related things: * Color 'info' messages as yellow. It affects several rarely shown existing messages, which looks okay. * Removed the 'lua preprocessor error' prefix from the error message: it does not give any new information and so looks useless. * Fixed conversion of the error to string: repr() -> str(). * Show an unhandled Lua error from a test-run command evaluation without a path to the test_run.lua file: it is useless and looks ugly (in particular a colon between the path and the error message obligates the yaml emitter to wrap the string into quotes and so duplicate all quotes inside). [1]: https://github.com/tarantool/tarantool/issues/5395

Handled the following cases: * test_run:cmd('stop server foo') is called against a stopped server. * test_run:eval('foo', '<lua expression>') is called against a nonexistent server. Found while looking into [1]. See the previus commit for a bit more broad description and reasoning. [1]: https://github.com/tarantool/tarantool/issues/5395

Since the method is used as clean up after a test / a test case, it does not look worthful to fail the test when one of instances is already stopped. However test-run's QA notice will be printed to the terminal in the case: it is possible that this situation is not expected and some subtle problem exists. It worth to let a developer know about this. Found while looking into [1]. [1]: https://github.com/tarantool/tarantool/issues/5395

We already had log messages for creating, starting and stopping a server, but without a response. Now all commands are traced, so those messages becomes redundant and removed so. The idea appears while looking into [1]. [1]: https://github.com/tarantool/tarantool/issues/5395

In the next commit, I'll add tracing of commands that are sent to tarantool and their responses. In order to trace `\set language <lang>` too, it is convenient to have code that sends a command and reads a response in one place. Part of activities around [1]. [1]: https://github.com/tarantool/tarantool/issues/5395

The commands and responses are written into test-run's log file that is named like test/var/log/001_replication.log (where '001' is number of worker and 'replication' is the test suite name). A command is logged exactly as it is sent to tarantool: without newlines in a middle[^1]. It'll look a bit weird for multiline commands, but may ease debugging if a problem will be in test-run's way to format a command before sending or somewhere around processing commands by the console module. [^1]: The last newline is stripped from logs for readability. The idea appears while looking into [1]. [1]: https://github.com/tarantool/tarantool/issues/5395

I plan to reuse the function outside of the OutputWatcher class in the next commit. Part of activities around [1]. [1]: https://github.com/tarantool/tarantool/issues/5395

A time difference between different events may be important for debugging a test. For example, we start a tarantool instance at time X and at time Y it says 'entering the event loop' in the log and reports 'orphan' status in box.info.status. If the waiting unexpectedly long, we can make a guess what is going wrong: say, compare (Y-X) against box.cfg.replication_sync_timeout value. The idea appears while looking into [1]. [1]: https://github.com/tarantool/tarantool/issues/5395

test-run outputs some lines in several chunks, e.g. "engine/tuple.test.lua", then "memtx", then "[ pass ]". When test-run prints on the terminal, it accumulates the chunks and write them at once to don't mix output from different workers. Logs cannot be bufferred this way: otherwise we can miss information that is valuable for debugging. After the previous commit ('logging: prepend each log entry with a timestamp') we write each of such chunks on its own line and they looks as if each would be a full line. In order to distinguish a chunk from a full line, add `<no newline>` after a chunk without a newline at the end. The difference does not look important, but it should make easier to understand relation between strings shown on the terminal and ones written to the test-run log file. Part of activities around [1]. [1]: https://github.com/tarantool/tarantool/issues/5395

When the option is passed, test-run prints debugging messages on the terminal in addition to writting them to the log file. It may be convenient for debugging of a test. The idea appears while looking into [1]. [1]: https://github.com/tarantool/tarantool/issues/5395

All log entries are prepended with a timestamp in a test-run log file, because this information may be useful for debugging a test[^1]. Let's also show timestamps for debug messages on the terminal, when --debug option is passed. [^1]: See commit 'logging: prepend each log entry with a timestamp' for details. Part of activities around [1]. [1]: https://github.com/tarantool/tarantool/issues/5395

Added more information about installing, starting, stopping instances: instance name, process details (PID, cmdline, custom process title), loading status, box.info.status value after loading. Revisit existing messages to keep them formatted in the similar way. It should ease reading of a log file. Part of activities around [1]. [1]: https://github.com/tarantool/tarantool/issues/5395

The idea of this timer is to stop testing if nothing is written to the terminal for a long time (120 seconds by default). Before this commit, the timer reset on any event: say, when an entry is written to a test-run log file. It is counter-intuitive and sometimes may look like the timer just does not work. Let's consider the following snippet of a test code: | -- Wait until <cmd> on instance 'foo' will return | -- an expected value. | local cmd = <...> | local exp = <...> | while true do \ | local res = test_run:eval('foo', cmd)[1] \ | if res == exp then \ | break \ | end \ | end After commit 'logging: trace test-run commands and responses' we log test-run commands and so the snippet would discard the no-output timer at all. If something going wrong and the <cmd> command will never return the expected result, the loop will spin infinitely. After this commit, the logging events will not reset the timer and so the testing will be finished eventually (after 120 seconds without output to the terminal). However if --debug is passed, the logging events will be printed on the terminal and will reset the timer. It looks logical, because the timer is about the printed messages, but it means that the behaviour of the testing system may depend on whether the --debug option is passed. Don't know whether behaviour would be better, so implemented the most intuitive one. Part of activities around [1]. [1]: https://github.com/tarantool/tarantool/issues/5395

Drop unnecessary test_debug() function: it is not used anywhere in Python tests and was used only to implement the property. In one of the following commits regarding --snapshot and --disable-schema-upgrade options we'll need to determine whether a given tarantool build contains debugging functions (error injections) before creation of any TarantoolServer instance. Part of #240

This utility function will be used in one of the following commits regarding --snapshot and --disable-schema-upgrade options. Co-authored-by: Alexander Turenko <[email protected]> Part of #240

Option '--snapshot' specifies a path to snapshot that will be loaded in Tarantool before testing. Option '--disable-schema-upgrade' allows to skip execution of Tarantool upgrade script using error injection mechanism. Closes #240

- add trailing newlines to error message - remove whitespaces in error message - break loop on first option match - use proper dict name to get options Closes #241

Found while looking into #245.

When there is no output from workers during a long time (10 seconds by default or 60 seconds when --long argument is passed), test-run prints a warning and shows amount of lines in the temporary result file. It is useful to understand on which statement a test hungs. I reproduced the problem, when mangled tarantool to ignore SIGTERM and SIGINT signals and run a simple 'tarantool = core' test. The test successfully passes, but the worker stucks in waiting for stopping the tarantool server. This particular case should be resolved in PR #186, but just because the timeout for stopping the server is less than the warning delay. This assumption looks fragile, especially if we'll want to make some of those timeouts / delays configurable. Let's handle the situation when the file does not exist. Found while looking into https://github.com/tarantool/tarantool/issues/5573 Fixes #245

Now we have only one place, where a [QA Notice] message is printed. But we'll have more and it is convenient to have one function for this purpose. The function should be used in places, where we observe an incorrect behaviour of tarantool or a test, but want to lower it to a warning and proceed further. Of course, there should be a solid reason to do so. Part of #157

If a tarantool instance does not handle SIGTERM correctly, it is the bug in tarantool. A testing system should reveal problems rather than hide them. However this patch doing that and I'll explain why. Many tests we have perform instance starting / stopping: they are written to spot different problems, but not this one. If everything else except hanging at stop is okay, there is no reason to fail the test. We print a notice on the terminal in the case and proceed further. In future, when all problems of this kind will be resolved, we can interpret presence of such notices as error. However now it'll not add anything for quality of our testing. We had a problem of this kind ([1]) in the past and now have another one ([2]). Until it will be resolved, it worth to handle the situation on the testing system side. [1]: tarantool/tarantool#4127 [2]: https://github.com/tarantool/tarantool/issues/5573 Part of #157 Co-authored-by: Alexander Turenko <[email protected]>

test-run copies .tarantooctl file from a test/ directory to a worker's working directory (e.g. test/var/001_app). It would be quite intuitive if tarantoolctl would use this copy when operating on tarantool instances of the worker. Suprisingly, it is not so. tarantoolctl checks whether the file exists inside a directory pointed by the PWD environment variable, not a real current working directory. test-run did not set PWD prior to this commit, however everything work. Why? Because test-run.py is always executed from the test/ directory, which contains the same .tarantoolctl file. Even `make test` / `make test-force` targets issue commands like `cd <...>/test && <...>/test-run.py <...>` (where `cd` sets PWD). The time has come. Let's set PWD and use test/var/001_app/.tarantoolctl configuration for instances of the 001_app worker. Side note: it would be good to fix tarantoolctl too to use a real cwd, not PWD. This inconsistency was spotted, when we perform attempt to remove the test/.tarantoolctl config from the main tarantool repository, move it to the test-run repository, but keep copying into a worker directory. As a side effect, test-run now could be invoked without manual changing a current directory to test/. So, where we run a command like this (example for in-source build): | (cd test && ./test-run.py) The following command can be used: | ./test/test-run.py Nice! Sadly, the trick `Popen(<...>, env=dict(os.environ, PWD=self.vardir))` does not work for us, because we use os.putenv() in test-run and Python tests. os.putenv() does not update os.environ, so we would feed non-actual environment to the subprocess. Because of this, we just set PWD before and after Popen() call. Part of tarantool/tarantool#5504 Part of #78 Co-authored-by: Alexander V. Tikhonov <[email protected]>

Moved .tarantoolctl to test-run tool submodule repository as: <tarantool repository>/test-run/.tarantoolctl Also set backward compability to use old path location: <tarantool repository>/test/.tarantoolctl as its primary place. Needed for tarantool/tarantool#5504 Part of #78

Found that tests may fail due to hang in seek_wait() loop on starting and stopping instances. It happened, because instances were not synced within default output timeout which is by default 120 seconds, while replication sync happens only each 300 seconds by default. To fix it replication_sync_timeout should be decreased to some value lower than 'no-output-timeout', so decided to set it to 100 seconds. The issue looked like in tests: --- replication/gh-5140-qsync-casc-rollback.result Mon Oct 19 17:29:46 2020 +++ /rw_bins/test/var/070_replication/gh-5140-qsync-casc-rollback.result Mon Oct 19 17:31:35 2020 @@ -169,56 +169,3 @@ -- all the records are replayed one be one without yields for WAL writes, and -- nothing should change. test_run:cmd('restart server default') - | -test_run:cmd('restart server replica') - | --- - | - true - | ... - Part of tarantool/tarantool#5504

Totktonada · 2020-12-06T03:05:32Z

I guess it was resolved without changing of Python tests by PR #186. Right?

Found that stop and restart routines should have the ability to set signal using restart routine to fix the issues, like: Test hung! Result content mismatch: --- replication-py/init_storage.result Wed Aug 26 06:06:15 2020 +++ /tmp/tnt/101_replication-py/init_storage.result Wed Nov 11 10:17:50 2020 @@ -130,5 +130,3 @@ ------------------------------------------------------------- waiting reconnect on JOIN... ok -waiting reconnect on SUBSCRIBE... -ok

avtikhon · 2020-12-06T11:03:23Z

Right, fixed.

avtikhon requested a review from Totktonada November 11, 2020 21:03

avtikhon self-assigned this Nov 11, 2020

avtikhon force-pushed the avtikhon/gh-5510-restart-sigkill branch from d344a7b to 591b576 Compare November 21, 2020 05:10

avtikhon force-pushed the avtikhon/add-signal-to-restart branch from e40cba4 to 60e9fce Compare November 21, 2020 05:12

Totktonada and others added 24 commits November 26, 2020 21:04

refactoring: move decolor() to colorer.py

5836c94

I plan to reuse the function outside of the OutputWatcher class in the next commit. Part of activities around [1]. [1]: https://github.com/tarantool/tarantool/issues/5395

Allow to extract a schema version from a snapshot

9288c76

This utility function will be used in one of the following commits regarding --snapshot and --disable-schema-upgrade options. Co-authored-by: Alexander Turenko <[email protected]> Part of #240

Add options for upgrade testing

9f5a32e

Option '--snapshot' specifies a path to snapshot that will be loaded in Tarantool before testing. Option '--disable-schema-upgrade' allows to skip execution of Tarantool upgrade script using error injection mechanism. Closes #240

Update code with checks of conflicting arguments

26aa387

- add trailing newlines to error message - remove whitespaces in error message - break loop on first option match - use proper dict name to get options Closes #241

typo: fix spelling in a result file diff output

1219c07

Found while looking into #245.

avtikhon force-pushed the avtikhon/add-signal-to-restart branch from 60e9fce to a0b89ce Compare December 6, 2020 10:52

avtikhon closed this Dec 6, 2020

avtikhon deleted the avtikhon/add-signal-to-restart branch December 6, 2020 11:03

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Add signal argument to server restart routine #236

Add signal argument to server restart routine #236

Uh oh!

avtikhon commented Nov 11, 2020 •

edited

Loading

Uh oh!

Totktonada commented Dec 6, 2020

Uh oh!

avtikhon commented Dec 6, 2020

Uh oh!

Uh oh!

Add signal argument to server restart routine #236

Add signal argument to server restart routine #236

Uh oh!

Conversation

avtikhon commented Nov 11, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Totktonada commented Dec 6, 2020

Uh oh!

avtikhon commented Dec 6, 2020

Uh oh!

Uh oh!

avtikhon commented Nov 11, 2020 •

edited

Loading