-
Notifications
You must be signed in to change notification settings - Fork 18
Add signal argument to server restart routine #236
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Closed
avtikhon
wants to merge
25
commits into
avtikhon/gh-5510-restart-sigkill
from
avtikhon/add-signal-to-restart
Closed
Add signal argument to server restart routine #236
avtikhon
wants to merge
25
commits into
avtikhon/gh-5510-restart-sigkill
from
avtikhon/add-signal-to-restart
Conversation
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
d344a7b
to
591b576
Compare
e40cba4
to
60e9fce
Compare
I looked over places, where LuaPreprocessorException is raised and they do not look as show-stoppers. Those situations are about incorrect commands (say, parsing error) or commands evaluated in an incorrect state (e.g. attempt to start a server that is not created yet). We can return an error to the test, report a notice to the terminal and continue the test. The decision, whether such cases should fail the test, is up to a test author now. The new format of displaying the notice on the terminal shifts understanding of the error from 'looks as a testing system problem' to 'the test performs unexpected commands'. Before the change such situations were reported to the terminal this way: | [001] TarantoolInpector.handle() received the following error: | [001] <..a python stack trace and an error message..> | [001] LuaPreprocessorException: lua preprocessor error: "Can't stop nonexistent server 'foo'" And the test stops. After the change, it looks so: | [001] * [QA Notice] | [001] * | [001] * Can't stop nonexistent server 'foo' | [001] * And the test receives a Lua error. Usually it leads to a fail of the test, however now it is possible to catch the error with pcall() and handle it. The idea appears while I looking into [1]. The test creates a full-mesh replicaset, spins in a cycle with stopping and starting instances and then drops the replicaset. There are several situations that correctly leads to failure of this particular test, but may be considered as appropriate ones in another tests. 1. The replicaset instances are numbered from 1 to 3, but when the test is unable to determine an instance to stop during 60 seconds (default test_run:wait_cond() timeout), it tries to stop 0th instance. It is attempt to stop a nonexistent instance. 2. The similar to the problem 1, but with attempt to evaluate a command on a nonexistent instance. 3. An error may occurs between stopping and starting the instance. In this case the loop exits and test_run:drop_cluster() is unable to stop all instances in the replicaset. The first problem is reported now as it should (with the QA Notice, as shown above). The second will be reported this way after the next commit. The third requires to ignore 'stop server <...>' failures in test_run:drop_cluster() and will be handled in the separate commit within this patchset. While I'm here, changed several related things: * Color 'info' messages as yellow. It affects several rarely shown existing messages, which looks okay. * Removed the 'lua preprocessor error' prefix from the error message: it does not give any new information and so looks useless. * Fixed conversion of the error to string: repr() -> str(). * Show an unhandled Lua error from a test-run command evaluation without a path to the test_run.lua file: it is useless and looks ugly (in particular a colon between the path and the error message obligates the yaml emitter to wrap the string into quotes and so duplicate all quotes inside). [1]: https://github.com/tarantool/tarantool/issues/5395
Handled the following cases: * test_run:cmd('stop server foo') is called against a stopped server. * test_run:eval('foo', '<lua expression>') is called against a nonexistent server. Found while looking into [1]. See the previus commit for a bit more broad description and reasoning. [1]: https://github.com/tarantool/tarantool/issues/5395
Since the method is used as clean up after a test / a test case, it does not look worthful to fail the test when one of instances is already stopped. However test-run's QA notice will be printed to the terminal in the case: it is possible that this situation is not expected and some subtle problem exists. It worth to let a developer know about this. Found while looking into [1]. [1]: https://github.com/tarantool/tarantool/issues/5395
We already had log messages for creating, starting and stopping a server, but without a response. Now all commands are traced, so those messages becomes redundant and removed so. The idea appears while looking into [1]. [1]: https://github.com/tarantool/tarantool/issues/5395
In the next commit, I'll add tracing of commands that are sent to tarantool and their responses. In order to trace `\set language <lang>` too, it is convenient to have code that sends a command and reads a response in one place. Part of activities around [1]. [1]: https://github.com/tarantool/tarantool/issues/5395
The commands and responses are written into test-run's log file that is named like test/var/log/001_replication.log (where '001' is number of worker and 'replication' is the test suite name). A command is logged exactly as it is sent to tarantool: without newlines in a middle[^1]. It'll look a bit weird for multiline commands, but may ease debugging if a problem will be in test-run's way to format a command before sending or somewhere around processing commands by the console module. [^1]: The last newline is stripped from logs for readability. The idea appears while looking into [1]. [1]: https://github.com/tarantool/tarantool/issues/5395
I plan to reuse the function outside of the OutputWatcher class in the next commit. Part of activities around [1]. [1]: https://github.com/tarantool/tarantool/issues/5395
A time difference between different events may be important for debugging a test. For example, we start a tarantool instance at time X and at time Y it says 'entering the event loop' in the log and reports 'orphan' status in box.info.status. If the waiting unexpectedly long, we can make a guess what is going wrong: say, compare (Y-X) against box.cfg.replication_sync_timeout value. The idea appears while looking into [1]. [1]: https://github.com/tarantool/tarantool/issues/5395
test-run outputs some lines in several chunks, e.g. "engine/tuple.test.lua", then "memtx", then "[ pass ]". When test-run prints on the terminal, it accumulates the chunks and write them at once to don't mix output from different workers. Logs cannot be bufferred this way: otherwise we can miss information that is valuable for debugging. After the previous commit ('logging: prepend each log entry with a timestamp') we write each of such chunks on its own line and they looks as if each would be a full line. In order to distinguish a chunk from a full line, add `<no newline>` after a chunk without a newline at the end. The difference does not look important, but it should make easier to understand relation between strings shown on the terminal and ones written to the test-run log file. Part of activities around [1]. [1]: https://github.com/tarantool/tarantool/issues/5395
When the option is passed, test-run prints debugging messages on the terminal in addition to writting them to the log file. It may be convenient for debugging of a test. The idea appears while looking into [1]. [1]: https://github.com/tarantool/tarantool/issues/5395
All log entries are prepended with a timestamp in a test-run log file, because this information may be useful for debugging a test[^1]. Let's also show timestamps for debug messages on the terminal, when --debug option is passed. [^1]: See commit 'logging: prepend each log entry with a timestamp' for details. Part of activities around [1]. [1]: https://github.com/tarantool/tarantool/issues/5395
Added more information about installing, starting, stopping instances: instance name, process details (PID, cmdline, custom process title), loading status, box.info.status value after loading. Revisit existing messages to keep them formatted in the similar way. It should ease reading of a log file. Part of activities around [1]. [1]: https://github.com/tarantool/tarantool/issues/5395
The idea of this timer is to stop testing if nothing is written to the terminal for a long time (120 seconds by default). Before this commit, the timer reset on any event: say, when an entry is written to a test-run log file. It is counter-intuitive and sometimes may look like the timer just does not work. Let's consider the following snippet of a test code: | -- Wait until <cmd> on instance 'foo' will return | -- an expected value. | local cmd = <...> | local exp = <...> | while true do \ | local res = test_run:eval('foo', cmd)[1] \ | if res == exp then \ | break \ | end \ | end After commit 'logging: trace test-run commands and responses' we log test-run commands and so the snippet would discard the no-output timer at all. If something going wrong and the <cmd> command will never return the expected result, the loop will spin infinitely. After this commit, the logging events will not reset the timer and so the testing will be finished eventually (after 120 seconds without output to the terminal). However if --debug is passed, the logging events will be printed on the terminal and will reset the timer. It looks logical, because the timer is about the printed messages, but it means that the behaviour of the testing system may depend on whether the --debug option is passed. Don't know whether behaviour would be better, so implemented the most intuitive one. Part of activities around [1]. [1]: https://github.com/tarantool/tarantool/issues/5395
Drop unnecessary test_debug() function: it is not used anywhere in Python tests and was used only to implement the property. In one of the following commits regarding --snapshot and --disable-schema-upgrade options we'll need to determine whether a given tarantool build contains debugging functions (error injections) before creation of any TarantoolServer instance. Part of #240
This utility function will be used in one of the following commits regarding --snapshot and --disable-schema-upgrade options. Co-authored-by: Alexander Turenko <[email protected]> Part of #240
Option '--snapshot' specifies a path to snapshot that will be loaded in Tarantool before testing. Option '--disable-schema-upgrade' allows to skip execution of Tarantool upgrade script using error injection mechanism. Closes #240
- add trailing newlines to error message - remove whitespaces in error message - break loop on first option match - use proper dict name to get options Closes #241
Found while looking into #245.
When there is no output from workers during a long time (10 seconds by default or 60 seconds when --long argument is passed), test-run prints a warning and shows amount of lines in the temporary result file. It is useful to understand on which statement a test hungs. I reproduced the problem, when mangled tarantool to ignore SIGTERM and SIGINT signals and run a simple 'tarantool = core' test. The test successfully passes, but the worker stucks in waiting for stopping the tarantool server. This particular case should be resolved in PR #186, but just because the timeout for stopping the server is less than the warning delay. This assumption looks fragile, especially if we'll want to make some of those timeouts / delays configurable. Let's handle the situation when the file does not exist. Found while looking into https://github.com/tarantool/tarantool/issues/5573 Fixes #245
Now we have only one place, where a [QA Notice] message is printed. But we'll have more and it is convenient to have one function for this purpose. The function should be used in places, where we observe an incorrect behaviour of tarantool or a test, but want to lower it to a warning and proceed further. Of course, there should be a solid reason to do so. Part of #157
If a tarantool instance does not handle SIGTERM correctly, it is the bug in tarantool. A testing system should reveal problems rather than hide them. However this patch doing that and I'll explain why. Many tests we have perform instance starting / stopping: they are written to spot different problems, but not this one. If everything else except hanging at stop is okay, there is no reason to fail the test. We print a notice on the terminal in the case and proceed further. In future, when all problems of this kind will be resolved, we can interpret presence of such notices as error. However now it'll not add anything for quality of our testing. We had a problem of this kind ([1]) in the past and now have another one ([2]). Until it will be resolved, it worth to handle the situation on the testing system side. [1]: tarantool/tarantool#4127 [2]: https://github.com/tarantool/tarantool/issues/5573 Part of #157 Co-authored-by: Alexander Turenko <[email protected]>
test-run copies .tarantooctl file from a test/ directory to a worker's working directory (e.g. test/var/001_app). It would be quite intuitive if tarantoolctl would use this copy when operating on tarantool instances of the worker. Suprisingly, it is not so. tarantoolctl checks whether the file exists inside a directory pointed by the PWD environment variable, not a real current working directory. test-run did not set PWD prior to this commit, however everything work. Why? Because test-run.py is always executed from the test/ directory, which contains the same .tarantoolctl file. Even `make test` / `make test-force` targets issue commands like `cd <...>/test && <...>/test-run.py <...>` (where `cd` sets PWD). The time has come. Let's set PWD and use test/var/001_app/.tarantoolctl configuration for instances of the 001_app worker. Side note: it would be good to fix tarantoolctl too to use a real cwd, not PWD. This inconsistency was spotted, when we perform attempt to remove the test/.tarantoolctl config from the main tarantool repository, move it to the test-run repository, but keep copying into a worker directory. As a side effect, test-run now could be invoked without manual changing a current directory to test/. So, where we run a command like this (example for in-source build): | (cd test && ./test-run.py) The following command can be used: | ./test/test-run.py Nice! Sadly, the trick `Popen(<...>, env=dict(os.environ, PWD=self.vardir))` does not work for us, because we use os.putenv() in test-run and Python tests. os.putenv() does not update os.environ, so we would feed non-actual environment to the subprocess. Because of this, we just set PWD before and after Popen() call. Part of tarantool/tarantool#5504 Part of #78 Co-authored-by: Alexander V. Tikhonov <[email protected]>
Moved .tarantoolctl to test-run tool submodule repository as: <tarantool repository>/test-run/.tarantoolctl Also set backward compability to use old path location: <tarantool repository>/test/.tarantoolctl as its primary place. Needed for tarantool/tarantool#5504 Part of #78
Found that tests may fail due to hang in seek_wait() loop on starting and stopping instances. It happened, because instances were not synced within default output timeout which is by default 120 seconds, while replication sync happens only each 300 seconds by default. To fix it replication_sync_timeout should be decreased to some value lower than 'no-output-timeout', so decided to set it to 100 seconds. The issue looked like in tests: --- replication/gh-5140-qsync-casc-rollback.result Mon Oct 19 17:29:46 2020 +++ /rw_bins/test/var/070_replication/gh-5140-qsync-casc-rollback.result Mon Oct 19 17:31:35 2020 @@ -169,56 +169,3 @@ -- all the records are replayed one be one without yields for WAL writes, and -- nothing should change. test_run:cmd('restart server default') - | -test_run:cmd('restart server replica') - | --- - | - true - | ... - Part of tarantool/tarantool#5504
I guess it was resolved without changing of Python tests by PR #186. Right? |
Found that stop and restart routines should have the ability to set signal using restart routine to fix the issues, like: Test hung! Result content mismatch: --- replication-py/init_storage.result Wed Aug 26 06:06:15 2020 +++ /tmp/tnt/101_replication-py/init_storage.result Wed Nov 11 10:17:50 2020 @@ -130,5 +130,3 @@ ------------------------------------------------------------- waiting reconnect on JOIN... ok -waiting reconnect on SUBSCRIBE... -ok
60e9fce
to
a0b89ce
Compare
Right, fixed. |
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Found that stop and restart routines should have the ability to set
signal using restart routine to fix the issues, like: