Skip to content

Conversation

@sdimitro
Copy link
Contributor

= Motivation

The current parsing logic of sdb based on split() and the shlex library
has resulted in multiple workarounds in the implementation of commands
and the overall user-experience of the tool. In addition, its lack of
proper error-handling and reporting frequently result in the user not
knowing what is wrong with his input. Trying to fix the aforementioned
shortcomings in the existing logic has proven difficult as fixing one
problem brings up a new one.

= Patch

This patch replaces this code a simple hand-written parser that provides
the bare-minimum that we need and improved error-reporting. Unit tests
are also provided for the parser to test its behavior and also highlight
its behavior in extreme cases of input. This patch also does a first
pass in undoing most of the workarounds that we have in existing commands
due to the old parsing logic.

= Things To Note: Quoted Strings

Proper support for single and double quote strings is added with this
patch. Double-quote strings are allowed to escape a double-quote by
inserting a backslash before it. Single-quote strings can escape a
single quote the same way. E.g. the following examples are valid:

... | filter "obj.spa_name == 'rpool'" | ...
... | filter "obj.spa_name == \"rpool\"" | ...
... | filter 'obj.spa_name == "rpool"' | ...
... | filter 'obj.spa_name == \'rpool\'' | ...

The purpose of strings is solely to allow the ability to pass multiple
words separated by space as a single argument to commands. The filter
examples show above get the whole predicate passed in string form as a
single argument. The actual quotes of the string are not part of the
arguments passed to the command. This behavior was modelled after bash.

= Examples of new errors

// Before
sdb> echo 1 2 3 |
sdb: cannot recognize command:

// After
sdb: syntax error: freestanding pipe with no command
  echo 1 2 3 |
             ^
// Before
sdb> echo 1 2 3 | filter obj != 1
sdb: filter: invalid input: comparison operator is missing

// After
sdb> echo 1 2 3 | filter obj != 1
sdb: syntax error: predicates that use != as an operator should be quoted
  echo 1 2 3 | filter obj != 1
                          ^
// Before
sdb> echo 1 2 3 | filter "obj != 1
sdb encountered an internal error due to a bug. Here's the
information you need to file the bug:
----------------------------------------------------------
Target Info:
	ProgramFlags.IS_LIVE|IS_LINUX_KERNEL
	Platform(<Architecture.X86_64: 1>, <PlatformFlags.IS_LITTLE_ENDIAN|IS_64_BIT: 3>)

Traceback (most recent call last):
  File "/usr/lib/python3/dist-packages/sdb/internal/repl.py", line 107, in eval_cmd
    for obj in invoke(self.target, [], input_):
  File "/usr/lib/python3/dist-packages/sdb/pipeline.py", line 107, in invoke
    all_tokens = list(lexer)
  File "/usr/lib/python3.6/shlex.py", line 295, in __next__
    token = self.get_token()
  File "/usr/lib/python3.6/shlex.py", line 105, in get_token
    raw = self.read_token()
  File "/usr/lib/python3.6/shlex.py", line 187, in read_token
    raise ValueError("No closing quotation")
ValueError: No closing quotation
----------------------------------------------------------
Link: https://github.com/delphix/sdb/issues/new

// After
sdb> echo 1 2 3 | filter "obj != 1
sdb: syntax error: unfinished string expression
  echo 1 2 3 | filter "obj != 1
                      ^
// Before
sdb> ! pwd
sdb: cannot recognize command:
sdb> ! echo hello!
Multiple ! not supported

// After
sdb> ! pwd
/export/home/delphix/sdb
sdb> ! echo hello!
hello!

@sdimitro sdimitro force-pushed the sdb_bssp branch 2 times, most recently from 521f3bb to 9e31f76 Compare April 27, 2020 23:03
@ahrens
Copy link
Contributor

ahrens commented Apr 28, 2020

With the example of filter "obj.spa_name == 'rpool'", I was wondering if you could do filter obj.spa_name=='rpool', without the quotes and without the whitespace. And then what about filter obj.spa_name!='rpool' (with the problematic !)

@sdimitro
Copy link
Contributor Author

@ahrens

For the filter obj.spa_name=='rpool', the arguments passed to filter with the current parsing logic are ['obj.spa_name==', 'rpool'] - there are currently 2 problems with this:
[1] The parser doesn't recognize comparison operators (e.g. ==, >, >=, etc..) as separate tokens. The only operators that it really recognizes is | and !. Thus, in the above example == is merged together in one token with obj.spa_name. This is easy to add in the current parsing code (just need to make sure that lookahead will be correct for cases where one adds spaces like 1 = = 0) so I can go ahead and add it. If users really want to have == as part of a token they can always quote it together with whatever they need.

[2] The other problem is more specific to filter in my opinion. The old parsing code would never unquote strings which means that the third argument will be passes with quotes 'rpool'. This may be ok for filter but comes as a surprise if you are coming from something like bash. The new parsing logic would unquote this token and pass rpool (quotes are gone) - as a result of that filter will blow up later as the comparison passed to eval would be obj.spa_name == rpool and rpool would be interpreted as an identifier (that doesn't exist) instead of a string 'rpool'. For this problem, I'm a bit conflicted but IMHO I believe that it is better to fix filter to handle that case itself, or just have the user add quotes around the whole expression like 'obj.spa_name=="rpool" than change the parsing behavior that would affect all commands - I can be convinced otherwise though.

For the filter obj.spa_name!='rpool' case we don't see any difference from what I mention above

sdb: syntax error: predicates that use != as an operator should be quoted
  spa | filter obj.spa_name!='rpool'
                           ^

(pardon the carrot pointer - in my machine it looks good but markdown on Github makes it look inaccurate). We are asked to quote the expression because the bang is interpreted as the operator for shell redirection and not what we expected it to be (inequality operator). If we really want to avoid quoting such expressions for filter and adding generic operators as I mention above in [1] I can add != as a formal inequality operator too - we just need to be sure that we won't hit any shell command that starts with = 😅

PS. Another way that we could do it for [2] would be to make it so double-quotes strings are always passed without their quotes but single-quoted strings are never unquoted. Kind of a random idea but it gives us the best of both worlds in terms of functionality but at the expense of the surprise of the user.

@ahrens
Copy link
Contributor

ahrens commented Apr 28, 2020

I don't have any strong feelings one way or the other, just interested in this corner case. It sounds like filter obj.spa_guid==12345 and filter obj.spa_guid!=12345 might work (without quotes)? I was mainly interested in how the ! gets parsed out - if we see it as starting a pipe-to-bash vs being part of the unquoted argument, here.

@codecov-io
Copy link

codecov-io commented Apr 30, 2020

Codecov Report

Merging #219 into master will increase coverage by 0.42%.
The diff coverage is 95.62%.

Impacted file tree graph

@@            Coverage Diff             @@
##           master     #219      +/-   ##
==========================================
+ Coverage   86.75%   87.18%   +0.42%     
==========================================
  Files          59       60       +1     
  Lines        2401     2465      +64     
==========================================
+ Hits         2083     2149      +66     
+ Misses        318      316       -2     
Impacted Files Coverage Δ
sdb/commands/container_of.py 100.00% <ø> (ø)
sdb/commands/linux/slabs.py 95.00% <ø> (-0.13%) ⬇️
sdb/commands/spl/spl_kmem_caches.py 92.30% <ø> (+3.55%) ⬆️
sdb/commands/threads.py 100.00% <ø> (ø)
sdb/commands/zfs/dbuf.py 86.11% <ø> (ø)
sdb/commands/zfs/zfs_dbgmsg.py 90.00% <ø> (ø)
sdb/commands/zfs/vdev.py 89.28% <83.33%> (ø)
sdb/pipeline.py 82.35% <83.33%> (-3.02%) ⬇️
sdb/commands/zfs/spa.py 97.82% <85.71%> (ø)
sdb/commands/filter.py 94.64% <91.66%> (-1.66%) ⬇️
... and 10 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 0e1988e...3d76f36. Read the comment docs.

@sdimitro sdimitro force-pushed the sdb_bssp branch 2 times, most recently from 4d347df to e78dfb2 Compare April 30, 2020 03:23
@sdimitro
Copy link
Contributor Author

After considering the above situation with allowing things like >, <, !=, ==, >=, and <= to be formal operators like | and !, the member command would fails because arguments like vdev_ms[0]->ms_allocatable would break into multiple tokens (e.g. ['vdev_ms[0]-', '>', 'ms_allocatable']). This is not a big deal but it got me thinking a lot more on how much should the sdb parser care about such things and if it worths it.

Given that filter is the only command that would benefit from such change and the expense of all the other commands having to care about handling these new operators (or don't care and propagate the workaround to the user by forcing them to use strings for such arguments), I decided not to ahead with it. I did change the code slightly so if we change our minds in the future it is easier to add these formal operators.

As a result of this change, filter will soon start requiring that it is passed only one string argument which encloses the conditional expression (old : filter obj != 1 vs new: filter "obj != 1" - in the case of string comparison users can either escape the double quotes or use both single and double-quotes to make the distinction between the external and internal string, e.g. filter "obj.spa_name == \"rpool\"" and filter 'obj.spa_name == "rpool"').

# the arguments passed to the sdb from the shell. This is far from what
# we want.
#
# Solution 2 is dangerous as default arguments in Python are mutable(!)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Mutable default arguments, how I loathe thee. The traditional alternative solution is to have the first two lines of the function be something like if args is None: args = [], but an if statement is necessary either way. IMO that solution is slightly cleaner, but it's not a big deal.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It does look cleaner. Applied

token_list: List[str] = []
idx: Optional[int] = 0
while True:
idx = _next_non_whitespace(line, idx) # type: ignore[arg-type]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not a nit, just kinda surprised mypy doesn't realize that idx is never NONE here

# limitations under the License.
#
"""
This module contains the logic for the tokenization and parsing
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Generally speaking I am somewhat leery of using a hand-rolled parser instead of a generated one; I assume you looked at options for parser generators and found them lacking in some way? It might be good to document that somewhere.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I decided to write the following block comment below the module docstring:

#
# Why Roll Our Own Parser?
#
# Our grammar in its current state could be implemented with shlex() that is
# part of the standard library if we applied some workarounds to it. That said
# the code wouldn't be clean, it would be hard to add new rules (workarounds
# on top of workaroudns) and providing helpful error messages would be hard.
#
# In terms of external parsing libraries, the following ones were considered:
#   * PLY (Python Lex-Yacc)
#   * SLY (Sly Lex-Yacc)
#   * Lark
#
# PLY attempts to model traditional Lex & Yacc and it does come with a lot of
# their baggage. There is a lot of global state, that we'd either need to
# recreate (e.g. regenerate the grammar) every time an SDB command is issued,
# or alternatively we'd need to keep track of a few global objects and reset
# their metadata in both success and error code paths. The latter is not that
# bad but it can be very invasive in parts of the code base where we really
# shouldn't care about parsing. In addition, error-handling isn't great and
# there is a lot of boilerplate and magic to it.
#
# SLY is an improved version of PLY that deals with most issues of global
# state and boilerplace code. The error-handling is still not optimal but a
# lot better, optimizing for common cases. SLY would provide a reasonable
# alternative implementation to our hand-written parser but it wasn't chosen
# mainly for one reason. It tries to optimize for traditional full-fledged
# languages which results in a few workarounds given SDB's simplistic but
# quirky command language.
#
# Lark is probably the best option compared to the above in terms of features,
# ergonomics like error-handling, and clean parser code. The only drawback of
# this library in the context of SDB is that it is hard to debug incorrect
# grammars - the grammar is generally one whole string and if it is wrong the
# resuting stack traces end up showing methods in the library, not in the
# code that the consumer of the library wrote (which is what would geenrally
# happen with SLY). This is not a big deal in general but for SDB we still
# haven't finalized all the command language features (i.e. subshells or
# defining alias commands in the runtime) and our grammar isn't stable yet.
#
# Our hand-written parser below has a small implementation (less than 100
# lines of code without the comments), provides friendly error messages,
# and it falls cleanly to our existing code. As SDB's command language
# grows and gets more stable it should be easy to replace the existing
# parser with a library like Lark.
#

Is this close to what you were looking for? If not let me know and we can change it.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks great to me! Thanks for giving such a thorough breakdown.

sdimitro added 2 commits May 12, 2020 17:32
= Motivation

The current parsing logic of sdb based on split() and the shlex library
has resulted in multiple workarounds in the implementation of commands
and the overall user-experience of the tool. In addition, its lack of
proper error-handling and reporting frequently result in the user not
knowing what is wrong with his input. Trying to fix the aforementioned
shortcomings in the existing logic has proven difficult as fixing one
problem brings up a new one.

= Patch

This patch replaces this code a simple hand-written parser that provides
the bare-minimum that we need and improved error-reporting. Unit tests
are also provided for the parser to test its behavior and also highlight
its behavior in extreme cases of input. This patch also does a first
pass in undoing most of the workarounds that we have in existing commands
due to the old parsing logic.

= Things To Note: Quoted Strings

Proper support for single and double quote strings is added with this
patch. Double-quote strings are allowed to escape a double-quote by
inserting a backslash before it. Single-quote strings can escape a
single quote the same way. E.g. the following examples are valid:
```
... | filter "obj.spa_name == 'rpool'" | ...
... | filter "obj.spa_name == \"rpool\"" | ...
... | filter 'obj.spa_name == "rpool"' | ...
... | filter 'obj.spa_name == \'rpool\'' | ...
```

The purpose of strings is solely to allow the ability to pass multiple
words separated by space as a single argument to commands. The `filter`
examples show above get the whole predicate passed in string form as a
single argument. The actual quotes of the string are not part of the
arguments passed to the command. This behavior was modelled after bash.

= Examples of new errors

```
// Before
sdb> echo 1 2 3 |
sdb: cannot recognize command:

// After
sdb: syntax error: freestanding pipe with no command
  echo 1 2 3 |
             ^
```

```
// Before
sdb> echo 1 2 3 | filter obj != 1
sdb: filter: invalid input: comparison operator is missing

// After
sdb> echo 1 2 3 | filter obj != 1
sdb: syntax error: predicates that use != as an operator should be quoted
  echo 1 2 3 | filter obj != 1
                          ^
```

```
// Before
sdb> echo 1 2 3 | filter "obj != 1
sdb encountered an internal error due to a bug. Here's the
information you need to file the bug:
----------------------------------------------------------
Target Info:
	ProgramFlags.IS_LIVE|IS_LINUX_KERNEL
	Platform(<Architecture.X86_64: 1>, <PlatformFlags.IS_LITTLE_ENDIAN|IS_64_BIT: 3>)

Traceback (most recent call last):
  File "/usr/lib/python3/dist-packages/sdb/internal/repl.py", line 107, in eval_cmd
    for obj in invoke(self.target, [], input_):
  File "/usr/lib/python3/dist-packages/sdb/pipeline.py", line 107, in invoke
    all_tokens = list(lexer)
  File "/usr/lib/python3.6/shlex.py", line 295, in __next__
    token = self.get_token()
  File "/usr/lib/python3.6/shlex.py", line 105, in get_token
    raw = self.read_token()
  File "/usr/lib/python3.6/shlex.py", line 187, in read_token
    raise ValueError("No closing quotation")
ValueError: No closing quotation
----------------------------------------------------------
Link: https://github.com/delphix/sdb/issues/new

// After
sdb> echo 1 2 3 | filter "obj != 1
sdb: syntax error: unfinished string expression
  echo 1 2 3 | filter "obj != 1
                      ^
```

```
// Before
sdb> ! pwd
sdb: cannot recognize command:
sdb> ! echo hello!
Multiple ! not supported

// After
sdb> ! pwd
/export/home/delphix/sdb
sdb> ! echo hello!
hello!
```
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Development

Successfully merging this pull request may close these issues.

5 participants