-
Notifications
You must be signed in to change notification settings - Fork 21
Introduce Basic Single Pass Parser #219
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
521f3bb to
9e31f76
Compare
|
With the example of |
|
For the [2] The other problem is more specific to For the (pardon the carrot pointer - in my machine it looks good but markdown on Github makes it look inaccurate). We are asked to quote the expression because the bang is interpreted as the operator for shell redirection and not what we expected it to be (inequality operator). If we really want to avoid quoting such expressions for filter and adding generic operators as I mention above in [1] I can add PS. Another way that we could do it for [2] would be to make it so double-quotes strings are always passed without their quotes but single-quoted strings are never unquoted. Kind of a random idea but it gives us the best of both worlds in terms of functionality but at the expense of the surprise of the user. |
|
I don't have any strong feelings one way or the other, just interested in this corner case. It sounds like |
Codecov Report
@@ Coverage Diff @@
## master #219 +/- ##
==========================================
+ Coverage 86.75% 87.18% +0.42%
==========================================
Files 59 60 +1
Lines 2401 2465 +64
==========================================
+ Hits 2083 2149 +66
+ Misses 318 316 -2
Continue to review full report at Codecov.
|
4d347df to
e78dfb2
Compare
|
After considering the above situation with allowing things like Given that As a result of this change, |
| # the arguments passed to the sdb from the shell. This is far from what | ||
| # we want. | ||
| # | ||
| # Solution 2 is dangerous as default arguments in Python are mutable(!) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Mutable default arguments, how I loathe thee. The traditional alternative solution is to have the first two lines of the function be something like if args is None: args = [], but an if statement is necessary either way. IMO that solution is slightly cleaner, but it's not a big deal.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It does look cleaner. Applied
| token_list: List[str] = [] | ||
| idx: Optional[int] = 0 | ||
| while True: | ||
| idx = _next_non_whitespace(line, idx) # type: ignore[arg-type] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not a nit, just kinda surprised mypy doesn't realize that idx is never NONE here
| # limitations under the License. | ||
| # | ||
| """ | ||
| This module contains the logic for the tokenization and parsing |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Generally speaking I am somewhat leery of using a hand-rolled parser instead of a generated one; I assume you looked at options for parser generators and found them lacking in some way? It might be good to document that somewhere.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I decided to write the following block comment below the module docstring:
#
# Why Roll Our Own Parser?
#
# Our grammar in its current state could be implemented with shlex() that is
# part of the standard library if we applied some workarounds to it. That said
# the code wouldn't be clean, it would be hard to add new rules (workarounds
# on top of workaroudns) and providing helpful error messages would be hard.
#
# In terms of external parsing libraries, the following ones were considered:
# * PLY (Python Lex-Yacc)
# * SLY (Sly Lex-Yacc)
# * Lark
#
# PLY attempts to model traditional Lex & Yacc and it does come with a lot of
# their baggage. There is a lot of global state, that we'd either need to
# recreate (e.g. regenerate the grammar) every time an SDB command is issued,
# or alternatively we'd need to keep track of a few global objects and reset
# their metadata in both success and error code paths. The latter is not that
# bad but it can be very invasive in parts of the code base where we really
# shouldn't care about parsing. In addition, error-handling isn't great and
# there is a lot of boilerplate and magic to it.
#
# SLY is an improved version of PLY that deals with most issues of global
# state and boilerplace code. The error-handling is still not optimal but a
# lot better, optimizing for common cases. SLY would provide a reasonable
# alternative implementation to our hand-written parser but it wasn't chosen
# mainly for one reason. It tries to optimize for traditional full-fledged
# languages which results in a few workarounds given SDB's simplistic but
# quirky command language.
#
# Lark is probably the best option compared to the above in terms of features,
# ergonomics like error-handling, and clean parser code. The only drawback of
# this library in the context of SDB is that it is hard to debug incorrect
# grammars - the grammar is generally one whole string and if it is wrong the
# resuting stack traces end up showing methods in the library, not in the
# code that the consumer of the library wrote (which is what would geenrally
# happen with SLY). This is not a big deal in general but for SDB we still
# haven't finalized all the command language features (i.e. subshells or
# defining alias commands in the runtime) and our grammar isn't stable yet.
#
# Our hand-written parser below has a small implementation (less than 100
# lines of code without the comments), provides friendly error messages,
# and it falls cleanly to our existing code. As SDB's command language
# grows and gets more stable it should be easy to replace the existing
# parser with a library like Lark.
#
Is this close to what you were looking for? If not let me know and we can change it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks great to me! Thanks for giving such a thorough breakdown.
= Motivation
The current parsing logic of sdb based on split() and the shlex library
has resulted in multiple workarounds in the implementation of commands
and the overall user-experience of the tool. In addition, its lack of
proper error-handling and reporting frequently result in the user not
knowing what is wrong with his input. Trying to fix the aforementioned
shortcomings in the existing logic has proven difficult as fixing one
problem brings up a new one.
= Patch
This patch replaces this code a simple hand-written parser that provides
the bare-minimum that we need and improved error-reporting. Unit tests
are also provided for the parser to test its behavior and also highlight
its behavior in extreme cases of input. This patch also does a first
pass in undoing most of the workarounds that we have in existing commands
due to the old parsing logic.
= Things To Note: Quoted Strings
Proper support for single and double quote strings is added with this
patch. Double-quote strings are allowed to escape a double-quote by
inserting a backslash before it. Single-quote strings can escape a
single quote the same way. E.g. the following examples are valid:
```
... | filter "obj.spa_name == 'rpool'" | ...
... | filter "obj.spa_name == \"rpool\"" | ...
... | filter 'obj.spa_name == "rpool"' | ...
... | filter 'obj.spa_name == \'rpool\'' | ...
```
The purpose of strings is solely to allow the ability to pass multiple
words separated by space as a single argument to commands. The `filter`
examples show above get the whole predicate passed in string form as a
single argument. The actual quotes of the string are not part of the
arguments passed to the command. This behavior was modelled after bash.
= Examples of new errors
```
// Before
sdb> echo 1 2 3 |
sdb: cannot recognize command:
// After
sdb: syntax error: freestanding pipe with no command
echo 1 2 3 |
^
```
```
// Before
sdb> echo 1 2 3 | filter obj != 1
sdb: filter: invalid input: comparison operator is missing
// After
sdb> echo 1 2 3 | filter obj != 1
sdb: syntax error: predicates that use != as an operator should be quoted
echo 1 2 3 | filter obj != 1
^
```
```
// Before
sdb> echo 1 2 3 | filter "obj != 1
sdb encountered an internal error due to a bug. Here's the
information you need to file the bug:
----------------------------------------------------------
Target Info:
ProgramFlags.IS_LIVE|IS_LINUX_KERNEL
Platform(<Architecture.X86_64: 1>, <PlatformFlags.IS_LITTLE_ENDIAN|IS_64_BIT: 3>)
Traceback (most recent call last):
File "/usr/lib/python3/dist-packages/sdb/internal/repl.py", line 107, in eval_cmd
for obj in invoke(self.target, [], input_):
File "/usr/lib/python3/dist-packages/sdb/pipeline.py", line 107, in invoke
all_tokens = list(lexer)
File "/usr/lib/python3.6/shlex.py", line 295, in __next__
token = self.get_token()
File "/usr/lib/python3.6/shlex.py", line 105, in get_token
raw = self.read_token()
File "/usr/lib/python3.6/shlex.py", line 187, in read_token
raise ValueError("No closing quotation")
ValueError: No closing quotation
----------------------------------------------------------
Link: https://github.com/delphix/sdb/issues/new
// After
sdb> echo 1 2 3 | filter "obj != 1
sdb: syntax error: unfinished string expression
echo 1 2 3 | filter "obj != 1
^
```
```
// Before
sdb> ! pwd
sdb: cannot recognize command:
sdb> ! echo hello!
Multiple ! not supported
// After
sdb> ! pwd
/export/home/delphix/sdb
sdb> ! echo hello!
hello!
```
= Motivation
The current parsing logic of sdb based on split() and the shlex library
has resulted in multiple workarounds in the implementation of commands
and the overall user-experience of the tool. In addition, its lack of
proper error-handling and reporting frequently result in the user not
knowing what is wrong with his input. Trying to fix the aforementioned
shortcomings in the existing logic has proven difficult as fixing one
problem brings up a new one.
= Patch
This patch replaces this code a simple hand-written parser that provides
the bare-minimum that we need and improved error-reporting. Unit tests
are also provided for the parser to test its behavior and also highlight
its behavior in extreme cases of input. This patch also does a first
pass in undoing most of the workarounds that we have in existing commands
due to the old parsing logic.
= Things To Note: Quoted Strings
Proper support for single and double quote strings is added with this
patch. Double-quote strings are allowed to escape a double-quote by
inserting a backslash before it. Single-quote strings can escape a
single quote the same way. E.g. the following examples are valid:
The purpose of strings is solely to allow the ability to pass multiple
words separated by space as a single argument to commands. The
filterexamples show above get the whole predicate passed in string form as a
single argument. The actual quotes of the string are not part of the
arguments passed to the command. This behavior was modelled after bash.
= Examples of new errors