Skip to content

Conversation

@mjambon
Copy link
Contributor

@mjambon mjambon commented Aug 13, 2020

This adds partial (incorrect) support for optional chaining as discussed in #134.

It kind of works because I specified creative tokens ?.[ and ?.( in the grammar, and the test cases don't have whitespace after the ?.. The problem is that if I specify a proper ?. token in front of the [ of a subscript (or in front of the ( of call arguments), I obtain an ERROR node in the CST in place of the ?.. I don't know what to touch in terms of precedence or other tricks to make this work. I'll resume tomorrow.

The current implementation also allows some optional chains as left-hand side expressions, which is illegal, but I know how to fix it (requires creating separate rules optional_{member|subscript|call} as already suggested by @maxbrunsfeld in #134.

@mjambon mjambon changed the title WIP: Optional chaining Optional chaining Aug 18, 2020
@mjambon
Copy link
Contributor Author

mjambon commented Sep 9, 2020

Hey there, @maxbrunsfeld and team. I'm back with my attempts to add optional chaining to javascript (and typescript). With some trial and error, I managed to pass my tests for property access (a?.b) and array access (a?.[b]). However I didn't manage to make the function calls work (f?.(x)) without a runtime error.

Those are all small changes and I hope we can implement the optional function calls without changing too much but I'm not sure what to try next. I had a brief look at the official typescript parser but it's huge and handwritten in somewhat procedural style. My next best thought is to find another javascript parser that may be written in a more declarative style, and try to imitate what they. I'll take any suggestions.

@mjambon
Copy link
Contributor Author

mjambon commented Sep 11, 2020

My colleague @nbrahms added support for f?.(x) using a single token for ?.(. We hope it covers most cases in practice. We'll start using this internally and will report if we see any problem. Optional chaining is widely used in typescript (about 10% of the files we looked at), so these changes should really help.

Copy link
Contributor

@maxbrunsfeld maxbrunsfeld left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is looking good; thank you both for fixing this! I suggested a couple of small changes.

One slightly larger change I'd like to discuss: in member_expression and subscript_expression, the optional operator ?. is just added to the existing rule, instead of adding a new named rule (like optional_member_expression etc). But in the case of call_expression, you've added a separate named opt_arguments node.

If possible (and if it works for your use case), I'd like to handle it the same way across all three cases (member, subscript, and call).

What do you think about removing opt_arguments and just adding ?. as an optional token inside call expression? Or conversely, adding separate optional_* variations of subscript and member expression? I think I would prefer the first option (and it will result in a smaller binary size), but I am open to the second as well.

grammar.js Outdated
)),

opt_arguments: $ => prec(PREC.CALL, seq(
'?.(',
Copy link
Contributor

@maxbrunsfeld maxbrunsfeld Sep 14, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you split the ?. into a separate token? I think whitespace is most likely allowed between ?. and (.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think @mjambon mentioned that but this led to parsing errors when applying it on a concrete file.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Exactly right. Using seq('?.', '(', ...) did not parse.

Copy link
Contributor

@maxbrunsfeld maxbrunsfeld Sep 14, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What specifically was failing to parse? I'm trying this locally, with the following added unit test, and everything seems to parse ok:

============================================
Optional function calls
============================================

a[b]?.(c);
d.e?.(f);

---

(program
  (expression_statement
    (call_expression
      (subscript_expression (identifier) (identifier))
      (opt_arguments (identifier))))
  (expression_statement
    (call_expression
      (member_expression (identifier) (property_identifier))
      (opt_arguments (identifier)))))

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is my grammar diff:

--- a/grammar.js
+++ b/grammar.js
@@ -685,7 +685,8 @@ module.exports = grammar({
         $.super,
         alias($._reserved_identifier, $.identifier)
       )),
-      choice('[', seq('?.', '[')),
+      optional('?.'),
+      '[',
       field('index', $._expressions),
       ']'
     )),
@@ -952,7 +953,8 @@ module.exports = grammar({
     )),
 
     opt_arguments: $ => prec(PREC.CALL, seq(
-      '?.(',
+      '?.',
+      '(',
       commaSep(optional(choice($._expression, $.spread_element))),
       ')'
     )),

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi @maxbrunsfeld

Using your grammar.js diff, and a test file of:

x?.(y);

x ?. (y);

x?.y;

x?.[0];

new function bob (a, b) {} ?.(1, 2);

I got:

➜  tree-sitter-javascript git:(optional-chains) ✗ npx tree-sitter parse opt.js
(program [0, 0] - [9, 0]
  (expression_statement [0, 0] - [0, 7]
    (call_expression [0, 0] - [0, 6]
      function: (identifier [0, 0] - [0, 1])
      (ERROR [0, 1] - [0, 3])
      arguments: (arguments [0, 3] - [0, 6]
        (identifier [0, 4] - [0, 5]))))
  (expression_statement [2, 0] - [2, 9]
    (call_expression [2, 0] - [2, 8]
      function: (identifier [2, 0] - [2, 1])
      (ERROR [2, 2] - [2, 4])
      arguments: (arguments [2, 5] - [2, 8]
        (identifier [2, 6] - [2, 7]))))
  (expression_statement [4, 0] - [4, 5]
    (member_expression [4, 0] - [4, 4]
      object: (identifier [4, 0] - [4, 1])
      property: (property_identifier [4, 3] - [4, 4])))
  (expression_statement [6, 0] - [6, 7]
    (subscript_expression [6, 0] - [6, 6]
      object: (identifier [6, 0] - [6, 1])
      index: (number [6, 4] - [6, 5])))
  (expression_statement [8, 0] - [8, 36]
    (call_expression [8, 0] - [8, 35]
      function: (new_expression [8, 0] - [8, 26]
        constructor: (function [8, 4] - [8, 26]
          name: (identifier [8, 13] - [8, 16])
          parameters: (formal_parameters [8, 17] - [8, 23]
            (identifier [8, 18] - [8, 19])
            (identifier [8, 21] - [8, 22]))
          body: (statement_block [8, 24] - [8, 26])))
      arguments: (opt_arguments [8, 27] - [8, 35]
        (number [8, 30] - [8, 31])
        (number [8, 33] - [8, 34])))))
opt.js	0 ms	(ERROR [0, 1] - [0, 3])```

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Note that tree-sitter seems to be preferring the $.arguments rule instead of $.opt_arguments here. I played with various prec settings, but that didn't help me out any.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@maxbrunsfeld now if you run the tests, you should see an ERROR node where the ?. is:

1 failure:

expected / actual

  1. Optional function calls:
    (program (expression_statement (call_expression (identifier) (ERROR) (arguments (identifier)))) (expression_statement (call_expression (subscript_expression (identifier) (identifier)) (opt_arguments (identifier)))) (expression_statement (call_expression (member_expression (identifier) (property_identifier)) (opt_arguments (identifier))))) 

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks like @mjambon beat me to it, but you can reference semgrep@7633d91 for my test case

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(thanks for your help here @maxbrunsfeld , definitely new to tree-sitter here and still learning the ropes!)

grammar.js Outdated
$.super,
alias($._reserved_identifier, $.identifier)
)),
choice('[', seq('?.', '[')),
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it would be slightly clearer to write this is as:

optional('?.'),
'[',

instead of

choice('[', seq('?.', '[')),

@maxbrunsfeld
Copy link
Contributor

maxbrunsfeld commented Sep 14, 2020

Yeah, this ended up requiring a bit larger of a change than I expected.

The problem was some complexity that we added in #89, to deal with the inherent ambiguity between new_expression and call_expression when parsing things like new foo.bar(). I switched to a different solution for that ambiguity, which makes it much easier to add the ?. token.

My change (a PR targeted at this branch) is at semgrep#1.

Split `?.` and `(` into separate tokens by removing `new_expression` complexity
@maxbrunsfeld maxbrunsfeld merged commit 3032cf5 into tree-sitter:master Sep 15, 2020
@maxbrunsfeld
Copy link
Contributor

Thanks for your work on this @mjambon and @nbrahms!

@aryx
Copy link
Contributor

aryx commented Sep 15, 2020

Fantastic. Thx a lot Max. Is there any chance we can get a glimpse of your debugging process to fix this issue. How did you get the intuition the problem was related to #89 and why all those what-seems-unrelated changes.
I know there's an art in grammar engineering, just curious if you have any tips for us so we can fix those bugs by ourselves.

@maxbrunsfeld
Copy link
Contributor

maxbrunsfeld commented Sep 15, 2020

Yes, absolutely. This is a good example to share, because it was a bit tricky for me to get it working. I've pushed a branch to illustrate my thought process step by step: https://github.com/tree-sitter/tree-sitter-javascript/commits/optional-chain-debugging-process.

In each commit, I have a message that describes the problem. I'll reiterate some of this explanation here, discussing each commit in sequence. For some commits, I'll include screenshots of the "debug graphs" that tree-sitter parse and tree-sitter test can produce if you pass the -D flag (requires graphviz to be installed).

  1. 9ba9b40

    Here, I've tried to remove the opt_arguments rule, and just add an optional ?. token to call_expression. This generates the parser without error, but as @nbrahms pointed out, the following code does not parse correctly:

    foo ?. ()

    Here is a screenshot of the debug graph

    step-1-fail

    The problem is that, in a call_expression, the leading identifier node would need to be wrapped in an _expression, because of the structure of the call_expression rule. But the parser is not creating an _expression node here, because the member_expression and subscript_expression rules allow a plain identifier, and they have a non-zero precedence (unlike _expression). So the parser has chosen to consume the ?. token without wrapping foo in an expression, in order to allow parsing the higher-precedence member_expression/subscript_expression rules, sacrificing the ability to parse a call_expression containing a ?..

  2. 4909c4f

    This problem ☝️ lead me to try to simplify member_expression and subscript_expression to always parse their left hand side as an _expression. This is easier to understand, and optional chaining now parses correctly across the board 🎉 .

    The problem now is that this code now parses incorrectly:

    new Foo.Bar();

    Desired tree:

    (program [0, 0] - [1, 0]
      (expression_statement [0, 0] - [0, 13]
        (new_expression [0, 0] - [0, 13]
          constructor: (member_expression [0, 4] - [0, 11]
            object: (identifier [0, 4] - [0, 7])
            property: (property_identifier [0, 8] - [0, 11]))
          arguments: (arguments [0, 11] - [0, 13]))))
    

    Actual tree:

    (program [0, 0] - [1, 0]
      (expression_statement [0, 0] - [0, 13]
        (call_expression [0, 0] - [0, 13]
          function: (member_expression [0, 0] - [0, 11]
            object: (new_expression [0, 0] - [0, 7]
              constructor: (identifier [0, 4] - [0, 7]))
            property: (property_identifier [0, 8] - [0, 11]))
          arguments: (arguments [0, 11] - [0, 13]))))
    

    Here is the relevant part of the debug graph:

    step-3-fail

    The parser makes a wrong decision when it sees the . token: it creates a new_expression for new Foo. Two notes about this decision:

    1. This decision is allowed because new_expression still allows a plain identifier after the new; it is not expected to be wrapped in an expression. This is something we changed in Remove calls from new expressions #89.
    2. The parser chooses this decision because 'new_expression' has a non-zero precedence, whereas _expression itself has zero precedence (because zero is the default). So when the parser sees the '.' token in 'new Foo.Bar()', it needs to decide whether the 'Foo' is an 'expression' or part of a 'new_expression'. Because of the non-zero precedence on 'new_expression', it chooses the latter (which is wrong).
  3. 15d5e85

    Next, I made the same change to new_expression: just parse the constructor part as an _expression, instead of allowing a select subset of the possible children of _expression. When I made this change, I get a failure in tree-sitter generate:

    Unresolved conflict for symbol sequence:
    
      'new'  _expression  •  '?.'  …
    
    Possible interpretations:
    
      1:  'new'  (call_expression  _expression  •  '?.'  arguments)                    (precedence: 12)
      2:  'new'  (call_expression  _expression  •  '?.'  template_string)              (precedence: 12)
      3:  'new'  (member_expression  _expression  •  '?.'  identifier)                 (precedence: 14)
      4:  'new'  (subscript_expression  _expression  •  '?.'  '['  _expressions  ']')  (precedence: 14, associativity: Right)
      5:  (new_expression  'new'  _expression)  •  '?.'  …                             (precedence: 13, associativity: Right)
    
    Possible resolutions:
    
      1:  Specify a higher precedence in `call_expression` and `call_expression` and `member_expression` and `subscript_expression` than in the other rules.
      2:  Specify a higher precedence in `new_expression` than in the other rules.
      3:  Add a conflict for these rules: `call_expression`, `new_expression`, `member_expression`, `subscript_expression`
    

    Here is how I read this conflict (I also wrote this in the commit message):

    One one hand, the ?. token may be part of a member_expression or subscript_expression. In that case, we want to SHIFT the ?. token before creating the new_expression node, in order to group things
    like this:

    'new' ( expression '?.' identifier)
          |                           |
          ----- member_expression ----
    

    On the other hand, the ?. token may be part of a call_expression, in which case, the parser thinks we would want to create the new_expression first, so as to group things like this:

    ( 'new' expression ) '?.' arguments
    |                  |
    -- new_expression --
    

    This second conflicting interpretation arises because new_expression currently has higher precedence than call_expression - we set the precedences this way in PR Remove calls from new expressions #89.

  4. 1d05f78

    Next I avoided this conflict by reducing the precedence of new_expression to match call_expression. This eliminates the conflicting interpretation (5) in the conflict message above, because there is no longer any reason to wrap group new foo in a new_expression before consuming the ?. token.

    But this reveals a new conflict:

     Unresolved conflict for symbol sequence:
    
       'new'  _expression  •  '('  …
    
     Possible interpretations:
    
       1:  'new'  (call_expression  _expression  •  arguments)  (precedence: 12)
       2:  (new_expression  'new'  _expression  •  arguments)   (precedence: 12)
       3:  (new_expression  'new'  _expression)  •  '('  …      (precedence: 12)
    
     Possible resolutions:
    
       1:  Specify a higher precedence in `call_expression` and `new_expression` than in the other rules.
       2:  Specify a higher precedence in `new_expression` than in the other rules.
       3:  Specify a left or right associativity in `new_expression`
       4:  Add a conflict for these rules: `call_expression`, `new_expression`
    

    To me, this conflict captures the fundamental problem that originally motivated us to add all of the complexity that we have encountered so far. It arises from the fact that new_expression has an optional arguments node. Intuitively, the question is, when parsing this code:

    new Foo()

    is it parsed

    1. as (new_expression (identifier) (arguments))
    2. or as (call_expression (new_expression (identifier)) (arguments))

    We want it to choose the first interpretation.

  5. e877b75

    Here I decided that, rather than trying to encode ☝️ this logic using precedence (which is more efficient at runtime in some ways), I will handle this ambiguity using the GLR mechanism of trying both interpretations. I opted into this behavior explicitly by adding this entry to conflicts (as suggested by the conflict message above):

    [$.new_expression, $.call_expression]

    Then, to make sure that the parser would pick interpretation 1 above, I added a dynamic precedence to the arguments node inside of new_expression. This expresses the idea that the parser should prefer a tree that has this structure.

@nbrahms
Copy link
Contributor

nbrahms commented Sep 15, 2020

@maxbrunsfeld This is awesome! Thank you for providing so much detail.

@mjambon
Copy link
Contributor Author

mjambon commented Sep 15, 2020

Thank you @maxbrunsfeld, this is the best GitHub comment I have ever seen! It believe it will boost our abilities a lot.

@mjambon mjambon deleted the optional-chains branch September 16, 2020 23:45
mjambon pushed a commit to semgrep/tree-sitter-typescript that referenced this pull request Sep 18, 2020
javascript. See tree-sitter/tree-sitter-javascript#137 (comment)
for full details.
Some conflicts remain and should be solved in a later commit, since they
look different than the issue we had with javascript.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants