11Rust's lexical grammar is not context-free. Raw string literals are the source
22of the problem. Informally, a raw string literal is an ` r ` , followed by ` N `
33hashes (where N can be zero), a quote, any characters, then a quote followed
4- by ` N ` hashes. This grammar describes this as best possible:
4+ by ` N ` hashes. Critically, once inside the first pair of quotes,
5+ another quote cannot be followed by ` N ` consecutive hashes. e.g.
6+ ` r###""###"### ` is invalid.
7+
8+ This grammar describes this as best possible:
59
610 R -> 'r' S
711 S -> '"' B '"'
@@ -22,8 +26,39 @@ accepted as one by the above grammar, using the derivation:
2226(Where ` T : U ` means the rule ` T ` is applied, and ` U ` is the remainder of the
2327string.) The difficulty arises from the fact that it is fundamentally
2428context-sensitive. In particular, the context needed is the number of hashes.
25- I know of no way to resolve this, but also have not come up with a proof that
26- it is not context sensitive. Such a proof would probably use the pumping lemma
27- for context-free languages, but I (cmr) could not come up with a proof after
28- spending a few hours on it, and decided my time best spent elsewhere. Pull
29- request welcome!
29+
30+ To prove that Rust's string literals are not context-free, we will use
31+ the fact that context-free languages are closed under intersection with
32+ regular languages, and the
33+ [ pumping lemma for context-free languages] ( https://en.wikipedia.org/wiki/Pumping_lemma_for_context-free_languages ) .
34+
35+ Consider the regular language ` R = r#+""#*"#+ ` . If Rust's raw string literals are
36+ context-free, then their intersection with ` R ` , ` R' ` , should also be context-free.
37+ Therefore, to prove that raw string literals are not context-free,
38+ it is sufficient to prove that ` R' ` is not context-free.
39+
40+ The language ` R' ` is ` {r#^n""#^m"#^n | m < n} ` .
41+
42+ Assume ` R' ` * is* context-free. Then ` R' ` has some pumping length ` p > 0 ` for which
43+ the pumping lemma applies. Consider the following string ` s ` in ` R' ` :
44+
45+ ` r#^p""#^{p-1}"#^p `
46+
47+ e.g. for ` p = 2 ` : ` s = r##""#"## `
48+
49+ Then ` s = uvwxy ` for some choice of ` uvwxy ` such that ` vx ` is non-empty,
50+ ` |vwx| < p+1 ` , and ` uv^iwx^iy ` is in ` R' ` for all ` i >= 0 ` .
51+
52+ Neither ` v ` nor ` x ` can contain a ` " ` or ` r ` , as the number of these characters
53+ in any string in ` R' ` is fixed. So ` v ` and ` x ` contain only hashes.
54+ Consequently, of the three sequences of hashes, ` v ` and ` x ` combined
55+ can only pump two of them.
56+ If we ever choose the central sequence of hashes, then one of the outer sequences
57+ will not grow when we pump, leading to an imbalance between the outer sequences.
58+ Therefore, we must pump both outer sequences of hashes. However,
59+ there are ` p+2 ` characters between these two sequences of hashes, and ` |vwx| ` must
60+ be less than ` p+1 ` . Therefore we have a contradiction, and ` R' ` must not be
61+ context-free.
62+
63+ Since ` R' ` is not context-free, it follows that the Rust's raw string literals
64+ must not be context-free.
0 commit comments