clojarse-js

# Syntax resources # - [the CCW ANTLR grammar](https://github.com/laurentpetit/ccw) - [the Clojure implementation](https://github.com/clojure/clojure/blob/master/src/jvm/clojure/lang/LispReader.java) # Structural parsing # ## Tokens ## Definitions: - macro character: one of ``` ";@^`~()[]{}\'%# ``` - terminating macro character: one of ``` ";@^`~()[]{}\ ``` ### Comment ### - open: `/(;|#!)/` - value: `/[^\n\r\f]*/` ### Whitespace ### - value: `/[ \t,\n\r\f]+/` Also, everything else that Java's `Character.isWhitespace` considers to be whitespace. See http://docs.oracle.com/javase/7/docs/api/java/lang/Character.html#isWhitespace(int). ### Number ### basically, if it starts with a digit, or the combination of +/- followed by a digit, it's a number. See http://docs.oracle.com/javase/7/docs/api/java/lang/Character.html#isDigit(int) for what is considered to be a digit. - sign: `/[-+]?/` - first: `/\d/` - rest: `(not1 ( whitespace | macro ) )(*)` ### Ident ### - first: `(not1 ( whitespace | macro ) ) | '%'` - rest: `(not1 ( whitespace | terminatingMacro ))(*)` Why does this include `%...`? Because: outside of a `#()` function, `%...` is just a normal ident. ### Character ### - open: `\\` - first: `.` - rest: `(not1 ( whitespace | terminatingMacro ) )(*)` ### String ### - open: `"` - body: `/([^\\"]|\\.)*/` -- `.` includes newlines - close: `"` This is only approximately correct. how could it go wrong? ### Regex ### - open: `#"` - body: `/([^\\"]|\\.)*/` -- `.` includes newlines - close: `"` ### Punctuation ### - `(` - `)` - `[` - `]` - `{` - `}` - `@` - `^` - `'` - ``` ` ``` - `~@` - `~` - #-dispatches - `#(` - `#{` - `#^` - `#'` - `#=` - `#_` - `#<` -- ??? unreadable reader ??? - error: `#` followed by anything else (except for `#!` and `#"`) ## Hierarchical forms ## Whitespace, comments and discard forms (`#_`) can appear in any amount between tokens. ### Discard ### - open: `#_` - value: `Form` ### List ### - open: `(` - body: `Form(*)` - close: `)` ### Vector ### - open: `[` - body: `Form(*)` - close: `]` ### Table ### - open: `{` - body: `Form(*)` - close: `}` ### Quote ### - open: `'` - value: `Form` ### Deref ### - open: `@` - value: `Form` ### Unquote ### - open: `~` - value: `Form` ### Unquote splicing ### - open: `~@` - value: `Form` ### Syntax quote ### - open: ``` ` ``` - value: `Form` ### Function ### - open: `#(` - body: `Form(*)` - close: `)` ### Set ### - open: `#{` - body: `Form(*)` - close: `}` ### Meta ### - open: `'^' | '#^'` - metadata: `Form` - value: `Form` ### Eval ### - open: `#=` - value: `Form` ### Var ### - open: `#'` - value: `Form` ### Unreadable ### - open: `#<` - value: ?????????? ### Other dispatch ### - open: `/#./` - value: ??????????? ### Form ### String | Number | Char | Ident | Regex | List | Vector | Set | Table | Function | Deref | Quote | Unquote | UnquoteSplicing | SyntaxQuote | Meta | Eval | Var Order in which they're tried does seem to be important for some cases, since a given input might match multiple patterns: - Number before Ident ### Clojure ### Form(*) # Token parsers # Goal of this phase: determine the internal structure of the number, ident, char, string, and regex tokens ## String ## Syntax - escape - open: `\` - error: next char matches `/[^btnfr\\"0-7u]/` - value - simple - `/[btnfr\\"]/` - octal - `/[0-7]{1,3}/` - stops when: 3 octal characters parsed, or whitespace hit, or macro character hit - error: digit is 8 or 9 - error: hasn't finished, but encounters character which is not whitespace, octal, or macro - unicode - `/u[0-9a-zA-Z]{4}/` - error: less than four hex characters found - `/[^\\"]/`: plain character (not escaped) - what about ?? unprintable chars (actual newline, etc.) ?? Notes - macro and whitespace characters have special meaning inside strings: they terminate octal and unicode escape sequences - octal and unicode escapes use Java's `Character.digit` and `Character.isDigit`, so they seem to work on other forms of digits, such as u+ff13 "\uＡＢＣＤ" is the 1 character string "ꯍ" // b/c each of ＡＢＣＤ is a digit according to Character.digit(ch, 16) ## Regex ## Syntax - real escape: `/\\[\\"]/` - fake escape: `/\\[^\\"]/` so-called because both characters get included in output Notes ## Number ## Syntax - ratio - sign: `/[-+]?/` - numerator: `/[0-9]+/` - slash: `/` - denominator: `/[0-9]+/` - float - sign: `/[-+]?/` - int: `/[0-9]+/` - decimal (optional) - dot: `.` - int: `/[0-9]*/` - exponent (optional) - e: `/[eE]/` - sign: `/[+-]?/` - power: `/[0-9]+/` - suffix - `/M?/` - integer - sign: `/[+-]?/` - body - base16 - `/0[xX]hex+/ - where `hex` is `/[0-9a-zA-Z]/` - base8 (not sure about this) - `/0[0-7]+/` - error: `08` - base(2-36) - `/[1-9][0-9]?[rR][0-9a-zA-Z]+/` - base10 - `/[1-9][0-9]*/` - bigint suffix: `/N?/` Notes - apparently, can't apply bigint suffix to base(2-36) ## Char ## - open: `\` - value - long escape - `newline` - `space` - `tab` - `backspace` - `formfeed` - `return` - unicode escape -- *not* identical to string's unicode escape - `XXXX` where X is a hex character - hex characters defined by Java's `Character.digit(<some_int>, 16)` - includes some surprises! - octal escape - `oX`, `oXX`, or `oXXX` where X is an octal character - octal characters defined by Java's `Character.digit(<som_int>, 8)` - includes surprises! - simple character (not escaped) - any character, including `n`, `u`, `\`, an actual tab, space, newline - what about unprintable characters? ## Ident ## Syntax - special errors - `::` anywhere but at the beginning - if it matches `/([:]?)([^\d/].*/)?(/|[^\d/][^/]*)/`, and: - `$2 =~ /:\/$/` -> error - `$3 =~ /:$/` -> error - value - reserved - `nil` - `true` - `false` - not reserved - type: starts with: - `::` -- auto keyword - `:` -- keyword - else -- symbol - namespace (optional) - `/[^/]+/` - `/` - name - `/.+/` - code used to verify against implementation: (fn [my-string] (let [f (juxt type namespace name)] (try (f (eval (read-string my-string))) (catch RuntimeException e (.getMessage e)))))