clojarse-js
Version:
concrete parsing of Clojure code
374 lines (252 loc) • 7.34 kB
Markdown
# Syntax resources #
- [the CCW ANTLR grammar](https://github.com/laurentpetit/ccw)
- [the Clojure implementation](https://github.com/clojure/clojure/blob/master/src/jvm/clojure/lang/LispReader.java)
# Structural parsing #
## Tokens ##
Definitions:
- macro character: one of ``` ";@^`~()[]{}\'%# ```
- terminating macro character: one of ``` ";@^`~()[]{}\ ```
### Comment ###
- open: `/(;|#!)/`
- value: `/[^\n\r\f]*/`
### Whitespace ###
- value: `/[ \t,\n\r\f]+/`
Also, everything else that Java's `Character.isWhitespace` considers to be whitespace.
See http://docs.oracle.com/javase/7/docs/api/java/lang/Character.html#isWhitespace(int).
### Number ###
basically, if it starts with a digit, or the combination of +/- followed by a
digit, it's a number.
See http://docs.oracle.com/javase/7/docs/api/java/lang/Character.html#isDigit(int)
for what is considered to be a digit.
- sign: `/[-+]?/`
- first: `/\d/`
- rest: `(not1 ( whitespace | macro ) )(*)`
### Ident ###
- first: `(not1 ( whitespace | macro ) ) | '%'`
- rest: `(not1 ( whitespace | terminatingMacro ))(*)`
Why does this include `%...`?
Because: outside of a `#()` function, `%...` is just a normal ident.
### Character ###
- open: `\\`
- first: `.`
- rest: `(not1 ( whitespace | terminatingMacro ) )(*)`
### String ###
- open: `"`
- body: `/([^\\"]|\\.)*/` -- `.` includes newlines
- close: `"`
This is only approximately correct. how could it go wrong?
### Regex ###
- open: `#"`
- body: `/([^\\"]|\\.)*/` -- `.` includes newlines
- close: `"`
### Punctuation ###
- `(`
- `)`
- `[`
- `]`
- `{`
- `}`
- `@`
- `^`
- `'`
- ``` ` ```
- `~@`
- `~`
- #-dispatches
- `#(`
- `#{`
- `#^`
- `#'`
- `#=`
- `#_`
- `#<` -- ??? unreadable reader ???
- error: `#` followed by anything else (except for `#!` and `#"`)
## Hierarchical forms ##
Whitespace, comments and discard forms (`#_`) can appear in any amount
between tokens.
### Discard ###
- open: `#_`
- value: `Form`
### List ###
- open: `(`
- body: `Form(*)`
- close: `)`
### Vector ###
- open: `[`
- body: `Form(*)`
- close: `]`
### Table ###
- open: `{`
- body: `Form(*)`
- close: `}`
### Quote ###
- open: `'`
- value: `Form`
### Deref ###
- open: `@`
- value: `Form`
### Unquote ###
- open: `~`
- value: `Form`
### Unquote splicing ###
- open: `~@`
- value: `Form`
### Syntax quote ###
- open: ``` ` ```
- value: `Form`
### Function ###
- open: `#(`
- body: `Form(*)`
- close: `)`
### Set ###
- open: `#{`
- body: `Form(*)`
- close: `}`
### Meta ###
- open: `'^' | '#^'`
- metadata: `Form`
- value: `Form`
### Eval ###
- open: `#=`
- value: `Form`
### Var ###
- open: `#'`
- value: `Form`
### Unreadable ###
- open: `#<`
- value: ??????????
### Other dispatch ###
- open: `/#./`
- value: ???????????
### Form ###
String | Number | Char | Ident | Regex |
List | Vector | Set | Table | Function |
Deref | Quote | Unquote | UnquoteSplicing |
SyntaxQuote | Meta | Eval | Var
Order in which they're tried does seem to be important for some cases, since
a given input might match multiple patterns:
- Number before Ident
### Clojure ###
Form(*)
# Token parsers #
Goal of this phase: determine the internal structure of the number, ident,
char, string, and regex tokens
## String ##
Syntax
- escape
- open: `\`
- error: next char matches `/[^btnfr\\"0-7u]/`
- value
- simple
- `/[btnfr\\"]/`
- octal
- `/[0-7]{1,3}/`
- stops when: 3 octal characters parsed, or whitespace hit,
or macro character hit
- error: digit is 8 or 9
- error: hasn't finished, but encounters character which is not
whitespace, octal, or macro
- unicode
- `/u[0-9a-zA-Z]{4}/`
- error: less than four hex characters found
- `/[^\\"]/`: plain character (not escaped)
- what about ?? unprintable chars (actual newline, etc.) ??
Notes
- macro and whitespace characters have special meaning inside strings:
they terminate octal and unicode escape sequences
- octal and unicode escapes use Java's `Character.digit` and
`Character.isDigit`, so they seem to work on other forms of digits,
such as u+ff13
"\uABCD" is the 1 character string "ꯍ"
// b/c each of ABCD is a digit according to Character.digit(ch, 16)
## Regex ##
Syntax
- real escape: `/\\[\\"]/`
- fake escape: `/\\[^\\"]/`
so-called because both characters get included in output
Notes
## Number ##
Syntax
- ratio
- sign: `/[-+]?/`
- numerator: `/[0-9]+/`
- slash: `/`
- denominator: `/[0-9]+/`
- float
- sign: `/[-+]?/`
- int: `/[0-9]+/`
- decimal (optional)
- dot: `.`
- int: `/[0-9]*/`
- exponent (optional)
- e: `/[eE]/`
- sign: `/[+-]?/`
- power: `/[0-9]+/`
- suffix
- `/M?/`
- integer
- sign: `/[+-]?/`
- body
- base16
- `/0[xX]hex+/
- where `hex` is `/[0-9a-zA-Z]/`
- base8 (not sure about this)
- `/0[0-7]+/`
- error: `08`
- base(2-36)
- `/[1-9][0-9]?[rR][0-9a-zA-Z]+/`
- base10
- `/[1-9][0-9]*/`
- bigint suffix: `/N?/`
Notes
- apparently, can't apply bigint suffix to base(2-36)
## Char ##
- open: `\`
- value
- long escape
- `newline`
- `space`
- `tab`
- `backspace`
- `formfeed`
- `return`
- unicode escape -- *not* identical to string's unicode escape
- `XXXX` where X is a hex character
- hex characters defined by Java's `Character.digit(<some_int>, 16)`
- includes some surprises!
- octal escape
- `oX`, `oXX`, or `oXXX` where X is an octal character
- octal characters defined by Java's `Character.digit(<som_int>, 8)`
- includes surprises!
- simple character (not escaped)
- any character, including `n`, `u`, `\`, an actual tab, space, newline
- what about unprintable characters?
## Ident ##
Syntax
- special errors
- `::` anywhere but at the beginning
- if it matches `/([:]?)([^\d/].*/)?(/|[^\d/][^/]*)/`, and:
- `$2 =~ /:\/$/` -> error
- `$3 =~ /:$/` -> error
- value
- reserved
- `nil`
- `true`
- `false`
- not reserved
- type: starts with:
- `::` -- auto keyword
- `:` -- keyword
- else -- symbol
- namespace (optional)
- `/[^/]+/`
- `/`
- name
- `/.+/`
- code used to verify against implementation:
(fn [my-string]
(let [f (juxt type namespace name)]
(try
(f (eval (read-string my-string)))
(catch RuntimeException e
(.getMessage e)))))