Alternatives to sweet-expressions

by David A. Wheeler, 2006-06-07

I separately wrote a paper on how to make s-expressions more readable, including a proposal called sweet-expressions. While creating sweet-expressions, I came up with some alternatives. I don't like them as much, but I didn't want to lose the ideas, so here they are (below).

An alternative: Using explicit infix markers

The biggest area of concern, I suspect, is the infix operations swapping parameters. You can simply set the infix default to "off" above, and carry on. But what if you think everyone would always want to identify infix like that?

Well, if you prefer clearer identification of where the infix operations are occurring, that could be done instead. Here’s a version of the manual approach, or a printout warning that infix is occurring, using {...} around the infix list (without changing lists inside):

 defun factorial (n)
   if {n <= 1}
     {n * factorial{n - 1}}

Here’s an alternative with “?” indicating infix to follow (without changing lists inside):

 defun factorial (n)
   if (? n <= 1)
       ? n * factorial(? n - 1)

Here’s yet another, using #I(...):

 defun factorial (n)
   if #I(n <= 1)
     #I(n * factorial( #I(n - 1)))

But I suspect that having it done automatically, though it makes people nervous at first, is the better way to go. Simply have all printouts give normal s-expressions, or mark when infix is being used or must not be used, with the usual (...) lists in the structures where it cannot possibly be infix under any interpretation.

Other approaches

Below I’ve documented some of my earlier approaches to the problem. They’re more abandoned thinking, and probably have other problems not noted below. I include this so that the ideas won’t be lost.

My early attempts

My initial forays for creating an easier-to-read syntax travelled down the same road of most people. For example, I defined languages with infix operations (a+b), function names followed by parens (f(a,b)), and control structures with special rules (if, defun, etc.). I ended up re-learning the reasons these were not widely accepted by Lispers, though some past works made other mistakes that I could avoid.

One problem is that people sometimes add new parameters in the s-expression notation. The obvious way to devise a surface structure does not take this into account, and so some of these simple approaches have to keep re-modifying the grammar to handle this. And as a result, simple grammars tend to be unable to represent the full underlying language at any time. I think IACL2 ran into this very problem, for example; the infix notation is always inferior to the s-expressions, because it cannot access all the functionality the s-expression system can. That is not acceptable... and the problem is avoidable.

If constructs have a clear marking for the the end of the expression and permit extensions at the end, this problem mostly goes away. If you also have an escape hatch to s-expressions that works inside your other notation, the problem is solved. One solution I found was to use many different keywords map to the s-expression for “)”. You can use things like if.. then.. else.. fi, case.. esac, do.. od, def*.. end, begin.. end. (At first I thought it would be good to have () always map to nothing in the s-expression, but functions traditionally use () to enclose those parameters, and I believe they should use them and map to s-expression parentheses). This should make it easy to see the mapping to s-expressions, and since different keywords match different things it should be good at detecting mistakes. By ensuring that you allow “excess” parameters before the ending marker, you can also handle many syntax changes, e.g., “if a then b else c :no-exceptions fi”. But most importantly, there needs to be some sort of general-purpose escape mechanism that can handle arbitrary s-expression cases, at any point inside the other notation -- this eliminates the risk of not being flexible. For example, in the syntax below, {..} allows developers to create arbitrary s-expressions using thus not lose any functionality. You can quote these expressions by doing ’(...) around them. Taking this approach produces a very traditional surface syntax, yet one that can access the full power of the s-expression system underneath.

Here’s an example of what the rules might look like, taking this approach:
Infix StatementS-expression equivalent
(A)A ; now we can control precedence.
{ X , Y ... [,]}(X Y ...)
if X then Y else Z ... fi(if X Y Z ...)
if A then B elsif C then D else E fi(if A B (if C D E))
defun F (LIST) = ... end(defun F (LIST) ...)
def* ... end(def* ... ) ; Catch unknown def functions.
XX ; just copy if not special.

OP is a set of symbol characters matching RE [:+-\*/<>=&|]+. The “if” expression is slightly more complicated; it’s really:

if ::= if A then B <iftail> fi      => (if A B <iftail> )
iftail ::= elsif C then D <iftail>  => (if C D <iftail> )
iftail ::= else Z {, ZZ}*           => Z { , ZZ}*

The { ... } form is a statement block, which has a statement list inside. This is an escape clause -- if all else fails, you can use this to create any s-expression you need (a failure of IACL2). Presumably, the top level is a statement list.

For example, using this kind of approach, you could transform this Common Lisp definition:

 (defun factorial (n)
   (if (<= n 1)
       (* n (factorial (- n 1)))))

Into this kind of traditional infix notation:

 defun factorial(n) =
   if (n<=1) then 1
   else n*factorial(n-1)

By using only commas, semicolon can continue its old uses (e.g., for comments). One problem is that if you quote things and use comma to indicate “don’t quote here” (”lifting”), you’ll need to pick another character for one or the other uses (to keep them conflicting), or use a whitespace rule to disambiguate them.

I think many would like this look, if they had to read a lot of it. But this takes more work to close up than regular s-expressions (look at the end of the function factorial), and text editors can’t “bounce” between matching beginning and end markers. Also, in this approach there will also be a lot of rules, one for each construct that looks like a traditional language... and that can grow pretty quickly. IACL2 gets hairy because of this very problem... the list becomes disturbingly long.

Another problem is knowing “where the end is” for interactive use; if you enter “5” a simple expression might worry that “newline + 3” would follow, which is nasty. A trivial (and most user-friendly) approach is to presume that if the levels are all closed, and there’s a newline, you’re done. You could require that the user press something special (say a blank line, control-D, or ;;) to force evaluation. Or you could presume that the outermost level is the list inside a block, so the expression would end in a “,”. The same problem occurs for quoting; you need to quote by surrounding it with parens, as well as a preceding quote mark, to be clear.

Quoting gets funny; now you need to look to see if the first “atom” ends in “(” and if it does, move them out... so *( 3 4) is quoted as ’(* 3 4).

If you only look at the this form, and never at the s-expressions that result, this may not be a bad way to go. Dylan, Python, and others go this route. But if you have to look at lots of s-expressions, or handle a constantly-changing language, maybe this isn’t the best way. Perhaps there is another way, a road that is not as obvious and thus less travelled. So, let’s try to find some simple ideas that are general -- perhaps they will help.

I think there are two basic approaches:

  1. Keep as close to Lisp as you can
  2. Abandon Lisp notation.
Let’s start with the first, staying as close to Lisp as possible. We can use I-expressions to reduce the number of parentheses, add “name-prefixing” to make a more common function call notation, and some sort of infix approach. Or we can abandon Lisp s-expression-like notation altogether; I show one approach below, which I call ZY-expressions.

ZY-Expressions: Simple rules

I have another approach, where I abandon Lisp s-expression formatting entirely in parts. I don’t think these are needed, but since I thought about them, I thought I should document them. I call these ZY-expressions.

Of course, we can abandon trying to keep things looking just like s-expressions, and see if that helps. I would like a simple notation, and one where I can “bounce” between the beginning and end of an expression in my text editor (e.g., “%” in vim) -- since many text editors know that {} and () are matching pairs. Most of the other approaches I’ve seen (e.g., IACL2) that try to be “readable” yet to handle the full language they seem to explode into a long list of rules. And they tend to be fragile; they can’t handle changes to the syntax underneath. Indeed, the folks trying to develop M-expressions started, but never finished defining them or implementing them... they seemed to keep getting stuck on what the rules might be. Simple seems to be really necessary for a practical result.

How about this as a set of simple rules, and making these basically it:

Infix Statement       S-expression equivalent
a OP b                (OP x y)
FUNC(a,b..)           (FUNC a b ...) ; NO SPACE ALLOWED after FUNC.
(a)                   a    ; () for priority, not s-exp.
{a, b..}              (a b...)  ; force s-exp lists using {,,}.
a                     a  ; otherwise, things translate to themselves

We could add just a very few more rules, e.g., support leading “-” as well as infix “-” (it’s well-known how to do this). But if we keep things to a very short list of rules, and don’t try to redo all constructs, what have we gained?

In principle, this is actually not that different in appearance from some previous options. Most Algol-like alternatives of the past several decades start with the first two rules, including the original M-expressions (which used FUNCTION[x,y] instead).

But usually alternatives keep making MORE changes; the M-expression notation was never completed in part because they couldn’t decide where to stop. What happens if we stop here? Look at the gain: the transformation rules are VERY simple. The resulting notation is not identical to “traditional” notation, but it’s very close. In particular, it looks identical to spreadsheet user notation, and since many have used spreadsheets, it should be much easier to explain.

The same discussion earlier about infix operations still applies. I think any infix operation OP will need to have whitespace around it, to distinguish from names that have operations in their name. Repeating the same operation should translate into one s-expression, that is, the presentation (a OP b OP c ...) should translate into the s-expression (OP a b c ...). If you don’t want that, add parentheses, i.e., infix presentation (a OP (b OP c)) would become the s-expression (OP a (OP b c)).

Here’s the factorial program for Common Lisp, again, using this notation:

defun(factorial, {n},
   if(n <= 1,
      n * factorial(n - 1)))

Notice that the infix operators make this simpler, and we have fewer parentheses to close. The commas unambiguously separate parameters, eliminating problems in figuring out when infix operations end -- now prefix minus and infix minus are well-solved issues. The “)” characters unambigously end what they start, so it’s easy to see where things end. For the entire expression we do need to add (...), so we can tell where the entire expression ends, but only at the outmost level.

You now need to create a list of infix ops, their precedences, and their S-expression equivalents (possibly always the same name). You can do this indirectly by specifying a regular expression of them (e.g., [:+-\*/<>=&|]+) -- that’s enough for the 4 mathematical operators, the usual comparison operators, and typical representations for “and” and “or”. Perhaps “^” should be added (for “power”), though “**” has a distinguished history as this operator. The “?” could be added, but some systems have that in their function names; other possibilities include “!”, “%”, and “~”. Notice that this is more than enough for other operators like “->” (often “implies”), “<->” (often “iff”), “=>”, and so on. Note that this means that the underlying language should not define names that embed these characters, or, you’ll need to enforce rules that disambiguate between them. If you’re developing a new language (like BitC), then you can avoid defining function names with those characters, and then you don’t need the whitespace. But Common Lisp (for example) defines all sorts of functions with embedded “-”, so a disambiguation rule would be needed if this were used for Common Lisp. The obvious disambiguation rule is “must surround both sides of an infix operator with whitespace” .

How do you figure out when you’re done with a ZY-expression? After all, if “a OP b” is allowed, when you see “a” and whitespace you can’t know if you’re done! One way is to simply declare that infix operations (x OP y) cannot occur in an outermost expression... you have to surround them with something (e.g., parenthesis or make them a parameter). Thus, once you hit the final closing “)”, if the next character is whitespace, you’re done.

So how do you quote such ZY-expressions? In LISP, you quote s-expressions by adding ‘ in front (or using the quote macro), and you can do the same for ZY-expressions. You can quote any subexpression by doing ’( ZY-expression ), that is, adding one more parenthesis around the whole thing (which would handle potential infix operations). One unfortunate thing about this notation is that the comma is used up (which is useful inside quoted expressions to “unquote” it). Even for function calls you need to add the outer paren, because this form is legal: f(5) + 3

This can handle the extensions that Lisp-like languages often accrue. For example, it’s common for Lisp systems to “extend” the syntax because it’s easy, e.g., maybe “if” or “defun” has optional parameters afterwards. IACL2 had trouble with this; you had to keep fiddling with the grammar to handle new parameters, and thus it could not handle “advanced” uses. Not a problem here -- just add the parameter.

You could use “;” instead of “,”, or allow both. I hate to waste a character on options, so I would pick just one. Here, I use “,” since that is more common; I would then use the semicolon for something else (e.g., begin comment to end of line, like Lisp).

You could use different bracketing characters. The original M-expression notation used FUNCTION[..] instead of FUNCTION(..). I figure it’d be nice to leave [..] for other uses, and (..) are commonly used for this purpose. You could use {..} for overriding precedence, and restore (..) (other than FUNCTION() uses) for its use for S-expressions.

How do you handle computed functions? For many programs I think the best approach is to use a special function (like funcall, apply, or mapcar) that calls the function as a parameter, because this is often clearer to readers. So what you want is the s-expression (funcall (findfunction x) y z), which transforms to ZY-expression funcall(findfunction(x), y, z). But if you want to use a direct notation, just use {..} in this case... that’s simple! So s-expression ((whichfunction x) param1 param2) would be described in ZY-expression notation as {whichfunction(x), param1, param2} -- it looks weird, but it works and can handle the entire expressiveness of s-expressions. Here are some bad alternatives:

  1. Do not use parentheses to force computation of the function first, followed by its parameters. I.e., (whichfunction(x))(param1, param2). That can cause trouble with quoting (e.g., ‘); you can’t have a simple rule like “when you reach the first closing outermost paren you are done.”
  2. Adding an extra pair of parens around the ZY-expression enclosing the function and its parameters, when this occurs, doesn’t help if you’re quoting... because as soon as you break the expression apart, the same problem re-occurs. Originally I thought differently, because parens don’t add depth to the equivalent s-expression when they’re not performing function calls, they only control precedence. So I thought the s-expression ((whichfunction x) param1 param2) could become ZY-expression (whichfunction(x)(param1, param2)). That means that the expression (x) above needs to permit after it an optional “function call parameters” set. But since {..} works, and is clearer too, just use that.
So clearly ZY-expressions can be as expressive, even when handling computed functions. There is a risk of doing it wrong with computed functions, but the solution is to remember that you must add parens that surround both the computation of the function and its parameters. The extra syntax could be helpful in warning the reader that something different is happening.

I could add a “prettying” rule that if an atom began with a colon (i.e., it was a Common Lisp keyword), then it could optionally be followed by another expression. Then you could say “f(a, :option b)” . Obviously you can’t do this in a context where the whitespace might end the expression, so you might have to surround it with parentheses. This does cause complications, though. What is ‘f(a, :option ,b) - is that an “:option” followed by the optional expression “,b” (which happens to be lifted), or is “:option” followed by another parameter, whose value is b? In this case it does not matter, the s-expression would be the same... but then ‘f(a, :option ,b + 2) shows up. I suspect these can be dealt with, but now we have complications getting imposed on the parser. Further thinking in this area might be valuable. A trivial rule that would stop this is requiring that all commas used as parameter separators must follow the text of their expressions for the commas between parameters in function calls, so that but then ‘f(a, :option, b + 2) has 3 parameters, while ‘f(a, :option ,b) has 2 parameters... but now we are imposing addition rules for an odd case.

A nifty bonus: if you have a parser that was expecting only s-expressions, such as a current Lisp implementation or BitC’s current parser, a “{...}” pair could switch to this mode (I suggest that it just switch modes -- it would not create a new sub-expression just by changing modes). This means you could have a mixed parser, if you wanted to. Here’s the Common Lisp factorial, if {..} switched to ZY-expressions:

 (defun factorial (n)
   (if {n <= 1}
       {n * factorial(n - 1)}))

One quibble: if you use this interactively, how do you know when an expression is done? I.e., is “3 (newline) -4” the same as “3-4”? Here’s one answer: If you see a newline, and there is no open expression, you’re done. So “3 (newline) -4” is two expressions, “3” and “-4”. That may not be so easy with traditional parsers, which will have multiple options and no way to say “quit” easily. One approach for interactive use is to use a special symbol, like “;;”, to signal “I’m done, please evaluate”... though that is very hokey.

At first, I though quoting things, like ‘factorial(5), was no issue... just move on to the closing paren. But if you can also quote individual names (atoms) that won’t fly, becuase ‘a has no closing paren. But quoting isn’t much harder. A ZY-expression can be safely quoted by surrounding the whole expression with parens and then adding ‘ in front. The parens are only for precedence and won’t add to the equivalent s-expression. So ’(a) and ’(factorial(5)) should work. I thought this would also solve the interactive issue, by adding another parens around everything, but it doesn’t; (defun ...) could still be completed by another (..) after it, if we allow computed functions. Nothing is perfect, it seems. If you require the use of apply, etc. there’s no issue.

Basically, this means that you can switch to infix whenever it’s convenient to you. If you like this format, or it’s especially useful for what you are doing, you can just surround entire files with {..}. This has no ambiguity issues, because the closing “}” unambiguously stops everything.

The “:” could be an operator too, for typing.

Let’s see how this works elsewhere. Given this Lisp s-expression:

 (cond ((> temp 100) 'hot)
       ((> temp 60) 'okay)
       ((> temp 40) 'cool)
       (t 'cold))
We end up with:
 cond( {temp > 100, 'hot },
       {temp > 60, 'okay },
       {temp > 40, 'cool },
       {t, 'cold} )

This does hint that a possible “optimization” might be to omit the “,” after a closing “}”. It’s not clear that this is a good idea, though; regularity has its own rewards. So I’m not doing that.

One problem with this is that you need a way to identify the end of the entire expression. E.G., “5” and “5 + 4” are expressions, so given ‘5, you wouldn’t know if there’s more expression or not (same problem). So to quote a fragment, you need to have something ELSE indicate the end. E.G., enclose it in {...} (if the exterior is s-expression format, enclose it in (...) (if the exterior is in ZY-expression format), or something else (e.g., end-of-file). You can surround with parens to solve that But I think the idea of an intentionally simple transform STILL makes sense.

I’m naming this limited notation “ZY-expressions”, because all the good (simple) names for expressions seem to be taken.

Another approach: Restricted ZY-expressions (rigid parens around infix operators)

If you really want to keep the notation identical with program fragments (instead of adding enclosing parens at the outmost level), try this restricted version of the above -- always put parens around infix ops, and always have the opening paren IMMEDIATELY after a function name:

ZY-Expression             S-expression equivalent
(a OP b)           <=>   (OP x y)  ; require whitespace around OP in ZY-expr
FUNC(a,b..)        <=>   (FUNC a b ...) ; NO whitespace before open paren.
{a, b..}           <=>   (a b...)  ; force s-exp lists using {,,}.

Again, an infix operation OP will need to have whitespace around it, to distinguish from names that have operations in their name.

This is a much more rigid format, but it resolves the problem above. Given the Common Lisp:

 (defun factorial (n)
    (if (<= n 1)
	   (* n (factorial (- n 1)))))

You can rewrite it this way, like a spreadsheet:

defun(factorial, {n},
   if( (n <= 1),
      (n * factorial( (n - 1) ))))

This format is a lot more rigid, unfortunately; all the parentheses around the infix form make it harder to read. But it’s certainly easy to find the end of anything.

You might want to look at my paper High Assurance (for Security or Safety) and Free-Libre / Open Source Software (FLOSS)... with Lots on Formal Methods.

You can see David A. Wheeler’s home page,