Readable s-expressions and sweet-expressions: Getting the infix fix and fewer parentheses in Lisp-like languages

by David A. Wheeler, 2006-06-17 (revised 2008-09-04)

This page is obsolete; see http://readable.sourceforge.net instead.

Many people find Lisp s-expressions hard to read as a programming notation. This paper discusses various ways to extend/modify s-expressions so they can be more readable without losing their power (such as quasiquoting, macros, and easily-manipulated program fragments). The goal is a notation that can be trivially translated to and from traditional s-expression notation (both by computer and in people’s heads), and isn't dependent on a particular underlying semantic. The paper identifies and discusses three approaches that seem particularly promising: indentation, name-prefixing (so func(x y) is the same as (func x y)), and infix support.

It then defines a particular way of combining these approaches, called “sweet-expressions”, that can be viewed as an essentially backward-compatible extension of s-expressions. A sweet-expression reader can accept typical cleanly-formatted s-expressions without change, but it also supports various extensions (optional syntactic sugar) that make much clearer code possible. This is purely a matter of screen presentation; underlying systems can continue to use s-expressions, unchanged. For example, here’s a trivial Common Lisp program that takes advantage of sweet-expression’s formatting extensions (the Scheme version is similar):

 defun factorial (n)         ; Parameters can be indented, but need not be
   if (n <= 1)               ; Supports infix, prefix, & function <=(n 1)
       1                     ; This has no parameters, so it's an atom.
       n * factorial(n - 1)  ; Function(...) notation supported

Sweet-expressions add the following abilities:

  1. Indentation. Indentation may be used instead of parentheses to start and end expressions: any indented line is a parameter of its parent, later terms on a line are parameters of the first term, lists of lists are marked with GROUP, and a function call with 0 parameters is surrounded or followed by a pair of parentheses. A “(” disables indentation until its matching “)”. Blank lines at the beginning of a new expression are ignored. A term that begins at the left edge and is immediately followed by newline is immediately executed, to make interactive use pleasant.
  2. Name-prefixing. Terms of the form ‘NAME(x y...)’, with no whitespace before ‘(’, are interpreted as ‘(NAME x y...)’;. If its content is an infix expression, it's considered one parameter.
  3. Infix. Optionally, expressions are automatically interpreted as infix if their second parameter is an infix operator (by matching an “infix operator” pattern of symbols), the first parameter is not an infix operator, and it has at least three parameters. Otherwise, expressions are interpreted as normal “function first” prefix notation. Infix expressions must have an odd number of parameters with the even ones being binary infix operators. You must separate each infix operator with whitespace on both sides. You can chain the same infix operator, so (2 + 3 + 4) is fine; to mix infix operators, use parentheses. Thus "2 + (y * -(x)" is a valid expression, equivalent to (+ 2 (* y (- x))). Infix operators must match this pattern (and in Scheme cannot be =>):
        [+-\*/<>=&\|\p{Sm}]{1-4}|\:|\|\|
    

For more information, see my website at http://www.dwheeler.com/readable.

This paper describes the rationale behind the older sweet-expressions version 0.1; see Sweet-expressions: Version 0.2 for information on the changes made to sweet-expressions since this paper.

Introduction

S-expression notation is a very simple notation for programs, and programs in variants of the Lisp programming language have traditionally been written using s-expressions. In this notation, an operation and its parameters is surrounded by parentheses; the operation to be performed is identified first, and each parameter afterwards is separated by whitespace. So “2+3” is written as “(+ 2 3)”. As noted in the May 2006 Wikipedia, this syntax “is extremely regular, which facilitates manipulation by computer. The reliance on [s-]expressions gives the language great flexibility. Because Lisp functions are themselves written as lists, they can be processed exactly like data: allowing easy writing of programs which manipulate other programs (metaprogramming).” In short, s-expressions are a powerful and regular way to represent programs and other data.

I’ve written a lot of Lisp code, so I’ve learned to read s-expressions fairly well. (I wrote a lot of Lisp code in the late 1980s on a $120,000 system.) But I am never the only one who reads my programs -- I need to make sure others can read my programs too.

Here’s the problem: for most software developers, programs written solely using s-expressions are hard to read, and they will only voluntarily use programming languages that allow, at least optionally, a more common notation. This is particularly true for the usual infix operations (+, <, and so on). People who use Lisp-based languages all the time eventually learn, but not everyone wants to use them all the time, and even developers who are comfortable with programs in s-expression notation need to share their work with others. Wikipedia notes that “the heavy use of parentheses in S-expressions has been criticized -- some joke acronyms for Lisp are ‘Lots of Irritating Superfluous Parentheses’, ‘Let’s Insert Some Parentheses’, or ‘Long Irritating Series of Parentheses’ “.

Yes, I know the arguments. “S-expressions are powerful and regular” ! Of course they are. They are a wonderful intermediate representation for lots of things, in fact. But they are a terrible user interface, especially if you are trying to share your results with others. People today -- even most programmers -- want systems that are “easy to use”, and one of the best ways to make something easy to use is to make it familiar. Most software developers have been trained for many years to use traditional infix mathematical notation, and S-expression notation fails to use it. Programming is often not a solo effort; development today is practically always a group effort, and readability for a large diverse group matters.

All the loud statements about the power of S-expressions cannot compete with time looking at real programs. Most software developers laugh at languages that are “obviously weak” (to them) because their default parser cannot handle <, *, and - in conventional ways. Below is a trivial Common Lisp program to compute a factorial - now ask, “is this really the most readable format possible for other readers?”

 (defun factorial (n)
   (if (<= n 1)
       1
       (* n (factorial (- n 1)))))

Maybe you think so now, but I use this trivial program as an example to show some alternatives that I think many people would prefer. For example, here’s the same program, but exploiting the abilities of sweet-expressions; a reader for sweet-expressions could read both the previous expression and this one:

 defun factorial (n)
   if (n <= 1)
       1
       n * factorial(n - 1)

The 1993 paper “The Evolution of Lisp” by Guy L. Steele, Jr. and Richard P. Gabriel section 3.5.1 discusses “Algol-style Syntax”, give a history, and slyly makes fun of efforts to try to use anything other than S-expressions: “Algol-style syntax makes programs look less like the data structures used to represent them. In a culture where the ability to manipulate representations of programs is a central paradigm, a notation that distances the appearance of a program from the appearance of its representation as data is not likely to be warmly received... it is always easy for the novice to experiment with alternative notations. Therefore we expect future generations of Lisp programmers to continue to reinvent Algol-style for Lisp, over and over and over again, and we are equally confident that they will continue, after an initial period of infatuation, to reject it. (Perhaps this process should be regarded as a rite of passage for Lisp hackers.)” Paul Graham posts a longer version of this quote.

Perhaps. But it’s worth noting that thousands of languages have been invented over the years, but almost none decide to use S-expressions as their surface expressions. Smalltalk and Python took many ideas from Lisp (see Norvig’s “Python for Lisp programmers” for more)... but not its surface syntax. Languages like ML and Haskell have strong academic communities... and do not use S-expressions for their surface syntax either. Logo was devised to have Lisp’s power, but intentionally chose to not use its syntax. Dylan actually switched from S-expressions to a more traditional syntax, when they wanted “normal” people to use it. Yacas is intentionally Lisp-like underneath, but completely abandoned Lisp's surface syntax for normal user interaction (as have essentially all computer algebra systems, even though many have a Lisp underneath).

The fact that so many people are compelled to find a “fix” to Lisp syntax indicates to me that there is a problem... the fact that so many efforts fail suggests that it is a hard problem. Norvig’s Lisp retrospective found that even in 2002, when comparing Lisp to Python and Java, Lisp was far faster and more extensible, and had many powerful properties... but it was gaining few users (it was mostly stagnant). I argue that one key reason is the syntax; Lisp's syntax looks genuinely hostile to most “normal programmers”. Instead of Lisp-like languages gaining converts who adjust to prefix syntax, many people are ignoring or abandoning Lisp-like languages to use languages designed to be readable by humans. John Foderaro correctly said, “Lisp is a programmable programming language”, but people expect standard notation to be already available. Obviously it’s easy to create a some kinds of front-end syntax for Lisp-like systems, but they generally don’t catch on -- in part because they usually don’t provide the full power of the system underneath (e.g., you often can’t do backquoting with comma-lifting, or insert new parameters in a control structure, or handle macros well). I prefer to use the best tool for the job; when it's not Lisp, don't use Lisp. But some jobs are naturals for Lisp-based langauges, yet their poor readability seriously interferes with that. I think there are ways to retain the power of Lisp-based languages, while improving their readability and retaining their flexibility.

Oh, and let’s debunk one claim here. Steele and Gabriel claim one problem with Algol-like syntax is that there are “not enough symbols”. Yet even if that were true, by combining characters lots of symbols can be created without resorting to fancy characters. Indeed, <= is a common combined character for “less than or equal to”; nobody has trouble understanding that.

So below, I discuss a lot of past and current work to improve the syntax of Lisp-like languages. I then focus on a few especially promising areas that can make Lisp more readable while still accepting normal s-expression notation and having only minimal changes to it. These include:

  1. Indentation for meaning (like Python and Haskell); there is already work on this, particularly with I-Expressions.
  2. Name-prefixing approach, so f(x) and (f x) mean the same thing. Norvig earlier experimented with this.
  3. Infix approaches - how in the world can we rationally add infix support? There are lots of options, which I explore.
After looking at many of the possibilities, I then present “sweet-expressions”, which I believe is an approach that resolves these problems in a reasonable way. I then end with conclusions. In an appendix I document BitC a little further, since it started the whole thing for me, and in a separate paper on alternatives for s-expressions I document some of the alternatives I tried before I came up with s-expressions.

Goals

My goals were originally for the BitC project, but now I think there might be ways to work more widely for any language that uses Lisp-like notations. Here were my goals for a programming notation for Lisp-like systems:

  1. Readable. It should be more “readable” to the uninitiated, in particular, it should look like more traditional notation. For example, ideal format would support infix notation for operations that normally use infix (+, -, <=, etc.), support having function names before parentheses, and not require as many parentheses.
  2. Mappable. There needs to be an obvious mapping to and from current s-expression notation, which must work for both data and code. The key advantage of Lisp-like languages is that you can easily manipulate programs as data and vice-versa; that must remain for any modified syntax. Otherwise, there’s no point; there are lots of other very good programming languages that support infix and other nice notations.
  3. General/Standardizable. It should be very general and standardizable across all systems that accept s-expressions (Lisp's original syntax); nobody wants to relearn syntax everywhere. It should be useful in Common Lisp, Scheme, Emacs Lisp, BitC, ACL2, DSSSL, AutoLISP (built into AutoCAD), ISLISP (standardized by ISO) BRL, and so on. A thread about guile noted this very need.
  4. Easily implemented (relatively speaking). It must not require tens of thousands of lines of code to do. However, if it takes a little extra code to produce nice results, that is fine; better to do things well once. It need not be easily implementable via a few tweaks of an existing reader, though that’d be nice. Yes, rewriting a reader is a pain, but those only have to be written once per implementation. Note that even among implementation of a particular language there is often much variance, so the format needs to be simple enough to support many implementations.
  5. Quote well. In particular, for both forward quote (’) and backquote/quasiquote (‘), it must be easy to find the end of the quote, and for backquote/quasiquote, it would be very desirable to support initial comma (,) and friends to locally reverse the quoting (the whole point of quasiquoting).
  6. Backward-compatible. Ideally, it should be able to read regular s-expressions (at least normally-seen formats of them) as well as the extensions. I’m willing to give a little on this one where necessary. Note: Version 0.2 of sweet-expressions is not perfectly backward-compatible, but most typical s-expressions as people actually use them are also valid sweet-expressions.
  7. Work with macros. There will probably be some tweaks for indenting and infix notation (especially since infix reorders things!), but macro processing should continue to work in most cases.

My notion is that the underlying s-expression system would not change... instead, the system would support a reader that takes an extended notation and converts it into s-expressions. A printer could then redisplay s-expressions later in the traditional notation, or same kind of notation used to input it. It would also be nice if the notation was not confusing or led to likely errors.

I’ve written this document to put at least a few ideas down in writing. In the long term, I suspect what needs to happen is for there to be some sort of “neutral” forum where ideas can be discussed, and code shared. For any syntax to be widely used, there must be trustworthy, widely-usable implementations. I think at least one FLOSS implementation with a generous license that permits use by proprietary programs and Free-libre / open source software (FLOSS) programs. Note that the LGPL doesn’t work as intended with most Lisp implementations; Franz has created the Lisp LGPL (LLGPL) which is a specific clarification of the LGPL for use with Lisp. Lisp code is typically licensed under the LLGPL instead of the LGPL, since the LLGPL clarifies some otherwise sticky issues and ambiguities in the LGPL. Note that any FLOSS software should be GPL-compatible, since there is so much GPL’ed code. The implementations would need to be widely portable and modular, so that they can be widely depended on.

Scheme and Common Lisp aren’t really compatible at all. Translators like scm2cl (Scheme to Common Lisp translator) could help initially, certainly to get started (suggesting that it'd be best to start with Scheme, and then transition at least an initial version to Common Lisp). But in the end, specialized implementations well-tuned to different environment will be necessary for an improved syntax to really work.

There are lots of Lisp code repositories and other Lisp-related sites, including Common-Lisp.net, Lisp.org / Association of Lisp users (ALU), CLiki (Common Lisp Wiki), and schemers.org.

Past Work

Here are some past efforts to try to make s-expressions more readable.

Special syntax for various constructs

Most readability efforts focus on creating special syntax for every language constructs; these often end up unused (because they cannot keep modifying the grammar to match the underlying system), or end up creating a completely new language less suitable for self-analysis of program fragments.

“The Evolution of Lisp” lists many efforts to create "Algol-like notations for Lisp", which generally included infix notations and attempts to be more “readable”. Originally Lisp was supposed to be written in M-expressions, which were more traditional in format. Function calls were written as F[x;y...] instead of (F x y...), for example. One problem is that they kept adding new syntax, and never found a good “final” M-expression format, so it was never implemented... and it just receded into the never-finished future. Many other notations were developed, including those of LISP 2. These generally had if ... then ... else and other more traditional naming conventions.

For example, RLisp (used by Reduce, among others) is a Lisp with an infix notation. A yacc grammar for RLisp by A. C. Norman (2002) is available. Norman notes that the grammar is “ambiguous or delicate in several areas”:

  1. It has the standard “dangling else” problem.
  2. If R is a word tagged as RLIS, then R takes as its operands a whole bunch of things linked by commas. At present I have this grammar ambiguous on R1 a, b, c, R2 d, e, f; where R2 could (as far as the grammar is concerned) be being given one, two or three arguments. This problem arises if the operands of R may themselves end in an R. This is harded to avoid than I at first thought - one might well want conditionals in the are list of an R, but then R1 a, IF x THEN R2 b, c; comes and bites. I guess this is a “dangling comma” problem. The above two problems are resolved by the parser genarator favouring shift over reduce in the ambiguous cases.
  3. “IN”, “ON” are both keywords, as used in for each x in y do ... and words with the RLISTAT property. This is sordid! Similarly “END” has a dual use. This is coped with by making special provision in the grammar for these cases.

One trouble of many of these notations is that it becomes difficult to see where the end of an expression is (e.g., there might be no way to indicate the “end” of the if statement); this can creates ambiguities and makes it harder to easily match the infix notation and s-expression if you need to. And, if you want to describe arbitrary Lisp s-expressions, not having an end-marker means that you may not able to access some of the capabilities of the underlying s-expressions (though that may not be an issue for all uses).

Using a hierarchy of Domain Specific Languages in complex software systems design by V. S. Lugovsky discusses using Lisp to create domain-specific languages, including syntactic transformations.

The ACL2 language is Common Lisp-based, but it has a separate front-end for a more traditional interface including an infix processor; it is named IACL2. This is not often used, for several reasons:

  1. ACL2 is only defined in s-expression form, so you cannot read any of the documentation or examples without knowing s-expression form anyway.
  2. IACL2 is not as portable as ACL2, nor as supported.
  3. IACL2 does not support all the capabilities as ACL2 -- and who wants to use a tool that is known to not work when you most need it?
IACL2 has a “dangling else”, so it is not always trivial to see where something ends.

Logo is basically Lisp with an infix and more readable syntax. Instead of “(”...”)”, Logo uses “[”...”]”. Normally, all commands begin with the name of the function, just like Lisp, and Logo even has text names for math functions: sum, product, difference, and quotient. More interestingly, infix is also available... by using symbols, Logo automatically uses the infix forms instead. Logo knows the number of parameters for each function, so once the number of parameters is provided you can just provide another function call on the same line without any marking of the end of the call (see the “rt” call below). This is not as flexible as s-expressions, where you can always add another parameter. Here’s a sample Logo program:

to spiral :size
   if  :size > 30 [stop] ; a condition stop
   fd :size rt 15        ; many lines of action
   spiral :size *1.02    ; the tailend recursive call
end

Dylan is another Lisp with more conventional notation, including an infix format. Here’s a Lisp-to-Dylan translator, exploiting the Common Lisp pretty-printer. D-Expressions: Lisp Power, Dylan Style even shows it is possible to combine infix forms with Lisp’s abilities to manipulate programs. Dylan is a little wordy, in part because of namespace problems (types are in the same namespace, which is often not what you want), but it’s easy to read. It ends blocks with “end blockname”, e.g., “if (a) b else c end if”; a little long but clear. Here’s a simple example from the page Procedural Dylan:

define method distance (x1 :: <real>, y1 :: <real>, x2 :: <real>, y2 :: <real>)
  => distance :: <real>;
   sqrt((x2 - x1) * (x2 - x1) + (y2 - y1) * (y2 - y1))
end method distance;

There is also CGOL (an Algol-like language that compiles into Common Lisp); this was developed originally by Vaughn Pratt, and written by a UC Berkeley graduate student, Tom Phelps. This program is a Common Lisp implementation of CGOL that is translated into Lisp before execution. I do not know what its license is. You can look more generally at the CMU archive, though beware of licensing problems.

Clike (see also Clike on Sourceforge) is a compiler which converts code written in a simple C-like language to Scheme. Pico is an educational Scheme with C-ish syntax.

TwinLisp is a "new way of programming in Common Lisp"; it creates a new reader with a more conventional syntax, which it then translates to Common Lisp and executes. It predefines precedence - which unfortunately means that it is tied to a particular interpretation of all operations. It also defines special control structures; thus, like most past efforts to create "readable Lisp", it ends up sacrificing generality of the resulting format (the reader is rigged to a specific set of operators with special syntax - so creating new operators is not simply a matter of defining the meaning of the operator).

Lisp resources gives pointers for where to go for more information. For extensions specifically focused on indentation, see the text below on indentation; for extensions specifically focused on infix notation, see the section on infix.

Skill

Skill from Cadence is a proprietary Lisp-based extension language. By examining that paper, this text, and this document, a hint of its syntax can be inferred. Skill supports name-prefixing, where FUNC(x y) is treated as meaning (FUNC x y). Skill also supports infix processing, automatically translating infix operators into an internal prefix format. In Skill, it appears that all of the prefix functions have alphabetic names, e.g., "(plus 3 4)" adds 3 to 4. This means that infix operators can be unambiguously detected and used without surrounding whitespace; thus "3+4" is automatically translated to "(plus 3 4)". Skill does not consider indentation significant.

I point out Skill specially because it does shows that it's possible to have a very general syntax that is easier to use.

Ioke

Ioke is a language developed by Ola Bini, a core JRuby developer. An interview with Ola Bini about Ioke mentions homoiconicity as important.

Arc and its influence: Syntax as Abbreviation

Paul Graham began devising a Lisp variant named Arc, and promulgated the term Syntax as abbreviation for his approach. Here’s what he said:

Instead, he wants Arc to have a syntax, because it makes programs shorter (a big win), but have additional notation as simply an abbreviation for the longer (still valid) syntax. He also would like to show structure by indentation instead of parentheses (which would become optional where no ambiguity results). His ideas sparked many others to re-investigate s-expression syntax, generally thinking in terms of “syntax as abbreviation”.

Arc has inspired many people to re-evaluate Lisp syntax, and see if there are ways it can be made more readable. Some comments on Arc made some interesting points:

Kragen Sitaker’s email “two-dimensional syntax for Lisp” summarizes some ideas from Paul Graham and Arc, but the author seems to add his own points too. He says:

A conversation about Arc and related ideas is posted on reddit.

BitC version 0.9 uses this “syntax as abbrevation” approach. You can indicate that “Value” has type “Type” using the quasi-function “the”, e.g., as (the Type Value), but the preferred form is “value : type”. It does not go far right now, though; + is still prefix (not infix).

Infpre is a Common Lisp infix utility (LGPL license).

Quack is yet another Lisp derivative. This adds two syntactic items. First, "postfix colon syntax" - colon with indentation as a shortcut for parentheses. Second, "infix colon syntax" - a colon without spaces around it are a shortcut for function/macro call, right-associative, so cdr:cdr:obj is (cdr (cdr obj))). He doesn't know if they're workable, and I don't find the examples convincing. Indentation has promise, but the colon ends up in a bad place and it's completely unnecessary. And while an infix operator like the infix colon is useful, making that the only infix operator is rather weak; where's 3 + 4?

Lisp sans parentheses hell discusses some similar ideas:

1. We use () to indicate grouping. ((((a)))) will mean the same thing as (a).
2. We use right associative infix operator notation -
        a * b + c  translates to (* a (+ b c)) in lisp.
3. We use indentation (tab-size = 4) to convey grouping -
        say hello          becomes  (say hello (to lisp))
                to lisp
4. We use square brackets notation to encode lists - for instance -
           [a,b,c] means     (a b c)
    The items in a list can also be separated by line breaks. So
     [  one
        two            means     (one two three)
        three ]
    Since indentation is significant,
    [
        one
            two                    means ((one two) three)
        three
    ]

There are a few special operators that help reduce the need for
parentheses and aid readability -
1.      a : b  translates to  (a . b) and due to right associativity,
        a : b : c translates to (a b . c)
2.     a :: b translates to (a b)
3.     a := b translates to (define a b)

Here are the cases where you won’t need parentheses -
1.     Your expression starts on its own line. In this case,
       the first term is the head term and the rest of the terms
       on the same line are tail terms.
2.    Your expression is the second argument to an infix operator.
      For example -
        mid :: quotient (low + high) 2
            is equivalent to
        mid :: (quotient (low + high) 2)

Exceptions -
1.    Use func() notation to indicate that you mean (func) (a function
call) and not just the value of func.

Limitations of current implementation -
1.    Back-quoting not supported.
2.    The use of comma separator outside of list expressions is not
defined and handled well enough.

Enjoy! ... and, for the record, I’ll stick to the parentheses,
thank you :)
This aids readability, but note the inability to handle all of Lisp's capabilities (like back-quoting), and that the creator won't use it (clearly a sign that it's not good enough).

So now, let’s turn to three areas that have great promise, and then see if we can combine them together.

Indentation (such as I-Expressions)

Lisp programs are normally presentated using indentation; the Common Lisp standard even includes a pretty printer! Experienced Lisp programmers eventually stop seeing the parentheses, and see the structure of a program instead as suggested by indentation. So why not use indentation to identify structure, since people do anyway, and eliminate many unnecessary parentheses?

Other languages, including Python, Haskell, Occam, Icon, use indentation to indicate structure, so this is a proven idea. Other recently-developed languages like Cobra (a variant of Python with strong compile-time typechecking) have decided to use indentation too, so clearly indentation-sensitive languages are considered useful by many. In praise of mandatory indentation notes that it can be helpful to have mandatory indentation.

Past work on indenting to represent s-expressions

Paul Graham (developer of Arc) is known to be an advocate of indentation for this purpose. As I noted above, Kragen Sitaker’s notes on Graham and Arc discusses how indentation can really help (in this notation, functions with no parameters need to be surrounded by parentheses, to distinguish them from atoms - “oh well” ). Graham's RTML is implemented using Lisp, but uses indentation instead of parentheses to define structure. RTML is a proprietary programming language used by Yahoo!’s Yahoo! Store and Yahoo! Site hosting products, though Yahoo are transitioning away from it. Paul Graham’s comments about the RTML language design and this introduction to RTML by Yahoo.

Darius Bacon's ”indent” file, includes his own implementation of a Python/Haskell-like syntax for Scheme using indentation in place of parentheses, and in that file he also includes Paul D. Fernhout's implementation of an indentation approach. Bacon's syntax for indenting uses colons in a way that is limiting (it interferes with other uses of the colon in various Lisp-like languages). I have not had a chance to examine Paul D. Fernhout's yet. (It also includes an I-expression implementation.) All of the files are released under the MIT/X license. (Darius Bacon also created mlish, an infix syntax front end listed earlier). Lispin discusses a way to get S-expressions with indentation.

I-expressions

I-expressions are an alternative method for presenting s-expressions (either program or data), using indentation. They are defined in SRFI ("surfie") 49; this has final status, making I-expressions a quasi-official part of Scheme. I-expressions have no special cases for semantic constructs of the language. SRFI 49 includes a sample implementation with an MIT-style license (based on the Sugar project).

Here’s an example, quoted from the Scheme Requests for Implementation (SFRI) number 49 (this example uses Scheme’s “define” function, not Common Lisp’s “defun” function):

 define
   fac x
   if
     = x 0
     1
     * x
       fac
         - x 1

I-expressions can include traditional s-expression representation too; here's an example (using Scheme):

  define (fac x)
    if (= x 0) 1
      * x
        fac (- x 1)
As you can probably guess, both of these are equivalent to this traditional s-expression representation:
 (define (fac x)
   (if (= x 0)
     1
     (* x (fac (- x 1)))))

The keyword “group” is used to begin a list of lists. (One minor drawback: it's more difficult to use a function named "group", so it's best to simply not create such a function.) You can drop back several indentation levels without dropping them all:

  let
    group
      foo
        + 1 2
       bar
        + 3 4
     + foo bar

The SFRI permits both tabs and space characters. I-expressions can also be quoted (including being quasi-quoted); see the proposal for information. Note this means that ’ followed by whitespace is now the beginning of a quote, because the initial whitespace is significant, and newlines become important.

In any indentation system for s-expressions, you need to be able to express s-expressions such as (f a), (g (h 1)), and (j (k)). In the first case, the first parameter is simple symbol, in the second case, the first parameter is a call with at least one parameter, and in the third case, the call to k has no parameters at all. I-expressions represent 0-parameter function calls by having the parameter surrounded with (...) (or having () as a name-ender). Let’s see what the indentation rules by themselves look like:

; Here is (f a):
  f
    a
; or
  f a

; Here is (g (h 1)):
  g
    h
      1
; or
  g
    h 1
; or
  g (h 1)

; Here is (j (k)), note the treatment of 0-parameter calls:
  j
    (k)
; or
  j (k)

The idea is simple, but it can be hard to reason about how it works -- so here is one way to think about it (if you're using this to write a program). On any line, the first term is the function to be called; the rest of the items on the line, and each indented line below, are its parameters. If a line has no other parameters (on the rest of the line or indented), then it's an atom; otherwise, it's treated as one expression and parentheses surround that line and its indents (if any). The “group” term creates an extra surrounding (...) in the resulting s-expression, so you can create lists of lists.

So how do you present computing a function and then calling it? Personally, I’d just switch to traditional s-expression notation in this case. But the I-expression representation actually isn’t bad; just use a “group” followed by how to compute the function to be called:

; ((getfunction x) a b) can be represented as:
   group
     getfunction
       x
     a
     b
; or:
   group (getfunction x)
     a
     b

The SFRI supplies Guile code (not fully portable Scheme code) to implement I-expressions (e.g., it uses define-public and not define). However, it should be easy to port.

In the I-expression sample implementation a “(” disables I-expression processing until its matching “)”, and this is an intended part of the definition of I-expressions. This has all sorts of wonderful side-effects:

  1. I-expressions parsing becomes very safe to use with existing code - pre-existing oddly-indented code will almost certainly start each expression with an opening parenthesis, disabling indentation processing.
  2. It supports dealing with text that is very close to running off the right-hand side; just use parentheses to disable indentation processing. Python does the same thing.
Disabling indentation this way doesn’t interfere with the use of parentheses to invoke function calls with 0 parameters - such calls don’t take that much room on a line!

There are a few special cases where the reader with I-expressions enabled parsed a file differently. When two top-level s-expressions follow each other, either (1) on the same line or (2) on consecutive lines with the second expression indented, the lines will be combined by an I-expression reader. The first seems unlikely to me; the second is a little unfortunate. What's worse, though, is that in a read-eval-print loop, users have to enter return twice to see their results, and it's easy to end up trying to evalute an empty list. We'll discuss those below in possible extensions to I-expressions.

The mailing list discussion of I-expressions includes a posting of an I-expression pretty-printer; the author says "I just couldn't get to sleep, so I wrote one". Be warned: I believe this pretty-printer has a bug, because it does not handle (f (g)) correctly. The first pretty-printer I wrote for I-expressions in Common Lisp had the same error, too, so this is a common mistake when doing this task - be warned. The expression (f (g)) should print like this:

  f
   (g)

I believe that users who use indenting expressions should only use space characters, and never tabs, to indent. The R5RS Scheme specification doesn't even officially permit tab characters (though implementations generally do). The real problem is that tabs produce a varying number of spaces on real systems; experience with Python suggests that using tabs can cause a lot of problems. Most of today’s text editors can be configured to turn tabs into spaces automatically.

Possible extensions to I-expressions

I-expressions and similar indentation approaches are a very nice approach. But there are several ways it could be extended or modified, now that I've had a chance to experiment with them.

Immediate execution if begin on left-hand-edge

If you use indented I-expressions interactively (in a read-eval-print loop), you have to enter an extra a blank line (an extra "enter") before the expression is evaluated. That's because the system needs to know when an expression has ended; the system believes you might indent the next line and add more expressions. This is fine for multi-line expressions, but it's very annoying for simple single-line expressions. If you do a lot of simple one-line commands this becomes very annoying. And this is likely; few people type in long, multi-line definitions into a read-eval loop, because they would stick those into a file instead!

So how can we make indenting easy to use at a command line? We could create a special command for use at the beginning of a line, such as "! ", that means "at the end of the line, you're done" . Or we could create a special indicator that means "execute now" . But these don't seem any easier than entering a blank line, and they are yet something else to remember.

One simple modification could be to add a special case, based on whether or not text was entered in column 1. There are several obvious variants:

  1. If leftmost edge has a "(" , and the entire expression is immediately followed by a newline, then the expression is complete. This has lots of problems; this doesn't handle evaluating atoms, such as running "x" to see what x contains. So I'll reject that.
  2. If text begins at the leftmost edge (no leading horizontal space), and the following term is completely read, followed immediately by a newline, then the expression is complete and the read function returns. Thus, if these are entered at the leftmost edge, they will immediately return:
      x
      (load "file")
      load("file")
      (3 + 4)
    
    However, the following will not (they will wait for a blank line) if entered on the leftmost edge:
      load "file"
      define x
      3 + 4
    
    The system will wait for a blank line in these cases, because the first term is followed by a space, not a newline... so it is waiting to see if you'll add more parameters by indenting them. The 'define x' is an example of good news (probably), and the 'load' command is an example of the bad news. Another problem is that the rule is a little complicated to explain. The good news is that 'define x' on the left edge will not trigger an immediate return. With these semantics, if you want to enter an indented expression, you'll need to indent the first line or include at least one parameter (expression) on the same line. If you indent at all on the command line, or enter one or more terms on one line, you must type an extra blank line to execute the sequence; this seems reasonable.
  3. If text begins at the leftmost edge (no leading horizontal space), and there are no unmatched parentheses, it is immediately completed at newline. In this approach, you must indent the first line to have an indented expression. This means that if these were entered on the left edge, they would be executed immediately:
      load "file"
      3 + 4
      define x
    

    The good news with this approach is that 'load "file"' on the left edge works as a user might expect - it will trigger an immediate return. The same is true for 3 + 4 ... suddenly an interactive Lisp interpreter makes a plausible calculator!

    The bad news is that 'define x' on the left edge followed by return will also trigger an immediate return of the read expression, which is almost certainly not what was intended. That is not such a big deal in an interactive command loop, since the user could try again. But more seriously, this would be a likely thing to do in a file, and it could cause very hard-to-detect bugs.

    How can we counter the risks of not noting the change in semantics for text on the left edge? It might be possible to have a special loop for file-reading that detected the case of "unindented expression followed immediately by indented expression", and trigger a warning or error. Another variant would be to have two slightly different modes for read (or two different starting functions): one intended for interactive use, where unindented lines are ended on newline where possible, and one intended for file reading (where there is no such distinction). This does complicate reasoning and implementation, and it's quite normal to cut-and-paste into an interactive session.

    One promising approach to countering the problem is to configure other tools to help detect such problems. Text editors, for example, could be colorize such constructs. Thus, if a left edge that would run immediately is followed by an indented line, it would be labelled with red lines. And special tools could be designed to detect this as well.

    One good point is that the rule is simple: "if you want to indent, you must start by indenting the first line". When you indent, you enter a blank line or another expression at the same indentation level. Frankly, that's much simpler than the rules above.

  4. If text begins at the leftmost edge (no leading horizontal space), and there is exactly one term on the line or it is a legal infix expression, then the expression is complete and the read function returns. Thus, if these are entered at the leftmost edge, they will immediately return:
      x
      (load "file")
      load("file")
      (3 + 4)
      3 + 4
      3 + (4 * 2)
    
    And these will not:
      define x
      load "file"
      3 + 4 * 2
      3 + 4 +
    

For sweet-expressions, for a long time I keep going back and forth between option 2 and option 3. Option 3 presents the risk of subtle bugs, and that is of great concern. However option 3 is a joy to work with interactively. I then said, "This is a hard choice to make, so I plan to experiment. Expressions like load("filename") and (3 + 4) seem particularly easy to explain, and do not have the drawbacks of this alternative, so they suggest using alternative 2." It was only in November 2007 that I finally realized that what I wanted was option 4.

Note that when activated these also partly eliminate one of the minor incompatibilities of I-expressions with traditional s-expressions. If there are two separate outermost s-expressions on consecutive lines, with the second expression indented from the first, as long as the first expression was not indented (a likely case) the lines will no longer be combined by the I-expression reader. Of course, this could be misleading, because the second line might appear to be combined with the first; this is a trade-off with no perfect answer.

Ignoring excess blank lines

Excess blank lines get interpreted oddly, as '(), which the underlying system may then try to run (Scheme in particular complains about them). It can also cause confusion if interpretation gets "out of sync".

A plausible modification would be to ignore any additional blank lines before a (next) expression. In short, a blank line shouldn't really return '() to the system; if they want that, users should need to enter that directly. If there blank lines at the end of the file, the system should probably return the end-of-file marker (if it supports that) once it has consumed the final blank lines trying to read the next expression.

An alternative would be to ignore any additional blank lines after an expression, but you don't want to do that. Then you need to treat the beginning of a file specially (assuming that there was a beginning of file).

Also, should horizontal spaces followed by newline be treated the same as a blank line? Users cannot easily see the difference, especially on typical printouts. That isn't clear, so I leave that as a question for now.

Indentation, a wrap-up

I-expressions and their extensions are a real improvement. But they are not enough; let's keep looking.

Name-prefixing: Parentheses can be prefixed by a function name

Lisp’s standard notation is different from “normal” notation in that the parentheses precede the function name, rather than follow it. Jorgen ‘forcer’ Schaefer argues that this is a more serious problem than the lack of infix notation; on July 2000 he said “I think most people would like Scheme a lot better if they could say lambda (expression) ... instead of (lambda (expression) ...”

Peter Norvig had some interesting ideas, as noted earlier. Let’s look at one of his rules, the rule that says “if a function name ends with an open parentheses, move it inside the list (when converting to an s-expression)”. This means that “(fact x)” and “fact(x)” will mean the same thing.

Obviously, this is trivial to parse. We don’t lose any power, because this is completely optional -- we only use it when we want to, and we can switch back to the traditional s-expression notation if we want to. It’s trivially quoted.. if you quote a symbol followed by “(”, just keep going until its matching “)” -- essentially the same rule as before! Technically, this is a change from some official Lisp s-expression notations and implementations. For example, entering “a(b)” into CLisp (a Common Lisp implementation) is the same as “a (b)” -- its parser tries to return the value of a, followed by running the function b. But it’s not clear it’s a big change in practice; commonly accepted style always separates parameters (including the first function call name) with whitespace. So normally, what follows a function call’s name is whitespace or “)”, and this is enforced by pretty-printers. Thus, many large existing Lisp programs could go through this kind of parsing without resulting in a change in meaning!

Does this help? Let’s rewrite the CL factorial example, but not make the infix operations do this:

 defun(factorial (n)
   if((<= n 1)
       1
       (* n factorial((- n 1)))))

It looks slightly more familiar, but not that much. Let’s try again, but move the infix ops out too; the result is actually not bad:

 defun(factorial (n)
   if(<=(n 1)
       1
       *(n factorial(-(n 1)))))

If we really wanted it to look conventional, we could use wordy names instead of symbols that are traditionally infix; that isn’t too horrible for + (use “sum”), but that is rather wordy for others (I don’t like it):

 defun(factorial (n)
   if(lessequal(n 1)
       1
       product(n factorial(subtract(n 1)))))

Note that as far as I can tell, you do not need any whitespace after the opening parentheses, because atoms (including function names) cannot have parentheses in their name in s-expressions. A variant of this idea would be to require whitespace after the opening paren if name-prefixing is used, but I see no need for that.

Skill from Cadence, a proprietary Lisp-based extension language, also supports name-prefixing.

Mathematica does something similar. As noted in How Purely Nested Notation Limits The Language's Utility, its FullForm notation can be transformed into Lisp using simple rules, starting with transforming f[a,b] into (f a b). But it isn't designed to be backward-compatible with existing Lisp notations; its use of "," to separate arguments would cause confusion since "," is also used in macro handling.

Name-prefixing can be combined with indentation (e.g., I-expressions). Let’s presume that if you indent further and begin the expression with a parentheses, or by a function name, no space, and a parenthesis, that parenthesis and its mate are silently ignored (they don’t add yet another subexpression in the s-expression). Furthermore, let’s presume that if you have a name-ender format (function name followed immediately by open paren), it does not switch to some “no more I-expression” mode. Then you could do this (using English names for operators):

 defun factorial (n)
   if lessequal(n 1)
       1
       product n factorial(subtract(n 1))
Or this (using traditional symbols for operators):
 defun factorial (n)
   if <=(n 1)
       1
       *(n factorial(-(n 1)))

It turns out that name-prefixing works well with indentation. Here are some examples:

; Here is (g (h 1)) <=> g(h(1))
  g
    h(1)
; or
  g h(1)

; Here is (g (h)) <=> g(h())
  g
    h()
; or
  g h()

; Here is (g (a) (b 1) c): <=> g(a() b(1) c)
  g
    a()
    b(1)
    c
; or
  g a()
    b(1)
    c

; Here is (g a (b 1) c()) <=> g(a b(1) c())
  g
    a
    b(1)
    c()
; or
  g a b(1)
    c()

This does introduce the risk of someone inserting a space between the function name and the opening “(”. But whitespace is already significant as a parameter separator, so this is consistent with how the system works anyway... this is not really a change at all.

I think this is slightly better for untrained eyes. It’s hard to argue that this is a major improvement, at least by itself, but the more I look at it the more I like it. What’s more, the pain is tiny, and that’s a good thing. This is a lower pain, lower gain approach, and it can be combined with SWP-expressions, which I’ll describe in a moment.

The article Improving lisp syntax is harder than it looks discusses name-prefixing, but I think it makes a number of errors. It first claims that this would be hard to integrate with macros - this actually isn't true if it's built into the reader (and the macros aren't doing reading themselves). If the reader transforms a(x) into (a x), then when the macro has a chance to run all it sees is (a x) - exactly what it was expecting to see. He also says that "Another disadvantage to this change of syntax is that it makes functional programming much more odd looking. Lets say you have a list containing functions and you want to call the first one. In Scheme you write ((car lst) params) and in Common Lisp (funcall (car lst) params). However in our new syntax it looks like: car(lst)(params) and funcall(car(lst) (params)). Neither of these is very elegant, and it only gets worse if that call in turn returns a function, which would look like: car(lst)(params)(params2) and funcall(funcall(car(lst) (params)) (params2))." But I find this remarkably elegant, and better than the traditional notation - to do functional programming, just cuddle up the parentheses. It's much easier to understand sequential parentheses compared to a deeply nested list.

Infix

Again, let’s continue the thought that we want to support a syntax that is maximally Lisp-like, generally accepting existing expressions. The biggest issue in making Lisp s-expressions easier to read are in whether or not to provide infix support, and if so, how.

The first question is, should you support infix at all? Because if we do, and we presume that the underlying s-expressions always have the operation first, this means that we are changing the external presentation of (some) data. Note that we are only changing the presentation for entering programs and program-like data, and possibly changing how they are externally displayed; the traditional s-expression would still be used internally. Thus, if “{3 + 4}” is interpreted as an infix expression, it will quietly be transformed to the s-expression “(+ 3 4)”, so the “car” (head) of “{3 + 4}” would be “+”. The ll1-discuss mailing list includes discussions on the issues of infix.

The reality is that nearly everyone prefers infix notation; people will specifically avoid Lisp-based systems solely because they lack infix support built-in. Even Paul Graham, a well-known Lisp advocate, admits that "Sometimes infix syntax is easier to read. This is especially true for math expressions. I've used Lisp my whole programming life and I still don't find prefix math expressions natural." Paul Prescod remarked, “[Regarding] infix versus prefix... I have more faith that you could convince the world to use esperanto than prefix notation.” Nearly all developers prefer to read infix for many operations. I believe Lisp-based systems have often been specifically ignored even where they were generally the best tool for the job, solely because there was no built-in support for infix operations. After all, if language creators can’t be bothered to support the standard notation for mathematical operations, then clearly it isn’t very powerful (as far as they are concerned). So let’s see some ways we can support infix, yet with minimal changes to s-expression notation.

If we are willing to support infix, there are many options. The big issues are:

  1. How do you determine when to use infix notation? Is this manually specified (if so, how, and how deep does it go?) - or is it automatic (if so, how, and how do you override it?).
  2. What are the legal infix operators?
  3. What are the semantics? (Do you do something trivial, like swap the first and second parameters, allow many parameters for the same operator, or go all the way to full infix with precedence?)
For all of these, you also need to determine how to display them too.

There are some code samples that might be used as a starting point for implementations; after going over the issues, I identify some of them.

Nathan Baum (of Bournemouth, United Kingdom) posted on 4/14/2006 8:19:16 PM the idea of using [...] to surround infix notation (I’m sure he’s neither the first nor last). He said, “Lisp actually has a few syntactic shortcuts of its own. Essentially anything that isn’t part of a name is a character which triggers a specialised reader function. Strings, for example, don’t need to be built in to the Lisp core: a reader can be associated with the double quote character, and that’ll return a string object. This also means you can redefine it yourself to get shell/Perl/Ruby-style interpolated strings, for example. Even lists themselves are read using a reader function triggered by [parentheses]. One simple shortcut is to have ‘normal’ Lisp expressions delimited by [parentheses], and infix expressions delimited by brackets...

  (print (+ x (- (/ y z) a)))
  ; or
  (print [x + [[y / x] - a]] ”.

How do you determine when to use infix notation?

There are several major options:

Of course, if the notation permits mixed infix and prefix notations, you can easily have a “manual” notation that just means “turn on autodetection all the way down from here” and end up having a combination of properties. Alan Manuel K. Gloria’s “infix notation macro” is a good example of a mixture - in this implementation, there must be an initial marker (nfx ...), but it then descends all the way down to all subexpressions. As a result, you could use the marker at a very high level, and have a result similar to automatic detection. A similar result could be had by having a function like “(enable-infix)” or “(sugar)” which you could insert at the beginning of a file as a top-level command (this could also enable indenting and the name-prefixing notation), which would then enable the infix operations (even at the topmost level) until specifically disabled. Using such a command would mean that everything below, for the rest of the file, would use automatic infix detection... but since there would be a specific a command to enable it first, it would be compatible with existing code (which didn’t enable infix). A command-line flag, or invoking a command with a different name (possibly via a different file extension), could enable automatic detection of infix notation.

What are the legal infix operators?

If detection of infix operators is automatic, now you need to determine the legal pattern of infix operators, since they will trigger the infix interpretation. Even if it’s manually triggered, if you permit arbitrary expressions, or want error-checking of the operators (”you said this was infix but there were none”) you’ll need to detect infix operators.

There are several options:

If we use a pattern, what is the pattern? Patterns are often expressed as regular expressions (REs), and an obvious RE for this purpose is: [+-\*/<>=]+. Basically, it has to be a set of one or more special characters. (One alternative, which might be nice, would be that it has to start with a special character, but after that, other characters are allowed; be wary that +h1 is an infix operator but +1 is not). This is enough to cover the 4 arithmetic expressions (+, -, *, and /), and all the usual comparisons (<, <=, =, >=, >). It’s also enough to cover other operations that people like to have as infix, such as “**” (exponentiate), “->” (implies), “<->” (if and only if), and so on. You can probably add “&”, which is enough to represent common representations of “and”. But there are some issues, especially with the characters not noted here:

If the printing routines normally convert to infix, then you need to not have a pattern that is likely to be engaged unintentionally... which are good reasons to not have the terms “and” and “or” be infix operators, and certainly a good reason to stay away from anything other than an isolated “:”. You can probably limit the regular expression length, say up to 3-6 characters, to help limit unintentional conversions as well. I would not include “and” and “or”, instead add & and | to the infix operators (and say that | has to be escaped), and limit it so it must be 1 through 4 characters in length. I would accept “:” alone (so it can be used for infix type declarations), but nothing else with colons.

These considerations suggest that “=>” be quietly prevented from being an infix opreator, and then use this regular expression for infix operators;

    [+-\*/<>=&\|\p{Sm}]{1-4}|\:|\|\|
where “\|” means match the “|” character, and \p{Sm} means “match mathematical symbols, Unicode range U+2200 through U+22FF”.

As a related matter, you need to decide how to separate infix operators from non-infix operators. In many other language families, “x-y” would be parsed as x minus y. Unfortunately, in many Lisp-like languages, names often include typical infix operator characters such as “-”. Both Common Lisp and Scheme would be unusable without the - symbol, because they have many functions with “-” in the name (in addition, Scheme has a convention where “->” embedded in the name indicates a conversion). The simplest and most obvious approach is to require that any infix operator be surrounded by whitespace; this is very easy for humans to remember, and is consistent with normal s-expression syntax anyway. You could require escaping such characters as part of a name, e.g. |simple-string-p| or simple\-string\-p, but this is both ugly and unnecessarily incompatible with common practice. If you’re devising a new language, you could simply forbid such characters in regular symbol names, and then you could automatically detect all the infix operators without requiring that they be surrounded by whitespace. However, for general-purpose parsing, sweet-expressions will require that infix operators be surrounded by whitespace (in cases where function names don't have such symbols, this requirement could be relaxed).

What are the semantics of the infix operations?

Once we detect that we are using infix notation, what are the semantics of the “infix” notation?

  1. Swap first two parameters. An extremely easy-to-implement and general approach is to just swap the first two parameters. A good idea for this case would be to require exactly 3 parameters -- that way, we won’t accidentally screw things up, because when limited to exactly 3 parameters, there’s no difference between swapping and the normal interpretation of infix. This is trivial to implement, and yet it’s enough to implement basic infix operators in a way that looks nicer. This means that the presented expression is very similar to the actual s-expression, which has its advantages. If you want to add a fancier infix format later, this approach is a reasonable stepping-stone towards that.
  2. Allow chaining (duplication) of identical operator. This extends the previous option, but you can have an odd number of parameters, and the even ones much match. Thus, (3 + 4 + 5 + 6) becomes (+ 3 4 5 6). Trivial to implement (just swap the first two parameters and ensure the rest match). This approach makes it possible to fully use the capabilities of underlying functions that allow multiple parameters, and that's a good thing. Interestingly enough, Common Lisp’s definition of comparison operators would make expressions like (x <= y < z) work as it does in mathematics!

    Note that this does not support precedence, so you have to surround different operators with parenthesis or indents, e.g., (3 + (5 * 6)). In one view, that's a disadvantage - you still have to indicate some information that would be automatic in other languages. On the other hand, this makes when there are new lists explicit, and that has some pretty big advantages. It also sidesteps the problems of defining precedence (either fixed or a way to add them). Those are pretty big advantages.

    There are many options for determining when things aren't correct, and then deciding if that is an error or merely an indicator that infix was not intended. Note that this only supports an odd number of parameters, and obviously there shouldn't be an infix operator presented as the first parameter in the text. Doing otherwise could be an error, or could be an indicator that it should not be treated as infix. For the moment, I'll plan to consider both as an indicator that infix was not intended. This only permits all the operators to be identical - if they're different, but all infix operators, then the writer probably intended some sort of precedence (and so an error should be reported). Otherwise, infix was probably not intended, so again, I'd probably interpret that as "not infix".

  3. Full infix of binary operations, with precedence rules. Allow infix and add precedence rules (* before +, etc.). This would mean that “(3 + 4 * 5 ** 2)” would quietly transform into the s-expression “(+ 3 (* 4 (** 5 2)))”. In cases where it is important that you be able to clearly see the mapping between the surface presentation and the underlying s-expression, just don’t use precedence - instead, parenthesize (so those who do not need to see that representation don’t have to). This is trivial to implement in Lisp... the problem is what to do about the precedence of unknown operators. You’d need to either not support precedence of them, have a default for unknown operators, or have a way to set precedence values (and make sure the settings are read before processing the code). Setting precedence is easy enough, but its disadvantage is that it invites trouble - you need to make sure that the rules are set before the parameters are swapped, and mistakes could result in hard to find errors. Left association could probably be safely assumed for more than one instance of the same unknown operator. Alternatives are giving unknown operators a specific precedence relative to other operators... or simply requiring that other operator’s precedence be made clear with parentheses (or their equivalent). Predefining precedence for common functions, and requiring statements about the others, isn’t a bad thing; people usually don’t want to have varying precedence tables (it’s confusing), and they also don’t want to memorize large tables. Again, I suggest making duplicate operators (3 + 4 + 5) turn into one s-expression (+ 3 4 5); have the user override if they want something different.

    Here, we don’t allow prefix/suffix operators, such as “- x”, or suffix operators, like “x !”. There is a trade-off here; prefix/suffix don’t work well with automatic detection or trying to also read traditional s-expressions correctly - it would be all too easy to misread such expressions. Using name-prefix notation for them, such as -(x), is a very reasonable alternative.

  4. Full infix of binary and non-binary operations, with precedence rules. This is like the above, only now we also allow prefix operators, such as as “- x”, or suffix operators, like “x !”. Allowing them adds a little extra flexibility, and might make sense if the specific expression is always manually marked as being infix. But it has many problems, as noted earlier.

Precedence rules

If you have precedence rules, what are they, and are they controllable? It's not hard to add a command that sets precedence rules, but there's always the problem that you risk not getting that called at the "right time" (before reading). Worse, if there are global precedence rules that can be changed, setting them in one environment may manipulate another (unintended) environment, so now we have to have a parameter for (read), and that may be hard to pass down. This is particularly a problem for Lisp-like systems, where you may have multiple levels and meta-levels that you might not want to interact. Another big challenge is that if different developers set precedences, it's harder to combine their code. Different operations may have different meanings in different circumstances, so the lack of control over precedence can be a problem too.

A chain of the same operation can normally be combined into a single operation with all those parameters, e.g., (a + b + c) should become (+ a b c). This adds capability, without any problems or loss.

Everyone agrees that * and / should have the same precedence, both of which have greater precedence than the equal-precedence binary + and -, and that all of these are left-to-right. So in theory that, at least, could be implemented. Originally I thought that it may be best to just implement those universal rules, and normally use parentheses most everywhere else for controlling infix precedence (or at least discourage lots of precedence-setting). Chaining is fine, but as discussed below, precedence causes a lot of problems in many use cases, so in the end I decided against them for sweet-expressions.

Printing expressions that might be infix-able

So far, I’ve primarily discussed how to read in expressions, but expressions need to be printed too. A likely answer is “just print s-expressions as usual” by default. That means that the output would be “(+ 3 4)”, even if the input was “(3 + 4)”. For debugging code, this is probably a Good Thing. There might be value in presenting expressions back with some infix operations “re-inserted”, at least as some sort of “pretty-printing” function or option. It might be wise to identify which lists are being interpreted as infix in these cases (e.g., by surrounding them with {...}). I think there is no reason to try to redo precedence when re-displaying; showing how the lists are actually stored is very clear and eliminates questions about precedence. Thus, if “(3 + 4 * 5)” is input, the resulting s-expression would be “(+ 3 (* 4 5))”, and the infix-printer might show this as “{3 + {4 * 5}}”.

Combining infix with Name-prefixing

There is a subtlety when combining name-prefixing with automatic infix: I think people expect an infix expression to be considered a single expression, even though at first it appears to be a multi-parameter list.

For example, most people would accept this syntax for calling function f with two parameters, x and y:

  f(x y)

However, if infix is normally done automatically, they would expect this to be computed with f given one parameter, not three:

  f(x + y)

Note that this is actually a very nice thing; it means that in many cases, name-prefixing causes the number of parentheses that need writing to reduce. So the original Lisp:

  (f (+ x y))
becomes:
  f(x + y)

Infix control

There is likely to be a need to control infix processing. These should be a standard way of saying “disable infix” and “enable infix”. You'd like to be able to enable or disable infix at only one level, in particular, so that you could leave infix on as the default and yet disable it in one particular expression. It would be especially useful to have an “enable infix for this one level of expression” operation that must receive an expression that “looks like an infix expression” or it is in error.

The following text are some very early ideas on the topic; I include them here, but it's all very early, and at least some is likely to be very wrong. Still, you may find useful thoughts here.

One implementation issue is that macros work from outside in, and most macros will not expect special additional macros. In particular, it might be very useful to cuddle an infix operator with a macro that says do not use infix, like this (where "as" means "as-is"):

  (define as(+) ...)
However, the obvious solution will fail; typical implentations of "define" do not expect to be followed by another macro in the definition! This means that any such function reference will need to be removed at read time or very early on during eval time (e.g., by an outside cuddling nfx(...)), because other macros won't know what to do with this.

There should be a standard way of inserting at the beginning of a file or read-eval-print loop “please switch to and from special processing, and for setting some of its options (e.g., infix control). For debugging, you could just print s-expressions as now. However, a standard way to request printing these expressions that shows infix as such, but marking infix operators and non-infix, would be a good idea and each list which could be misinterpreted as an infix expression but is not has another marker. This implies a need for a marker that says “the following list is an infix expression and it’s an error if it is not”. If adding specific new infix operations is supported (such as “and” and “or”), a standard name and syntax for this add operation would be a good idea too. Ideally, the external interfaces for these operations be pseudo-standard across all Lisp-like notations. Bridging the gap between Common Lisp and Scheme might be challenging for a standard interface (in some cases string parameters might be needed; string syntax is now universal, but other syntax is not - e.g., the keyword syntax of Common Lisp is incompatible with the proposed keyword syntax for Scheme).

The notation for controlling all this is to be determined. Here’s a start. Originally, I looked at implementing another # character option. If you want to do that, you might look for sharpsign options that no one one currently uses, examining sources such as the Common Lisp Hyperspec (particularly the sharpsign section), the R5RS Scheme specification, the GNU Emacs Lisp Reference Manual, as well as less-common systems like NewLisp. Using #I for infix is tempting, but Scheme uses #i for inexact; it’s not clear if #I would get interpreted the same way on some systems, but it’s not worth risking and might be confusing to developers anyway. I especially considered using #/ to mean "infix". The #@ combination is used by Emacs Lisp, so we should avoid that. However, Scheme only supports one-character lookahead; if you want to be able to call the "ordinary" Scheme reader for other #-beginning constructs, there isn't an easy way to do that - because to read the character after #, you have to consume the #, making that more difficult to do. So I don't think beginning with # is a good idea.

I’ve looked for some possible notation, under the presumption that this is a future standard format common to Common Lisp and Scheme (so we need something unallocated by either).

A simple approach might this naming convention, with function-call-like macros:

Each of the above could either accept a single list (in which case it's the list that is being referred to), otherwise it is the expression itself as the rest of its parameters that is being referred to. Since any infix expression must have at least three parameters (2 if unary operators are used), this isn't ambiguous.

We could interpret unfx(...) around a second parameter to quietly disable infix processing. Alternatively, we could have a different function/macro name such as "as(...)". This would Treat the inside (function) as-is, so it won't be considered an infix operation (use this if an infix operator is the second parameter, but you don't want the expression to be considered infix). Here's an example:

  defun as(+) (left right) ...

One interesting challenge in doing this is recursion in the read function. I decided to implement much of the processing in the reader, and to have the processing go "as I go" in the reader, rather than have the reader automatically add nfx() in front of expressions and then have eval() fix it up by calling a macro. Since the reader is called on ordinary data in s-expression format, it is very inconvenient to automatically have the reader add nfx(), etc. calls! Yet if there is no outermost call to "fix up" the expression, then the outermost parameters would be defun, define, or other terms that would not handle these macro calls correctly. And you dare not remove things piecemeal - read can be called recursively, so if you call read, remove things partly, but then later all "fix up" all the way down again, you can "unfix" things. This is one reason to have a separate "as" function, so that these different forms can be differentiated.

Note that “(s (b))” is interpreted as you might expect during sweet-expression processing, but (s(b)) is interpreted as the s-expression ((s b)), not (s (b)). In practice this is not a problem - nobody likes the format (s(b)) for s-expressions, and not all s-expression processors would even accept them. More also needs to be done to describe their interaction with macros; this is a critical area and hard to get right, but since there is relatively little that is changing, this may not be so bad.

A simple program could transform an existing program to sweet-expression format, including its comments, and with any necessary markers to control infix interpretation. I expect that the need for controlling infix interpretation will be exceedingly rare, so the results should look very nice.

Infix implementations

Many, many people have implemented infix processors. Here are some, not including the many larger notational systems (like IACL2) that have an infix notation built into them:

  1. A simple Lisp program that converts infix to s-expression format is available as part of Peter Norvig’s book, “Artificial Intelligence: A Modern Approach.” This code’s license appears to be open source software. Besides the usual disclaimers, it says, “3. The origin of this software must not be misrepresented, either by explicit claim or by omission” and “4. Altered versions must be plainly marked as such, and must not be misrepresented as being the original software. Altered versions may be distributed in packages under other licenses (such as the GNU license).”
  2. Alan Manuel K. Gloria’s “infix notation macro” is extremely promising. His approach is to create a macro “nfx”, and then put spaces around everything. From then on, everything inside transitively can use either infix or prefix notation. It detects which lists are infix by examining the second parameter, and determining if it is in a list of infix operators. At this time the license intends to allow arbitrary use, though not in a clear legal manner.
  3. Mark Kantrowitz wrote an infix reader macro that is typical of many such packages (dated Jan 18, 1995); his runs on Common Lisp. It is often mentioned, though its restrictive license makes it useless to many. It “allows the user to type arithmetic expressions in the traditional way (e.g., 1+2) when writing Lisp programs instead of using the normal Lisp syntax (e.g., (+ 1 2)). It is not intended to be a full replacement for the normal Lisp syntax. If you want a more complete alternate syntax for Lisp, get a copy Apple’s MLisp or Pratt’s CGOL. Although similar in concept to the Symbolics infix reader (#<DIAMOND>), no real effort has been made to ensure compatibility beyond coverage of at least the same set of basic arithmetic operators. There are several differences in the syntax beyond just the choice of #I as the macro character. (Our syntax is a little bit more C-like than the Symbolics macro in addition to some more subtle differences.) It is not open source software; derivatives are allowed, but no fees or compensation may be charged for “use, copies, distribution or access to this software”, which probably forbids Linux distributors from distributing it (it is in Debian’s “non-free” section for this reason). Here it is. Norvig extended it. With this, you can write #I(3+4). It supports all sorts of precedence rules, etc. However, its restrictive license makes it useless to most people; it cannot be distributed by many FLOSS vendors, it cannot be used in commercial settings, and since there is no hope of commercial support for it, its code is unusable to most people. However, it is a good idea that with a fresh re-implementation might be practical for some (after all, this package is a re-implementation itself).
  4. ”mlish” is an infix syntax frontend for Scheme (from the same person who made “indent” ).
  5. The Gambit Scheme implementation reader includes the SIX (Scheme Infix eXtension). “The backslash character is a delimiter that marks the beginning of a single datum expressed in the infix syntax ... the backslash character escapes the prefix syntax temporarily to use the infix syntax. For example a three element list could be written as ‘(X \Y Z)’, the elements X and Z are expressed using the normal prefix syntax and Y is expressed using the infix syntax. When the reader encounters an infix datum, it constructs a syntax tree for that particular datum. Each node of this tree is represented with a list whose first element is a symbol indicating the type of node. For example, ‘(six.identifier abc)’ is the representation of the infix identifier ‘abc’ and ‘(six.index (six.identifier abc) (six.identifier i))’ is the representation of the infix datum ‘abc[i];’.
  6. guile-arith is a module to demonstrate how to build a parser in the Guilde version of Scheme, but is very immature. Its web page says “this is not meant to be used.” It builds yacc/lex parsers into guile C modules. It has a demo, supporting notations like this:
     (define (square x) #[ x ^ 2 ])
     (square 2) ==> 4
     (define (cube x) #[ square(x) * x ])
     (cube 2) ==> 8
    
  7. Note that SIOD (Scheme in One Defun) also included an infix module - again, not something that is standard across implementations.

Putting it together: Sweet-expressions

Obviously, there are lots of different ways to try to make Lisp-like languages easier to read. So let’s combine them. In particular, I’ve identified a particular set of options that I think are tasty.

Sweet-expressions: A basic definition

I have developed a combination I call “sweet-expressions” that I hope will be a good solution. In this notation, normally-styled s-expressions will typically work “as normal”, but various extensions are added that make it possible to create easier-to-read programs:

  1. We’ll support indentations using augmented I-expressions. An indented parameter with no following parameters (on the same line or via indentation) is considered a symbol, so a function with no parameters must be invoked by surrounding it with “(...)” or terminating its name with “()”. We'll augment it by ignoring blank lines at the beginning of an expression, and by immediately executing a term that begins at the left edge and is immediately followed by newline (to make interactive use pleasant).
  2. By default, any expression is an infix expression if the first term (the function name) is not an infix operator, its second term is an infix operator, and it has at least three terms; otherwise it is a traditional prefix expression. Infix expressions must have an odd number of terms, and the even parameters must be the identical infix operators. So, (2 + 3 + 4) is okay, but (2 + 3 * 4) is not; you must use (2 + (3 * 4)). That way, you only need to look at the first two terms of any expression to see if something is infix or not. If all the (odd-numbered) infix operators are the same, they are chained into a single infix operator, so (2 + 3 + 4) becomes (+ 2 3 4). Note that in a name-prefix form, it's the parameters that are considered, so f(2 + 3) becomes (f (+ 2 3)). Lists where the even-numbered parameters are different are considered non-infix (at least if any of them aren't operators); that will mean that more "traditional" Lisp s-expressions will be read without meaning change, even if they have an operator in the second position. You must separate each infix operator with whitespace on both sides.

    Previously I planned to use built-in precedence rules if the infix operators differ, but after some experience I'm currently thinking that it'd be better to not support precedence at all. Precedence is nearly universal among programming languages; they're very useful, and only a few infix-supporting languages (such as Smalltalk) lack them. But in the typical use cases of a Lisp-like language, precedence has some significant downsides that are irrelevant in other languages. First, let's talk about a big advantage to not supporting precedence in sweet-expressions: It makes the creation of every new list obvious in the text. That's very valuable in a list processing language; the key advantage of list processing languages is that you can process programs like data, and data like programs, in a very fluid way, so having clear markers of new lists like parentheses and indentation is very valuable.

    What's surprising is that precedence turns out to have a host of problems in specific context, leading me to believe that it's a bad idea for this usage. Precedence is very useful in general (and it could be added for special cases), but adding precedence rules to a general-purpose list expression processor creates a slippery slope of complexity. You can trivially create commands that create new infix operators, and express their precedence and direction - no problem. But you're using a Lisp-like language, the use case tends to be different, with objects shifting between being data and being code. Often you're working at multiple levels and meta-levels of input - and it's not likely that you'll want their precedences to be the same. Yet trying to handle that turns out to require very complex management of the reader routine. Also, combining code from different sources can be a nightmare if they have different precedence values. In short, having precedences vary by meta-level creates complexities you just don't want to deal with. That suggests moving to a completely static, pre-canned set of static levels is required. Okay, but someone has to predefine that set. That turns out to be really hard; the meanings of various infix operators are not cast in stone, so it's difficult to preset the precedence levels meaningfully (since you don't know what the operators mean). The one obvious case is that * and / have higher precedence than binary + and - basically everywhere. That particular rule could be built-in, but if that's all you build in, it's not clear that it's worth the trouble. Especially since the idea that completely forbidding mixing the operators means that we make the mapping between lists and the textual representation much clearer. It's easier to add precedence later, if that turns out to be important after more experimentation. But after the current experimentation it appears that precedence simply isn't worth it in this case; it creates complexity, and hides where the lists begin/end. Precedence is great in many circumstances, but not in this one.

    Of course, you can write code in some Lisp dialect to implement a language that includes precedence. Many programs written in Lisp, including PVS and Maxima, do just that. But when you're implementing another language, you know what the operators are, and you're probably implementing other syntactic sugar too, so adding precedence is a non-problem. But sweet-expressions are intended to be a universal representation, just like S-expressions are, so their role is different... and in that different role, precedence causes problems that don't typically show up in other uses. Not supporting precedence turns out to be better in this different role.

    This approach supports binary infix operators, but doesn't include any special mechanism for unary minus. For unary minus, you must use the name-prefixed form “-(x)” or the prefix form “(- x)” . By only handling binary infix operators, it is much less likely for an expression to be “accidentally” interpreted as an infix expression when we did not mean it to. For example, you can even have quoted lists of infix operators, and freely mix s-expressions that begin with infix operations, without trouble. But there are a few cases where automatic infix interpretation while reading can be troublesome, so there will be ways to control it. Note that an expression is not infix if you do just about anything with the second parameter, or if you use an infix operator as the first operator. This means that:

      (2 + (3 * 5)) <==> (2 + (* 3 5)) <==> (+ 2 (* 3 5))
    
  3. We’ll allow the use of name-prefixing, so NAME(x y) is the same as (NAME x y). NAME must be followed immediately by a “(” without whitespace between them. For the infix rules purposes we’ll perform this transformation after doing infix expression handling, so if the cuddled expression is an infix expression, it's a single-parameter function with an expression in infix form (you can use parentheses to cause a multi-parameter interpretation). Here are some examples:
     factorial(z)         <==> (factorial z)
     foo(x y)             <==> (foo x y)
     bar(x y z)           <==> (bar x y z)
     factorial(x - 1)     <==> (factorial (x - 1)) <==> (factorial (- x 1))
     f(g(y) h(y) a)       <==> (f (g y) (h y) a)
     f((x + 2) y (z - 3)) <==> (f (+ x 2) y (- z 3))
     f((x + 2) * (y - 3)) <==> (f (* (+ x 2) (- y 3)))
    

I call this combination “sweet-expressions” - by adding syntactic sugar (which are essentially abbreviations), I hope to create a sweet result. In the few cases where the sweet-expression is awkward, just use the usual s-expression notation (with infix escapes if necessary) -- well-formatted s-expressions are still legal. The result supports quoting and backquoting, is very compatible with typical s-expressions, and should work well with most macros.

Controlling sweet-expression interpretation and next steps

There should be standard ways controlling this, particularly for infix interpretation. Here are some early ideas about this.

For the moment, I plan to just implement an "as(..)" expression that can be cuddled around the infix operator (the second parameter). This must be removed at read time or very early on during eval time, because other macros won't know what to do with this. In other words, to ensure that "(defun as(+) ...)" will work correctly, we need to process as(..) before we process "defun". (Alternatively, use name-prefixing: defun(+ ...) works as well). I plan to implement this at read time for now, so that all other macros will work perfectly normally.. by the time normal macros get the expressions, they only see standard s-expression format. (Instead of as(...), I could use some notation beginning with #, but this turns out to be awkward in Scheme (because handling # as a secondary reader is painful), and it also turns out to be awkward in Common Lisp (because functions have a different name space, so you have chained # operators that look confusing).

Controlling infix processing is probably the most troublesome issue. For new code, it's more convenient to presume infix unless disabled - you really only need the occasional "as". Yet for maximum compatibility with existing code, it's better to expressly turn it on and to have control over it, so you want default off, with various ways to enable it. I believe this is very solvable.

For now, I'll just implement the "as" version, which is enough to see how it works.

A related issue is defining what the infix operations are. It seems clear to me that = (equality), <=, +, and so on should map to their prefix versions. I also think that && should map to "and", and || should map to "or" -- these are common conventions in other languages, and thus should be easy to read. I think "**" should become "expt" (exponentiate). But there are many other prefix operations with text-based names that should be defined (by defining a function or macro) that aren't as obvious. All these mappings can be done using regular definitions, without anything special happening in the language.

First, assignment (setf in Common Lisp, set! in Scheme). Gloria suggests using "=" for setting values (setf in Common Lisp), but I think that's a bad idea. After all, "="... that's already the equality test, and having the same symbol mean different things is likely to be disastrous; C already has trouble simply because "=" and "==" look too similar. The phrase ":=" has its problems too; it'd be considered "=" in the keyword space (confusing!) in Common Lisp, and doesn't prefix well. A plausible alternative is "<-", so "x < 5" reads as "set x to the value 5". This also combines well with C-like operators, e.g., "+<-" could be "add to"... that means that "x +< 5" would be the same as "(setf (+ x 5))" in Common Lisp or "(set! x (+ x 5))" in Scheme.

There are lots of other potential operators. Gloria suggests "@" become prefix "aref" (which accesses array "left" at element "right"), "->" become "slot-value-q" (though this conflicts with using it as "implies"), "%" become "mod", "===" become eq, "==" become equal, "&" become logand, "|" become logor, and ":" become "values". I worry about using | and & because they're easily mistaken for the double-character kind. Indeed, Dennis Ritchie admits that the precedence values for bitwise and/or, vs. logical and/or, were a mistake - made worse because of their visual similarity. Languages should designed so that people will be less likely to make obvious mistakes, based on previous measured evidence that these mistakes occur... and this evidence certainly exists in the case of && vs. &.

What should ++ be... increment, string concatenation, something else? There should probably be a string concatenation infix operator, but I'm not sure what to name it.

I would use "->" for implies, and "<->" for iff (if and only if). One danger is that mixing "->" and "<-" in the same expression can be very confusing, but I don't see a simple alternative.

A "standard list" of common infix operations and their meaning would make it easier to read and understand such expressions. But this is difficult to do, and in case only makes sense once there's more experience with sweet-expressions.

What about writing a list back out? It'd be nice to have a written form that could always be read back in as a sweet-expression. An optional writer would be nice - one that added as(...) where needed, and which optionally showed infix as infix, or optionally used indenting. It could be made really obvious that it was the special writer by using name-prefixing where possible -- that would be a glaringly obvious difference from the traditional s-expression writer. So it would write stuff like:

 if((x = 5) display("Five") display("Not five"))
Alternatively, the writer could normally write without using name-prefixing:
 (if (x = 5) (display "Five") (display "Not five"))
It could also just write traditional s-expressions, but adding the as(...) in the few places where it's necessary; that'd be good for debugging:
 (if (= x 5) (display "Five") (display "Not five"))

Simple Example

Now, let’s look again at our sample Common Lisp program, using traditional s-expression notation:

 (defun factorial (n)
   (if (<= n 1)
       1
       (* n (factorial (- n 1)))))

The Scheme version is similar:

 (define (factorial n)
   (if (<= n 1)
       1
       (* n (factorial (- n 1)))))
The two samples above are also legal sweet-expressions. Why? Well, the initial “(” disables I-expression processing, so even if the formatting is ruined, we’re simply processing as usual. There’s no function name ajoined to the right with “(”; normal formatting rules wouldn’t even consider that. As is usual for Lisp code, “infix” operators are listed first (not second) so there’s no infix operator as a second parameter to cause a change.

But what if we take advantage of the new capabilities of sweet-expressions? A pretty-printer could even do this automatically, as a way to start.. and the results look fantastic. Here's the Common Lisp version:

 defun factorial (n)
   if (n <= 1)
       1
       n * factorial(n - 1)

And here's the Scheme version:

 define factorial(n)
   if (n <= 1)
       1
       n * factorial(n - 1)

What’s even more interesting, we haven’t lost any power. We can still use quotes, or quasiquotes with comma-lifting. We can still have functions that return functions, and so on. We can also mix up regular s-expressions as part of sweet-expressions, or even mix things up back and forth:

 defun factorial (n)
   if (n <= 1) 1 (* n (factorial (n - 1)))

Another example

Here's a trivial example derived from the Common Lisp Tutorial (by Gordon, Haible, and Taube):

(do ((x 1 (+ x 1))
       (y 1 (* y 2)))
      ((> x 5) y)
    (print y))

Again, the above is also a valid sweet-expression, but we can make it look much nicer instead:

 do
   group
     x 1 (x + 1)
     y 1 (y * 2)
   group
     x > 5
     y
   print y

Some patterns

More complex s-expression patterns turn out to have regular representations in sweet-expressions. For example, the typical structure:

 (cond ((condition1) (result1)) ((condition2) (result2)) ...
       (t (otherwiseresult)))
can be expressed the same way in sweet-expressions. But if you use I-expression indentation, they can become staggeringly more obvious to the uninitiated:
 cond
  group (condition1)
    result1
  group (condition2)
    result2
  group t
    otherwiseresult

Since the infix operations are “merged” if they are all the same, this works as it would in mathematics in a sweet-expression:

  if (0 <= x <= 10)
    (dostuff) ; x is between 0 and 10 inclusive.
    (dontstuff) ; x is not between 0 and 10 inclusive.

Discarded Variation: Initial "(" on block disabling more

In the definition of Sweet-expressions above, any "(" disables indentation processing, but name-prefix and infix process continue to work. This is mostly compatible with existing Lisp code, but not completely. Name-prefixing is unlikely to cause a problem in well-formatted code, but expressions that look like infix expressions can certainly occur, and that could be a problem. Basically, it'd be very nice to be able to disable infix processing in some reasonable way.

I thought of one variation, which I describe here.. but then explain its problems.

One solution is to modify sweet-expressions (as described above) so that, if the first non-whitespace character of a newly-read block is "(", more of sweet-expressions could be disabled. We could disable infix processing, at least. Of course, once you do that, why not just disable everything and process it as a traditional s-expression? The advantage of doing this is that you can call the "original" read routine, with all of its local extensions. The rules would then get a little more complex to explain, but you also get massive backwards compatibility. If the first non-whitespace character is a special character (mainly ";", "'", and ","), process them and try again. Otherwise, after than initial character normal sweet-expression processing occurs... so an embedded "(" in expressions like load("name") or if (n < 3) would work normally.

There are disadvantages to this approach, too. One sad thing is that parentheses-surrounded infix calculations would no longer work at the topmost level, e.g., "(3 + 4)" would work inside a block, but not as the topmost block. You could enter "3 + 4", but that would require two "Enters" if we keep the rule that "multiple terms on the initial line mean wait for more lines". The user could call using name-prefixing any function that returns its first parameter, e.g., if "calc" is a function that returns its first parameter, then "calc(3 + 4)" could do the calculation. An alternative for the "usual case" would be to add one more rule: if there are 3 or more terms on the first line, and it matches the infix pattern, then return immediately after the first line is entered - so "3 + 4" now works on the command line. I like that idea; it seems unlikely to happen by accident, and it's even simpler than "(3 + 4)". You'd still need a "calc()" function if on the initial line you have an infix operation whose first parameter is also infix, e.g., "(5 * 6) + 3" won't work - you'd have to use "calc(5 * 6) + 3". Such special-casing is not as "clean", unfortunately. The advantages in backwards-compatibility and ease-of-use are significant, though, so I'm leaning towards that change.

So here's how the rules would look:

  1. If the first non-whitespace character of a new block is "(", it's a traditional s-expression; it may be prefixed by arbitrary numbers of blank lines (which may include spaces or tabs), ";...." comments, quoting ('), quasiquoting (`), or comma-lifting (,). Any spaces or tabs after the expression will be consumed before returning, to increase compatibility with older systems (an s-expression, followed by a bare atom, will be processed like a traditional Lisp processor would process it).
  2. Otherwise, if (after skipping content-free lines) the line begins without any space or tab, and the line is either one complete term like load("...") or one complete infix expression like 5 + 6 + 7, it's considered a one-line sweet-expression; it's returned (and run) immediately at the end of the line. This special rule makes interactive use much more pleasant. Without this special rule, we have to enter extra blank lines all the time for sweet-expressions, even when it's "obvious" that a new line isn't needed. Note that more lines will always be requested if there are any unclosed parens.
  3. Otherwise, it's a multi-line sweet-expression. Basically, putting tabs/spaces at the beginning of a sweet-expression forces a "multiple lines" meaning. While there's an open paren, indenting is ignored. If there's not an open paren on the first line, the second line's tab/space indent sets the "smallest indent"; any the line with less indentation (including a blank line) ends the sweet-expression block.

    The blank-line rule is needed to make interactive processing pleasant. I think a space-and-tab only line should be considered the same as an immediate-return blank line; you could make lines with only spaces and tabs have a different meaning, but that's certain to cause mysterious problems (since they'd look the same). Lines which include only a ;-prefixed comment should be completely ignored, even if they have less indentation, so that they can be included without ending the block.

    Note: If the first line was indented, but the second line is not indented at all, then the previous line is processed and returned, with the second line unconsumed. This deals with weird formats like:

       hello
    (more stuff....)
    
    Such formats were probably intended to be interpreted as traditional s-expressions (they make no sense as multi-line sweet-expressions), so we'll interpret the first line as a traditional format to increase backwards compatibility.

The rules are complex, though. What's disturbing is that "(3 + 4) * 5" works completely differently from "3 + (4 * 5)"; that's the sort of inconsistency that makes a notation a burden instead of a pleasure. One complication is that the "read" function needs to return when it's read an expression, and ideally doesn't store internal state of whitespace alrady read. That means that recording the indentation of the initial block should be avoided, if it's reasonable to do so... otherwise we have to record that information.

In short, striving for backwards-compatibility, ease-of-use at the command line, ease-of-use for programming, readability, and full list processing without special-casing a lot of syntax involves tradeoffs... my goal is to find the "best" trade.

Variations: Using different grouping symbols

There are times when you don't want infix processing (or want to control it specially). Also, it's clearly possible that reading an existing s-expression, but using infix-interpretation, could occasionally silently change the meaning of that expression. And that is obviously of concern to some.

In discussions with Alan Manuel Gloria, another idea popped up: really using multiple different grouping/list markers. The original discussion noted using [] for grouping/list initialization instead of (). Basically, [...] would start a new list and turn off indentation, but continue name-prefix and infix processing. In contrast, () would at least disable infix processing.

Although the original discussions used [...], I tried out various experiments and think that {...} is a better choice. The problem is that [...] is used in mathematics, and many programming languages including Python and Haskell, to mean a literal list. But this is exactly what we don't mean - instead, we want a "grouping" symbol that means that we specially interpret what's inside (in particular, that we interpret infix operators). In contrast, all languages with C-like syntaxes (C, C++, Java, C#) have trained folks to read {..} as possibly meaning a block, and in mathematics {...} can denote sets (again, special interpretation used for what's inside).

So, for the moment, let's presume that unprefixed (...) will indicate "no infix" (since this is much more backwards-compatible with s-expressions). There are actually two alternative ways to interpret (), if it's to be distinct from {} grouping. One is to say that () disables only infix, but you still have name-prefixing, and you can switch back to infix inside by using {}. Another is that () switches to pure s-expression processing.

There's a big advantage to saying that () switches to pure s-expression processing (disabling both indentation and infix processing, as well as indentation). If you define it that way, the sweet-expression reader can immediately call the original built-in (read) routine instead. That's important - while implementing sweet-expressions, I've found that there's a lot of implementation-unique read processing that you have to try to simulate if you want ALL the functionality of the underlying system. (E.G., standard Scheme doesn't have |..| constants, but some implementations do, so to read stuff for their Scheme you have to implement these new oddities.) That makes it essentially impossible to write a portable implementation of sweet-expressions that retains the capabilities of the underlying system. As a side-benefit, most s-expression files would be just readable as-is... which I admit is NOT a 'most important' criteria, but it'd be nice to have.

On the other hand, sometimes you'd like to say "no infix or indent processing", but be able to use name-prefixing, and switch to infix inside it. Guess what - there's a symbol pair already with that meaning: [..]. So [f g(x) + h(x)] is interpreted as (f (g x) + (h x)), without any infix operator in sight. Let's say that [..] has this meaning as long as it's not name-prefixed; that way, we can later give expressions like name[i] another meaning. If we use [...] to mean "list, no infix" and {...} to mean "list, with infix", you can then write sweet-expressions without (...) at all. This means that we could then give the reader a choice of how to handle (...)... as grouping, like {}, or as no infix, as []. The latter is more backwards-compatible, and so I'd expect that to be the default; the former is much more similar to other languages.

I think it'd be nicer to start with "infix as the default" (maybe a callable parameter if you want something else). That's a reasonable default simply because MOST people are more comfortable with it. Since ANY open paren disables the infix until its matching paren, users that use ordinary s-expressions might not even notice sweet-expressions. It's not like normal Lispers type in "3 + 4" and expect it to work :-). So it won't harm those who want s-expression prefix, but it will help those who want infix. That might make sweet-expressions more likely to be accepted; you could enable them, and yet for most people they wouldn't even notice the difference (except for the addition of new abilities).

The rules are simple: When unprefixed, use {} for grouping, which disables only indentation; use [...] to disable indentation and infix; use () to disable everything (infix, name-prefix, and indentation). A prefixed {} or () is a function call, and only disables indentation. We'll reserve prefixed [...] for future use. This use of {} instead of () for grouping infix operators like {4 + {3 * 2}} is _slightly_ nonstandard, but not terribly so, and lets us be completely backwards-compatible with s-expressions. That's pretty compelling.

What about name-prefixed forms? Even if (f x) disables infix processing, that does not mean that f(x) must do the same. It's consistent in one sense if it does, but f{x} looks really funny while f(x) looks very conventional. I'm presuming, for the moment, that unprefixed (...) will really only be used for backwards-compatibility.

So now we're back to nice forms, without any calc()-like irregularities:

3 + 4
3 + {5 * 6}        ; just use {} instead of ().
{3 + 5} * 6        ; just use {} instead of (). No special cases.
{3 + 4} * f(x + y) ; Function calls work fine.
(+ 3 4) * f{x + y} ; this works too (!)
f{3 + 4}           ; Function calls support infix, is (f (+ 3 4))
f(3 + 4)           ; Proposal: prefixed () keeps infix, so it's (f (+ 3 4))
[1 + 2]            ; Here's a safe way to describe (1 + 2).
[f 3 + 4]          ; Here's (f 3 + 4).
(f 3 + 4)          ; Same thing.

There's another interesting benefit too. This would mean that sweet-expressions would be full of {...} when infix is used... which would make them even easier to tell apart from s-expressions.

Interestingly, this also makes it easy to use special "infix" macros (often written as "nfx"). Just tell users that "for advanced infix processing, use (nfx ....) and then follow these infix rules (which can then support precedence levels, defining infix operators, etc.)". So now you can have a choice: (1) a built-in "base" system doesn't need to be told about operators, precedence, or associativity, but you have to group all different operators using [] and you can only use punctuation as infix operators... and (2) advanced "nfx-like" functions, but one where you have to define all that stuff.

In theory, a program written using Scheme RSR6 might use [...] to mean the same as (..). But I bet almost no one actually DOES that.

This switch to using {...} everywhere for grouping, and (...) for straight s-expressions, is very tempting. The factorial example becomes:

 defun factorial {n}         ; Parameters can be indented, but need not be
   if {n <= 1}               ; Supports infix, prefix, & function <=(n 1)
       1                     ; This has no parameters, so it's an atom.
       n * factorial(n - 1)  ; Function(...) notation supported
                             ; Function{...} okay too.

It's easy to accidentally use parens when you meant to use {...}, particularly in expressions like {4 + {3 * 2}}. But it's still tempting to make this change, since this makes the entire notation even more backwards-compatible, as well as giving it an easy (and clear) way to disable infix.

Since this gives an easy way to disable infix, this does raise the possibility of allowing "and" and "or" as built-in infix operators. After all, now infix operators are easy to disable. On the one hand, it's horribly inconsistent to have "and" and "or" built-in as infix, since all other infix operators are punctuation. On the other hand, "and" and "or" are essentially supported by all major Lisps. Also, these two operators are often treated specially; provers often treat them specially, and some Lisps treat them as short-circuiting operations (so in some since they are already very special). A typical dilemma: You don't want many special cases, but if a special case really makes something easy to use, it may be worth it. And of course, once "and" and "or" are built-in, along with relational operators and + - * /, perhaps some built-in precedence does make sense.

In the end, the best approach may be to choose the "most easily changed" decision. If precedence is not supported at first, then adding it later is easy (since any written code won't use precedence). On the other hand, removing precedence (or fixing precedence levels) is hard/impossible. Similarly, if () disables sweet-expressions entirely, then code won't be written inside () that uses sweet-expressions... so it won't be a big deal to add that functionality later.

It's really too bad about (...). In any other language, (...) merely groups expressions, but if we're serious about backwards-compatibility, then an unprefixed (...) needs to turn off infix processing inside it. If you don't need backwards compatibility, accepting unprefixed (...) for simply grouping (without disabling infix) sure looks nicer. It's too bad that there is a big trade-off here, I'd like the benefits of both.

Issues with Sweet-expressions

Ideally, any data format would not be easily misread, and it would not be easy to make a small mistake with disastrous consequences. This is less likely if any minor change to an expression would make it illegal. Any format that can read both original S-expressions and a new format will have some issues in this regard, simply because there are many more expressions that were once illegal but are now legal. A particular problem is that in some cases, expressions could be interpreted as being infix but were not intended to be infix. However, I believe these are very rare. Conversely, expressions that are intended to be infix might not be (e.g., because of a failure to surround infix operators with spaces); I think that with experience this goes away. The indenting is a non-issue; since this is only meaningful outside of parentheses, any existing Lisp code will be inside parens, and indenting like this is common practice, there's no real issue. The prefixed names are unlikely to be in real code; there's a risk that people will accidentally put spaces between the name and the opening paren, but I think it's manageable.

If there's a really major risk that a misplaced space would immediately cause your deep space probe to crash, perhaps a Lisp-like notation isn't the best idea for your application. Even regular s-expressions have similar problems; a space anywhere in an s-expression separates parameters, and there's little compile-time checking of the kind expected for such circumstances. Lisp-like languages don't do a lot of compile-time checking in general (of the kind expected for an "any mistake is disastrous" application). Specialized notations that make it hard to make that kind of mistake, and easier to detect every mistake at compile time, might be better (Ada comes to mind). But in many circumstances, that's not the primary need; if you need to do a pile of list processing, including processing/analysis of language, a list processing language makes sense. For example, analysis of a critically-important program might be a perfectly fine use for sweet-expressions.

Another problem is when there are multiple ways to express something, it can be hard to choose between them. In particular, you can use name-prefixing or indentation to implement a function call, as well as traditional s-expressions - which do you choose? And when displaying a list, which do you choose? I think "name-prefixing for short expressions, and indentation when their parameter lists get long", is a reasonable rule of thumb when writing code where you intend to use it as a function call. If it's just a list of information, a traditional s-expression shows it clearly. I would probably use traditional s-expressions to write the list back out; that is clear, and can help people avoid misunderstandings and misuse.

Conclusions

There are lots of other options, but I only have so much time. In any case, I’ve investigated a lot of options, and I think I’ve found a plausible combination. What I have here is a paper design that looks like it should work.

I think the next step is to establish a project where people can discuss this, and fix it based on those comments. Then, after a period of discussion, create some FLOSS code to implement it that is widely-usable (say under the MIT, BSD-new, or LLGPL license). One of the first things to implement are pretty-printers for this format, so we can see what big programs look like in this format (and see its problems). Once the format is updated through experience coding it, I would want the resulting programs would be available to all. Multiple implementations will be necessary (for Scheme, Common Lisp, etc.) and they will need to support several different usage scenarios:

My current thought is to develop it in a subset of Scheme, which can then be easily translated to ACL2, Common Lisp, and so on (Scheme has far fewer operations, and is much stricter about types such as booleans, so it's easier to translate Scheme to something else instead of the other way around.)

Finally, after experimentation and testing in the real world, the final form of sweet-expressions could be set and released to the world.

I've created a SourceForge project named "readable" - the project's role is to discuss ways to make s-expressions more readable, create at least one (sweet-expressions), and promulgate FLOSS code to support readable representations of s-expressions. I could possibly create "sweetexpr" for specifically working on sweet-expressions, or work via the Scheme SFRI process.

Appendix: BitC

I recently encountered the new language BitC, a language devised to support low-level programs (e.g., operating system kernels) that are easy to mathematically prove correct. (I learned about this while writing my paper High Assurance (for Security or Safety) and Free-Libre / Open Source Software (FLOSS)... with Lots on Formal Methods.) BitC combines ideas from Scheme and ML, including an emphasis on functional programming and tail recursion, with low-level capabilities like those of C. In particular, it’s designed so that you do not need to do heap allocation (e.g., “malloc” or “new”) or garbage collection to create useful programs, a property that functional programming languages generally do not have. This is very interesting work, and BitC version 0.9+ uses an S-expression-like notation. The reason it does, intead of a more common notation, is because the authors want to make it easy to prove them correct. One way to prove them correct is to use reasoning where program fragments are repeatedly replaced by other fragments -- and since s-expressions are very regular, s-expressions simplify the process. They hope to reuse some of ACL2’s capabilities, which is written in Common Lisp and expects s-expressions as input.

The problem is that s-expressions are still not easy for humans to read, and BitC’s creators are well aware of that. Also, while Lisp programmers routinely “write programs to write programs”, here the application is different -- generally a theorem prover is doing the rewriting, and displaying LOTS of output from text generated from a program. This means humans will be doing lots of reading, possibly of long, complicated structures; what you want is an input and output that is simple for the human to read and unambiguous. What’s more, this is a new language, not Lisp, so trying to stay compatible with Lisp for compatibility’s sake is not necessary. Indeed, the BitC folks had already strayed from the strict s-expression camp; they had included a few abbreviations, like an infix “:” to indicate types (the abbreviations quietly transform to regular s-expressions). Keeping the power of s-expressions (e.g., the regularity of processing) would be advantage for BitC.

In an email I received from Jonathan Shapiro, he noted that “I share your concern about using an [s-expression-based] syntax [that users may find it hard to use]. The problem is that we have a bunch of very nasty experience in Isabelle/HOL, TWELF, and friends with the alternative...”, in part because “All known examples for quoting programs so that they can be manipulated as terms in a prover are a complete disaster if the [programming language] is infix.” He continued, “When we get around to dealing with the logic language, it will become clear that treating program fragments as terms is vital. In LISP this is quite natural: ’(eval 1). ... ACL2 exploits a devious pun between applicative functions and logic terms... [defining a function] simultaneously [admits] admitting a procedure and a [mathematical pure] function... Because [this pun] is maintained so successfully, godelization requires no explicit syntax... Our experience with parsing infix languages as terms has been that (a) it is truly dismal, and (b) it is incomprehensible to the developer... even the people at UT could not make IACL2 work well... One reason that we did NOT adopt the ‘{’ syntax for vectors was the anticipation that we might use it for infix terms in some future implementation. I’ve been very careful to preserve the possibility of an equivalent infix syntax. It doesn’t mean that we would have to drop the existing lisp-ish syntax.” He also doesn’t want to spend time devising syntax, since that is not the focus of his work.

Note that this application needs to be able to manipulate fragments of program code, repeatedly -- a traditional Lisp strength, and a traditional reason to use s-expressions. Yet, we want the s-expressions to be easier to read!

The power of metaprogramming is well-known, and there's a lot of literature about it. To cite just one example, Jurgen Jordanus Vinju's "Analysis and Transformation of Source Code by Parsing and Rewriting" discusses some of the things you can do when you use programs to analyze/modify programs. There are a legion of other references.

Whether or not the BitC folks pick up these ideas immediately, I think finding a way to make s-expressions more accessible is an interesting puzzle, one that would be great to solve. I suspect there are many people who might be interested in picking up an improvement, if it really was an improvement. The ACL2 folks have the same issues, as do many others, too. So it'd be great to solve it once, and then anyone who wants to use the solution can do so.

Appendix: Other Notes

Of course, making programming languages readable is much more general topic, and there's a lot of literature on that. The literature on parsing is also extensive. One paper on parsing that I should mention is Haskell, do you read me?: constructing and composing efficient top-down parsers at runtime, Proceedings of the first ACM SIGPLAN symposium on Haskell table of contents Victoria, BC, Canada, pp. 63-74, 2008. This is implemented by "ChristmasTree": "Changing Haskell's Read Implementation Such That by Manipulating Abstract Syntax Trees it Reads Expressions Efficiently".


You might want to look at my paper High Assurance (for Security or Safety) and Free-Libre / Open Source Software (FLOSS)... with Lots on Formal Methods.

You can see David A. Wheeler’s home page, http://www.dwheeler.com.