David A. Wheeler's Blog

Mon, 14 Oct 2013

Readable Lisp version 1.0.0 released!

Lisp-based languages have been around a long time. They have some interesting properties, especially when you want to write programs that analyze or manipulate programs. The problem with Lisp is that the traditional Lisp notation - s-expressions - is notoriously hard to read.

I think I have a solution to the problem. I looked at past (failed) solutions and found that they generally failed to be general or homoiconic. I then worked to find notations with these key properties. My solution is a set of notation tiers that make Lisp-based languages much more pleasant to work with. I’ve been working with many others to turn this idea of readable notations into a reality. If you’re interested, you can watch a short video or read our proposed solution.

The big news is that we have reached version 1.0.0 in the readable project. We now have an open source software (MIT license) implementation for both (guile) Scheme and Common Lisp, as well as a variety of support tools. The Scheme portion implements the SRFI-105 and SRFI-110 specs, which we wrote. One of the tools, unsweeten, makes it possible to process files in other Lisps as well.

So what do these tools do? Fundamentally, they implement the 3 notation tiers we’ve created: curly-infix-expressions, neoteric-expressions, and sweet-expressions. Sweet-expressions have the full set of capabilities.

Here’s an example of (awkward) traditional s-expression format:

(define (factorial n)
  (if (<= n 1)
    1
    (* n (factorial (- n 1)))))

Here’s the same thing, expressed using sweet-expressions:

define factorial(n)
  if {n <= 1}
    1
    {n * factorial{n - 1}}

I even briefly mentioned sweet-expressions in my PhD dissertation “Fully Countering Trusting Trust through Diverse Double-Compiling” (see section A.3).

So if you are interested in how to make Lisp-based languages easier to read, watch our short video about the readable notations or download the current version of the readable project. We hope you enjoy them.

path: /misc | Current Weblog | permanent link to this entry

Tue, 06 Aug 2013

Don’t anthropomorphize computers, they hate that

A lot of people who program computers or live in the computing world ‐ including me ‐ talk about computer hardware and software as if they are people. Why is that? This is not as obvious as you’d think.

After all, if you read the literature about learning how to program, you’d think that programmers would never use anthropomorphic language. “Separating Programming Sheep from Non-Programming Goats” by Jeff Atwood discusses teaching programming and points to the intriguing paper “The camel has two humps” by Saeed Dehnadi and Richard Bornat. This paper reported experimental evidence on why some people can learn to program, while others struggle. Basically, to learn to program you must fully understand that computers mindlessly follow rules, and that computers just don’t act like humans. As their paper said, “Programs… are utterly meaningless. To write a computer program you have to come to terms with this, to accept that whatever you might want the program to mean, the machine will blindly follow its meaningless rules and come to some meaningless conclusion… the consistent group [of people] showed a pre-acceptance of this fact: they are capable of seeing mathematical calculation problems in terms of rules, and can follow those rules wheresoever they may lead. The inconsistent group, on the other hand, looks for meaning where it is not. The blank group knows that it is looking at meaninglessness, and refuses to deal with it. [The experimental results suggest] that it is extremely difficult to teach programming to the inconsistent and blank groups.” Later work by Saeed Dehnadi and sometimes others expands on this earlier work. The intermediate paper “Mental models, Consistency and Programming Aptitude” (2008) seemed to have refuted the idea that consistency (and ignoring meaning) was critical to programming, but the later “Meta-analysis of the effect of consistency on success in early learning of programming” (2009) added additional refinements and then re-confirmed this hypothesis. The reconfirmation involved a meta-analysis of six replications of an improved version of Dehnadi’s original experiment, and again showed that understanding that computers were mindlessly consistent was key in successfully learning to program.

So the good programmers know darn well that computers mindlessly follow rules. But many use anthropomorphic language anyway. Huh? Why is that?

Some do object to anthropomorphism, of course. Edjar Dijkstra certainly railed against anthropomorphizing computers. For example, in EWD854 (1983) he said, “I think anthropomorphism is the worst of all [analogies]. I have now seen programs ‘trying to do things’, ‘wanting to do things’, ‘believing things to be true’, ‘knowing things’ etc. Don’t be so naive as to believe that this use of language is harmless.” He believed that analogies (like these) led to a host of misunderstandings, and that those misunderstandings led to repeated multi-million-dollar failures. It is certainly true that misunderstandings can lead to catastrophe. But I think one reason Dijkstra railed particularly against anthropomorphism was (in part) because it is a widespread practice, even among those who do understand things ‐ and I see no evidence that anthropomorphism is going away.

The Jargon file specifically discusses anthropomorphization: “one rich source of jargon constructions is the hackish tendency to anthropomorphize hardware and software. English purists and academic computer scientists frequently look down on others for anthropomorphizing hardware and software, considering this sort of behavior to be characteristic of naive misunderstanding. But most hackers anthropomorphize freely, frequently describing program behavior in terms of wants and desires. Thus it is common to hear hardware or software talked about as though it has homunculi talking to each other inside it, with intentions and desires… As hackers are among the people who know best how these phenomena work, it seems odd that they would use language that seems to ascribe consciousness to them. The mind-set behind this tendency thus demands examination. The key to understanding this kind of usage is that it isn’t done in a naive way; hackers don’t personalize their stuff in the sense of feeling empathy with it, nor do they mystically believe that the things they work on every day are ‘alive’.”

Okay, so others have noticed this too. The Jargon file even proposes some possible reasons for anthropomorphizing computer hardware and software:

  1. It reflects a “mechanistic view of human behavior.” “In this view, people are biological machines - consciousness is an interesting and valuable epiphenomenon, but mind is implemented in machinery which is not fundamentally different in information-processing capacity from computers… Because hackers accept that a human machine can have intentions, it is therefore easy for them to ascribe consciousness and intention to other complex patterned systems such as computers.” But while the materialistic view of humans has respectible company, this “explanation” fails to explain why humans would use anthropomorphic terms about computer hardware and software, since they are manifestly not human. Indeed, as the Jargon file acknowledges, even hackers who have contrary religious views will use anthropological terminology.
  2. It reflects “a blurring of the boundary between the programmer and his artifacts - the human qualities belong to the programmer and the code merely expresses these qualities as his/her proxy. On this view, a hacker saying a piece of code ‘got confused’ is really saying that he (or she) was confused about exactly what he wanted the computer to do, the code naturally incorporated this confusion, and the code expressed the programmer’s confusion when executed by crashing or otherwise misbehaving. Note that by displacing from “I got confused” to “It got confused”, the programmer is not avoiding responsibility, but rather getting some analytical distance in order to be able to consider the bug dispassionately.”
  3. “It has also been suggested that anthropomorphizing complex systems is actually an expression of humility, a way of acknowleging that simple rules we do understand (or that we invented) can lead to emergent behavioral complexities that we don’t completely understand.”

The Jargon file claims that “All three explanations accurately model hacker psychology, and should be considered complementary rather than competing.” I think the first “explanation” is completely unjustified. The second and third explanations do have some merit. However, I think there’s a simpler and more important reason: Language.

When we communicate with a human, we must use some language that will be more-or-less understood by the other human. Over the years people have developed a variety of human languages that do this pretty well (again, more-or-less). Human languages were not particularly designed to deal with computers, but languages have been honed over long periods of time to discuss human behaviors and their mental states (thoughts, beliefs, goals, and so on). The sentence “Sally says that Linda likes Tom, but Tom won’t talk to Linda” would be understood by any normal seven-year-old girl (well, assuming she speaks English).

I think a primary reason people anthropomorphic terminology is because it’s much easier to communicate that way when discussing computer hardware and software using existing languages. Compare “the program got confused” with the overly long “the program executed a different path than the one expected by the program’s programmer”. Human languages have been honed to discuss human behaviors and mental states, so it is much easier to use languages this way. As long as both the sender and receiver of the message understand the message, the fact that the terminology is anthropomorphic is not a problem.

It’s true that anthropomorphic language can confuse some people. But the primary reason it confuses some people is that they still have trouble understanding that computers are mindless ‐ that computers simply do whatever their instructions tell them. Perhaps this is an innate weakness in some people, but I think that addressing this weakness head-on can help counter it. This is probably a good reason for ensuring that people learn a little programming as kids ‐ not because they will necessarily do it later, but because computers are so central to the modern world that people should have a basic understanding of them.

path: /misc | Current Weblog | permanent link to this entry

Sun, 10 Mar 2013

Readable Lisp: Sweet-expressions

I’ve used Lisp-based programming languages for decades, but while they have some nice properties, their traditional s-expression notation is not very readable. Even the original creator of Lisp did not particularly like its notation! However, this problem turns out to be surprisingly hard to solve.

After reviewing the many past failed efforts, I think I have figured out why they failed. Past solutions typically did not work because they failed to be general (the notation is independent from any underlying semantic) or homoiconic (the underlying data structure is clear from the syntax). Once I realized that, I devised (with a lot of help from others!) a new notation, called sweet-expressions (t-expressions), that is general and homoiconic. I think this creates a real solution for an old problem.

You can download and try out sweet-expressions as released by the Readable Lisp S-expressions Project by downloading our new version 0.7.0 release.

If you’re interested, please participate! In particular, please participate in the SRFI-110 sweet-expressions (t-expressions) mailing list. SRFIs let people write specifications for extensions to the Scheme programming language (a Lisp), and this SRFI lets people in the Scheme community discuss it.

The following table shows what an example of traditional (ugly) Lisp s-expressions, the same thing in sweet-expressions, and a short explanation.

s-expressions Sweet-expressions (t-expressions) Explanation
(define (fibfast n)
  (if (< n 2)
    n
    (fibup n 2 1 0)))
define fibfast(n)
  if {n < 2}
    n
    fibup n 2 1 0
Typical function notation
Indentation, infix {...}
Single expr = no new list
Simple function calls

path: /misc | Current Weblog | permanent link to this entry

Sun, 12 Aug 2012

Readable s-expressions for Lisp-based languages: Lots of progress!

Lots has been happening recently in my effort to make Lisp-based languages more readable. A lot of programming languages are Lisp-based, including Scheme, Common Lisp, emacs Lisp, Arc, Clojure, and so on. But many software developers reject these languages, at least in part because their basic notation (s-expressions) is very awkward.

The Readable Lisp s-expressions project has a set of potential solutions. We now have much more robust code (you can easily download, install, and use it, due to autoconfiscation), and we have a video that explains our solutions. The video on readable Lisp s-expressions is also available on Youtube.

We’re now at version 0.4. This version is very compatible with existing Lisp code; they are simply a set of additional abbreviations. There are three tiers: curly-infix expressions (which add infix), neoteric-expressions (which add a more conventional call format), and sweet-expressions (which deduce parentheses from indentation, reducing the number of required parentheses).

Here’s an example of (awkward) traditional s-expression format:

(define (factorial n)
  (if (<= n 1)
    1
    (* n (factorial (- n 1)))))

Here’s the same thing, expressed using sweet-expressions:

define factorial(n)
  if {n <= 1}
    1
    {n * factorial{n - 1}}

A sweet-expression reader could accept either format, actually, since these tiers are simply additional abbreviations and adjustments that you can make to an existing Lisp reader. If you’re interested, please go to the Readable Lisp s-expressions project web page for more information and an implementation - and please join us!

path: /misc | Current Weblog | permanent link to this entry

Fri, 20 Jan 2012

Website back up

This website (www.dwheeler.com) was down part of the day yesterday due to a mistake made by my web hosting company. Sorry about that. It’s back up, obviously.

For those who are curious what happened, here’s the scoop. My hosting provider (WebHostGiant) moved my site to a new improved computer. By itself, that’s great. That new computer has a different IP address (the old one was 207.55.250.19, the new one is 208.86.184.80). That’d be fine too, except they didn’t tell me that they were changing my site’s IP address, nor did they forward the old IP address. The mistake is that the web hosting company should have notified me of this change, ahead of time, but they failed to do so. As a result, I didn’t change my site’s DNS entries (which I control) to point to its new location; I didn’t even know that I should, or what the new values would be. My provider didn’t even warn me ahead of time that anything like this was going to happen… if they had, I could have at least changed the DNS timeouts so the changeover would have been quick.

Now to their credit, once I put in a trouble ticket (#350465), Alex Prokhorenko (of WebhostGIANT Support Services) responded promptly, and explained what happened so clearly that it was easy for me to fix things. I appreciate that they’re upgrading the server hardware, I understand that IP addresses sometimes much change, and I appreciate their low prices. In fact, I’ve been generally happy with them.

But if you’re a hosting provider, you need to tell the customer if some change you make will make your customer’s entire site unavailable without the customer taking some action! A simple email ahead-of-time would have eliminated the whole problem.

Grumble grumble.

I did post a rant against SOPA and PIPA the day before, but I’m quite confident that this outage was unrelated.

Anyway, I’m back up.

path: /misc | Current Weblog | permanent link to this entry

Mon, 04 Jul 2011

U.S. government must balance its budget

(This is a blog entry for U.S. citizens — everyone else can ignore it.)

We Americans must demand that the U.S. government work to balance its budget over time. The U.S. government has a massive annual deficit, resulting in a massive national debt that is growing beyond all reasonable bounds. For example, in just Fiscal Year (FY) 2010, about $3.4 trillion was spent, but only $2.1 trillion was received; that means that the U.S. government spent more than a trillion dollars more than it received. Every year that the government spends more than it receives it adds to the gross federal debt, which is now more than $13.6 trillion.

This is unsustainable. The fact that this is unsustainable is certainly not news. The U.S. Financial Condition and Fiscal Future Briefing (GAO, 2008) says, bluntly, that the “Current Fiscal Policy Is Unsustainable”. “The Moment of Truth: Report of the National Commission on Fiscal Responsibility and Reform” similarly says “Our nation is on an unsustainable fiscal path”. Many others have said the same. But even though it’s not news, it needs to be yelled from the rooftops.

The fundamental problem is that too many Americans — aka “we the people” — have not (so far) been willing to face this unpleasant fact. Fareed Zakaria nicely put this in February 21, 2010: “ … in one sense, Washington is delivering to the American people exactly what they seem to want. In poll after poll, we find that the public is generally opposed to any new taxes, but we also discover that the public will immediately punish anyone who proposes spending cuts in any middle class program which are the ones where the money is in the federal budget. Now, there is only one way to square this circle short of magic, and that is to borrow money, and that is what we have done for decades now at the local, state and federal level … The lesson of the polls in the recent elections is that politicians will succeed if they pander to this public schizophrenia. So, the next time you accuse Washington of being irresponsible, save some of that blame for yourself and your friends”.

But Americans must face the fact that we must balance the budget. And we must face it now. We must balance the budget the same way families balance their budgets — the government must raise income (taxes), lower expenditures (government spending), or both. Growth over time will not fix the problem.

How we rellocate income and outgo so that they match needs to be a political process. Working out compromises is what the political process is supposed to be all about; nobody gets everything they want, but eventually some sort of rough set of priorities must be worked out for the resources available. Compromise is not a dirty word to describe the job of politics; it is the job. In reality, I think we will need to both raise revenue and decrease spending. I think we must raise taxes to some small degree, but we can’t raise taxes on the lower or middle class much; they don’t have the money. Also, we will not be able to solve this by taxing the rich out of the country. Which means that we must cut spending somehow. Just cutting defense spending won’t work; defense is only 20% of the entire budget. In contrast, the so-called entitlements — mainly medicare, medicaid, and social security — are 43% of the government costs and rapidly growing in cost. I think we are going to have to lower entitlement spending; that is undesirable, but we can’t keep providing services we can’t pay for. The alternative is to dramatically increase taxes to pay for them, and I do not think that will work. Raising the age before Social Security benefits can normally be received is to me an obvious baby step, but again, that alone will not solve the problem. It’s clearly possible to hammer out approaches to make this work, as long as the various camps are willing to work out a compromise.

To get there, we need to specify and plan out the maximum debt that the U.S. will incur in each year, decreasing that each year (say, over a 10-year period). Then Congress (and the President) will need to work out, each year, how to meet that requirement. It doesn’t need to be any of the plans that have been put forward so far; there are lots of ways to do this. But unless we agree that we must live within our means, we will not be able to make the decisions necessary to do so. The U.S. is not a Greece, at least not now, but we must make decisions soon to prevent bad results. I am posting this on Independence Day; Americans have been willing to undergo lots of suffering to gain control over their destinies, and I think they are still able to do so today.

In the short term (say a year), I suspect we will need to focus on short-term recovery rather than balancing the budget. And we must not default. But we must set the plans in motion to stop the runaway deficit, and get that budget balanced. The only way to get there is for the citizenry to demand it stop, before far worse things happen.

path: /misc | Current Weblog | permanent link to this entry

Sun, 10 Apr 2011

Innovations update

I’ve made various updates to my list of The Most Important Software Innovations. I’ve added Distributed Version Control System (DVCS); these are all over now in the form of git, Mercurial (hg), Bazaar, Monotone, and so on, but these were influenced by the earlier BitKeeper, which was in turn influenced by the earlier Teamware (developed by Sun starting in 1991). As is often the case, “new” innovations are actually much older than people realize. I also added make, originally developed in 1977, and quicksort, developed in 1960-1961 by C.A.R. (Tony) Hoare. I’ve also improved lots of material that was already there, such as a better description of the history of the remote procedure call (RPC).

So please enjoy The Most Important Software Innovations!

path: /misc | Current Weblog | permanent link to this entry

Sun, 15 Aug 2010

Geek Video Franchises

I have a new web page on silly game I title Geek Video Franchises. The goal of this game is to interconnect as many geek video franchises as possible via common actors. In this game, you’re only allowed to use video franchises that geeks tend to like.

For example: The Matrix connects to The Lord of the Rings via Hugo Weaving (Agent Smith/Elrond), which connects to Indiana Jones via John Rhys-Davies (Gimli/Sallah), which connects to Star Wars via Harrison Ford (Indiana Jones/Han Solo). The Lord of the Rings directly connects to Star Wars via Christopher Lee (Saruman/Count Dooku). Of course, Lord of the Rings also connects to X-men via Ian McKellen (Gandalf/Magneto), which connects to Star Trek via Patrick Stewart (Professor Xavier / Captain Jean-Luc Picard). Star Trek connects to Dr. Who via Simon Pegg (JJ Abrams’ Montgomery Scott/The Editor), which connects to Harry Potter via David Tennant (Dr. Who #10/Barty Crouch Jr.), which connects to Monty Python via John Cleese (Nearly Headless Nick/Lancelot, etc.).

So if you’re curious, check out Geek Video Franchises.

path: /misc | Current Weblog | permanent link to this entry

Sat, 03 Jul 2010

Opening files and URLs from the command line

Nearly all operating systems have a simple command to open up a file, directory, or URL from the command line. This is useful when you’re using the command line, e.g., xdg-open . will pop up a window in the current directory on most Unix/Linux systems. This capability is also handy when you’re writing a program, because these are easy to invoke from almost any language. You can then pass it a filename (to open that file using the default application for that file type), a directory name to start navigating in that directory (use “.” for the current directory), or a URL like “http://www.dwheeler.com” to open a browser at that URL.

Unfortunately, the command to do this is different on different platforms.

My new essay How to easily open files and URLs from the command line shows how to do this.

For example, on Unix/Linux systems, you should use xdg-open (not gnome-open or kde-open), because that opens the right application given the user’s current environment. On MacOS, the command is “open”. On Windows you should use start (not explorer, because invoking explorer directly will ignore the user’s default browser setting), while on Cygwin, the command is “cygstart”. More details are in the essay, including some gotchas and warnings.

Anyway, take a look at: How to easily open files and URLs from the command line

path: /misc | Current Weblog | permanent link to this entry

Sun, 21 Mar 2010

Using Wikipedia for research

Some teachers seem to lose their minds when asked about Wikipedia, and make absurd rules like “I forbid students from using Wikipedia”. A 2008 article states that Wikipedia is the encyclopedia “that most universities forbid students to use”.

But the professors don’t need to be such Luddites; it turns out that college students tend to use Wikipedia quite appropriately. A research paper titled How today’s college students use Wikipedia for course-related research examines Wikipedia use among college students; it found that Wikipedia use was widespread, and that the primary reason they used Wikipedia was to obtain background information or a summary about a topic. Most respondents reported using Wikipedia at the beginning of the research process; very few used Wikipedia near or at the end. In focus group sessions, students described Wikipedia as “the very beginning of the very beginning for me” or “a .5 step in my research process”, and that it helps primarily in the beginning because it provided a “simple narrative that gives you a grasp”. Another focus group participant called Wikipedia “my presearch tool”. Presearch, as the participant defined it, was “the stage of research where students initially figure out a topic, find out about it, and delineate it”.

Now, it’s perfectly reasonable to say that Wikipedia should not be cited as an original source; I have no trouble with professors making that rule. Wikipedia itself has a rule that Wikipedia does not publish original research or original thought. Indeed, the same is true for Encyclopedia Britannica or any other encyclopedia; encyclopedias are supposed to be summaries of knowledge gained elsewhere. You would expect that college work would normally not have many citations of any encyclopedia, be it Wikipedia or Encyclopedia Britannica, simply because encyclopedias are not original sources.

Rather than running in fear from new materials and techologies, teachers should be helping students understand how to use them appropriately, helping them consider the strengths and weaknesses of their information sources. Wikipedia should not be the end of any serious research, but it’s a reasonable place to start. You should supplement it with other material, for the simple reason that you should always examine multiple sources no matter where you start, but that doesn’t make Wikipedia less valuable. For younger students, there are reasonable concerns about inappropriate material (e.g., due to Wikipedia vandalism and because Wikipedia covers topics not appropriate for much younger readers), but the derivative “Wikipedia Selection for Schools” is a good solution for that problem. I’m delighted that so much information is available to people everywhere; we need to help people use these resources instead of ignoring them.

And speaking of which, if you like Wikipedia, please help! With a little effort, you can make it better for everyone. In particular, Wikipedia needs more video; please help the Video on Wikipedia folks get more videos on Wikipedia. This also helps the cause of open video, ensuring that the Internet continues to be open to innovation.

path: /misc | Current Weblog | permanent link to this entry

Sat, 06 Mar 2010

Robocopy

If you use Microsoft Windows (XP or some later version), and don’t have an allergic reaction to the command line, you should know about Robocopy. Robocopy (“robust file copy”) is a command-line program from Microsoft that copies collections of files from one place to another in an efficient way. Robocopy is included in Windows Vista, Windows 7, and Windows Server 2008. Windows XP and Windows Server 2003 users can download Robocopy for free from Microsoft as part of the Windows Server 2003 “Resource Kit Tools”.

Robocopy copies files, like the COPY command, but Robocopy will only copy a file if the source and destination have different time stamps or different file sizes. Robocopy is nowhere near as capable as the Unix/Linux “rsync” command, but for some tasks it suffices. Robocopy will not copy files that are currently open (by default it will repeatedly retry copying them), it can only do one-way mirroring (not bi-directional synchronization), it can only copy mounted filesystems, and it’s foolish about how it copies across a network (it copies the whole file, not just the changed parts). Anyway, you invoke it at the command line like this:

ROBOCOPY Source Destination OPTIONS

So, here’s an example of copying everything from “c:\data” to “q:\data”:

 robocopy c:\data u:\data /MIR /NDL /R:20

To do this on an automated schedule in Windows XP, put your commands into a text file with a name ending in “.bat” and select Control Panel-> Scheduled Tasks-> Add Scheduled Task. Select your text file to run, have it run “daily”. You would think that you can’t run it more than once a day this way, but that’s actually not true. Click on “Open advanced properties for this task when I click Finish” and then press Finish. Now select the “Schedule” tab. Set it to start at some time when you’re probably using the computer, click on “Advanced”, and set “repeat task” so it will run (say, every hour with a duration of 2 hours). Then click on “Show multiple schedules”, click “new”, and then select “At system startup”. Now it will make copies on startup AND every hour. You may want to go to the “Settings” tab and tweak it further. You can use Control Panel-> Scheduled tasks to change the schedule or other settings.

A GUI for Robocopy is available. An alternative to Robocopy is SyncToy; SyncToy has a GUI, but Microsoft won’t support it, I’ve had reliability and speed problems with it, and SyncToy has a nasty bug in its “Echo” mode… so I don’t use it. I suspect the Windows Vista and Windows 7 synchronization tools might make Robocopy a less useful, but I find that the Windows XP synchronization tools are terrible… making using Robocopy a better approach. There are a boatload of applications out there that do one-way or two-way mirroring, including ports of rsync, but getting them installed in some security-conscious organizations can be difficult.

Of course, if you’re using Unix/Linux, then use rsync and be happy. Rsync usually comes with Unix/Linux, and rsync is leaps-and-bounds better than robocopy. But not everyone has that option.

path: /misc | Current Weblog | permanent link to this entry

Sat, 02 May 2009

Own your own site!

Geocities, a web hosting site sponsored by Yahoo, is shutting down. Which means that, barring lots of work by others, all of its information will be disappearing forever. Jason Scott is trying to coordinate efforts to archive GeoCities’ information, but it’s not easy. He estimates they’re archiving about 2 Gigabytes/hour, pulling in about 5 Geocities sites per second… and they don’t know if it’ll be enough. What’s more, the group has yet to figure out how to serve it: “It is more important to me to grab the data than to figure out how to serve it later…. I don’t see how the final collection won’t end up online, but how is elusive…”

This sort of thing happens all the time, sadly. Some company provides a free service for your site / blog / whatever… and so you take advantage of it. That’s fine, but if you care about your site, make sure you own your data sufficiently so that you can move somewhere else… because you may have to. Yahoo is a big, well-known company, who paid $3.5 billion for Geocities… and now it’s going away.

Please own your own site — both its domain name and its content — if it’s important to you. I’ve seen way too many people have trouble with their sites because they didn’t really own them. Too many scams are based on folks who “register” your domain for you, but actually register it in their own names… and then hold your site as a hostage. Similarly, many organizations provide wonderful software that is unique to their site for managing your data… but then you either can’t get your own data, or you can’t use your data because you can’t separately get and re-install the software to use it. Using open standards and/or open source software can help reduce vendor lock-in — that way, if the software vendor/website disappears or stops supporting the product/service, you can still use the software or a replacement for it. And of course, continuously back up your data offsite, so if the hosting service disappears without notice, you still have your data and you can get back on.

I practice what I preach. My personal site, www.dwheeler.com, has moved several times, without problems. I needed to switch my web hosting service (again) earlier in 2009, and it was essentially no problem. I just used “rsync” to copy the files to my new hosting service, change the domain information so people would use the new hosting service instead, and I was up and running. I’ve switched web servers several times, but since I emphasize using ordinary standards like HTTP, HTML, and so on, I haven’t had any trouble. The key is to (1) own the domain name, and (2) make sure that you have your data (via backups) in a format that lets you switch to another provider or vendor. Do that, and you’ll save yourself a lot of agony later.

path: /misc | Current Weblog | permanent link to this entry

Thu, 08 Jan 2009

Updating cheap websites with rsync+ssh

I’ve figured out how to run and update cheap, simple websites using rsync and ssh and Linux. I thought I’d share that info here, in case you want to copy my approach.

My site (www.dwheeler.com) is an intentionally simple website. It’s simply a bunch of directories with static files; those files may contain Javascript and animated GIF, but site visitors aren’t supposed to cause them to change. Programs to manage my site (other than the web server) are run before the files are sent to the server. Most of today’s sites can’t be run this way… but when you can do this, the site is much easier to secure and manage. It’s also really efficient (and thus fast). Even if you can’t run a whole site this way, if you can run a big part of it this way, you can save yourself a lot of security, management, and performance problems.

This means that I can make arbitrary changes to a local copy of the website, and then use rsync+ssh to upload just those changes. rsync is a wonderful program, originally created by Andrew Tridgell, that can copy a directory tree to and from remote directory trees, but only send the changes. The result is that rsync is a great bandwidth-saver.

This approach is easy to secure, too. Rsync uses ssh to create the connection, so people can’t normally snoop on the transfer, and redirecting DNS will be immediately noticed. If the website is compromised, just reset it and re-send a copy; as long as you retain a local copy, no data can be permanently lost. I’ve been doing this for years, and been happy with this approach.

On a full-capability hosting service, using rsync this is easy. Just install rsync on the remote system (typically using yum or apt-get), and run:

 rsync -a LOCALDIR REMOTENAME@REMOTESITE:REMOTEDIR

Unfortunately, at least some of the cheap hosting services available today don’t make this quite so easy. The cheapest hosting services are “shared” sites that share resources between many users without using full operating system or hardware virtualization. I’ve been looking at a lot of the cheap Linux web hosting services like these such as WebhostGIANT, Hostmonster, Hostgator, and Bluehost. It appears that at least some of these hosting companies improve their security by greatly limiting the access granted to you via the ssh/shell interface. I know that WebhostGIANT is an example, but I believe there are many such examples. So, even if you have ssh access on a Linux system, you may only get a few commands you can run like “mv” and “cp” (and not “tar” or “rsync”). You could always ask the hosting company to install programs, but they’re often reluctant to add new ones. But… it turns out that you can use rsync and other such services, without asking them to install anything, at least in some cases. I’m looking for new hosting providers, and realized (1) I can still use this approach without asking them to install anything, but (2) it requires some technical “magic” that others might not know. So, here’s how to do this, in case this information/example helps others.

Warning: Complicated technical info ahead.

I needed to install some executables, and rather than recompiling my own, I grabbed pre-compiled executables. To do this, I found out the Linux distribution used by the hosting service (in the case of WebhostGIANT, it’s CentOS 5, so all my examples will be RPM-based). On my local Fedora Linux machine I downloaded the DVD “.iso” image of that distro, and did a “loopback mount” as root so that I could directly view its contents:

 cd /var/www     # Or wherever you want to put the ISO.
 wget ...mirror location.../CentOS-5.2-i386-bin-DVD.iso
 mkdir /mnt/centos-5.2
 mount CentOS-5.2-i386-bin-DVD.iso /mnt/centos-5.2 -o loop
 # Get ready to extract some stuff from the ISO.
 cd
 mkdir mytemp
 cd mytemp

Now let’s say I want the program “nice”. On a CentOS or Fedora machine you can determine the package that “nice” is in using this command:

 rpm -qif `which nice`
Which will show that nice is in the “coreutils” package. You can extract “nice” from its package by doing this:
 rpm2cpio /mnt/centos-5.2/CentOS/coreutils-5.97-14.el5.i386.rpm | \
   cpio --extract --make-directories
Now you can copy it to your remote site. Presuming that you want the program to go into the remote directory “/private/”, you can do this:
 scp -p ./usr/bin/rsync MY_USERID@MY_REMOTE_SITE:/private/

Now you can run /private/nice, and it works as you’d expect. But what about rsync? Well, when you try to do this with rsync and run it, it will complain with an error message. The error message says that rsync can’t find another library (libpopt in this case). The issue is that and cheap web hosting services often don’t provide a lot of libraries, and they won’t let you install new libraries in the “normal” places. Are we out of luck? Not at all! We could just recompile the program statically, so that the library is embedded in the file, but we don’t even have to do that. We just need to upload the needed library to a different place, and tell the remote site where to find the library. It turns out that the program “/lib/ld-linux.so” has an option called “—library-path” that is specially designed for this purpose. ld-linux.so is the loader (the “program for running programs”), which you don’t normally invoke directly, but if you need to add library paths, it’s a reasonable way to do it. (Another way is to use LD_LIBRARY_PATH, but that requires that the string be interpreted by a shell, which doesn’t always happen.) So, here’s what I did (more or less).

First, I extracted the rsync program and necessary library (popt) on the local system, and copied them to the remote system (to “/private”, again):

 rpm2cpio /mnt/centos-5.2/CentOS/rsync-2.6.8-3.1.i386.rpm | \
   cpio --extract --make-directories
 # rsync requires popt:
 rpm2cpio /mnt/centos-5.2/CentOS/popt-1.10.2-48.el5.i386.rpm | \
   cpio --extract --make-directories
 scp -p ./usr/bin/rsync ./usr/lib/libpopt.so.0.0.0 \
        MY_USERID@MY_REMOTE_SITE:/private/
Then, I logged into the remote system using ssh, and added symbolic links as required by the normal Unix/Linux library conventions:
 ssh MY_USERID@MY_REMOTE_SITE
 cd /private
 ln -s libpopt.so.0.0.0 libpopt.so 
 ln -s libpopt.so.0.0.0 libpopt.so.0

Now we’re ready to use rsync! The trick is to tell the local rsync where the remote rsync is, using “—rsync-path”. That option’s contents must invoke ld-linux.so to tell the remote system where the additional library path (for libopt) is. So here’s an example, which copies files from the directory LOCAL_HTTPD to the directory REMOTE_HTTPDIR:

rsync -a \
 --rsync-path="/lib/ld-linux.so.2 --library-path /private /private/rsync" \
 LOCAL_HTTPDIR REMOTENAME@REMOTESITE:REMOTE_HTTPDIR

There are a few ways we can make this nicer for everyday production use. If the remote server is a cheap shared system, we want to be very kind on its CPU and bandwidth use (or we’ll get thrown off it!). The “nice” command (installed by the steps above) will reduce CPU use on the remote web server when running rsync. There are several rsync options that can help, too. The “—bwlimit=KBPS” option will limit the bandwidth used. The “—fuzzy” option will reduce bandwidth use if there’s a similar file already on the remote side. The “—delete” option is probably a good idea; this means that files deleted locally are also deleted remotely. I also suggest “—update” (this will avoid updating remote files if they have a newer timestamp) and “—progress” (so you can see what’s happening). Rsync is able to copy hard links (using “-H”), but that takes more CPU power; I suggest using symbolic links and then not invoking that option. You can enable compression too, but that’s a trade-off; compression will decrease bandwidth but increase CPU use. So our final command looks like this:

rsync -a --bwlimit=100 --fuzzy --delete --update --progress \
 --rsync-path="/private/nice /lib/ld-linux.so.2 --library-path /private /private/rsync" \
 LOCAL_HTTPDIR REMOTENAME@REMOTESITE:REMOTE_HTTPDIR

Voila! Store that script in some easily-run place. Now you can easily update your website locally and push it to the actual webserver, even on a cheap hosting service, with very little bandwidth and CPU use. That’s a win-win for everyone.

path: /misc | Current Weblog | permanent link to this entry

Mon, 19 May 2008

YEARFRAC Incompatibilities between Excel 2007 and OOXML (OXML)

In theory, the OOXML (OXML) specification is supposed to define what Excel 2007 reads and writes. In practice, it’s not true at all; the latest public drafts of OOXML are unable to represent many actual Excel 2007 files.

For example, at least 26 Excel financial functions depend on a parameter called “Basis”, which controls how the calendar is interpreted. The YEARFRAC function is a good example of this; it returns the fraction of years between two dates, given a “basis” for interpreting the calendar. Errors in these functions can have large financial stakes.

I’ve posted a new document, YEARFRAC Incompatibilities between Excel 2007 and OOXML (OXML), and the Definitions Actually Used by Excel 2007 ([OpenDocument version]), which shows that the definitions of OOXML and Excel 2007 aren’t the same at all. “This document identifies incompatibilities between the YEARFRAC function, as implemented by Microsoft Excel 2007, compared to how it is defined in the Office Open Extensible Mark-up Language (OOXML), final draft ISO/IEC 29500-1:2008(E) as of 2008-05-01 (aka OXML). It also identifies the apparent definitions used by Excel 2007 for YEARFRAC, which to the author’s knowledge have never been fully documented anywhere. They are not defined in the OOXML specification, because OOXML’s definitions are incompatible with the apparent definition used by Excel 2007.”

“This incompatibility means that, given OOXML’s current definition, OOXML cannot represent any Excel spreadsheet that uses financial functions using “basis” date calculations, such as YEARFRAC, if they use common “basis” values (omitted, 0, 1, or 4). Excel functions that depend upon “basis” date calculations include: ACCRINT, ACCRINTM, AMORDEGRC, AMORLINC, COUPDAYBS, COUPDAYS, COUPDAYSNC, COUPNCD, COUPNUM, COUPPCD, DISC, DURATION, INTRATE, MDURATION, ODDFPRICE, ODDFYIELD, ODDLPRICE, ODDLYIELD, PRICE, PRICEDISC, PRICEMAT, RECEIVED, YEARFRAC, YIELD, YIELDDISC, and YIELDMAT (26 functions).”

I have much more information about YEARFRAC if you want it.

path: /misc | Current Weblog | permanent link to this entry

Fri, 09 May 2008

Bilski: Information is physical!?

The US Court of Appeals for the Federal Circuit in Washington, DC just heard arguments in the Bilski case, where the appellant (Bilski) is arguing that a completely mental process should get a patent. The fact that this was even entertained demonstrates why the patent system has truly descended into new levels of madness. At least the PTO rejected the application; the problem is that the PTO now allows business method patents and software patents. Once they allowed them, there’s no rational way to say “stop! That’s rediculous!” without being arbitrary.

Mr. David Hanson (Webb Law Firm) argued for the appellant (Bilski), and got peppered with questions. “Is a curve ball patentable?”, for example. At the end, he finally asked the court to think of “information as physical”; it is therefore tangible and can be transformed.

That is complete lunacy, and it clearly demonstrates why the patent office is in real trouble.

Information is not physical, it is fundamentally different, and that difference has been understood for centuries. If I give you my car, I no longer have that car. If I give you some information, I still have the information. That is a fundamental difference in information, and always has been. The fact that Bilski’s lawyer can’t understand this difference shows why our patent office is so messed up.

This fundamental difference between information and physical objects was well-understood by the U.S. founding fathers. Here’s what Thomas Jefferson said: “That ideas should freely spread from one to another over the globe, for the moral and mutual instruction of man, and improvement of his condition, seems to have been peculiarly and benevolently designed by nature, when she made them, like fire, expansible over all space, without lessening their density at any point, and like the air in which we breath, move, and have our physical being, incapable of confinement or exclusive appropriation. Inventions then cannot, in nature, be a subject of property.” Thomas Jefferson was a founder, and an inventor. No, they didn’t have computers then, but computers merely automate the processing of information; the essential difference between information and physical/tangible objects was quite clear then.

Our laws need to distinguish between information and physical objects, because they have fundamentally different characteristics.

Basically, by failing to understand the differences, the PTO let in software patents and business method patents, which have been grossly harmful to the United States.

Even if you thought they were merely “neutral”, that’s not enough. There’s a famous English speech about the trade-offs of copyright law, whose principles also apply here: “It is good that authors should be remunerated; and the least exceptionable way of remunerating them is by a monopoly. Yet monopoly is an evil. For the sake of the good we must submit to the evil; but the evil ought not to last a day longer than is necessary for the purpose of securing the good.” - Thomas Babbington Macaulay, speech to the House of Commons, February 5, 1841.

I believe that software patents need to be abolished, pronto. As I’ve discussed elsewhere, software patents harm software innovation, not help it.

But here in the Bilski case we see why some some people have managed to sneak software patents into the patent process. In short, too many people do not understand the fundamental differences between information and physical objects. People whose thinking is that fuzzy are easily duped. Though clearly many people aren’t as confused as Bilski’s lawyer, I think too many people in the patent process have become so confused about the difference between physical objects and information that they don’t understand why software patents are a serious problem. Patents should only apply to processes that directly change physical objects, and their scope should only cover the specifics of those changes. I add that latter part because yes, changing the number on a display does change something physical, but that is irrelevant. If you have a wholly new process for making displays (say, using a new chemical compound), that could be patentable, but changing a “5” to a “6” should not be patentable because “changing a 5 to a 6” is not fundamentally a change in nature. Taking something unpatentable and adding the phrase “doing it with a computer” should not change an unpatentable invention into a patentable one; the Supreme Court understood that, but the PTO still fails to understand that.

I think pharmaceutical companies are afraid of any patent reform laws, because they’re afraid that a change in the patent system might hurt them. But if the patent system isn’t fixed - by eliminating business method patents and software patents - the entire patent system might become too overwhelmed to function, and thus eventually scrapped. I don’t know if pharma patents are more help than hinderance; I’m not an expert in that area. But I make my living with software, and it’s obvious to me (and most other software practitioners) that software patents and business patents are becoming a massive drag on innovation. If we can’t fix the patent system, we’ll have to abolish the patent system completely. A lot of lawyers will be unhappy if the patent system is eliminated, but there are more non-lawyers than lawyers. If the pharma companies want to have a working patent system, then they’ll need to help reign in patents in other areas, or the whole system may collapse.

path: /misc | Current Weblog | permanent link to this entry

Fri, 21 Mar 2008

Microsoft Office XML (OOXML) massively defective

Robert Weir has been analyzing Microsoft’s Office XML spec (aka OOXML) to determine how defective it is, with disturbing results.

Most standards today are relatively small, build on other standards, and are developed publicly over time with lots of opportunity for correction. Not OOXML; Emca submitted Office Open XML for “Fast Track” as a massive 6,045 page specification, developed in an absurdly rushed way, behind closed doors, using a process controlled by a single vendor. It’s huge primarily because does everything in a non-standard way, instead of referring to other standards where practical as standards are supposed to do (e.g., for mathematical equations they created their own incompatible format instead of using the MathML standard). All by itself, its failure to build on other standards should have disqualified OOXML, but it was accepted for review anyway, and what happened next was predictable.

No one can seriously review such a massive document in a short time, though ISO tried; ISO’s process did find 3,522 defects. It’s not at all clear that the defects were fixed - there’s been no time to really check, because the process for reviewing the standard simply wasn’t designed to handle that many defects. But even if they were fixed - a doubtful claim - Robert Weir has asked another question, “did they find nearly all of the defects?”. The answer is: Almost all of the original defects remain. By sampling pages, he’s found error after error, none of which were found by the ISO process. The statistics from the sample are very clear: practically all serious errors have not been found. It’s true that good standards sometimes have a few errors left in them, after review, but this isn’t “just a few errors”; these clearly show that the specification is intensely defect-ridden. Less than 2% of the defects have been found, by the data we have so far, which suggests that there are over 172,000 important defects (49x3522) left to find. That’s rediculous.

Want more evidence that it’s defect-ridden? Look at Inigo Surguy’s “Technical review of OOXML”, where he examines just the WordProcessingML section’s 2300 XML examples. He wrote code to check for well-formedness and validation errors, and found that more than 10% (about 300) were in error even given this trivial test. Conclusion? “While a certain number of errors is understandable in any large specification, the sheer volume of errors indicates that the specification has not been through a rigorous technical review before becoming an Ecma standard, and therefore may not be suitable for the fast-track process to becoming an ISO standard.” This did not include the other document sections, and this is a lower bound on accuracy (XML could validate and still be in error). (He also confirmed that Word 2007 does not implement the extensibility requirements of the Ecma specification, so as a result it would be hard to “write an interoperable word processor with Word” using OOXML.)

I think that all by itself, these vast number of errors in OOXML prove that the “Fast Track” process is completely inappropriate for OOXML. The “Fast Track” process was intended to be used when there was already a widely-implemented, industry-accepted standard that had already had its major problems addressed. That’s just not the case here.

These huge error rates were predictable, too. The committee for creating OOXML wasn’t even created until OpenDocument was complete, so they had to do a massive rush job to produce anything. ( Doug Mahugh admitted that “Microsoft… had to rush this standard through.”) They didn’t reuse existing mature standards, so they ended up creating much more work for themselves. Most developers (who could have helped find and fix the defects) stayed away from the Ecma process in the first place; its rules gave one vendor complete control over what was allowed, and there was already a vendor-independent standard in place, which gave most experts no reason to participate. The Ecma process was also almost entirely closed-door (OpenDocument’s mailing lists are public, in contrast), which predictably increased the error rate too.

The GNOME Foundation has been involved in OOXML’s development, and here’s what they say in the GNOME Foundation Annual Report 2007: “The GNOME Foundation’s involvement in ECMA TC45-M (OOXML) was the main discussion point during the last meeting…. [the] Foundation does not support this file format as the main format or as a standard…” I don’t think this is as widely touted as it should be. Here’s an organization directly involved in OOXML development, and it thinks OOXML should not be a standard at all.

India has already voted “no” to OOXML. I hope others do the same. Countries with the appropriate rights have until March 29 to decide. It’s quite plausiable that the final vote will be “no”, and indeed, based on what’s published, it should be “no”. Open Malaysia reported on the March 2008 BRM meeting, for example. It reports that everybody “did their darnest to improve the spec… The final day was absolute mayhem. We had to submit decisions on over 500 items which we hadn’t [had] the time to review. All the important issues which have been worked on repeatedly happened to appear on this final day. So it was non-stop important matters… It was a failure of the Fast Track process, and Ecma for choosing it. It should have been obvious to the administrators that submitting a 6000+ page document which failed the contradiction period, the 5 month ballot vote and poor resolution dispositions, should be pulled from the process. It should have been blatantly obvious that if you force National Bodies to contribute in the BRM and end up not deliberating on over 80% of their concerns, you will make a lot of people very unhappy… judging from the reactions from the National Bodies who truly tried to contribute on a positive manner, without having their concerns heard let alone resolved, they leave the BRM with only one decision in their mind come March 29th. The Fast Tracking process is NOT suitable for ISO/IEC DIS 29500. It will fail yet again. And this time it will be final.”

In my opinion, the OOXML specification should not become an international standard, period. I think it clearly doesn’t meet the criteria for “fast track” - but more importantly, it doesn’t meet the needs for being a standard at all. It completely contradicts the goal of “One standard, one test - Accepted everywhere”, and it simply is not an open standard. I’ve blogged before that having multiple standards for office documents is a terrible idea. There’s nothing wrong with a vendor publishing their internal format; in fact, ISO’s “type 2 technical report” or “ISO agreement” are pre-existing mechanisms for documenting the format of a single vendor and product line specification. But when important data is going to be exchanged between parties, it should be exchanged using an open standard. We already have an open standard for office documents that was developed by consensus and implemented by multiple vendors: OpenDocument (ISO/IEC 26300). For more clarification about what an open standard is, or why OpenDocument is an open standard, see my essay “Is OpenDocument an Open Standard? Yes!” OpenDocument works very well; I use it often. In contrast, it seems clear that OOXML will never be a specification that everyone can fully implement. Its technical problems alone are serious, but even more importantly, the Software Freedom Law Center’s “Microsoft’s Open Specification Promise: No Assurance for GPL” makes it clear that OOXML cannot be legally implemented by anyone using any license. And this matters greatly.

Andy Updegrove calls for recognition of “Civil ICT Standards”, which I think helps puts this technical stuff into a broader and more meaningful context. He notes that in our new “interconnected world, virtually every civic, commercial, and expressive human activity will be fully or partially exercisable only via the Internet, the Web and the applications that are resident on, or interface with, them. And in the third world, the ability to accelerate one’s progress to true equality of opportunity will be mightily dependent on whether one has the financial and other means to lay hold of this great equalizer… [and thus] public policy relating to information and communications technology (ICT) will become as important, if not more, than existing policies that relate to freedom of travel (often now being replaced by virtual experiences), freedom of speech (increasingly expressed on line), freedom of access (affordable broadband or otherwise), and freedom to create (open versus closed systems, the ability to create mashups under Creative Commons licenses, and so on)… This is where standards enter the picture, because standards are where policy and technology touch at the most intimate level. Much as a constitution establishes and balances the basic rights of an individual in civil society, standards codify the points where proprietary technologies touch each other, and where the passage of information is negotiated… what will life be like in the future if Civil ICT Rights are not recognized and protected, as paper and other fixed media disappear, as information becomes available exclusively on line, and as history itself becomes hostage to technology? I would submit that a vote to adopt OOXML would be a step away from, rather than a way to advance towards, a future in which Civil ICT Rights are guaranteed”.

Ms. Geraldine Fraser-Moleketi, Minister of Public Service and Administration, South Africa, gave an interesting presentation at the Idlelo African Conference on FOSS and the Digital Commons. She said, “The adoption of open standards by governments is a critical factor in building interoperable information systems which are open, accessible, fair and which reinforce democratic culture and good governance practices. In South Africa we have a guiding document produced by my department called the Minimum Interoperability Standards for Information Systems in Government (MIOS). The MIOS prescribes the use of open standards for all areas of information interoperability, including, notably, the use of the Open Document Format (ODF) for exchange of office documents… It is unfortunate that the leading vendor of office software, which enjoys considerable dominance in the market, chose not to participate and support ODF in its products, but rather to develop its own competing document standard which is now also awaiting judgement in the ISO process. If it is successful, it is difficult to see how consumers will benefit from these two overlapping ISO standards… The proliferation of multiple standards in this space is confusing and costly.” She also said, “One cannot be in Dakar without being painfully aware of the tragic history of the slave trade… As we find ourselves today in this new era of the globalised Knowledge Economy there are lessons we can and must draw from that earlier era. That a crime against humanity of such monstrous proportions was justified by the need to uphold the property rights of slave owners and traders should certainly make us more than a little cautious about what should and should not be considered suitable for protection as property.”

You can get more detail from the Groklaw ODF-MSOOXML main page, but I think the point is clear. The world doesn’t need the confusion of a specification controlled by a single vendor being labelled as an international standard. NoOOXML has a list of reasons to reject OOXML.

path: /misc | Current Weblog | permanent link to this entry

Thu, 06 Dec 2007

Readable s-expressions (sweet-expressions) draft 0.2 for Lisp-like languages

Back in 2006 I posted my basic ideas about “sweet-expressions”. Lisp-based programming languages normally represent programs as s-expressions, and though they are regular, most people find them hard to read. I hope to create an alternative to s-expressions that have their advantages, and not their disadvantages. You can see more at my readable Lisp page. I’ve gotten lots of feedback, based on both my prototype of the idea, as well as on the mailing list discussing it.

I’ve just posted a a draft of version 0.2 of sweet-expressions. This takes that feedback into account, in particular, it’s now much more backwards-compatible. There’s still a big question about whether or not infix should be a default; see the page for more details.

Here are the improvements over version 0.1:

  1. This version is much more compatible with existing Lisp code. The big change is that an unprefixed “(” immediately calls the underlying s-expression reader. This way, people can quietly replace their readers with a sweet-reader, without harming most existing code. In fact, many implementations could quietly switch to a sweet-reader and users might not notice until they use the new features. Instead of using (…), this uses {..} and […] for grouping expressions without disabling sweet-expressions.
  2. It can work more cleanly with macros that provide infix precedence (for those who want precedence rules).
  3. It extends version 0.1’s “name-prefixing” into “term-prefixing”. This is not only more general, it also makes certain kinds of functional programming much more pleasant.
  4. It adds syntax for the common case of accessing maps (such as indexed or associative arrays) - now a[j] is translated into (bracketaccess a j).
  5. Infix default supports arbitrarily-spelled infix operators, and it automatically accepts “and” and “or”.

Here’s an example of (ugly) s-expressions:

 (defun factorial (n)
   (if (<= n 1)
       1
       (* n (factorial (- n 1)))))

Here’s sweet-expressions version 0.1:

 defun factorial (n)
   if (n <= 1)
       1
       n * factorial(n - 1)

Here is sweet-expressions version 0.2 (draft), with infix default (it figures out when you have an infix operation from the spelling of the operation):

 defun factorial (n)
   if {n <= 1}
       1
       n * factorial(n - 1)

Here is sweet-expressions version 0.2 (draft), with infix non-default (you must surround every infix operator with {…}):

 defun factorial (n)
   if {n <= 1}
       1
       {n * factorial{n - 1}}

I’m still taking comments. If you’re interested, take a look at http://www.dwheeler.com/readable. And if you’re really interested, please join the readable-discuss mailing list.

path: /misc | Current Weblog | permanent link to this entry

Sun, 04 Nov 2007

Added “MapReduce” to the “Software Innovations” list

Ken Krugler’s recent blog said that my article of The Most Important Software Innovations was “very good”, but he was surprised that I hadn’t included MapReduce as an important software innovation. Basically, MapReduce makes writing certain kinds of programs that process huge amounts of data, on vast distributed clusters, remarkably easy and efficient. (Wikipedia explains MapReduce, including links to alternative implementations like the open source Hadoop.)

It’s not because I didn’t know about MapReduce; I read about it almost immediately after it got published. I thought it was very promising, and even forwarded the original paper to some co-workers. I think MapReduce is especially promising because, now that we have cheap commodity computers, having a way to easily exploit their capabilities is really valuable. But even with something this promising, I didn’t want to add it to my list of innovations right away - after all, maybe after a little while it turns out to be not so helpful.

Currently, there’s aren’t many who have Google-sized clusters of computers available. But it’s clear that this approach is useful in many other circumstances as well. It’s new, but I think it’s stood the test of time enough that it’s a worthy addition… so I’ve just added it.

One interesting issue is that the MapReduce framework is itself built primarily on the “map” and “reduce” functions, which are far, far older. So, is MapReduce really a new idea, or is it just a high-quality implementation of an old idea? I’ll accept that it’s a new idea, but that can be difficult to judge. This judgment doesn’t really matter unless you think software patents are a good idea (since every software patents in theory prevents progress for 20 years). But I think it’s quite clear that software patents are a foolish idea, and it’s clear that others have come to the same conclusion. Eric S. Maskin, an economist who has long criticised the patenting of software, recently received the 2007 Nobel Prize for Economics. Here’s a nice quote: “… when patent protection was extended to software in the 1980s, […] standard arguments would predict that R&D intensity and productivity should have increased among patenting firms. Consistent with our model, however, these increases did not occur.” Someone who correctly predicted that software patents were harmful to innovation just received a Nobel prize. I hope to someday see people receive other prizes because they ended software patenting in the United States.

path: /misc | Current Weblog | permanent link to this entry

Mon, 22 Oct 2007

Donald Macleay

Don Macleay was my mentor and friend, and he just passed away (Oct. 15, 2007). So, this is a small blog entry in his honor.

Donald (Don) Macleay Here’s what I said at his funeral: “In 1980, Don was the manager of a computer store. I was only 15, but he took a chance on employing me, and I’m grateful. He taught me much, in particular, showing by example that you could be in business (even as a salesman!) and be an honest person. He later moved to other companies, and I moved twice with him, because I found that good bosses were hard to find. Don was honest, reliable, a good friend, and an inspiration to me. I will miss him, and I look forward to seeing him again in heaven.” I should add that he spoke at my Eagle scout ceremony. Later on, when he moved out to the country, it was always a pleasure to visit him and his family.

Here’s a part of his biography, as printed in the funeral bulletin: “Born in Washington, D.C., on October 27, 1934, Donald Macleay was raised in Falls Church. He attained the rank of Eagle Scout and graduated at the top of the first class of St. Stephen’s School in 1952. In 1956, he graduated with a Bachelor of Arts in English from the Virginia Military Institute (VMI).

After serving as a Marine Corps officer, Donald Macleay spent many years in the business world before becoming a Parole Officer for the Department of Juvenile Justice in Stafford County. As well, in 1992, he was a candidate for the U.S. Congress as an Independent.”

The biography goes on to note that he “valued being a Christian, a husband, a father and grandfather, and a friend.” Much of his last years were spent helping troubled youth in his area (Fredericksburg, VA), and from all accounts he was extraordinarily successful at helping them and their families.

path: /misc | Current Weblog | permanent link to this entry

Thu, 11 Oct 2007

Readable s-expressions (sweet-expressions) for Lisp-like languages

Back in 2006 I posted my basic ideas about “sweet-expressions”. Here’s a basic recap, before I discuss what’s new. Lisp-based programming languages normally represent programs as s-expressions, where an operation and its parameters are surrounded by parentheses. The operation to be performed is identified first, and each parameter afterwards is separated by whitespace. So the traditional “2+3” is written as “(+ 2 3)” instead. This is regular, but most people find this hard to read. Here’s a longer example of an s-expression - notice the many parentheses and the lack of infix operations:

 (defun factorial (n)
   (if (<= n 1)
       1
       (* n (factorial (- n 1)))))

Lisp-based systems are very good at symbol manipulation tasks, including program analysis. But many software developers avoid Lisp-based languages, even in cases where they would be a good tool to use, because most software developers find s-expressions really hard to read. I think I’ve found a better solution, which I call “sweet-expressions”. Here’s that same program be written using sweet-expressions:

 defun factorial (n)         ; Parameters can be indented, but need not be
   if (n <= 1)               ; Supports infix, prefix, & function <=(n 1)
       1                     ; This has no parameters, so it's an atom.
       n * factorial(n - 1)  ; Function(...) notation supported

Sweet-expressions add the following abilities:

  1. Indentation. Indentation may be used instead of parentheses to start and end expressions: any indented line is a parameter of its parent, later terms on a line are parameters of the first term, lists of lists are marked with GROUP, and a function call with 0 parameters is surrounded or followed by a pair of parentheses [e.g., (pi) and pi()]. A “(” disables indentation until its matching “)”. Blank lines at the beginning of a new expression are ignored. A term that begins at the left edge and is immediately followed by newline is immediately executed, to make interactive use pleasant.
  2. Name-ending. Terms of the form ‘NAME(x y…)’, with no whitespace before ‘(’, are interpreted as ‘(NAME x y…)’;. Parameters are space-separated inside. If its content is an infix expression, it’s considered one parameter instead (so f(2 + 3) computes 2 + 3 and passes its result, 5, to f).
  3. Infix. Optionally, expressions are automatically interpreted as infix if their second parameter is an infix operator (by matching an “infix operator” pattern of symbols), the first parameter is not an infix operator, and it has at least three parameters. Otherwise, expressions are interpreted as normal “function first” prefix notation. To disable infix interpretation, surround the second parameter with as(…). Infix expressions must have an odd number of parameters with the even ones being the same binary infix operator. You must separate each infix operator with whitespace on both sides. Precedence is not supported; just use parens (a lot more about that in a moment). Use the “name-ending” form for unary operations, e.g., -(x) for “negate x”. Thus “2 + (y * -(x))” is a valid expression, equivalent to (+ 2 (* y (- x))). “(2 + 3 + 4)” is fine too. Infix operators must match this pattern (and in Scheme cannot be =>):
        [+-\*/<>=&\|\p{Sm}]{1-4}|\:
    

For more information on sweet-expressions or on making s-expressions more readable in general, see my website page at http://www.dwheeler.com/readable. For example, I provide a demo sweet-expression reader in Scheme (under the MIT license), as well as an indenting pretty-printer in Common Lisp. In particular, you can see my lengthy paper about why sweet-expressions do what they do, and some plausible alternatives. You can also download some other implementation code. I’ve also set up a SourceForge project named “readable” to discuss options in making s-expressions more readable, and to distribute open source software to implement them (unimplemented ideas don’t go far!).

Okay, but all of that was true in 2006 - what’s new? What’s new is a change of heart about precedence. I’ve been occasionally trying to figure out how to “flesh out” sweet-expressions with operator precedence, and I just kept hitting one annoying complication after another. Precedence is nearly universal among programming languages; they’re very useful, and only a few infix-supporting languages (such as Smalltalk) lack them. “Everyone” knows that 2+3*4 is 14, not 20, because of years of training in math classes that you multiply before you add. They’re also pretty easy to code (it’s an old solved problem). But I’ve discovered that in the typical use cases of a Lisp-like language’s expression reader, supporting precedence (in the general case) has some significant downsides that are irrelevant in other situations. Which is interesting, given how widespread they are elsewhere, so let’s see why that is.

First, let’s talk about a big advantage to not supporting precedence in sweet-expressions: It makes the creation of every new list obvious in the text. That’s very valuable in a list processing language; the key advantage of list processing languages is that you can process programs like data, and data like programs, in a very fluid way, so having clear markers of new lists using parentheses and indentation is very valuable.

Now let me note the downsides to supporting precedence in the specific cases of a Lisp-like language, which leads me to believe that it’s a bad idea for this particular use. Basically, adding precedence rules to a general-purpose list expression processor creates a slippery slope of complexity. There are two basic approaches to defining precedence: dynamic and static.

It’s easier to add precedence later, if that turns out to be important after more experimentation. But after the experimentation I’ve done so far, it appears that precedence simply isn’t worth it in this case. Precedence creates complexity in this case, and it hides where the lists begin/end. It’s not hard to work without it; you can even argue that (2 + (5 * 6)) is actually clearer than (2 + 5 * 6). Precedence is great in many circumstances - I’d hate to lose it in other languages - but in this particular set of use cases, it seems to be more of a hurt than a help.

Of course, you can write code in some Lisp dialect to implement a language that includes precedence. Many programs written in Lisp, including PVS and Maxima, do just that. But when you’re implementing another language, you know what the operators are, and you’re probably implementing other syntactic sugar too, so adding precedence is a non-problem. Also, if you’re really happy with s-expressions as they are, and just want precedence in a few places in your code, a simple macro to implement them (such as infpre) works very well. But sweet-expressions are intended to be a universal representation in Lisp-like languages, just like S-expressions are, so their role is different. In that different role, precedence causes problems that don’t show up in most other uses. It think not supporting precedence turns out to be much better for this different role.

Here are some more examples, this time in Scheme (another Lisp dialect):
Sweet-expression (Ugly) S-expression
define factorial(n)
   if (n <= 1)
       1
       n * factorial(n - 1)
substring("Hello" (1 + 1) string-length("Hello"))
define move-n-turn(angle)
   tortoise-move(100)
   tortoise-turn(angle)
if (0 <= 5 <= 10)
   display("True\n")
   display("Uh oh\n")
define int-products(x y)
  if (x = y)
    x
    x * int-products((x + 1) y)
int-products(3 5)
(2 + 3 + (4 * 5) + 7.1)
(2 + 3 + (4 / (5 * 6)))
*(2 3 4 5)
(define (factorial n)
   (if (<= n 1)
       1
       (* n (factorial (- n 1)))))
(substring "Hello" (+ 1 1) (string-length "Hello"))
(define (move-n-turn angle)
   (tortoise-move 100)
   (tortoise-turn angle))
(if (<= 0 5 10)
   (display "True\n")
   (display "Uh oh\n"))
(define (int-products x y)
  (if (= x y)
      x
      (* x (int-products (+ x 1) y))))
(int-products 3 5)
(+ 2 3 (* 4 5) 7.1)
(+ 2 3 (/ 4 (* 5 6)))
(* 2 3 4 5)

So I’ve modified my demo program so that it supports infix operator chaining, such as (2 + 3 + 4). Since I no longer need to implement precedence, the addition of chaining means that I now have a working model of the whole idea, ready for experimentation. My demo isn’t ready for “serious use” in development yet; it has several known bugs and weaknesses. But it’s good enough for experimentation, to see if the basic idea is sensible - and I think it is. You can actually sit down and play with it, and see if it has merit. There are still some whitespace rules I’d like to fiddle with, to make both long files and interactive use as comfortable as possible, but these are at the edges of the definitions… not at its core.

I’m suggesting the use of && for “logical and”, and || for “logical or”. These are common symbols in other languages, and using the same symbols aids readability. Now, in Common Lisp and some Scheme implementations, || is “the symbol with 0-length name”. Oddly enough, this doesn’t seem to be a problem; Lisps can generally bind to the symbol with the 0-length name, and print it the same way, so it works perfectly well! In Scheme this is trivially done by running this:

define(&& and)
define(|| or)
Then you can do this:
(#t && #t)
if ((a > b) || ((a * 2) < (c + d + e))) ...
Instead of the hideous s-expressions:
(and #t #t)
(if (or (> a b) (< (* a 2) (+ c d e))) ...)

Here are some quotable quotes, by the way, showing that I’m not the only one who thinks there’s room for improvement:

Lisp-based languages are all over the place. There are vast number of implementations of Common Lisp and Scheme. GNU guile is a Scheme implementation embedding into many other programs, for example. GNU emacs is a widely-used text editor, built on its own dialect of Lisp. AutoCAD has its own variant under the covers, too. Programs like PVS are implemented in Lisp, and interacting with it currently requires using s-expressions. It’d be great if all of these supported an alternative, simpler syntax. With sweet-expressions, typical s-expressions are legal too. So I think this is a widely-useful idea.

So if you’re interested, take a look at http://www.dwheeler.com/readable.

path: /misc | Current Weblog | permanent link to this entry