Curling Quotes in HTML and XML

David A. Wheeler

2025-10-09 (was 2002-12-10)

If you’re creating HTML or XML, use UTF-8 characters for curling quote characters.

Here’s a table showing what I mean:

CharacterUTF-8
Left Double Quotation Mark (U+201C)
Right Double Quotation Mark (U+201D)
Left Single Quotation Mark (U+2018)
Right Single Quotation Mark (U+2019, including English possessives and contractions)

History

I first wrote this page in 2002. At that time many incompatible character encodings were in wide use. A character encoding, if you do not know, is a system for converting human-readable text characters into numbers and back. Computers only work with bits, not characters, so there needs to be an agreement on how to encode characters so that they can be stored and displayed.

The UTF-8 character encoding was created in 1992. However, by 2002, UTF-8 was used by less than 1% of all web (HTML) pages. HTML pages instead primarily used a hodgepodge of different and partly-incompatible character encodings including ASCII, Western European encodings like Latin-1, (shift) JIS, Windows-1251, Windows-1252, and GB2312. The default character encoding for HTML 4.0 was ISO-8859-1, aka Latin-1, though an HTML page could specify something different, and many were different. If you weren’t careful, it was often difficult to combine pages and fragments of pages because their character encodings were different.

Why did’t everyone use UTF-8? Part of this came down to tooling. Many text editors and tools didn’t support UTF-8 natively in 2002. The first Linux using UTF-8 by default was Red Hat in 2002, and this new default was a big deal at the time. Debian's default character encoding didn't switch to UTF-8 until 2007. This was only the default; other character encodings could be used instead, and many others were.

So this led to a common problem: how can you express curling quotes? These marks are also called “smart quotes,” “curly quotes,” “curled quotes,” “curling quotes,” or “curved quotes”. The good news was that most of these different character encodings used ASCII as a subset. So you could use ASCII characters to specify the other characters that you wanted to display. This meant that the safest approach to display these characters, in 2002, was to use “decimal numeric character references” for curling single and double quote characters. These could be expressed in the widely-supported ASCII subset. In other words, for left and right double quotation marks, the best approach at the time was to use “ and ” - and for left and right single quotation marks (and apostrophes), use ‘ and ’. This approach complies with all international standards, and worked essentially everywhere. It still works today, as I expected at the time.

Here’s a table showing what I recommended in 2002 given that historical circumstance:

To showIn HTML, SGML, or XML useDisplays on your system as
Left Double Quotation Mark“
Right Double Quotation Mark”
Left Single Quotation Mark‘
Right Single Quotation Mark (including English possessives and contractions)’

What’s changed?

Thankfully, you no longer need to do things this way.

The character encoding UTF-8 is now broadly supported and used practically everywhere. Over time, practically all software tools in use have added support for UTF-8. The current version of HTML defaults to UTF-8. Almost all HTML pages are in UTF-8. XML supports UTF-8, too. In 2002 I noted that UTF-8 was an option, but at the time I also noted how that UTF-8 was poorly supported. I concluded in 2002, with regards to UTF-8, “for now, don’t do it.” I did make it clear that was a for now statement. Thankfully, things have changed for the better since 2002.

Other Sources of Information

Markus Kuhn’s “ASCII and Unicode Quotation Marks” describes the general problem well.

Feel free to see my home page.