PEP: XXX Title: Include __cmp__ in Python 3000 Version: $Revision$ Last-Modified: 15-Jan-2008 Author: David A. Wheeler Status: Draft Type: Standards Track Content-Type: text/plain Created: 29-Oct-2007 Python-Version: 3.0 Post-History: Abstract Python has traditionally supported the comparison function __cmp__, but the current draft of Python 3.0 has removed support for __cmp__. This PEP proposes that __cmp__ be restored to Python 3.0+. Rationale The __cmp__ function makes it easy to define classes whose instances can be ordered, in a way that provides better performance, reduces the risk of error, and often involves simpler code than various alternatives. The __cmp__(x,y) function returns one of 3 values: -1 (xy). History There are many cases where objects are ordered and need to be compared, e.g., to determine if a < b. This includes sorting, where the comparison operation is embedded in the inner loop of the sort, so the comparison operation's performance can directly determine the performance of an entire program. Since many classes support ordering, making it easy to define and use comparisons is very valuable. Originally Python had only __cmp__(x,y) for comparisons, which returns negative if xy. This works well in nearly all cases, but it does not work when comparisons aren't symmetric (e.g., when x==y and x!=y can both produce False). Since IEEE floating point value comparisons aren't symmetric, Python grew specific comparison operators such as __lt__ (less-than) and __le__ (less-than-or-equal-to); specific comparison operators enable complete control over every comparison request. Python 3000 currently proposes supporting ONLY the specific comparison operators and losing __cmp__. This PEP argues that this is a mistake, and that __cmp__ should be kept in the language. Why losing __cmp__ is a problem Not having a __cmp__ operator can lead to gross inefficiencies. As noted by Guido van Rossum, "When implementing the '<' operator on lists or tuples, you really want to call the 'cmp' operator on the individual items, because otherwise (if all you have is == and <) the algorithm becomes something like 'compare for equality until you've found the first pair of items that are unequal; then compare those items again using < to decide the final outcome'. If you don't believe this, try to implement this operation using only == or < without comparing any two items more than once." Without a 'cmp' operator, doing comparisons becomes very inefficient when comparing large or complex data structures. In contrast, a single 'cmp' operation can be applied to each of the elements in turn and provide all the information required by a comparison. It would be possible to write a user-defined compare() function with cmp's semantics, and then write versions of __lt__, __le__, etc. to implement comparisons. However, this would add a good deal of boilerplate code and could reduce efficiency in comparisons - which may be used in the inner loops of programs. This "boilerplate" code could be stored in a mixin class, so at least it would not need to be rewritten, but using a mixin severely reduces efficiency (compared to having __cmp__ built into the language). One test of a mixin showed a test program taking 50% longer when __cmp__ is not built into the language. It would be possible to write implementations of __lt__, __le__, etc. for each class that are efficient, but this would require writing a great deal of "boilerplate" code for each of the various cases. All of which is completely unnecessary if __cmp__ remains built into the language. In some cases, it's important to do a three-way comparison of a "large" object (for example, a Decimal instance). By using cmp, you can store the result of cmp() and then do a separate three-way branch. Without a cmp operator, this cannot be done. And finally, losing __cmp__ appears to be a gratuitous incompatibility between Python 3000 and previous versions of Python. Python 3000 is intended to be a clean version of Python, and not to include changes simply for the sake of making changes. Specific comparison operators and their relationship with __cmp__ This proposal does not eliminate the specific comparison operators; here is why, and what their expected relationship would be. There are a few classes where comparisons are not symmetric, e.g., where a < b may produce the same result as a >= b, yet not be an exception. Thus, there is still a need for specialized comparison operators like __lt__() and __le__(). However, asymmetric classes are extremely rare. In most cases, (a < b) == not (a >= b). Since this is by far and away the common case, and there are many advantages to retaining __cmp__, __cmp__ should be retained. By default, the specific comparison operators __lt__ (etc.) would call __cmp__, and when an asymmetric type is to be defined, developers would redefine __lt__ (etc.). There are also performance reasons for having both specific comparison operators and __cmp__. It has been noted on the mailing list that in some cases, equality and inequality can be tested for in a much more efficient manner than general (rich) comparisons. But this optimization can be handled the same way: For efficiency, the __eq_ and __ne__ functions can be specially defined. On the other hand, when a function needs to know if two objects are less than, equal to, or greater than each others, it is likely to be more efficient to use __cmp__. Note that < and <= will continue to be overloaded as set inclusion operators. For users of relationship operators, the semantics are straightforward. The use of <, =, etc. become transformed into calls to __lt__(), __eq__(), etc. on the object. That object may have a specific definition for that, perhaps because it's asymmetric, or perhaps for performance reasons. If there's no redefinition, it will call __cmp__() which then returns a comparison, and it then uses that result to implement the specific comparison. A call to __cmp__() invokes __cmp__(), of course. Thus, if a class defines both __cmp__() and __lt__(), then "<" would invoke __lt__(); ">" will invoke __gt__(), but if this goes up to object, object would then invoke __cmp__() and translate that result to the True/False expected by the relationship operator. If __cmp__() is not defined for the object, then the usual response to the call of an undefined function apply. Classes where objects can be less than, equal-to, or greater than another object would define a __cmp__(x,y) function to return one of 3 values: -1 (xy). Users should only depend on <0, ==0, or >0, however. If only some comparisons and not others can be performed, do not define __cmp__ - define only those operators. For example, if class objects can be compared for equality and inequality, but NOT something else, they should NOT define __cmp__; instead, they should only define __eq__ and __ne__. It is suggested, but not enforced, that developers use __cmp__ to define comparisons for symmetric classes where there is no significant performance difference, and then allow Python to implement all other comparisons from that. For asymmetric classes, __lt__ (etc.) would be used. When implementing a subset of these comparisons, it is suggested that __lt__ and __eq__ be the basic operators to be defined - that way, programs can prefer to call those operators that don't add extra indirection. This means that __cmp__ is still "Pythonic" in terms of having "one way to do the right thing". In short, define __cmp__ for general symmetric comparisons, and define the specific comparison operators for asymmetric comparisons or efficiency. Then use __cmp__ if you need to know if something is less than, equal to, or greater than another; if you only need one relationship, use the specific relationship. When in doubt, use "<" and "=". Proposed Solution The proposed solution is to restore __cmp__, with exactly the same semantics as with Python 2.5. In short, a call for comparison such as __lt__ would search up the tree for a specific function; if one has not been defined (as an override), an attempt to call __cmp__ to would occur. Copyright This document has been placed in the public domain. Local Variables: mode: indented-text indent-tabs-mode: nil sentence-end-double-space: t fill-column: 70 coding: utf-8 End: