git.net

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

keying by identity in dict and set


Hi,

I have an application that would benefit from object instances
distinguished by identity being used in dict's and set's. To do this,
the __hash__ method must be overridden, the obvious return value being
the instance's id.

This works flawlessly in extensive tests on several platforms, and on
a couple of different Python versions and implementations.

The documentation seems to preclude a second requirement, however.

I also want to use the == operator on these objects to mean a natural
comparison of values, different from identity, so that two instances
comparing equivalent does not imply that they are identical.

But the documentation for __hash__ has:
    "The only required property is that objects which compare equal
    have the same hash value"

Yet it seems something more is going on, because as near as I can
tell, it just works, perfectly, every time, despite this requirement.
If the the __hash__ method returns the object's id, the __eq__ method
of the *never* called when the object is added to a dict.  (See
examples below.)

It would appear that if __hash__ returns the id, then that id is used
internally as the key, and since the id is by definition unique, no
key collision ever occurs -- at least in every Python implementation
I've tried. It also seems that, for a class instance obj,
    hash( hash( obj ) ) == hash( obj )
    hash( id( obj ) ) == id( obj )
These are very strong and useful properties.  Where are they documented?

It looks like the existing Python built-in containers do exactly what
I need, but the documentation suggests that they may not.

Taking the documentation at face value, I'm in the position of
choosing between abandoning the natural use of operator '==', or of
introducing an unfamiliar container implementation.  As my package is
intended for the use by other people, neither option is attractive.

Questions:

Is all this apparent behaviour documented somewhere that I have missed?

Is there some fatal reason that this approach must never be used,
besides the lack of documentary support for it?

Is it in fact safe to assume that __eq__ is never called under these
circumstances?

===============================================================
from __future__ import print_function
from sys import stdout

class A( object ):    # minimal hashable object
    def __init__( self, v1, v2 ):
        self.v = ( v1, v2 )

    def __eq__( self, b ):
        #return self.v[0] == b.v[0] and self.v[1] == b.v[1]
        raise Exception( "CALLED __eq__" )

    def __hash__( self ):    # instances distinguished by identity
        return id( self )

# It was suggested that set and dict internally use some representation
# of keys that contains less information than the id() value returned by
# __hash__, thus causing key collisions, which are resolved by calls to
# __eq__.
# In that case, one would expect __eq__ to be called eventually if enough
# objects were added to a set.  I don't see that, though.

NINSTANCES = 3000000    # play with this number -- carefully!
STATUS_INTERVAL = 100000

def test():
    """ hammer the set algorithms """
    s = set()
    instances = []
    for i in range( 0, NINSTANCES ):
        p = A( 1, 0 )
        s.add( p )
        instances.append( p )
        if not i % STATUS_INTERVAL:
            stdout.write( str( i // STATUS_INTERVAL ) + " " )
            stdout.flush()
    stdout.write( "\n" )

    print( "length of set", len( s ) )
    print( "number of instances", len( instances ) )

    for i in instances:
        if not i in s:
            print( "INSTANCE DROPPED OUT!" )
test()