git.net

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

keying by identity in dict and set


Regarding my question
     "Is there some fatal reason that this approach must never be
used, besides the lack of documentary support for it?"
If finally dawned on me that there is a use-case for containers that
would preclude using object identity for keys.  That is, if the object
is to be serialized, or otherwise stored past the run-time of the
program.  Of course, all the identities (all the id() values) will be
meaningless once the current run ends.

That is not my use-case, however.  My containers are already stored by
completely different means.

As to my other questions, I am now fairly convinced that, at least in
CPython and PyPy, on both 32 bit and 64 bit machines, it is safe to
assume that __eq__() is never called if __hash__() returns id(), and
that this results in a perfect hash.  It works splendidly.  That leads
me to the problem of the documentation.

I hope to pursue a change in the documentation, at least to provide a
little more explanation of the reason for the requirements of
__hash__() and __eq__().   I would like to first discuss it with
somebody who understands the history and internals of Python, to make
sure I have all my facts straight.

Cheers!

On Sat, Oct 19, 2019 at 1:31 PM Steve White <stevan.white at gmail.com> wrote:
>
> Hi,
>
> I have an application that would benefit from object instances
> distinguished by identity being used in dict's and set's. To do this,
> the __hash__ method must be overridden, the obvious return value being
> the instance's id.
>
> This works flawlessly in extensive tests on several platforms, and on
> a couple of different Python versions and implementations.
>
> The documentation seems to preclude a second requirement, however.
>
> I also want to use the == operator on these objects to mean a natural
> comparison of values, different from identity, so that two instances
> comparing equivalent does not imply that they are identical.
>
> But the documentation for __hash__ has:
>     "The only required property is that objects which compare equal
>     have the same hash value"
>
> Yet it seems something more is going on, because as near as I can
> tell, it just works, perfectly, every time, despite this requirement.
> If the the __hash__ method returns the object's id, the __eq__ method
> of the *never* called when the object is added to a dict.  (See
> examples below.)
>
> It would appear that if __hash__ returns the id, then that id is used
> internally as the key, and since the id is by definition unique, no
> key collision ever occurs -- at least in every Python implementation
> I've tried. It also seems that, for a class instance obj,
>     hash( hash( obj ) ) == hash( obj )
>     hash( id( obj ) ) == id( obj )
> These are very strong and useful properties.  Where are they documented?
>
> It looks like the existing Python built-in containers do exactly what
> I need, but the documentation suggests that they may not.
>
> Taking the documentation at face value, I'm in the position of
> choosing between abandoning the natural use of operator '==', or of
> introducing an unfamiliar container implementation.  As my package is
> intended for the use by other people, neither option is attractive.
>
> Questions:
>
> Is all this apparent behaviour documented somewhere that I have missed?
>
> Is there some fatal reason that this approach must never be used,
> besides the lack of documentary support for it?
>
> Is it in fact safe to assume that __eq__ is never called under these
> circumstances?
>
> ===============================================================
> from __future__ import print_function
> from sys import stdout
>
> class A( object ):    # minimal hashable object
>     def __init__( self, v1, v2 ):
>         self.v = ( v1, v2 )
>
>     def __eq__( self, b ):
>         #return self.v[0] == b.v[0] and self.v[1] == b.v[1]
>         raise Exception( "CALLED __eq__" )
>
>     def __hash__( self ):    # instances distinguished by identity
>         return id( self )
>
> # It was suggested that set and dict internally use some representation
> # of keys that contains less information than the id() value returned by
> # __hash__, thus causing key collisions, which are resolved by calls to
> # __eq__.
> # In that case, one would expect __eq__ to be called eventually if enough
> # objects were added to a set.  I don't see that, though.
>
> NINSTANCES = 3000000    # play with this number -- carefully!
> STATUS_INTERVAL = 100000
>
> def test():
>     """ hammer the set algorithms """
>     s = set()
>     instances = []
>     for i in range( 0, NINSTANCES ):
>         p = A( 1, 0 )
>         s.add( p )
>         instances.append( p )
>         if not i % STATUS_INTERVAL:
>             stdout.write( str( i // STATUS_INTERVAL ) + " " )
>             stdout.flush()
>     stdout.write( "\n" )
>
>     print( "length of set", len( s ) )
>     print( "number of instances", len( instances ) )
>
>     for i in instances:
>         if not i in s:
>             print( "INSTANCE DROPPED OUT!" )
> test()