git.net

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Bulletproof json.dump?


On Mon, Jul 6, 2020 at 6:37 AM Adam Funk <a24061 at ducksburg.com> wrote:

> Is there a "bulletproof" version of json.dump somewhere that will
> convert bytes to str, any other iterables to list, etc., so you can
> just get your data into a file & keep working?
>

Is the data only being read by python programs? If so, consider using
pickle: https://docs.python.org/3/library/pickle.html
Unlike json dumping, the goal of pickle is to represent objects as exactly
as possible and *not* to be interoperable with other languages.


If you're using json to pass data between python and some other language,
you don't want to silently convert bytes to strings.
If you have a bytestring of utf-8 data, you want to utf-8 decode it before
passing it to json.dumps.
Likewise, if you have latin-1 data, you want to latin-1 decode it.
There is no universal and correct bytes-to-string conversion.

On Mon, Jul 6, 2020 at 9:45 AM Chris Angelico <rosuav at gmail.com> wrote:

> Maybe what we need is to fork out the default JSON encoder into two,
> or have a "strict=True" or "strict=False" flag. In non-strict mode,
> round-tripping is not guaranteed, and various types will be folded to
> each other - mainly, many built-in and stdlib types will be
> represented in strings. In strict mode, compliance with the RFC is
> ensured (so ValueError will be raised on inf/nan), and everything
> should round-trip safely.
>

Wouldn't it be reasonable to represent this as an encoder which is provided
by `json`? i.e.

    from json import dumps, UnsafeJSONEncoder
    ...
    json.dumps(foo, cls=UnsafeJSONEncoder)

Emphasizing the "Unsafe" part of this and introducing people to the idea of
setting an encoder also seems nice.


On Mon, Jul 6, 2020 at 9:12 AM Chris Angelico <rosuav at gmail.com> wrote:

> On Mon, Jul 6, 2020 at 11:06 PM Jon Ribbens via Python-list
> <python-list at python.org> wrote:
> >

> The 'json' module already fails to provide round-trip functionality:
> >
> >     >>> for data in ({True: 1}, {1: 2}, (1, 2)):
> >     ...     if json.loads(json.dumps(data)) != data:
> >     ...         print('oops', data, json.loads(json.dumps(data)))
> >     ...
> >     oops {True: 1} {'true': 1}
> >     oops {1: 2} {'1': 2}
> >     oops (1, 2) [1, 2]
>
> There's a fundamental limitation of JSON in that it requires string
> keys, so this is an obvious transformation. I suppose you could call
> that one a bug too, but it's very useful and not too dangerous. (And
> then there's the tuple-to-list transformation, which I think probably
> shouldn't happen, although I don't think that's likely to cause issues
> either.)


Ideally, all of these bits of support for non-JSON types should be opt-in,
not opt-out.
But it's not worth making a breaking change to the stdlib over this.

Especially for new programmers, the notion that
    deserialize(serialize(x)) != x
just seems like a recipe for subtle bugs.

You're never guaranteed that the deserialized object will match the
original, but shouldn't one of the goals of a de/serialization library be
to get it as close as is reasonable?


I've seen people do things which boil down to

    json.loads(x)["some_id"] == UUID(...)

plenty of times. It's obviously wrong and the fix is easy, but isn't making
the default json encoder less strict just encouraging this type of bug?

Comparing JSON data against non-JSON types is part of the same category of
errors: conflating JSON with dictionaries.
It's very easy for people to make this mistake, especially since JSON
syntax is a subset of python dict syntax, so I don't think `json.dumps`
should be encouraging it.

On Tue, Jul 7, 2020 at 6:52 AM Adam Funk <a24061 at ducksburg.com> wrote:

> Here's another "I'd expect to have to deal with this sort of thing in
> Java" example I just ran into:
>
> >>> r = requests.head(url, allow_redirects=True)
> >>> print(json.dumps(r.headers, indent=2))
> ...
> TypeError: Object of type CaseInsensitiveDict is not JSON serializable
> >>> print(json.dumps(dict(r.headers), indent=2))
> {
>   "Content-Type": "text/html; charset=utf-8",
>   "Server": "openresty",
> ...
> }
>

Why should the JSON encoder know about an arbitrary dict-like type?
It might implement Mapping, but there's no way for json.dumps to know that
in the general case (because not everything which implements Mapping
actually inherits from the Mapping ABC).
Converting it to a type which json.dumps understands is a reasonable
constraint.

Also, wouldn't it be fair, if your object is "case insensitive" to
serialize it as
  { "CONTENT-TYPE": ... } or { "content-type": ... } or ...
?

`r.headers["content-type"]` presumably gets a hit.
`json.loads(json.dumps(dict(r.headers)))["content-type"]` will get a
KeyError.

This seems very much out of scope for the json package because it's not
clear what it's supposed to do with this type.
Libraries should ask users to specify what they mean and not make
potentially harmful assumptions.

Best,
-Stephen