PEP 680: "tomllib" Support for parsing TOML in the Standard Library

brettcannon · January 11, 2022, 10:18pm

Because when it comes to the stdlib, there is no “initial”, only “final”.

Please understand that any API that lands in the stdlib is extremely hard to change (for instance, simply ditching some attributes on modules is going to take me around 7 years to accomplish). Basic stdlib policy is you start small and grow, not make assumptions and hope for the best.

And when it comes to human-readable output (which JSON, CSV, pickle, and arguably XML are not), everyone has an opinion (see every single style guide and formatter as examples of that). So asking for write support is a big ask on the Python core team to support, deal with feature requests, etc. in order to either appease everyone or constantly reject people’s asks. It’s way easier to push that to the community to support or ask people to use f-strings for their e.g. TOML templates. Plus writing TOML is way easier than reading it.

hukkinj1 · January 11, 2022, 10:23pm

Awesome, thanks!

The main motivation for this PEP in my mind is fixing pyproject.toml based Python bootstrapping nightmare. And also enabling tools an easy way to read configuration from that same file.

The main motivation is not to maximize use cases, and I don’t think the stdlib should attempt to maximize use cases now that we have PyPI, packaging standards and the Internet. But that’s another discussion.

This is true. But this PEP doesn’t remove the option of adding write API in the future.

The idea is to mention tomli_w and other third-party TOML writers clearly in the docs.

The PEP tries to explain the degrees of freedom that we must take a stance on. Even simple things like the default output style (indentation width, single or double quotes etc.) should be considered features and changing them is a breaking change in the stdlib IMO.

hauntsaninja · January 11, 2022, 10:32pm

Thanks CAM! I’ll edit the top level post to include links to previous discussion.

On the name “toml”:

I think we can all agree that the “toml” name is better.

Breakage seems unavoidable as discussed here
The amount of breakage is dicussed here

So it seems to me that the only productive discussion on this subject is determining how much breakage is acceptable, where I believe CPython has a high prior on the answer “zero breakage”. If there isn’t consensus or a ruling from the SC, my view is to stick with “tomllib”.

On adding a write API:

Discussed in the PEP here

A previous version of the PEP had some more detail on this, including listing reasons why a write API would be useful (nothing that CAM doesn’t mention) and a concrete discussion of the degrees of freedom in the design space. This can be found in Editing for the tomllib PEP by encukou · Pull Request #3 · hauntsaninja/peps · GitHub (see Appendix B + the section discussing write API). If people find it useful, I can restore that discussion.

I think people’s view of this depends on how much they weigh the core motivations of reading pyproject.toml vs that of using TOML more generally. It’s not the worst thing to push users to PyPI when they would be better served by more capable packages, but might stick to the standard library schelling point out of inertia (I’ve found several uses of “toml” that would likely be served better by “tomlkit”). Petr Viktorin made an interesting point on another thread that this might actually be a good reason to not use the name “toml”: to leave the more desirable name to the community.

As others have said, the current PEP doesn’t preclude the possibility of a write API in the future. The main reason to push for including a write API in this PEP, rather than later, is if we feel a) that it is likely that we want to adopt the “toml” name, b) we believe including a write API will effectively minimise the breakage of doing so. Since I’m inclined to think using the “toml” name is not an option, I am very happy to have discussion of a write API deferred.

With all that said, I would be curious if any CPython core developers feel strongly enough about this that they’d be willing to take on maintenance of a write API.

pf_moore · January 11, 2022, 10:49pm

Not me, certainly. Personally, I think of TOML as very much a human-editable format, whereas I see JSON, XML, CSV and similar as much less so. The fiddly details of formatting output in a write API are much more controversial in a format that’s human editable than in one that’s not. So I don’t think the design constraints for TOML are the same as those other formats. And as a result, I think it’s the right decision to not include a write API, at least in the initial version, and possibly never in the stdlib (because it’s too hard to change the stdlib).

hugovk · January 11, 2022, 11:21pm

It’s very popular indeed, 34m downloads last month, 52nd most popular on PyPI:

Let’s imagine we got control of the existing toml right now and released it today with a warning saying to use some other name we also release.

That gives everyone 10 months to heed the warning and upgrade before toml is released in CPython 3.11.

Compare, CPython requires a minimum of two releases of warnings before removals = 2-3 years, depending on the cadence. It’s often years longer, and yet there’s many packages which never upgrade until the last minute, or after the removal (and some never do, nose is a recent casualty of 3.10 yet many projects are still testing with it).

10 months feels much too short notice for such a popular package.

And realistically, the PEP 541 toml name transfer has another three weeks minimum on the clock plus backlog delays, putting it under 9 months to warn.

barry · January 12, 2022, 1:33am

encukou · January 12, 2022, 9:34am

The library needs a core dev maintainer. I volunteered to do that, in the sense that if something happens to @hukkinj1, I’ll reprioritize what I’m currently doing and take over tomllib.
I do not want to do that for a write API – as discussed above, that’s much harder to get right, and unlike load, I don’t think the community came up with a worthy API so far. toml and tomli-w and others are certainly useful, but IMO they’re in the stage of experimenting with choices and trade-offs. That’s fine for PyPI projects. (Nothing wrong with using toml if it works for you even though it’s not getting updates. And if a new version gets a redesign, you can pin the old one.)

As @hauntsaninja said before, I am happy that the name toml will be left to the community. There should be better libraries available on PyPI. The stdlib doesn’t need to cater to advanced use cases (or performance needs) here. And there are many examples of this already:

dataclasses is much less powerful than attrs: it’s up to you if you call it you call it “focused” or “dumbed down”, but it’s designed that way deliberately.
urllib3 docs recommend using requests instead.
http.server docs warn using against using it seriously.
re is strictly less powerful than than regex.
and so on.

The tomllib proposal doesn’t seek to replace the other libraries. There’s still lots of room for them. Maybe one day, when writing TOML is boring enough, that will change.

Anyway. There’s one detail in the PEP that I’d do differently: I’d leave out the parse_float argument. TOMF floats are IEEE 754 binary64 values, which is Python floats on basically all architectures that support binary64. Using Decimal for extra precision is not portable: for compatibility with other parsers, I believe it’s much better to always store money amounts as str than to tempt users into using parse_float=decimal.Decimal (and different extensions in different parsers).
(Relatedly, decimal.Decimal() happens to parse current TOML float syntax, but that’s not a given. Same for float(), which is he default value for parse_float, but it’s not documented as such because the TOML and Python syntaxes could drift apart in the future.)
But AFAIK, the plan is to extend the PEP with some quotes from users that need parse_float, and I’m ready to be convinced that practicality should beat purity here.

takluyver · January 12, 2022, 10:24am

I’d probably be in favour of adding some kind of write support one day, but not now. The benefits of having the reading part in the stdlib are greater, and there are far fewer design choices to make. Let’s not risk missing 3.11 by trying to expand the scope to writing.

On the naming: it’s a shame not to call it toml, but I think it’s more important to avoid breaking existing code. So from my perspective, the choice is between a) using another name, in which case tomllib is as good as any, and b) matching the API of the existing toml package (not bug for bug, but say 99% of usage should not break), and then going through a deprecation cycle for any unwanted functionality. The appendix of the PEP covers what this would involve - but I don’t think anyone is arguing for this.

hukkinj1 · January 12, 2022, 12:00pm

Yeah. The parse_float argument is in this PEP largely because it is part of Tomli API (the proposed implementation). The PEP discusses why leaving it out may be a bad thing, but I’ll try to expand a bit:

As mentioned, float precision is architecture dependant → precision may be lost on some architecture meaning we are unable to represent the original TOML float value. I think this should be at least as concerning as the portability issues with extra precision.
Assuming we are happy with the precision provided by the float type, there’s still the problem that the float type and its operations are unusable in many applications due to limitations like
```
>>> 0.1 * 3 == 0.3
False
```
admittedly we can get around this by implementing a function that walks the parsed document and converts floats with an operation something like
```
def float_to_decimal(f):
    return decimal.Decimal(f"{f:.15g}")
```
to get 15 significant numbers (and replace 15 with whatever number of decimals is guaranteed to be precise on the given platform). But that maybe makes more assumptions than at least I would like my mission critical software to make.
With the emergence of cryptocurrencies, we now have values like Ethereum amounts that have 18 decimals after the decimal point, and need to be perfectly precise
This is very selfish , but anyways: parse_float will stay in Tomli so removing it from stdlib will cause small divergence and extra maintenance effort (assuming Tomli is used as backport)

Some potential issues here:

TOML can be used as an interface between a non-tech-savvy person and an application/developer. It can be painful to explain why fractional numbers need quotes around them.
Typing of “str floats” needs to be reimplemented in application code.

Unfortunately, I don’t think I’ll have the energy to be able to provide the quotes requested, unless they appear in this thread of course

I don’t feel strongly about this. If there’s consensus that parse_float is better left out of the stdlib I can update the PEP and make the needed changes when integrating to the stdlib.

encukou · January 12, 2022, 12:17pm

Most of your arguments are against float (IEEE 754 binary64) itself. The limitations of floats are well-known, in Python and other languages, and aren’t limited to TOML.
TOML made the choice to use IEEE binary64. Despite all its shortcomings, it’s a useful implementation of floating-point numbers, and it’s nearly universally available in programming languages. Since TOML is designed for interoperability, binary64 makes sense.

The fact that some builds of CPython may not have binary64 floats is the only other reason to keep parse_float. But it’s not a very practical reason: those builds are extremely rare.

Without those quotes, I stay convinced that parse_float is second-guessing the TOML designers, and it’s not something the stdlib should do. (But it is of course perfectly fine for tomli to do it – I can imagine the “interface between a non-tech-savvy person and an application/developer” argument carries a lot more weight there.)

tiran · January 12, 2022, 1:05pm

OT: Do we actually support platforms without IEEE 754 semantics? Almost all our target platforms have hardware floats. Some platforms like very old Rapsberry Pi have soft-float. Which platforms do have neither hard-floats nor soft-floats?

pf_moore · January 12, 2022, 2:32pm

The TOML 1.0.0 spec says

Floats should be implemented as IEEE 754 binary64 values.

Note that it’s “should” rather than “must” - arguably consumers can choose whatever representation they want. This is similar to JSON, which doesn’t state what representation consumers should use for numbers, but suggests binary64.

The stdlib json library has a parse_float argument, which works exactly the same as the one in tomli. I feel that consistency, combined with the fact that the TOML spec says “should” rather than “must”, supports having the tomli parse_float argument in the stdlib version. (Even though I doubt that I’ll ever personally use it in real code).

CAM-Gerlach · January 12, 2022, 11:03pm

My thanks to @encukou , @brettcannon , @hukkinj1 @hauntsaninja , and the others who took the time to detail the write support issue further. While I can still see some strong reasons to include write support in tomllib in the standard library, at least at some point; thanks to your helpful explanations and rationale, I now have a better appreciation for why it is deferred, at least for now—especially in light of the urgent and increasing need for bootstrapping purposes that was a primary motivation for this PEP that we’re all aware of by now.

Might it be worth elaborating a bit in the PEP as to the counterpoints mentioned, particularly with regard to the DoF that even a “minimal” write implementation would have to deal with? It wouldn’t need to be anything as lengthy as the prior removed section; just a couple sentences tweaking the specific examples and clarifying how they apply to any write implementation would go a long way toward forestalling the type of questions I asked around that, given they were prompted by the existing ones seeming to be focused on features a minimal implementation would lack completely.

Also, for what its worth coming from a regular Python user, for the reasons others and myself have mentioned here and elsewhere, I am in favor of the tomllib name (and particularly not taking toml's), and I sympathize with @pf_moore 's position on parse_float above for consistancy, even though I also certainly understand @encukou 's point of view on that.

barry · January 13, 2022, 4:36pm

I don’t necessarily disagree about deferring write support, but I wonder if that would hinder adoption by third party tools. I’m thinking about my experience with pdm add for example. That has to write out the pyproject.toml file. So if that’s the case, does adding a read-only tomllib change the calculation of tools like this as to whether they vendor TOML support or not?

abravalheri · January 13, 2022, 5:02pm

I remember seeing a conversation in the Packaging category about the possibility of allowing PKG-INFO/METADATA with a .toml extension. Of course that was by no means more than brainstorming, but I see the lack of write support in stdlib influencing decisions in such scenarios.

hukkinj1 · January 13, 2022, 5:04pm

I doubt that is the case.

The most important vendorers when it comes to bootstrapping are (if I’m not mistaken) build backends (flit_core, poetry_core, setuptools etc.), frontends (build) and installers (pip, installer). None of these needs write capability AFAIK.

When it comes to non-build related tools, out of the ones mentioned in the PEP (black, mypy, pytest, tox, pylint, isort, flake8) none needs to write TOML.

encukou · January 13, 2022, 5:23pm

Regarding parse_float, I fully trust @hukkinj1 to make the right decision. My opinion should be clear now, but I’ll sponsor and support the PEP either way.

OTOH, if anyone wants to add write support now, start by finding another PEP sponsor. As far as I’m concerned, it’s out of the scope of this PEP, and should be discussed separately. I’m not saying it would be bad to add it, just that I can’t commit to helping to add it. (And also, I think the discussion would delay this PEP far too much.)

CAM-Gerlach · January 13, 2022, 5:56pm

Good point. From skimming the code; it looks like PDM uses tomli for most of the read-only uses, and appear to only use tomlkit for cases where they need to roundtrip/preserve style, which would be much more complex and further from the existing scope to implement for this PEP than a simple parser like tomli-w (which would be the natural baseline to include). At present, they are both hard deps (along with many others, meaning bootstrapping would be pretty difficult anyway); I’m not sure how easy it would be for PDM or a downstream to factor tomlkit out/make it optional, though I presume in theory it wouldn’t be needed just for build, basic package management, etc.

In any case, given that, I’m not sure adding tomli-w-level write support would be of much benefit either way as opposed to a full style-preserving implementation which seems pretty out of scope here, how much in the way of changes to PDM would be needed to have one without the other, and how much difference it makes given how many other deps PDM has. However, given you’re a core dev and SC member, you could consider sponsoring one in the future to do so.

@frostming , any thoughts here?

Just to be clear, as I understand, PDM is both a backend, frontend and installer, and sees sees substantial and increasing use (nearly 2k stars on GitHub and less than two years old). It has been one of the first adopters of several new PyPA standards. However, it doesn’t currently vendor, as explained above given that and the large dep stack, I’m not sure how much difference write support (especially non-format preserving) would really make, and it likely can at least somewhat be worked around, so I don’t see it as a blocker.

I’m not sure this would make much sense, since PKG-INFO metadata are primary machine-readable, and certainly exclusively machine-writeable files (unlike pyproject.toml, so I’m not sure it would make much sense to write them as TOML instead of the less verbose, widely understood, web friendly and already-supported JSON (which was previously proposed in Core Metadata 2.0, IIRC, but never gained much real-world adoption). In any case, this would need to be standardized and implemented so I’m not sure it makes sense to block including TOML read support on this.

For what its worth, as a strong initial advocate of including write support, I’ve come to more or less agree with this, especially for the case (PDM) mentioned above.

barry · January 13, 2022, 6:42pm

I’m not an SC member any more! But I still might be interested in sponsoring such a future PEP.

brettcannon · January 13, 2022, 8:17pm

That PDM case is also different as it’s updating a pre-existing file versus generating a fresh one. So it’s even more complicated than writing as you’re trying to probably match an arbitrary, pre-existing format of the file while only updating one line, not simply control of newlines and indent with some write API (tomlkit was created for handling the update case BTW).

I’m not aware of any discussion, but there’s a bigger chance of those files moving to JSON than TOML as they are not meant to be human-readable/edited like pyproject.toml and TOML in general is.