-
-
Notifications
You must be signed in to change notification settings - Fork 1.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
include-package-data can't find it in namespaces #4193
base: main
Are you sure you want to change the base?
Conversation
I generated a special development release to demonstrate the issue: https://pypi.org/project/finitelycomputable-idtrust-common/24.0.dev1/#files The pyproject.toml uses this configuration. The sdist contains the needed template files, but they are not listed in SOURCES.txt, and the generated wheel does not include them. To include them it is not sufficient to uncomment the tool.setuptools.package-data table, you must also comment out the tool.setuptools.packages.find table.
Hi @thrasymache , thank you very much for the report. Before we consider removing this section of the docs, could you please elaborate more on the reproducer for the issue you are having. Ideally, we are looking for a simplified, minimal reproducer accompanied by a clear set of instructions of how you are building your package. A note about |
Hi @abravalheri, thank you for taking interest in my report. I am in the middle of a 6k+ line changeset my the source repo, involving three dozen distribution packages, and without the data files the test suite won't run, so I can be careful enough to The reproducer I have so far is a 3KB sdist. I can minimalize it by stripping the metadata from pyproject.toml, and truncating all the files down to one-liners. I can also make separate sdists for each of the variants rather than giving editing instructions. I am building using |
@abravalheri let me apologize for a confusion/mistake on my part. Early on in my fury last night I ran into the fact that I had an sdist which contained the template data files, but when a wheel was built from this sdist, the wheel did not have these data files. I could see that even explicitly pip installing from the sdist was building a wheel and that was why my site-packages did not have the template files. But I then assumed that this was normal, and spent the rest of my time just looking at what it took to get the template files into the wheel. I am now aware that this is not normal, and having more files in the sdist than the wheel built from that sdist is extra special breakage that isn't easy to trigger. In particular, the sdist that I uploaded isn't an example of this, and doesn't have the template files. For a moment I thought I had been totally wrong about this aspect of the issue or that I had at least destroyed the evidence with an actual working build, but I had not. I have an sdist and a wheel from the same I have strict instructions from my cardiologist not to dive deeply into the setuptools implementation, but I can see that this exchange is likely going to end up with changes outside of the documentation. But all of this is about the extra special breakage of an sdist not matching a wheel. The include-data-files being insufficient to include my data files remains an easily reproducible issue which I will make more minimal even if I can't get the sdist wheel mismatch to happen again. |
I have a minimal reproducer for the sdist not matching the wheel, and it is itself an sdist.
The built sdist will contain a templates directory with three files in it, but the wheel does not have these. I had an instructive failure to reproduce while making this: if you change the name of the distribution, even just by substituting some of the letters, then the sdist also does not contain the templates. |
Hi @thrasymache , thank you very much for the reproducer. Indeed after trying to run the reproducer I can see that the contents for the produced In fact the only "code files" that seem to be included in the sdist (other then the generated files) when you run things from scratch are:
Please see how I reached this conclusion below: # docker run --rm -it python:3.11-bookworm /bin/bash
cd /tmp
wget https://github.com/pypa/setuptools/files/14066818/finitelycomputable-idtrust-common-0.0.dev2.tar.gz
tar xzf finitelycomputable-idtrust-common-0.0.dev2.tar.gz
cd finitelycomputable-idtrust-common-0.0.dev2
# Important:
# Let's remove the *.egg-info folder and PKG-INFO
# to truly ensure our reproducer is starting from scratch and that
# it doesn't have any external factor influencing in the final result:
rm -rf *.egg-info PKG-INFO
python3 -m pip install -U pip build
python3 -m build
tar tf dist/finitelycomputable-idtrust-common-0.0.dev2.tar.gz
# finitelycomputable-idtrust-common-0.0.dev2/
# finitelycomputable-idtrust-common-0.0.dev2/PKG-INFO
# finitelycomputable-idtrust-common-0.0.dev2/README.rst
# finitelycomputable-idtrust-common-0.0.dev2/finitelycomputable/
# finitelycomputable-idtrust-common-0.0.dev2/finitelycomputable/idtrust_common/
# finitelycomputable-idtrust-common-0.0.dev2/finitelycomputable/idtrust_common/__init__.py
# finitelycomputable-idtrust-common-0.0.dev2/finitelycomputable/idtrust_common/logic.py
# finitelycomputable-idtrust-common-0.0.dev2/finitelycomputable/idtrust_common/strategies.py
# finitelycomputable-idtrust-common-0.0.dev2/finitelycomputable_idtrust_common.egg-info/
# finitelycomputable-idtrust-common-0.0.dev2/finitelycomputable_idtrust_common.egg-info/PKG-INFO
# finitelycomputable-idtrust-common-0.0.dev2/finitelycomputable_idtrust_common.egg-info/SOURCES.txt
# finitelycomputable-idtrust-common-0.0.dev2/finitelycomputable_idtrust_common.egg-info/dependency_links.txt
# finitelycomputable-idtrust-common-0.0.dev2/finitelycomputable_idtrust_common.egg-info/top_level.txt
# finitelycomputable-idtrust-common-0.0.dev2/pyproject.toml
# finitelycomputable-idtrust-common-0.0.dev2/setup.cfg
unzip -Z1 dist/finitelycomputable_idtrust_common-0.0.dev2-py3-none-any.whl
# finitelycomputable/idtrust_common/__init__.py
# finitelycomputable/idtrust_common/logic.py
# finitelycomputable/idtrust_common/strategies.py
# finitelycomputable_idtrust_common-0.0.dev2.dist-info/METADATA
# finitelycomputable_idtrust_common-0.0.dev2.dist-info/WHEEL
# finitelycomputable_idtrust_common-0.0.dev2.dist-info/top_level.txt
# finitelycomputable_idtrust_common-0.0.dev2.dist-info/RECORD I don't know how you originated the To ensure all the files are contained in the In the same example above, once I add a rm -rf *.egg-info dist build
cat <<EOF > MANIFEST.in
global-include *.html
EOF
python3 -m build
tar tf dist/finitelycomputable-idtrust-common-0.0.dev2.tar.gz
# finitelycomputable-idtrust-common-0.0.dev2/
# finitelycomputable-idtrust-common-0.0.dev2/MANIFEST.in
# finitelycomputable-idtrust-common-0.0.dev2/PKG-INFO
# finitelycomputable-idtrust-common-0.0.dev2/README.rst
# finitelycomputable-idtrust-common-0.0.dev2/finitelycomputable/
# finitelycomputable-idtrust-common-0.0.dev2/finitelycomputable/idtrust_common/
# finitelycomputable-idtrust-common-0.0.dev2/finitelycomputable/idtrust_common/__init__.py
# finitelycomputable-idtrust-common-0.0.dev2/finitelycomputable/idtrust_common/logic.py
# finitelycomputable-idtrust-common-0.0.dev2/finitelycomputable/idtrust_common/strategies.py
# finitelycomputable-idtrust-common-0.0.dev2/finitelycomputable/idtrust_common/templates/
# finitelycomputable-idtrust-common-0.0.dev2/finitelycomputable/idtrust_common/templates/exchange_form.html
# finitelycomputable-idtrust-common-0.0.dev2/finitelycomputable/idtrust_common/templates/interaction.html
# finitelycomputable-idtrust-common-0.0.dev2/finitelycomputable/idtrust_common/templates/interaction_begin.html
# finitelycomputable-idtrust-common-0.0.dev2/finitelycomputable_idtrust_common.egg-info/
# finitelycomputable-idtrust-common-0.0.dev2/finitelycomputable_idtrust_common.egg-info/PKG-INFO
# finitelycomputable-idtrust-common-0.0.dev2/finitelycomputable_idtrust_common.egg-info/SOURCES.txt
# finitelycomputable-idtrust-common-0.0.dev2/finitelycomputable_idtrust_common.egg-info/dependency_links.txt
# finitelycomputable-idtrust-common-0.0.dev2/finitelycomputable_idtrust_common.egg-info/top_level.txt
# finitelycomputable-idtrust-common-0.0.dev2/pyproject.toml
# finitelycomputable-idtrust-common-0.0.dev2/setup.cfg
unzip -Z1 dist/finitelycomputable_idtrust_common-0.0.dev2-py3-none-any.whl
# finitelycomputable/idtrust_common/__init__.py
# finitelycomputable/idtrust_common/logic.py
# finitelycomputable/idtrust_common/strategies.py
# finitelycomputable/idtrust_common/templates/exchange_form.html
# finitelycomputable/idtrust_common/templates/interaction.html
# finitelycomputable/idtrust_common/templates/interaction_begin.html
# finitelycomputable_idtrust_common-0.0.dev2.dist-info/METADATA
# finitelycomputable_idtrust_common-0.0.dev2.dist-info/WHEEL
# finitelycomputable_idtrust_common-0.0.dev2.dist-info/top_level.txt
# finitelycomputable_idtrust_common-0.0.dev2.dist-info/RECORD The example in the docs might be relevant for this issue:
In the summary we also highlight that:
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think removing this from the documentation is the right solution for the problem presented.
I think we can add text regarding caching (e.g. mention that the user might want to remove the automatically generated *.egg-info
, build
, dist
folders when modifying the configuration files or doing structural changes in the directory/file structure) and/or the need of properly ensuring the sdist
contain all the relevant files, but I am against the complete removal of the information that is currently provided.
getting an sdist inconsistent with a wheel only occurs if you have an old egg-info directory with an old SOURCES.txt. I can only guess that it also requires the SOURCES.txt to get misidentified as up-to-date.
After removing the *.egg-info directory I also got an The documentation for converting
I don't think that |
I'm sorry for the confusion. I was just so baffled with the
Done. |
The user may want to remove the automatically generated PKG-INFO, \*.egg-info, | ||
build, and dist folders when modifying the configuration files or doing | ||
structural changes in the directory/file structure. Failure to do so may result | ||
in an sdist that contains datafiles which are not in a simultaneously built | ||
wheel, and which will not be installed even if directly installing the sdist. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe something along the lines of:
The user may want to remove the automatically generated PKG-INFO, \*.egg-info, | |
build, and dist folders when modifying the configuration files or doing | |
structural changes in the directory/file structure. Failure to do so may result | |
in an sdist that contains datafiles which are not in a simultaneously built | |
wheel, and which will not be installed even if directly installing the sdist. | |
.. note:: | |
Setuptools automatically creates a few directories to host build artefacts | |
and cache files, such as ``build``, ``dist``, ``*.egg-info``. | |
While cache is useful to speed up incremental builds, in some edge cases it might become stale. | |
If you feel that caching is causing problems to your build, specially after changes in | |
configuration or in the directory/file structure., consider removing | |
``build``, ``dist``, ``*.egg-info`` [#PKG-INFO] before rebuilding or | |
reinstalling your project. | |
.. [#PKG-INFO] | |
When working from an existing sdist (e.g. for patching), you might also consider removing | |
the `PKG-INFO` file to force its recreation. |
I suggest putting the PKG-INFO
part in a footnote because those are not day-to-day operations for a standard setuptools user. It is something very advanced.
Nevertheless, I think it would make more sense to add this content as a dedicated session in the document (with a heading title) more towards the end. The reason being that we should organise the information to have a nice reading flow. I think that the "attention scenarios" and workaround/debugging tips should come after the configuration methods are explained.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Another thing that might make sense is to reference this content from the part of the document we mention MANIFEST.in
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The way that you have written it makes sense. I will rework what I have to be more like that, and I agree that flow is important. I just have two issues here:
- A user who thinks they have a simple situation and wants to get back to something else won't read the whole page, they will get far enough to have something that might work and stop there. I think it is important to quickly warn users who have already built a package, that they should look at the note at the bottom. If it is a quick note it won't interfere with those who are configuring a brand new package who won't be affected by stale artefacts.
- I got to the point of having created and uploaded several of my own packages without really understanding the differences between an
sdist
and awheel
beyond naming and archive format. Specifically, I was convinced that if an editable install had the needed files that I could look in thesdist
to see if the files were there as well. I hadn't seen any of the documentation that would have said thatsdist
archives are expected to have files that will not be installed, and will not produce warnings or errors if there are files that are in thesdist
that are not used in building what does get installed. This page is one of the first ones that I ran into when I realized that I needed to get files that weren't Python source to be installed, and I suspect others at my level of understanding will end up here as well. I recognize that it does interrupt the flow to be talking about the difference between what gets packaged and what gets installed here, but there should at least be a link to somewhere that it is explained, preferably by someone who understands it better than I do.
@@ -407,49 +413,6 @@ which enables the ``data`` directory to be identified, and then, we separately s | |||
files for the root package ``mypkg``, and the namespace package ``data`` under the package | |||
``mypkg``. | |||
|
|||
With ``include_package_data`` the configuration is simpler: you simply need to enable |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Still, I don't think this part should be removed.
Alternatively, instead of removing, we can drop the "simply" (because as you said it is not as simple as we first thought) and then add a reminder that points to the section of the document that we mention MANIFEST.in
and/or the caching note as proposed in the change.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm sorry I am taking so long to provide a good set of reproducers. I finally did a tally and came up with 48 cases. I suspect that it is going to end up that some of the config that the comments say is optional is not optional in my situation. After I have iterated through all the cases I'll give you a summary, a handful of the most instructive reproducers and hopefully a smaller edit of this section of the document, plus mentioning the Manifest.in
option.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@abravalheri After testing many of the configurations I am actually back to the original position, that this configuration does not permit you to include non-Python files in a distribution. I have only checked pyproject.toml configurations, so I do not know if the issue is specific to that code-path. I am including three reproducers each of which is using the configuration from the current documentation virtually verbatim, and that do not have any of the generated files in them. Two of them use a src parent directory, and one of them matches my own setup without a src parent directory.
find-inc-yes-src-ns.tar.gz
find-inc-yes-src.tar.gz
find-inc-yes-dot.tar.gz
tar tzf find-inc-yes-dot.tar.gz
python -m build
zipinfo dist/find_where-0.0.dev0-py3-none-any.whl
substitute the other filenames as appropriate in the tar command.
I tried adding the package-data configuration described above the section I suggest removing to show things working, with the result that the namespace-level HTML file is included but only the Python is included in the package. At this point I remembered that I had previously avoided including dashes in package directories because they are interpreted as subtraction in Python. The same configuration with the directory renamed to have an underscore rather than a dash results in all of the HTML files being included. The same directory name with the package-data configuration commented out again results in no HTML files being included. I do not know if we should also include a note about dashes in package directory names not supporting data files, or if there's a different way to configure the package data in that situation, or whether there's already a more central piece of documentation saying that it's a bad idea and will cause random things to break even if it sometimes works.
find_where-inc-yes-src.tar.gz
find_where-pd-inc-yes-src.tar.gz
find-where-pd-inc-yes-src.tar.gz
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hi @thrasymache thank you very much for the in-depth investigation.
I just want to double check with you a couple of points before delving into the reproducers you provided:
-
When you say "this configuration does not permit you to include non-Python files in a distribution", I am assuming you are referring to
include-package-data
(whose default value isTrue
when it comes frompyproject.toml
). Please let me know if you are referring to thepackage-data
or other configuration field instead. -
I am assuming, you considered this passage in the docs (that I have previously highlighted in this conversation) when writing the reproducers:
[...] you can simply use the
include_package_data
keyword [...]
[...] files will be automatically installed with your package, provided:- These files are included via the MANIFEST.in file, [...]
- OR, they are being tracked by a revision control system such as Git, Mercurial or SVN, and you have configured an appropriate plugin such as setuptools-scm or setuptools-svn [...]
If that is not the case please let me know. (Also if none of these conditions is met, I agree 100% that the outcome is that the files will be not included at the end, but that does not mean that the docs are wrong, since they do mention these pre-conditions).
As I do not see anyMANIFEST.in
file in thetar.gz
you provided (orsetuptools-scm
inpyproject.toml
), I can only guess that there was a problem when creating thetar.gz
archives (?)
That might be the reason why you are seeing a weird behaviour in your results.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Now we're getting to the heart of the matter. I think I can save you the time of delving into the reproducers: Yes, I am using the include-package-data
configuration, and none of the reproducers use either MANIFEST.in
nor a version control plugin, but there is still a glaring problem with the documentation as it stands: the passage that you are highlighting is in the include_package_data
section, and the edit that I started suggesting is in the Subdirectory for Data Files
section. Now that I have been looking at this document for a few weeks I can see how it's trying to work, but also how easy it is for someone looking at it the first time to read it the way that I did last month.
There are 4 sections before the Summary and Addenda. The first two specify independent ways to include data files which can also be used together. The third section (exclude_package_data
) only makes sense if being used with one or both of the first two, but that is also clear from its content. The fourth Subdirectory for Data Files
section, is actually describing the same configurations as the first two, but it is incomplete, in that it doesn't mention that include-package-data
requires the specific files to be specified elsewhere. This is something that is clear in the first section and in the Summary, though it isn't clear in the fourth section that you also need to read anything else. Specifically, it says:
With include_package_data the configuration is simpler: you simply need to enable scanning of namespace packages in the src directory and the rest is handled by Setuptools.
This gives the impression that the below configuration is all you need, since specifying the packages elsewhere in the same file is simpler than creating a new file to do it or installing a plugin. My guess is that the author of this section wasn't aware that he was depending upon the plugin for this configuration to work.
Anyway, I agree with the MANIFEST.in
the data files would be included by the config that I suggested removing (although it might be worth noting that in a pyproject.toml
and the right layout you can get away with omitting it). The conservative edit would be to simply mention the necessity of either the MANIFEST.in
or the VCS plugin here as it is elsewhere on the page. On the other hand, if you're up for it, I still think it's confusing to have this section here discussing two of the three configuration options in a place in the document where you would expect a fourth configuration option. I think it would make more sense if this section was at least below the Summary, and referred to the sections that give details on the two configuration options in addition or instead of explicit configs.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you very much @thrasymache. That is a very good summary!
I am proposing an alternative in #4230. Please feel free to comment.
Hi @thrasymache, I merged #4230. Hopefully that makes things cleared. |
Hi @abravalheri, I've had a busy week. I do want to continue working on this, and will get to incorporating your suggestions in the next day or two. |
I generated a development release to demonstrate the issue: https://pypi.org/project/finitelycomputable-idtrust-common/24.0.dev1/#files
I was in this section of the documentation to try to get a templates data directory to be included in a package which is itself in the finitelycomputable namespace package.
The pyproject.toml uses the configuration described in the removed section. The sdist contains the needed template files, but they are not listed in SOURCES.txt, and the generated wheel does not include them. To include them it is not sufficient to uncomment the tool.setuptools.package-data table, you must also comment out the tool.setuptools.packages.find table. All of this revolves around SOURCES.txt, which is sometimes generated by the setuptools build backend. If you build it once in a way that adds the template files to SOURCES.txt, then even if you misconfigure the pyproject.toml, you are likely to get the template files anyway. I am sure I am not the first person to have been bitten by this, but I am even more certain that there are immovable obstacles preventing it from being better.
Summary of changes
Removing misleading documentation about include-package-data
Pull Request Checklist
newsfragments/
.(See documentation for details)