Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

include-package-data can't find it in namespaces #4193

Draft
wants to merge 2 commits into
base: main
Choose a base branch
from
Draft
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
49 changes: 6 additions & 43 deletions docs/userguide/datafiles.rst
Original file line number Diff line number Diff line change
Expand Up @@ -12,6 +12,12 @@ Setuptools focuses on this most common type of data files and offers three ways
of specifying which files should be included in your packages, as described in
the following sections.

The user may want to remove the automatically generated PKG-INFO, \*.egg-info,
build, and dist folders when modifying the configuration files or doing
structural changes in the directory/file structure. Failure to do so may result
in an sdist that contains datafiles which are not in a simultaneously built
wheel, and which will not be installed even if directly installing the sdist.
Comment on lines +15 to +19
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe something along the lines of:

Suggested change
The user may want to remove the automatically generated PKG-INFO, \*.egg-info,
build, and dist folders when modifying the configuration files or doing
structural changes in the directory/file structure. Failure to do so may result
in an sdist that contains datafiles which are not in a simultaneously built
wheel, and which will not be installed even if directly installing the sdist.
.. note::
Setuptools automatically creates a few directories to host build artefacts
and cache files, such as ``build``, ``dist``, ``*.egg-info``.
While cache is useful to speed up incremental builds, in some edge cases it might become stale.
If you feel that caching is causing problems to your build, specially after changes in
configuration or in the directory/file structure., consider removing
``build``, ``dist``, ``*.egg-info`` [#PKG-INFO] before rebuilding or
reinstalling your project.
.. [#PKG-INFO]
When working from an existing sdist (e.g. for patching), you might also consider removing
the `PKG-INFO` file to force its recreation.

I suggest putting the PKG-INFO part in a footnote because those are not day-to-day operations for a standard setuptools user. It is something very advanced.

Nevertheless, I think it would make more sense to add this content as a dedicated session in the document (with a heading title) more towards the end. The reason being that we should organise the information to have a nice reading flow. I think that the "attention scenarios" and workaround/debugging tips should come after the configuration methods are explained.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Another thing that might make sense is to reference this content from the part of the document we mention MANIFEST.in.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The way that you have written it makes sense. I will rework what I have to be more like that, and I agree that flow is important. I just have two issues here:

  1. A user who thinks they have a simple situation and wants to get back to something else won't read the whole page, they will get far enough to have something that might work and stop there. I think it is important to quickly warn users who have already built a package, that they should look at the note at the bottom. If it is a quick note it won't interfere with those who are configuring a brand new package who won't be affected by stale artefacts.
  2. I got to the point of having created and uploaded several of my own packages without really understanding the differences between an sdist and a wheel beyond naming and archive format. Specifically, I was convinced that if an editable install had the needed files that I could look in the sdist to see if the files were there as well. I hadn't seen any of the documentation that would have said that sdist archives are expected to have files that will not be installed, and will not produce warnings or errors if there are files that are in the sdist that are not used in building what does get installed. This page is one of the first ones that I ran into when I realized that I needed to get files that weren't Python source to be installed, and I suspect others at my level of understanding will end up here as well. I recognize that it does interrupt the flow to be talking about the difference between what gets packaged and what gets installed here, but there should at least be a link to somewhere that it is explained, preferably by someone who understands it better than I do.


include_package_data
====================

Expand Down Expand Up @@ -407,49 +413,6 @@ which enables the ``data`` directory to be identified, and then, we separately s
files for the root package ``mypkg``, and the namespace package ``data`` under the package
``mypkg``.

With ``include_package_data`` the configuration is simpler: you simply need to enable
Copy link
Contributor

@abravalheri abravalheri Feb 5, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Still, I don't think this part should be removed.

Alternatively, instead of removing, we can drop the "simply" (because as you said it is not as simple as we first thought) and then add a reminder that points to the section of the document that we mention MANIFEST.in and/or the caching note as proposed in the change.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm sorry I am taking so long to provide a good set of reproducers. I finally did a tally and came up with 48 cases. I suspect that it is going to end up that some of the config that the comments say is optional is not optional in my situation. After I have iterated through all the cases I'll give you a summary, a handful of the most instructive reproducers and hopefully a smaller edit of this section of the document, plus mentioning the Manifest.in option.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@abravalheri After testing many of the configurations I am actually back to the original position, that this configuration does not permit you to include non-Python files in a distribution. I have only checked pyproject.toml configurations, so I do not know if the issue is specific to that code-path. I am including three reproducers each of which is using the configuration from the current documentation virtually verbatim, and that do not have any of the generated files in them. Two of them use a src parent directory, and one of them matches my own setup without a src parent directory.

find-inc-yes-src-ns.tar.gz
find-inc-yes-src.tar.gz
find-inc-yes-dot.tar.gz

tar tzf find-inc-yes-dot.tar.gz
python -m build
zipinfo dist/find_where-0.0.dev0-py3-none-any.whl

substitute the other filenames as appropriate in the tar command.

I tried adding the package-data configuration described above the section I suggest removing to show things working, with the result that the namespace-level HTML file is included but only the Python is included in the package. At this point I remembered that I had previously avoided including dashes in package directories because they are interpreted as subtraction in Python. The same configuration with the directory renamed to have an underscore rather than a dash results in all of the HTML files being included. The same directory name with the package-data configuration commented out again results in no HTML files being included. I do not know if we should also include a note about dashes in package directory names not supporting data files, or if there's a different way to configure the package data in that situation, or whether there's already a more central piece of documentation saying that it's a bad idea and will cause random things to break even if it sometimes works.

find_where-inc-yes-src.tar.gz
find_where-pd-inc-yes-src.tar.gz
find-where-pd-inc-yes-src.tar.gz

Copy link
Contributor

@abravalheri abravalheri Feb 15, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi @thrasymache thank you very much for the in-depth investigation.
I just want to double check with you a couple of points before delving into the reproducers you provided:

  • When you say "this configuration does not permit you to include non-Python files in a distribution", I am assuming you are referring to include-package-data (whose default value is True when it comes from pyproject.toml). Please let me know if you are referring to the package-data or other configuration field instead.

  • I am assuming, you considered this passage in the docs (that I have previously highlighted in this conversation) when writing the reproducers:

    [...] you can simply use the include_package_data keyword [...]
    [...] files will be automatically installed with your package, provided:

    • These files are included via the MANIFEST.in file, [...]
    • OR, they are being tracked by a revision control system such as Git, Mercurial or SVN, and you have configured an appropriate plugin such as setuptools-scm or setuptools-svn [...]

    If that is not the case please let me know. (Also if none of these conditions is met, I agree 100% that the outcome is that the files will be not included at the end, but that does not mean that the docs are wrong, since they do mention these pre-conditions).
    As I do not see any MANIFEST.in file in the tar.gz you provided (or setuptools-scm in pyproject.toml), I can only guess that there was a problem when creating the tar.gz archives (?)
    That might be the reason why you are seeing a weird behaviour in your results.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Now we're getting to the heart of the matter. I think I can save you the time of delving into the reproducers: Yes, I am using the include-package-data configuration, and none of the reproducers use either MANIFEST.in nor a version control plugin, but there is still a glaring problem with the documentation as it stands: the passage that you are highlighting is in the include_package_data section, and the edit that I started suggesting is in the Subdirectory for Data Files section. Now that I have been looking at this document for a few weeks I can see how it's trying to work, but also how easy it is for someone looking at it the first time to read it the way that I did last month.

There are 4 sections before the Summary and Addenda. The first two specify independent ways to include data files which can also be used together. The third section (exclude_package_data) only makes sense if being used with one or both of the first two, but that is also clear from its content. The fourth Subdirectory for Data Files section, is actually describing the same configurations as the first two, but it is incomplete, in that it doesn't mention that include-package-data requires the specific files to be specified elsewhere. This is something that is clear in the first section and in the Summary, though it isn't clear in the fourth section that you also need to read anything else. Specifically, it says:

With include_package_data the configuration is simpler: you simply need to enable scanning of namespace packages in the src directory and the rest is handled by Setuptools.

This gives the impression that the below configuration is all you need, since specifying the packages elsewhere in the same file is simpler than creating a new file to do it or installing a plugin. My guess is that the author of this section wasn't aware that he was depending upon the plugin for this configuration to work.

Anyway, I agree with the MANIFEST.in the data files would be included by the config that I suggested removing (although it might be worth noting that in a pyproject.toml and the right layout you can get away with omitting it). The conservative edit would be to simply mention the necessity of either the MANIFEST.in or the VCS plugin here as it is elsewhere on the page. On the other hand, if you're up for it, I still think it's confusing to have this section here discussing two of the three configuration options in a place in the document where you would expect a fourth configuration option. I think it would make more sense if this section was at least below the Summary, and referred to the sections that give details on the two configuration options in addition or instead of explicit configs.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you very much @thrasymache. That is a very good summary!

I am proposing an alternative in #4230. Please feel free to comment.

scanning of namespace packages in the ``src`` directory and the rest is handled by Setuptools.

.. tab:: pyproject.toml

.. code-block:: toml

[tool.setuptools]
# ...
# By default, include-package-data is true in pyproject.toml, so you do
# NOT have to specify this line.
include-package-data = true

[tool.setuptools.packages.find]
# scanning for namespace packages is true by default in pyproject.toml, so
# you need NOT include the following line.
namespaces = true
where = ["src"]

.. tab:: setup.cfg

.. code-block:: ini

[options]
packages = find_namespace:
package_dir =
= src
include_package_data = True

[options.packages.find]
where = src

.. tab:: setup.py

.. code-block:: python

from setuptools import setup, find_namespace_packages
setup(
# ... ,
packages=find_namespace_packages(where="src"),
package_dir={"": "src"},
include_package_data=True,
)

Summary
=======
Expand Down