aiohttp 3.10 changed the server's handling of content-type/encoding
headers for compressed files. This affects our usage of
aiohttp.test_utils.TestServer: gzip-compressed files are no longer being
decompressed by default. This in itself is not a problem, but the
MIME type used ("application/gzip") did not match our client's list of
automatically handled MIME types.
Add it to the list to handle it in the same way as others.
This fixes test_cmd::yum with aiohttp=>3.10.
Directories are generally expected to be listed first in directory
indexes. That was already working for yum and file repos, but wasn't the
case for kickstart repos due to their combination of different types of
content.
This commit applies a consistent sorting so that directories will always
come first, and entries will otherwise be sorted by name, for all repo
types.
The Fetcher type was designed to return a 'str'.
That wasn't a good idea because it implies that every fetched file must
be loaded into memory completely. On certain large yum repos,
decompressed primary XML can be hundreds of MB, and it's not appropriate
to require loading that all into memory at once.
Make it support a file-like object (stream of bytes). Since the SAX
XML parser supports reading from a stream, this makes it possible to
avoid loading everything into memory at once.
A test of repo-autoindex CLI against
/content/dist/rhel/server/7/7Server/x86_64/os showed major
improvement:
- before: ~1200MiB
- after: ~80MiB
Note that achieving the full improvement requires any downstream users
of the library (e.g. exodus-gw) to update their Fetcher implementation
as well, to stop returning a 'str'.
This library includes inline type hints, but per PEP 561 this must be
indicated by including a "py.typed" marker file, otherwise tools like
mypy will not make use of the type hints when checking downstream
projects.
3f478e76f7 added a "type: ignore" here due to a change in
typeshed. The commit message mentioned that the type hint may have been
wrong.
It looks like that was fixed in
https://github.com/python/typeshed/pull/9919/files,
so it's necessary to also remove the "type: ignore" now.
The following commit defined a return type hint for
getElementsByTagName:
3fc2f27990 (diff-f451f731d037ef9d79347194490b32ba613798ea7eaa2c160351a69625f05e08R150)
It defined the return type as a list of Node, while this code expects a
list of Element (Element is a subtype of Node).
Given that one would expect a getElements method to return
specifically *elements* and not other types of node, I think the
typeshed change may be incorrect, but it's hard to be sure since the
stdlib docs themselves are ambiguous.
Suppress it for now to unblock dependency updates.
AppStream kickstart repos were missing from the initial collection
of repos used to test the kickstart repo index functionality. AppStream
repos uniquely do not contain "checksums" sections in their treeinfo
files. So, when attempting to run repo-autoindex against an AppStream
kickstart repo, "KeyError: 'checksums'" was raised.
Now, when encountering an AppStream kickstart repo, repo-autoindex
does not attempt to parse the "checksums" section.
Due to the presence of a "repodata/repomd.xml" path in a kickstart
repo, repo-autoindex previously interpreted kickstart repos as yum
repos. As such, a kickstart repo's index would solely consist of two
directories: "Packages" and "repodata".
While a kickstart repo does contain a yum repo, kickstart repos also
contain two additional repo entry points: treeinfo and extra_files.json.
Each entry point references additional files that should be included
in a kickstart repo's index. These files were previously ignored.
Now, when repo-autoindex encounters a kickstart repo, repo-autoindex
produces a repo index that reflects the content referenced in all
three repo entry points (repomd.xml, treeinfo, extra_files.json).
Redo the parsing of packages from primary.xml to use SAX; previously it
was using pulldom. The motivation for the change is to reduce memory usage.
When parsing a larger yum repo such as that contained within
rhel-8-for-ppc64le-appstream-kickstart__8_DOT_4, the observed memory
usage from repo-autoindex command was:
- pulldom: ~700MB
- SAX: ~85MB
This does not affect the output of the indexing process, and is covered
by existing tests.
Ultimately, all errors are propagated in some way, but it's important to
differentiate between "the content was invalid" vs "failed to fetch the
content".