Drop XML parsing, switch to using dnf excludepkgs (#5) #32

Merged
adamwill merged 1 commit from dnf-excludepkgs into main 2026-04-20 18:32:30 +00:00
Owner

Instead of painfully retrieving, decompressing, parsing, editing
and reconstructing the repodata, let's use dnf's excludepkgs
option to achieve the same thing. This is much simpler, faster,
and more robust.

To 'modify' the modified base repos, we use dnf repoquery to
list the packages they contain and their source RPM names, then
construct a table with the NEVRs to be removed for each repo.
Then when we do the repoclosure command, we pass
--setopt .excludepkgs= for each modified repo.
This tells dnf to act as if the specified NEVRs simply do not
exist in the specified repo, which is exactly what we want to
achieve.

Signed-off-by: Adam Williamson awilliam@redhat.com

Instead of painfully retrieving, decompressing, parsing, editing and reconstructing the repodata, let's use dnf's excludepkgs option to achieve the same thing. This is much simpler, faster, and more robust. To 'modify' the modified base repos, we use dnf repoquery to list the packages they contain and their source RPM names, then construct a table with the NEVRs to be removed for each repo. Then when we do the repoclosure command, we pass --setopt <repo>.excludepkgs=<nevrlist> for each modified repo. This tells dnf to act as if the specified NEVRs simply do not exist in the specified repo, which is exactly what we want to achieve. Signed-off-by: Adam Williamson <awilliam@redhat.com>
Drop XML parsing, switch to using dnf excludepkgs (#5)
Some checks failed
CI via Tox / tox (pull_request) Failing after 1m25s
AI Code Review / ai-review (pull_request_target) Successful in 50s
b8d1c41a08
Instead of painfully retrieving, decompressing, parsing, editing
and reconstructing the repodata, let's use dnf's excludepkgs
option to achieve the same thing. This is much simpler, faster,
and more robust.

To 'modify' the modified base repos, we use dnf repoquery to
list the packages they contain and their source RPM names, then
construct a table with the NEVRs to be removed for each repo.
Then when we do the repoclosure command, we pass
--setopt <repo>.excludepkgs=<nevrlist> for each modified repo.
This tells dnf to act as if the specified NEVRs simply do not
exist in the specified repo, which is exactly what we want to
achieve.

Signed-off-by: Adam Williamson <awilliam@redhat.com>
Author
Owner

Wow, I should have tried this sooner. Would've saved me a week of messing about with XML parsers. OTOH, I know a lot more about writing a streaming XML token parser now.

The only thing that worries me here is potentially exceeding subprocess' or bash's or something's argument length limit for extremely big updates. I tried it with a 60 package update and it worked, but I'd be worried with those 400+ package KDE updates. I'll see if I can test this somehow before merging. If that's a problem we can probably switch to using config files.

Wow, I should have tried this sooner. Would've saved me a week of messing about with XML parsers. OTOH, I know a lot more about writing a streaming XML token parser now. The only thing that worries me here is potentially exceeding subprocess' or bash's or something's argument length limit for extremely big updates. I tried it with a 60 package update and it worked, but I'd be worried with those 400+ package KDE updates. I'll see if I can test this somehow before merging. If that's a problem we can probably switch to using config files.

AI Code Review

📋 MR Summary

Replaces manual XML repository metadata manipulation with DNF's native excludepkgs functionality for faster and more robust dependency testing.

  • Key Changes:
    • Removed custom XML decompression, parsing, and modification logic.
    • Eliminated lxml and compression-related dependencies.
    • Introduced dnf repoquery to identify package NEVRs based on source names.
    • Added --setopt=<repo>.excludepkgs=<nevrlist> to dynamically hide packages during repoclosure.
  • Impact: rmdepcheck.py (core logic), tests/test_rmdepcheck.py (test suite), Dependency management (pyproject.toml, tox.requires)
  • Risk Level: 🟡 Medium - While architecturally sound, the changes introduce risks regarding shell argument length limits (ARG_MAX) for massive exclusion lists and potential crashes from unvalidated subprocess stdout parsing.

Detailed Code Review

The architectural shift from manual XML parsing to utilizing native DNF capabilities is an excellent move for both performance and maintainability. It significantly reduces the tool's complexity and fragility.

However, there are two primary edge cases to address to ensure robustness. First, subprocess output parsing assumes a strict comma-separated format from dnf repoquery, which can fail if DNF outputs warnings or progress text to stdout. Second, as discussed, injecting potentially thousands of NEVRAs via command-line arguments risks exceeding OS ARG_MAX limits. Mitigating the ARG_MAX issue by dynamically generating a temporary .repo config file instead of relying heavily on command-line flags would guarantee scalability.

Lastly, there are a few residual artifacts from the previous implementation, such as unused directory parameters, that can be cleaned up.

📂 File Reviews

📄 `rmdepcheck.py` - Core execution script restructured to use DNF queries and options rather than XML manipulation.
  • Major [Bug]: Unvalidated parsing of dnf repoquery output. If DNF outputs any warnings, blank lines, or unexpected metadata to stdout, line.split(",") will throw a ValueError and crash the execution.
    • Suggestion: Add validation to ensure the line contains a comma before attempting to unpack it, and use maxsplit=1 for safety.
        for line in out.splitlines():
            if "," not in line:
                continue
            sname, nevr = line.split(",", 1)
            if sname in removes:
                repo_excludes.setdefault(mrepo, []).append(nevr)
  • Major [Architecture]: Risk of exceeding ARG_MAX limits. If excludes grows very large, passing it via --setopt in subprocess.run will throw OSError: [Errno 7] Argument list too long.

    • Suggestion: Instead of --repofrompath and --setopt, dynamically generate a temporary DNF configuration file containing the [repo] blocks and excludepkgs=... lists, then pass it to DNF using -c.
  • Minor [Style]: The mreposdir argument is still passed to get_modified_and_new_repoclosure but is no longer used anywhere inside the function body.

    • Suggestion: Remove the mreposdir parameter from the function signature and the corresponding call in main().
def get_modified_and_new_repoclosure(
    mrepos: Iterable[str],
    nmrepos: Iterable[str],
    nrepos: Iterable[str],
    removes: Iterable[str],
) -> tuple[str, str]:

Summary

  • Overall Assessment: The architectural pivot is solid, but major issues exist with unprotected string parsing and potential ARG_MAX limitations.
  • Priority Issues:
    • Safeguard dnf repoquery output parsing against unexpected output lines lacking commas.
    • Address the ARG_MAX vulnerability by switching from command-line --setopt to generated temporary configuration files if exclusion lists can grow arbitrarily large.
  • Minor Suggestions:
    • Remove the unused mreposdir parameter from get_modified_and_new_repoclosure.

🤖 AI Code Review | Generated with ai-code-review | Model: gemini-3.1-pro-preview

⚠️ AI-generated suggestions may be incorrect. Verify before applying. Not a replacement for human review.

## AI Code Review ### 📋 MR Summary Replaces manual XML repository metadata manipulation with DNF's native `excludepkgs` functionality for faster and more robust dependency testing. - **Key Changes:** - Removed custom XML decompression, parsing, and modification logic. - Eliminated `lxml` and compression-related dependencies. - Introduced `dnf repoquery` to identify package NEVRs based on source names. - Added `--setopt=<repo>.excludepkgs=<nevrlist>` to dynamically hide packages during `repoclosure`. - **Impact:** rmdepcheck.py (core logic), tests/test_rmdepcheck.py (test suite), Dependency management (pyproject.toml, tox.requires) - **Risk Level:** 🟡 Medium - While architecturally sound, the changes introduce risks regarding shell argument length limits (ARG_MAX) for massive exclusion lists and potential crashes from unvalidated subprocess stdout parsing. ### Detailed Code Review The architectural shift from manual XML parsing to utilizing native DNF capabilities is an excellent move for both performance and maintainability. It significantly reduces the tool's complexity and fragility. However, there are two primary edge cases to address to ensure robustness. First, subprocess output parsing assumes a strict comma-separated format from `dnf repoquery`, which can fail if DNF outputs warnings or progress text to `stdout`. Second, as discussed, injecting potentially thousands of NEVRAs via command-line arguments risks exceeding OS `ARG_MAX` limits. Mitigating the `ARG_MAX` issue by dynamically generating a temporary `.repo` config file instead of relying heavily on command-line flags would guarantee scalability. Lastly, there are a few residual artifacts from the previous implementation, such as unused directory parameters, that can be cleaned up. #### 📂 File Reviews <details> <summary><strong>📄 `rmdepcheck.py`</strong> - Core execution script restructured to use DNF queries and options rather than XML manipulation.</summary> - **Major** [Bug]: Unvalidated parsing of `dnf repoquery` output. If DNF outputs any warnings, blank lines, or unexpected metadata to `stdout`, `line.split(",")` will throw a `ValueError` and crash the execution. - *Suggestion:* Add validation to ensure the line contains a comma before attempting to unpack it, and use `maxsplit=1` for safety. ```` for line in out.splitlines(): if "," not in line: continue sname, nevr = line.split(",", 1) if sname in removes: repo_excludes.setdefault(mrepo, []).append(nevr) ```` - **Major** [Architecture]: Risk of exceeding `ARG_MAX` limits. If `excludes` grows very large, passing it via `--setopt` in `subprocess.run` will throw `OSError: [Errno 7] Argument list too long`. - *Suggestion:* Instead of `--repofrompath` and `--setopt`, dynamically generate a temporary DNF configuration file containing the `[repo]` blocks and `excludepkgs=...` lists, then pass it to DNF using `-c`. - **Minor** [Style]: The `mreposdir` argument is still passed to `get_modified_and_new_repoclosure` but is no longer used anywhere inside the function body. - *Suggestion:* Remove the `mreposdir` parameter from the function signature and the corresponding call in `main()`. ```` def get_modified_and_new_repoclosure( mrepos: Iterable[str], nmrepos: Iterable[str], nrepos: Iterable[str], removes: Iterable[str], ) -> tuple[str, str]: ```` </details> ### ✅ Summary - **Overall Assessment:** The architectural pivot is solid, but major issues exist with unprotected string parsing and potential ARG_MAX limitations. - **Priority Issues:** - Safeguard `dnf repoquery` output parsing against unexpected output lines lacking commas. - Address the `ARG_MAX` vulnerability by switching from command-line `--setopt` to generated temporary configuration files if exclusion lists can grow arbitrarily large. - **Minor Suggestions:** - Remove the unused `mreposdir` parameter from `get_modified_and_new_repoclosure`. --- 🤖 **AI Code Review** | Generated with [ai-code-review](https://gitlab.com/redhat/edge/ci-cd/ai-code-review) | **Model:** `gemini-3.1-pro-preview` ⚠️ *AI-generated suggestions may be incorrect. Verify before applying. Not a replacement for human review.*
adamwill force-pushed dnf-excludepkgs from b8d1c41a08
Some checks failed
CI via Tox / tox (pull_request) Failing after 1m25s
AI Code Review / ai-review (pull_request_target) Successful in 50s
to 3a96888ca9
All checks were successful
CI via Tox / tox (pull_request) Successful in 1m27s
2026-04-18 03:43:11 +00:00
Compare
adamwill force-pushed dnf-excludepkgs from 3a96888ca9
All checks were successful
CI via Tox / tox (pull_request) Successful in 1m27s
to db1414151c
All checks were successful
CI via Tox / tox (pull_request) Successful in 1m27s
AI Code Review / ai-review (pull_request_target) Successful in 40s
2026-04-18 03:46:21 +00:00
Compare
Author
Owner

So I tested and this handles a 224-package update with panache and aplomb. ARG_MAX seems to be 2097152 in typical Fedora environments, which really ought to be enough for any foreseeable update (famous last words, hah) - we're only putting package NEVRs in here, they're typically like 10-50 characters long, so if we save 97k characters for contingencies, we can have like 40,000 packages in there. So I think I'm inclined to go with this at least until it blows up. Switching to using config files is a bit of a faff.

So I tested and this handles a 224-package update with panache and aplomb. ARG_MAX seems to be 2097152 in typical Fedora environments, which really ought to be enough for any foreseeable update (famous last words, hah) - we're only putting package NEVRs in here, they're typically like 10-50 characters long, so if we save 97k characters for contingencies, we can have like 40,000 packages in there. So I think I'm inclined to go with this at least until it blows up. Switching to using config files is a bit of a faff.

AI Code Review

📋 MR Summary

This MR drastically simplifies dependency checking by replacing complex manual XML parsing with DNF's native excludepkgs option via repoquery and repoclosure.

  • Key Changes:
    • Removed manual XML metadata downloading, decompressing, parsing, and modification logic.
    • Eliminated standard external tool dependencies like curl, zstd, gzip, and lxml.
    • Introduced get_modified_and_new_repoclosure which relies on dnf repoquery to build an exclusion list and applies it using --setopt <repo>.excludepkgs=<nevrlist>.
  • Impact: rmdepcheck.py, tests/test_rmdepcheck.py, Packaging & Requirements (pyproject.toml, tox.requires)
  • Risk Level: 🟢 Low - The changes remove a highly complex, fragile XML parsing layer in favor of established DNF tooling, drastically lowering maintenance overhead. The primary edge case (ARG_MAX limits) is acknowledged and deemed acceptable for current scaling needs.

Detailed Code Review

Overall, this is an excellent refactor that greatly reduces technical debt by leveraging the native capabilities of dnf instead of trying to replicate repository modifications through raw XML mutation. The test suite has been properly trimmed down to match the simplified logic.

From a performance standpoint, there's a minor but standard algorithmic improvement possible. String concatenation in a loop (excludes += ...) has an O(N^2) complexity in Python. While the number of packages is typically small (e.g., 200-300), switching to an array append and .join() is a safer best practice to handle larger scale inputs effectively. Other than that, the new approach perfectly handles the intended design.

📂 File Reviews

📄 `rmdepcheck.py` - Replaced manual XML editing functionalities with sub-process calls leveraging DNF configuration parameters.
  • Minor [Performance]: String concatenation inside a loop has O(N^2) time complexity. It is recommended to append elements to a list and use ','.join() to generate the excludes string, which is O(N) and significantly more efficient if the exclusion list grows large.
    • Suggestion: Collect matching NEVRs in a list and join them after the loop.
        excludes_list = []
        args = queryargs + ["--repofrompath", f"{hash_repo(mrepo)},{mrepo}"]
        out = SUBPCAPTURE(args).stdout
        for line in out.splitlines():
            # sname, nevr
            elems = line.split(",")
            if len(elems) != 2:
                continue  # pragma: no cover
            if elems[0] in removes:
                excludes_list.append(elems[1])
        
        excludes = ",".join(excludes_list)

Summary

  • Overall Assessment: No critical issues identified. The transition to DNF options is robust, with a minor algorithmic optimization suggested for string joining.
  • Minor Suggestions:
    • Switch the string concatenation loop for excludes to list .append() and ','.join() for optimal O(N) performance.

🤖 AI Code Review | Generated with ai-code-review | Model: gemini-3.1-pro-preview

⚠️ AI-generated suggestions may be incorrect. Verify before applying. Not a replacement for human review.

## AI Code Review ### 📋 MR Summary This MR drastically simplifies dependency checking by replacing complex manual XML parsing with DNF's native `excludepkgs` option via `repoquery` and `repoclosure`. - **Key Changes:** - Removed manual XML metadata downloading, decompressing, parsing, and modification logic. - Eliminated standard external tool dependencies like `curl`, `zstd`, `gzip`, and `lxml`. - Introduced `get_modified_and_new_repoclosure` which relies on `dnf repoquery` to build an exclusion list and applies it using `--setopt <repo>.excludepkgs=<nevrlist>`. - **Impact:** rmdepcheck.py, tests/test_rmdepcheck.py, Packaging & Requirements (pyproject.toml, tox.requires) - **Risk Level:** 🟢 Low - The changes remove a highly complex, fragile XML parsing layer in favor of established DNF tooling, drastically lowering maintenance overhead. The primary edge case (ARG_MAX limits) is acknowledged and deemed acceptable for current scaling needs. ### Detailed Code Review Overall, this is an excellent refactor that greatly reduces technical debt by leveraging the native capabilities of `dnf` instead of trying to replicate repository modifications through raw XML mutation. The test suite has been properly trimmed down to match the simplified logic. From a performance standpoint, there's a minor but standard algorithmic improvement possible. String concatenation in a loop (`excludes += ...`) has an O(N^2) complexity in Python. While the number of packages is typically small (e.g., 200-300), switching to an array append and `.join()` is a safer best practice to handle larger scale inputs effectively. Other than that, the new approach perfectly handles the intended design. #### 📂 File Reviews <details> <summary><strong>📄 `rmdepcheck.py`</strong> - Replaced manual XML editing functionalities with sub-process calls leveraging DNF configuration parameters.</summary> - **Minor** [Performance]: String concatenation inside a loop has O(N^2) time complexity. It is recommended to append elements to a list and use `','.join()` to generate the `excludes` string, which is O(N) and significantly more efficient if the exclusion list grows large. - *Suggestion:* Collect matching NEVRs in a list and join them after the loop. ```` excludes_list = [] args = queryargs + ["--repofrompath", f"{hash_repo(mrepo)},{mrepo}"] out = SUBPCAPTURE(args).stdout for line in out.splitlines(): # sname, nevr elems = line.split(",") if len(elems) != 2: continue # pragma: no cover if elems[0] in removes: excludes_list.append(elems[1]) excludes = ",".join(excludes_list) ```` </details> ### ✅ Summary - **Overall Assessment:** No critical issues identified. The transition to DNF options is robust, with a minor algorithmic optimization suggested for string joining. - **Minor Suggestions:** - Switch the string concatenation loop for `excludes` to list `.append()` and `','.join()` for optimal O(N) performance. --- 🤖 **AI Code Review** | Generated with [ai-code-review](https://gitlab.com/redhat/edge/ci-cd/ai-code-review) | **Model:** `gemini-3.1-pro-preview` ⚠️ *AI-generated suggestions may be incorrect. Verify before applying. Not a replacement for human review.*
Author
Owner

switching to an array append and .join() is a safer best practice to handle larger scale inputs effectively

typing didn't like that, unless I use two different variable names for it, so I went with this. ehh. Fine, I'll just use two variables, I guess.

> switching to an array append and .join() is a safer best practice to handle larger scale inputs effectively typing didn't like that, unless I use two different variable names for it, so I went with this. ehh. Fine, I'll just use two variables, I guess.
Author
Owner

actually, no I won't, because that makes pylint complain about too many locals. We'll stick with this.

actually, no I won't, because that makes pylint complain about too many locals. We'll stick with this.
adamwill force-pushed dnf-excludepkgs from db1414151c
All checks were successful
CI via Tox / tox (pull_request) Successful in 1m27s
AI Code Review / ai-review (pull_request_target) Successful in 40s
to 251fd11d3b
All checks were successful
CI via Tox / tox (pull_request) Successful in 1m28s
2026-04-18 07:06:05 +00:00
Compare
Author
Owner

ARG_MAX seems to be 2097152 in typical Fedora environments

Well, StackOverflow notes that "The maximum size for a single string argument is limited to 131072", and that does seem to be the case in my environment. Still, that seems like enough. Still assuming 50 character NVRs, that's enough room for 2621 of them.

> ARG_MAX seems to be 2097152 in typical Fedora environments Well, [StackOverflow notes](https://stackoverflow.com/questions/29801975/why-is-the-subprocess-popen-argument-length-limit-smaller-than-what-the-os-repor/29802900#29802900) that "The maximum size for a single string argument is limited to 131072", and that does seem to be the case in my environment. Still, that seems like enough. Still assuming 50 character NVRs, that's enough room for 2621 of them.
Author
Owner

This looks to be working fine in testing on staging so far - gives all results the same as prod, including some correct failures. So let's go with it.

This looks to be working fine in testing on staging so far - gives all results the same as prod, including some correct failures. So let's go with it.
adamwill deleted branch dnf-excludepkgs 2026-04-20 18:32:31 +00:00
Sign in to join this conversation.
No reviewers
No milestone
No project
No assignees
2 participants
Notifications
Due date
The due date is invalid or out of range. Please use the format "yyyy-mm-dd".

No due date set.

Dependencies

No dependencies set.

Reference
quality/rmdepcheck!32
No description provided.