proxies-reverseproxy: set keepalive=on ttl=10 for koji #3173

Merged
kevin merged 1 commit from victorkoycheff/ansible:issue-12913-koji-502 into main 2026-03-11 20:36:50 +00:00
Contributor

This PR addresses the intermittent 502 Bad Gateway errors users experience during long-running Koji connections (e.g., watch-task, watch-logs, or chainbuild).

The Problem:
Currently, the Apache proxy has a KeepAliveTimeout of 15s and the Koji backend has a KeepAliveTimeout of 16s. However, because the two layers of keep-alive are untethered, a race condition occurs. mod_proxy occasionally attempts to reuse an idle connection from its backend pool at the exact millisecond the Koji backend is closing it, resulting in a 502.

The Fix:
By adding keepalive=on ttl=10 to the proxyopts for the Koji proxies, we force Apache to proactively expire and drop idle backend connections after 10 seconds. This guarantees the proxy will never attempt to reuse a connection that is approaching the backend's 16-second threshold, completely eliminating the race condition.

Reference: Apache mod_proxy documentation

Fixes infra/tickets#12913

This PR addresses the intermittent 502 Bad Gateway errors users experience during long-running Koji connections (e.g., `watch-task`, `watch-logs`, or `chainbuild`). **The Problem:** Currently, the Apache proxy has a `KeepAliveTimeout` of 15s and the Koji backend has a `KeepAliveTimeout` of 16s. However, because the two layers of keep-alive are untethered, a race condition occurs. `mod_proxy` occasionally attempts to reuse an idle connection from its backend pool at the exact millisecond the Koji backend is closing it, resulting in a 502. **The Fix:** By adding `keepalive=on ttl=10` to the `proxyopts` for the Koji proxies, we force Apache to proactively expire and drop idle backend connections after 10 seconds. This guarantees the proxy will never attempt to reuse a connection that is approaching the backend's 16-second threshold, completely eliminating the race condition. Reference: [Apache mod_proxy documentation](https://httpd.apache.org/docs/2.4/mod/mod_proxy.html) Fixes https://forge.fedoraproject.org/infra/tickets/issues/12913
Owner

I'm not fully convinced that this is the place where the keepalive is mismatched. It's a pretty complex path:

request -> proxy httpd -> anubis -> proxy httpd -> koji httpd

But it's possible at least. :)

Since we are in beta freeze this will need to wait until we are unfrozen, but we could try it then.

Or if we can duplicate it in staging we could try there.

I'm not fully convinced that this is the place where the keepalive is mismatched. It's a pretty complex path: request -> proxy httpd -> anubis -> proxy httpd -> koji httpd But it's possible at least. :) Since we are in beta freeze this will need to wait until we are unfrozen, but we could try it then. Or if we can duplicate it in staging we could try there.
victorkoycheff force-pushed issue-12913-koji-502 from d1912a75f4
Some checks failed
Linter / yamllint (pull_request) Failing after 23s
Linter / ansible-lint (pull_request) Failing after 26s
to 947cfd0b2c
Some checks failed
Linter / yamllint (pull_request) Failing after 29s
Linter / ansible-lint (pull_request) Failing after 30s
AI Code Review / ai-review (pull_request_target) Successful in 29s
2026-02-26 07:46:53 +00:00
Compare
Author
Contributor

Yeah, it's definitely a pretty complex path. :)

My thinking was that the race condition happens right at that very last hop to the koji backend, so dropping the connection at 10s there might just do the trick... Empirically, I've seen this exact tweak clear up similar 502s where the backend and proxy timeouts were fighting each other.

I also just force-pushed a fix for a silly yaml syntax error that was failing the linter checks.

We can definitely just leave this until the freeze is lifted and try to duplicate it in staging then.

Yeah, it's definitely a pretty complex path. :) My thinking was that the race condition happens right at that very last hop to the koji backend, so dropping the connection at 10s there might just do the trick... Empirically, I've seen this exact tweak clear up similar 502s where the backend and proxy timeouts were fighting each other. I also just force-pushed a fix for a silly yaml syntax error that was failing the linter checks. We can definitely just leave this until the freeze is lifted and try to duplicate it in staging then.

AI Code Review

📋 MR Summary

Adds proxy keepalive and TTL options to Koji reverse proxy configurations to resolve intermittent 502 Bad Gateway errors.

  • Key Changes:
    • Added proxyopts: "keepalive=on ttl=10" to the Koji production balancer configuration.
    • Added proxyopts: "keepalive=on ttl=10" to the Koji staging balancer configuration.
    • Fixed Ansible syntax indentation for the varnish_url variable.
  • Impact: proxies-reverseproxy.yml
  • Risk Level: 🟢 Low - The changes are straightforward Apache mod_proxy configuration updates with standard, documented values that prevent known race conditions without introducing new logical risks.

Detailed Code Review

The implementation aligns perfectly with the intent described in the PR and previous discussions. Setting keepalive=on ttl=10 is the standard and correct approach to prevent mod_proxy from reusing connections that the backend is about to close. The indentation fix for varnish_url is a nice minor correction.

📂 File Reviews

📄 `playbooks/include/proxies-reverseproxy.yml` - Updated proxy configuration to include keepalive options for Koji and fixed syntax indentation for variables.
  • Suggestion [Style]: The change from - varnish_url to varnish_url corrects the Ansible variable dictionary structure, which is good practice.

Summary

  • Overall Assessment: No critical issues identified. The fix effectively addresses the reported 502 errors.

🤖 AI Code Review | Generated with ai-code-review | Model: gemini-3.1-pro-preview

⚠️ AI-generated suggestions may be incorrect. Verify before applying. Not a replacement for human review.

## AI Code Review ### 📋 MR Summary Adds proxy keepalive and TTL options to Koji reverse proxy configurations to resolve intermittent 502 Bad Gateway errors. - **Key Changes:** - Added `proxyopts: "keepalive=on ttl=10"` to the Koji production balancer configuration. - Added `proxyopts: "keepalive=on ttl=10"` to the Koji staging balancer configuration. - Fixed Ansible syntax indentation for the `varnish_url` variable. - **Impact:** proxies-reverseproxy.yml - **Risk Level:** 🟢 Low - The changes are straightforward Apache mod_proxy configuration updates with standard, documented values that prevent known race conditions without introducing new logical risks. ### Detailed Code Review The implementation aligns perfectly with the intent described in the PR and previous discussions. Setting `keepalive=on ttl=10` is the standard and correct approach to prevent `mod_proxy` from reusing connections that the backend is about to close. The indentation fix for `varnish_url` is a nice minor correction. #### 📂 File Reviews <details> <summary><strong>📄 `playbooks/include/proxies-reverseproxy.yml`</strong> - Updated proxy configuration to include keepalive options for Koji and fixed syntax indentation for variables.</summary> - **Suggestion** [Style]: The change from `- varnish_url` to `varnish_url` corrects the Ansible variable dictionary structure, which is good practice. </details> ### ✅ Summary - **Overall Assessment:** No critical issues identified. The fix effectively addresses the reported 502 errors. --- 🤖 **AI Code Review** | Generated with [ai-code-review](https://gitlab.com/redhat/edge/ci-cd/ai-code-review) | **Model:** `gemini-3.1-pro-preview` ⚠️ *AI-generated suggestions may be incorrect. Verify before applying. Not a replacement for human review.*
Owner

ok, lets try (staging first)...

ok, lets try (staging first)...
kevin merged commit 35a1b3223b into main 2026-03-11 20:36:50 +00:00
Sign in to join this conversation.
No reviewers
No milestone
No project
No assignees
3 participants
Notifications
Due date
The due date is invalid or out of range. Please use the format "yyyy-mm-dd".

No due date set.

Dependencies

No dependencies set.

Reference
infra/ansible!3173
No description provided.