Avoid jobs getting stuck forever #9
Loading…
Add table
Add a link
Reference in a new issue
No description provided.
Delete branch "%!s()"
Deleting a branch is permanent. Although the deleted branch may continue to exist for a short time before it actually gets removed, it CANNOT be undone in most cases. Continue?
We should implement a timeout when running cronjobs to avoid a situation like https://pagure.io/fedora-infrastructure/issue/9193
The job was stuck at
2020-07-29 09:00:22,025 review_stats.py INFO Quering Bugzilla for the blockers list.thus preventing other cronjobs to run.
Is there a way to set a timeout in the cronjob ansible template or should we add a control at python script level?
I'm thinking that this may be a python-bugzilla issue as afaics it does not seem to use a timeout in the requests it makes, nor does it seem to allow setting one.
Opened a ticket upstream: https://github.com/python-bugzilla/python-bugzilla/issues/131
Perhaps activeDeadlineSeconds could also be set in the cronjob?
Thanks both. I've opened a PR on ansible to set activeDeadlineSeconds on both review-stats cronjobs. Openshift manual isn't really clear (to me) about where to put activeDeadlineSeconds entry: I assumed it has to be put in the root of the job and not in the jobTemplate or in the pod definition.
Meanwhile, @kevin is there anything to do manually to have the make-html-pages cronjob work again? Even after you manually stopped the stuck pod OS refuses to start it with the message:
Cannot determine if job needs to be started: Too many missed start time (> 100). Set or decrease .spec.startingDeadlineSeconds or check clock skew.Should I update the PR adding startingDeadlineSeconds also?
I got it to work by deleting the cronjob and re-creating it. ;(
Thanks!
I've requested a PR on ansible to add both activeDeadlineSeconds and startingDeadlineSeconds, once approved it should prevent this to happen again.
Metadata Update from @mattia: