Avoid jobs getting stuck forever #9

Closed
opened 2020-08-04 16:23:25 +00:00 by mattia · 7 comments
Owner

We should implement a timeout when running cronjobs to avoid a situation like https://pagure.io/fedora-infrastructure/issue/9193
The job was stuck at
2020-07-29 09:00:22,025 review_stats.py INFO Quering Bugzilla for the blockers list.
thus preventing other cronjobs to run.

Is there a way to set a timeout in the cronjob ansible template or should we add a control at python script level?

We should implement a timeout when running cronjobs to avoid a situation like https://pagure.io/fedora-infrastructure/issue/9193 The job was stuck at `2020-07-29 09:00:22,025 review_stats.py INFO Quering Bugzilla for the blockers list.` thus preventing other cronjobs to run. Is there a way to set a timeout in the cronjob ansible template or should we add a control at python script level?

I'm thinking that this may be a python-bugzilla issue as afaics it does not seem to use a timeout in the requests it makes, nor does it seem to allow setting one.

I'm thinking that this may be a python-bugzilla issue as afaics it does not seem to use a timeout in the requests it makes, nor does it seem to allow setting one.
Opened a ticket upstream: https://github.com/python-bugzilla/python-bugzilla/issues/131
Owner

Perhaps activeDeadlineSeconds could also be set in the cronjob?

Perhaps activeDeadlineSeconds could also be set in the cronjob?
Author
Owner

Thanks both. I've opened a PR on ansible to set activeDeadlineSeconds on both review-stats cronjobs. Openshift manual isn't really clear (to me) about where to put activeDeadlineSeconds entry: I assumed it has to be put in the root of the job and not in the jobTemplate or in the pod definition.

Meanwhile, @kevin is there anything to do manually to have the make-html-pages cronjob work again? Even after you manually stopped the stuck pod OS refuses to start it with the message:
Cannot determine if job needs to be started: Too many missed start time (> 100). Set or decrease .spec.startingDeadlineSeconds or check clock skew.

Should I update the PR adding startingDeadlineSeconds also?

Thanks both. I've opened a PR on ansible to set activeDeadlineSeconds on both review-stats cronjobs. Openshift manual isn't really clear (to me) about where to put activeDeadlineSeconds entry: I assumed it has to be put in the root of the job and not in the jobTemplate or in the pod definition. Meanwhile, @kevin is there anything to do manually to have the make-html-pages cronjob work again? Even after you manually stopped the stuck pod OS refuses to start it with the message: `Cannot determine if job needs to be started: Too many missed start time (> 100). Set or decrease .spec.startingDeadlineSeconds or check clock skew.` Should I update the PR adding startingDeadlineSeconds also?
Owner

I got it to work by deleting the cronjob and re-creating it. ;(

I got it to work by deleting the cronjob and re-creating it. ;(
Author
Owner

Thanks!
I've requested a PR on ansible to add both activeDeadlineSeconds and startingDeadlineSeconds, once approved it should prevent this to happen again.

Thanks! I've requested a PR on ansible to add both activeDeadlineSeconds and startingDeadlineSeconds, once approved it should prevent this to happen again.
Author
Owner

Metadata Update from @mattia:

  • Issue status updated to: Closed (was: Open)
**Metadata Update from @mattia**: - Issue status updated to: Closed (was: Open)
Sign in to join this conversation.
No labels
No milestone
No project
No assignees
3 participants
Notifications
Due date
The due date is invalid or out of range. Please use the format "yyyy-mm-dd".

No due date set.

Dependencies

No dependencies set.

Reference
apps/review_stats#9
No description provided.