httpd on pkgs01 regularly dumps core #12670
Labels
No labels
announcement
anubis
authentication
aws
backlog
blocked
bodhi
ci
cloud
communishift
copr
database
day-to-day
dc-move
deprecated
dev
discourse
dns
downloads
easyfix
epel
firmitas
forgejo_migration
Gain
High
Gain
Low
Gain
Medium
gitlab
greenwave
hardware
help wanted
high-trouble
koji
koschei
lists
low-trouble
medium-trouble
mirrorlists
monitoring
Needs investigation
odcs
OpenShift
ops
outage
packager_workflow_blocker
pagure
permissions
Priority
Needs Review
Priority
Next Meeting
Priority
🔥 URGENT 🔥
Priority
Waiting on Assignee
Priority
Waiting on External
Priority
Waiting on Reporter
rabbitmq
release-monitoring
releng
request-for-resources
s390x
security
SMTP
sprint-0
sprint-1
src.fp.o
staging
unfreeze
waiverdb
websites-general
wiki
Backlog Status
Needs Review
Backlog Status
Ready
chore
documentation
points
01
points
02
points
03
points
05
points
08
points
13
Priority
High
Priority
Low
Priority
Medium
Sprint Status
Blocked
Sprint Status
Done
Sprint Status
In Progress
Sprint Status
Review
Sprint Status
To Do
Technical Debt
Work Item
Bug
Work Item
Epic
Work Item
Spike
Work Item
Task
Work Item
User Story
No milestone
No project
No assignees
5 participants
Notifications
Due date
No due date set.
Dependencies
No dependencies set.
Reference
infra/tickets#12670
Loading…
Add table
Add a link
Reference in a new issue
No description provided.
Delete branch "%!s()"
Deleting a branch is permanent. Although the deleted branch may continue to exist for a short time before it actually gets removed, it CANNOT be undone in most cases. Continue?
We discussed it in this morning’s (08:15 UTC) Infra/RelEng standup call, I think @zlopez mentioned it first: the
httpdprocess onpkgs01.rdu3regularly dumps core (every odd minute or three):Here’s an exemplary one:
The SIGSEGV is delivered to the offending thread, whose backtrace always seems to have this structure:
pthread_cond_timedwait@@GLIBC_2.3.2 => take_gil => … => [Python Garbage Collector related] => … => PyImport_Cleanup => Py_EndInterpreter => … => wsgi_python_child_cleanup => apr_pool_destroy
I take it as:
httpdwinds down a worker,mod_wsgicleans up the Python (sub-)interpreter, and garbage collection trips over something that doesn’t exist any longer.I don’t know why this didn’t happen before the DC move, so I can only guess that maybe we had this problem before, a configuration hot-fix was applied which didn’t end up in Ansible, so the fix was lost in the move. Or it happened and we didn’t notice?
Anyway, let’s collect more info here.
I was curious and checked on
pagure02:Perhaps comparing differences between the
src.fp.oandpagure.ioinstances can give some insight in how to fix it.Metadata Update from @kevin:
@nphilipp The src.fp.o didn't produced coredumps till I added
CoreDumpDirectory /tmpto httpd dist-git conf. But you can still see the errors being reported in httpd error log.It seems that
pkgs01.iad2dumped core before the DC move, we just didn’t notice:I don’t find anything in the pagure02 logs, but notice that there are only
pagure0[12].vpn.*directories (neither.iad2.nor.rdu3.).Hmm, probably also because it doesn’t have
CoreDumpDirectory /tmp.@nphilipp Is this still something we need to resolve?
We ran across this as part of backlog refinement during today's infra weekly and apparently it's still happening. Does anybody know what could be the cause? Have you had a chance to look deeper @nphilipp?
Hi @nphilipp, @patrikp and @zlopez. I looked into the stack traces provided and they point to a mod_wsgi subinterpreter teardown crash.
I've submitted a PR to force Pagure into the global application group, which is the recommended fix to stop these specific core dumps: infra/ansible#3171
Also a link to the documentation from which I got the solution:
https://modwsgi.readthedocs.io/en/master/user-guides/application-issues.html#python-simplified-gil-state-api
I merged the pr and deployed it.
I will keep an eye on it today and see how it does.
Thanks for the pr @victorkoycheff !
Always a pleasure 🫡
Let's see
I've seen 0 coredumps so far, so I think this is solved. Thanks again!