Setup new gpu01.rdu3.fedoraproject.org machine #13124
Labels
No labels
announcement
anubis
authentication
aws
backlog
blocked
bodhi
ci
cloud
communishift
copr
database
day-to-day
dc-move
deprecated
dev
discourse
dns
downloads
easyfix
epel
firmitas
forgejo_migration
Gain
High
Gain
Low
Gain
Medium
gitlab
greenwave
hardware
help wanted
high-trouble
koji
koschei
lists
low-trouble
medium-trouble
mirrorlists
monitoring
Needs investigation
odcs
OpenShift
ops
outage
packager_workflow_blocker
pagure
permissions
Priority
Needs Review
Priority
Next Meeting
Priority
🔥 URGENT 🔥
Priority
Waiting on Assignee
Priority
Waiting on External
Priority
Waiting on Reporter
rabbitmq
release-monitoring
releng
request-for-resources
s390x
security
SMTP
sprint-0
sprint-1
src.fp.o
staging
unfreeze
waiverdb
websites-general
wiki
Backlog Status
Needs Review
Backlog Status
Ready
chore
documentation
points
01
points
02
points
03
points
05
points
08
points
13
Priority
High
Priority
Low
Priority
Medium
Sprint Status
Blocked
Sprint Status
Done
Sprint Status
In Progress
Sprint Status
Review
Sprint Status
To Do
Technical Debt
Work Item
Bug
Work Item
Epic
Work Item
Spike
Work Item
Task
Work Item
User Story
No milestone
No project
No assignees
2 participants
Notifications
Due date
No due date set.
Dependencies
No dependencies set.
Reference
infra/tickets#13124
Loading…
Add table
Add a link
Reference in a new issue
No description provided.
Delete branch "%!s()"
Deleting a branch is permanent. Although the deleted branch may continue to exist for a short time before it actually gets removed, it CANNOT be undone in most cases. Continue?
Description of request
This is a new machine in the fedora-isolated vlan (10.16.179.x) with some gpus to be used for testing/building/etc things that require gpus.
We need to:
Coordinate with Gordon Messmer on the os/provisioning, etc.
ok. Finally we have the drac license sorted out. :)
Next we need to see how we want to move forward... The machine has a fedora 43 install on it right now, but of course we would prefer installing from a known kickstart and then configure with ansible.
However, we may need to sort out if all the hardware is working and how we might want to configure it.
infra/ansible#3189 is a first cut at configuration for this machine.
ok, got the firmware on the machine mostly updated (but oddly it errors trying to automatically get the updates). I got the machine installed and ansiblized. I setup a sysadmin-gpu group with @gordonmessmer in it with access. You should be able to setup ssh per https://docs.fedoraproject.org/en-US/infra/sysadmin_guide/sshaccess/ and then ssh gpu01.rdu3.fedoraproject.org (which will jump via bastion server) to the machine...
@gordonmessmer can you confirm you can get in and have at least for now what you need on the machine?
Either my local configuration is incorrect, or I do not have access. I've reviewed the ssh setup instructions and I think it's correct. I have an appropriate ssh config, known_hosts, and my key is present in my fas profile.
ssh gordonmessmer@gpu01.rdu3.fedoraproject.org -vdoes not successfully connect. I can see ssh reach the bastion, which has a matching ED25519-CERT host cert. ssh attempts key auth, and the server accepts the key, but the connection immediately drops.I've also tried manually running
ssh -v gordonmessmer@bastion.fedoraproject.org exec nc gpu01.rdu3.fedoraproject.org 22which exits immediately with error code 255.I think I probably don't have access to the bastion.
Ah, I think I see whats going on. I didn't add the new group to fedora-contributor...
Can you try again now?
Or wait, no, that wasn't it. New theory... I needed to run the playbook on bastion to pick up the new group.
Anyhow, try now?
Yes, I have a shell open now!
Do I have access to the ansible playbooks that manage the system? Can I open PRs to suggest changes?
great!
yes, they are in our ansible repo, you can submit pr's. https://forge.fedoraproject.org/infra/ansible/src/branch/main/playbooks/groups/gpu.yml for the playbook. You can make a roles/gpu/ role for specific items for the host.
Also, if you like we can set it up so you can run that playbook yourself after changes. (but doing so right now requires me to change batcave01, which is frozen).
I guess we can close this now? Feel free to open a new ticket if there's any further adjustments to make...
Thanks again for your patience. It's been a long road on this peksy machine. :)