Setup new gpu01.rdu3.fedoraproject.org machine #13124

Closed
opened 2026-02-06 19:37:26 +00:00 by kevin · 9 comments
Owner

Description of request

This is a new machine in the fedora-isolated vlan (10.16.179.x) with some gpus to be used for testing/building/etc things that require gpus.

We need to:

  • Sort out it's drac license (I am working with IT on this now)
  • Sort out what install is on the machine now and determine if it needs reprovisioning or not.
  • Assign it a 10.16.179.x ip address in dns
  • Get it an external ip and nat in ssh to it
  • Add external ip to dns
  • Setup a group to allow ssh/sudo access to the machine.

Coordinate with Gordon Messmer on the os/provisioning, etc.

### Description of request This is a new machine in the fedora-isolated vlan (10.16.179.x) with some gpus to be used for testing/building/etc things that require gpus. We need to: * Sort out it's drac license (I am working with IT on this now) * Sort out what install is on the machine now and determine if it needs reprovisioning or not. * Assign it a 10.16.179.x ip address in dns * Get it an external ip and nat in ssh to it * Add external ip to dns * Setup a group to allow ssh/sudo access to the machine. Coordinate with Gordon Messmer on the os/provisioning, etc.
kevin self-assigned this 2026-02-26 22:52:03 +00:00
Author
Owner

ok. Finally we have the drac license sorted out. :)

Next we need to see how we want to move forward... The machine has a fedora 43 install on it right now, but of course we would prefer installing from a known kickstart and then configure with ansible.

However, we may need to sort out if all the hardware is working and how we might want to configure it.

ok. Finally we have the drac license sorted out. :) Next we need to see how we want to move forward... The machine has a fedora 43 install on it right now, but of course we would prefer installing from a known kickstart and then configure with ansible. However, we may need to sort out if all the hardware is working and how we might want to configure it.
Author
Owner

infra/ansible#3189 is a first cut at configuration for this machine.

https://forge.fedoraproject.org/infra/ansible/pulls/3189 is a first cut at configuration for this machine.
Author
Owner

ok, got the firmware on the machine mostly updated (but oddly it errors trying to automatically get the updates). I got the machine installed and ansiblized. I setup a sysadmin-gpu group with @gordonmessmer in it with access. You should be able to setup ssh per https://docs.fedoraproject.org/en-US/infra/sysadmin_guide/sshaccess/ and then ssh gpu01.rdu3.fedoraproject.org (which will jump via bastion server) to the machine...

@gordonmessmer can you confirm you can get in and have at least for now what you need on the machine?

ok, got the firmware on the machine mostly updated (but oddly it errors trying to automatically get the updates). I got the machine installed and ansiblized. I setup a sysadmin-gpu group with @gordonmessmer in it with access. You should be able to setup ssh per https://docs.fedoraproject.org/en-US/infra/sysadmin_guide/sshaccess/ and then ssh gpu01.rdu3.fedoraproject.org (which will jump via bastion server) to the machine... @gordonmessmer can you confirm you can get in and have at least for now what you need on the machine?

Either my local configuration is incorrect, or I do not have access. I've reviewed the ssh setup instructions and I think it's correct. I have an appropriate ssh config, known_hosts, and my key is present in my fas profile.

ssh gordonmessmer@gpu01.rdu3.fedoraproject.org -v does not successfully connect. I can see ssh reach the bastion, which has a matching ED25519-CERT host cert. ssh attempts key auth, and the server accepts the key, but the connection immediately drops.

debug1: Offering public key: /home/gmessmer/.ssh/id_ed25519 ED25519 SHA256:4S5gjIIunWkOjFKuNQAAYpU7UlnaKVuzJ1QAY8enlFU agent
debug1: Server accepts key: /home/gmessmer/.ssh/id_ed25519 ED25519 SHA256:4S5gjIIunWkOjFKuNQAAYpU7UlnaKVuzJ1QAY8enlFU agent
Connection closed by 2620:52:6:1121:bead:cafe:feed:fed1 port 22
kex_exchange_identification: Connection closed by remote host
Connection closed by UNKNOWN port 65535

I've also tried manually running ssh -v gordonmessmer@bastion.fedoraproject.org exec nc gpu01.rdu3.fedoraproject.org 22 which exits immediately with error code 255.

I think I probably don't have access to the bastion.

Either my local configuration is incorrect, or I do not have access. I've reviewed the ssh setup instructions and I think it's correct. I have an appropriate ssh config, known_hosts, and my key is present in my fas profile. `ssh gordonmessmer@gpu01.rdu3.fedoraproject.org -v` does not successfully connect. I can see ssh reach the bastion, which has a matching ED25519-CERT host cert. ssh attempts key auth, and the server accepts the key, but the connection immediately drops. ``` debug1: Offering public key: /home/gmessmer/.ssh/id_ed25519 ED25519 SHA256:4S5gjIIunWkOjFKuNQAAYpU7UlnaKVuzJ1QAY8enlFU agent debug1: Server accepts key: /home/gmessmer/.ssh/id_ed25519 ED25519 SHA256:4S5gjIIunWkOjFKuNQAAYpU7UlnaKVuzJ1QAY8enlFU agent Connection closed by 2620:52:6:1121:bead:cafe:feed:fed1 port 22 kex_exchange_identification: Connection closed by remote host Connection closed by UNKNOWN port 65535 ``` I've also tried manually running `ssh -v gordonmessmer@bastion.fedoraproject.org exec nc gpu01.rdu3.fedoraproject.org 22` which exits immediately with error code 255. I think I probably don't have access to the bastion.
Author
Owner

Ah, I think I see whats going on. I didn't add the new group to fedora-contributor...

Can you try again now?

Ah, I think I see whats going on. I didn't add the new group to fedora-contributor... Can you try again now?
Author
Owner

Or wait, no, that wasn't it. New theory... I needed to run the playbook on bastion to pick up the new group.
Anyhow, try now?

Or wait, no, that wasn't it. New theory... I needed to run the playbook on bastion to pick up the new group. Anyhow, try now?

Yes, I have a shell open now!

Do I have access to the ansible playbooks that manage the system? Can I open PRs to suggest changes?

Yes, I have a shell open now! Do I have access to the ansible playbooks that manage the system? Can I open PRs to suggest changes?
Author
Owner

great!

yes, they are in our ansible repo, you can submit pr's. https://forge.fedoraproject.org/infra/ansible/src/branch/main/playbooks/groups/gpu.yml for the playbook. You can make a roles/gpu/ role for specific items for the host.

Also, if you like we can set it up so you can run that playbook yourself after changes. (but doing so right now requires me to change batcave01, which is frozen).

great! yes, they are in our ansible repo, you can submit pr's. https://forge.fedoraproject.org/infra/ansible/src/branch/main/playbooks/groups/gpu.yml for the playbook. You can make a roles/gpu/ role for specific items for the host. Also, if you like we can set it up so you can run that playbook yourself after changes. (but doing so right now requires me to change batcave01, which is frozen).
Author
Owner

I guess we can close this now? Feel free to open a new ticket if there's any further adjustments to make...

Thanks again for your patience. It's been a long road on this peksy machine. :)

I guess we can close this now? Feel free to open a new ticket if there's any further adjustments to make... Thanks again for your patience. It's been a long road on this peksy machine. :)
kevin closed this issue 2026-03-06 18:26:31 +00:00
Sign in to join this conversation.
No milestone
No assignees
2 participants
Notifications
Due date
The due date is invalid or out of range. Please use the format "yyyy-mm-dd".

No due date set.

Dependencies

No dependencies set.

Reference
infra/tickets#13124
No description provided.