docker-slim type of forgejo runner #568
Labels
No labels
Backlog Status
Needs Review
Backlog Status
Ready
Chore
points
01
points
02
points
03
points
05
points
08
points
13
Priority
High
Priority
Low
Priority
Medium
Sprint Status
Blocked
Sprint Status
Done
Sprint Status
In Progress
Sprint Status
Review
Sprint Status
To Do
Technical Debt
Work Item
Bug
Work Item
Epic
Work Item
Spike
Work Item
Task
Work Item
User Story
No milestone
No project
No assignees
4 participants
Notifications
Due date
No due date set.
Depends on
#578 Add 'lightweight' option to forge runner
forge/forge
Reference
forge/forge#568
Loading…
Add table
Add a link
Reference in a new issue
No description provided.
Delete branch "%!s()"
Deleting a branch is permanent. Although the deleted branch may continue to exist for a short time before it actually gets removed, it CANNOT be undone in most cases. Continue?
Summary
Provide a light-weigh type of forgejo runner
Details
The main issue here is to have a very minimal runner type that has almost no dependencies (only docker or nodejs) for workflows that are just wrapping around
testing-farmand equivalent action that do not consume much cpu and just do occasional curl operationsOne particular usage I noticed is that right now the workflows do not parallelize well, e.g. in this run using
dockerthe jobs were basically sequentialThat sounds like a very good idea and would make using the testing-farm/tmt action significantly more interesting.
How the small the ideal image would be?
node:20-alpine (~180MB) or better alpine:latest with ~7MB without node, curl or docker (to be added in the workflow)?
Minimum would be
nodebecause that is needed for forgejo actions right now. Docker on top would also be good, but if it complicates deployment and sharing, then would rather benode-onlyIndeed, we should set a big limit for parallel jobs for this runner (maybe with small CPU & memory limits) to be able to run a lot of testing farm jobs in parallel.
Ok, there is a
node:22-alpineminimal image under labeldocker-slimavailable on your org.Can you also enable it for
ciso that I can also experiment with it?docker-slimis there forciorg as well now@lenkaseg wrote in #568 (comment):
Doesn't seem to do what the main proposal here is aiming for, the jobs are still getting stuck on trying to get the runner https://forge.fedoraproject.org/ci/_fedora-kiwi-descriptions/actions/runs/2/jobs/0/attempt/1, seems like these are missing a coordination to allow these to run in a more "oversubscribed" way.
(1h and it did not start)
Yeah, I noticed when testing on staging that the runner definition was misformatted.
Ok, I fixed the multiple labels problem for staging, but there's something wrong with the runner configs on production. Unfortunately I don't have access to the place where the runners are defined and we're in the evening before a PTO.
I decided to create another runner for you manually, with the fixed labels, so you're not blocked on testing the composes. I see that it picks jobs correctly. Since it is a manual thing, it will be removed next time someone runs the runner automation, but hopefully you can do some testing before that happens.
Thanks. This appear to be working now in atomic-desktops/config#757 (runs in https://forge.fedoraproject.org/ci/_atomic-desktops-config/actions).
OK, this no longer work as I have a stalled job: https://forge.fedoraproject.org/atomic-desktops/config/actions/runs/78/jobs/0/attempt/1
I know, we did some maintenance and the labels got messed up again. I have a fix, but need someone to merge: infra/ansible#3377
Ok, fixed.
Is there something more to be done on this issue?
@lenkaseg wrote in #568 (comment):
Yes, adding more nodes or oversubscription for these tags. It can get in a quite long queue awaiting for a node
Currently you have two nodes for 'docker-slim' under the 'ci' org.
Pull request under way to provide you with capacity of 20 concurrent jobs on ci-1 runner: infra/ansible#3381/files
Your request is for ci org only or for atomic-desktops as well?
I was hoping that it can be more generic for other users who pop up having a similar use. But it can be as a case-by-case as they are requesting. Could you maybe make an issue that I can link in https://forge.fedoraproject.org/ci/testing-farm/ so that it is easier to request similar machines when it comes to it?
But yeah, should be enabled for both of the orgs for now at least
Both of the orgs have docker-slim option and the capacity increase will happen when the PR is merged.
By 'make an issue' you mean document the runner options/flavours we offer? Like in a documentation or in a template?
We have an issue template for requesting runners: https://forge.fedoraproject.org/forge/forge/issues/new?template=.forgejo%2fissue_template%2fnew_runner.yml
@lenkaseg wrote in #568 (comment):
Template approach is good too, but adding a new type in the
Resource Requirementsor concurrency or such. Any approach that I can document on the repo there.Cool will try to do some stress-test of it. Do you have an idea of what to expect if the concurrency is too high or the action is not optimized well enough?
I don't know where the issue is but the
docker-slimjobs are still throttled in the atomic-desktops repo, i.e. I can not get multiple parallel runs for different PRs.I think there is a confusion about what we would like to do here.
We would like to have a runner with the "docker-slim" label (or maybe "testing-farm" would be more explicit) that is available to all orgs by default and that has a very high concurrency setting.
Reading https://code.forgejo.org/forgejo/runner/src/branch/main/internal/pkg/config/config.example.yaml, I'm afraid that we can not set ressource restrictions on jobs (see example for the GitLab runner: https://docs.gitlab.com/runner/configuration/advanced-configuration/#the-runnersdocker-section) so this means that it would not be a good idea to make such a runner global unfortunately.
From https://forge.fedoraproject.org/infra/ansible/src/branch/main/roles/openshift-apps/forgejo/runners/production/atomic-desktops-1.yml, I see a single runner has both labels set where instead it should be two distinct runners.
So what I think we can do instead:
I've made infra/ansible#3385
I've tried it on https://forge.fedoraproject.org/ci/_fedora-kiwi-descriptions/actions/runs/2/jobs/0/attempt/3, but I only got there a concurrency of 3, and there are no other runners running in the whole org. It seems that
capacityis not being picked up at allI see one problem and that are the shared labels. I will change the labels of the ci-2 runner so it is not random which runner picks up a job, since they have different configs.
ci-1 is currently on capacity: 1 and ci-2 on capacity: 2. So 3 parallel jobs seems to be exactly correct.
I will fix the labels. For the concurrency, there will be delays as we have to perform heavy load tests before we allow capacity more than 4 probably. Sorry for that, we're still in progress of establishing the runner policy. I'll keep you updated about the progress and decisions.
@siosm wrote in #568 (comment):
Yes, that is correct. The capacity of 20 has not been applied yet, so for the moment it stays on 1. Sorry for that. I'll get back once I have more info.
@siosm wrote in #568 (comment):
That sounds pretty reasonable to me.
About the limit, I think there's a way of specifying cpu and memory use per runner: https://code.forgejo.org/forgejo/runner/issues/551 Let's see if it works.
atomic-desktopsandciorgs now have a new runner available designed for the testing farm jobs, with following specs:testing-farmThe current policy is to provide no global runners, but there will be an option in the runner issue template to request the testing-farm runner.
Also, the
docker-slimlabel was dropped fromatomic-desktops-1runner. @lecris I supppse you don't need thedocker-slimlabel on a regular runner either now, since there's a dedicated testing-farm runner.Note: regular runners (label docker, docker-slim) have been bumped to 4 concurrent jobs.
Please let me know if this works.
testing-farmrunner #9@lenkaseg wrote in #568 (comment):
Yes, would not be needed.
Ok, the concurrency seems to be working: https://forge.fedoraproject.org/ci/_fedora-kiwi-descriptions/actions/runs/3/jobs/0/attempt/1. We seem to be ratelimiting docker on that, but that is mainly an issue of how the action is implemented and the CI there running the setup step at the same time. I suspect it is a warming up issue, will check on it later
Still consistently rate limited. I re-ran one of the other jobs on the other runner and it goes through that step almost instantly as if it has that image already cached. @lenkaseg any idea on how this thing works. The issue is with the first image that it tries to pull
node:22-alpineI also do not know how the whole forgejo action for container actions works. It seems to cache it based on how fast it goes, the logs do not tell much about that.
I guess having a node installed is kinda essential, right?
In that case...checking how to cache the image or mirror it to quay, ...
Ok, try now please! I mirrored the image from docker hub to quay.
Awesome, works like a charm. The only thing left is to document this ci/testing-farm#9 which is waiting on the template update:
options:- Default- Large (more CPU, memory, and disk space)Neat work folks! was not following this, looks very nice. Do not forget to add
--skip-guest-setupto get the VM up and running in 1.5 minutes for best experience (if you do not need artifact installation and Fedora CI environment). This will become later the default :)Also we have bare metal hosts now a lot better scalable, for that use
--hardware virtualization.is-virtualized=false