[EVENT] Openscapes, large single day event, March 14th #3743

sgibson91 · 2024-02-27T14:11:58Z

sgibson91 · 2024-02-27T14:12:40Z

From the freshdesk ticket:

Date: March 14, 2024

Time: 12:00 pm PST

Language: Python

No. participants: 230

Expected usage: 16 GB (per/participant)..... possibly slightly larger

Do you think there is anything specific that needs to be done in order to facilitate this? We'd also like to request that during the workshop, we can have files in the shared folder be read only on the hub.

sgibson91 · 2024-02-27T14:50:36Z

Assignee can see if this guide is helpful: https://infrastructure.2i2c.org/howto/prepare-for-events/event-prep

consideRatio · 2024-02-28T13:04:09Z

I'm not completing this or assigning myself, but note:

Their resource allocation choices have requests == limits, making them not share ram - safe but expensive.

The resource allocation choices provides the following choices, none of them map directly to "16 GB" or "slightly more".

                mem_14_8:
                  display_name: 14.8 GB RAM, upto 3.7 CPUs
                  kubespawner_override:
                    mem_guarantee: 15941615616
                    mem_limit: 15941615616
                    cpu_guarantee: 1.875
                    cpu_limit: 3.75
                    node_selector:
                      node.kubernetes.io/instance-type: r5.xlarge
                mem_29_7:
                  display_name: 29.7 GB RAM, upto 3.7 CPUs
                  kubespawner_override:
                    mem_guarantee: 31883231232
                    mem_limit: 31883231232
                    cpu_guarantee: 3.75
                    cpu_limit: 3.75
                    node_selector:
                      node.kubernetes.io/instance-type: r5.xlarge

The r5.xlarge is a 4 CPU / 32 GB node, so if they show up with 230 users, it will be ~115 or ~230 nodes starting based on what is picked. This would be a terrible user experience, and could make us run into trouble with quotas etc as well!

I think its super relevant they are put on larger nodes to house a few tens of users at least. I think using ~64 CPU / 512GB nodes makes sense (r5.16xlarge), they would house ~32 users requesting ~16 GB then.

sgibson91 · 2024-03-11T10:08:45Z

According to Openscapes' eksctl config, they already have an r5.16xlarge nodepool available, so I think this is just about exposing it via the profile list

infrastructure/eksctl/openscapes.jsonnet

Line 35 in c47a436

{ instanceType: "r5.16xlarge" },

Doing a search for r5.16xlarge in the common config file does not yield any results.

My plan for this event is therefore:

Add a profile list option exposing the r5.16xlarge node type with appropriate resource requests
Scale up the nodepool in the European morning on the day of the event (to at least one)
Scale down the nodepool the day after the event
Optionally, remove the profile list option. I will check with the hub champion.

sgibson91 · 2024-03-11T10:26:39Z

Passing r5.16xlarge to the resource allocation script, and asking it to generate 10 examples, produces the following profile list options that jumps from 15GB to 30GB. So close! I'm also unsure what the "upto X CPUs" means - will one user get (e.g.) 63 CPUs?

mem_15_3:
  display_name: 15.3 GB RAM, upto 63.6 CPUs
  kubespawner_override:
    mem_guarantee: 16437845376
    mem_limit: 16437845376
    cpu_guarantee: 1.9875
    cpu_limit: 63.6
    node_selector:
      node.kubernetes.io/instance-type: r5.16xlarge
mem_30_6:
  display_name: 30.6 GB RAM, upto 63.6 CPUs
  kubespawner_override:
    mem_guarantee: 32875690752
    mem_limit: 32875690752
    cpu_guarantee: 3.975
    cpu_limit: 63.6
    node_selector:
      node.kubernetes.io/instance-type: r5.16xlarge

I think I need to adjust the strategy the script is using, but I don't know how to learn which options there are and what they do. There is not much info in either the help string or documentation

$ deployer generate resource-allocation choices --help
                                                                                                                                                                                    
 Usage: deployer generate resource-allocation choices [OPTIONS] INSTANCE_TYPE                                                                                                       
                                                                                                                                                                                    
 Generate a custom number of resource allocation choices for a certain instance type, depending on a certain chosen strategy.                                                       
                                                                                                                                                                                    
╭─ Arguments ──────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────╮
│ *    instance_type      TEXT  Instance type to generate Resource Allocation options for [default: None] [required]                                                               │
╰──────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────╯
╭─ Options ────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────╮
│ --num-allocations        INTEGER                         Number of choices to generate [default: 5]                                                                              │
│ --strategy               [proportional-memory-strategy]  Strategy to use for generating resource allocation choices choices                                                      │
│                                                          [default: ResourceAllocationStrategies.PROPORTIONAL_MEMORY_STRATEGY]                                                    │
│ --help                                                   Show this message and exit.                                                                                             │
╰──────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────╯

yuvipanda · 2024-03-11T17:28:54Z

Me and @jmunroe are participating in an openscapes organizing event today (NASA-Openscapes/2i2cAccessPolicies#7), I'll bring this up and try to set up a process for handling this.

yuvipanda · 2024-03-11T20:30:20Z

We had a lot of good conversations at the meeting, and I'll be opening further issues with information. But as far as this is workshop / issue is concerned, after conversations, they're going to instruct users to use the existing 14.8GB RAM profile. Their 16GB was just an estimate, and we've identified 'how do you figure out how much resources you need?' as something that needs more guidance.

As far as action items for this particular workshop for you to take @sgibson91:

Change the 'node_selector' in

infrastructure/config/clusters/openscapes/common.values.yaml

Line 99 in c47a436

node.kubernetes.io/instance-type: r5.xlarge

for all profiles to point to the r5.16xlarge. This won't be an exact fit, but it's good enough. This should be deployed the day before the workshop.
Check AWS quota to make sure there is enough in the correct region. Back of the envelope calculations based on Erik's comment in [EVENT] Openscapes, large single day event, March 14th #3743 (comment) tells me that if ~32 users fit on one node, then we will need about 8 r5.16xlarge nodes. They have 16 CPUs each, so that's about 512 CPUs total. You can go to https://us-west-2.console.aws.amazon.com/servicequotas/home/services/ec2/quotas and look for Running On-Demand Standard (A, C, D, H, I, M, R, T, Z) instances - the value of this quota is the number of CPUs available. With some margin (say a 20% extra), if there's not enough quota, request more.

I don't think you need to pre-warm the cluster by increasing node pool sizes or similar. There's also the question of having someone (given TZ, that's me haha) monitor during the event that I'll figure out some answer to.

There's additional action items for me here to change documentation and improve some of the process, but I believe this should unblock you.

… options As discussed in 2i2c-org#3743 (comment)

sgibson91 · 2024-03-12T10:34:43Z

I have opened the PR addressing the first point here: #3792

Do you think merging in the AM European time on the day of the workshop is fine? Or shall I do it last thing tomorrow so folk have a day to test?

I assume this PR should be reverted after the workshop also.

sgibson91 · 2024-03-12T11:11:51Z

The Running On-Demand Standard (A, C, D, H, I, M, R, T, Z) instances quota is currently 1360, which is plenty more than 512 CPUs + 20%, so there's no action to take on the quota.

Fields from L-R are: Quota name, Applied account-level quota value, AWS default quota value, and Adjustability

consideRatio · 2024-03-12T14:30:50Z

Do you think merging in the AM European time on the day of the workshop is fine? Or shall I do it last thing tomorrow so folk have a day to test?

I think merging it before is suitable as we reduce risk of last-minute issues and help people test etc. I think it can make sense to merge it already, but if it was more than a full week ahead of time that would perhaps been to early.

sgibson91 · 2024-03-15T16:14:25Z

I put up a PR to revert the one that moved all profiles to the r5.16xlarge machines #3804 Also asked for confirmation from Brianna that they want this reversion in the freshdesk ticket.

UPDATE: Brianna confirmed the reversion

sgibson91 · 2024-03-15T16:42:01Z

Another event done!

github-project-automation bot added this to DEPRECATED Engineering and Product Backlog Feb 27, 2024

github-project-automation bot moved this to Needs Shaping / Refinement in DEPRECATED Engineering and Product Backlog Feb 27, 2024

consideRatio changed the title ~~[EVENT] Openscapes~~ [EVENT] Openscapes single day event, March 14th Feb 28, 2024

consideRatio changed the title ~~[EVENT] Openscapes single day event, March 14th~~ [EVENT] Openscapes, large single day event, March 14th Feb 28, 2024

sgibson91 self-assigned this Mar 7, 2024

yuvipanda self-assigned this Mar 11, 2024

yuvipanda mentioned this issue Mar 12, 2024

Fix 29GB RAM profile option not starting in openscapes #3791

Merged

yuvipanda removed their assignment Mar 12, 2024

sgibson91 added a commit to sgibson91/infrastructure that referenced this issue Mar 12, 2024

Change the node selector instance type to r5.16xlarge for all profile…

51939ed

… options As discussed in 2i2c-org#3743 (comment)

sgibson91 mentioned this issue Mar 12, 2024

Openscapes: Change the node selector instance type to r5.16xlarge for all profile options #3792

Merged

sgibson91 closed this as completed Mar 15, 2024

github-project-automation bot moved this from Needs Shaping / Refinement to Complete in DEPRECATED Engineering and Product Backlog Mar 15, 2024

yuvipanda mentioned this issue May 29, 2024

[EVENT] OpenScapes Workshop, May 30th #4144

Closed

11 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[EVENT] Openscapes, large single day event, March 14th #3743

[EVENT] Openscapes, large single day event, March 14th #3743

sgibson91 commented Feb 27, 2024 •

edited

Loading

sgibson91 commented Feb 27, 2024

sgibson91 commented Feb 27, 2024 •

edited by GeorgianaElena

Loading

consideRatio commented Feb 28, 2024

sgibson91 commented Mar 11, 2024

sgibson91 commented Mar 11, 2024 •

edited

Loading

yuvipanda commented Mar 11, 2024

yuvipanda commented Mar 11, 2024 •

edited by sgibson91

Loading

sgibson91 commented Mar 12, 2024 •

edited

Loading

sgibson91 commented Mar 12, 2024 •

edited

Loading

consideRatio commented Mar 12, 2024

sgibson91 commented Mar 15, 2024 •

edited

Loading

sgibson91 commented Mar 15, 2024

[EVENT] Openscapes, large single day event, March 14th #3743

[EVENT] Openscapes, large single day event, March 14th #3743

Comments

sgibson91 commented Feb 27, 2024 • edited Loading

The link towards the Freshdesk ticket this event was reported

The GitHub handle or name of the community representative

The date when the event will start

The date when the event will end

What hours of the day will participants be active? (e.g., 5am - 5pm US/Pacific)

Are we three weeks before the start date of the event?

Number of attendees

Make sure to add the event into the calendar

Does the hub already exist?

The URL of the hub that will be used for the event

Will this hub be decommissioned after the event is over?

Was all the info filled in above?

Quotas from the cloud provider are high-enough to handle expected usage?

sgibson91 commented Feb 27, 2024

sgibson91 commented Feb 27, 2024 • edited by GeorgianaElena Loading

consideRatio commented Feb 28, 2024

sgibson91 commented Mar 11, 2024

sgibson91 commented Mar 11, 2024 • edited Loading

yuvipanda commented Mar 11, 2024

yuvipanda commented Mar 11, 2024 • edited by sgibson91 Loading

sgibson91 commented Mar 12, 2024 • edited Loading

sgibson91 commented Mar 12, 2024 • edited Loading

consideRatio commented Mar 12, 2024

sgibson91 commented Mar 15, 2024 • edited Loading

sgibson91 commented Mar 15, 2024

sgibson91 commented Feb 27, 2024 •

edited

Loading

sgibson91 commented Feb 27, 2024 •

edited by GeorgianaElena

Loading

sgibson91 commented Mar 11, 2024 •

edited

Loading

yuvipanda commented Mar 11, 2024 •

edited by sgibson91

Loading

sgibson91 commented Mar 12, 2024 •

edited

Loading

sgibson91 commented Mar 12, 2024 •

edited

Loading

sgibson91 commented Mar 15, 2024 •

edited

Loading