Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[EVENT] Openscapes, large single day event, March 14th #3743

Closed
3 of 11 tasks
sgibson91 opened this issue Feb 27, 2024 · 12 comments
Closed
3 of 11 tasks

[EVENT] Openscapes, large single day event, March 14th #3743

sgibson91 opened this issue Feb 27, 2024 · 12 comments
Assignees

Comments

@sgibson91
Copy link
Member

sgibson91 commented Feb 27, 2024

The link towards the Freshdesk ticket this event was reported

https://2i2c.freshdesk.com/a/tickets/1344

The GitHub handle or name of the community representative

Brianna Lind, [email protected]

The date when the event will start

March 14, 2024

The date when the event will end

March 14, 2024

What hours of the day will participants be active? (e.g., 5am - 5pm US/Pacific)

12:00pm PST (https://arewemeetingyet.com/Los%20Angeles/2024-03-14/12:00)

Are we three weeks before the start date of the event?

  • Yes (it was when the ticket was opened, but not at the time I've transferred it to an issue)
  • No

Number of attendees

230, expecting 16GB usage each, maybe more

Make sure to add the event into the calendar

  • Done

Does the hub already exist?

  • Yes
  • No

The URL of the hub that will be used for the event

https://openscapes.2i2c.cloud

Will this hub be decommissioned after the event is over?

  • Yes
  • No

Was all the info filled in above?

  • Yes
  • No

Quotas from the cloud provider are high-enough to handle expected usage?

  • Yes
  • No
@sgibson91
Copy link
Member Author

From the freshdesk ticket:

  • Date: March 14, 2024
  • Time: 12:00 pm PST
  • Language: Python
  • No. participants: 230
  • Expected usage: 16 GB (per/participant)..... possibly slightly larger

Do you think there is anything specific that needs to be done in order to facilitate this? We'd also like to request that during the workshop, we can have files in the shared folder be read only on the hub.

@sgibson91
Copy link
Member Author

sgibson91 commented Feb 27, 2024

Assignee can see if this guide is helpful: https://infrastructure.2i2c.org/howto/prepare-for-events/event-prep

@consideRatio consideRatio changed the title [EVENT] Openscapes [EVENT] Openscapes single day event, March 14th Feb 28, 2024
@consideRatio consideRatio changed the title [EVENT] Openscapes single day event, March 14th [EVENT] Openscapes, large single day event, March 14th Feb 28, 2024
@consideRatio
Copy link
Contributor

I'm not completing this or assigning myself, but note:

  • Their resource allocation choices have requests == limits, making them not share ram - safe but expensive.
  • The resource allocation choices provides the following choices, none of them map directly to "16 GB" or "slightly more".
                    mem_14_8:
                      display_name: 14.8 GB RAM, upto 3.7 CPUs
                      kubespawner_override:
                        mem_guarantee: 15941615616
                        mem_limit: 15941615616
                        cpu_guarantee: 1.875
                        cpu_limit: 3.75
                        node_selector:
                          node.kubernetes.io/instance-type: r5.xlarge
                    mem_29_7:
                      display_name: 29.7 GB RAM, upto 3.7 CPUs
                      kubespawner_override:
                        mem_guarantee: 31883231232
                        mem_limit: 31883231232
                        cpu_guarantee: 3.75
                        cpu_limit: 3.75
                        node_selector:
                          node.kubernetes.io/instance-type: r5.xlarge
    
  • The r5.xlarge is a 4 CPU / 32 GB node, so if they show up with 230 users, it will be ~115 or ~230 nodes starting based on what is picked. This would be a terrible user experience, and could make us run into trouble with quotas etc as well!

I think its super relevant they are put on larger nodes to house a few tens of users at least. I think using ~64 CPU / 512GB nodes makes sense (r5.16xlarge), they would house ~32 users requesting ~16 GB then.

@sgibson91 sgibson91 self-assigned this Mar 7, 2024
@sgibson91
Copy link
Member Author

According to Openscapes' eksctl config, they already have an r5.16xlarge nodepool available, so I think this is just about exposing it via the profile list

{ instanceType: "r5.16xlarge" },

Doing a search for r5.16xlarge in the common config file does not yield any results.

My plan for this event is therefore:

  • Add a profile list option exposing the r5.16xlarge node type with appropriate resource requests
  • Scale up the nodepool in the European morning on the day of the event (to at least one)
  • Scale down the nodepool the day after the event
  • Optionally, remove the profile list option. I will check with the hub champion.

@sgibson91
Copy link
Member Author

sgibson91 commented Mar 11, 2024

Passing r5.16xlarge to the resource allocation script, and asking it to generate 10 examples, produces the following profile list options that jumps from 15GB to 30GB. So close! I'm also unsure what the "upto X CPUs" means - will one user get (e.g.) 63 CPUs?

mem_15_3:
  display_name: 15.3 GB RAM, upto 63.6 CPUs
  kubespawner_override:
    mem_guarantee: 16437845376
    mem_limit: 16437845376
    cpu_guarantee: 1.9875
    cpu_limit: 63.6
    node_selector:
      node.kubernetes.io/instance-type: r5.16xlarge
mem_30_6:
  display_name: 30.6 GB RAM, upto 63.6 CPUs
  kubespawner_override:
    mem_guarantee: 32875690752
    mem_limit: 32875690752
    cpu_guarantee: 3.975
    cpu_limit: 63.6
    node_selector:
      node.kubernetes.io/instance-type: r5.16xlarge

I think I need to adjust the strategy the script is using, but I don't know how to learn which options there are and what they do. There is not much info in either the help string or documentation

$ deployer generate resource-allocation choices --help
                                                                                                                                                                                    
 Usage: deployer generate resource-allocation choices [OPTIONS] INSTANCE_TYPE                                                                                                       
                                                                                                                                                                                    
 Generate a custom number of resource allocation choices for a certain instance type, depending on a certain chosen strategy.                                                       
                                                                                                                                                                                    
╭─ Arguments ──────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────╮
│ *    instance_type      TEXT  Instance type to generate Resource Allocation options for [default: None] [required]                                                               │
╰──────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────╯
╭─ Options ────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────╮
│ --num-allocations        INTEGER                         Number of choices to generate [default: 5]                                                                              │
│ --strategy               [proportional-memory-strategy]  Strategy to use for generating resource allocation choices choices                                                      │
│                                                          [default: ResourceAllocationStrategies.PROPORTIONAL_MEMORY_STRATEGY]                                                    │
│ --help                                                   Show this message and exit.                                                                                             │
╰──────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────╯

@yuvipanda
Copy link
Member

Me and @jmunroe are participating in an openscapes organizing event today (NASA-Openscapes/2i2cAccessPolicies#7), I'll bring this up and try to set up a process for handling this.

@yuvipanda yuvipanda self-assigned this Mar 11, 2024
@yuvipanda
Copy link
Member

yuvipanda commented Mar 11, 2024

We had a lot of good conversations at the meeting, and I'll be opening further issues with information. But as far as this is workshop / issue is concerned, after conversations, they're going to instruct users to use the existing 14.8GB RAM profile. Their 16GB was just an estimate, and we've identified 'how do you figure out how much resources you need?' as something that needs more guidance.

As far as action items for this particular workshop for you to take @sgibson91:

I don't think you need to pre-warm the cluster by increasing node pool sizes or similar. There's also the question of having someone (given TZ, that's me haha) monitor during the event that I'll figure out some answer to.

There's additional action items for me here to change documentation and improve some of the process, but I believe this should unblock you.

@sgibson91
Copy link
Member Author

sgibson91 commented Mar 12, 2024

I have opened the PR addressing the first point here: #3792

Do you think merging in the AM European time on the day of the workshop is fine? Or shall I do it last thing tomorrow so folk have a day to test?

I assume this PR should be reverted after the workshop also.

@sgibson91
Copy link
Member Author

sgibson91 commented Mar 12, 2024

The Running On-Demand Standard (A, C, D, H, I, M, R, T, Z) instances quota is currently 1360, which is plenty more than 512 CPUs + 20%, so there's no action to take on the quota.

Screenshot 2024-03-12 at 11 10 10 Fields from L-R are: Quota name, Applied account-level quota value, AWS default quota value, and Adjustability

@consideRatio
Copy link
Contributor

Do you think merging in the AM European time on the day of the workshop is fine? Or shall I do it last thing tomorrow so folk have a day to test?

I think merging it before is suitable as we reduce risk of last-minute issues and help people test etc. I think it can make sense to merge it already, but if it was more than a full week ahead of time that would perhaps been to early.

@sgibson91
Copy link
Member Author

sgibson91 commented Mar 15, 2024

I put up a PR to revert the one that moved all profiles to the r5.16xlarge machines #3804 Also asked for confirmation from Brianna that they want this reversion in the freshdesk ticket.

UPDATE: Brianna confirmed the reversion

@sgibson91
Copy link
Member Author

Another event done!

@github-project-automation github-project-automation bot moved this from Needs Shaping / Refinement to Complete in DEPRECATED Engineering and Product Backlog Mar 15, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
No open projects
Development

No branches or pull requests

3 participants