+ The ALCF provides users with access to supercomputing resources that are significantly more powerful than systems typically used for open scientific research.
+
+ The ALCF is committed to providing training and outreach opportunities that prepare researchers to efficiently use its leadership computing systems, while also cultivating a diverse and skilled HPC workforce for the future.
+
+ The Argonne Leadership Computing Facility enables breakthroughs in science and engineering by providing supercomputing resources and expertise to the research community.
+
I am unable to sign in to the Accounts website. What do I do?
+
Only users with active ALCF accounts can sign in to the Account and Project Management website. If you have an active account, verify that you are using the correct ALCF username. Note that username is case-sensitive. If you forgot your username, contact support@alcf.anl.gov. For passcode token issues, please review the troubleshooting information on this page: Passcode Tokens.
If you never had an ALCF account, please apply for one here: https://my.alcf.anl.gov/accounts/#/accountRequest. Note that all ALCF accounts must be associated with a project with an active allocation.
+
How do I request a new project/allocation?
+
There are 3 allocation opportunities at ALCF. Please see How to Get an Allocation on how to get time on our systems.
+
Who do I contact if my Discretionary Project Allocation expires or if I need to request additional hours?
+
To request an extension of your existing discretionary allocation or to request additional hours, please email support@alcf.anl.gov with answers to the following or fill out the form at request an extension/additional hours:
+- What you have accomplished with your original allocation?
+ - Please include a brief description of any publications or major presentations that were (or will be) generated in full or in part because of this allocation.
+- What you will do with the extra time?
+- What you are requesting as your new expiration date?
+- How many additional hours you are requesting?
+
How do I join a project?
+
To join a project, please go to https://my.alcf.anl.gov/, then click "join a project". Once there, scroll down to the project you want to join and click on it. At the bottom of the next page, please click on the "Request Membership" button. Once we receive approval from the PI regarding your membership request, we will provide you with access to the necessary resources.
+
How do I request a reservation?
+
Reservation requests must include information detailed here:
+
+
Machine Reservations: Please email the completed reservation request to support@alcf.anl.gov. We will contact you after your request is reviewed by our reservations committee.
+
+
How do I apply for a new account?
+
Note: All ALCF accounts must be associated with an allocated project.
Please forward your account expiry email to your Sponsor. As soon as we receive an approval email from your Sponsor, we'll proceed with your account renewal process as needed.
+
What do I do when I receive a warning that my 593 has expired / is about to expire?
+
If you are planning to extend this assignment/computer user account, please let us know, so a new 593 (Foreign Visit & Assignment Request form) will be filed for you using the information from before. In case any other documents are needed from your end, you'll be contacted as necessary. In order to allow sufficient time for an indices check, it is recommended that your response be submitted as soon as possible.
+
If you are not planning to extend your account, also let us know so that we may close out your records.
+ The ALCF provides users with access to supercomputing resources that are significantly more powerful than systems typically used for open scientific research.
+
+ The ALCF is committed to providing training and outreach opportunities that prepare researchers to efficiently use its leadership computing systems, while also cultivating a diverse and skilled HPC workforce for the future.
+
+ The Argonne Leadership Computing Facility enables breakthroughs in science and engineering by providing supercomputing resources and expertise to the research community.
+
Please note: An account can be associated with a single token only (Mobile or Physical token). Please contact accounts@alcf.anl.gov to change your token preference.
+
Mobile Token
+
The SafeNet MobilePass+ Mobile Token allows access to ALCF systems. This security mobile token uses one-time passwords combined with your PIN for controlled access to the login systems. The mobile token utilizes an app that is keyed to your user account and for which you are responsible on your Android, iPhone or Windows mobile device. Please safeguard your phone as you would your credit cards or house keys: Do not store username, PIN, or other account-related records with the token. Sharing of mobile tokens is strictly forbidden. A mobile token can be associated with a single device only.
+
Step 1. Download the SafeNet MobilePass+ app for your device:
+
The SafeNet MobilePASS+ app turns your device into a two-factor authentication device, removing the need to carry an additional hardware token. As a SafeNet MobilePASS+ user, you can generate passcodes on your mobile device and use those passcodes to authenticate on ALCF computing resources. See supported OS and platforms for more information.
+
Step 2. Enroll your MobilePass+ mobile token:
+
After you’ve been provisioned a mobile token, you will receive a notification email with the subject line "ALCF Mobile Token Self-Enrollment" which you must access from the device on which you wish to install the token.
Click on the http:// link in the email. The SafeNet Authentication Service Self-Enrollment will open.
+
Click enroll your SafeNet MobilePass+ token.
+
When prompted to open in MobilePass+ tap Open.
+
You will now be prompted to enter a 6 digit all numeric PIN.
+
Enter your PIN in the Token PIN field and repeat in the Confirm PIN field.
+
You will be taken to the Enrollment Complete screen to name the token.
+
Insert the desired name in the Token Name field or leave it as is. This name is not utilized by the server; it is for you only.
+
The newly enrolled SafeNet MobilePass+ token is now displayed in the SafeNet MobilePass+ app.
+
+
Manual Enrollment:
+
+
Copy the activation string from the SafeNet provision email.
+
Open the SafeNet MobilePass+ app and tap the manual option.
+
Paste the enrollment string into the field provided and tap the Enroll button.
+
You will now be prompted to enter a 6 all numeric PIN.
+
Enter your PIN in the Token PIN field and repeat in the Confirm PIN field.
+
You will be taken to the Enrollment Complete screen to name the token.
+
Insert the desired name in the Token Name field or leave it as is. This name is not utilized by the server; it is for you only.
+
+
Logging in to an ALCF System using a Mobile Token
+
+
Open the MobilePASS+ app on your device. Then initiate an SSH session and type the following:
+
+
ssh <ALCF username>@<system_name>.alcf.anl.gov
+
+
+
+
When prompted for a password, click the SafeNet MobilePASS+ app on your phone. Click on the token name listed within the app, and enter your PIN.
+
+
+
The app will display your passcode immediately. Enter the passcode as the login password for the system within the SSH session. Please Note: You do NOT have to enter the PIN on the SSH screen when logging into a resource. This only needs to be done to access the passcode within the SafeNet MobilePASS+ app.
+
+
+
Each generated passcode is valid on the SafeNet MobilePass+ app window until your mobile device screen times out.
+
+
+
Troubleshooting your Mobile Token
+
Case 1: Forgotten PIN: If you enter a PIN for your mobile Token and you get an invalid PIN, you will be asked to re-enter your PIN. After 6 failed attempts your token will be deleted and you will need to call the ALCF help desk or send an email to ALCF support to have a new mobile token provisioned.
+
Case 2: Account Lockout: If you fail to enter the correct password 6 times, you will get a permission denied error on the SSH screen. Upon 4 more failed attempts, your IP will be blocked. You will need to call the ALCF help desk and submit a ticket to have the IP unblocked.
+
Case 3: PIN Change: While logged in to the mobile token, click on token settings then tap change PIN. Enter the current PIN followed by the new PIN and confirm.
+
Case 4: Re-Sync: If you are unable to log in to a resource after entering the correct PIN and passcode your token may be out of sync with the server. Please email ALCF Service Desk at accounts at alcf.anl.gov for assistance.
+
Case 5: New Mobile Device: If you have a new mobile device, please email the ALCF Service Desk at accounts at alcf.anl.gov to have a new mobile token provisioned.
+
Physical Token
+
The physical token allows access to the ALCF systems. This security token uses one-time passwords combined with your PIN for controlled access to the login systems. The physical token is a tracked asset for which you are responsible and is keyed to your use. Please safeguard your token as you would your credit cards or house keys: Do not store username, PIN, or other account-related records with the token. Sharing of tokens is strictly forbidden. Please do not mark on the token or alter it in any way.
+
Enabling Your ALCF Physical Token
+
Upon receipt of CRYPTOCard token, contact accounts@alcf.anl.gov to verify your identity and activate the token. If this step is not performed the CRYPTOCard token will not be able to log on to the ALCF resource.
+
ALCF Accounts Service Desk Info
+Hours: Monday-Friday 9 a.m. - 5 p.m. (Central time);
+Email:accounts@alcf.anl.gov
+
Logging in to an ALCF System using a Physical Token
+
When the physical token is activated, an initial PIN will be provided. This will be a four-digit number that will prepend to the one-time password string generated by the token.
+
Upon INITIAL login (to one of the ALCF machines), a prompt to change the PIN will appear. PINs must be at least four characters long and must only contain numbers.
+
+
+
Initiate an SSH session using:
+
ssh <ALCF username>@<system_name>.alcf.anl.gov
+
+
+
+
A password prompt will be received. At this point, push the button on the physical token once.
+
+
+
An eight-character, one-time password made up of letters and numbers will appear on the token’s display. This one-time password is case-sensitive.
+
+
+
Type your PIN followed immediately by the one-time password at the SSH password prompt.
+
+
+
For example, if your PIN is 1234 and you received the one-time password string ABCD9876, you would type 1234ABCD9876 at the password prompt.
+
Troubleshooting Your Physical Token
+
Case 1: It says "locked": The physical token may be locked due to too many failed attempts. Please contact the ALCF Help Desk to return the defective token and so a replacement can be sent.
+
Case 2: You have a PIN for your physical token: Once a PIN has been set for your physical token, you will need to prepend your PIN to the token password. Otherwise you will not be able to log in. If you do not remember your PIN, please email us so we can verify your identity and reset your Initial PIN.
+
Case 3: It does not say "locked" but still does not work: It is likely that your token has fallen out of sync with the server. If you have pushed the button on your physical token more than 10 times without successfully logging in, it will fail to authenticate because it has lost synchronization with the server. Please try connecting to Polaris first. If it still fails, please follow the re-sync instructions below.
+
Re-Sync Instructions
+
If you have pushed the button on your physical token more than 10
+times, it will fail to authenticate because it has lost synchronization
+with the server. You can re-synchronize your token using the following procedure:
+
1. Have your physical token ready.
+
+2. Obtain a challenge sequence:
+ - Initiate an SSH session to a host that allows token
+ authentication (such as polaris.alcf.anl.gov). At the password
+ prompt, just hit 'Enter'. This will cause the Cryptocard service
+ to produce a challenge string consisting of 8 numbers.
+
+3. Hold down the button on your token for a few seconds until the
+ display says "Init", then let go.
+
+4. The token will scroll through a series of menu options. When it
+ displays "ReSync", hit the button again.
+
+5. The display will say
+
+ Resync?0
+
+6. The number at the end will start cycling from 0 to 9, over and over.
+
+7. Look at the numbers in your challenge string. When the number
+ displayed on your token changes to the first number of the challenge
+ string, press the button. The display will now show this number, and
+ the second digit will start cycling.
+
+8. Enter each of the numbers from your challenge string in the same
+ manner, until the display on your token matches the entire challenge string.
+ Choose the "<" to backspace and re-enter the previous number if
+ necessary.
+
+9. Once you've entered all 8 digits, re-check to make sure they're
+ accurate. Then, while all 8 digits are displayed on the token, press
+ the button to generate a new password.
+
+10. Enter your PIN followed by the new password, and hit 'Enter'.
+ If successful, you will be logged in to the resource. You're now back
+ in sync with the authentication server.
+
+If you are unsuccessful, you will be presented with another challenge string.
+At this point, you may need to perform the re-sync instructions again.
+
+
If there are still problems after completing the re-synchronization procedures, please email us at accounts@alcf.anl.gov so we can run a test on the physical token to determine if it is defective.
+
If it is found to be defective we will promptly replace it. Physical tokens are the property of Argonne National Laboratory.
+
Please return them to us at:
+
ALCF Help Desk
+Argonne National Laboratory
+9700 S. Cass Ave.
+Bldg. 240, Rm. 2129
+Lemont, IL 60439
+
+
Resetting the Physical Token PIN
+
Please email us at support at alcf.anl.gov for PIN resets. Once your identity has been verified, we will provide you with a new PIN for your CRYPTOcard token.
+
Returning a Physical Token
+
If you no longer need your physical token, please return it to this address:
+
ALCF Help Desk
+Argonne National Laboratory
+9700 S. Cass Ave.
+Bldg. 240, Rm. 2129
+Lemont, IL 60439
+
+ The ALCF provides users with access to supercomputing resources that are significantly more powerful than systems typically used for open scientific research.
+
+ The ALCF is committed to providing training and outreach opportunities that prepare researchers to efficiently use its leadership computing systems, while also cultivating a diverse and skilled HPC workforce for the future.
+
+ The Argonne Leadership Computing Facility enables breakthroughs in science and engineering by providing supercomputing resources and expertise to the research community.
+
All computing carried out on the ALCF systems is associated with a user "account." This account is used to log onto the login servers and run jobs on the resources. If someone has a user account, then he or she has a login name that is recorded in the user database. This web page describes the process that users will need to understand to manage account details, including policies and procedures.
+
If you need an account, visit the Accounts and Project Management website: Request an account
+
If you want to learn how to get started, visit the Get Started Guide: Get Started Guide
+
Who Can Get an Account
+
Those who are interested in having an account on a ALCF resource must first request an allocation and provide a detailed description of the work, including computational requirements and coding capabilities for the Blue Gene platform. Another means of acquiring an allocation on the ALCF system is to be part of a project team that already has an active allocation. Once an allocation has been granted, new users should complete an account request. A project’s Principal Investigator (PI) must sponsor these accounts—if the PI is the user, an ALCF staff member must serve as sponsor. Sponsors are asked annually to evaluate the accounts they have sponsored to determine whether or not these accounts should be kept active.
+
Account Abilities
+
A user with an active account can login to the ALCF login servers (e.g., polaris.alcf.anl.gov) This account will have some home directory space, where file transfer can occur from that space via the login nodes, and where development activities, such as editing and compiling, can also occur.
+
Account States
+
Accounts are classified in one of the following categories:
+
+
Pending: An account that has been requested but has not yet been created.
+
Active: An account that can be used to interact with the ALCF Login Servers. This is the normal state for all accounts.
+
Inactive: An account that still exists on the system (that is, the account continues to be registered in the database and the user's files exist on disk) but the user cannot interact with the ALCF Login Servers. An account might be disabled due to misuse, security concerns, or because it is no longer allocated.
+
Deleted: An account that existed on the system and is thus in the records and backups, but whose user no longer has access to the systems or files on disk.
+ The ALCF provides users with access to supercomputing resources that are significantly more powerful than systems typically used for open scientific research.
+
+ The ALCF is committed to providing training and outreach opportunities that prepare researchers to efficiently use its leadership computing systems, while also cultivating a diverse and skilled HPC workforce for the future.
+
+ The Argonne Leadership Computing Facility enables breakthroughs in science and engineering by providing supercomputing resources and expertise to the research community.
+
DD projects with a negative balance will not be able to run jobs until they have requested additional time, see Getting more time below.
+
INCITE and ALCC PIs automatically email a summary of project usage. If this is a DD project, please email support@alcf.anl.gov.
+
+
Allocation Expiration
+
Projects and allocations at the ALCF are different. A particular project might have multiple allocations of time. For example, a discretionary project that has been approved for more than 3 times will have 3 allocations (2 are probably expired) but just one project. Projects will not expire, allocations will. If allocations are expired, or have no hours left, jobs will not be able to run. Use the two bullets above (Checking for an active allocation and Determining the balance of an allocation) to determine active allocations.
+
Getting More Time
+
To request an extension of your existing discretionary allocation or to request additional hours, please email support@alcf.anl.gov with answers to the following:
+
+
What you have accomplished with your original allocation?
+
Please include a brief description of any publications or major presentations that were (or will be) generated in full or in part because of this allocation.
+
What you will do with the extra time?
+
What you are requesting as your new expiration date?
+
How many additional hours you are requesting?
+
+
Sub-allocations
+
Suballocations let PIs control who in their team can runs jobs, how much they are allowed to consume (allocation amount), and when they are allowed to run jobs (start and end dates)
+
Step 1: Create Suballocations (Project PI):
+
PI creates suballocations
+
sbank new sub <allocationid> **-name <nameofsuballoc>
+
Tip: see sbank new suballocation -h for all the options.
+
Step 2: Manage Suballocations (Project PI)
+
PI adds users to suballocations
+
sbank e sub <projectname>::<nameofsuballoc> --add-user="<username1> <username2> ..."
+
PI can change the name of a suballocation
+
sbank e sub <suballocationID> --name=<new_name_of_suballocation>
+
By default, the primary suballocation (which is the default suballocation created when the allocation is created by ALCF) is unrestricted .i.e. enabled for all project members. That means all project members can submit jobs against the primary suballocation by default. All other suballocations are restricted by default and users have to be added for each of them.
+
To change the default for the primary suballocation to restrict usage, PI must first edit the suballocation:
sbank e sub <primary suballocation id> --add-user="<username1> <username2> ..."
+
PI changes start and end dates for a suballocation:
+
sbank e sub <suallocationID> -S <start_date> -E <end_date>
+
PI adds hours to a suballocation:
+
sbank e sub <projectname>::<nameofsuballoc> --hours-to-move <hours> --to-suballocation <projectname>::<nameofsuballoc2>
+
Note: hourstomove must be greater than or equal to the available balance for the suballocation nameofsuballoc
+
Tip: see sbank e suballocation -h for all the options
+
Step 3: Submit Jobs (Project team)
+
Submit jobs to a suballocation. Note that the user should be on the suballocation’s user list
+
Eg: qsub -l select=10,walltime=30:00,filesystems=eagle:home -A <suballoctionID> -q demand test.sh
+
Note: Once submanagement is enabled for a project allocation, all job submissions must specify the suballocationID
+
Useful commands:
+List all suballocations for a project that shows number of jobs run, charges, allocation balance, suballocation name, and list of users
+ The ALCF provides users with access to supercomputing resources that are significantly more powerful than systems typically used for open scientific research.
+
+ The ALCF is committed to providing training and outreach opportunities that prepare researchers to efficiently use its leadership computing systems, while also cultivating a diverse and skilled HPC workforce for the future.
+
+ The Argonne Leadership Computing Facility enables breakthroughs in science and engineering by providing supercomputing resources and expertise to the research community.
+
NOTE:
+ 1. The list of arguments are optional.
+ 2. you can also enter list by using the -a option multiple times.
+ 3. regardless, both are optional, and you can get detail allocation info using the option filters below.
+
OPTIONS
+
--version
+
show program's version number and exit
+
-h, --help
+
show this help message and exit
+
-a ALLOCATION_ID, --allocation-id=ALLOCATION_ID
+
filter on allocation id
+
-e EVENT_ID, --event-id=EVENT_ID
+
filter on event id
+
-f FIELD_INFO, --field-to-display=FIELD_INFO
+
FIELD_INFO is [:], for available fields enter -f? or -f "?", to add fields enter -f "+ [:] ..."
+ The ALCF provides users with access to supercomputing resources that are significantly more powerful than systems typically used for open scientific research.
+
+ The ALCF is committed to providing training and outreach opportunities that prepare researchers to efficiently use its leadership computing systems, while also cultivating a diverse and skilled HPC workforce for the future.
+
+ The Argonne Leadership Computing Facility enables breakthroughs in science and engineering by providing supercomputing resources and expertise to the research community.
+
filter on name or id, DO NOT MIX, enter 'all' to get all, wild cards '*' is allowed but only on names
+
-w "FIELD_INFO", --field-width
+
"FIELD_INFO" FIELD_INFO is :, for available fields enter -w? or -w "?"
+
-E END, --end=END
+
[OPER1][...[OPER2]], where the operators OPER1 and OPER2 can be one of the following: ge, gt, le, lt, eq or >=, >, <=, <, ==. Operator Defaults: OPER1 is 'lt' for single date entry, OPER1 and OPER2 are 'ge' and 'lt', respectively, for range date entry. Date Parsing Precedence: YEAR then MONTH then DAY, i.e., 121101 is parsed as YYMMDD, hence Nov. 1, 2012
+
-H, --human-readable
+
abbreviate numbers and use unit suffixes: K (thousands), M (millions), G (billions), T (trillions) ...
+
-S START, --start=START
+
[OPER1][...[OPER2]], where the operators OPER1 and OPER2 can be one of the following: ge, gt, le, lt, eq or >=, >,<=, <, == . Operator Defaults: OPER1 is 'ge' for single date entry, OPER1 and OPER2 are 'ge' and 'lt', respectively, for range date entry. Date Parsing Precedence: YEAR then MONTH then DAY, i.e., 121101 is parsed as YYMMDD, hence Nov. 1, 2012
[OPER1][...[OPER2]], where the operators OPER1 and OPER2 can be one of the following: ge, gt, le, lt, eq or >=, >, <=, <, ==. Operator Defaults: OPER1 is 'ge' for single date entry, OPER1 and OPER2 are 'ge' and 'lt', respectively, for range date entry. Date Parsing Precedence: YEAR then MONTH then DAY, i.e., 121101 is parsed as YYMMDD, hence Nov. 1, 2012
[OPER1][...[OPER2]], where the operators OPER1 and OPER2 can be one of the following: ge, gt, le, lt, eq or >=, >, <=, <, ==. Operator Defaults: OPER1 is 'ge' for single date entry, OPER1 and OPER2 are 'ge' and 'lt', respectively, for range date entry. Date Parsing Precedence: YEAR then MONTH then DAY, i.e., 121101 is parsed as YYMMDD, hence Nov. 1, 2012
+
--get-not-charged
+
only un-charged jobs
+
--history-date-range=END
+
[OPER1][...[OPER2]], where the operators OPER1 and OPER2 can be one of the following: ge, gt, le, lt, eq or >=, >, <=, <, ==. Operator Defaults: OPER1 is 'ge' for single date entry, OPER1 and OPER2 are 'ge' and 'lt', respectively, for range date entry. Date Parsing Precedence: YEAR then MONTH then DAY, i.e., 121101 is parsed as YYMMDD, hence Nov. 1, 2012
+
--last-updated=LAST_UPDATED_TIMESTAMP
+
[OPER1][...[OPER2]], where the operators OPER1 and OPER2 can be one of the following: ge, gt, le, lt, eq or >=, >, <=, <, ==. Operator Defaults: OPER1 is 'gt' for single date entry, OPER1 and OPER2 are 'ge' and 'lt', respectively, for range date entry. Date Parsing Precedence: YEAR then MONTH then DAY, i.e., 121101 is parsed as YYMMDD, hence Nov. 1, 2012
+
--no-commas
+
remove commas from comma separated thousands
+
--no-header
+
do not display the header
+
--no-history
+
do not show history information
+
--no-rows
+
do not display the row data
+
--no-sys-msg
+
do not display system message
+
--no-totals
+
do not display the totals
+
--queued=QUEUED_TIMESTAMP
+
[OPER1][...[OPER2]], where the operators OPER1 and OPER2 can be one of the following: ge, gt, le, lt, eq or >=, >, <=, <, ==. Operator Defaults: OPER1 is 'ge' for single date entry, OPER1 and OPER2 are 'ge' and 'lt', respectively, for range date entry. Date Parsing Precedence: YEAR then MONTH then DAY, i.e., 121101 is parsed as YYMMDD, hence Nov. 1, 2012
+ The ALCF provides users with access to supercomputing resources that are significantly more powerful than systems typically used for open scientific research.
+
+ The ALCF is committed to providing training and outreach opportunities that prepare researchers to efficiently use its leadership computing systems, while also cultivating a diverse and skilled HPC workforce for the future.
+
+ The Argonne Leadership Computing Facility enables breakthroughs in science and engineering by providing supercomputing resources and expertise to the research community.
+
NOTE:
+ 1. The list of arguments are optional
+ 2. you can also enter list by using the -p option multiple times
+ 3. regardless, both are optional, and you can get detail project info using the option filters below
+
OPTIONS
+
--version
+
show program's version number and exit
+
-h, --help
+
show this help message and exit
+
-a ALLOCATION_ID, --allocation-id=ALLOCATION_ID
+
filter on allocation id
+
-f FIELD_INFO, --field-to-display=FIELD_INFO
+
FIELD_INFO is [:], for available fields enter -f? or -f "?", to add fields enter -f "+ [:] ..."
+ The ALCF provides users with access to supercomputing resources that are significantly more powerful than systems typically used for open scientific research.
+
+ The ALCF is committed to providing training and outreach opportunities that prepare researchers to efficiently use its leadership computing systems, while also cultivating a diverse and skilled HPC workforce for the future.
+
+ The Argonne Leadership Computing Facility enables breakthroughs in science and engineering by providing supercomputing resources and expertise to the research community.
+
NOTE:
+ 1. The list of arguments are optional
+ 2. you can also enter list by using the -t option multiple times
+ 3. regardless, both are optional, and you can get detail transaction info using the option filters below
+
OPTIONS
+
--version
+
show program's version number and exit
+
-h, --help
+
show this help message and exit
+
-a ALLOCATION_ID, --allocation-id=ALLOCATION_ID
+
filter on allocation id
+
-c, --comment
+
display comment
+
-e EVENT_ID, --event-id=EVENT_ID
+
filter on event id
+
-f FIELD_INFO, --field-to-display=FIELD_INFO
+
FIELD_INFO is [:] for available fields enter -f? or -f "?", to add fields enter -f "+ [:] ..."
+ The ALCF provides users with access to supercomputing resources that are significantly more powerful than systems typically used for open scientific research.
+
+ The ALCF is committed to providing training and outreach opportunities that prepare researchers to efficiently use its leadership computing systems, while also cultivating a diverse and skilled HPC workforce for the future.
+
+ The Argonne Leadership Computing Facility enables breakthroughs in science and engineering by providing supercomputing resources and expertise to the research community.
+
**NOTE: **
+ 1. Use -I to include inactive allocations
+ 2. the list of arguments are optional
+ 3. you can also enter list by using the -u option multiple times
+ 4. regardless, both are optional, and you can get detail user info using the option filters below
+
OPTIONS
+
--version
+
show program's version number and exit
+
-h, --help
+
show this help message and exit
+
-a ALLOCATION_ID, --allocation-id=ALLOCATION_ID
+
filter on allocation id
+
-f FIELD_INFO, --field-to-display=FIELD_INFO
+
FIELD_INFO is [:], for available fields enter -f? or -f "?", to add fields enter -f "+ [:] ..."
+ The ALCF provides users with access to supercomputing resources that are significantly more powerful than systems typically used for open scientific research.
+
+ The ALCF is committed to providing training and outreach opportunities that prepare researchers to efficiently use its leadership computing systems, while also cultivating a diverse and skilled HPC workforce for the future.
+
+ The Argonne Leadership Computing Facility enables breakthroughs in science and engineering by providing supercomputing resources and expertise to the research community.
+
+ The ALCF provides users with access to supercomputing resources that are significantly more powerful than systems typically used for open scientific research.
+
+ The ALCF is committed to providing training and outreach opportunities that prepare researchers to efficiently use its leadership computing systems, while also cultivating a diverse and skilled HPC workforce for the future.
+
+ The Argonne Leadership Computing Facility enables breakthroughs in science and engineering by providing supercomputing resources and expertise to the research community.
+
Below is a set of helpful commands to help you better manage the projects you have running at the ALCF.
+
View your project's allocations
+
Command: sbank-list-allocations
+
Use this command to list all of your active allocations for a specific project [Project-X]. This is useful when you need to provide this information in a report.
+
List all charges for userx on theta on project ProjectX
+
> sbank-list-users -p ProjectX -r theta -u userx
+ User Jobs Charged
+ --------------- ---------- ---------------
+ userx 1,814 9,884.5
+
+Totals:
+ Rows: 1
+ Resources: theta
+ Charged: 9,884.5 node hours
+ Jobs : 1,814
+ ```
+
+### List charges for all users in ProjectX on Cooley.
+This works for project leads (i.e. PIs, Co-PIs, Proxies), since they can see everything in their own projects.
+
+
+
sbank-list-users -p ProjectX -r theta
+ User Jobs Charged
## View your project's jobs
+List jobs for user "userx" for jobs that started in the range 2016-02-15<= started < 2016-02-29 and add the transactions related to the job
+
+### **Command:** sbank-list-jobs
+
+**Note:** The job with the refund ```transaction_ids_list field can be shorten all the way to "t" in the -f "+ t"```
+
### List the nodes used, runtime and start timestamp for Cooley job 744160
+**Note**: To display the date and time we increased the the number of characters of start_timestamp to 19
+
## View your project's transactions
+### **Command:** sbank-list-transactions
+
+List of transactions that where at or after 2016-02-29 for ProjectX add fields: job_duration, nodes_used and hosts
+
+**Note**:
+- job_duration, nodes_used and hosts are shorten, but they are still uniquely identified
+- host has the left justified width of 20, specified as "h:-20"
+
+catapult~ > sbank-list-transactions -p ProjectX --at "ge 2016-02-29" -f "+ job_d nodes_u h:-20" -r theta
+ Id Resource Project Allocation At User Transaction Type Amount Jobid Job Duration Nodes Used Hosts
+
+
+ The ALCF provides users with access to supercomputing resources that are significantly more powerful than systems typically used for open scientific research.
+
+ The ALCF is committed to providing training and outreach opportunities that prepare researchers to efficiently use its leadership computing systems, while also cultivating a diverse and skilled HPC workforce for the future.
+
+ The Argonne Leadership Computing Facility enables breakthroughs in science and engineering by providing supercomputing resources and expertise to the research community.
+
+ The ALCF provides users with access to supercomputing resources that are significantly more powerful than systems typically used for open scientific research.
+
+ The ALCF is committed to providing training and outreach opportunities that prepare researchers to efficiently use its leadership computing systems, while also cultivating a diverse and skilled HPC workforce for the future.
+
+ The Argonne Leadership Computing Facility enables breakthroughs in science and engineering by providing supercomputing resources and expertise to the research community.
+
+ The ALCF provides users with access to supercomputing resources that are significantly more powerful than systems typically used for open scientific research.
+
+ The ALCF is committed to providing training and outreach opportunities that prepare researchers to efficiently use its leadership computing systems, while also cultivating a diverse and skilled HPC workforce for the future.
+
+ The Argonne Leadership Computing Facility enables breakthroughs in science and engineering by providing supercomputing resources and expertise to the research community.
+
+ The ALCF provides users with access to supercomputing resources that are significantly more powerful than systems typically used for open scientific research.
+
+ The ALCF is committed to providing training and outreach opportunities that prepare researchers to efficiently use its leadership computing systems, while also cultivating a diverse and skilled HPC workforce for the future.
+
+ The Argonne Leadership Computing Facility enables breakthroughs in science and engineering by providing supercomputing resources and expertise to the research community.
+
+ The ALCF provides users with access to supercomputing resources that are significantly more powerful than systems typically used for open scientific research.
+
+ The ALCF is committed to providing training and outreach opportunities that prepare researchers to efficiently use its leadership computing systems, while also cultivating a diverse and skilled HPC workforce for the future.
+
+ The Argonne Leadership Computing Facility enables breakthroughs in science and engineering by providing supercomputing resources and expertise to the research community.
+
+ The ALCF provides users with access to supercomputing resources that are significantly more powerful than systems typically used for open scientific research.
+
+ The ALCF is committed to providing training and outreach opportunities that prepare researchers to efficiently use its leadership computing systems, while also cultivating a diverse and skilled HPC workforce for the future.
+
+ The Argonne Leadership Computing Facility enables breakthroughs in science and engineering by providing supercomputing resources and expertise to the research community.
+
+ The ALCF provides users with access to supercomputing resources that are significantly more powerful than systems typically used for open scientific research.
+
+ The ALCF is committed to providing training and outreach opportunities that prepare researchers to efficiently use its leadership computing systems, while also cultivating a diverse and skilled HPC workforce for the future.
+
+ The Argonne Leadership Computing Facility enables breakthroughs in science and engineering by providing supercomputing resources and expertise to the research community.
+
"detail" meta command displays information in a long format with history updates, where appropriate.
+
list meta command
+
"list" meta command displays information in a table format, but no history updates are displayed.
+
IMPORTANT NOTES
+ 1. All dates entered shall be interpreted as UTC
+ 2. non-admin users will only be able to see their content (jobs, charges, etc.)
+ 3. project admin users will be able to see all of the content for their projects
+ 4. staff admin users will be able to see all the content
+ 5. --help and -h are the help options.
sbank-list-allocation -f "id p avail" or > sbank-list-allocation -f id -f p -f avail For -u, -p and -r the use of wild card "*" is allowed, but only on names, not ids:
+
+
+
+
Examples:
+
+
The following command will find allocations for users whose names start with "pers" and also users rojas and allcock. > sbank-list-allocation -u "pers* rojas allcock"
+
The following command will find allocations for projects that contain "ratio" in the name. > sbank-list-allocation -p ratio
+
The following command will find allocations for projects that end with "tion" in the name. > sbank-list-allocation -p *tion
+
The following command will find allocations for projects that start with "ab" and end with "ng" in the name. > sbank-list-allocation -p ab*ng
+
+
For -f option:
+This option is the display field option.
+
To get the available fields enter -f? or -f "?". Default fields columns will be displayed if no field option is specified.
+
To replace the current fields to display, enter:
+
If you wish to add fields to the default fields, enter one + symbol anywhere in the quoted string:
+
> sbank-list-allocations ... -f "+ FIELD[:WIDTH]...FIELD[:WIDTH]", only one + symbol is needed.
+
+
The fields will be displayed in table format and in the order entered in the command line. You can specify the field width, where WIDTH can be positive or negative value. Left alignment use -, right alignment use + or nothing.
+
For -w option:
+
FIELD:WIDTH, if the field is displayed it will change the width for the specified field.
+
NOTE: This will not add the field as in -f option, only change the width. To get available fields you can also use -w? or -w "?" as in -f option.
+
For -S, -E, --created, --queued, --last-updated, --history-date-range options:
+
These are the date filter options. All dates are treated as UTC.
+
You can use any reasonable date string that resembles a date Ambiguous dates will be parsed with the following parsing precedence: **YEAR then MONTH then DAY **
+
For example, 10-11-12 or 101112 will be the following date: Oct. 11, 2012 Not: Nov. 12, 2010 or Nov. 10, 2012
+
Or you can specify a single date as follows:
+
"[OPER]UTC_DATE" You can specify a date range as follows:
+"[OPER1]UTC_DATE1...[OPER2]UTC_DATE2" Where OPER can be one of the following operators: "==", ">=", "<=", ">", "<" or "eq", "ge", "le", "gt", "lt"
+
+
Note: The following defaults for OPER, OPER1, OPER2 for the following options:
+
You can also use the following key letters "n", "t", "d", "w", "y" as follows:
+
KEY SYNTAX DEFINITIONS ---------- ----------- n[ow] now, where "now" is current-date current-time UTC t[oday] today, where "today" is current-date 00:00:00 UTC [+/-]d specified "number" of +/- days from "today" in UTC [+/-]w specified "number" of +/- weeks from "today" in UTC [+/-]y specified "number" of +/- years from "today" in UTC
+
+
For -T option:
+
Transaction type option. The following are the valid transaction types and their explanation: CHARGE filter on job charges PULLBACK filter on allocation pullbacks DEPOSIT filter on allocation deposits REFUND filter on job refunds VOID filter on void transactions
+
INVOCATION
+
sbank sbank sbank sbank-detail sbank detail sbank d sbank-detail-allocations sbank detail allocations sbank d a sbank-detail-jobs sbank detail jobs sbank d j sbank-detail-projects sbank detail project sbank d p sbank-detail-transactions sbank detail transactions sbank d t sbank-detail-users sbank detail users sbank d u sbank-list sbank list sbank l sbank-list-allocations sbank list allocations sbank l a sbank-list-jobs sbank list jobs sbank l j sbank-list-projects sbank list projects sbank l p sbank-list-transactions sbank list transactions sbank l t sbank-list-users sbank list users sbank l u
+
ENVIRONMENT VARIABLES
+
Command line default options: Define the following environment variables as you would in the command line. Once the environment variable is defined, it will be used as the default options and arguments for the specific command. Command line options will take precedence.
+
sbank_DETAIL_ALLOCATIONS_ARGS
+
Default arguments and options for sbank-detail-allocations.
+
sbank_DETAIL_CATEGORIES_ARGS
+
Default arguments and options for sbank-detail-categories.
+
sbank_DETAIL_NAMES_ARGS
+
Default arguments and options for sbank-detail-names.
+
sbank_DETAIL_MESSAGES_ARGS
+
Default arguments and options for sbank-detail-messages.
+
sbank_DETAIL_JOBS_ARGS
+
Default arguments and options for sbank-detail-jobs.
+
sbank_DETAIL_PROJECTS_ARGS
+
Default arguments and options for sbank-detail-projects.
+
sbank_DETAIL_TRANSACTIONS_ARGS
+
Default arguments and options for sbank-detail-transactions.
+
sbank_DETAIL_USERS_ARGS
+
Default arguments and options for sbank-detail-users.
+
sbank_LIST_ALLOCATIONS_ARGS
+
Default arguments and options for sbank-list-allocations.
+
sbank_LIST_JOBS_ARGS
+
Default arguments and options for sbank-list-jobs.
+
sbank_LIST_PROJECTS_ARGS
+
Default arguments and options for sbank-list-projects.
+
sbank_LIST_TRANSACTIONS_ARGS
+
Default arguments and options for sbank-list-transactions.
+
sbank_LIST_USERS_ARGS
+
Default arguments and options for sbank-list-users.
+Explanation: Fields will be displayed in order of appearance, where field1:-20 means 20 characters long, left align; where field2:20 means 20 characters long, right align; where field3 uses default sizes. Number fields default to right aligned. Text fields default to left aligned.
+
Example 2: -S, -E, --created, --queued, --last-updated, --history-start, --history-end
+
Single date-string examples:
+
+
+
+
sbank-list-allocations -S ">=Oct 11, 2014" start dates that are >= "2014-10-11 00:00:00"
+
+
+
+
+
sbank-list-allocations -S "<=2014-11-10" start dates that are <= "2014-11-10 00:00:00"
+
+
+
+
+
sbank-list-allocations -E "<20141110" end dates that are < "2014-11-10 00:00:00"
+
+
+
+
+
sbank-list-allocations -E "22:30:10" end dates that are < " 22:30:10"
+
+
+
+
+
sbank-list-allocations -S ">today" start dates that are > " 00:00:00"
+
+
+
+
+
sbank-list-allocations -E t end dates that are < " 00:00:00"
+
+
+
+
+
sbank-list-allocations -S gtnow start dates that are > ""
+
+
+
+
+
sbank-list-allocations -E len end dates that are <= ""
+
+
+
+
+
sbank-list-allocations -S "1d" start dates that are >= "today +1 day"
+
+
+
+
+
sbank-list-allocations -E "-2w" end dates that are < "today -2 weeks"
+
+
+
+
+
sbank-list-allocations -S ">=1y" start dates that are >= "today +1 year"
+
+
+
+
+
sbank-list-allocations -S ">2012" start dates that are > "2012-- 00:00:00"
+ The ALCF provides users with access to supercomputing resources that are significantly more powerful than systems typically used for open scientific research.
+
+ The ALCF is committed to providing training and outreach opportunities that prepare researchers to efficiently use its leadership computing systems, while also cultivating a diverse and skilled HPC workforce for the future.
+
+ The Argonne Leadership Computing Facility enables breakthroughs in science and engineering by providing supercomputing resources and expertise to the research community.
+
Researchers gain access to ALCF systems for computational science and engineering projects—typically with awards of millions of core-hours—through competitive, peer-reviewed allocation programs supported by the DOE and Argonne. Our peer-reviewed award programs consist of the INCITE, ALCC, and ADSP programs. More information about the programs, including dates for our CFPs, can be found on their web pages.
+
Director's Discretionary
+
Alternatively, ALCF offers a Director's Discretionary allocation award program to leadership computing preparation, INCITE and ALCC scaling, and application performance to maximize scientific application efficiency and productivity on leadership computing platforms. See the Director's Discretionary (DD) Program page for more information.
+
Initializing Your Awarded Allocation
+
Projects with INCITE, ALCC, and ADSP awards will be contacted directly by the ALCF staff with information on creating accounts.
+
Director's Discretionary awards will receive information in the award confirmation email.
+
Allocation Resources
+
While requesting an allocation, users can choose from:
If you are a PI of a Director's Discretionary project that has an active allocation, you can request additional time or an extension using the allocation request form.
+ The ALCF provides users with access to supercomputing resources that are significantly more powerful than systems typically used for open scientific research.
+
+ The ALCF is committed to providing training and outreach opportunities that prepare researchers to efficiently use its leadership computing systems, while also cultivating a diverse and skilled HPC workforce for the future.
+
+ The Argonne Leadership Computing Facility enables breakthroughs in science and engineering by providing supercomputing resources and expertise to the research community.
+
sbank is the accounting system used within the ALCF. It tracks project allocations, usage charges, and refunds. sbank allows queries about the balance and expiration of project allocations, and has replaced the outdated cbank accounting system.
+
The sbank accounting system helps users manage their allocations and usage per job. It gives the PIs the ability to monitor their allocation usage by user, job, and machine. It also allows the user to monitor their usage per allocation and provides insight on how many hours are left on the project.
+
Getting Started with sbank
+
sbank Example Commands provides a set of example commands on how to use the most common commands.
+
sbank Man Pages
+
Use these sbank man pages to get information on how to use the commands.
+ The ALCF provides users with access to supercomputing resources that are significantly more powerful than systems typically used for open scientific research.
+
+ The ALCF is committed to providing training and outreach opportunities that prepare researchers to efficiently use its leadership computing systems, while also cultivating a diverse and skilled HPC workforce for the future.
+
+ The Argonne Leadership Computing Facility enables breakthroughs in science and engineering by providing supercomputing resources and expertise to the research community.
+
+ The ALCF provides users with access to supercomputing resources that are significantly more powerful than systems typically used for open scientific research.
+
+ The ALCF is committed to providing training and outreach opportunities that prepare researchers to efficiently use its leadership computing systems, while also cultivating a diverse and skilled HPC workforce for the future.
+
+ The Argonne Leadership Computing Facility enables breakthroughs in science and engineering by providing supercomputing resources and expertise to the research community.
+
The Argonne Leadership Computing Facility (ALCF) is required to report the progress and scientific accomplishments of all peer-reviwed projects.
+
PIs of INCITE, ALCC, and ADSP projects are required to complete quarterly reports and a final end-of-project (EOY/EOP) report.
+
Due dates
+
Due dates for the 2024 INCITE quarterly, EOY, and the EOP reports:
+
+
April 1, 2024 (CY2024 - Q1)
+
July 1, 2024 (CY2024 - Q2)
+
October 1, 2024 (CY2024 - Q3)
+
January 1, 2025 (CY2025 - EOY) or February 15, 2025 (entire allocation period - EOP)
+
+
Due dates for the 2023-2024 ALCC quarterly and the EOP reports:
+
+
October 1, 2023 (CY2023 - Q3)
+
January 1, 2024 (CY2024 - Q4)
+
April 1, 2024 (CY2024 - Q1)
+
August 15, 2024 (CY2024 - EOP)
+
+
Penalties
+
If a quarterly report is more than 30 days late:
+- The ability to submit jobs for the PI and users of the late project will be disabled.
+
If a quarterly report is more than 90 days late:
+- The PI and users of the late project will have their accounts disabled.
+
These penalties will be removed within three business days after the late quarterly or EOY report is submitted.
+
ALCC Specific Penalties:
+
A similar penalty will also be applied to new ALCC projects with the same PI or co-PIs that have failed to submit the EOP report for a previous ALCC project. If the EOP report is more than 15 days late:
+
+
The new ALCC project will be blocked. For a currently active ALCC project, the ability to submit jobs will be disabled for the project and all sub-projects. For a project that has not been created yet, the process for new project creation will be halted.
+
+
Appeals
+
A PI or user may appeal a project or account suspension to the ALCF Director by a request to support at alcf.anl.gov.
+
Report Templates
+
Templates for the quarterly and the EOY reports can be found at the links on the bottom of this page.
+
Please modify the filename to replace PINAME with the last name of the PI of the INCITE/ALCC project, ALLOCATION to INCITE/ALCC, and YEAR to the corresponding calendar year. For quarterly reports, please replace the X in the filename with the quarter number.
+
For example, for a project with PI 'Joe Smith' that is submitting the quarterly report for the first quarter in 2023-2024 cycle for ALCC, the filename will be Smith_ALCC_Q1.docx.
+
For an EOY report, replace YEARS with the years associated with your allocation. For example, an ALCC 2023-2024 project with PI 'Joe Smith' would have a filename of Smith_ALCC_2023-2024_EOY.docx.
+ The ALCF provides users with access to supercomputing resources that are significantly more powerful than systems typically used for open scientific research.
+
+ The ALCF is committed to providing training and outreach opportunities that prepare researchers to efficiently use its leadership computing systems, while also cultivating a diverse and skilled HPC workforce for the future.
+
+ The Argonne Leadership Computing Facility enables breakthroughs in science and engineering by providing supercomputing resources and expertise to the research community.
+
The following guide is for PIs and Proxies to get insight into managing projects and teams for ALCF awards. Please submit for questions or trouble tickets to support@alcf.anl.gov.
+
Get Started with ALCF’s Systems
+
To get started using our resources, please visit:
+Connect & Login
+
We also encourage you to take full advantage of ALCF's training programs and user services. Some useful introductory materials and videos are listed below:
Before your project begins, you will receive an email with the following project information:
+
+
Project Short Name: The assigned, shortened name for your project. This will be the name that you’ll use to access your project on the systems.
+
Project Proxies: Project members designated by PIs that are authorized to add or renew project members on your behalf.
+
Allocation System(s) and Allocation Amount: The approved system(s) and amount of your award in node hours.
+
Approved Quota: The approved amount of disk space for your project directory.
+
File System: The file system where your project directory will reside. For information on the Eagle file system, see Storage and Networking.
+
Assigned Catalyst: INCITE projects will have ALCF staff members that are assigned to the projects who are available to assist the team throughout the duration of the INCITE allocation.
+
Allocation Start Date: The start date of your award.
If you have an active ALCF account: Submit a request to join the newly awarded project at https://my.alcf.anl.gov/.
+
Information for Foreign National Access
+
The U.S. Department of Energy has guidelines and requirements for foreign nationals who access its facilities and sites. This guidance is issued in DOE Order 142.3, which is part of Argonne's contract; therefore, all foreign nationals (non-U.S. Citizens) must obtain authorization prior to using ALCF resources.
+
If you are a foreign national and do not have current authorization credentials, you are required to submit a ANL-593 (Foreign National Access Request) form. It is critical that identity documentation requests sent by ALCF staff are completed as early as possible to facilitate timely processing for your account approval.
+
User Agreement for INCITE, ALCC, and ADSP
+
Note: This does not apply to Director's Discretionary awards.
+
Insitution Master Agreement for INCITE, ALCC, and ADSP
+
If you are not an employee of Argonne National Laboratory, a user agreement must be signed by your home institution to perform research at Argonne’s user facilities. This policy applies to every member of the project team who will be conducting research on ALCF resources.
Note: This does not apply to Director's Discretionary awards.
+
Every project team member who requests an ALCF account must sign and return an acknowledgment form, stating that they agree to the terms in the user agreement.
As a PI, you can add members to your project. You can assign proxies who are project members authorized to add or renew project members on your behalf.
+
A project PI or proxy has the authority to:
+
+
Approve and renew accounts
+
Add and delete users to/from the project
+
Approve Foreign Assignment/Visit Request form renewals for project members who are foreign nationals
+
+
During your project setup, the ALCF Support Team will request the following information to establish your project members:
+
+
The names, email addresses, and/or ALCF usernames (if already existing) of up to two proxies and all project members.
+
+
About Project and UNIX Group Membership
+
All project members have the ability to run jobs against your allocation. There is no limit to the number of project members you may authorize.
+Project members are automatically added to the project UNIX group giving them the ability to write to the project directory and to access project data. When a project member is added or removed from a project, this automatically be reflected in the project UNIX group membership.
+
Adding Project Members
+
The PI or a proxy must approve each team member to access ALCF resources and run jobs on their project. PI/proxies can respond to emails from ALCF for account access approval with a "yes" or "no".
+
PI/proxies with active ALCF accounts can also approve new account requests, project membership requests, account reactivation requests, add existing active ALCF users to the project by logging into the ALCF Account and Project Management application.
+
Note: If PI/proxies need to request an ALCF account, see the section below for instructions on "how to apply" for an account.
+
Accounts and Access for your Project Members
+
All project members will need an ALCF user account to access project data and to run jobs on ALCF systems.
Members with ALCF accounts that are no longer active should submit a reactivation request here: https://my.alcf.anl.gov/accounts/#/accountReactivate. When prompted for project name, they should select your project short name.
+
Members with active ALCF accounts but have not been added to your project should submit a request to join your project by going to this page: https://my.alcf.anl.gov/. They should search for your project and click the "Request Membership" button for that project.
+
Moving Your Data
+
We encourage you to use Globus to move your project data to your ALCF project directory before your allocation begins. For details, see Using Globus.
+
Project Status Reports for INCITE, ALCC, and ADSP
+
Note: PIs that are awarded a Director's Discretionary will not receive weekly status project reports.
+
Shortly after your allocation begins, we will begin sending you a weekly project status report via support@alcf.anl.gov to keep you informed or your award progress.
+
Look for an email from us with the subject line: ALCF [ALLOCATION PROGRAM] Project Status Report for [PROJECT SHORT NAME]
+
Reporting Requirements for INCITE, ALCC, and ADSP
+
Note: PIs that are awarded a Director's Discretionary allocations are not required to submit project reports.
+
If you receieved INCITE, ALCC, or ADSP allocation award, quarterly reporting is required to keep DOE informed of progress related to your allocation.
+
The ALCF will send you a report template at the end of each quarter. Please complete the report promptly and submit it via email to support@alcf.anl.gov. For more information see the Quarterly Report webpage.
+
Policies
+
Pullback Policy
+
Please be aware that we will periodically monitor, and could potentially adjust, your project allocation if a large portion of it goes unused. You may view: Pullback Policy
+
Allocation Overburn Policy
+
Please see this page for overburn/overuse eligibility for INCITE projects that have exhausted their allocation in the first 11 months of its allocation year: Allocation Overburn
+
Acknowledgment In Publications
+
Please follow the guidelines provided on the ALCF Acknowledgement Policy page to properly acknowledge the use of ALCF resources in all of your publications, both online and print.
+
Facility Policies
+
Facility policies have been established to provide consistent and reliable services. Please read about our [ALCF Facility Policies] (../policies/facility-policies.md).
+
Useful Allocation and Quota Commands
+
We have an allocation management tool called sbank, and below are a few helpful sbank commands.
+
+
myprojectquotas: log into Polaris and type this command to view the project directory quotas for all your projects
+
myquota: log into Polaris and type this command to view your home directory quota
+
+
You can use the following command to check your project balance on Polaris:
+- sbank-list-allocations -p -r
We can also help resolve any issues or needs that may be delaying the start of your scientific campaign.
+- Are you in need of high-throughput software?
+- Are you having difficulty compiling your application?
+- Does your code have limited restart capabilities?
+
If your project allocation usage is being held back for reasons due to one of our systems, please contact us for assistance by emailing support@alcf.anl.gov.
+ The ALCF provides users with access to supercomputing resources that are significantly more powerful than systems typically used for open scientific research.
+
+ The ALCF is committed to providing training and outreach opportunities that prepare researchers to efficiently use its leadership computing systems, while also cultivating a diverse and skilled HPC workforce for the future.
+
+ The Argonne Leadership Computing Facility enables breakthroughs in science and engineering by providing supercomputing resources and expertise to the research community.
+
The PI or Proxy must approve each member of the team to gain access and to run project jobs on the ALCF's resources. If you have an active ALCF account, you can manage your project team by logging into the ALCF account and project management website and navigating to https://my.alcf.anl.gov/
+
Project members will need to have an active ALCF user account to access project data and to run jobs on ALCF systems. See [Accounts and Access for your Project Members]{https://docs.alcf.anl.gov/account-project-management/project-management/starting-alcf-award/#accounts-and-access-for-your-project-members) for information on how team members can get an account, reactivate an account, or request to join your project.
+
Accessing your project(s)
+
+
Log in at https://my.alcf.anl.gov/ using your credentials (ALCF username and Physical/Mobile token passcode one-time passcode).
+
You will see a list of projects of which you are the Primary Investigator (PI).
+
Click on the desired project to view information and management options for the selected project.
+
+
Modifying project information
+
Some project information cannot be modified, but as the PI, you can modify the following: project title, institutions, and associated funding.
+
Your project can be associated with multiple institutions, but you must specify a primary institution.
+
Managing project members with an Existing ALCF Account
+
+
You can manage the membership for your project by clicking on the desired project from the Project Management screen.
+
Add and/or remove proxies and team members by clicking on the red "Remove" button to the right of each member or clicking on "Add new user."
+
You can view account information for each user as it relates to the project:
+
Account Status
+
Project Role
+
Proxy Permissions
+
+
Membership Status
+
+
+
Proxies are individuals authorized to add or renew user accounts for the project PI. You have the ability to upgrade a user from a member to a Proxy, by clicking on the "Proxy" radio button that corresponds with the desired member.
+ The ALCF provides users with access to supercomputing resources that are significantly more powerful than systems typically used for open scientific research.
+
+ The ALCF is committed to providing training and outreach opportunities that prepare researchers to efficiently use its leadership computing systems, while also cultivating a diverse and skilled HPC workforce for the future.
+
+ The Argonne Leadership Computing Facility enables breakthroughs in science and engineering by providing supercomputing resources and expertise to the research community.
+
+ The ALCF provides users with access to supercomputing resources that are significantly more powerful than systems typically used for open scientific research.
+
+ The ALCF is committed to providing training and outreach opportunities that prepare researchers to efficiently use its leadership computing systems, while also cultivating a diverse and skilled HPC workforce for the future.
+
+ The Argonne Leadership Computing Facility enables breakthroughs in science and engineering by providing supercomputing resources and expertise to the research community.
+
The modelzoo/modelzoo/transformers/pytorch/bert directory is a PyTorch implementation of BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding
+This BERT-large msl128 example uses a single sample dataset for both training and evaluation. See the README.md in the source directory for details on how to build a dataset from text input.
+First, source a Cerebras PyTorch virtual environment and make sure that the requirements are installed:
cd ~/R_2.3.0/modelzoo/src/cerebras/modelzoo/models/nlp/bert
+cp /software/cerebras/dataset/bert_large/bert_large_MSL128_sampleds.yaml configs/bert_large_MSL128_sampleds.yaml
+export MODEL_DIR=model_dir_bert_large_pytorch
+if [ -d "$MODEL_DIR" ]; then rm -Rf $MODEL_DIR; fi
+python run.py CSX --job_labels name=bert_pt --params configs/bert_large_MSL128_sampleds.yaml --num_workers_per_csx=1 --mode train --model_dir $MODEL_DIR --mount_dirs /home/ /software/ --python_paths /home/$(whoami)/R_2.3.0/modelzoo/src --compile_dir $(whoami) |& tee mytest.log
+
+Note: the vocabulary file referenced in /software/cerebras/dataset/bert_large/bert_large_MSL128_sampleds.yaml is the same as the one at /home/$(whoami)/R_2.3.0/modelzoo/modelzoo/transformers/vocab/google_research_uncased_L-12_H-768_A-12.txt.
+
The last parts of the output should resemble the following, with messages about cuda that should be ignored and are not shown.
Evolutionary Scale Modeling (ESM-2) is a transformer protein language models from the Meta Fundamental AI Research Protein Team (FAIR).
+The Cerebras ESM-2 model implementation can be found at modelzoo/src/cerebras/modelzoo/models/nlp/esm2. Configs available are listed at https://github.com/Cerebras/modelzoo/tree/main/src/cerebras/modelzoo/models/nlp/esm2#configs-included-for-this-model. This example will use the Uniref 50 dataset, preprocessed at path /software/datasets/ESM-2/, to train a small 35M parameter model.
+
First, source a Cerebras PyTorch virtual environment and make sure that the requirements are installed:
+
+ The ALCF provides users with access to supercomputing resources that are significantly more powerful than systems typically used for open scientific research.
+
+ The ALCF is committed to providing training and outreach opportunities that prepare researchers to efficiently use its leadership computing systems, while also cultivating a diverse and skilled HPC workforce for the future.
+
+ The Argonne Leadership Computing Facility enables breakthroughs in science and engineering by providing supercomputing resources and expertise to the research community.
+
+Connection to one of the CS-2 cluster login nodes requires an MFA passcode for authentication - either an 8-digit passcode generated by an app on your mobile device (e.g. MobilePASS+) or a CRYPTOCard-generated passcode prefixed by a 4-digit pin. This is the same passcode used to authenticate into other ALCF systems, such as Polaris.
+In the examples below, replace ALCFUserID with your ALCF user id.
+To connect to a CS-2 login:
+
+
ssh to a desired login node:
+
sshALCFUserID@cer-login-01.ai.alcf.anl.gov
+
+ or
+
sshALCFUserID@cer-login-02.ai.alcf.anl.gov
+
+ or
+
sshALCFUserID@cer-login-03.ai.alcf.anl.gov
+
+
Alternatively, ssh randomly to one of the above three login nodes:
+
+ The ALCF provides users with access to supercomputing resources that are significantly more powerful than systems typically used for open scientific research.
+
+ The ALCF is committed to providing training and outreach opportunities that prepare researchers to efficiently use its leadership computing systems, while also cultivating a diverse and skilled HPC workforce for the future.
+
+ The Argonne Leadership Computing Facility enables breakthroughs in science and engineering by providing supercomputing resources and expertise to the research community.
+
The CS-2 cluster has its own Kubernetes-based system for job submission and queuing.
+
Jobs are started automatically through the Python framework in modelzoo.common.pytorch.run_utils
+Continuous job status for a job is output to stdout/stderr; redirect the output, or consider using a persistent session started with screen, or tmux, or both.
+
Jobs that have not yet completed can be listed as shown. Note: this command can take over a minute to complete.
+
(venv_cerebras_pt)$ csctlgetjobs
+NAME AGE DURATION PHASE SYSTEMS USER LABELS DASHBOARD
+wsjob-thjj8zticwsylhppkbmjqe 13s 1s RUNNING cer-cs2-01 username name=unet_pt https://grafana.cerebras1.lab.alcf.anl.gov/d/WebHNShVz/wsjob-dashboard?orgId=1&var-wsjob=wsjob-thjj8zticwsylhppkbmjqe&from=1691705374000&to=now
+(venv_cerebras_pt)$
+
Jobs can be labeled in the command line that launches them, if they are written with Cerebras's Python framework for running appliance jobs, by adding a command line option of this form:
+
--job_labels labelname=labelvalue
+
+
Jobs can also be labeled after they have been started as shown:
+
(venv_cerebras_pt)$ csctllabeljobwsjob-ez6dyfronnsg2rz7f7fqw4testlabel=test
+job/wsjob-ez6dyfronnsg2rz7f7fqw4 was patched
+(venv_cerebras_pt)$
+
+
Jobs with a particular label/label value can be listed as shown:
+
See csctl -h for more options.
+Add -h to a command for help for that command, e.g. csctl get -h or csctl cancel -h.
+
$ csctl-h
+Cerebras cluster command line tool.
+
+Usage:
+ csctl [command]
+
+Available Commands:
+ cancel Cancel job
+ clear-worker-cache Clear the worker cache
+ config View csctl config files
+ get Get resources
+ label Label resources
+ log-export Gather and download logs.
+ types Display resource types
+
+Flags:
+ -d, --debug int higher debug values will display more fields in output objects
+ -h, --help help for csctl
+ --namespace string configure csctl to talk to different user namespaces
+ -v, --version version for csctl
+
+Use "csctl [command] --help" for more information about a command.
+
+ The ALCF provides users with access to supercomputing resources that are significantly more powerful than systems typically used for open scientific research.
+
+ The ALCF is committed to providing training and outreach opportunities that prepare researchers to efficiently use its leadership computing systems, while also cultivating a diverse and skilled HPC workforce for the future.
+
+ The Argonne Leadership Computing Facility enables breakthroughs in science and engineering by providing supercomputing resources and expertise to the research community.
+
Cerebras documentation for porting code to run on a Cerebras CS-2 system:
+Ways to port your model
+
Grafana WsJob Dashboard for Cerebras jobs
+
A Grafana dashboard provides support for visualizing, querying, and exploring the CS2 system’s metrics and enables to access system logs and traces.
+See the Cerebras documentation for the Job Information Dashboard
+
Here is a summary (tested to work on Ubuntu and MacOS)
+
On your work machine with a web browser, e.g. your laptop,
+edit /etc/hosts, using your editor of choice
+
sudo nano /etc/hosts
+
+Add this line
+
127.0.0.1 grafana.cerebras1.lab.alcf.anl.gov
+
+Save, and exit the editor
+
Download the Grafana certificate present on the Cerebras node at /opt/cerebras/certs/grafana_tls.crt to your local machine. To add this certificate to your browser keychain,
+
+
On chrome, go to Settings->Privacy and security->Security->Manage device certificates
+
Select System under "System Keychains" on the left hand side of your screen. Also select the "Certificate" tab.
+
Drag and drop the downloaded certificate. Once it is added, it is visible as "lab.alcf.anl.gov"
+
+
Select the certificate, and ensure that the "Trust" section is set to "Always Trust"
+
+
+
On your work machine with a web browser, e.g. your laptop,
+tunnel the grafana https port on the cerebras grafana host through to localhost
+
Point a browser at grafana. (Tested with Firefox and Chrome/Brave)
+Open browser to a job grafana url shown in csctl get jobs, adding :8443 to hostname, e.g.
+
+ The ALCF provides users with access to supercomputing resources that are significantly more powerful than systems typically used for open scientific research.
+
+ The ALCF is committed to providing training and outreach opportunities that prepare researchers to efficiently use its leadership computing systems, while also cultivating a diverse and skilled HPC workforce for the future.
+
+ The Argonne Leadership Computing Facility enables breakthroughs in science and engineering by providing supercomputing resources and expertise to the research community.
+
Cerebras jobs are initiated and tracked automatically within the Python framework in modelzoo.common.pytorch.run_utils. This framework interacts with the Cerebras cluster management node.
+
Login nodes
+
Jobs are launched from login nodes.
+If you expect a loss of an internet connection for any reason, for long-running jobs we suggest logging into a specific login node and using either screen or tmux to create persistent command line sessions. For details use:2
+
manscreen
+# or
+mantmux
+
+
Running jobs on the wafer
+
Follow these instructions to compile and train the fc_mnist PyTorch sample. This models is a couple of fully connected layers plus dropout and RELU.
+
Cerebras virtual environments
+
First, make a virtual environment for Cerebras for PyTorch.
+See Customizing Environments for the procedures for making PyTorch virtual environments for Cerebras.
+If an environment is made in ~/R_2.3.0/, it would be activated as follows:
+
+ The ALCF provides users with access to supercomputing resources that are significantly more powerful than systems typically used for open scientific research.
+
+ The ALCF is committed to providing training and outreach opportunities that prepare researchers to efficiently use its leadership computing systems, while also cultivating a diverse and skilled HPC workforce for the future.
+
+ The Argonne Leadership Computing Facility enables breakthroughs in science and engineering by providing supercomputing resources and expertise to the research community.
+
The Cerebras CS-2 is a wafer-scale deep learning accelerator comprising 850,000 processing cores, each providing 48KB of dedicated SRAM memory for an on-chip total of 40GB and interconnected to optimize bandwidth and latency. Its software platform integrates the popular machine learning framework PyTorch.
+
The ALCF CS-2 systems are configured as a Cerebras Wafer-Scale Cluster, designed to support large-scale models (up to and well beyond 1 billion parameters) and large-scale inputs. The cluster contains two CS-2 systems and can distribute jobs across one or both CS-2 systems in a data-parallel framework. The supporting CPU cluster consists of MemoryX, SwarmX, management, and input worker nodes. The Cerebras Wafer-Scale cluster is run as an appliance: a user submits a job to the appliance, and the appliance manages preprocessing and streaming of the data, IO, and device orchestration within the appliance. It provides programming via PyTorch, with data-parallel distribution when using more than one CS-2. This installation supports both Pipelined execution for models up to 1 billion parameters and Weight Streaming execution for models up to and above 1 billion parameters.
+
+
+
+
+
The public Cerebras documentation is available here.
+
A typical Cerebras Wafer-Scale Cluster is shown in the figure.
+Users connect (ssh) to one of the three login nodes. Either ssh to cerebras.ai.alcf.anl.gov, which randomly resolves to one of cer-login-0[1-3].ai.alcf.anl.gov, or ssh to a specific node, cer-login-01.ai.alcf.anl.gov, cer-login-02.ai.alcf.anl.gov, cer-login-03.ai.alcf.anl.gov.
+The rest of the nodes in the cluster infrastructure are not directly accessible, except by admins.
+The trees /home, /projects, and /software are shared across all three login nodes, the relevant cluster infrastructure nodes, and all ALCF AI testbed platforms.
As indicated in the figures, the CS-2 nodes on the right are responsible only for running and accelerating the computations for training and predictions with the model. The other work, including compilation, is performed by input nodes, and by MemoryX nodes, which are used for weight storage and broadcast, and SwarmX nodes, which are used for gradient accumulation. Some model verification work can be done on login nodes.
+ The ALCF provides users with access to supercomputing resources that are significantly more powerful than systems typically used for open scientific research.
+
+ The ALCF is committed to providing training and outreach opportunities that prepare researchers to efficiently use its leadership computing systems, while also cultivating a diverse and skilled HPC workforce for the future.
+
+ The Argonne Leadership Computing Facility enables breakthroughs in science and engineering by providing supercomputing resources and expertise to the research community.
+
+ The ALCF provides users with access to supercomputing resources that are significantly more powerful than systems typically used for open scientific research.
+
+ The ALCF is committed to providing training and outreach opportunities that prepare researchers to efficiently use its leadership computing systems, while also cultivating a diverse and skilled HPC workforce for the future.
+
+ The Argonne Leadership Computing Facility enables breakthroughs in science and engineering by providing supercomputing resources and expertise to the research community.
+
Users have a shared home filesystem /home shared across the ALCF AI testbed systems, including the login and compute nodes. Default user quota is 1 TB storage and 1,000,000 files. This space is backed up.
+
Project File System Space
+
The team project/campaign file system /projects is intended to facilitate project collaboration and is accessible to the team members of your project that have an ALCF account. Default group storage quota is 2 TB and 2,000,000 files. Please note that this space isn't backed up. Our policy is that data will be purged from disk 6 months after project completion.
+
Data Transfer
+
Users can transfer data to and from the AI testbed using Globus or tools such as scp or rsync.
+
Using Globus
+
We have a Globus endpoint each to move data to and from the /projects and /home filesystem respectively.
+
+
Use alcf#ai_testbed_projects for the /projects file system
+
Use alcf#ai_testbed_home for the /home files system
+
+
Relevant information regarding using globus can be found here
Please Note: The basic level of protection provided is UNIX file level permissions; it is the user's responsibility to ensure that file permissions and umasks are set to match their needs.
+ The ALCF provides users with access to supercomputing resources that are significantly more powerful than systems typically used for open scientific research.
+
+ The ALCF is committed to providing training and outreach opportunities that prepare researchers to efficiently use its leadership computing systems, while also cultivating a diverse and skilled HPC workforce for the future.
+
+ The Argonne Leadership Computing Facility enables breakthroughs in science and engineering by providing supercomputing resources and expertise to the research community.
+
+ The ALCF provides users with access to supercomputing resources that are significantly more powerful than systems typically used for open scientific research.
+
+ The ALCF is committed to providing training and outreach opportunities that prepare researchers to efficiently use its leadership computing systems, while also cultivating a diverse and skilled HPC workforce for the future.
+
+ The Argonne Leadership Computing Facility enables breakthroughs in science and engineering by providing supercomputing resources and expertise to the research community.
+
Note: Conversion of CT to the various machines is meant to be a tutorial as to how
+to convert a model.
+
+
Cerebras CT
+
Cerebras cannot support CT and UNets in general as of 4/25/23.
+
Graphcore CT
+
Alex has been very busy with conferences, etc.
+
He ran CT but, it ran on the CPU. He has stated that it may need to be completely written
+using, I can't remember which, Poplar or PopArt. If that is necessary, Venkat should
+make the call.
When you change back to 3.2, use virtual-environments.md from the commit a4ce3b5598f4d6feee7ca58accde1a6a0ea84244 "virtual-environments.md with 3.2 edits."
+ The ALCF provides users with access to supercomputing resources that are significantly more powerful than systems typically used for open scientific research.
+
+ The ALCF is committed to providing training and outreach opportunities that prepare researchers to efficiently use its leadership computing systems, while also cultivating a diverse and skilled HPC workforce for the future.
+
+ The Argonne Leadership Computing Facility enables breakthroughs in science and engineering by providing supercomputing resources and expertise to the research community.
+
The ALCF AI Testbed houses some of the most advanced AI accelerators for scientific research.
+
The goal of the testbed is to enable explorations into next-generation machine learning applications and workloads, enabling the ALCF and its user community to help define the role of AI accelerators in scientific computing and how to best integrate such technologies with supercomputing resources.
+
The AI accelerators complement the ALCF's current and next-generation supercomputers to provide a state-of-the-art computing environment that supports pioneering research at the intersection of AI, big data, and high performance computing (HPC).
+
The platforms are equipped with architectural features that support AI and data-centric workloads, making them well suited for research tasks involving the growing deluge of scientific data produced by powerful tools, such as supercomputers, light sources, telescopes, particle accelerators, and sensors. In addition, the testbed will allow researchers to explore novel workflows that combine AI methods with simulation and experimental science to accelerate the pace of discovery.
+
How to Get Access
+
Researchers interested in using the AI Testbed’s Cerebras CS-2, SambaNova DataScale SN30, Graphcore Bow Pod64 and GroqRack platforms can now submit project proposals via the ALCF’s Director’s Discretionary program. Access to additional testbed resources, including Habana accelerators, will be announced at a later date.
Request a Director's Discretionary project on SambaNova/Cerebras/Graphcore/Groq.
+
+
+
Apply for an ALCF account after the project request is approved. Choose the SambaNova/Cerebras/Graphcore/Groq project that your PI has created at ALCF. If you have an active ALCF account, request to join the project after your project is approved.
+
+
+
Transfer data to ALCF using Globus after your account has been created.
+
a. The endpoint for your data in ALCF is alcf#ai_testbed_projects with the path to your project being /<project name>.
+
b. The endpoint for your home directory on the AI Testbeds in ALCF is alcf#ai_testbed_home.
+
+
+
Add/invite team members to your ALCF project on SambaNova/Cerebras/Graphcore/Groq.
+
+
+
How to Contribute to Documentation
+
The documentation is based on MkDocs and source files are
+on GitHub. You can contribute to the documentation by creating a pull request.
+ The ALCF provides users with access to supercomputing resources that are significantly more powerful than systems typically used for open scientific research.
+
+ The ALCF is committed to providing training and outreach opportunities that prepare researchers to efficiently use its leadership computing systems, while also cultivating a diverse and skilled HPC workforce for the future.
+
+ The Argonne Leadership Computing Facility enables breakthroughs in science and engineering by providing supercomputing resources and expertise to the research community.
+
+ The ALCF provides users with access to supercomputing resources that are significantly more powerful than systems typically used for open scientific research.
+
+ The ALCF is committed to providing training and outreach opportunities that prepare researchers to efficiently use its leadership computing systems, while also cultivating a diverse and skilled HPC workforce for the future.
+
+ The Argonne Leadership Computing Facility enables breakthroughs in science and engineering by providing supercomputing resources and expertise to the research community.
+
Graphcore provides examples of some well-known AI applications in their repository at https://github.com/graphcore/examples.git.
+Clone the examples repository to your personal directory structure:
+
cd ~/graphcore/examples/tutorials/simple_applications/tensorflow2/mnist/
+
+
Run MNIST - TensorFlow
+
Execute the command:
+
/opt/slurm/bin/srun --ipus=1 python mnist.py
+
+
Output
+
The expected output will resemble the following:
+
srun: job 10672 queued and waiting for resources
+srun: job 10672 has been allocated resources
+2023-08-22 23:35:02.925033: I tensorflow/compiler/plugin/poplar/driver/poplar_platform.cc:43] Poplar version: 3.3.0 (de1f8de2a7) Poplar package: b67b751185
+2023-08-22 23:35:06.119772: I tensorflow/compiler/plugin/poplar/driver/poplar_executor.cc:1619] TensorFlow device /device:IPU:0 attached to 1 IPU with Poplar device ID: 0
+2023-08-22 23:35:07.087287: I tensorflow/compiler/mlir/mlir_graph_optimization_pass.cc:185] None of the MLIR Optimization Passes are enabled (registered 2)
+2023-08-22 23:35:07.351132: I tensorflow/compiler/mlir/tensorflow/utils/dump_mlir_util.cc:210] disabling MLIR crash reproducer, set env var `MLIR_CRASH_REPRODUCER_DIRECTORY` to enable.
+2023-08-22T23:35:09.469066Z PL:POPOPS 3545299.3545299 W: createOutputForElementWiseOp 'while/sparse_categorical_crossentropy/SparseSoftmaxCrossEntropyWithLogits/SparseSoftmaxCrossEntropyWithLogits/fusion.3/Op/Equal/Out' ({32,10}): No suitable input found, creating new variable with linear tile mapping
+2023-08-22 23:35:18.532415: I tensorflow/compiler/jit/xla_compilation_cache.cc:376] Compiled cluster using XLA! This line is logged at most once for the lifetime of the process.
+Epoch 1/4
+2000/2000 [==============================] - 13s 6ms/step - loss: 0.6220
+Epoch 2/4
+2000/2000 [==============================] - 1s 262us/step - loss: 0.3265
+Epoch 3/4
+2000/2000 [==============================] - 1s 273us/step - loss: 0.2781
+Epoch 4/4
+2000/2000 [==============================] - 1s 289us/step - loss: 0.2482
+
+
+
+
ResNet50
+
Activate PopTorch Environment
+
Create and activate a fresh PopTorch environment poptorch33_resnet50_env as outlined in the virtual environment section, then activate it.
+
+To run 4 replicas (a total for 4 IPUs) of the ResNet50 model:
+Make a script with the following contents, called poprun_unet.sh
+This script tells poprun to use the partition id of the partition created for the slurm job used to run the script.
+
In order to run the GPT-2 Pytorch model, create a new popTorch virtual environment poptorch33_gpt2 as described in the virtual environment section and activate it.
+It runs a gpt2 model that fits on 4 IPUS indicated by --ipus-per-replica. The --replication-factor indicates how many times the model is replicated in a data parallel manner (4 in the above example). Hence the total number of IPUs used in this example is 16.
+
The effective global batch size in this example is (micro)batch-size * gradient-accumulation * replication-factor = 1 x 2048 x 4 = 8192. The device iterations indicates the total number samples loaded in 1 training step = global batch size * device iterations = 8192*8 = 65536. To learn more about these parameters and in general batching of IPUs refer IPU batching .
+
The above example is running with generated or synthetic data. To use the same example with a real world dataset, refer to data setup.
+ The ALCF provides users with access to supercomputing resources that are significantly more powerful than systems typically used for open scientific research.
+
+ The ALCF is committed to providing training and outreach opportunities that prepare researchers to efficiently use its leadership computing systems, while also cultivating a diverse and skilled HPC workforce for the future.
+
+ The Argonne Leadership Computing Facility enables breakthroughs in science and engineering by providing supercomputing resources and expertise to the research community.
+
Connection to a Graphcore node is a two-step process.
+
The first step is to ssh from a local machine to the login node.
+
The second step is to log in to a Graphcore node from the login node.
+
+
Log in to Login Node
+
Login to the Graphcore login node from your local machine using the below command. This uses the ALCF account ID that uses the password generated from the MobilePASS+.
+
+
Note: In the examples below, replace ALCFUserID with your ALCF user id.
+
+
sshALCFUserID@gc-login-01.ai.alcf.anl.gov
+# or
+sshALCFUserID@gc-login-02.ai.alcf.anl.gov
+
+
+
Note: Use the ssh "-v" option in order to debug any ssh problems.
+
+
Log in to a Graphcore Node
+
Once you are on the login node, ssh to one of the Graphcore nodes.
+
sshgc-poplar-02.ai.alcf.anl.gov
+# or
+sshgc-poplar-03.ai.alcf.anl.gov
+# or
+sshgc-poplar-04.ai.alcf.anl.gov
+
+
+
**Note: ssh gc-poplar-01.ai.alcf.anl.gov is not accessible to users. However, its IPU resources are assigned by the slurm tasks.
+ The ALCF provides users with access to supercomputing resources that are significantly more powerful than systems typically used for open scientific research.
+
+ The ALCF is committed to providing training and outreach opportunities that prepare researchers to efficiently use its leadership computing systems, while also cultivating a diverse and skilled HPC workforce for the future.
+
+ The Argonne Leadership Computing Facility enables breakthroughs in science and engineering by providing supercomputing resources and expertise to the research community.
+
ALCF's Graphcore POD64 system uses Slurm for job submission and queueing. Below are some of the important commands for using Slurm. For more information refer to Slurm Documentation.
+
+
NOTE: Jobs that require IPUs will fail unless launched with srun or sbatch.
+NOTE: There is a single Slurm scheduler for the Graphcore POD64.
+
+
SRun
+
The Slurm command srun can be used to run individual Python scripts (or other programs) in parallel with other scripts on a cluster managed by Slurm. An example of srun usage is shown below. Use the --ipus= option to specify the number of IPUs required for the run.
+
Example:
+
srun --ipus=1 python mnist_poptorch.py
+
+
SBatch
+
Alternatively, these jobs can be submitted to the Slurm workload manager through a batch script by using the sbatch command. To do this, create a bash script (submit-mnist-poptorch-job.sh here as an example) with the commands that you want to execute.
+
#!/bin/sh
+
+pythonmnist_poptorch.py
+
+
Then pass the bash script as an input to the sbatch command as shown below, requesting the number of IPUs required:
+ The ALCF provides users with access to supercomputing resources that are significantly more powerful than systems typically used for open scientific research.
+
+ The ALCF is committed to providing training and outreach opportunities that prepare researchers to efficiently use its leadership computing systems, while also cultivating a diverse and skilled HPC workforce for the future.
+
+ The Argonne Leadership Computing Facility enables breakthroughs in science and engineering by providing supercomputing resources and expertise to the research community.
+
+The IPUOF_VIPU_API_HOST environment variable can conflict with the running of poptorch programs.
+The graphcore nodes have a convenience script that temporarily sets this environment variable.
+
+ The ALCF provides users with access to supercomputing resources that are significantly more powerful than systems typically used for open scientific research.
+
+ The ALCF is committed to providing training and outreach opportunities that prepare researchers to efficiently use its leadership computing systems, while also cultivating a diverse and skilled HPC workforce for the future.
+
+ The Argonne Leadership Computing Facility enables breakthroughs in science and engineering by providing supercomputing resources and expertise to the research community.
+
Note: Please be mindful of how you are using the system.
+For example, consider running larger jobs in the evening or on weekends.
+
+
Running of any model or application includes graph compilation of the model that is then deployed on the IPUs. Below is the description of training a neural network for classification on the MNIST dataset using the PopTorch (pytorch framework optimized for IPU).
All models are run using Slurm, with the --ipus indicating how many IPUs are need to be allocated for the model being run. This example uses a batchsize of 8, and run for 10 epochs. It also set the device iteration to 50 which is the number of iterations the device should run over the data before returning to the user. The dataset used in the example is derived from the TorchVision and the PopTorch dataloader is used to load the data required for the 50 device iterations from the host to the device in a single step.
+
The model used here is a simple CNN based model with an output from a classifier (softmax layer).
+A simple Pytorch model is translated to a PopTorch model using poptorch.Options().
+poptorch.trainingModel is the model wrapping function on the Pytorch model. The first call to trainingModel will compile the model for the IPU. You can observe the compilation process as part of output of the above command.
The artifacts from the graph compilations is cached in the location set by the flag POPTORCH_CACHE_DIR, where the .popef file corresponding to the model under consideration is cached.
+
Output
+
The expected output will start with downloads followed by and we can observe the model used by the model, the progress bar of the compilation process, and the training progress bar.
+ The ALCF provides users with access to supercomputing resources that are significantly more powerful than systems typically used for open scientific research.
+
+ The ALCF is committed to providing training and outreach opportunities that prepare researchers to efficiently use its leadership computing systems, while also cultivating a diverse and skilled HPC workforce for the future.
+
+ The Argonne Leadership Computing Facility enables breakthroughs in science and engineering by providing supercomputing resources and expertise to the research community.
+
The Graphcore Bow-Pod64 system is the latest-generation AI accelerator from Graphcore. This is a one-rack system consisting of 64 Bow-class Intelligence Processing Units (IPU) with a custom interconnect. The system provides for an aggregate 22 Petaflops/s of performance in half precision. It has a total of 57.6 GB In-Processor-Memory with a total of 94,208 IPU cores. The system consists of four servers for data-processing.
The Graphcore software stack includes support for TensorFlow and PyTorch using the Poplar SDK. The Poplar® SDK is t is the toolchain specifically designed for creating graph software for ML applications. It integrates with the traditional ML frameworks like PyTorch and TensorFlow allowing users to port their existing code to the IPU hardware-specific code. The various components of the poplar SDK stack are shown in the figure. It includes the PopTorch framework which is a wrapper over the PyTorch framework optimized to the IPU hardware. It also enlists the different PopLibs libraries supported, which enables to construct graphs, define tensor data and control how the code and data are mapped onto the IPU for execution.
+ The ALCF provides users with access to supercomputing resources that are significantly more powerful than systems typically used for open scientific research.
+
+ The ALCF is committed to providing training and outreach opportunities that prepare researchers to efficiently use its leadership computing systems, while also cultivating a diverse and skilled HPC workforce for the future.
+
+ The Argonne Leadership Computing Facility enables breakthroughs in science and engineering by providing supercomputing resources and expertise to the research community.
+
mkdir~/.ssh
+cd~/.ssh
+ssh-keygen-trsa-b4096
+#Accecpt default filename of id_rsa
+#Enter passphrase (empty for no passphrase):
+#Enter same passphrase again:
+catid_rsa.pub>>authorized_keys
+
Update ${HOME}/graphcore/examples/vision/cnns/pytorch/train/benchmarks.yml
+with your favorite editor to match benchmarks.yml.
+
configs.yml
+
Update ${HOME}/graphcore/examples/vision/cnns/pytorch/train/configs.yml
+with your favorite editor. At about line 30, change use_bbox_info: true to
+use_bbox_info: false.
+
Scale ResNet50
+
Scale and benchmark ResNet50.
+
+
Note: The number at the end of each line indicates the number of IPUs.
+
Note: Use screen because every run is long.
+
+
"PopRun exposes this control with the --process-placement flag and provides multiple pre-defined strategies. By default (and with --process-placement spreadnuma), PopRun is designed to be NUMA-aware. On each host, all the available NUMA nodes are divided among the instances. This means that each instance is bound to execute on and allocate memory from its assigned NUMA nodes, ensuring memory access locality. This strategy maximises memory bandwidth and is likely to yield optimal performance for most of the data loading workloads in machine learning." [Multi-Instance Multi-Host(https://docs.graphcore.ai/projects/poprun-user-guide/en/latest/launching.html#multi-instance-multi-host)
+
Setup
+
Move to the correct directory and establish the datasets directory.
+ The ALCF provides users with access to supercomputing resources that are significantly more powerful than systems typically used for open scientific research.
+
+ The ALCF is committed to providing training and outreach opportunities that prepare researchers to efficiently use its leadership computing systems, while also cultivating a diverse and skilled HPC workforce for the future.
+
+ The Argonne Leadership Computing Facility enables breakthroughs in science and engineering by providing supercomputing resources and expertise to the research community.
+
The intent of this page is to show conceptually how to convert a model to run on the Graphcore system.
+It is not necessary to convert CosmicTagger because it has already been converted and is
+located at CosmicTagger on the Graphcore branch.
+The original is located at CosmicTagger.
+
Run Model on CPU
+
The first step to converting a model is to verify that it runs on the CPU. This step has been verified for CosmicTagger.
+
Config.py
+
CosmicTagger can run on multiple machines. As such, it is necessary to specify the architecture
+that one is using. For example, CPU or GPU. The architecture is stored in the
+ComputeMode class.
+
Edit src/config/config.py. Add IPU to the ComputeMode class.
+
classComputeMode(Enum):
+CPU=0
+#...
+IPU=5
+
+
Trainer.py
+
Edit src/utils/torch/trainer.py.
+
Import PopTorch
+
PopTorch is Graphcore's extension of PyTorch.
+
Import poptorch at the top of the file.
+
importpoptorch
+
+
Wrap Model
+
Wrap the model using poptorch.trainingModel() so that it may be ran on IPUs for training.
+
Wrap the model using poptorch.inferenceModel() when not training.
+
Find the following code around line 90 in the init_network method.
+
# Foregoing any fusions as to not disturb the existing ingestion pipeline
+ifself.is_training()andself.args.mode.quantization_aware:
+self._raw_net.qconfig=torch.quantization.get_default_qat_qconfig('fbgemm')
+self._net=torch.quantization.prepare_qat(self._raw_net)
+else:
+self._net=self._raw_net
+
Putting the loss calculation in forward_pass() allows the loss computation to be performed on the IPUs.
+This will be faster because the data will not need to be transfered round-trip to the CPU.
The following code changes are to account for the loss function, i.e., self.loss_calculator, and the
+image labels, i.e., labels_image, to be passed to the model's forward_pass method. Additionally, the calculated
+loss is returned from the forward_pass method.
Receive the extra loss variable from the forward_pass method.
+
Update the train_step method.
+
Original Training Step
+
withself.timing_context("forward"):
+ifself.args.run.precision==Precision.mixedandself.args.run.compute_mode==ComputeMode.GPU:
+withtorch.cuda.amp.autocast():
+logits_image,labels_image=self.forward_pass(minibatch_data)
+else:
+logits_image,labels_image=self.forward_pass(minibatch_data)
+
+verbose=False
+
+# Compute the loss based on the logits
+withself.timing_context("loss"):
+loss=self.loss_calculator(labels_image,logits_image)
+
+
Updated Training Step
+
The forward_pass() method was changed to return the extra variable loss in the previous section. It is now
+received conditionally when using an IPU(s).
+
In the with self.timing_context("loss"): section, only calculate loss if not using an IPU(s).
+
withself.timing_context("forward"):
+ifself.args.run.precision==Precision.mixedandself.args.run.compute_mode==ComputeMode.GPU:
+withtorch.cuda.amp.autocast():
+logits_image,labels_image=self.forward_pass(minibatch_data)
+else:
+ifself.args.run.compute_mode==ComputeMode.IPU:
+logits_image,labels_image,loss=self.forward_pass(minibatch_data)
+else:
+logits_image,labels_image=self.forward_pass(minibatch_data)
+
+verbose=False
+
+
+# Compute the loss based on the logits
+withself.timing_context("loss"):
+ifself.args.run.compute_mode==ComputeMode.IPU:
+loss=loss
+else:
+loss=self.loss_calculator(labels_image,logits_image)
+
+
Update Validation Step
+
Update the val_step method.
+
Original Validation Step Code
+
Find this code.
+
ifself.args.run.precision==Precision.mixedandself.args.run.compute_mode==ComputeMode.GPU:
+withtorch.cuda.amp.autocast():
+logits_image,labels_image=self.forward_pass(minibatch_data,net=val_net)
+else:
+logits_image,labels_image=self.forward_pass(minibatch_data,net=val_net)
+
+# Compute the loss based on the logits
+loss=self.loss_calculator(labels_image,logits_image)
+
+
Updated Validation Step Code
+
Change the code to the following.
+
ifself.args.run.precision==Precision.mixedandself.args.run.compute_mode==ComputeMode.GPU:
+withtorch.cuda.amp.autocast():
+logits_image,labels_image=self.forward_pass(minibatch_data,net=val_net)
+
+# Compute the loss based on the logits
+loss=self.loss_calculator(labels_image,logits_image)
+else:
+ifself.args.run.compute_mode==ComputeMode.IPU:
+logits_image,labels_image,loss=self.forward_pass(minibatch_data,net=val_net)
+else:
+logits_image,labels_image=self.forward_pass(minibatch_data,net=val_net)
+
+# Compute the loss based on the logits
+loss=self.loss_calculator(labels_image,logits_image)
+
+
UResNet2D Model
+
Update Model
+
The Graphcore system is more computationally efficient if the loss function is on the
+IPU. This is accomplished by using the loss function within the model's forward method.
+
Edit src/networks/torch/uresnet2D.py.
+
Update the Forward Declaration
+
Find the forward method.
+
defforward(self,input_tensor):
+
+
Update the argument list to include the loss function, i.e., loss_calculator
+and the image labels, i.e., labels_image.
+ The ALCF provides users with access to supercomputing resources that are significantly more powerful than systems typically used for open scientific research.
+
+ The ALCF is committed to providing training and outreach opportunities that prepare researchers to efficiently use its leadership computing systems, while also cultivating a diverse and skilled HPC workforce for the future.
+
+ The Argonne Leadership Computing Facility enables breakthroughs in science and engineering by providing supercomputing resources and expertise to the research community.
+
The intent of this page is to show conceptually how to convert a Graphcore model to run on Distributed Data Parallel
+using PopDist.
+It is not necessary to convert CosmicTagger because it has already been converted and is
+located at CosmicTagger on the GraphcoreDDP branch.
+The original is located at CosmicTagger.
+
Run Model on CPU
+
The first step to converting a model is to verify that it runs on the CPU. This step has been verified for CosmicTagger.
+
Starter Code
+
You may use the code at CosmicTagger on the Graphcore branch.
+
Trainer.py
+
Edit src/utils/torch/trainer.py.
+
Import Poplar Packages
+
PopTorch is Graphcore's extension of PyTorch.
+
PopDist is Graphcore's distributed processing package.
+
Import poptorch and popdist at the top of the file.
PopTorch has an Option() method which returns values that get passed to poptorch.trainingModel.
+The returned values are stored in opts in this example.
ifself.args.run.compute_mode==ComputeMode.IPU:
+ifpopdist.isPopdistEnvSet():
+opts=popdist.poptorch.Options()
+# When using the dataloader with 'auto_distributed_partitioning=True'
+# and 'shuffle=True' we must set the random seed to ensure that tensors
+# are in the same order in all processes.
+opts.randomSeed(42)
+# Replication factor is already set via PopRun so
+# we ignore 'args.num_replicas'.
+logging.info(f"Num of local replicas: {popdist.getNumLocalReplicas()}")
+else:
+opts=poptorch.Options()
+opts.replicationFactor(self.args.num_replicas)
+
+ifself.is_training():
+self._net=poptorch.trainingModel(self._net,opts,optimizer=torch.optim.SGD(self._net.parameters(),lr=1e-3))
+else:
+self._net=poptorch.inferenceModel(self._net)
+
+ The ALCF provides users with access to supercomputing resources that are significantly more powerful than systems typically used for open scientific research.
+
+ The ALCF is committed to providing training and outreach opportunities that prepare researchers to efficiently use its leadership computing systems, while also cultivating a diverse and skilled HPC workforce for the future.
+
+ The Argonne Leadership Computing Facility enables breakthroughs in science and engineering by providing supercomputing resources and expertise to the research community.
+
+ The ALCF provides users with access to supercomputing resources that are significantly more powerful than systems typically used for open scientific research.
+
+ The ALCF is committed to providing training and outreach opportunities that prepare researchers to efficiently use its leadership computing systems, while also cultivating a diverse and skilled HPC workforce for the future.
+
+ The Argonne Leadership Computing Facility enables breakthroughs in science and engineering by providing supercomputing resources and expertise to the research community.
+
+ The ALCF provides users with access to supercomputing resources that are significantly more powerful than systems typically used for open scientific research.
+
+ The ALCF is committed to providing training and outreach opportunities that prepare researchers to efficiently use its leadership computing systems, while also cultivating a diverse and skilled HPC workforce for the future.
+
+ The Argonne Leadership Computing Facility enables breakthroughs in science and engineering by providing supercomputing resources and expertise to the research community.
+
+ The ALCF provides users with access to supercomputing resources that are significantly more powerful than systems typically used for open scientific research.
+
+ The ALCF is committed to providing training and outreach opportunities that prepare researchers to efficiently use its leadership computing systems, while also cultivating a diverse and skilled HPC workforce for the future.
+
+ The Argonne Leadership Computing Facility enables breakthroughs in science and engineering by providing supercomputing resources and expertise to the research community.
+
This section describes how to generate the files that the Graph Analyser can analyze. The Graph Analyser uses report files generated during compilation and execution by the Poplar SDK.
+
IPU Memory Overhead
+
Because of all these extra memory requirements, a model with high memory consumption may go out of memory when profiling is enabled. Depending on the model, you can adjust its parameters to leave space for the instrumentation. For example, you can try decreasing the batch size. In TensorFlow BERT you can adjust the micro batch-size.
+
Host Computing Overhead
+
It is essential that you also try to reduce the iterations on each run. For instance, by reducing the number of steps or the number of batches per step you can get a lighter execution profile. This will not only reduce the host computation overhead but will also speed up visualization in the Graph Analyser.
+ The ALCF provides users with access to supercomputing resources that are significantly more powerful than systems typically used for open scientific research.
+
+ The ALCF is committed to providing training and outreach opportunities that prepare researchers to efficiently use its leadership computing systems, while also cultivating a diverse and skilled HPC workforce for the future.
+
+ The Argonne Leadership Computing Facility enables breakthroughs in science and engineering by providing supercomputing resources and expertise to the research community.
+
The Poplar SDK is downloaded onto the graphcore systems at the /software/graphcore/poplar_sdk/ location. The default poplar
+version (3.3.0) is enabled automatically upon logging into a graphcore node.
+
Check if Poplar is setup correctly:
+
popc--version
+
+
One should see:
+
POPLAR version 3.3.0 (de1f8de2a7)
+clang version 16.0.0 (2fce0648f3c328b23a6cbc664fc0dd0630122212)
+
+
If the Poplar SDK is not enabled, it can be enabled with
+
To disable the current Poplar SDK, e.g. if one wants to use a different Poplar SDK, follow the steps below. (Otherwise, skip to section Miscellaneous Environment Variables.)
+This example assumes that the current installed SDK is 3.1.0 and you want to move to 3.3.0
PopTorch is an extension of the Pytorch framework that is optimized for the IPU specific functionality. To activate the PopTorch environment, first create a virtual environment and activate it.
The Poplar SDK provides TensorFlow and Keras wheels built on 2.6 that includes the IPU specific functionality and optimized for the AMD processors. It can be installed as follows.
+ The ALCF provides users with access to supercomputing resources that are significantly more powerful than systems typically used for open scientific research.
+
+ The ALCF is committed to providing training and outreach opportunities that prepare researchers to efficiently use its leadership computing systems, while also cultivating a diverse and skilled HPC workforce for the future.
+
+ The Argonne Leadership Computing Facility enables breakthroughs in science and engineering by providing supercomputing resources and expertise to the research community.
+
Connection to a GroqRack node is a two-step process.
+
The first step is to ssh from a local machine to a login node.
+The second, optional step is to ssh from a login node to a GroqRack node. Jobs may also be started and tracked from login nodes.
+
+
Log in to a login node
+
Connect to a groq login node, editing this command line to use your ALCF user id. You will be prompted for a password; use the 8-digit code provided by MobilePASS+.
+
sshALCFUserID@groq.ai.alcf.anl.gov
+
+This randomly selects one of the login nodes, namely groq-login-01.ai.alcf.anl.gov or groq-login-02.ai.alcf.anl.gov. You can alternatively ssh to the specific login nodes directly.
+
Log in to a GroqRack node
+
Once you are on a login node, optionally ssh to one of the GroqRack nodes, which are numbered 1-9.
+
sshgroq-r01-gn-01.ai.alcf.anl.gov
+# or
+sshgroq-r01-gn-09.ai.alcf.anl.gov
+# or any node with hostname of form groq-r01-gn-0[1-9].ai.alcf.anl.gov
+
+ The ALCF provides users with access to supercomputing resources that are significantly more powerful than systems typically used for open scientific research.
+
+ The ALCF is committed to providing training and outreach opportunities that prepare researchers to efficiently use its leadership computing systems, while also cultivating a diverse and skilled HPC workforce for the future.
+
+ The Argonne Leadership Computing Facility enables breakthroughs in science and engineering by providing supercomputing resources and expertise to the research community.
+
This section covers how to remotely use the GroqView profiler and visualizer tool.
+
GroqView sample
+
Groq compiles produce an accurate and detailed model of the performance of a model's execution on groq cards. There is no need to run a model on groqcards to use GroqView.
+The GroqView example adds the "groqview=True" parameter to the groqit call, then calls the groqview() method on the model returned by groqit.
+This is the relevant code when using GroqFlow. It tries to retrieve the compiled model from the cache, compiles the model on a cache miss, then calls groqview().
+From groqflow/examples/pytorch/groqview.py:
+
# Build model
+gmodel = groqit(pytorch_model, inputs, groqview=True)
+# Open GroqView
+gmodel.groqview()
+
+
Run the sample
+
On a groq node, run the groqview.py sample (or any script that includes similar code). Note the port number chosen by GroqView.
+
conda activate groqflow
+cd ~/groqflow/examples/pytorch
+python groqview.py
+# Youwillseesomethinglikethefollowing.
+# Theportnumbermaybedifferent.
+...
+Open your web browser:
+ http://localhost:8439
+
+
Forward the port to your machine with a browser
+
On your laptop/user machine with a display, set up a 2-hop ssh tunnel.
+Set $GN_HOSTNAME to the name of the host where job is running
+
+ The ALCF provides users with access to supercomputing resources that are significantly more powerful than systems typically used for open scientific research.
+
+ The ALCF is committed to providing training and outreach opportunities that prepare researchers to efficiently use its leadership computing systems, while also cultivating a diverse and skilled HPC workforce for the future.
+
+ The Argonne Leadership Computing Facility enables breakthroughs in science and engineering by providing supercomputing resources and expertise to the research community.
+
+ The ALCF provides users with access to supercomputing resources that are significantly more powerful than systems typically used for open scientific research.
+
+ The ALCF is committed to providing training and outreach opportunities that prepare researchers to efficiently use its leadership computing systems, while also cultivating a diverse and skilled HPC workforce for the future.
+
+ The Argonne Leadership Computing Facility enables breakthroughs in science and engineering by providing supercomputing resources and expertise to the research community.
+
Jobs are launched from any GroqRack node, or from login nodes.
+If you expect a loss of an internet connection for any reason, for long-running jobs we suggest logging into a specific node and using either screen or tmux to create persistent command line sessions. For details use:
GroqFlow is the simplest way to port applications running inference to groq. The groqflow github repo includes many sample applications.
+See GroqFlow.
+
Clone the GroqFlow github repo
+
Clone the groqflow github repo and change current directory to the clone:
+
Create a groqflow conda environment, and activate it.
+Follow the instructions in the Virtual Environments section.
+Note: Similar install instructions are in ~/groqflow/docs/install.md or GroqFlow™ Installation Guide
+The conda enviroment should be reinstalled whenever new groqflow code is pulled from the groqflow github; with a groqflow conda environment activated, redo just the pip install steps.
+
Running a groqflow sample
+
Each groqflow sample directory in the ~/groqflow/proof_points tree has a README.md describing the sample and how to run it.
+
Optionally activate your GroqFlow conda environment
Create a script run_minilmv2.sh with the following contents. It assumes that conda was installed in the default location. The conda initialize section can also be copied from your .bashrc if the conda installer was allowed to add it.
+
#!/bin/bash
+# >>> conda initialize >>>
+# !! Contents within this block are managed by 'conda init' !!
+__conda_setup="$(${HOME}'/miniconda3/bin/conda''shell.bash''hook'2>/dev/null)"
+if[$?-eq0];then
+eval"$__conda_setup"
+else
+if[-f"${HOME}/miniconda3/etc/profile.d/conda.sh"];then
+."${HOME}/miniconda3/etc/profile.d/conda.sh"
+else
+exportPATH="${HOME}/miniconda3/bin:$PATH"
+fi
+fi
+unset__conda_setup
+# <<< conda initialize <<<
+condaactivategroqflow
+cd~/groqflow/proof_points/natural_language_processing/minilm
+pipinstall-rrequirements.txt
+pythonminilmv2.py
+
+
Then run the script as a batch job with PBS. This will reserve a full eight-card(chip) node.
+
qsub-lselect=1,place=exclrun_minilmv2.sh
+
+
Note: the number of chips used by a model can be found in the compile cache dir for the model after it is compiled. E.g.
+
+The groqflow proofpoints models use 1, 2 or 4 chips.
+
If your ~/.bashrc initializes conda, an alternative to copying the conda initilization script into your execution scripts is to comment out this section in your "~/.bashrc":
+
# If not running interactively, don't do anything
+case$-in
+*i*);;
+*)return;;
+esac
+
+to
+
## If not running interactively, don't do anything
+#case $- in
+# *i*) ;;
+# *) return;;
+#esac
+
$ qstat
+Job id Name User Time Use S Queue
+---------------- ---------------- ---------------- -------- - -----
+3084.groq-r01-co* run_minilmv2 user 0 R workq
+$
+
+
Output will by default go to two files with names like the following, where the suffix is the job id. One standard output for the job. The other is the standard error for the job.
+
$ ls-larun_minilmv2.sh.*
+-rw------- 1 user users 448 Oct 16 18:40 run_minilmv2.sh.e3082
+-rw------- 1 user users 50473 Oct 16 18:42 run_minilmv2.sh.o3082
+
+
Run a sample using PBS in interactive mode
+
An alternative is to use an interactive PBS job. This may be useful when debugging new or changed code. Here is an example that starts a 24 hour interactive job. It reserves a full eight-card(chip) node.
+
qsub-IV-lwalltime=24:00:00-lselect=1,place=excl
+
+Then activate your groqflow environment, and run python scripts with
+
+ The ALCF provides users with access to supercomputing resources that are significantly more powerful than systems typically used for open scientific research.
+
+ The ALCF is committed to providing training and outreach opportunities that prepare researchers to efficiently use its leadership computing systems, while also cultivating a diverse and skilled HPC workforce for the future.
+
+ The Argonne Leadership Computing Facility enables breakthroughs in science and engineering by providing supercomputing resources and expertise to the research community.
+
ALCF's Groq system consists of a single GroqRackTM compute cluster that provides an extensible accelerator network consisting of 9 GroqNodeTM [ groq-r01-gn-01 through groq-r01-gn-09 ] nodes with a rotational multi-node network topology. Each of these GroqNodes consists of 8 GroqCardTM accelerators in them with integrated chip-to-chip connections with a dragonfly multi-chip topology.
+
GroqCardTM accelerator is a dual-width, full-height, three-quarter length PCI-Express Gen4 x16 adapter that includes a single GroqChipTM processor with 230 MB of on-chip memory. Based on the proprietary Tensor Streaming Processor (TSP) architecture, the GroqChip processor is a low latency and high throughput single core SIMD compute engine capable of 750 TOPS (INT8) and 188 TFLOPS (FP16) @ 900 MHz that includes advanced vector and matrix mathematical acceleration units. The GroqChip processor is deterministic, providing predictable and repeatable performance.
+
The GroqWare suite SDK uses a API based programming model and enables users to develop, compile, and run models on the GroqCard accelerator in a host server system. The SDK uses a ONNX/MLIR enabled DAG compiler and it consists of Groq Compiler, Groq API, and utility tools like GroqView™ profiler and groq-runtime.
+
+
+
+
+
For more information refer to the following links:
+ The ALCF provides users with access to supercomputing resources that are significantly more powerful than systems typically used for open scientific research.
+
+ The ALCF is committed to providing training and outreach opportunities that prepare researchers to efficiently use its leadership computing systems, while also cultivating a diverse and skilled HPC workforce for the future.
+
+ The Argonne Leadership Computing Facility enables breakthroughs in science and engineering by providing supercomputing resources and expertise to the research community.
+
rmMiniconda3-latest-Linux-x86_64.sh*
+wgethttps://repo.anaconda.com/miniconda/Miniconda3-latest-Linux-x86_64.sh
+bashMiniconda3-latest-Linux-x86_64.sh
+# answer y/yes to all prompts
+# exit ssh session, then start a new ssh session
+exit
+
+
GroqFlow conda environment setup
+
Create and activate a groqflow conda environment
+
Create a groqflow conda environment and activate it
+
Install groqflow into the groqflow conda environment
+
Execute the following commands to install groqflow into the activated groqflow conda environment
+
# Alter this if you have cloned groqflow to some other location.
+cd~/groqflow
+if[-d"groqflow.egg-info"];thenrm-rgroqflow.egg-info;fi
+pipinstall--upgradepip
+piplist--format=freeze>frozen.txt
+pipinstall-rfrozen.txt-e.
+pushd.
+cddemo_helpers
+if[-d"groqflow_demo_helpers.egg-info"];thenrm-rgroqflow_demo_helpers.egg-info;fi
+pipinstall-e.
+popd
+pipinstallsoundfile
+pipinstalldatasets==2.21.0
+
+
Note: if you encounter problems trying to update an existing groqflow conda environment, consider removing the existing environment with the following command, and recreating it. Make sure you deactivate the environment before removing it.
+
condaremove--namegroqflow--all-y
+
+
Use Groqflow
+
To use groqfloq,
+
condaactivategroqflow
+
+Note: Always use a personal conda environment when installing packages on groq nodes; otherwise they can get installed into ~/.local and can cause problems when your shared home directory is used on other systems. If you encounter mysterious package dependency/version issues, check your ~/.local/lib and ~/.local/bin for mistakenly installed packages.
+
Note: The conda enviroment should be reinstalled whenever new groqflow code is pulled from the groqflow github; with a groqflow conda environment activated, redo just the pip install steps, including the removal of the egg-info directories.
+ The ALCF provides users with access to supercomputing resources that are significantly more powerful than systems typically used for open scientific research.
+
+ The ALCF is committed to providing training and outreach opportunities that prepare researchers to efficiently use its leadership computing systems, while also cultivating a diverse and skilled HPC workforce for the future.
+
+ The Argonne Leadership Computing Facility enables breakthroughs in science and engineering by providing supercomputing resources and expertise to the research community.
+
Using /data/ANL/results/sn30-r1-h1/wilsonb/032223.18/GPT1.5B.out for output
+Using /data/ANL/results/sn30-r2-h1/wilsonb/032223.19/GPT1.5B.out for output
+
Using /data/ANL/results/sn30-r2-h1/wilsonb/032223.19/BertLarge.out for output
+ The ALCF provides users with access to supercomputing resources that are significantly more powerful than systems typically used for open scientific research.
+
+ The ALCF is committed to providing training and outreach opportunities that prepare researchers to efficiently use its leadership computing systems, while also cultivating a diverse and skilled HPC workforce for the future.
+
+ The Argonne Leadership Computing Facility enables breakthroughs in science and engineering by providing supercomputing resources and expertise to the research community.
+
+ The ALCF provides users with access to supercomputing resources that are significantly more powerful than systems typically used for open scientific research.
+
+ The ALCF is committed to providing training and outreach opportunities that prepare researchers to efficiently use its leadership computing systems, while also cultivating a diverse and skilled HPC workforce for the future.
+
+ The Argonne Leadership Computing Facility enables breakthroughs in science and engineering by providing supercomputing resources and expertise to the research community.
+
The SambaNova Model Zoo is SambaNova's new github repository for delivering RDU-compatible source code, including example applications for compiling and running models on SambaNova hardware.
+
In the ALCF SN30 cluster, the Model Zoo samples run inside of Singularity containers. The Singularity image includes support for compiling and running models.
+Note: your home directory is mounted by default in the singularity containers.
+
Starting a container:
+
Change directory to your Model Zoo clone, and set an environment variable to be host SambaNova runts version, then start the container. This example binds a directory containing an OpenWebText dataset.
+
cd ~/sambanova/modelzoo
+export TARGET_SAMBAFLOW_VERSION=$((rpm -q sambanova-runtime 2>/dev/null || dpkg -s sambanova-runtime 2>/dev/null) | egrep -m 1 -o "[0-9]+\.[0-9]+\.[0-9]+")
+echo $TARGET_SAMBAFLOW_VERSION
+# should be of the form 1.19.1
+./start_container.sh -b /data/ANL/openwebtext/hdf5/hdf5:/opt/datasets/openweb_hdf54096/ -b /software:/software / /software/sambanova/singularity/images/llm-modelzoo/Modelzoo/ModelzooDevbox_1.sif
+
To list all running containers (while outside a container, e.g. a different SSH session):
+
$ singularity instance list
+INSTANCE NAME PID IP IMAGE
+devbox_arnoldw_1724873417 1649294 /software/sambanova/singularity/images/llm-modelzoo/Modelzoo/ModelzooDevbox_1.sif
+
+To re-enter an exited but still-running container (while outside a container):
+
Optionally, download the Hugging Face model for Llama-2-7b
+
This model is also avaiable in /software/models/Llama-2-7b-hf/
+First, create a Hugging Face account at https://huggingface.co/join if you do not already have one.
+Go to meta-llama/Llama-2-7b-hf and accept the terms of use for Llama2 7B.
+You will need to wait (minutes at least) until the request is proccessed.
+In your Hugging Face account settings, generate a user access token. A read-only token works. Record the token such that it can easily be copy-pasted in the future.
+
# if working in an environment (e.g. laptop) where git-lfs is not installed,
+# sudo apt install git-lfs
+git lfs install # Only needs to be done once
+cd ~/sambanova
+git clone https://huggingface.co/meta-llama/Llama-2-7b-hf
+# Enter your HF user name and user access token (copy;paste) when prompted.
+
+
Text generation sample
+
Compile a text generation sample that uses the HF model
+
Compile a LLaMA-7b text generation sample (using the Hugging Face model). This will take 20 minutes
+
cd ~/sambanova
+# or ./Llama-2-7b-hf if downloaded
+python ./modelzoo/examples/nlp/text_generation/rdu_generate_text.py \
+command=compile \
+checkpoint.model_name_or_path=/software/models/Llama-2-7b-hf/ \
+samba_compile.output_folder=/home/$(whoami)/sambanova/out_generation \
++samba_compile.target_sambaflow_version=$TARGET_SAMBAFLOW_VERSION # =1.19.1
+
+
Note: each compile will add a new subdirectory to the ouput folder (/home/$(whoami)/sambanova/out_generation), containing compile artifacts. The folder can be deleted when testing is complete;
+
Run the text generation sample
+
Run the sample, using the .pef binary created by the compile.
+Note: The expression in the command line finds the most recent pef file.
+
+
+
cd ~/sambanova
+export PEF=$(find /home/$(whoami)/sambanova/out_generation -type f -name "*.pef" -printf "%T@ %p\n" | sort -n | tail -n1 | awk '{print $2}')
+# or ./Llama-2-7b-hf if downloaded
+python ./modelzoo/examples/nlp/text_generation/rdu_generate_text.py \
+ command=run \
+ checkpoint.model_name_or_path=/software/models/Llama-2-7b-hf/ \
+ samba_run.pef=${PEF}
+
+
+
+
The end of the console output should resemble the following:
+
Generating 32 tokens ...
+Decoding ...
+Completion:
+[', there was a little boy who lived in a small town.\nHe was a good boy, but sometimes he had a hard time following the rules.\n']
+
+latencies
+ time to first token 1.1981s
+ tokens, excluding first token 0.3330s
+ tokens, overall 0.3600s
+ Total Latency 1.5310s
+throughputs
+ tokens/second excluding first token 3.0032
+ tokens/second overall 2.7777
+Singularity>
+
+
Model Finetuning Sample
+
Fine-tune the Llama2 7B model using a chat dataset.
+
Data preparation
+
NOTE: These data preparation steps should be performed on a SambaNova node, and not in a singularity container.
+
Install the Generative Data Prep package in a virtualenv
Make sure that you have git lfs installed, with git lfs install
+
cd ~/sambanova
+git clone https://huggingface.co/datasets/stingning/ultrachat
+
+
Convert the dataset to the .jsonl format
+
cd ~/sambanova
+source generative_data_prep/gdp_venv/bin/activate
+# This step makes a single jsonl file
+python ./modelzoo/examples/nlp/training/utils/convert_ultrachat.py -src ultrachat/ -dest ultrachat_processed.jsonl
+# get a small subset to keep the 1 epoch runtime down.
+mv ~/sambanova/ultrachat_processed.jsonl ~/sambanova/ultrachat_processed_full.jsonl
+head -1000 ~/sambanova/ultrachat_processed_full.jsonl > ~/sambanova/ultrachat_processed.jsonl
+# This step makes a directory of hdf5 files from the single jsonl file
+export TOKENIZER="./Llama-2-7b-hf"
+export MAX_SEQ_LENGTH=4096
+python -m generative_data_prep pipeline --input_file_path=./ultrachat_processed.jsonl --output_path=./ultrachat_dialogue --pretrained_tokenizer=${TOKENIZER} --max_seq_length=${MAX_SEQ_LEN}
+deactivate
+
+
Compile a sample that finetunes the HF model
+
Start container
+
If you are not already in a Singularity container (with the pre-reqs installed),
+start a new Model Zoo Singularity container with
+
cd ~/sambanova/modelzoo
+export TARGET_SAMBAFLOW_VERSION=$((rpm -q sambanova-runtime 2>/dev/null || dpkg -s sambanova-runtime 2>/dev/null) | egrep -m 1 -o "[0-9]+\.[0-9]+\.[0-9]+")
+echo $TARGET_SAMBAFLOW_VERSION
+# should be of the form 1.19.1
+./start_container.sh -b /data/ANL/openwebtext/hdf5/hdf5:/opt/datasets/openweb_hdf54096/ -b /software:/software /software/sambanova/singularity/images/llm-modelzoo/Modelzoo/ModelzooDevbox_1.sif
+
Note: each compile will add a new subdirectory to the ouput folder (/home/$(whoami)/sambanova/out_train), containing compile artifacts. The folder can be deleted when testing is complete;
+
Run finetuning using generated pef file
+
This will run for 1 full epoch and takes 1 hour to execute, using a single RDU.
+It uses the config file modelzoo/examples/nlp/training/config/base_config_rdu.yaml
+
cd ~/sambanova
+export CHECKPOINT=/software/models/Llama-2-7b-hf/ # or ./Llama-2-7b-hf
+export MAX_SEQ_LENGTH=4096
+export DATASET=./ultrachat_dialogue; # or container path to dataset
+# Finds most recent pef file in tree
+export PEF=$(find /home/$(whoami)/sambanova/out_train -type f -name "*.pef" -printf "%T@ %p\n" | sort -n | tail -n1 | awk '{print $2}')
+python -u modelzoo/examples/nlp/training/rdu_train_llm.py \
+ command=run \
+ checkpoint.model_name_or_path=${CHECKPOINT} \
+ model.max_seq_length=${MAX_SEQ_LENGTH} \
+ samba_run.pef=${PEF} \
+ training.dataset=${DATASET}
+
+
The end of the console output should resemble the following if run for a full epoch:
+
Targeting samba-runtime v4.2.5. Samba is running with --target-runtime-version=1.3.10 on a system with installed runtime None.
+
+Log ID initialized to: [arnoldw][python][1003] at /var/log/sambaflow/runtime/sn.log
+Loading dataset for epoch 1...
+
+Number of epochs: 1
+Batch size: 8
+Number of batches (steps): 1,143
+
+Starting training for epoch 1...
+Epoch [1/1], Step [1/1143], Loss: 0.8184
+Epoch [1/1], Step [2/1143], Loss: 0.2452
+Epoch [1/1], Step [3/1143], Loss: 0.3727
+Epoch [1/1], Step [4/1143], Loss: 0.2945
+...
+Epoch [1/1], Step [1134/1143], Loss: 0.2529
+Epoch [1/1], Step [1135/1143], Loss: 0.2713
+Epoch [1/1], Step [1136/1143], Loss: 0.2669
+Epoch [1/1], Step [1137/1143], Loss: 0.2144
+Epoch [1/1], Step [1138/1143], Loss: 0.2129
+Epoch [1/1], Step [1139/1143], Loss: 0.2229
+Epoch [1/1], Step [1140/1143], Loss: 0.2263
+Epoch [1/1], Step [1141/1143], Loss: 0.2434
+Epoch [1/1], Step [1142/1143], Loss: 0.2131
+Epoch [1/1], Step [1143/1143], Loss: 0.1626
+Finished training.
+Saving checkpoint...
+Checkpoint saved at finetuned_model/
+Saving summary...
+Summary saved at finetuned_model/summary.txt
+Singularity>
+
+ The ALCF provides users with access to supercomputing resources that are significantly more powerful than systems typically used for open scientific research.
+
+ The ALCF is committed to providing training and outreach opportunities that prepare researchers to efficiently use its leadership computing systems, while also cultivating a diverse and skilled HPC workforce for the future.
+
+ The Argonne Leadership Computing Facility enables breakthroughs in science and engineering by providing supercomputing resources and expertise to the research community.
+
In this section we will learn how to extend the UNet2d and Gpt1.5B applications scripts that we introduced in the Example Programs to compile and run multiple instances of the model in a data parallel fashion across multiple tiles or across multiple nodes.
+
UNet2d
+
Set Up
+
Create the following directory and change to it if you have not already done so.
Create the file Unet2d.sh and unet_batch.sh in the current directory using your favorite editor.
+Copy and paste the contents of Unet2d.sh and unet_batch.sh
+to files with the same name into the current directory using your favorite editor.
+
chmod+xUnet2d.sh
+chmod+xunet_batch.sh
+
+
Compile and run
+
Run these commands for training (compile + train):
+The compile and run scripts have the following input arguments.
+
+
+
image size: The images are square. Valid sizes include 256, 512, and 1024.
+
+
+
Batch size: local batch size. The global batch size is local batch size * Num of instances.
+
+
+
num of instances: Total number of instances of Unet2d run in data parallel framework.
+
+
+
RunID: A unique Id for the compile or run process.
+
+
+
The script uses the arguments pcompile and prun for the data parallel compile and run.
The above commands displays the file that contains the output for the execution of the above scripts, usually /data/ANL/results/<hostname>/<userId>/<RunID>/Unet2d.out
+
You can inspect the compile command that contains --data-parallel -ws 2 arguments to ensure that the pef file is compatible for data parallel runs. The pef generated from the compilation process for the above compile command is placed under out/Unet2d/unet_train_256_256_NP_4 inside the current working directory.
Once the model is compiled, sbatch is used to launch the multiple instances. The below example shows that a total of 8 tasks or instances are launched over the host on which the script is launched.
The throughput is calculated by averaging the e2e samples_per_sec over the different instances.
+
inner train loop time : 36.314290046691895 for 10 epochs, number of global steps: 10, e2e samples_per_sec: 563.9653143065
+inner train loop time : 33.36756229400635 for 10 epochs, number of global steps: 10, e2e samples_per_sec: 613.7697389922524
+inner train loop time : 33.94625234603882 for 10 epochs, number of global steps: 10, e2e samples_per_sec: 603.3066563941279
+inner train loop time : 32.309499979019165 for 10 epochs, number of global steps: 10, e2e samples_per_sec: 633.8692958200872
+inner train loop time : 31.418426036834717 for 10 epochs, number of global steps: 10, e2e samples_per_sec: 651.8467849404489
+inner train loop time : 28.164129495620728 for 10 epochs, number of global steps: 10, e2e samples_per_sec: 727.1660927132315
+inner train loop time : 30.29698896408081 for 10 epochs, number of global steps: 10, e2e samples_per_sec: 675.9747651583616
+inner train loop time : 25.332663536071777 for 10 epochs, number of global steps: 10, e2e samples_per_sec: 808.442427336472
+
Create and run Gpt1.5B_compile.sh and Gpt1.5B_run.sh
+
Create the files Gpt1.5B_compile.sh and Gpt1.5B_run.sh in the current directory.
+Copy the contents of Gpt1.5B_compile.sh and Gpt1.5B_run.sh. Alternatively, the files can be accessed at /data/ANL/scripts/Gpt1.5B_compile.sh and /data/ANL/scripts/Gpt1.5B_run.sh on any of the compute node and can be copied over to the working directory.
+
Compile and Run
+
This script consists of commands to compile and run multiple instances of Gpt1.5B model across multiple nodes. Run the Gpt1.5B_compile.sh to first compile and generate the pef file for the model and it in turn launches the Gpt1.5B_run.sh script to run multiple instances of the model over the different nodes.
You can see the log file path displayed on the screen as seen in the example below. You can use the tail command to check the progress of the run.
+
vsastry@sn30-r1-h1:~/nlp-multiNodetest$ ./Gpt1.5B_compile.sh
+Using /data/ANL/results/sn30-r1-h1/vsastry/041823.19/GPT1.5B.out for output
+
+
The artifacts of the compile process is produced in the path : /data/scratch/<userId>.
+
Inspect the compile command in the script to see that it includes additional arguments --data-parallel and -ws 2 to generate a pef that is compatible for data parallel runs.
Once the model is compiled, sbatch is used to launch the multiple instances across the nodes. The below example shows that a total of 32 tasks or instances are launched over 2 nodes with each node having a maximum of 16 tasks. Slurm allocates any 2 of the available nodes in this example.
The run command for each of this instance is present in the Gpt1.5B_run.sh script. You can inspect the command in the script to see that --data-parallel --reduce-on-rdu arguments are present to ensure that the model is run in a data parallel fashion and that the gradient accumulation takes place on the RDU.
The Slurm log associated with the JOBID (10191 in the above example) is located in the home directory. You can use the tail command to check the progress of the training.
+
vsastry@sn30-r1-h1:~$ tail-f~/slurm-10191.out
+Using /data/ANL/results/sn30-r1-h1/vsastry/041823.03/Gpt1.5B.out for output
+
+ The ALCF provides users with access to supercomputing resources that are significantly more powerful than systems typically used for open scientific research.
+
+ The ALCF is committed to providing training and outreach opportunities that prepare researchers to efficiently use its leadership computing systems, while also cultivating a diverse and skilled HPC workforce for the future.
+
+ The Argonne Leadership Computing Facility enables breakthroughs in science and engineering by providing supercomputing resources and expertise to the research community.
+
You can use the link to the tutorials on the SambaNova GitHub site or the examples on the compute node (as explained below).
+
+
Find the tutorials on the SambaNova GitHub site. If you use those instructions, ensure that you still use the steps for accessing the SN compute node, setting the required environment and compiling and running the applications as described in this documentation.
+
Use the examples of well-known simple AI applications under the path: /opt/sambaflow/apps/starters, on all SambaNova compute nodes, as discussed on this page.
Deactivate any active conda environment. If you have conda installed and a conda environment is active, you will see something like (base) at the beginning of the command prompt. If so, you will need to deactivate it with conda deactivate. Conda is not used on the SambaNova SN30 cluster.
+
LeNet
+
Change directory
+
cd~/apps/starters/lenet
+
+
Common Arguments
+
Below are some of the common arguments used across most of the models in the example code.
+
+
+
+
Argument
+
Default
+
Help
+
+
+
+
+
-b
+
1
+
Batch size for training
+
+
+
+
+
+
+
+
-n,
+
100
+
Number of iterations to run
+
+
+
--num-iterations
+
+
the pef for
+
+
+
+
+
+
+
+
-e,
+
1
+
Number epochs for training
+
+
+
--num-epochs
+
+
+
+
+
+
+
+
+
+
--log-path
+
'check
+
Log path
+
+
+
+
points'
+
+
+
+
+
+
+
+
+
--num-workers
+
0
+
Number of workers
+
+
+
+
+
+
+
+
--measure-train-
+
None
+
Measure training performance
+
+
+
performance
+
+
+
+
+
+
+
+
+
+
+
LeNet Arguments
+
+
+
+
Argument
+
Default
+
Help
+
+
+
+
+
--lr
+
0.01
+
Learning rate for training
+
+
+
+
+
+
+
+
--momentum
+
0.0
+
Momentum value for training
+
+
+
+
+
+
+
+
--weight-decay
+
0.01
+
Weight decay for training
+
+
+
+
+
+
+
+
--data-path
+
'./data'
+
Data path
+
+
+
+
+
+
+
+
--data-folder
+
'mnist_
+
Folder containing mnist data
+
+
+
+
data'
+
+
+
+
+
+
+
+
+
+
+
+
+
Note: If you receive an \"HTTP error\" message on any of the
+following commands, run the command again. Such errors (e.g 503) are
+commonly an intermittent failure to download a dataset.
+
+
Run these commands to compile and train the LeNet model:
The output, pef/logreg/output.log, will look something like this:
+
2023-03-08 21:18:25.168190: I tensorflow/core/platform/cpu_feature_guard.cc:193] This TensorFlow binary is optimized with oneAPI Deep Neural Network Library (oneDNN) to use the following CPU instructions in performance-critical operations: AVX2 FMA
+To enable them in other operations, rebuild TensorFlow with the appropriate compiler flags.
+2023-03-08 21:18:25.334389: W tensorflow/compiler/xla/stream_executor/platform/default/dso_loader.cc:64] Could not load dynamic library 'libcudart.so.11.0'; dlerror: libcudart.so.11.0: cannot open shared object file: No such file or directory
+2023-03-08 21:18:25.334430: I tensorflow/compiler/xla/stream_executor/cuda/cudart_stub.cc:29] Ignore above cudart dlerror if you do not have a GPU set up on your machine.
+2023-03-08 21:18:26.422458: W tensorflow/compiler/xla/stream_executor/platform/default/dso_loader.cc:64] Could not load dynamic library 'libnvinfer.so.7'; dlerror: libnvinfer.so.7: cannot open shared object file: No such file or directory
+2023-03-08 21:18:26.422701: W tensorflow/compiler/xla/stream_executor/platform/default/dso_loader.cc:64] Could not load dynamic library 'libnvinfer_plugin.so.7'; dlerror: libnvinfer_plugin.so.7: cannot open shared object file: No such file or directory
+2023-03-08 21:18:26.422709: W tensorflow/compiler/tf2tensorrt/utils/py_utils.cc:38] TF-TRT Warning: Cannot dlopen some TensorRT libraries. If you would like to use Nvidia GPU with TensorRT, please make sure the missing libraries mentioned above are installed properly.
+[Info][SAMBA]# Placing log files in /home/wilsonb/apps/starters/logreg/pef/logreg/logreg.samba.log
+[Info][MAC]# Placing log files in /home/wilsonb/apps/starters/logreg/pef/logreg/logreg.mac.log
+...
+
+Epoch [1/1], Step [10000/60000], Loss: 0.4642
+Epoch [1/1], Step [20000/60000], Loss: 0.4090
+Epoch [1/1], Step [30000/60000], Loss: 0.3863
+Epoch [1/1], Step [40000/60000], Loss: 0.3703
+Epoch [1/1], Step [50000/60000], Loss: 0.3633
+Epoch [1/1], Step [60000/60000], Loss: 0.3553
+Test Accuracy: 91.40 Loss: 0.3014
+2023-03-08T21:19:08 : [INFO][LIB][2688517]: sn_create_session: PEF File: pef/logreg/logreg.pef
+
+
UNet2D
+
The UNet application example is provided in the the path : /opt/sambaflow/apps/image/segmentation/. As any other application, we first compile and then train the model using compile and run arguments respectively.
+The scripts containing the compile and run commands for UNet2D model can be accessed at Unet2d.sh or at /data/ANL/scripts/Unet2d.sh on any SN30 compute node.
+
Change directory and copy files.
+
mkdir-p~/apps/image/unet
+cd~/apps/image/unet
+
+
Copy and paste the contents of
+Unet2d.sh
+to a file with the same name into the current directory using your favorite editor.
+
chmod+xUnet2d.sh
+
+
Run these commands for training (compile + train):
The compile and run arguments of the script can only be run with number of instances equal to 1, indicating that this is a simple 4 tile run without data parallel framework.
+For a image size of 256x256 and batch size 256 when running just 1 instance, the commands are provided as follows.
The above commands displays the file that contains the output for the execution of the above scripts, usually /data/ANL/results/<hostname>/<userid>/<RunID>/Unet2d.out
+
If we inspect the compile and run commands for the UNet application provided in the script, we see that the application is compiled with --num-tiles 4, which means that the entire application fits on 4 tiles or half of a RDU.
+The pef generated from the compilation process of the above command is placed under out/Unet2d/unet_train_256_256_single_4 inside the current working directory.
The performance data is located at the bottom of log file.
+
inner train loop time : 374.6789753437042 for 10 epochs, number of global steps: 130, e2e samples_per_sec: 88.82270474202953
+
+
Gpt 1.5B
+
The Gpt 1.5B application example is provided in the the path : /opt/sambaflow/apps/nlp/transformers_on_rdu/.
+The scripts containing the compile and run commands for Gpt1.5B model can be accessed at the path /data/ANL/scripts/Gpt1.5B_base_single_compile.sh and /data/ANL/scripts/Gpt1.5B_base_single_run.sh on any SN30 compute node. This script is compiled and run for only 1 instance and the model fits on 4 tiles or half of a RDU. The scripts are provided for reference.
The Gpt1.5B_base_single_compile.sh script will internally call the Gpt1.5B_base_single_run.sh to perform the training. You can inspect the compile and run commands in the scripts to learn that this model trains with a batch size of 32 for 1 instance over 4 tiles. The human decision file and the compiler config file helps to optimize the compute and memory resources specific to this Gpt 1.5B model run.
+ The ALCF provides users with access to supercomputing resources that are significantly more powerful than systems typically used for open scientific research.
+
+ The ALCF is committed to providing training and outreach opportunities that prepare researchers to efficiently use its leadership computing systems, while also cultivating a diverse and skilled HPC workforce for the future.
+
+ The Argonne Leadership Computing Facility enables breakthroughs in science and engineering by providing supercomputing resources and expertise to the research community.
+
SambaNova SN30 can be accessed using your ALCF account. See Get Started
+to request an account and for additional information.
+
Setup
+
System View
+
Connection to a SambaNova node is a two-step process. The first step is to ssh to the login node.
+This step requires an MFA passcode for authentication - an
+eight-digit passcode generated by an app on your mobile device, e.g., MobilePASS+.
+The second step is to log in to a SambaNova node from the login node.
+
+
Log in to Login Node
+
Log in to the SambaNova login node from your local machine using the below command. This uses the MobilePASS+ token generated every time you log in to the system. This is the same passcode used to authenticate into other ALCF systems, such as Polaris.
+
In the examples below, replaceALCFUserIDwith your ALCF user id.
Note: Use the ssh "-v" option in order to debug any ssh problems.
+
+
Log in to a SambaNova Node
+
Once you are on the login node, a SambaNova node can be accessed using an alias, sn30-r[1-4]-h[1-2] where 'r' stands for the rack number, and 'h' stands for host. sn30-r1-h1 is the first host of the first rack.
+
The 8 nodes are aliased as : sn30-r1-h1 , sn30-r1-h2, sn30-r2-h1, sn30-r2-h2, sn30-r3-h1, sn30-r3-h2, sn30-r4-h1, sn30-r4-h2.
+
sn30-r1-h1 can be accessed as below.
+
sshsn30-r1-h1
+
+
SDK setup
+
The required software environment (SambaFlow software stack and the associated environmental variables) for a SN30 node is set up automatically at login. This is unlike the SN10 where the environment had to be set up by each user.
+ The ALCF provides users with access to supercomputing resources that are significantly more powerful than systems typically used for open scientific research.
+
+ The ALCF is committed to providing training and outreach opportunities that prepare researchers to efficiently use its leadership computing systems, while also cultivating a diverse and skilled HPC workforce for the future.
+
+ The Argonne Leadership Computing Facility enables breakthroughs in science and engineering by providing supercomputing resources and expertise to the research community.
+
SambaNova uses Slurm for job submission and queueing. Below are some of the important commands for using Slurm. For more information refer to Slurm Documentation.
+
+
Note: Run the Python scripts using 'srun' or 'sbatch', to ensure that concurrent jobs do not interfere with each other.
+
Note: There is just one scheduler for all of the SambaNova nodes.
+
+
SRun
+
The Slurm command srun can be used to run individual Python scripts in parallel with other scripts on a cluster managed by Slurm. Examples of srun usage are shown below.
+
Slurm will assign a nodelist/host to run a job if a host is not specified.
Alternatively, these jobs can be submitted to the Slurm workload manager through a batch script by using the sbatch command. To do this, create a bash script (submit-lenet-job.sh here as an example) with the commands that you want to execute.
+ The ALCF provides users with access to supercomputing resources that are significantly more powerful than systems typically used for open scientific research.
+
+ The ALCF is committed to providing training and outreach opportunities that prepare researchers to efficiently use its leadership computing systems, while also cultivating a diverse and skilled HPC workforce for the future.
+
+ The Argonne Leadership Computing Facility enables breakthroughs in science and engineering by providing supercomputing resources and expertise to the research community.
+
To find the SDK version, run the following commands
+
# TODO
+(venv)ALCFUserID@sn30-r1-h1:~$ python
+Python 3.7.6 (default, Feb 18 2020, 21:28:31)
+[GCC 9.3.0] on linux
+Type "help", "copyright", "credits" or "license" for more information.
+>>> import sambaflow
+>>> sambaflow.__version__
+'1.11.5'
+>>>
+
+
OMP_NUM_THREADS
+
The OMP_NUM_THREADS environment variable sets the number of threads to use for parallel regions.
+
The value of this environment variable must be a list of positive integer values. The values of the list set the number of threads to use for parallel regions at the corresponding nested levels.
+
For the SambaNova system it, is usually set to one.
+
exportOMP_NUM_THREADS=16
+
+
Where is the Model?
+
Two copies of the model are maintained. One in host CPU memory and one in RDU
+memory. They do not interfere with each other unless you explicitly sync
+the model/parameter in between using:
+
SambaTensor.rdu() # Moves the CPU model to the RDU
+SambaTensor.cpu() # Moves the RDU model to the CPU
+
+
In order to run the model on the CPU, you can simply use the PyTorch model
+as if there is no RDU.
+In order to run the model on RDU, you would need to use session.run().
+
Useful Commands
+
SN Configuration
+
snconfigshowNodestatic
+
+
The snconfig utility shows the static configuration of the system. The configuration for the first node is as follows:
+
======================================================
+======= NODE Info =======
+======================================================
+======= Static Info =======
+Timestamp: 2023-03-16 17:00:04
+Platform Name: DataScale SN30-8
+Node Name: NODE
+ Number of XRDUS: 4
+ XRDU Name: XRDU_0
+ Number of RDUS: 2
+ RDU name: RDU_0
+ Serial Number : 205057B469B35895
+ Number of TILES: 8
+ TILE Name: TILE_0
+ Serial Number : N/A
+ TILE Name: TILE_1
+ Serial Number : N/A
+
+
+...
+
+
+ Size : 128.0 GB
+ Serial Number : 1F5BC22
+ DDR CH Name: DDRCH_6
+ Number of DIMMS: 1
+ DIMM Name: DIMM_L0
+ Size : 128.0 GB
+ Serial Number : 1F5BC99
+ DDR CH Name: DDRCH_7
+ Number of DIMMS: 1
+ DIMM Name: DIMM_M0
+ Size : 128.0 GB
+ Serial Number : 1F5BB68
+ Total XRDU_3 memory size (GB): 2048.0
+
+
SambaNova Daemon Service
+
The following command checks if the SambaNova daemon service is running.
+
systemctlstatussnd
+
+
The output should look something like this:
+
● snd.service - SN Devices Service
+ Loaded: loaded (/lib/systemd/system/snd.service; enabled; vendor preset: enabled)
+ Drop-In: /etc/systemd/system/snd.service.d
+ └─override.conf
+ Active: active (running) since Fri 2023-01-27 04:03:14 UTC; 1 months 18 days ago
+ Main PID: 5635 (snd)
+ Tasks: 9 (limit: 629145)
+ Memory: 156.8M
+ CGroup: /system.slice/snd.service
+ └─5635 /opt/sambaflow/bin/snd
+
+Warning: some journal files were not opened due to insufficient permissions.
+
+
Tile status
+
sntilestat
+watchsntilestat
+
+
The output shown below is when the system is completely idle.
+ The ALCF provides users with access to supercomputing resources that are significantly more powerful than systems typically used for open scientific research.
+
+ The ALCF is committed to providing training and outreach opportunities that prepare researchers to efficiently use its leadership computing systems, while also cultivating a diverse and skilled HPC workforce for the future.
+
+ The Argonne Leadership Computing Facility enables breakthroughs in science and engineering by providing supercomputing resources and expertise to the research community.
+
Note: Please be mindful of how you are using the system.
+For example, consider running larger jobs in the evening or on weekends
+
Note: Please use only Slurm commands, i.e., srun and sbatch, to run your code.
+If you run your code directly using the 'python' command, it may cause conflicts
+on the system.
+
Note: If you have conda installed and a conda environment is active, you will see something like (base) at the beginning of the command prompt. If so, you will need to deactivate it with conda deactivate. Conda is not used on the SambaNova SN30 cluster.
+
+
Introduction
+
The SambaNova workflow includes the following main steps to run a model.
Example Programs lists the different example applications with corresponding commands for each of the above steps.
+
Compile
+
Compiles the model and generates a .pef file. This file contains
+information on how to reconfigure the hardware, and map the compute and
+memory resources required to run an application on RDUs.
+The pef files are by default saved in the 'out' directory; the
+SambaNova documentation advises saving pef files in separate
+directories with the '--output-folder' option.
+
It is necessary to re-compile only when the model changes, or parameters specific to the model graph change, including the batch size.
+
Compile times can be significant. Compiling the UNet sample, for example, when using images of size 32x32 pixels, takes 358(s), and 1844(s) for images of size 256x256.
+
The entire compile process is executed on the host and no RDUs are involved in the compile step.
As part of this step, the model is trained on the RDUs by passing in the PEF file and the training dataset. The location of the pef file generated in the compile step is passed as an argument to the run command. Below is the example of the run command that trains a LeNet model.
The location of the pef file generated in the compile step is passed as an argument to the run command.
+
Test (Optional)
+
This command is used to run the model on both the host CPU and a SambaNova RDU. It compares the results from the CPU and RDU and will report if any discrepancies are found. Pass the pef file generated as part of the compile step as the input to this command.
+ The ALCF provides users with access to supercomputing resources that are significantly more powerful than systems typically used for open scientific research.
+
+ The ALCF is committed to providing training and outreach opportunities that prepare researchers to efficiently use its leadership computing systems, while also cultivating a diverse and skilled HPC workforce for the future.
+
+ The Argonne Leadership Computing Facility enables breakthroughs in science and engineering by providing supercomputing resources and expertise to the research community.
+
This section covers how to use the SambaTune profiling performance tuning tool, and the SambaTune UI for viewing the results.
+
+
SambaTune uses a yaml file that describes how to profile an application.
+There are samples in /opt/sambaflow/sambatune/configs.
+This section shows how to run the simplest sample, a linear net.
+
First, ssh into one of the nodes in the SN30 cluster.
+Next, start a slurm interative job reserving a full node (8 RDUs), for 8 hours (480 minutes):
+
Next, set an environment variable indicating where the profiling information should be stored:
+
export DUMP_ROOT=~/Sambatune
+
+
If running a large model, the profiling information can be hundreds of gigabytes or more, and the DUMP_ROOT should be set to some location with more storage than your home directory (which has a quota).
+E.g. somewhere that you have write access to in /projects
+
Optionally, examine the sample yaml file. You will see that it has 5 top-level sections: app:, model-args:, compile-args:, run-args:, env:
+
Next, run sambatune using a sample sambatune yaml configuration file. This sample command line requests profiling with the benchmark, instrument, and run modes.
+
$ sambatune --modes benchmark instrument run -- /opt/sambaflow/sambatune/configs/linear_net.yaml
+
+
This will take a while to run, particularly if the yaml for a larger model is used.
Copy the password shown (e.g. to your clipboard). The userid is always admin. The password is different for every sambatune_ui run.
+
In a fresh console on your working machine where you will run the browser, set up a two-hop ssh tunnel to the target node. Replace the ALCFUserID in the ssh command line with your ALCF userid.
+
Put localhost:8576 in the url bar of a Chrome-family browser. (Chrome, Brave, Vivaldi, Opera tested.)
+A login prompt for the sambatune ui should show.
+Enter admin and the password copied previously.
+You should now see the SambaTune UI.
+
If the browser does not show a login prompt, or if any previous step complains about a port conflict, try another value for ST_PORT on both the target node and for the ssh tunnel command, e.g. 8577.
+
See SambaNova's SambaTune documentation for more information about using SambaTune and the SambaTune UI.
+This section is a good starting point: Workflow overview
+
When finished:
+- Break the ssh tunnel with ctrl-c or equivalent.
+- Stop the sambatune_ui server on the target node with ctrl-c or equivalent.
+- Exit the interactive slurm job to release the reserved resources.
+
A disconnected job can be canceled by determining its job id with squeue -a and canceling the job with scancel <jobid>
+ The ALCF provides users with access to supercomputing resources that are significantly more powerful than systems typically used for open scientific research.
+
+ The ALCF is committed to providing training and outreach opportunities that prepare researchers to efficiently use its leadership computing systems, while also cultivating a diverse and skilled HPC workforce for the future.
+
+ The Argonne Leadership Computing Facility enables breakthroughs in science and engineering by providing supercomputing resources and expertise to the research community.
+
The SambaNova DataScale SN30 system is architected around the next-generation Reconfigurable Dataflow Unit (RDU) processor for optimal dataflow processing and acceleration. The AI Testbed's SambaNova SN30 system consists of eight nodes in 4 full racks, each node featuring eight RDUs interconnected to enable model and data parallelism. SambaFlow, Sambanova's software stack, extracts, optimizes, and maps the dataflow graphs to the RDUs from standard machine learning frameworks like PyTorch.
+
Below are some of the links to SambaNova documentation.
+ The ALCF provides users with access to supercomputing resources that are significantly more powerful than systems typically used for open scientific research.
+
+ The ALCF is committed to providing training and outreach opportunities that prepare researchers to efficiently use its leadership computing systems, while also cultivating a diverse and skilled HPC workforce for the future.
+
+ The Argonne Leadership Computing Facility enables breakthroughs in science and engineering by providing supercomputing resources and expertise to the research community.
+
Port forwarding is covered here. This is specifically for TensorBoard.
+
TensorBoard Port Forwarding
+
This section describes the steps to be followed to set up port forwarding for applications,
+like TensorBoard, which runs on the SambaNova system and binds to one or more ports.
+This example uses 6006 and 16006 as port numbers. Using port numbers other than these may
+avoid collisions with other users.
+
From Your Local Machine
+
ReplaceALCFUserIDwith your ALCF User ID.
+
Run
+
# Forward a port number from sambanova.alcf.anl.gov to your local machine.
+ssh-v-N-f-Llocalhost:16006:localhost:16006ALCFUserID@sambanova.alcf.anl.gov
+...
+Password:<MobilePass+code>
+
+# Connect to sambanova.alcf.anl.gov
+sshALCFUserID@sambanova.alcf.anl.gov
+...
+Password:<MobilePass+code>
+
+
From sambanova.alcf.anl.gov
+
Below are the commands specific to sn30-r1-h1. You may replace sn30-r1-h1 with any other node when using the appropriate system.
+
Run
+
+
Note: The full name is sn30-r1-h1.ai.alcf.anl.gov and it may also be used.
+
+
# Forward the port.
+ssh-N-f-Llocalhost:16006:localhost:6006ALCFUserID@sn30-r1-h1
+# Connect to the system.
+sshALCFUserID@sn30-r1-h1
+
+
On sn30-r1-h1
+
Activate the venv appropriate to your project.
+
Navigate to the appropriate directory for your model.
+Launch your model using srun or sbatch.
The SambaNova system has a bash shell script to setup the required software environment.
+This sets up the SambaFlow software stack, the associated environmental variables and activates
+a pre-configured virtual environment.
Then, navigate in your browser to, in this example, http://localhost:16006 on your local machine.
+
Notes
+
Explanation of ssh command:
+
-N : no remote commands
+
+-f : put ssh in the background
+
+-L <machine1>:<portA>:<machine2>:<portB> :
+
+The full command line will forward <machine2>:<portB> (remote scope) to <machine1>:<portA> (local scope)
+
+ The ALCF provides users with access to supercomputing resources that are significantly more powerful than systems typically used for open scientific research.
+
+ The ALCF is committed to providing training and outreach opportunities that prepare researchers to efficiently use its leadership computing systems, while also cultivating a diverse and skilled HPC workforce for the future.
+
+ The Argonne Leadership Computing Facility enables breakthroughs in science and engineering by providing supercomputing resources and expertise to the research community.
+
The intent of this page is to show conceptually how to convert a model to run on the SambaNova system.
+It is not necessary to convert CosmicTagger because it has already been converted and is
+located at CosmicTagger on the SambaNova branch.
+The original is located at CosmicTagger.
+
Run Model on CPU
+
The first step to converting a model is to verify that it runs on the CPU. This step has been verified for CosmicTagger.
+
Config.py
+
CosmicTagger can run on multiple machines. As such, it is necessary to specify the architecture
+that one is using. For example, CPU or GPU. The architecture is stored in the
+ComputeMode class.
+
Edit src/config/config.py. Add RDU to the ComputeMode class.
+
classComputeMode(Enum):
+CPU=0
+#...
+RDU=6
+
+
Trainer.py
+
Edit src/utils/torch/trainer.py.
+
Import SambaNova Packages
+
Insert the imports at the top of the file.
+
SambaFlow is a complete software stack designed to take input from standard machine learning frameworks such as PyTorch and TensorFlow. SambaFlow automatically extracts, optimizes, and maps dataflow graphs onto RDUs.
Wrap the model using poptorch.trainingModel() so that it may be ran on IPUs for training.
+
Wrap the model using poptorch.inferenceModel() when not training.
+
Find the following code around line 90 in the init_network method.
+
# Foregoing any fusions as to not disturb the existing ingestion pipeline
+ifself.is_training()andself.args.mode.quantization_aware:
+self._raw_net.qconfig=torch.quantization.get_default_qat_qconfig('fbgemm')
+self._net=torch.quantization.prepare_qat(self._raw_net)
+else:
+self._net=self._raw_net
+
Putting the loss calculation in forward_pass() allows the loss computation to be performed on the IPUs.
+This will be faster because the data will not need to be transfered round-trip to the CPU.
The following code changes are to account for the loss function, i.e., self.loss_calculator, and the
+image labels, i.e., labels_image, to be passed to the model's forward_pass method. Additionally, the calculated
+loss is returned from the forward_pass method.
Receive the extra loss variable from the forward_pass method.
+
Update the train_step method.
+
Original Training Step
+
withself.timing_context("forward"):
+ifself.args.run.precision==Precision.mixedandself.args.run.compute_mode==ComputeMode.GPU:
+withtorch.cuda.amp.autocast():
+logits_image,labels_image=self.forward_pass(minibatch_data)
+else:
+logits_image,labels_image=self.forward_pass(minibatch_data)
+
+verbose=False
+
+# Compute the loss based on the logits
+withself.timing_context("loss"):
+loss=self.loss_calculator(labels_image,logits_image)
+
+
Updated Training Step
+
The forward_pass() method was changed to return the extra variable loss in the previous section. It is now
+received conditionally when using an IPU(s).
+
In the with self.timing_context("loss"): section, only calculate loss if not using an IPU(s).
+
withself.timing_context("forward"):
+ifself.args.run.precision==Precision.mixedandself.args.run.compute_mode==ComputeMode.GPU:
+withtorch.cuda.amp.autocast():
+logits_image,labels_image=self.forward_pass(minibatch_data)
+else:
+ifself.args.run.compute_mode==ComputeMode.IPU:
+logits_image,labels_image,loss=self.forward_pass(minibatch_data)
+else:
+logits_image,labels_image=self.forward_pass(minibatch_data)
+
+verbose=False
+
+
+# Compute the loss based on the logits
+withself.timing_context("loss"):
+ifself.args.run.compute_mode==ComputeMode.IPU:
+loss=loss
+else:
+loss=self.loss_calculator(labels_image,logits_image)
+
+
Update Validation Step
+
Update the val_step method.
+
Original Validation Step Code
+
Find this code.
+
ifself.args.run.precision==Precision.mixedandself.args.run.compute_mode==ComputeMode.GPU:
+withtorch.cuda.amp.autocast():
+logits_image,labels_image=self.forward_pass(minibatch_data,net=val_net)
+else:
+logits_image,labels_image=self.forward_pass(minibatch_data,net=val_net)
+
+# Compute the loss based on the logits
+loss=self.loss_calculator(labels_image,logits_image)
+
+
Updated Validation Step Code
+
Change the code to the following.
+
ifself.args.run.precision==Precision.mixedandself.args.run.compute_mode==ComputeMode.GPU:
+withtorch.cuda.amp.autocast():
+logits_image,labels_image=self.forward_pass(minibatch_data,net=val_net)
+
+# Compute the loss based on the logits
+loss=self.loss_calculator(labels_image,logits_image)
+else:
+ifself.args.run.compute_mode==ComputeMode.IPU:
+logits_image,labels_image,loss=self.forward_pass(minibatch_data,net=val_net)
+else:
+logits_image,labels_image=self.forward_pass(minibatch_data,net=val_net)
+
+# Compute the loss based on the logits
+loss=self.loss_calculator(labels_image,logits_image)
+
+
UResNet2D Model
+
Update Model
+
The Graphcore system is more computationally efficient if the loss function is on the
+IPU. This is accomplished by using the loss function within the model's forward method.
+
Edit src/networks/torch/uresnet2D.py.
+
Update the Forward Declaration
+
Find the forward method.
+
defforward(self,input_tensor):
+
+
Update the argument list to include the loss function, i.e., loss_calculator
+and the image labels, i.e., labels_image.
+ The ALCF provides users with access to supercomputing resources that are significantly more powerful than systems typically used for open scientific research.
+
+ The ALCF is committed to providing training and outreach opportunities that prepare researchers to efficiently use its leadership computing systems, while also cultivating a diverse and skilled HPC workforce for the future.
+
+ The Argonne Leadership Computing Facility enables breakthroughs in science and engineering by providing supercomputing resources and expertise to the research community.
+
+ The ALCF provides users with access to supercomputing resources that are significantly more powerful than systems typically used for open scientific research.
+
+ The ALCF is committed to providing training and outreach opportunities that prepare researchers to efficiently use its leadership computing systems, while also cultivating a diverse and skilled HPC workforce for the future.
+
+ The Argonne Leadership Computing Facility enables breakthroughs in science and engineering by providing supercomputing resources and expertise to the research community.
+
+ The ALCF provides users with access to supercomputing resources that are significantly more powerful than systems typically used for open scientific research.
+
+ The ALCF is committed to providing training and outreach opportunities that prepare researchers to efficiently use its leadership computing systems, while also cultivating a diverse and skilled HPC workforce for the future.
+
+ The Argonne Leadership Computing Facility enables breakthroughs in science and engineering by providing supercomputing resources and expertise to the research community.
+
+ The ALCF provides users with access to supercomputing resources that are significantly more powerful than systems typically used for open scientific research.
+
+ The ALCF is committed to providing training and outreach opportunities that prepare researchers to efficiently use its leadership computing systems, while also cultivating a diverse and skilled HPC workforce for the future.
+
+ The Argonne Leadership Computing Facility enables breakthroughs in science and engineering by providing supercomputing resources and expertise to the research community.
+
This is helpful if doing multiple runs and one wishes to specify a run ID.
+ This bash script argument is optional. Place it at the very end of the command.
+ The ALCF provides users with access to supercomputing resources that are significantly more powerful than systems typically used for open scientific research.
+
+ The ALCF is committed to providing training and outreach opportunities that prepare researchers to efficiently use its leadership computing systems, while also cultivating a diverse and skilled HPC workforce for the future.
+
+ The Argonne Leadership Computing Facility enables breakthroughs in science and engineering by providing supercomputing resources and expertise to the research community.
+
Rick 4/16/2023 [10:16 AM]
+/home/rweisner/sambatune_ui_dir contains the 1.15.3 version which is the latest released version. It should work on your experimental. You will need browser access to wherever you install it.
+
cd/home/rweisner/tmp/uno_test
+
+
#TODOBRW
+ssh wilsonb@homes.cels.anl.gov
+ssh sm-02
+MobilePass+ password
+On sm-02
+source /opt/sambaflow/venv/bin/activate
+export PATH=/opt/sambaflow/bin:$PATH
+sambatune linear_net.yaml --artifact-root $(pwd)/artifact_root --modes benchmark instrument run
+sambatune_ui --directory /home/wilsonb/tmp/sambatune_gen --port 8580
+#There will be a username and password displayed that you will use in your browser on your laptop.
+Command used on laptop for port forward
+ssh -XL 8580:127.0.0.1:8580 wilsonb@sm-02.cels.anl.gov
+MobilePass+ password
+# You will be logged into sm-02 but, you do not need to do anything.
+address used in browser on laptop localhost:8580
+#Use username and password from sambatune_ui.
+Username
+Password
+
+#TODOBRW
+/home/wilsonb/DL/Sambanova/apps_1.12/private/anl/2022-09-21T19-21-05.html
+
+
About SambaTune
+
SambaTune is a tool for profiling, debugging, and tuning the performance of applications
+running on SN hardware.
+
The tool automates the collection of hardware performance counters, metrics aggregation,
+report generation, and visualization. It also automates benchmarking of the application
+to compute average throughput over a sufficient number of runs. The tool is designed to
+aid the user with performance bottleneck analysis and tuning.
+
SambaTune is currently used by SN engineers involved in performance tuning efforts.
+SambaTune is also planned for release to external customers to aid with performance
+bottleneck analysis and resolution.
+
Run SambaTune
+
sshALCFUserID@sambanova.alcf.anl.gov
+# Enter MobilePass+ pass code
+sshsm-01
+
+
#TODOBRW
+sshwilsonb@sambanova.alcf.anl.gov
+# Enter MobilePass+ pass code
+sshsm-01
+
+
First, enter the virtual environment on sm-01 or sm-02:
+
source/opt/sambaflow/venv/bin/activate
+
+
Update path:
+
exportPATH=/opt/sambaflow/bin:$PATH
+
+
Usage
+
usage: sambatune [-h] [--artifact-root ARTIFACT_ROOT] [--disable-override]
+ [--compile-only | -m MODES [MODES ...]] [--version]
+ config
+
+positional arguments:
+ config YAML file with model, compile, run configuration.
+
+optional arguments:
+ -h, --help show this help message and exit
+ --artifact-root ARTIFACT_ROOT
+ Custom location to save compile/run artifacts;
+ defaults to '$DUMP_ROOT/artifact_root' (default: None)
+ --disable-override Reuse the placement from the baseline compilation
+ (default: False)
+ --compile-only Run compilation of PEFs for selected modes only
+ (default: False)
+ -m MODES [MODES ...], --modes MODES [MODES ...]
+ Select modes to execute from ['benchmark',
+ 'instrument', 'run'] (default: ['benchmark'])
+ --version version of sambatune and sambaflow.
+
+
Command Overview
+
By default, it will run with the benchmarking mode enabled. Use the --modes flag to run
+modes individually or in any combination.
+Benchmark-Only:
# From Bill
+python/opt/sambaflow/apps/private/anl/uno_full.pycompile--weight-sharing-b16-mb4--num-spatial-batches500--mappingspatial--mac-human-decision/opt/sambaflow/apps/private/anl/samba_uno/human_decisions_spatial.json--pef-name=uno_16_4_500_ws--output-folder=/home/arnoldw//models_dir/1520847--mac-v1
+
+python/opt/sambaflow/apps/private/anl/uno_full.pyrun--train-samba-spatial--weight-sharing-b16-mb4--num-spatial-batches500--mappingspatial--pef=/home/arnoldw//models_dir/1520847/uno_16_4_500_ws/uno_16_4_500_ws.pef--in_dir/var/tmp/raw/--mac-v1
+
+
# From Bill --> Bruce
+python/opt/sambaflow/apps/private/anl/uno_full.pycompile--weight-sharing-b16-mb4--num-spatial-batches500--mappingspatial--mac-human-decision/opt/sambaflow/apps/private/anl/samba_uno/human_decisions_spatial.json--pef-name=uno_16_4_500_ws--output-folder='.'--mac-v1
+
+exportOMP_NUM_THREADS=1
+python/opt/sambaflow/apps/private/anl/uno_full.pyrun--train-samba-spatial--weight-sharing-b16-mb4--num-spatial-batches500--mappingspatial--pef=./uno_16_4_500_ws/uno_16_4_500_ws.pef--in_dir/var/tmp/raw/--mac-v1
+
NOTE: The password only works with this one instance of sambatune_ui. If you stop this instance of sambatune_ui and start another instance, it will have a new password.
+
NOTE: You will need to > or use the kill command to stop sambatune_ui when you have finished.
+Not doing so will tie up the port.
+You can ps -elf | grep the_port_you_used to find the running processes.
+If you are not comfortable doing this, please ask for help.
+
Use Port-Forwarding
+
This describes the steps to set up port-forwarding for applications,
+like SambaTune UI, which runs on the SambaNova system and binds to one or more ports.
+This example uses 8576 and 18576 as port numbers. Using port numbers other than these may
+avoid collisions with other users.
+
From your local machine
+
This command sets up a port forward SambaNova login node to your local machine.
Then, navigate in your browser to, in this example, http://localhost:18576 on your local machine.
+
Use the username and password from sm-01 to log in.
+
SSH Notes
+
Explanation of ssh command:
+
-N : no remote commands
+
+-f : put ssh in the background
+
+-L <machine1>:<portA>:<machine2>:<portB> :
+
+The full command line will forward <machine1>:<portA> (local scope) to <machine2>:<portB> (remote scope)
+
+ The ALCF provides users with access to supercomputing resources that are significantly more powerful than systems typically used for open scientific research.
+
+ The ALCF is committed to providing training and outreach opportunities that prepare researchers to efficiently use its leadership computing systems, while also cultivating a diverse and skilled HPC workforce for the future.
+
+ The Argonne Leadership Computing Facility enables breakthroughs in science and engineering by providing supercomputing resources and expertise to the research community.
+
To install a different version of a package that is already installed in one's environment, one can use:
+
pipinstall--ignore-installed...# or -I
+
+
Pre-Built Sample Venv
+
Each of the samples or application examples provided by SambaNova has its own pre-built virtual environment which can be readily used. They are present in the /opt/sambaflow/apps/ directory tree within each of the applications.
+
+
Note: Conda is not supported on the SambaNova system.