Skip to content

Commit

Permalink
Merge branch 'argonne-lcf:main' into auroraMPICH_envVars
Browse files Browse the repository at this point in the history
  • Loading branch information
zippylab authored Mar 5, 2024
2 parents cb4c9a8 + b51f9aa commit 0d97df6
Show file tree
Hide file tree
Showing 17 changed files with 345 additions and 36 deletions.
1 change: 1 addition & 0 deletions docs/ai-testbed/cerebras/example-programs.md
Original file line number Diff line number Diff line change
Expand Up @@ -80,6 +80,7 @@ export MODEL_DIR=model_dir_bert_large_pytorch
if [ -d "$MODEL_DIR" ]; then rm -Rf $MODEL_DIR; fi
python run.py CSX --job_labels name=bert_pt --params configs/bert_large_MSL128_sampleds.yaml --num_workers_per_csx=1 --mode train --model_dir $MODEL_DIR --mount_dirs /home/ /software/ --python_paths /home/$(whoami)/R_2.0.3/modelzoo/ --compile_dir $(whoami) |& tee mytest.log
```
Note: the vocabulary file referenced in `/software/cerebras/dataset/bert_large/bert_large_MSL128_sampleds.yaml` is the same as the one at `/home/$(whoami)/R_2.0.3/modelzoo/modelzoo/transformers/vocab/google_research_uncased_L-12_H-768_A-12.txt`.

The last parts of the output should resemble the following, with messages about cuda that should be ignored and are not shown.

Expand Down
3 changes: 3 additions & 0 deletions docs/ai-testbed/sambanova/example-programs.md
Original file line number Diff line number Diff line change
Expand Up @@ -87,6 +87,7 @@ python lenet.py run --pef="pef/lenet/lenet.pef"
Then

```bash
mkdir -p pef/lenet
sbatch --output=pef/lenet/output.log submit-lenet-job.sh
```

Expand Down Expand Up @@ -134,6 +135,7 @@ python ffn_mnist.py run -b 1 -p out/ffn_mnist/ffn_mnist.pef
```

```bash
mkdir -p pef/ffn_mnist
sbatch --output=pef/ffn_mnist/output.log submit-ffn_mnist-job.sh
```

Expand Down Expand Up @@ -190,6 +192,7 @@ python logreg.py run --pef="pef/logreg/logreg.pef"
Then

```bash
mkdir -p pef/logreg
sbatch --output=pef/logreg/output.log submit-logreg-job.sh
```

Expand Down
2 changes: 1 addition & 1 deletion docs/ai-testbed/sambanova/sambatune.md
Original file line number Diff line number Diff line change
Expand Up @@ -25,7 +25,7 @@ export DUMP_ROOT=~/Sambatune
```

If running a large model, the profiling information can be hundreds of gigabytes or more, and the DUMP_ROOT should be set to some location with more storage than your home directory (which has a quota).<br>
E.g. somewhere that you have write access to in ```/srv/projects```
E.g. somewhere that you have write access to in ```/projects```

Optionally, examine the sample yaml file. You will see that it has 5 top-level sections: `app:`, `model-args:`, `compile-args:`, `run-args:`, `env:`

Expand Down

This file was deleted.

4 changes: 2 additions & 2 deletions docs/aurora/data-management/lustre/gecko.md
Original file line number Diff line number Diff line change
Expand Up @@ -8,7 +8,7 @@ Currently, scp and SFTP are the only ways to transfer data to/from Aurora.

As an expedient for initiating ssh sessions to Aurora login nodes via the bastion indirect nodes, and to enable scp from remote (non ALCF) hosts to Aurora login nodes, follow these steps:

1. Create SSH keys on the laptop/desktop/remote machine. See "Creating SSH Keys" section on this page.
1. Create SSH keys on the laptop/desktop/remote machine. See "Creating SSH Keys" section on [this page](https://help.cels.anl.gov/docs/linux/ssh/):
2. Add the lines listed below to your ~/.ssh/config file on the remote host. That is, you should do this on your laptop/desktop, from which you are initiating ssh login sessions to Aurora via bastion, and on other non-ALCF host systems from which you want to copy files to Aurora login nodes using scp.

```
Expand Down Expand Up @@ -51,4 +51,4 @@ knight@aurora-uan-0009:~> scp [email protected]:/grand/catalyst/proj-s
[Password:
knight@aurora-uan-0009:~> cat test.txt
from_polaris grand
```
```
1 change: 0 additions & 1 deletion docs/aurora/data-science/libraries/onednn.md

This file was deleted.

157 changes: 149 additions & 8 deletions docs/aurora/data-science/libraries/openvino.md
Original file line number Diff line number Diff line change
Expand Up @@ -5,11 +5,11 @@ This page contains build and run instructions for Python and C/C++ examples, but



## Instlling OpenVINO
## Instlling the OpenVINO Python Runtime and CLI Tools
OpenVINO does not come with the default frameworks module on Aurora, but it can be installed manually within a virtual environment as shown below
```
module use /soft/modulefiles
module load frameworks/2023.10.15.001
module load frameworks/2023.12.15.001
python -m venv --clear /path/to/_ov_env --system-site-packages
source /path/to/_ov_env/bin/activate
pip install openvino==2023.2
Expand All @@ -22,7 +22,7 @@ Note that `/path/to/` can either be a user's home or project directory.
To use OpenVINO in the future, simply load the frameworks module and source the virtual environment.
```
module use /soft/modulefiles
module load frameworks/2023.10.15.001
module load frameworks/2023.12.15.001
source /path/to/_ov_env/bin/activate
```

Expand Down Expand Up @@ -89,28 +89,169 @@ Note that `benchmark_app` takes a number of additional configuration options as

## Inference with Python OpenVINO API

Inference can be performed invoking the compiled model directly or using the OpenVINO Runtime API explicitly.
Inference can be performed invoking the compiled model directly or using the OpenVINO Runtime API explicitly to create inference requests.

An example of performing direct inference with the compiled model is shown below.
This leads to compact code, but it performs a single synchronous inference request.
Future calls to the model will reuse the same inference request created, thus will experience less overhead.
Note that the output of the model is a numpy array.
```
import openvino as ov
import openvino.properties.hint as hints
import torch
core = ov.Core()
compiled_model = core.compile_model("resnet50.xml",device_name='GPU.0')
config = {hints.inference_precision: 'f32'}
compiled_model = core.compile_model("resnet50.xml",device_name='GPU.0', config=config)
input_data = torch.rand((1, 3, 224, 224))
results = compiled_model(input_data)[0]
```

The Runtime API can be called explicitly to have more control over the requests.
Note:

* The output of the direct call to the compiled model is a NumPy array
* By default, OpenVINO performs inference with FP16 precision on GPU, therefore the precision type must be specified as a hint during model compilation if FP32 or other precisions are desired.

Other than the direct call to the model, the Runtime API can be used to create inference requests and control their execution.
For this approach we refer the user to the OpenVINO [documentation page](https://docs.openvino.ai/2023.2/openvino_docs_OV_UG_Integrate_OV_with_your_application.html), which clearly outlines the steps involved.



## Inference with C++ OpenVINO API
This feature is still under testing on Aurora.

Currently, the C++ OpenVINO API on Aurora is enabled through a pre-built set of libraries.
The environment is set as follows, with `/path/to/openvino` being a placeholder for the user to specify
```
module use /soft/modulefiles
module load spack-pe-gcc
module load cmake
export OV_PATH=/path/to/openvino
cp /home/balin/OpenVINO/SLES15.3/openvino-suse.tar.gz $OV_PATH
tar -xzvf $OV_PATH/openvino-suse.tar.gz -C $OV_PATH
source $OV_PATH/openvino/setupvars.sh
# Need to add a path to the libtbb.so.2 library needed by OpenVINO
export LD_LIBRARY_PATH=$LD_LIBRARY_PATH:/soft/datascience/llm_ds/basekit_2023_0_25537/vtune/2023.0.0/lib64
export ONEAPI_DEVICE_SELECTOR=opencl:gpu
export ZE_AFFINITY_MASK=0.0
```

An example performing inference with the C++ OpenVINO API is shown below.
This simple program loads the ResNet50 model in OpenVINO IR format to the GPU (see instructions above on how to download and convert the model), creates an input vector and offloads it to the GPU with SYCL, and finally executes a single synchronous inference request on the GPU.

```
#include <iostream>
#include <cstdlib>
#include <vector>
#include "sycl/sycl.hpp"
#include "openvino/openvino.hpp"
#include "openvino/runtime/intel_gpu/ocl/ocl.hpp"
const int N_BATCH = 1;
const int N_CHANNELS = 3;
const int N_PIXELS = 224;
const int INPUTS_SIZE = N_BATCH*N_CHANNELS*N_PIXELS*N_PIXELS;
int main(int argc, const char* argv[])
{
// Print some information about OpenVINO and start the runtime
std::cout << "Running with " << ov::get_openvino_version() << "\n\n";
ov::Core core;
std::vector<std::string> availableDevices = core.get_available_devices();
char device_str[] = "GPU";
bool found_device = false;
for (auto&& device : availableDevices) {
if (strcmp(device.c_str(),device_str)==0) {
std::cout << "Found device " << device << " \n\n";
found_device = true;
}
}
if (not found_device) {
std::cout << "Input device not found \n";
std::cout << "Available devices are: \n";
for (auto&& device : availableDevices) {
std::cout << device << std::endl;
}
return -1;
}
// Load the model
std::shared_ptr<ov::Model> model = core.read_model("./resnet50.xml");
std::cout << "Loaded model \n\n";
// Create the input data on the host
std::vector<float> inputs(INPUTS_SIZE);
srand(12345);
for (int i=0; i<INPUTS_SIZE; i++) {
inputs[i] = static_cast <float> (rand()) / static_cast <float> (RAND_MAX);
}
std::cout << "Generated input data on the host \n\n";
// Move input data to the device with SYCL
sycl::queue Q(sycl::gpu_selector_v, sycl::property::queue::in_order{}); // oneDNN needs in order queues
std::cout << "SYCL running on "
<< Q.get_device().get_info<sycl::info::device::name>()
<< "\n\n";
float *d_inputs = sycl::malloc_device<float>(INPUTS_SIZE, Q);
Q.memcpy((void *) d_inputs, (void *) inputs.data(), INPUTS_SIZE*sizeof(float));
Q.wait();
// Share the SYCL queue and context with the GPU plugin and compile the model
auto queue = sycl::get_native<sycl::backend::opencl>(Q);
auto remote_context = ov::intel_gpu::ocl::ClContext(core, queue);
auto compiled_model = core.compile_model(model, remote_context,
ov::hint::inference_precision("f32"));
// Convert input array to OpenVINO Tensor
ov::element::Type input_type = ov::element::f32;
ov::Shape input_shape = {N_BATCH, N_CHANNELS, N_PIXELS, N_PIXELS};
//ov::Tensor input_tensor = ov::Tensor(input_type, input_shape, d_inputs);
auto input_tensor = remote_context.create_tensor(input_type, input_shape, (void *) d_inputs);
// Run inference
ov::InferRequest infer_request = compiled_model.create_infer_request();
infer_request.set_input_tensor(input_tensor);
infer_request.infer();
std::cout << "Performed inference \n\n";
// Output the predicted Torch tensor
ov::Tensor output_tensor = infer_request.get_output_tensor();
std::cout << "Size of output tensor " << output_tensor.get_shape() << std::endl;
std::cout << "Predicted tensor is : \n";
for (int i=0; i<10; i++) {
std::cout << output_tensor.data<float>()[i] << "\n";
}
std::cout << "\n";
return 0;
}
```

To build the example program, use the `CMakeLists.txt` file below
```
cmake_minimum_required(VERSION 3.5 FATAL_ERROR)
project(inference_openvino_sycl_example)
find_package(OpenVINO REQUIRED COMPONENTS Runtime)
set(ov_link_libraries openvino::runtime)
add_executable(inference_openvino_sycl inference_openvino_sycl.cpp)
target_link_libraries(inference_openvino_sycl ${ov_link_libraries} -lOpenCL)
set_property(TARGET inference_openvino_sycl PROPERTY CXX_STANDARD 17)
```

and execute
```
cmake -DCMAKE_CXX_FLAGS="-std=c++17 -fsycl" ./
make
./inference_openvino_sycl
```

Note:

* OpenVINO does not currently support the Level Zero backend. OpenCL must be used instead, which can be set on Aurora with `export ONEAPI_DEVICE_SELECTOR=opencl:gpu`
* The [Remote Tensor API](https://docs.openvino.ai/2023.3/openvino_docs_OV_UG_supported_plugins_GPU_RemoteTensor_API.html) must be used to share the SYCL OpenCL context with OpenVINO



27 changes: 26 additions & 1 deletion docs/aurora/getting-started-on-aurora.md
Original file line number Diff line number Diff line change
Expand Up @@ -4,7 +4,32 @@

*** ACCESS IS CURRENTLY ENABLED FOR ESP and ECP TEAMS ONLY ***

The pre-requisites required for Sunspot are applicable to Aurora as well. See this [page](https://www.alcf.anl.gov/support-center/aurorasunspot/getting-started-sunspot#pre-req) for more information.

## How to Get Access to Aurora (for New Users)

### If You Already Have Access to Sunspot

If you already have access to Sunspot, all you need to do to gain access to Aurora is send an email to [email protected] requesting access to Aurora. In your email, include

* Your ALCF username
* Your institutional email address
* The ESP or ECP project in which you are a member

### For Aurora Early Science Program (ESP) Team Members

If you have never had access to Sunspot, here are the steps to gain access to Aurora:

1. Verify that your institution has signed a CNDA with Intel that covers you.
2. If you do not have an active ALCF account, request one using the [ALCF Account request webpage](https://accounts.alcf.anl.gov/#/accountRequest). When you come to the part about joining a project, request the `ProjectName_aesp_CNDA` project.
3. Acknowledge the Intel Terms of Use agreement (TOU) for the Aurora Software Development Kit (SDK) by submitting [this form](https://events.cels.anl.gov/event/147/surveys/7).

Getting a new ALCF account typically takes anywhere from a few days to a few weeks (processing new access for foreign nationals is what can take weeks). After you acknowledge the TOU, there is a manual step that typically takes a few days. You will receive an email notifying you when Aurora access is granted, including some getting started instructions.

### For Aurora Exascale Computing Project (ECP) Team Members

See this [page](https://www.alcf.anl.gov/support-center/aurorasunspot/getting-started-sunspot#pre-req) for instructions.

## Caveats About Using Aurora and Reporting Findings

NOTE: Sharing of any results from Aurora publicly no longer requires a review or approval from Intel. However, anyone publishing these results should include the following in their materials:

Expand Down
29 changes: 26 additions & 3 deletions docs/aurora/known-issues.md
Original file line number Diff line number Diff line change
Expand Up @@ -24,13 +24,36 @@ The value of `FI_CXI_DEFAULT_CQ_SIZE` can be set to something larger if issues p

## Submitting Jobs

Jobs may fail to successfully start at times (particularly at higher node counts). If no error message is apparent, then one thing to check is the `comment` field in the full job information for the job using the command `qstat -xf [JOBID]`.
Jobs may fail to successfully start at times (particularly at higher node counts). If no error message is apparent, then one thing to check is the `comment` field in the full job information for the job using the command `qstat -xfw [JOBID] | grep comment`. Some example comments follow.

1. In the event that you find your job placed on hold, you may find the message `comment = job held, too many failed attempts to run`. This does not indicate a problem with your script, but indicates PBS made several attempts to find a set of nodes to run your job and was not able too. Users are encouraged to delete the held job and try resubmitting.
```
comment = Job held by [USER] on Tue Feb 6 05:20:00 2024 and terminated
```
The user has placed the job on hold; user can `qrls` the job when ready for it to be queued again.


```
comment = Not Running: Queue not started. and terminated
```

User has submitted to a queue that is not currently running; user should `qmove` the job to an appropriate queue.

```
comment = job held, too many failed attempts to run
```

The job tried and failed to start. In this scenario, the user should find that their job was placed on hold. This does not indicate a problem the users' job script, but indicates PBS made several attempts to find a set of nodes to run the job and was not able too. Users can `qdel` the job and resubmit or `qrls` the job to try running it again.

```
comment = Not Running: Node is in an ineligible state: down and terminated
```

2. In the event of a node going down during a job, users may encounter messages such as `ping failed on x4616c0s4b0n0: Application 047a3c9f-fb41-4595-a2ad-4a4d0ec1b6c1 not found`. The node will likely have started a reboot and won't be included in jobs again until checks pass.
There are an insufficient number of nodes are online and free for the job to start

In the event of a node going down during a job, users may encounter messages such as `ping failed on x4616c0s4b0n0: Application 047a3c9f-fb41-4595-a2ad-4a4d0ec1b6c1 not found`. The node will likely have started a reboot and won't be included in jobs again until checks pass.

To increase the chances that a large job does not terminate due to a node failure, you may choose to interactively route your MPI job around nodes that fail during your run. See this page on [Working Around Node Failures](https://docs.alcf.anl.gov/aurora/running-jobs-aurora/#working-around-node-failures) for more information.

## Other issues

* Interim Filesystem: The early access filesystem is not highly performant. Intermittent hangs or pauses should be expected - waiting for IO to complete is recommended and IO completions should pass without failure. Jobs requiring significant filesystem performance must be avoided at this time.
Expand Down
3 changes: 0 additions & 3 deletions docs/aurora/programming-models/compatibility-tool.md

This file was deleted.

3 changes: 0 additions & 3 deletions docs/aurora/programming-models/heterogeneous-models.md

This file was deleted.

3 changes: 0 additions & 3 deletions docs/aurora/programming-models/one-api.md

This file was deleted.

Loading

0 comments on commit 0d97df6

Please sign in to comment.