Merge branch 'argonne-lcf:main' into auroraMPICH_envVars

argonne-lcf · Mar 5, 2024 · 0d97df6 · 0d97df6
2 parents cb4c9a8 + b51f9aa
commit 0d97df6
Show file tree

Hide file tree

Showing 17 changed files with 345 additions and 36 deletions.
diff --git a/docs/ai-testbed/cerebras/example-programs.md b/docs/ai-testbed/cerebras/example-programs.md
@@ -80,6 +80,7 @@ export MODEL_DIR=model_dir_bert_large_pytorch
 if [ -d "$MODEL_DIR" ]; then rm -Rf $MODEL_DIR; fi
 python run.py CSX --job_labels name=bert_pt --params configs/bert_large_MSL128_sampleds.yaml --num_workers_per_csx=1 --mode train --model_dir $MODEL_DIR --mount_dirs /home/ /software/ --python_paths /home/$(whoami)/R_2.0.3/modelzoo/ --compile_dir $(whoami) |& tee mytest.log
 ```
+Note: the vocabulary file referenced in `/software/cerebras/dataset/bert_large/bert_large_MSL128_sampleds.yaml` is the same as the one at `/home/$(whoami)/R_2.0.3/modelzoo/modelzoo/transformers/vocab/google_research_uncased_L-12_H-768_A-12.txt`. 
 
 The last parts of the output should resemble the following, with messages about cuda that should be ignored and are not shown.
 

diff --git a/docs/ai-testbed/sambanova/example-programs.md b/docs/ai-testbed/sambanova/example-programs.md
@@ -87,6 +87,7 @@ python lenet.py run --pef="pef/lenet/lenet.pef"
 Then
 
 ```bash
+mkdir -p pef/lenet
 sbatch --output=pef/lenet/output.log submit-lenet-job.sh
 ```
 
@@ -134,6 +135,7 @@ python ffn_mnist.py  run -b 1 -p out/ffn_mnist/ffn_mnist.pef
 ```
 
 ```bash
+mkdir -p pef/ffn_mnist
 sbatch --output=pef/ffn_mnist/output.log submit-ffn_mnist-job.sh
 ```
 
@@ -190,6 +192,7 @@ python logreg.py run --pef="pef/logreg/logreg.pef"
 Then
 
 ```bash
+mkdir -p pef/logreg
 sbatch --output=pef/logreg/output.log submit-logreg-job.sh
 ```
 

diff --git a/docs/ai-testbed/sambanova/sambatune.md b/docs/ai-testbed/sambanova/sambatune.md
@@ -25,7 +25,7 @@ export DUMP_ROOT=~/Sambatune
 ```
 
 If running a large model, the profiling information can be hundreds of gigabytes or more, and the DUMP_ROOT should be set to some location with more storage than your home directory (which has a quota).<br>
-E.g. somewhere that you have write access to in ```/srv/projects```
+E.g. somewhere that you have write access to in ```/projects```
 
 Optionally, examine the sample yaml file. You will see that it has 5 top-level sections: `app:`, `model-args:`, `compile-args:`, `run-args:`, `env:`
 

diff --git a/docs/aurora/compiling-and-linking/oneapi-compilers-aurora.md b/docs/aurora/compiling-and-linking/oneapi-compilers-aurora.md
diff --git a/docs/aurora/data-management/lustre/gecko.md b/docs/aurora/data-management/lustre/gecko.md
@@ -8,7 +8,7 @@ Currently, scp and SFTP are the only ways to transfer data to/from Aurora.
 
 As an expedient for initiating ssh sessions to Aurora login nodes via the bastion indirect nodes, and to enable scp from remote (non ALCF) hosts to Aurora login nodes, follow these steps:
 
-1. Create SSH keys on the laptop/desktop/remote machine. See "Creating SSH Keys" section on this page.
+1. Create SSH keys on the laptop/desktop/remote machine. See "Creating SSH Keys" section on [this page](https://help.cels.anl.gov/docs/linux/ssh/):
 2. Add the lines listed below to your ~/.ssh/config file on the remote host. That is, you should do this on your laptop/desktop, from which you are initiating ssh login sessions to Aurora via bastion, and on other non-ALCF host systems from which you want to copy files to Aurora login nodes using scp.
 
 ```
@@ -51,4 +51,4 @@ knight@aurora-uan-0009:~> scp [email protected]:/grand/catalyst/proj-s
 [Password:
 knight@aurora-uan-0009:~> cat test.txt 
 from_polaris grand
-```
+```
diff --git a/docs/aurora/data-science/libraries/onednn.md b/docs/aurora/data-science/libraries/onednn.md
diff --git a/docs/aurora/data-science/libraries/openvino.md b/docs/aurora/data-science/libraries/openvino.md
@@ -5,11 +5,11 @@ This page contains build and run instructions for Python and C/C++ examples, but
 
 
 
-## Instlling OpenVINO
+## Instlling the OpenVINO Python Runtime and CLI Tools
 OpenVINO does not come with the default frameworks module on Aurora, but it can be installed manually within a virtual environment as shown below
 ```
 module use /soft/modulefiles
-module load frameworks/2023.10.15.001
+module load frameworks/2023.12.15.001
 python -m venv --clear /path/to/_ov_env --system-site-packages
 source /path/to/_ov_env/bin/activate
 pip install openvino==2023.2
@@ -22,7 +22,7 @@ Note that `/path/to/` can either be a user's home or project directory.
 To use OpenVINO in the future, simply load the frameworks module and source the virtual environment.
 ```
 module use /soft/modulefiles
-module load frameworks/2023.10.15.001
+module load frameworks/2023.12.15.001
 source /path/to/_ov_env/bin/activate
 ```
 
@@ -89,28 +89,169 @@ Note that `benchmark_app` takes a number of additional configuration options as
 
 ## Inference with Python OpenVINO API
 
-Inference can be performed invoking the compiled model directly or using the OpenVINO Runtime API explicitly.
+Inference can be performed invoking the compiled model directly or using the OpenVINO Runtime API explicitly to create inference requests.
 
 An example of performing direct inference with the compiled model is shown below. 
 This leads to compact code, but it performs a single synchronous inference request. 
 Future calls to the model will reuse the same inference request created, thus will experience less overhead.
-Note that the output of the model is a numpy array.
 ```
 import openvino as ov
+import openvino.properties.hint as hints
 import torch
 
 core = ov.Core()
-compiled_model = core.compile_model("resnet50.xml",device_name='GPU.0')
+config = {hints.inference_precision: 'f32'}
+compiled_model = core.compile_model("resnet50.xml",device_name='GPU.0', config=config)
 input_data = torch.rand((1, 3, 224, 224))
 results = compiled_model(input_data)[0]
 ```
 
-The Runtime API can be called explicitly to have more control over the requests.
+Note:
+
+* The output of the direct call to the compiled model is a NumPy array
+* By default, OpenVINO performs inference with FP16 precision on GPU, therefore the precision type must be specified as a hint during model compilation if FP32 or other precisions are desired.
+
+Other than the direct call to the model, the Runtime API can be used to create inference requests and control their execution.
 For this approach we refer the user to the OpenVINO [documentation page](https://docs.openvino.ai/2023.2/openvino_docs_OV_UG_Integrate_OV_with_your_application.html), which clearly outlines the steps involved. 
 
 
 
 ## Inference with C++ OpenVINO API
-This feature is still under testing on Aurora.
+
+Currently, the C++ OpenVINO API on Aurora is enabled through a pre-built set of libraries.
+The environment is set as follows, with `/path/to/openvino` being a placeholder for the user to specify
+```
+module use /soft/modulefiles
+module load spack-pe-gcc
+module load cmake
+
+export OV_PATH=/path/to/openvino
+cp /home/balin/OpenVINO/SLES15.3/openvino-suse.tar.gz $OV_PATH
+tar -xzvf $OV_PATH/openvino-suse.tar.gz -C $OV_PATH
+source $OV_PATH/openvino/setupvars.sh
+
+# Need to add a path to the libtbb.so.2 library needed by OpenVINO
+export LD_LIBRARY_PATH=$LD_LIBRARY_PATH:/soft/datascience/llm_ds/basekit_2023_0_25537/vtune/2023.0.0/lib64
+export ONEAPI_DEVICE_SELECTOR=opencl:gpu
+export ZE_AFFINITY_MASK=0.0
+```
+
+An example performing inference with the C++ OpenVINO API is shown below.
+This simple program loads the ResNet50 model in OpenVINO IR format to the GPU (see instructions above on how to download and convert the model), creates an input vector and offloads it to the GPU with SYCL, and finally executes a single synchronous inference request on the GPU.
+
+```
+#include <iostream>
+#include <cstdlib>
+#include <vector>
+#include "sycl/sycl.hpp"
+#include "openvino/openvino.hpp"
+#include "openvino/runtime/intel_gpu/ocl/ocl.hpp"
+
+const int N_BATCH = 1;
+const int N_CHANNELS = 3;
+const int N_PIXELS = 224;
+const int INPUTS_SIZE = N_BATCH*N_CHANNELS*N_PIXELS*N_PIXELS;
+
+int main(int argc, const char* argv[])
+{
+  // Print some information about OpenVINO and start the runtime
+  std::cout << "Running with " << ov::get_openvino_version() << "\n\n";
+  ov::Core core;
+  std::vector<std::string> availableDevices = core.get_available_devices();
+  char device_str[] = "GPU";
+  bool found_device = false;
+  for (auto&& device : availableDevices) {
+    if (strcmp(device.c_str(),device_str)==0) {
+      std::cout << "Found device " << device << " \n\n";
+      found_device = true;
+    }
+  }
+  if (not found_device) {
+    std::cout << "Input device not found \n";
+    std::cout << "Available devices are: \n";
+    for (auto&& device : availableDevices) {
+      std::cout << device << std::endl;
+    }
+    return -1;
+  }
+
+  // Load the model
+  std::shared_ptr<ov::Model> model = core.read_model("./resnet50.xml");
+  std::cout << "Loaded model \n\n";
+
+  // Create the input data on the host
+  std::vector<float> inputs(INPUTS_SIZE);
+  srand(12345);
+  for (int i=0; i<INPUTS_SIZE; i++) {
+    inputs[i] = static_cast <float> (rand()) / static_cast <float> (RAND_MAX);
+  }
+  std::cout << "Generated input data on the host \n\n";
+
+  // Move input data to the device with SYCL
+  sycl::queue Q(sycl::gpu_selector_v, sycl::property::queue::in_order{}); // oneDNN needs in order queues
+  std::cout << "SYCL running on "
+            << Q.get_device().get_info<sycl::info::device::name>()
+            << "\n\n";
+  float *d_inputs = sycl::malloc_device<float>(INPUTS_SIZE, Q);
+  Q.memcpy((void *) d_inputs, (void *) inputs.data(), INPUTS_SIZE*sizeof(float));
+  Q.wait();
+
+  // Share the SYCL queue and context with the GPU plugin and compile the model
+  auto queue = sycl::get_native<sycl::backend::opencl>(Q);
+  auto remote_context = ov::intel_gpu::ocl::ClContext(core, queue);
+  auto compiled_model = core.compile_model(model, remote_context,
+                                           ov::hint::inference_precision("f32"));
+
+  // Convert input array to OpenVINO Tensor
+  ov::element::Type input_type = ov::element::f32;
+  ov::Shape input_shape = {N_BATCH, N_CHANNELS, N_PIXELS, N_PIXELS};
+  //ov::Tensor input_tensor = ov::Tensor(input_type, input_shape, d_inputs);
+  auto input_tensor = remote_context.create_tensor(input_type, input_shape, (void *) d_inputs);
+
+  // Run inference
+  ov::InferRequest infer_request = compiled_model.create_infer_request();
+  infer_request.set_input_tensor(input_tensor);
+  infer_request.infer();
+  std::cout << "Performed inference \n\n";
+
+  // Output the predicted Torch tensor
+  ov::Tensor output_tensor = infer_request.get_output_tensor();
+  std::cout << "Size of output tensor " << output_tensor.get_shape() << std::endl;
+  std::cout << "Predicted tensor is : \n";
+  for (int i=0; i<10; i++) {
+    std::cout << output_tensor.data<float>()[i] << "\n";
+  }
+  std::cout << "\n";
+
+  return 0;
+}
+```
+
+To build the example program, use the `CMakeLists.txt` file below
+```
+cmake_minimum_required(VERSION 3.5 FATAL_ERROR)
+project(inference_openvino_sycl_example)
+
+find_package(OpenVINO REQUIRED COMPONENTS Runtime)
+set(ov_link_libraries openvino::runtime)
+
+add_executable(inference_openvino_sycl inference_openvino_sycl.cpp)
+target_link_libraries(inference_openvino_sycl ${ov_link_libraries} -lOpenCL)
+
+set_property(TARGET inference_openvino_sycl PROPERTY CXX_STANDARD 17)
+```
+
+and execute
+```
+cmake -DCMAKE_CXX_FLAGS="-std=c++17 -fsycl" ./
+make
+./inference_openvino_sycl
+```
+
+Note:
+
+* OpenVINO does not currently support the Level Zero backend. OpenCL must be used instead, which can be set on Aurora with `export ONEAPI_DEVICE_SELECTOR=opencl:gpu`
+* The [Remote Tensor API](https://docs.openvino.ai/2023.3/openvino_docs_OV_UG_supported_plugins_GPU_RemoteTensor_API.html) must be used to share the SYCL OpenCL context with OpenVINO
+
 
 
diff --git a/docs/aurora/getting-started-on-aurora.md b/docs/aurora/getting-started-on-aurora.md
@@ -4,7 +4,32 @@
 
 *** ACCESS IS CURRENTLY ENABLED FOR ESP and ECP TEAMS ONLY ***
 
-The pre-requisites required for Sunspot are applicable to Aurora as well. See this [page](https://www.alcf.anl.gov/support-center/aurorasunspot/getting-started-sunspot#pre-req) for more information.
+
+## How to Get Access to Aurora (for New Users)
+
+### If You Already Have Access to Sunspot
+
+If you already have access to Sunspot, all you need to do to gain access to Aurora is send an email to [email protected] requesting access to Aurora. In your email, include
+
+* Your ALCF username
+* Your institutional email address
+* The ESP or ECP project in which you are a member
+
+### For Aurora Early Science Program (ESP) Team Members
+
+If you have never had access to Sunspot, here are the steps to gain access to Aurora:
+
+1. Verify that your institution has signed a CNDA with Intel that covers you.
+2. If you do not have an active ALCF account, request one using the [ALCF Account request webpage](https://accounts.alcf.anl.gov/#/accountRequest). When you come to the part about joining a project, request the `ProjectName_aesp_CNDA` project.
+3. Acknowledge the Intel Terms of Use agreement (TOU) for the Aurora Software Development Kit (SDK) by submitting [this form](https://events.cels.anl.gov/event/147/surveys/7).
+
+Getting a new ALCF account typically takes anywhere from a few days to a few weeks (processing new access for foreign nationals is what can take weeks). After you acknowledge the TOU, there is a manual step that typically takes a few days. You will receive an email notifying you when Aurora access is granted, including some getting started instructions.
+
+### For Aurora Exascale Computing Project (ECP) Team Members
+
+See this [page](https://www.alcf.anl.gov/support-center/aurorasunspot/getting-started-sunspot#pre-req) for instructions.
+
+## Caveats About Using Aurora and Reporting Findings
 
 NOTE: Sharing of any results from Aurora publicly no longer requires a review or approval from Intel. However, anyone publishing these results should include the following in their materials: 
 

diff --git a/docs/aurora/known-issues.md b/docs/aurora/known-issues.md
@@ -24,13 +24,36 @@ The value of `FI_CXI_DEFAULT_CQ_SIZE` can be set to something larger if issues p
 
 ## Submitting Jobs
 
-Jobs may fail to successfully start at times (particularly at higher node counts). If no error message is apparent, then one thing to check is the `comment` field in the full job information for the job using the command `qstat -xf [JOBID]`.
+Jobs may fail to successfully start at times (particularly at higher node counts). If no error message is apparent, then one thing to check is the `comment` field in the full job information for the job using the command `qstat -xfw [JOBID] | grep comment`. Some example comments follow.
 
-1. In the event that you find your job placed on hold, you may find the message `comment = job held, too many failed attempts to run`. This does not indicate a problem with your script, but indicates PBS made several attempts to find a set of nodes to run your job and was not able too. Users are encouraged to delete the held job and try resubmitting.
+```
+comment = Job held by [USER] on Tue Feb 6 05:20:00 2024 and terminated
+```
+The user has placed the job on hold; user can `qrls` the job when ready for it to be queued again.
+
+
+```
+comment = Not Running: Queue not started. and terminated
+```
+
+User has submitted to a queue that is not currently running; user should `qmove` the job to an appropriate queue.
+
+```
+comment = job held, too many failed attempts to run
+```
+
+The job tried and failed to start. In this scenario, the user should find that their job was placed on hold. This does not indicate a problem the users' job script, but indicates PBS made several attempts to find a set of nodes to run the job and was not able too. Users can `qdel` the job and resubmit or `qrls` the job to try running it again.
+
+```
+comment = Not Running: Node is in an ineligible state: down and terminated
+```
 
-2. In the event of a node going down during a job, users may encounter messages such as `ping failed on x4616c0s4b0n0: Application 047a3c9f-fb41-4595-a2ad-4a4d0ec1b6c1 not found`. The node will likely have started a reboot and won't be included in jobs again until checks pass.
+There are an insufficient number of nodes are online and free for the job to start
+
+In the event of a node going down during a job, users may encounter messages such as `ping failed on x4616c0s4b0n0: Application 047a3c9f-fb41-4595-a2ad-4a4d0ec1b6c1 not found`. The node will likely have started a reboot and won't be included in jobs again until checks pass.
 
 To increase the chances that a large job does not terminate due to a node failure, you may choose to interactively route your MPI job around nodes that fail during your run. See this page on [Working Around Node Failures](https://docs.alcf.anl.gov/aurora/running-jobs-aurora/#working-around-node-failures) for more information.
+
 ## Other issues
 
 * Interim Filesystem: The early access filesystem is not highly performant. Intermittent hangs or pauses should be expected - waiting for IO to complete is recommended and IO completions should pass without failure. Jobs requiring significant filesystem performance must be avoided at this time.

diff --git a/docs/aurora/programming-models/compatibility-tool.md b/docs/aurora/programming-models/compatibility-tool.md
diff --git a/docs/aurora/programming-models/heterogeneous-models.md b/docs/aurora/programming-models/heterogeneous-models.md
diff --git a/docs/aurora/programming-models/one-api.md b/docs/aurora/programming-models/one-api.md