2022-11-09 10:18:31 Starting - Starting the training job...
+2022-11-09 10:18:54 Starting - Preparing the instances for trainingProfilerReport-1667989110: InProgress
+......
+2022-11-09 10:19:54 Downloading - Downloading input data...
+2022-11-09 10:20:34 Training - Downloading the training image..................................==================================
+== Triton Inference Server Base ==
+==================================
+NVIDIA Release 22.08 (build 42766143)
+Copyright (c) 2018-2022, NVIDIA CORPORATION & AFFILIATES. All rights reserved.
+Various files include modifications (c) NVIDIA CORPORATION & AFFILIATES. All rights reserved.
+This container image and its contents are governed by the NVIDIA Deep Learning Container License.
+By pulling and using the container, you accept the terms and conditions of this license:
+https://developer.nvidia.com/ngc/nvidia-deep-learning-container-license
+NOTE: CUDA Forward Compatibility mode ENABLED.
+ Using CUDA 11.7 driver version 515.65.01 with kernel driver version 510.47.03.
+ See https://docs.nvidia.com/deploy/cuda-compatibility/ for details.
+2022-11-09 10:27:03,405 sagemaker-training-toolkit INFO No Neurons detected (normal if no neurons installed)
+2022-11-09 10:27:03,438 sagemaker-training-toolkit INFO No Neurons detected (normal if no neurons installed)
+2022-11-09 10:27:03,473 sagemaker-training-toolkit INFO No Neurons detected (normal if no neurons installed)
+2022-11-09 10:27:03,485 sagemaker-training-toolkit INFO Invoking user script
+Training Env:
+{
+ "additional_framework_parameters": {},
+ "channel_input_dirs": {
+ "train": "/opt/ml/input/data/train",
+ "valid": "/opt/ml/input/data/valid"
+ },
+ "current_host": "algo-1",
+ "current_instance_group": "homogeneousCluster",
+ "current_instance_group_hosts": [
+ "algo-1"
+ ],
+ "current_instance_type": "ml.g4dn.xlarge",
+ "distribution_hosts": [],
+ "distribution_instance_groups": [],
+ "framework_module": null,
+ "hosts": [
+ "algo-1"
+ ],
+ "hyperparameters": {
+ "batch_size": 1024,
+ "epoch": 10
+ },
+ "input_config_dir": "/opt/ml/input/config",
+ "input_data_config": {
+ "train": {
+ "TrainingInputMode": "File",
+ "S3DistributionType": "FullyReplicated",
+ "RecordWrapperType": "None"
+ },
+ "valid": {
+ "TrainingInputMode": "File",
+ "S3DistributionType": "FullyReplicated",
+ "RecordWrapperType": "None"
+ }
+ },
+ "input_dir": "/opt/ml/input",
+ "instance_groups": [
+ "homogeneousCluster"
+ ],
+ "instance_groups_dict": {
+ "homogeneousCluster": {
+ "instance_group_name": "homogeneousCluster",
+ "instance_type": "ml.g4dn.xlarge",
+ "hosts": [
+ "algo-1"
+ ]
+ }
+ },
+ "is_hetero": false,
+ "is_master": true,
+ "is_modelparallel_enabled": null,
+ "is_smddpmprun_installed": false,
+ "job_name": "sagemaker-merlin-tensorflow-2022-11-09-10-18-29-376",
+ "log_level": 20,
+ "master_hostname": "algo-1",
+ "model_dir": "/opt/ml/model",
+ "module_dir": "s3://sagemaker-us-east-1-843263297212/sagemaker-merlin-tensorflow-2022-11-09-10-18-29-376/source/sourcedir.tar.gz",
+ "module_name": "train",
+ "network_interface_name": "eth0",
+ "num_cpus": 4,
+ "num_gpus": 1,
+ "num_neurons": 0,
+ "output_data_dir": "/opt/ml/output/data",
+ "output_dir": "/opt/ml/output",
+ "output_intermediate_dir": "/opt/ml/output/intermediate",
+ "resource_config": {
+ "current_host": "algo-1",
+ "current_instance_type": "ml.g4dn.xlarge",
+ "current_group_name": "homogeneousCluster",
+ "hosts": [
+ "algo-1"
+ ],
+ "instance_groups": [
+ {
+ "instance_group_name": "homogeneousCluster",
+ "instance_type": "ml.g4dn.xlarge",
+ "hosts": [
+ "algo-1"
+ ]
+ }
+ ],
+ "network_interface_name": "eth0"
+ },
+ "user_entry_point": "train.py"
+}
+Environment variables:
+SM_HOSTS=["algo-1"]
+SM_NETWORK_INTERFACE_NAME=eth0
+SM_HPS={"batch_size":1024,"epoch":10}
+SM_USER_ENTRY_POINT=train.py
+SM_FRAMEWORK_PARAMS={}
+SM_RESOURCE_CONFIG={"current_group_name":"homogeneousCluster","current_host":"algo-1","current_instance_type":"ml.g4dn.xlarge","hosts":["algo-1"],"instance_groups":[{"hosts":["algo-1"],"instance_group_name":"homogeneousCluster","instance_type":"ml.g4dn.xlarge"}],"network_interface_name":"eth0"}
+SM_INPUT_DATA_CONFIG={"train":{"RecordWrapperType":"None","S3DistributionType":"FullyReplicated","TrainingInputMode":"File"},"valid":{"RecordWrapperType":"None","S3DistributionType":"FullyReplicated","TrainingInputMode":"File"}}
+SM_OUTPUT_DATA_DIR=/opt/ml/output/data
+SM_CHANNELS=["train","valid"]
+SM_CURRENT_HOST=algo-1
+SM_CURRENT_INSTANCE_TYPE=ml.g4dn.xlarge
+SM_CURRENT_INSTANCE_GROUP=homogeneousCluster
+SM_CURRENT_INSTANCE_GROUP_HOSTS=["algo-1"]
+SM_INSTANCE_GROUPS=["homogeneousCluster"]
+SM_INSTANCE_GROUPS_DICT={"homogeneousCluster":{"hosts":["algo-1"],"instance_group_name":"homogeneousCluster","instance_type":"ml.g4dn.xlarge"}}
+SM_DISTRIBUTION_INSTANCE_GROUPS=[]
+SM_IS_HETERO=false
+SM_MODULE_NAME=train
+SM_LOG_LEVEL=20
+SM_FRAMEWORK_MODULE=
+SM_INPUT_DIR=/opt/ml/input
+SM_INPUT_CONFIG_DIR=/opt/ml/input/config
+SM_OUTPUT_DIR=/opt/ml/output
+SM_NUM_CPUS=4
+SM_NUM_GPUS=1
+SM_NUM_NEURONS=0
+SM_MODEL_DIR=/opt/ml/model
+SM_MODULE_DIR=s3://sagemaker-us-east-1-843263297212/sagemaker-merlin-tensorflow-2022-11-09-10-18-29-376/source/sourcedir.tar.gz
+SM_TRAINING_ENV={"additional_framework_parameters":{},"channel_input_dirs":{"train":"/opt/ml/input/data/train","valid":"/opt/ml/input/data/valid"},"current_host":"algo-1","current_instance_group":"homogeneousCluster","current_instance_group_hosts":["algo-1"],"current_instance_type":"ml.g4dn.xlarge","distribution_hosts":[],"distribution_instance_groups":[],"framework_module":null,"hosts":["algo-1"],"hyperparameters":{"batch_size":1024,"epoch":10},"input_config_dir":"/opt/ml/input/config","input_data_config":{"train":{"RecordWrapperType":"None","S3DistributionType":"FullyReplicated","TrainingInputMode":"File"},"valid":{"RecordWrapperType":"None","S3DistributionType":"FullyReplicated","TrainingInputMode":"File"}},"input_dir":"/opt/ml/input","instance_groups":["homogeneousCluster"],"instance_groups_dict":{"homogeneousCluster":{"hosts":["algo-1"],"instance_group_name":"homogeneousCluster","instance_type":"ml.g4dn.xlarge"}},"is_hetero":false,"is_master":true,"is_modelparallel_enabled":null,"is_smddpmprun_installed":false,"job_name":"sagemaker-merlin-tensorflow-2022-11-09-10-18-29-376","log_level":20,"master_hostname":"algo-1","model_dir":"/opt/ml/model","module_dir":"s3://sagemaker-us-east-1-843263297212/sagemaker-merlin-tensorflow-2022-11-09-10-18-29-376/source/sourcedir.tar.gz","module_name":"train","network_interface_name":"eth0","num_cpus":4,"num_gpus":1,"num_neurons":0,"output_data_dir":"/opt/ml/output/data","output_dir":"/opt/ml/output","output_intermediate_dir":"/opt/ml/output/intermediate","resource_config":{"current_group_name":"homogeneousCluster","current_host":"algo-1","current_instance_type":"ml.g4dn.xlarge","hosts":["algo-1"],"instance_groups":[{"hosts":["algo-1"],"instance_group_name":"homogeneousCluster","instance_type":"ml.g4dn.xlarge"}],"network_interface_name":"eth0"},"user_entry_point":"train.py"}
+SM_USER_ARGS=["--batch_size","1024","--epoch","10"]
+SM_OUTPUT_INTERMEDIATE_DIR=/opt/ml/output/intermediate
+SM_CHANNEL_TRAIN=/opt/ml/input/data/train
+SM_CHANNEL_VALID=/opt/ml/input/data/valid
+SM_HP_BATCH_SIZE=1024
+SM_HP_EPOCH=10
+PYTHONPATH=/opt/ml/code:/usr/local/bin:/opt/tritonserver:/usr/local/lib/python3.8/dist-packages:/usr/lib/python38.zip:/usr/lib/python3.8:/usr/lib/python3.8/lib-dynload:/usr/local/lib/python3.8/dist-packages/faiss-1.7.2-py3.8.egg:/usr/local/lib/python3.8/dist-packages/merlin_sok-1.1.4-py3.8-linux-x86_64.egg:/usr/local/lib/python3.8/dist-packages/merlin_hps-1.0.0-py3.8-linux-x86_64.egg:/usr/lib/python3/dist-packages
+Invoking script with the following command:
+/usr/bin/python3 train.py --batch_size 1024 --epoch 10
+2022-11-09 10:27:03,486 sagemaker-training-toolkit INFO Exceptions not imported for SageMaker Debugger as it is not installed.
+
+2022-11-09 10:27:16 Training - Training image download completed. Training in progress.2022-11-09 10:27:08.761711: I tensorflow/core/util/util.cc:169] oneDNN custom operations are on. You may see slightly different numerical results due to floating-point round-off errors from different computation orders. To turn them off, set the environment variable `TF_ENABLE_ONEDNN_OPTS=0`.
+2022-11-09 10:27:12.818302: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:991] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
+2022-11-09 10:27:12.819693: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:991] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
+2022-11-09 10:27:12.819906: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:991] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
+2022-11-09 10:27:12.894084: I tensorflow/core/platform/cpu_feature_guard.cc:194] This TensorFlow binary is optimized with oneAPI Deep Neural Network Library (oneDNN) to use the following CPU instructions in performance-critical operations: SSE3 SSE4.1 SSE4.2 AVX
+To enable them in other operations, rebuild TensorFlow with the appropriate compiler flags.
+2022-11-09 10:27:12.895367: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:991] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
+2022-11-09 10:27:12.895631: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:991] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
+2022-11-09 10:27:12.895807: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:991] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
+2022-11-09 10:27:16.651703: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:991] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
+2022-11-09 10:27:16.651981: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:991] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
+2022-11-09 10:27:16.652183: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:991] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
+2022-11-09 10:27:16.653025: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1532] Created device /job:localhost/replica:0/task:0/device:GPU:0 with 10752 MB memory: -> device: 0, name: Tesla T4, pci bus id: 0000:00:1e.0, compute capability: 7.5
+Workflow saved to /tmp/tmp5fpdavsc/workflow.
+batch_size = 1024, epochs = 10
+Epoch 1/10
+684/684 - 14s - loss: 0.6932 - auc: 0.4998 - regularization_loss: 0.0000e+00 - val_loss: 0.6931 - val_auc: 0.5000 - val_regularization_loss: 0.0000e+00 - 14s/epoch - 20ms/step
+Epoch 2/10
+684/684 - 8s - loss: 0.6931 - auc: 0.5026 - regularization_loss: 0.0000e+00 - val_loss: 0.6932 - val_auc: 0.4990 - val_regularization_loss: 0.0000e+00 - 8s/epoch - 11ms/step
+Epoch 3/10
+684/684 - 7s - loss: 0.6922 - auc: 0.5222 - regularization_loss: 0.0000e+00 - val_loss: 0.6941 - val_auc: 0.4989 - val_regularization_loss: 0.0000e+00 - 7s/epoch - 11ms/step
+Epoch 4/10
+684/684 - 7s - loss: 0.6858 - auc: 0.5509 - regularization_loss: 0.0000e+00 - val_loss: 0.6991 - val_auc: 0.4994 - val_regularization_loss: 0.0000e+00 - 7s/epoch - 11ms/step
+Epoch 5/10
+684/684 - 7s - loss: 0.6790 - auc: 0.5660 - regularization_loss: 0.0000e+00 - val_loss: 0.7052 - val_auc: 0.4993 - val_regularization_loss: 0.0000e+00 - 7s/epoch - 11ms/step
+Epoch 6/10
+684/684 - 8s - loss: 0.6751 - auc: 0.5722 - regularization_loss: 0.0000e+00 - val_loss: 0.7096 - val_auc: 0.4994 - val_regularization_loss: 0.0000e+00 - 8s/epoch - 11ms/step
+Epoch 7/10
+684/684 - 7s - loss: 0.6722 - auc: 0.5755 - regularization_loss: 0.0000e+00 - val_loss: 0.7184 - val_auc: 0.4991 - val_regularization_loss: 0.0000e+00 - 7s/epoch - 11ms/step
+Epoch 8/10
+684/684 - 7s - loss: 0.6700 - auc: 0.5777 - regularization_loss: 0.0000e+00 - val_loss: 0.7289 - val_auc: 0.4990 - val_regularization_loss: 0.0000e+00 - 7s/epoch - 11ms/step
+Epoch 9/10
+684/684 - 8s - loss: 0.6687 - auc: 0.5792 - regularization_loss: 0.0000e+00 - val_loss: 0.7404 - val_auc: 0.4994 - val_regularization_loss: 0.0000e+00 - 8s/epoch - 11ms/step
+Epoch 10/10
+684/684 - 8s - loss: 0.6678 - auc: 0.5801 - regularization_loss: 0.0000e+00 - val_loss: 0.7393 - val_auc: 0.4988 - val_regularization_loss: 0.0000e+00 - 8s/epoch - 11ms/step
+/usr/lib/python3/dist-packages/requests/__init__.py:89: RequestsDependencyWarning: urllib3 (1.26.12) or chardet (3.0.4) doesn't match a supported version!
+ warnings.warn("urllib3 ({}) or chardet ({}) doesn't match a supported "
+/usr/local/lib/python3.8/dist-packages/merlin/schema/tags.py:148: UserWarning: Compound tags like Tags.USER_ID have been deprecated and will be removed in a future version. Please use the atomic versions of these tags, like [<Tags.USER: 'user'>, <Tags.ID: 'id'>].
+ warnings.warn(
+/usr/local/lib/python3.8/dist-packages/merlin/schema/tags.py:148: UserWarning: Compound tags like Tags.ITEM_ID have been deprecated and will be removed in a future version. Please use the atomic versions of these tags, like [<Tags.ITEM: 'item'>, <Tags.ID: 'id'>].
+ warnings.warn(
+WARNING:absl:Found untraced functions such as train_compute_metrics, model_context_layer_call_fn, model_context_layer_call_and_return_conditional_losses, output_layer_layer_call_fn, output_layer_layer_call_and_return_conditional_losses while saving (showing 5 of 97). These functions will not be directly callable after loading.
+INFO:__main__:Model saved to /tmp/tmp5fpdavsc/dlrm.
+Model saved to /tmp/tmp5fpdavsc/dlrm.
+WARNING:absl:Found untraced functions such as train_compute_metrics, model_context_layer_call_fn, model_context_layer_call_and_return_conditional_losses, output_layer_layer_call_fn, output_layer_layer_call_and_return_conditional_losses while saving (showing 5 of 97). These functions will not be directly callable after loading.
+/usr/local/lib/python3.8/dist-packages/merlin/schema/tags.py:148: UserWarning: Compound tags like Tags.USER_ID have been deprecated and will be removed in a future version. Please use the atomic versions of these tags, like [<Tags.USER: 'user'>, <Tags.ID: 'id'>].
+ warnings.warn(
+/usr/local/lib/python3.8/dist-packages/merlin/schema/tags.py:148: UserWarning: Compound tags like Tags.ITEM_ID have been deprecated and will be removed in a future version. Please use the atomic versions of these tags, like [<Tags.ITEM: 'item'>, <Tags.ID: 'id'>].
+ warnings.warn(
+WARNING:tensorflow:No training configuration found in save file, so the model was *not* compiled. Compile it manually.
+WARNING:tensorflow:No training configuration found in save file, so the model was *not* compiled. Compile it manually.
+WARNING:absl:Found untraced functions such as train_compute_metrics, model_context_layer_call_fn, model_context_layer_call_and_return_conditional_losses, output_layer_layer_call_fn, output_layer_layer_call_and_return_conditional_losses while saving (showing 5 of 97). These functions will not be directly callable after loading.
+Ensemble graph saved to /opt/ml/model.
+INFO:__main__:Ensemble graph saved to /opt/ml/model.
+2022-11-09 10:29:21,498 sagemaker-training-toolkit INFO Reporting training SUCCESS
+
+2022-11-09 10:29:41 Uploading - Uploading generated training model
+2022-11-09 10:29:41 Completed - Training job completed
+Training seconds: 589
+Billable seconds: 589
+