2022-11-09 10:18:31 Starting - Starting the training job...
-2022-11-09 10:18:54 Starting - Preparing the instances for trainingProfilerReport-1667989110: InProgress
-......
-2022-11-09 10:19:54 Downloading - Downloading input data...
-2022-11-09 10:20:34 Training - Downloading the training image..................................==================================
-== Triton Inference Server Base ==
-==================================
-NVIDIA Release 22.08 (build 42766143)
-Copyright (c) 2018-2022, NVIDIA CORPORATION & AFFILIATES. All rights reserved.
-Various files include modifications (c) NVIDIA CORPORATION & AFFILIATES. All rights reserved.
-This container image and its contents are governed by the NVIDIA Deep Learning Container License.
-By pulling and using the container, you accept the terms and conditions of this license:
-https://developer.nvidia.com/ngc/nvidia-deep-learning-container-license
-NOTE: CUDA Forward Compatibility mode ENABLED.
- Using CUDA 11.7 driver version 515.65.01 with kernel driver version 510.47.03.
- See https://docs.nvidia.com/deploy/cuda-compatibility/ for details.
-2022-11-09 10:27:03,405 sagemaker-training-toolkit INFO No Neurons detected (normal if no neurons installed)
-2022-11-09 10:27:03,438 sagemaker-training-toolkit INFO No Neurons detected (normal if no neurons installed)
-2022-11-09 10:27:03,473 sagemaker-training-toolkit INFO No Neurons detected (normal if no neurons installed)
-2022-11-09 10:27:03,485 sagemaker-training-toolkit INFO Invoking user script
-Training Env:
-{
- "additional_framework_parameters": {},
- "channel_input_dirs": {
- "train": "/opt/ml/input/data/train",
- "valid": "/opt/ml/input/data/valid"
- },
- "current_host": "algo-1",
- "current_instance_group": "homogeneousCluster",
- "current_instance_group_hosts": [
- "algo-1"
- ],
- "current_instance_type": "ml.g4dn.xlarge",
- "distribution_hosts": [],
- "distribution_instance_groups": [],
- "framework_module": null,
- "hosts": [
- "algo-1"
- ],
- "hyperparameters": {
- "batch_size": 1024,
- "epoch": 10
- },
- "input_config_dir": "/opt/ml/input/config",
- "input_data_config": {
- "train": {
- "TrainingInputMode": "File",
- "S3DistributionType": "FullyReplicated",
- "RecordWrapperType": "None"
- },
- "valid": {
- "TrainingInputMode": "File",
- "S3DistributionType": "FullyReplicated",
- "RecordWrapperType": "None"
- }
- },
- "input_dir": "/opt/ml/input",
- "instance_groups": [
- "homogeneousCluster"
- ],
- "instance_groups_dict": {
- "homogeneousCluster": {
- "instance_group_name": "homogeneousCluster",
- "instance_type": "ml.g4dn.xlarge",
- "hosts": [
- "algo-1"
- ]
- }
- },
- "is_hetero": false,
- "is_master": true,
- "is_modelparallel_enabled": null,
- "is_smddpmprun_installed": false,
- "job_name": "sagemaker-merlin-tensorflow-2022-11-09-10-18-29-376",
- "log_level": 20,
- "master_hostname": "algo-1",
- "model_dir": "/opt/ml/model",
- "module_dir": "s3://sagemaker-us-east-1-843263297212/sagemaker-merlin-tensorflow-2022-11-09-10-18-29-376/source/sourcedir.tar.gz",
- "module_name": "train",
- "network_interface_name": "eth0",
- "num_cpus": 4,
- "num_gpus": 1,
- "num_neurons": 0,
- "output_data_dir": "/opt/ml/output/data",
- "output_dir": "/opt/ml/output",
- "output_intermediate_dir": "/opt/ml/output/intermediate",
- "resource_config": {
- "current_host": "algo-1",
- "current_instance_type": "ml.g4dn.xlarge",
- "current_group_name": "homogeneousCluster",
- "hosts": [
- "algo-1"
- ],
- "instance_groups": [
- {
- "instance_group_name": "homogeneousCluster",
- "instance_type": "ml.g4dn.xlarge",
- "hosts": [
- "algo-1"
- ]
- }
- ],
- "network_interface_name": "eth0"
- },
- "user_entry_point": "train.py"
-}
-Environment variables:
-SM_HOSTS=["algo-1"]
-SM_NETWORK_INTERFACE_NAME=eth0
-SM_HPS={"batch_size":1024,"epoch":10}
-SM_USER_ENTRY_POINT=train.py
-SM_FRAMEWORK_PARAMS={}
-SM_RESOURCE_CONFIG={"current_group_name":"homogeneousCluster","current_host":"algo-1","current_instance_type":"ml.g4dn.xlarge","hosts":["algo-1"],"instance_groups":[{"hosts":["algo-1"],"instance_group_name":"homogeneousCluster","instance_type":"ml.g4dn.xlarge"}],"network_interface_name":"eth0"}
-SM_INPUT_DATA_CONFIG={"train":{"RecordWrapperType":"None","S3DistributionType":"FullyReplicated","TrainingInputMode":"File"},"valid":{"RecordWrapperType":"None","S3DistributionType":"FullyReplicated","TrainingInputMode":"File"}}
-SM_OUTPUT_DATA_DIR=/opt/ml/output/data
-SM_CHANNELS=["train","valid"]
-SM_CURRENT_HOST=algo-1
-SM_CURRENT_INSTANCE_TYPE=ml.g4dn.xlarge
-SM_CURRENT_INSTANCE_GROUP=homogeneousCluster
-SM_CURRENT_INSTANCE_GROUP_HOSTS=["algo-1"]
-SM_INSTANCE_GROUPS=["homogeneousCluster"]
-SM_INSTANCE_GROUPS_DICT={"homogeneousCluster":{"hosts":["algo-1"],"instance_group_name":"homogeneousCluster","instance_type":"ml.g4dn.xlarge"}}
-SM_DISTRIBUTION_INSTANCE_GROUPS=[]
-SM_IS_HETERO=false
-SM_MODULE_NAME=train
-SM_LOG_LEVEL=20
-SM_FRAMEWORK_MODULE=
-SM_INPUT_DIR=/opt/ml/input
-SM_INPUT_CONFIG_DIR=/opt/ml/input/config
-SM_OUTPUT_DIR=/opt/ml/output
-SM_NUM_CPUS=4
-SM_NUM_GPUS=1
-SM_NUM_NEURONS=0
-SM_MODEL_DIR=/opt/ml/model
-SM_MODULE_DIR=s3://sagemaker-us-east-1-843263297212/sagemaker-merlin-tensorflow-2022-11-09-10-18-29-376/source/sourcedir.tar.gz
-SM_TRAINING_ENV={"additional_framework_parameters":{},"channel_input_dirs":{"train":"/opt/ml/input/data/train","valid":"/opt/ml/input/data/valid"},"current_host":"algo-1","current_instance_group":"homogeneousCluster","current_instance_group_hosts":["algo-1"],"current_instance_type":"ml.g4dn.xlarge","distribution_hosts":[],"distribution_instance_groups":[],"framework_module":null,"hosts":["algo-1"],"hyperparameters":{"batch_size":1024,"epoch":10},"input_config_dir":"/opt/ml/input/config","input_data_config":{"train":{"RecordWrapperType":"None","S3DistributionType":"FullyReplicated","TrainingInputMode":"File"},"valid":{"RecordWrapperType":"None","S3DistributionType":"FullyReplicated","TrainingInputMode":"File"}},"input_dir":"/opt/ml/input","instance_groups":["homogeneousCluster"],"instance_groups_dict":{"homogeneousCluster":{"hosts":["algo-1"],"instance_group_name":"homogeneousCluster","instance_type":"ml.g4dn.xlarge"}},"is_hetero":false,"is_master":true,"is_modelparallel_enabled":null,"is_smddpmprun_installed":false,"job_name":"sagemaker-merlin-tensorflow-2022-11-09-10-18-29-376","log_level":20,"master_hostname":"algo-1","model_dir":"/opt/ml/model","module_dir":"s3://sagemaker-us-east-1-843263297212/sagemaker-merlin-tensorflow-2022-11-09-10-18-29-376/source/sourcedir.tar.gz","module_name":"train","network_interface_name":"eth0","num_cpus":4,"num_gpus":1,"num_neurons":0,"output_data_dir":"/opt/ml/output/data","output_dir":"/opt/ml/output","output_intermediate_dir":"/opt/ml/output/intermediate","resource_config":{"current_group_name":"homogeneousCluster","current_host":"algo-1","current_instance_type":"ml.g4dn.xlarge","hosts":["algo-1"],"instance_groups":[{"hosts":["algo-1"],"instance_group_name":"homogeneousCluster","instance_type":"ml.g4dn.xlarge"}],"network_interface_name":"eth0"},"user_entry_point":"train.py"}
-SM_USER_ARGS=["--batch_size","1024","--epoch","10"]
-SM_OUTPUT_INTERMEDIATE_DIR=/opt/ml/output/intermediate
-SM_CHANNEL_TRAIN=/opt/ml/input/data/train
-SM_CHANNEL_VALID=/opt/ml/input/data/valid
-SM_HP_BATCH_SIZE=1024
-SM_HP_EPOCH=10
-PYTHONPATH=/opt/ml/code:/usr/local/bin:/opt/tritonserver:/usr/local/lib/python3.8/dist-packages:/usr/lib/python38.zip:/usr/lib/python3.8:/usr/lib/python3.8/lib-dynload:/usr/local/lib/python3.8/dist-packages/faiss-1.7.2-py3.8.egg:/usr/local/lib/python3.8/dist-packages/merlin_sok-1.1.4-py3.8-linux-x86_64.egg:/usr/local/lib/python3.8/dist-packages/merlin_hps-1.0.0-py3.8-linux-x86_64.egg:/usr/lib/python3/dist-packages
-Invoking script with the following command:
-/usr/bin/python3 train.py --batch_size 1024 --epoch 10
-2022-11-09 10:27:03,486 sagemaker-training-toolkit INFO Exceptions not imported for SageMaker Debugger as it is not installed.
-
-2022-11-09 10:27:16 Training - Training image download completed. Training in progress.2022-11-09 10:27:08.761711: I tensorflow/core/util/util.cc:169] oneDNN custom operations are on. You may see slightly different numerical results due to floating-point round-off errors from different computation orders. To turn them off, set the environment variable `TF_ENABLE_ONEDNN_OPTS=0`.
-2022-11-09 10:27:12.818302: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:991] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
-2022-11-09 10:27:12.819693: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:991] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
-2022-11-09 10:27:12.819906: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:991] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
-2022-11-09 10:27:12.894084: I tensorflow/core/platform/cpu_feature_guard.cc:194] This TensorFlow binary is optimized with oneAPI Deep Neural Network Library (oneDNN) to use the following CPU instructions in performance-critical operations: SSE3 SSE4.1 SSE4.2 AVX
-To enable them in other operations, rebuild TensorFlow with the appropriate compiler flags.
-2022-11-09 10:27:12.895367: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:991] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
-2022-11-09 10:27:12.895631: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:991] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
-2022-11-09 10:27:12.895807: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:991] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
-2022-11-09 10:27:16.651703: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:991] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
-2022-11-09 10:27:16.651981: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:991] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
-2022-11-09 10:27:16.652183: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:991] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
-2022-11-09 10:27:16.653025: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1532] Created device /job:localhost/replica:0/task:0/device:GPU:0 with 10752 MB memory: -> device: 0, name: Tesla T4, pci bus id: 0000:00:1e.0, compute capability: 7.5
-Workflow saved to /tmp/tmp5fpdavsc/workflow.
-batch_size = 1024, epochs = 10
-Epoch 1/10
-684/684 - 14s - loss: 0.6932 - auc: 0.4998 - regularization_loss: 0.0000e+00 - val_loss: 0.6931 - val_auc: 0.5000 - val_regularization_loss: 0.0000e+00 - 14s/epoch - 20ms/step
-Epoch 2/10
-684/684 - 8s - loss: 0.6931 - auc: 0.5026 - regularization_loss: 0.0000e+00 - val_loss: 0.6932 - val_auc: 0.4990 - val_regularization_loss: 0.0000e+00 - 8s/epoch - 11ms/step
-Epoch 3/10
-684/684 - 7s - loss: 0.6922 - auc: 0.5222 - regularization_loss: 0.0000e+00 - val_loss: 0.6941 - val_auc: 0.4989 - val_regularization_loss: 0.0000e+00 - 7s/epoch - 11ms/step
-Epoch 4/10
-684/684 - 7s - loss: 0.6858 - auc: 0.5509 - regularization_loss: 0.0000e+00 - val_loss: 0.6991 - val_auc: 0.4994 - val_regularization_loss: 0.0000e+00 - 7s/epoch - 11ms/step
-Epoch 5/10
-684/684 - 7s - loss: 0.6790 - auc: 0.5660 - regularization_loss: 0.0000e+00 - val_loss: 0.7052 - val_auc: 0.4993 - val_regularization_loss: 0.0000e+00 - 7s/epoch - 11ms/step
-Epoch 6/10
-684/684 - 8s - loss: 0.6751 - auc: 0.5722 - regularization_loss: 0.0000e+00 - val_loss: 0.7096 - val_auc: 0.4994 - val_regularization_loss: 0.0000e+00 - 8s/epoch - 11ms/step
-Epoch 7/10
-684/684 - 7s - loss: 0.6722 - auc: 0.5755 - regularization_loss: 0.0000e+00 - val_loss: 0.7184 - val_auc: 0.4991 - val_regularization_loss: 0.0000e+00 - 7s/epoch - 11ms/step
-Epoch 8/10
-684/684 - 7s - loss: 0.6700 - auc: 0.5777 - regularization_loss: 0.0000e+00 - val_loss: 0.7289 - val_auc: 0.4990 - val_regularization_loss: 0.0000e+00 - 7s/epoch - 11ms/step
-Epoch 9/10
-684/684 - 8s - loss: 0.6687 - auc: 0.5792 - regularization_loss: 0.0000e+00 - val_loss: 0.7404 - val_auc: 0.4994 - val_regularization_loss: 0.0000e+00 - 8s/epoch - 11ms/step
-Epoch 10/10
-684/684 - 8s - loss: 0.6678 - auc: 0.5801 - regularization_loss: 0.0000e+00 - val_loss: 0.7393 - val_auc: 0.4988 - val_regularization_loss: 0.0000e+00 - 8s/epoch - 11ms/step
-/usr/lib/python3/dist-packages/requests/__init__.py:89: RequestsDependencyWarning: urllib3 (1.26.12) or chardet (3.0.4) doesn't match a supported version!
- warnings.warn("urllib3 ({}) or chardet ({}) doesn't match a supported "
-/usr/local/lib/python3.8/dist-packages/merlin/schema/tags.py:148: UserWarning: Compound tags like Tags.USER_ID have been deprecated and will be removed in a future version. Please use the atomic versions of these tags, like [<Tags.USER: 'user'>, <Tags.ID: 'id'>].
- warnings.warn(
-/usr/local/lib/python3.8/dist-packages/merlin/schema/tags.py:148: UserWarning: Compound tags like Tags.ITEM_ID have been deprecated and will be removed in a future version. Please use the atomic versions of these tags, like [<Tags.ITEM: 'item'>, <Tags.ID: 'id'>].
- warnings.warn(
-WARNING:absl:Found untraced functions such as train_compute_metrics, model_context_layer_call_fn, model_context_layer_call_and_return_conditional_losses, output_layer_layer_call_fn, output_layer_layer_call_and_return_conditional_losses while saving (showing 5 of 97). These functions will not be directly callable after loading.
-INFO:__main__:Model saved to /tmp/tmp5fpdavsc/dlrm.
-Model saved to /tmp/tmp5fpdavsc/dlrm.
-WARNING:absl:Found untraced functions such as train_compute_metrics, model_context_layer_call_fn, model_context_layer_call_and_return_conditional_losses, output_layer_layer_call_fn, output_layer_layer_call_and_return_conditional_losses while saving (showing 5 of 97). These functions will not be directly callable after loading.
-/usr/local/lib/python3.8/dist-packages/merlin/schema/tags.py:148: UserWarning: Compound tags like Tags.USER_ID have been deprecated and will be removed in a future version. Please use the atomic versions of these tags, like [<Tags.USER: 'user'>, <Tags.ID: 'id'>].
- warnings.warn(
-/usr/local/lib/python3.8/dist-packages/merlin/schema/tags.py:148: UserWarning: Compound tags like Tags.ITEM_ID have been deprecated and will be removed in a future version. Please use the atomic versions of these tags, like [<Tags.ITEM: 'item'>, <Tags.ID: 'id'>].
- warnings.warn(
-WARNING:tensorflow:No training configuration found in save file, so the model was *not* compiled. Compile it manually.
-WARNING:tensorflow:No training configuration found in save file, so the model was *not* compiled. Compile it manually.
-WARNING:absl:Found untraced functions such as train_compute_metrics, model_context_layer_call_fn, model_context_layer_call_and_return_conditional_losses, output_layer_layer_call_fn, output_layer_layer_call_and_return_conditional_losses while saving (showing 5 of 97). These functions will not be directly callable after loading.
-Ensemble graph saved to /opt/ml/model.
-INFO:__main__:Ensemble graph saved to /opt/ml/model.
-2022-11-09 10:29:21,498 sagemaker-training-toolkit INFO Reporting training SUCCESS
-
-2022-11-09 10:29:41 Uploading - Uploading generated training model
-2022-11-09 10:29:41 Completed - Training job completed
-Training seconds: 589
-Billable seconds: 589
-