Skip to content
This repository has been archived by the owner on Jul 19, 2024. It is now read-only.

"MNIST handwritten digit classification using TensorFlow" failed due to "tensorflow.python.framework.errors_impl.NotFoundError: /tmp/model" #7

Open
ggsato opened this issue May 20, 2018 · 2 comments

Comments

@ggsato
Copy link

ggsato commented May 20, 2018

Hi,

I tried to follow Quick Start to run a sample model service on my edge device, but failed as follows.

$ az ml service logs realtime -i my-mnist
2018-05-20 21:55:11,845 CRIT Supervisor running as root (no user in config file)
2018-05-20 21:55:11,847 INFO supervisord started with pid 1
2018-05-20 21:55:12,849 INFO spawned: 'rsyslog' with pid 9
2018-05-20 21:55:12,851 INFO spawned: 'program_exit' with pid 10
2018-05-20 21:55:12,853 INFO spawned: 'nginx' with pid 11
2018-05-20 21:55:12,854 INFO spawned: 'iot' with pid 12
2018-05-20 21:55:12,856 INFO spawned: 'gunicorn' with pid 13
2018-05-20 21:55:12,868 INFO success: iot entered RUNNING state, process has stayed up for > than 0 seconds (startsecs)
2018-05-20 21:55:12,931 INFO exited: iot (exit status 1; expected)
2018-05-20 21:55:13,932 INFO success: rsyslog entered RUNNING state, process has stayed up for > than 1 seconds (startsecs)
2018-05-20 21:55:13,932 INFO success: program_exit entered RUNNING state, process has stayed up for > than 1 seconds (startsecs)
2018-05-20 21:55:17,937 INFO success: nginx entered RUNNING state, process has stayed up for > than 5 seconds (startsecs)
2018-05-20T21:55:18.305553Z, INFO, 00000000-0000-0000-0000-000000000000, , Starting gunicorn 19.6.0
2018-05-20T21:55:18.307199Z, INFO, 00000000-0000-0000-0000-000000000000, , Listening at: http://127.0.0.1:9090 (13)
2018-05-20T21:55:18.307539Z, INFO, 00000000-0000-0000-0000-000000000000, , Using worker: sync
2018-05-20T21:55:18.308591Z, INFO, 00000000-0000-0000-0000-000000000000, , worker timeout is set to 300
2018-05-20T21:55:18.309772Z, INFO, 00000000-0000-0000-0000-000000000000, , Booting worker with pid: 31
Initializing logger
2018-05-20T21:55:18.773498Z, INFO, 00000000-0000-0000-0000-000000000000, , Starting up app insights client
2018-05-20T21:55:18.773791Z, INFO, 00000000-0000-0000-0000-000000000000, , Starting up request id generator
2018-05-20T21:55:18.773942Z, INFO, 00000000-0000-0000-0000-000000000000, , Starting up app insight hooks
2018-05-20T21:55:18.774278Z, INFO, 00000000-0000-0000-0000-000000000000, , Invoking user's init function
2018-05-20 21:55:22.053772: W tensorflow/core/platform/cpu_feature_guard.cc:45] The TensorFlow library wasn't compiled to use SSE4.1 instructions, but these are available on your machine and could speed up CPU computations.
2018-05-20 21:55:22.053811: W tensorflow/core/platform/cpu_feature_guard.cc:45] The TensorFlow library wasn't compiled to use SSE4.2 instructions, but these are available on your machine and could speed up CPU computations.
2018-05-20 21:55:22.053835: W tensorflow/core/platform/cpu_feature_guard.cc:45] The TensorFlow library wasn't compiled to use AVX instructions, but these are available on your machine and could speed up CPU computations.
2018-05-20 21:55:22.053845: W tensorflow/core/platform/cpu_feature_guard.cc:45] The TensorFlow library wasn't compiled to use AVX2 instructions, but these are available on your machine and could speed up CPU computations.
2018-05-20 21:55:22.053855: W tensorflow/core/platform/cpu_feature_guard.cc:45] The TensorFlow library wasn't compiled to use FMA instructions, but these are available on your machine and could speed up CPU computations.
2018-05-20T21:55:22.976338Z, ERROR, 00000000-0000-0000-0000-000000000000, , User's init function failed
2018-05-20T21:55:22.977815Z, ERROR, 00000000-0000-0000-0000-000000000000, , Encountered Exception Traceback (most recent call last):
  File "/var/azureml-app/aml_blueprint.py", line 109, in register
    main.init()
  File "/var/azureml-app/main.py", line 41, in init
    user_init_func()
  File "/var/azureml-app/webservice_driver.py", line 9, in init
    saver.restore(sess,tf.train.latest_checkpoint('./'))
  File "/home/mmlspark/lib/conda/lib/python3.5/site-packages/tensorflow/python/training/saver.py", line 1511, in latest_checkpoint
    if file_io.get_matching_files(v2_path) or file_io.get_matching_files(
  File "/home/mmlspark/lib/conda/lib/python3.5/site-packages/tensorflow/python/lib/io/file_io.py", line 282, in get_matching_files
    compat.as_bytes(filename), status)]
  File "/home/mmlspark/lib/conda/lib/python3.5/contextlib.py", line 66, in __exit__
    next(self.gen)
  File "/home/mmlspark/lib/conda/lib/python3.5/site-packages/tensorflow/python/framework/errors_impl.py", line 466, in raise_exception_on_not_ok_status
    pywrap_tensorflow.TF_GetCode(status))
tensorflow.python.framework.errors_impl.NotFoundError: /tmp/model

2018-05-20T21:55:22.978319Z, INFO, 00000000-0000-0000-0000-000000000000, , Worker exiting (pid: 31)
2018-05-20T21:55:23.243101Z, INFO, 00000000-0000-0000-0000-000000000000, , Shutting down: Master
2018-05-20T21:55:23.243485Z, INFO, 00000000-0000-0000-0000-000000000000, , Reason: Worker failed to boot.
2018-05-20 21:55:23,271 INFO exited: gunicorn (exit status 3; not expected)
2018-05-20 21:55:24,274 INFO spawned: 'gunicorn' with pid 41
2018-05-20T21:55:29.717610Z, INFO, 00000000-0000-0000-0000-000000000000, , Starting gunicorn 19.6.0
2018-05-20T21:55:29.718823Z, INFO, 00000000-0000-0000-0000-000000000000, , Listening at: http://127.0.0.1:9090 (41)
2018-05-20T21:55:29.719061Z, INFO, 00000000-0000-0000-0000-000000000000, , Using worker: sync
2018-05-20T21:55:29.719820Z, INFO, 00000000-0000-0000-0000-000000000000, , worker timeout is set to 300
2018-05-20T21:55:29.720920Z, INFO, 00000000-0000-0000-0000-000000000000, , Booting worker with pid: 45
Initializing logger
2018-05-20T21:55:30.045352Z, INFO, 00000000-0000-0000-0000-000000000000, , Starting up app insights client
2018-05-20T21:55:30.045671Z, INFO, 00000000-0000-0000-0000-000000000000, , Starting up request id generator
2018-05-20T21:55:30.046003Z, INFO, 00000000-0000-0000-0000-000000000000, , Starting up app insight hooks
2018-05-20T21:55:30.046263Z, INFO, 00000000-0000-0000-0000-000000000000, , Invoking user's init function
2018-05-20 21:55:31.715341: W tensorflow/core/platform/cpu_feature_guard.cc:45] The TensorFlow library wasn't compiled to use SSE4.1 instructions, but these are available on your machine and could speed up CPU computations.
2018-05-20 21:55:31.715390: W tensorflow/core/platform/cpu_feature_guard.cc:45] The TensorFlow library wasn't compiled to use SSE4.2 instructions, but these are available on your machine and could speed up CPU computations.
2018-05-20 21:55:31.715402: W tensorflow/core/platform/cpu_feature_guard.cc:45] The TensorFlow library wasn't compiled to use AVX instructions, but these are available on your machine and could speed up CPU computations.
2018-05-20 21:55:31.715412: W tensorflow/core/platform/cpu_feature_guard.cc:45] The TensorFlow library wasn't compiled to use AVX2 instructions, but these are available on your machine and could speed up CPU computations.
2018-05-20 21:55:31.715422: W tensorflow/core/platform/cpu_feature_guard.cc:45] The TensorFlow library wasn't compiled to use FMA instructions, but these are available on your machine and could speed up CPU computations.
2018-05-20T21:55:32.620562Z, ERROR, 00000000-0000-0000-0000-000000000000, , User's init function failed
2018-05-20T21:55:32.621796Z, ERROR, 00000000-0000-0000-0000-000000000000, , Encountered Exception Traceback (most recent call last):
  File "/var/azureml-app/aml_blueprint.py", line 109, in register
    main.init()
  File "/var/azureml-app/main.py", line 41, in init
    user_init_func()
  File "/var/azureml-app/webservice_driver.py", line 9, in init
    saver.restore(sess,tf.train.latest_checkpoint('./'))
  File "/home/mmlspark/lib/conda/lib/python3.5/site-packages/tensorflow/python/training/saver.py", line 1511, in latest_checkpoint
    if file_io.get_matching_files(v2_path) or file_io.get_matching_files(
  File "/home/mmlspark/lib/conda/lib/python3.5/site-packages/tensorflow/python/lib/io/file_io.py", line 282, in get_matching_files
    compat.as_bytes(filename), status)]
  File "/home/mmlspark/lib/conda/lib/python3.5/contextlib.py", line 66, in __exit__
    next(self.gen)
  File "/home/mmlspark/lib/conda/lib/python3.5/site-packages/tensorflow/python/framework/errors_impl.py", line 466, in raise_exception_on_not_ok_status
    pywrap_tensorflow.TF_GetCode(status))
tensorflow.python.framework.errors_impl.NotFoundError: /tmp/model

2018-05-20T21:55:32.622017Z, INFO, 00000000-0000-0000-0000-000000000000, , Worker exiting (pid: 45)
2018-05-20T21:55:32.873411Z, INFO, 00000000-0000-0000-0000-000000000000, , Shutting down: Master
2018-05-20T21:55:32.873754Z, INFO, 00000000-0000-0000-0000-000000000000, , Reason: Worker failed to boot.
2018-05-20 21:55:32,905 INFO exited: gunicorn (exit status 3; not expected)
2018-05-20 21:55:33,906 INFO gave up: gunicorn entered FATAL state, too many start retries too quickly
2018-05-20 21:55:34,908 WARN program_exit: bad result line: 'Killing supervisor with this event: ver:3.0 server:supervisor serial:0 pool:program_exit poolserial:0 eventname:PROCESS_STATE_FATAL len:58'
2018-05-20 21:55:34,908 WARN program_exit: has entered the UNKNOWN state and will no longer receive events, this usually indicates the process violated the eventlistener protocol
2018-05-20 21:55:34,908 WARN received SIGQUIT indicating exit request
2018-05-20 21:55:34,908 INFO waiting for nginx, rsyslog, program_exit to die
2018-05-20 21:55:35,913 INFO stopped: nginx (exit status 0)
2018-05-20 21:55:35,914 INFO stopped: program_exit (terminated by SIGTERM)
2018-05-20 21:55:35,915 INFO stopped: rsyslog (exit status 0)

Received 64 lines of log

Is there anything missing that I should have done?

I am not sure if this has anything to do with this issue, but I choose the file name my_ConvNet_MNIST_model.data-00000-of-00001, as instructed in 2, which is different from my_ConvNet_MNIST_model.data-00001-of-00000 in the az ml service create command.

Thanks in advance.

@girishkumarbk
Copy link

I see the same issue as below. Can the author put some light on this issue ?

$ az ml service logs realtime -i mnist-model
2018-05-29 06:07:05,676 CRIT Supervisor running as root (no user in config file)
2018-05-29 06:07:05,679 INFO supervisord started with pid 1
2018-05-29 06:07:06,681 INFO spawned: 'rsyslog' with pid 9
2018-05-29 06:07:06,683 INFO spawned: 'program_exit' with pid 10
2018-05-29 06:07:06,685 INFO spawned: 'nginx' with pid 11
2018-05-29 06:07:06,686 INFO spawned: 'iot' with pid 12
2018-05-29 06:07:06,688 INFO spawned: 'gunicorn' with pid 13
2018-05-29 06:07:06,702 INFO success: iot entered RUNNING state, process has stayed up for > than 0 seconds (startsecs)
2018-05-29 06:07:06,757 INFO exited: iot (exit status 1; expected)
2018-05-29 06:07:07,758 INFO success: rsyslog entered RUNNING state, process has stayed up for > than 1 seconds (startsecs)
2018-05-29 06:07:07,758 INFO success: program_exit entered RUNNING state, process has stayed up for > than 1 seconds (startsecs)
2018-05-29 06:07:11,763 INFO success: nginx entered RUNNING state, process has stayed up for > than 5 seconds (startsecs)
2018-05-29T06:07:12.127151Z, INFO, 00000000-0000-0000-0000-000000000000, , Starting gunicorn 19.6.0
2018-05-29T06:07:12.128599Z, INFO, 00000000-0000-0000-0000-000000000000, , Listening at: http://127.0.0.1:9090 (13)
2018-05-29T06:07:12.128965Z, INFO, 00000000-0000-0000-0000-000000000000, , Using worker: sync
2018-05-29T06:07:12.129924Z, INFO, 00000000-0000-0000-0000-000000000000, , worker timeout is set to 300
2018-05-29T06:07:12.131108Z, INFO, 00000000-0000-0000-0000-000000000000, , Booting worker with pid: 31
Initializing logger
2018-05-29T06:07:12.609597Z, INFO, 00000000-0000-0000-0000-000000000000, , Starting up app insights client
2018-05-29T06:07:12.609886Z, INFO, 00000000-0000-0000-0000-000000000000, , Starting up request id generator
2018-05-29T06:07:12.610067Z, INFO, 00000000-0000-0000-0000-000000000000, , Starting up app insight hooks
2018-05-29T06:07:12.610244Z, INFO, 00000000-0000-0000-0000-000000000000, , Invoking user's init function
2018-05-29 06:07:16.064184: W tensorflow/core/platform/cpu_feature_guard.cc:45] The TensorFlow library wasn't compiled to use SSE4.1 instructions, but these are available on your machine and could speed up CPU computations.
2018-05-29 06:07:16.064237: W tensorflow/core/platform/cpu_feature_guard.cc:45] The TensorFlow library wasn't compiled to use SSE4.2 instructions, but these are available on your machine and could speed up CPU computations.
2018-05-29 06:07:16.064267: W tensorflow/core/platform/cpu_feature_guard.cc:45] The TensorFlow library wasn't compiled to use AVX instructions, but these are available on your machine and could speed up CPU computations.
2018-05-29 06:07:16.064299: W tensorflow/core/platform/cpu_feature_guard.cc:45] The TensorFlow library wasn't compiled to use AVX2 instructions, but these are available on your machine and could speed up CPU computations.
2018-05-29 06:07:16.064308: W tensorflow/core/platform/cpu_feature_guard.cc:45] The TensorFlow library wasn't compiled to use FMA instructions, but these are available on your machine and could speed up CPU computations.
2018-05-29T06:07:16.941963Z, ERROR, 00000000-0000-0000-0000-000000000000, , User's init function failed
2018-05-29T06:07:16.943163Z, ERROR, 00000000-0000-0000-0000-000000000000, , Encountered Exception Traceback (most recent call last):
File "/var/azureml-app/aml_blueprint.py", line 109, in register
main.init()
File "/var/azureml-app/main.py", line 41, in init
user_init_func()
File "/var/azureml-app/webservice_driver.py", line 9, in init
saver.restore(sess,tf.train.latest_checkpoint('./'))
File "/home/mmlspark/lib/conda/lib/python3.5/site-packages/tensorflow/python/training/saver.py", line 1511, in latest_checkpoint
if file_io.get_matching_files(v2_path) or file_io.get_matching_files(
File "/home/mmlspark/lib/conda/lib/python3.5/site-packages/tensorflow/python/lib/io/file_io.py", line 282, in get_matching_files
compat.as_bytes(filename), status)]
File "/home/mmlspark/lib/conda/lib/python3.5/contextlib.py", line 66, in exit
next(self.gen)
File "/home/mmlspark/lib/conda/lib/python3.5/site-packages/tensorflow/python/framework/errors_impl.py", line 466, in raise_exception_on_not_ok_status
pywrap_tensorflow.TF_GetCode(status))
tensorflow.python.framework.errors_impl.NotFoundError: /tmp/model

2018-05-29T06:07:16.943407Z, INFO, 00000000-0000-0000-0000-000000000000, , Worker exiting (pid: 31)
2018-05-29T06:07:17.162821Z, INFO, 00000000-0000-0000-0000-000000000000, , Shutting down: Master
2018-05-29T06:07:17.164038Z, INFO, 00000000-0000-0000-0000-000000000000, , Reason: Worker failed to boot.
2018-05-29 06:07:17,198 INFO exited: gunicorn (exit status 3; not expected)
2018-05-29 06:07:18,200 INFO spawned: 'gunicorn' with pid 41
2018-05-29T06:07:23.646741Z, INFO, 00000000-0000-0000-0000-000000000000, , Starting gunicorn 19.6.0
2018-05-29T06:07:23.647964Z, INFO, 00000000-0000-0000-0000-000000000000, , Listening at: http://127.0.0.1:9090 (41)
2018-05-29T06:07:23.648163Z, INFO, 00000000-0000-0000-0000-000000000000, , Using worker: sync
2018-05-29T06:07:23.649027Z, INFO, 00000000-0000-0000-0000-000000000000, , worker timeout is set to 300
2018-05-29T06:07:23.650233Z, INFO, 00000000-0000-0000-0000-000000000000, , Booting worker with pid: 45
Initializing logger
2018-05-29T06:07:23.962799Z, INFO, 00000000-0000-0000-0000-000000000000, , Starting up app insights client
2018-05-29T06:07:23.963089Z, INFO, 00000000-0000-0000-0000-000000000000, , Starting up request id generator
2018-05-29T06:07:23.963266Z, INFO, 00000000-0000-0000-0000-000000000000, , Starting up app insight hooks
2018-05-29T06:07:23.963389Z, INFO, 00000000-0000-0000-0000-000000000000, , Invoking user's init function
2018-05-29 06:07:25.553304: W tensorflow/core/platform/cpu_feature_guard.cc:45] The TensorFlow library wasn't compiled to use SSE4.1 instructions, but these are available on your machine and could speed up CPU computations.
2018-05-29 06:07:25.553345: W tensorflow/core/platform/cpu_feature_guard.cc:45] The TensorFlow library wasn't compiled to use SSE4.2 instructions, but these are available on your machine and could speed up CPU computations.
2018-05-29 06:07:25.553358: W tensorflow/core/platform/cpu_feature_guard.cc:45] The TensorFlow library wasn't compiled to use AVX instructions, but these are available on your machine and could speed up CPU computations.
2018-05-29 06:07:25.553368: W tensorflow/core/platform/cpu_feature_guard.cc:45] The TensorFlow library wasn't compiled to use AVX2 instructions, but these are available on your machine and could speed up CPU computations.
2018-05-29 06:07:25.553378: W tensorflow/core/platform/cpu_feature_guard.cc:45] The TensorFlow library wasn't compiled to use FMA instructions, but these are available on your machine and could speed up CPU computations.
2018-05-29T06:07:26.462690Z, ERROR, 00000000-0000-0000-0000-000000000000, , User's init function failed
2018-05-29T06:07:26.464196Z, ERROR, 00000000-0000-0000-0000-000000000000, , Encountered Exception Traceback (most recent call last):
File "/var/azureml-app/aml_blueprint.py", line 109, in register
main.init()
File "/var/azureml-app/main.py", line 41, in init
user_init_func()
File "/var/azureml-app/webservice_driver.py", line 9, in init
saver.restore(sess,tf.train.latest_checkpoint('./'))
File "/home/mmlspark/lib/conda/lib/python3.5/site-packages/tensorflow/python/training/saver.py", line 1511, in latest_checkpoint
if file_io.get_matching_files(v2_path) or file_io.get_matching_files(
File "/home/mmlspark/lib/conda/lib/python3.5/site-packages/tensorflow/python/lib/io/file_io.py", line 282, in get_matching_files
compat.as_bytes(filename), status)]
File "/home/mmlspark/lib/conda/lib/python3.5/contextlib.py", line 66, in exit
next(self.gen)
File "/home/mmlspark/lib/conda/lib/python3.5/site-packages/tensorflow/python/framework/errors_impl.py", line 466, in raise_exception_on_not_ok_status
pywrap_tensorflow.TF_GetCode(status))
tensorflow.python.framework.errors_impl.NotFoundError: /tmp/model

2018-05-29T06:07:26.464671Z, INFO, 00000000-0000-0000-0000-000000000000, , Worker exiting (pid: 45)
2018-05-29T06:07:26.666009Z, INFO, 00000000-0000-0000-0000-000000000000, , Shutting down: Master
2018-05-29T06:07:26.666482Z, INFO, 00000000-0000-0000-0000-000000000000, , Reason: Worker failed to boot.
2018-05-29 06:07:26,697 INFO exited: gunicorn (exit status 3; not expected)
2018-05-29 06:07:27,699 INFO gave up: gunicorn entered FATAL state, too many start retries too quickly
2018-05-29 06:07:28,701 WARN program_exit: bad result line: 'Killing supervisor with this event: ver:3.0 server:supervisor serial:0 pool:program_exit poolserial:0 eventname:PROCESS_STATE_FATAL len:58'
2018-05-29 06:07:28,701 WARN program_exit: has entered the UNKNOWN state and will no longer receive events, this usually indicates the process violated the eventlistener protocol
2018-05-29 06:07:28,701 WARN received SIGQUIT indicating exit request
2018-05-29 06:07:28,701 INFO waiting for nginx, rsyslog, program_exit to die
2018-05-29 06:07:29,706 INFO stopped: nginx (exit status 0)
2018-05-29 06:07:29,707 INFO stopped: program_exit (terminated by SIGTERM)
2018-05-29 06:07:29,708 INFO stopped: rsyslog (exit status 0)

Received 64 lines of log

@AdityaKurude
Copy link

Hi guys,
I faced the same issue while following the quick start tutorial. However, after debugging a little bit I resolved the issue.
Since the checkpoint file provided in the repo contain path "/tmp/model/my_ConvNet_MNIST_model", the service tried to look for the model in the same path. Just change the path as "./my_ConvNet_MNIST_model" and then create the service.

command : $ az ml service create realtime -m my_ConvNet_MNIST_model.meta -d my_ConvNet_MNIST_model.data-00001-of-00000 -d my_ConvNet_MNIST_model.index -d checkpoint -f webservice_driver.py -n [your service name] -r python -c conda_dependencies.yml

I think this will resolve your issue.

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants