Skip to content
This repository has been archived by the owner on Sep 18, 2024. It is now read-only.

Dispatcher crash with TPE KeyError #5798

Open
Fripplebubby opened this issue Jul 3, 2024 · 3 comments
Open

Dispatcher crash with TPE KeyError #5798

Fripplebubby opened this issue Jul 3, 2024 · 3 comments

Comments

@Fripplebubby
Copy link

Fripplebubby commented Jul 3, 2024

Describe the issue:
It seems the dispatcher crashes for me from unknown causes, and when this happens, my experiment stops running.

Environment:

  • NNI version: 3.0
  • Training service (local|remote|pai|aml|etc): local
  • Client OS: Linux (Ubuntu 22.04)
  • Server OS (for remote mode only):
  • Python version: 3.10.14
  • PyTorch/TensorFlow version: 2.2.1+cu118 (PyTorch)
  • Is conda/virtualenv/venv used?: No (pyenv is used)
  • Is running in Docker?: No

Configuration:

  • Experiment config (remember to remove secrets!):
from nni.experiment import Experiment
experiment = Experiment('local')
experiment.config.trial_command = 'python model.py'
experiment.config.trial_code_directory = '.'
experiment.config.search_space = search_space
experiment.config.tuner.name = 'TPE'
experiment.config.tuner.class_args['optimize_mode'] = 'minimize'
experiment.config.max_trial_number = 1000
experiment.config.trial_concurrency = 1
experiment.run(8080)
  • Search space:
search_space = {
    "hidden_sizes": {
        "_type": "choice",
        "_value": [[], [256], [512], [1024], [1024, 512], [1024, 512, 256], [512, 256]]
    },
    "learning_rate": {
        "_type": "loguniform",
        "_value": [0.000001, 0.1]
    },
    "batch_size": {
        "_type": "choice",
        "_value": [32, 64, 128]
    },
    "num_epochs": {
        "_type": "randint",
        "_value": [100, 1000]
    },
    "dropout_prob": {
        "_type": "uniform",
        "_value": [0.0, 0.5]
    },
    "use_batch_norm": {
        "_type": "choice",
        "_value": [True, False]
    },
    "activation_fn": {
        "_type": "choice",
        "_value": ["relu", "leaky_relu", "sigmoid", "tanh", "elu", "selu"]
    },
    "patience": {
        "_type": "randint",
        "_value": [0, 10]
    }
}

Log message:

  • nnimanager.log:
    (relevant snippet)
[2024-07-03 17:10:21] INFO (NNIManager) submitTrialJob: form: {
  sequenceId: 46,
  hyperParameters: {
    value: '{"parameter_id": 46, "parameter_source": "algorithm", "parameters": {"hidden_sizes": [256], "learning_rate": 0.004027533073627928, "batch_size": 128, "num_epochs": 748, "dropout_prob": 0.18980965379785528, "use_batch_norm": true, "activation_fn": "selu", "patience": 6}, "parameter_index": 0}',
    index: 0
  },
  placementConstraint: { type: 'None', gpus: [] }
}
[2024-07-03 17:10:21] INFO (LocalV3.local) Created trial eDGmO
[2024-07-03 17:10:22] INFO (LocalV3.local) Trial parameter: eDGmO {"parameter_id": 46, "parameter_source": "algorithm", "parameters": {"hidden_sizes": [256], "learning_rate": 0.004027533073627928, "batch_size": 128, "num_epochs": 748, "dropout_prob": 0.18980965379785528, "use_batch_norm": true, "activation_fn": "selu", "patience": 6}, "parameter_index": 0}
[2024-07-03 17:10:29] ERROR (WsChannel.__default__) Channel closed. Ignored command {
  type: 'ME',
  content: '{"parameter_id": 46, "trial_job_id": "eDGmO", "type": "PERIODICAL", "sequence": 1, "value": "14.616238377757908"}'
}
[2024-07-03 17:10:29] ERROR (WsChannel.__default__) Channel closed. Ignored command {
  type: 'ME',
  content: '{"parameter_id": 46, "trial_job_id": "eDGmO", "type": "PERIODICAL", "sequence": 2, "value": "10.664397033219485"}'
}
  • dispatcher.log:
[2024-07-03 16:45:49] INFO (nni.tuner.tpe/MainThread) Using random seed 668056533
[2024-07-03 16:45:49] INFO (nni.runtime.msg_dispatcher_base/MainThread) Dispatcher started
[2024-07-03 16:45:49] INFO (nni.runtime.msg_dispatcher/Thread-1 (command_queue_worker)) Initial search space: {'hidden_sizes': {'_type': 'choice', '_value': [[], [256], [512], [1024], [1024, 512], [1024, 512, 256], [512, 256]]}, 'learning_rate': {'_type': 'loguniform', '_value': [1e-06, 0.1]}, 'batch_size': {'_type': 'choice', '_value': [32, 64, 128]}, 'num_epochs': {'_type': 'randint', '_value': [100, 1000]}, 'dropout_prob': {'_type': 'uniform', '_value': [0, 0.5]}, 'use_batch_norm': {'_type': 'choice', '_value': [True, False]}, 'activation_fn': {'_type': 'choice', '_value': ['relu', 'leaky_relu', 'sigmoid', 'tanh', 'elu', 'selu']}, 'patience': {'_type': 'randint', '_value': [0, 10]}}
[2024-07-03 17:10:21] ERROR (nni.runtime.msg_dispatcher_base/Thread-1 (command_queue_worker)) 45
Traceback (most recent call last):
  File "/home/josep/.pyenv/versions/3.10.14/lib/python3.10/site-packages/nni/runtime/msg_dispatcher_base.py", line 108, in command_queue_worker
    self.process_command(command, data)
  File "/home/josep/.pyenv/versions/3.10.14/lib/python3.10/site-packages/nni/runtime/msg_dispatcher_base.py", line 154, in process_command
    command_handlers[command](data)
  File "/home/josep/.pyenv/versions/3.10.14/lib/python3.10/site-packages/nni/runtime/msg_dispatcher.py", line 148, in handle_report_metric_data
    self._handle_final_metric_data(data)
  File "/home/josep/.pyenv/versions/3.10.14/lib/python3.10/site-packages/nni/runtime/msg_dispatcher.py", line 201, in _handle_final_metric_data
    self.tuner.receive_trial_result(id_, _trial_params[id_], value, customized=customized,
  File "/home/josep/.pyenv/versions/3.10.14/lib/python3.10/site-packages/nni/algorithms/hpo/tpe_tuner.py", line 197, in receive_trial_result
    params = self._running_params.pop(parameter_id)
KeyError: 45
[2024-07-03 17:10:28] INFO (nni.runtime.msg_dispatcher_base/MainThread) Dispatcher exiting...
[2024-07-03 17:10:28] INFO (nni.runtime.msg_dispatcher_base/MainThread) Dispatcher terminiated

How to reproduce it?:

It happens not just once for me, but occasionally with different experiments. I tried lowering concurrency to 1 in order to avoid it, but it appears nonetheless.

In this example, it was trial 45 evidently which caused the crash. In the web ui, I can see that trial 45 succeeded and there is a recorded metric value for it. Yet, when TPE goes to find its parameters, it seems it cannot find them?

@sertreet
Copy link

plus one.me too

@Lionelsy
Copy link

Same problem while using the TPE.

@redLinmumu
Copy link

Same question.

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants