grow-cluster / create-cluster shouldn't return until finished initializing #210

jvivian · 2016-07-20T18:43:34Z

There is a status endpoint for determining if it's finished initializing.

hannes-ucsc · 2016-07-20T19:35:50Z

Are you referring to the instance status checks? AFAIK, those aren't necessarily conclusive since they don't actually check on anything going on inside the instance. And how could they, unless there is code running inside the VM that reports the status to the host. AFAIK, there is no such code. Waiting for the status check to report "ok" just works coincidentally because they typically take longer to initialize than the instance takes to boot up. But it is not a reliable mechanism. If you put a sleep 3600 into the boot process, the status checks would still report "ok" after five minutes, I think.

CGCloud waits until the last cloud-init stage is finished which happens late in the boot processes, close to when rc.local is being run. Even if we wait for the init process to enter the final run level, the daemons started by init will still go through initialization asynchronously. This includes, for example, the Mesos slave daemons registering with the master.

To deal with this asynchrony, the unit tests ask the master daemon (Spark or Mesos) to report the number of slaves. We could move that functionality into create-cluster. Then again, I wouldn't want it to get hung up on a few sticky instances when I'm creating a cluster of hundreds of instances. So that check would have to leave some wiggle room, requireing only, say, 90% of the instances to join.

hannes-ucsc · 2016-07-20T22:44:26Z

ping @briandoconnor

hannes-ucsc · 2016-07-28T19:21:31Z

Just now, I've seen an instance that got stuck while booting, before even starting ssh while the instance status check says its ok. I take this as further evidence that the status checks are inconclusive as far as boot completion is concerned.

jvivian added the enhancement label Jul 20, 2016

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

grow-cluster / create-cluster shouldn't return until finished initializing #210

grow-cluster / create-cluster shouldn't return until finished initializing #210

jvivian commented Jul 20, 2016

hannes-ucsc commented Jul 20, 2016

hannes-ucsc commented Jul 20, 2016

hannes-ucsc commented Jul 28, 2016 •

edited

Loading

grow-cluster / create-cluster shouldn't return until finished initializing #210

grow-cluster / create-cluster shouldn't return until finished initializing #210

Comments

jvivian commented Jul 20, 2016

hannes-ucsc commented Jul 20, 2016

hannes-ucsc commented Jul 20, 2016

hannes-ucsc commented Jul 28, 2016 • edited Loading

hannes-ucsc commented Jul 28, 2016 •

edited

Loading