You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Are you referring to the instance status checks? AFAIK, those aren't necessarily conclusive since they don't actually check on anything going on inside the instance. And how could they, unless there is code running inside the VM that reports the status to the host. AFAIK, there is no such code. Waiting for the status check to report "ok" just works coincidentally because they typically take longer to initialize than the instance takes to boot up. But it is not a reliable mechanism. If you put a sleep 3600 into the boot process, the status checks would still report "ok" after five minutes, I think.
CGCloud waits until the last cloud-init stage is finished which happens late in the boot processes, close to when rc.local is being run. Even if we wait for the init process to enter the final run level, the daemons started by init will still go through initialization asynchronously. This includes, for example, the Mesos slave daemons registering with the master.
To deal with this asynchrony, the unit tests ask the master daemon (Spark or Mesos) to report the number of slaves. We could move that functionality into create-cluster. Then again, I wouldn't want it to get hung up on a few sticky instances when I'm creating a cluster of hundreds of instances. So that check would have to leave some wiggle room, requireing only, say, 90% of the instances to join.
Just now, I've seen an instance that got stuck while booting, before even starting ssh while the instance status check says its ok. I take this as further evidence that the status checks are inconclusive as far as boot completion is concerned.
Suggestion from @briandoconnor
There is a status endpoint for determining if it's finished initializing.
The text was updated successfully, but these errors were encountered: