-
Notifications
You must be signed in to change notification settings - Fork 68
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Proper clock synchronization is required when running multiple Meteor servers/workers #260
Comments
btw, this is the db json of the job that cloned itself 100s times:
And this one clones itself tens of times:
|
Hi, obviously the behavior you are describing is not normal, and is not how the package typically performs. To get to the bottom of this, you should produce a simplified Meteor app reproducing this issue with as little code as possible not critical to reproducing the issue. Ideally a stripped down app that does nothing but schedule and run such a job. If you can isolate this behavior from the rest of your application logic (and other unnecessary package cependencies) then you can commit that reproduction code to a repo on Github. If I can reproduce the issue in running code, then I will be able to very quickly diagnose what is happening. Trying to do it any other way will be huge time sink for both of us. In my experience, weird problems like this are often the result of interactions with packages that monkey-patch core Meteor APIs in unknown ways. But it’s also possible there is something in your code that I won’t be able to see without you sharing a whole running app with me. |
Got it, thanks for your answer. I'm currently stripping down the job to see when it stops showing this behaviour. I kind of suspect Meteor.defer, however the other job that also clones itself does not use it. And forgot to mention, I only have this issue with recurring jobs. One-off jobs are doing just fine. If I find something reproducible I will post it here. |
@vsivsi question: We have 2 servers running. 1 server is running the job correctly (meaning, it runs it once), the other one is creating clones. There seems to be a time discrepancy between the 2 servers: The job is scheduled at 04:05am.
The job never runs on the scheduled time. And both servers pick up the job, one is cloning it. So I checked the server times, they were out of sync. I synced them, so both servers now run at exactly the same time. And...it seems to solve the issue! Plus the job is running correctly on time. But this all looks very fragile to me. It's quite easy to have servers running out of sync. Server 1: Server 2: |
Hi, as I'm sure you can appreciate, clock synchronization is very important to virtually all distributed applications. Given that your servers were more than 1 hour off from one another, I'm frankly surprised that far more serious things didn't break than what you've described in this issue. Most distributed databases, etc. don't work at all if server clocks are more than a few seconds apart. In this case it seems that you had a single MongoDB instance, but multiple Meteor servers running against it. It may seem "fragile" but job-collection instances only communicate via the database they are attached to, and they may "come-and-go" freely, without any complicated synchronization beyond using NTP to keep the server clocks within some fraction of a second of one another. Given that running jobs on a schedule is an inherently time-based activity, having accurate clock settings seems like a very basic and obvious prerequisite to having such a system run reliably. So, I'm not really sure what I can do to make this less fragile. I suppose a check could be run at startup (or periodically) to compare the Meteor server time to the MongoDB server time and if they differ by more than some delta (say a few seconds) then throw with a warning on the Meteor side. Even that is a bit difficult to measure with network latency, etc. (precisely the problem that protocols like NTP address). Anyway, thanks for reporting this. I'll think about it. At a minimum there should probably be a note in the documentation about supporting multiple servers and the importance of clock synchronization. |
I understand 😄
Yes that's right. We're using mlab replica set for Mongo hosting, 2 meteor servers behind nginx. What I could imagine is that the server who picks up the job adds an id or something to the job, so that another worker never picks it up even if server times are out of sync (I think SteveJobs works like this https://github.com/msavin/SteveJobs-meteor-jobs-queue). But honestly I don't know exactly how your package works, so this is just a long shot 😄 But even if that would work, I still don't understand why it starts cloning jobs like crazy. Why would a job clone itself? That is the part that mystifies me and I feel is fragile: If server 1 picks up the job 1 hour too earlier because it is out of sync, it could just run it 1 hour too early and complete it. Then server 2 would never pick it up. But what actually happens is that server 1 picks it up 1 hour too early, completes it, clones it hundred times, completes all the clones. And then finally, server 2 also picks it up, but probably does not clone it because server 2 time > server 1 time. |
Hi, I think this behavior is pretty simple to explain in the context of how job collection works... Take this example of how a repeating job will run with a single server/worker:
Now let's add a second server with a clock that is one hour off to this scenario. We'll call these
Voila. You will run as many jobs in that hour as the job runtime and server "promotion interval" allow. For jobs that are scheduled to "wait" some period of time, the code that handles this adds a few seconds of slop to ensure that a small clock offset will not cause the job to instantly repeat, but that doesn't help in the case of jobs scheduled to run at specific times when the server clocks are off by more than the time to complete one whole cycle of: submit, promote, run, complete, resubmit. Hope this explanation helps... |
I'm a long time user of this package. But recently I run into a problem that a single repeating job is cloning itself hundreds of time. This is the code:
It correctly creates a single job to run at 4:05am.
Then it starts running, and it creates hundreds copies of itself. Each copy runs, gets status 'completed' and none of them failed (I checked the logs).
Here's the abbreviated code of the worker:
With another job I have exactly the same issue. I just clones itself about 40-50 times. I checked the Job collection and the cloning is kind of random.
I'm literally out of ideas, starting to think about replacing this package :-(
Any idea what could result in jobs cloning themselves?
The text was updated successfully, but these errors were encountered: