-
Notifications
You must be signed in to change notification settings - Fork 113
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
v4: Try to select a replica with available connections when calling DoSecondary #346
Comments
Hi @woodsbury, I agree that this is a problem worth solving, and thank you for digging into it and coming up with a solution. I do have a couple comments about your implementation: A minor one: the Another minor comment: whatever solution gets introduced here I would want to also add to the Sentinel implementation, because it surely has the same problem. The bigger concern of mine is what you've touched on already, that you're accessing the pool directly. As it stands now there are no components of radix which downcast to any of radix's private types, everything is done via the public interfaces. This allows components to be wrapped or have their implementations completely replaced by users. It's something which is promised by the documentation, and I don't want to undercut that promise by introducing special behavior for the radix types. A different solution which comes to mind would be to have Cluster track the result of calling I wouldn't blame you if you don't want to implement this though, it could be a bit tricky. But maybe you have some other ideas/comments as well. Wdyt? |
Yeah, I was worried just accessing the length like that wasn't thread safe. Having the cluster track which of its clients seem to be misbehaving might work. The main thought I have around that is that if the cluster needs to determine that all by itself it would only ever be able to do so after the fact - it would have to first run a command against the client to detect if it returned an error or took longer than usual to respond, and then if it decided a client was unhealthy it wouldn't be able to work out when it had recovered without just trying to use it again and hoping it didn't get another error. It seemed to me like the client itself might be in a better position to determine if it's healthy or not, and therefore could react quicker both in determining that it was having problems and also that it had recovered. Tracking it in the cluster for each of its clients does also sound like it could be a trickier implementation on top of that. With the implementation I've got, the idea I had to remove the explicit reference to the pool type was to introduce a new, optional interface. Any client could then implement that interface to opt in to the behaviour, or ignore it to continue using the current behaviour. I've got an updated commit here with the concept: woodsbury@e3f42da Like before, I'm not married to this specific implementation or naming or anything, just presenting it here to hopefully make it clearer what I'm thinking. I made the interface private just because that was simpler, but it could just as easily be a new public interface if that was better. I can also see having a way for a client to report whether or not it is healthy or ready might be useful outside of using it in a cluster as well. Do you have any thoughts on that? |
Continuing the testing we're doing on different failure states in a Redis cluster and how Radix behaves to them - this time checking failures of master nodes - we saw a similar issue but this time with the cluster resync-ing its topology. When a master node dies, sending a command to it will fail. We can detect that we've got an error back and call Sync to try to update the topology, but when the cluster is only small there's a good chance that the node Radix chooses to call CLUSTER SLOTS on will be the same master that we've just detected as dead. Using the same logic as that previous commit I was able to improve that scenario too: woodsbury@2e5f2c2 I've just duplicated the interface here because that was the easiest thing to do, not intending it to necessarily be a proper implementation. Just wanted to point out a related issue while we're discussing solutions. |
Hey @woodsbury , the idea of an optional interface had occurred to me as well. There's two reasons I'm hesitant about it:
Which isn't to say I'm against the idea entirely, just that I'd like some more time to think on it. It would be feasible to write a |
We had an issue recently where one of the replica nodes in our Redis cluster became unresponsive. Radix seemed to do a pretty good job at detecting that the connections it had open were failing and cleaning them out of the pool, but since Redis was still reporting the node in the output from "CLUSTER SLOTS" it understandably kept the pool around.
This led to an issue where all the commands we were running via DoSecondary on that shard slowed down considerably. We have two replicas for each shard in our cluster, and because the pool for the unresponsive node was still around Radix had a 50% chance of sending commands to it (which would inevitably fail after some timeout). We retry those failures, and we'd then have another 50% chance of getting the bad node again, etc.
An improvement we're suggesting is updating the code for selecting which replica to use to prefer any replicas which have non-empty pools. I've created an implementation of this that in some limited local testing removed all the errors we were seeing beforehand by making Radix more consistently choose the good node during this kind of scenario. If all nodes are bad (or just loaded up at that moment) then it should effectively perform the same behaviour as before of just selecting one at random.
The commit with the changes on it can be found here: woodsbury@62c0519
I wanted to present it here before creating a pull request for it in case there's any issues you can see with the general concept. The current changes do access the connections inside a pool directly as that was the easiest way to implement it, but there may be ways to clean that up more.
The text was updated successfully, but these errors were encountered: