The Load Balancer will declare a machine as failed, or dead, when the following events happen:
-
The balancer cannot make a connection to the back-end in a reasonable time. This is often due to network connectivity problems, or perhaps the network (or machine) is too busy to be contactable.
If the machine really is reachable, try increasing the connection timeout with the tunable tuning!max_connect_time
-
The back-end server refuses connections from the balancer. This may happen if the programs on the back-end have broken, or failed to start up (perhaps the machine is rebooting).
-
The balancer manages to send a request to the back-end machine, but does not start receiving a response in a reasonable time. This is often seen if a service locks up, but still continues to accept new connections.
You may trigger this problem if you have complex scripts on your web server (e.g. big CGIs) that do not return any data at all until they have finished their work. This can inadvertantly fool the balancer into thinking the back-end is malfunctioning (although if other requests are working, then it will know that all is well.)
You can increase this timeout with the tunable tuning!max_reply_time
Note that the balancer requires more than just one failed connection before deciding that a machine or service is dead. The number of failed attempts is controllable with the tunable tuning!max_retries.