Right now we have a rather fragile design for the load monitoring. If anything unexpected happen then the loop that handles regular polling of an agent might get broken and that agent won't get properly updated load information.
This gets worse if the agent is ever detected as down, as it will then get stuck in a downed state forever. And agents start off as being considered down, so a failure in the first ever poll makes that agent gone permanently.
The technical reason for the problem is that the response callback from GetLoad must always be called as that is what schedules a new check. And the current design wasn't built with the goal of making sure the callback is always called. So we need either a redesign, or make sure the scheduling happens in some other manner.
To fix the issue vsmserver must be restarted.
This should work better now. We now handle all unexpected errors from the load updater and the callback is always called nevertheless.
To reproduce: see bug 7531
Tester needs to make sure that we never stop trying to update the load for an agent. Even if we got an error the last time.
Tested on RHEL8 server with nightly.
If loadbalancer.py/update_loadinfo() encounters an error early on (Before we update our loadstatus) we will re-try to update loadinfo immediately which is propably as we want it to be. This will spam the log extremly if this is an persistent error.
(In reply to Niko Lehto from comment #3)
> If loadbalancer.py/update_loadinfo() encounters an error early on (Before we
> update our loadstatus) we will re-try to update loadinfo immediately which
> is propably as we want it to be. This will spam the log extremly if this is
> an persistent error.
Re-tested with nightly build 6721.
Works well now! The polling continues even if crashes happen, and this does not spam the log.