Bug 7530 - Crash in GetLoad call causing agent to no longer be polled
Summary: Crash in GetLoad call causing agent to no longer be polled
Status: CLOSED FIXED
Alias: None
Product: ThinLinc
Classification: Unclassified
Component: VSM Server (show other bugs)
Version: trunk
Hardware: PC Unknown
: P2 Normal
Target Milestone: 4.13.0
Assignee: Frida Flodin
URL:
Keywords: nikle_tester, prosaic
Depends on:
Blocks:
 
Reported: 2020-07-09 10:26 CEST by Pierre Ossman
Modified: 2021-01-18 13:29 CET (History)
2 users (show)

See Also:
Acceptance Criteria:


Attachments

Description Pierre Ossman cendio 2020-07-09 10:26:54 CEST
Right now we have a rather fragile design for the load monitoring. If anything unexpected happen then the loop that handles regular polling of an agent might get broken and that agent won't get properly updated load information.

This gets worse if the agent is ever detected as down, as it will then get stuck in a downed state forever. And agents start off as being considered down, so a failure in the first ever poll makes that agent gone permanently.

The technical reason for the problem is that the response callback from GetLoad must always be called as that is what schedules a new check. And the current design wasn't built with the goal of making sure the callback is always called. So we need either a redesign, or make sure the scheduling happens in some other manner.

To fix the issue vsmserver must be restarted.
Comment 2 Frida Flodin cendio 2021-01-14 16:32:40 CET
This should work better now. We now handle all unexpected errors from the load updater and the callback is always called nevertheless.

To reproduce: see bug 7531

Tester needs to make sure that we never stop trying to update the load for an agent. Even if we got an error the last time.
Comment 3 Niko Lehto cendio 2021-01-15 14:46:30 CET
Tested on RHEL8 server with nightly.

If loadbalancer.py/update_loadinfo() encounters an error early on (Before we update our loadstatus) we will re-try to update loadinfo immediately which is propably as we want it to be. This will spam the log extremly if this is an persistent error.
Comment 5 Frida Flodin cendio 2021-01-15 16:16:11 CET
(In reply to Niko Lehto from comment #3)
> If loadbalancer.py/update_loadinfo() encounters an error early on (Before we
> update our loadstatus) we will re-try to update loadinfo immediately which
> is propably as we want it to be. This will spam the log extremly if this is
> an persistent error.

Fixed now
Comment 6 Niko Lehto cendio 2021-01-18 13:29:46 CET
Re-tested with nightly build 6721.
Works well now! The polling continues even if crashes happen, and this does not spam the log.

Note You need to log in before you can comment on or make changes to this bug.