We've got quite a few problems with the number of session verification calls that's needed to work with the default 10 minute session_update_delay setting in clusters with lots of users.
Instead of verifying all sessions with individual calls to agents per session, we could do better by adding a XMLRPC call/handler that checks all sessions on an agent at the same time.
This approach could cut down the number of calls required during each session_update_delay from the number of sessions to the number of agents.
There is a consensus about a principal design which involves one job on the server that iterates over the sessions in the session database and groups sessions per agent and then asks each agent to verify its sessions.
The test TestSessionOnRemovedAgent caught a change in behaviour in the new code; we now verify existing sessions right away after the server starts. Previously we would do so after a delay.
We need to decide if we want to keep this behaviour or not. One possible problem could be that we are racing with the agent when starting up and might mark those sessions as unverified.
We've also failed to implement VerifySessionsCall.handle_known_errors(), which the test TestSessionOnDeadAgent has detected.
Works well. We've tested:
- Periodic check: alive, dead, timeout
- Reconnect: alive, dead
- Shadow: alive, dead
- HA: no scenario found (see bug 6146), we have unit tests though
- tlwebadm: alive, dead, connect/disconnect
Also checked socket usage and it seems to be doing fine on a single port per agent (or less). I had a master/agent pair and 100 sessions on the agent. Only port 1023 was used on the master.