Reported in Issue 13605.
When both vsmserver and vsmagent are restarted simultaneous (this happens weekends on Ubuntu according to reports), vsmserver tries to verify sessions loaded from file with a vsmagent that hasn't started yet. Log extracts below.
2012-08-19 04:13:14 INFO vsmserver: Got SIGTERM, signaling process to quit
2012-08-19 04:13:14 INFO vsmserver: Terminating. Have a nice day!
2012-08-19 04:13:17 INFO vsmserver: VSM Server version 3.4.0post build 3568 started
2012-08-19 04:13:17 INFO vsmserver.license: Updating license data from disk to memory
2012-08-19 04:13:18 INFO vsmserver.license: License summary: 10 concurrent users. Hard limit of 11 concurrent users.
2012-08-19 04:13:18 INFO vsmserver.session: Loaded 2 sessions for 2 users from file
2012-08-19 04:13:18 WARNING vsmserver.session: Connection refused (ECONNREFUSED) talking to VSM Agent 127.0.0.1:904 in request to verify session for moetiker.
2012-08-19 04:13:18 INFO vsmserver.session: Session for moetiker on 127.0.0.1 have not been updated for 13474 seconds. Removing.
2012-08-19 04:13:18 WARNING vsmserver.loadinfo: Connection refused (ECONNREFUSED) talking to VSM Agent 127.0.0.1:904 in request for loadinfo. Marking as down.
2012-08-19 04:13:18 WARNING vsmserver.session: Connection refused (ECONNREFUSED) talking to VSM Agent 127.0.0.1:904 in request to verify session for zaucker.
2012-08-19 04:13:18 INFO vsmserver.session: Session for zaucker on 127.0.0.1 have not been updated for 13726 seconds. Removing.
2012-08-19 04:13:14 INFO vsmagent: Got SIGTERM, signaling process to quit
2012-08-19 04:13:14 INFO vsmagent: Terminating. Have a nice day!
2012-08-19 04:13:18 INFO vsmagent: VSM Agent version 3.4.0post build 3568 started
2012-08-19 04:13:18 INFO vsmagent: My public hostname is froburg.oetiker.ch
2012-08-19 04:13:19 WARNING vsmagent: Empty output from ssh-keyscan. Sshd not running?
2012-08-19 09:28:18 INFO vsmagent.session: Verified connectivity to newly started Xvnc for zaucker
2012-08-19 09:30:15 WARNING vsmagent.sessions: Broken session for user zaucker, tl-session process 30134 does not exist
2012-08-19 09:30:17 INFO vsmagent.session: Verified connectivity to newly started Xvnc for zaucker
I suppose vsmserver could mark the sessions as unreachable and give vsmagent a few more chances of responding before removing the session.
I've asked the reporter to try to add "sleep 10" before restarting vsmserver in the postrotate script for vsm-server. Waiting for feedback.
vsmserver tried to give the agent 3*10 minutes to recover from errors, but the timestamp it compared to was never written to disk unless a session was added/removed/changed. So for a server that did logrotates in the middle of the night with no users active, all sessions would have timestamps that were hours old. When vsmserver was restarted it did one try to verify the session, and when vsmagent couldn't be reached it looked at the timestamp and discarded the session for being too old.
Instead of using a timestamp to check when the session was last updated, we're now using an error counter that's reset when vsmserver is restarted, giving vsmserver three tries to verify the session before it gives up.
The tester should make sure that HA still works, and that the value of this counter is unique per vsmserver.
Committed in r27842, r27843, r27844.
Fixed an additional off-by-one problem found by autotests in r27857.
Verified functionality using build 4116, 30 minutes passed before vsmserver removed user from session logs.
Still needs to test following:
1. Test the cleanup functionality:
- stop and start agent to make sessions processes childs of the system.
- verify that server can verify session by pid
- kill session process
- verify that server removes session within 10 minutes
2. Test HA functionality.
(In reply to comment #8)
> Still needs to test following:
> 1. Test the cleanup functionality:
> - stop and start agent to make sessions processes childs of the system.
> - verify that server can verify session by pid
> - kill session process
> - verify that server removes session within 10 minutes
Verified that a broken session got removed after 10 minutes when tl-session process was killed.
Tested using 2 x SLED11 Sp2, ThinLinc build 4116
A: Master service
B: Master and Agent services
> 2. Test HA functionality.
- Verified HA functionality, connection/reconnection through both masters etc.
- Killed tl-session and verified that session was cleaned up and removed from
master A and B.
- Stopped vsmagent and restarted vsmservers.
Verified that session was cleaned up after 30 minutes. Master A deleted
session and sent delete notice to master B that deleteed and stopped trying to
verify the session.
- With a active session, restarting vsmagent to make it a child of init process.
Killed the tl-session process and verified that vsm servers cleaned up the
session from session database.