www.cendio.com
Bug 4380 - restarting vsmserver and vsmagent at the same time causes sessions to be lost
: restarting vsmserver and vsmagent at the same time causes sessions to be lost
Status: CLOSED FIXED
: ThinLinc
VSM Server
: 3.4.0
: PC Unknown
: P2 Normal
: 4.1.1
Assigned To:
:
:
:
:
  Show dependency treegraph
 
Reported: 2012-08-22 09:25 by
Modified: 2013-10-24 10:50 (History)
Acceptance Criteria:


Attachments


Note

You need to log in before you can comment on or make changes to this bug.


Description From cendio 2012-08-22 09:25:05
Reported in Issue 13605.

When both vsmserver and vsmagent are restarted simultaneous (this happens
weekends on Ubuntu according to reports), vsmserver tries to verify sessions
loaded from file with a vsmagent that hasn't started yet. Log extracts below.

2012-08-19 04:13:14 INFO vsmserver: Got SIGTERM, signaling process to quit
2012-08-19 04:13:14 INFO vsmserver: Terminating. Have a nice day!
2012-08-19 04:13:17 INFO vsmserver: VSM Server version 3.4.0post build 3568
started
2012-08-19 04:13:17 INFO vsmserver.license: Updating license data from disk to
memory
2012-08-19 04:13:18 INFO vsmserver.license: License summary: 10 concurrent
users. Hard limit of 11 concurrent users.
2012-08-19 04:13:18 INFO vsmserver.session: Loaded 2 sessions for 2 users from
file
2012-08-19 04:13:18 WARNING vsmserver.session: Connection refused
(ECONNREFUSED) talking to VSM Agent 127.0.0.1:904 in request to verify session
for moetiker.
2012-08-19 04:13:18 INFO vsmserver.session: Session for moetiker on 127.0.0.1
have not been updated for 13474 seconds. Removing.
2012-08-19 04:13:18 WARNING vsmserver.loadinfo: Connection refused
(ECONNREFUSED) talking to VSM Agent 127.0.0.1:904 in request for loadinfo.
Marking as down.
2012-08-19 04:13:18 WARNING vsmserver.session: Connection refused
(ECONNREFUSED) talking to VSM Agent 127.0.0.1:904 in request to verify session
for zaucker.
2012-08-19 04:13:18 INFO vsmserver.session: Session for zaucker on 127.0.0.1
have not been updated for 13726 seconds. Removing.

2012-08-19 04:13:14 INFO vsmagent: Got SIGTERM, signaling process to quit
2012-08-19 04:13:14 INFO vsmagent: Terminating. Have a nice day!
2012-08-19 04:13:18 INFO vsmagent: VSM Agent version 3.4.0post build 3568
started
2012-08-19 04:13:18 INFO vsmagent: My public hostname is froburg.oetiker.ch
2012-08-19 04:13:19 WARNING vsmagent: Empty output from ssh-keyscan. Sshd not
running?
2012-08-19 09:28:18 INFO vsmagent.session: Verified connectivity to newly
started Xvnc for zaucker
2012-08-19 09:30:15 WARNING vsmagent.sessions: Broken session for user zaucker,
tl-session process 30134 does not exist
2012-08-19 09:30:17 INFO vsmagent.session: Verified connectivity to newly
started Xvnc for zaucker

I suppose vsmserver could mark the sessions as unreachable and give vsmagent a
few more chances of responding before removing the session.
------- Comment #2 From cendio 2013-08-08 11:24:10 -------
Possible workaround:

I've asked the reporter to try to add "sleep 10" before restarting vsmserver in
the postrotate script for vsm-server. Waiting for feedback.
------- Comment #6 From cendio 2013-09-06 16:23:54 -------
vsmserver tried to give the agent 3*10 minutes to recover from errors, but the
timestamp it compared to was never written to disk unless a session was
added/removed/changed. So for a server that did logrotates in the middle of the
night with no users active, all sessions would have timestamps that were hours
old. When vsmserver was restarted it did one try to verify the session, and
when vsmagent couldn't be reached it looked at the timestamp and discarded the
session for being too old.

Instead of using a timestamp to check when the session was last updated, we're
now using an error counter that's reset when vsmserver is restarted, giving
vsmserver three tries to verify the session before it gives up. 

The tester should make sure that HA still works, and that the value of this
counter is unique per vsmserver.

Committed in r27842, r27843, r27844.
------- Comment #7 From cendio 2013-09-10 16:01:35 -------
Fixed an additional off-by-one problem found by autotests in r27857.
------- Comment #8 From cendio 2013-10-17 14:03:08 -------
Verified functionality using build 4116, 30 minutes passed before vsmserver
removed user from session logs.

Still needs to test following:

1. Test the cleanup functionality:
 - stop and start agent to make sessions processes childs of the system.
 - verify that server can verify session by pid
 - kill session process
 - verify that server removes session within 10 minutes

2. Test HA functionality.
------- Comment #9 From cendio 2013-10-17 14:35:12 -------
(In reply to comment #8)

> Still needs to test following:
> 
> 1. Test the cleanup functionality:
>  - stop and start agent to make sessions processes childs of the system.
>  - verify that server can verify session by pid
>  - kill session process
>  - verify that server removes session within 10 minutes
> 

Verified that a broken session got removed after 10 minutes when tl-session
process was killed.
------- Comment #10 From cendio 2013-10-18 09:32:56 -------
Tested using 2 x SLED11 Sp2, ThinLinc build 4116

A: Master service
B: Master and Agent services

> 2. Test HA functionality.

- Verified HA functionality, connection/reconnection through both masters etc.

- Killed tl-session and verified that session was cleaned up and removed from 
  master A and B.

- Stopped vsmagent and restarted vsmservers.
  Verified that session was cleaned up after 30 minutes. Master A deleted 
  session and sent delete notice to master B that deleteed and stopped trying
to
  verify the session.

- With a active session, restarting vsmagent to make it a child of init
process.
  Killed the tl-session process and verified that vsm servers cleaned up the 
  session from session database.