Bug 5476 - poor handling of session on downed agent
Summary: poor handling of session on downed agent
Status: CLOSED FIXED
Alias: None
Product: ThinLinc
Classification: Unclassified
Component: VSM Server (show other bugs)
Version: trunk
Hardware: PC Unknown
: P2 Normal
Target Milestone: 4.8.0
Assignee: Peter Åstrand
URL:
Keywords: relnotes, samuel_tester
: 5699 (view as bug list)
Depends on: 5268
Blocks:
  Show dependency treegraph
 
Reported: 2015-03-18 15:59 CET by Pierre Ossman
Modified: 2017-04-28 16:52 CEST (History)
3 users (show)

See Also:
Acceptance Criteria:


Attachments

Description Pierre Ossman cendio 2015-03-18 15:59:18 CET
Right now we have a rather poor user experience if a user has a session on an agent that isn't responding and the user tries to reconnect. What will normally happen is:

 - The client will request a list of sessions.
 - The master cannot verify the session, so it will be excluded from the list
 - The client thinks it has no sessions, so it will request a new one
 - The master will refuse since only one session is allowed per user.

At this point the users only option is to wait for the master to give up and drop the session. Not even "End existing session" works as it won't remove a session unless the agent is alive and responding.

If multiple sessions are allowed then you'll silently get a new session. Which isn't great either, but not as problematic as the default case.

It is not clear what to do here. We don't want to just allow a second session as that can create lots of other issues. We can provide a better error message to the user though. And we could consider letting kill_session work so that the user has a choice of when to give up on the unreachable session.


There is also the issue that the agent might start responding again after the master has given up. At this point you have a stray session that is unreachable.
Comment 1 Pierre Ossman cendio 2015-03-18 16:06:54 CET
Bug 5268 has a crude workaround for this. It simply allows a new session for a user if the agent for the existing session is marked as down. It also adds an attempt at cleaning up old sessions by killing "unknown" tl-session processes whenever verify_session is run on an agent.
Comment 2 Peter Åstrand cendio 2016-12-06 09:41:10 CET
A few notes after looking at this:

* A better error message is probably good. Currently, we return ERR_SESSION_LIMIT. We would need an error message that indicates this but also conveys that there's a session on a downed agent. Since old clients does not now about this new error message, we either need let them display "unknown error" or the existing ERR_SESSION_LIMIT message. 

* As it is not, there's a discrepancy between the server and client about how many sessions there are, in this situation. The server counts the number of sessions in the database. The client checks how many sessions it gets from the server, but since the server is filtering out the session that cannot be verified, the number will be different. 

One solution to the problem above would be to stop filtering out unreachable sessions; return it to the client but marked in a special way. This would solve the "number of sessions problem" and also make it possible to use End existing session. The problem is that we must then make sure that the client does not try to reconnect to such session. I'm not sure this is possible with the existing client code. Perhaps the client logic needs to be extended, and that the list of sessions is only filtered for old clients. More investigation is needed.
Comment 3 Pierre Ossman cendio 2016-12-12 14:03:15 CET
I've had a look at the client logic and it does look promising to introduce a new state.

An old client with the default configuration of 1 session per user will behave like this:

 a) It will try to reconnect to the session, and the server will respond with a new error code which the client will present as "Unknown error".

 b) The user will attempt to use "End existing session", and the client will send a "kill_session" as a result. At this point the server could abandon the unreachable session. After that the client will attempt to create a new session and the user is hopefully up and running again.

If multiple sessions are allowed:

 c) An auto client will consider unreachable sessions the same as connected, and in most cases request a new session.

 d) A non-auto client will pop up the selection dialog where the unreachable sessions will have a state of "unknown". The user can then kill them (abandon), or attempt a reconnect which will result in the "Unknown error" from a).



An updated client could present this slightly better by warning the user that sessions are only abandoned, not actually terminated. It could also refuse to connect to an unreachable session.
Comment 4 Pierre Ossman cendio 2016-12-12 14:07:40 CET
As for the server side of things, it could do the following:

 a) Stop automatically abandoning unreachable sessions, instead marking them in some way.

 b) Drop the unreachable sessions when kill_session is received

 c) Refuse connections to unreachable sessions

We might need to add some automatic abandon timeout to compensate for MaxDisconnectTime and such not working in these cases.


For extra bonus we could change b) to:

 b2) Put the abandoned session on a special "kill later" list and terminate them once we get in touch with the agent again. We could put a very long timeout on this to handle an agent that never comes back, e.g. a week.
Comment 10 Peter Åstrand cendio 2016-12-21 11:04:50 CET
For testing, check ctc/vsm/doc/multisession-reconnect.txt
Comment 21 Pierre Ossman cendio 2017-01-10 12:29:13 CET
Note that the HTML client will initially be unaffected by these things as it lacks both support for terminating sessions (bug 5049) and handling multiple sessions (bug 5060).
Comment 33 Peter Åstrand cendio 2017-01-30 13:20:00 CET
To tester: I recommend testing all "R" entries in the tables in vsm/doc/multisession-reconnect.txt.
Comment 34 Samuel Mannehed cendio 2017-02-01 15:51:46 CET
Seems to work well for the most parts. Tested using build 5360 of both the server and the client. I have had a cluster with two agent servers and tested combinations where I through using iptables or stopping the vsmagent service blocked the communication with the master.

I have tested different values of max sessions. I have tested both the ask policy and the normal one in the client settings.

An old client trying to reconnect to an unreachable session will report that ThinLinc login failed with an unknown error. Good enough.

However, in tlwebadm on the sessions page, it says:

> Connection Status	unknown

instead of unreachable. This should be fixed imo.
Comment 36 Pierre Ossman cendio 2017-02-02 10:47:30 CET
We'll keep the connection status, but added a more prominent warning for unreachable sessions. Also changed the label for terminating the session in that case.
Comment 37 Samuel Mannehed cendio 2017-02-03 09:28:28 CET
(In reply to comment #36)
> We'll keep the connection status, but added a more prominent warning for
> unreachable sessions. Also changed the label for terminating the session in
> that case.

Works well now.
Comment 38 Samuel Mannehed cendio 2017-02-03 09:30:23 CET
Trying to login using the HTML5 client, max 1 session allowed, with 1 unreachable session:

> 2017-02-03 09:18:39 INFO tlwebaccess[68321]: [::ffff:10.47.4.80] Successful authentication for user u'cendio'
> 2017-02-03 09:19:19 INFO tlwebaccess[68321]: [::ffff:10.47.4.80] User u'cendio' reconnecting to existing session
> 2017-02-03 09:19:19 ERROR tlwebaccess[68321]: [::ffff:10.47.4.80] ----------------------------------------
> 2017-02-03 09:19:19 ERROR tlwebaccess[68321]: [::ffff:10.47.4.80] Traceback (most recent call last):
> 2017-02-03 09:19:19 ERROR tlwebaccess[68321]: [::ffff:10.47.4.80]   File "/opt/thinlinc/sbin/tlwebaccess", line 340, in post_or_get
> 2017-02-03 09:19:19 ERROR tlwebaccess[68321]: [::ffff:10.47.4.80]     i1ii1iiI , Ii11I1 = getattr ( I1I1I , action ) ( o0o0o0oO0oOO , OO , i1ii1iiI )
> 2017-02-03 09:19:19 ERROR tlwebaccess[68321]: [::ffff:10.47.4.80]   File "/opt/thinlinc/modules/thinlinc/tlwebaccess/main.py", line 158, in do_POST
> 2017-02-03 09:19:19 ERROR tlwebaccess[68321]: [::ffff:10.47.4.80]     self . _POST_METHODS . get ( page_name , self . error_404 ) ( query ) )
> 2017-02-03 09:19:19 ERROR tlwebaccess[68321]: [::ffff:10.47.4.80]   File "/opt/thinlinc/modules/thinlinc/tlwebaccess/main.py", line 326, in home
> 2017-02-03 09:19:19 ERROR tlwebaccess[68321]: [::ffff:10.47.4.80]     return self . check_authenticated ( iI1ii1Ii , i1i , Oo000o , o00oOO0o )
> 2017-02-03 09:19:19 ERROR tlwebaccess[68321]: [::ffff:10.47.4.80]   File "/opt/thinlinc/modules/thinlinc/tlwebaccess/main.py", line 400, in check_authenticated
> 2017-02-03 09:19:19 ERROR tlwebaccess[68321]: [::ffff:10.47.4.80]     return self . thinlinc_login ( username , iiIi1IIi1I , screen_size_array )
> 2017-02-03 09:19:19 ERROR tlwebaccess[68321]: [::ffff:10.47.4.80]   File "/opt/thinlinc/modules/thinlinc/tlwebaccess/main.py", line 570, in thinlinc_login
> 2017-02-03 09:19:19 ERROR tlwebaccess[68321]: [::ffff:10.47.4.80]     o0OoOo00o0o [ "result" ] [ "termserv_hostname" ] ,
> 2017-02-03 09:19:19 ERROR tlwebaccess[68321]: [::ffff:10.47.4.80] KeyError: 'result'
> 2017-02-03 09:19:19 ERROR tlwebaccess[68321]: [::ffff:10.47.4.80] ----------------------------------------
> 2017-02-03 09:19:19 ERROR tlwebaccess[68321]: [::ffff:10.47.4.80] code 500, message Internal error on page '/main/'
> 2017-02-03 09:19:19 INFO tlwebaccess[68321]: [::ffff:10.47.4.80] 'POST /main/ HTTP/1.1' 500 -
Comment 43 Samuel Mannehed cendio 2017-02-03 13:19:04 CET
> Your session is currently unreachable. Please try again later, or contact your system administrator.

Nice, works well now. Tested with tl-4.7.0post_5363.r32167.jenkins516.
Comment 44 Pierre Ossman cendio 2017-03-14 10:25:06 CET
*** Bug 5699 has been marked as a duplicate of this bug. ***
Comment 45 Pierre Ossman cendio 2017-03-29 16:24:55 CEST
The documentation isn't properly updated. We have a chapter that describes the login handling in the client that needs several changes.
Comment 47 Pierre Ossman cendio 2017-03-30 15:31:04 CEST
There wasn't actually that much more needed since the new stuff is in new dialogs that only pop up when there's an issue. Still, a small section describing this new state was added.
Comment 48 Samuel Mannehed cendio 2017-04-06 10:11:03 CEST
Looks good.

Note You need to log in before you can comment on or make changes to this bug.