Bug 5476 - poor handling of session on downed agent
: poor handling of session on downed agent
: ThinLinc
VSM Server
: trunk
: PC Unknown
: P2 Normal
: 4.8.0
Assigned To:
: 5268
  Show dependency treegraph
Reported: 2015-03-18 15:59 by
Modified: 2017-04-28 16:52 (History)
Acceptance Criteria:



You need to log in before you can comment on or make changes to this bug.

Description From cendio 2015-03-18 15:59:18
Right now we have a rather poor user experience if a user has a session on an
agent that isn't responding and the user tries to reconnect. What will normally
happen is:

 - The client will request a list of sessions.
 - The master cannot verify the session, so it will be excluded from the list
 - The client thinks it has no sessions, so it will request a new one
 - The master will refuse since only one session is allowed per user.

At this point the users only option is to wait for the master to give up and
drop the session. Not even "End existing session" works as it won't remove a
session unless the agent is alive and responding.

If multiple sessions are allowed then you'll silently get a new session. Which
isn't great either, but not as problematic as the default case.

It is not clear what to do here. We don't want to just allow a second session
as that can create lots of other issues. We can provide a better error message
to the user though. And we could consider letting kill_session work so that the
user has a choice of when to give up on the unreachable session.

There is also the issue that the agent might start responding again after the
master has given up. At this point you have a stray session that is
------- Comment #1 From cendio 2015-03-18 16:06:54 -------
Bug 5268 has a crude workaround for this. It simply allows a new session for a
user if the agent for the existing session is marked as down. It also adds an
attempt at cleaning up old sessions by killing "unknown" tl-session processes
whenever verify_session is run on an agent.
------- Comment #2 From cendio 2016-12-06 09:41:10 -------
A few notes after looking at this:

* A better error message is probably good. Currently, we return
ERR_SESSION_LIMIT. We would need an error message that indicates this but also
conveys that there's a session on a downed agent. Since old clients does not
now about this new error message, we either need let them display "unknown
error" or the existing ERR_SESSION_LIMIT message. 

* As it is not, there's a discrepancy between the server and client about how
many sessions there are, in this situation. The server counts the number of
sessions in the database. The client checks how many sessions it gets from the
server, but since the server is filtering out the session that cannot be
verified, the number will be different. 

One solution to the problem above would be to stop filtering out unreachable
sessions; return it to the client but marked in a special way. This would solve
the "number of sessions problem" and also make it possible to use End existing
session. The problem is that we must then make sure that the client does not
try to reconnect to such session. I'm not sure this is possible with the
existing client code. Perhaps the client logic needs to be extended, and that
the list of sessions is only filtered for old clients. More investigation is
------- Comment #3 From cendio 2016-12-12 14:03:15 -------
I've had a look at the client logic and it does look promising to introduce a
new state.

An old client with the default configuration of 1 session per user will behave
like this:

 a) It will try to reconnect to the session, and the server will respond with a
new error code which the client will present as "Unknown error".

 b) The user will attempt to use "End existing session", and the client will
send a "kill_session" as a result. At this point the server could abandon the
unreachable session. After that the client will attempt to create a new session
and the user is hopefully up and running again.

If multiple sessions are allowed:

 c) An auto client will consider unreachable sessions the same as connected,
and in most cases request a new session.

 d) A non-auto client will pop up the selection dialog where the unreachable
sessions will have a state of "unknown". The user can then kill them (abandon),
or attempt a reconnect which will result in the "Unknown error" from a).

An updated client could present this slightly better by warning the user that
sessions are only abandoned, not actually terminated. It could also refuse to
connect to an unreachable session.
------- Comment #4 From cendio 2016-12-12 14:07:40 -------
As for the server side of things, it could do the following:

 a) Stop automatically abandoning unreachable sessions, instead marking them in
some way.

 b) Drop the unreachable sessions when kill_session is received

 c) Refuse connections to unreachable sessions

We might need to add some automatic abandon timeout to compensate for
MaxDisconnectTime and such not working in these cases.

For extra bonus we could change b) to:

 b2) Put the abandoned session on a special "kill later" list and terminate
them once we get in touch with the agent again. We could put a very long
timeout on this to handle an agent that never comes back, e.g. a week.
------- Comment #10 From cendio 2016-12-21 11:04:50 -------
For testing, check ctc/vsm/doc/multisession-reconnect.txt
------- Comment #21 From cendio 2017-01-10 12:29:13 -------
Note that the HTML client will initially be unaffected by these things as it
lacks both support for terminating sessions (bug 5049) and handling multiple
sessions (bug 5060).
------- Comment #33 From cendio 2017-01-30 13:20:00 -------
To tester: I recommend testing all "R" entries in the tables in
------- Comment #34 From cendio 2017-02-01 15:51:46 -------
Seems to work well for the most parts. Tested using build 5360 of both the
server and the client. I have had a cluster with two agent servers and tested
combinations where I through using iptables or stopping the vsmagent service
blocked the communication with the master.

I have tested different values of max sessions. I have tested both the ask
policy and the normal one in the client settings.

An old client trying to reconnect to an unreachable session will report that
ThinLinc login failed with an unknown error. Good enough.

However, in tlwebadm on the sessions page, it says:

> Connection Status	unknown

instead of unreachable. This should be fixed imo.
------- Comment #36 From cendio 2017-02-02 10:47:30 -------
We'll keep the connection status, but added a more prominent warning for
unreachable sessions. Also changed the label for terminating the session in
that case.
------- Comment #37 From cendio 2017-02-03 09:28:28 -------
(In reply to comment #36)
> We'll keep the connection status, but added a more prominent warning for
> unreachable sessions. Also changed the label for terminating the session in
> that case.

Works well now.
------- Comment #38 From cendio 2017-02-03 09:30:23 -------
Trying to login using the HTML5 client, max 1 session allowed, with 1
unreachable session:

> 2017-02-03 09:18:39 INFO tlwebaccess[68321]: [::ffff:] Successful authentication for user u'cendio'
> 2017-02-03 09:19:19 INFO tlwebaccess[68321]: [::ffff:] User u'cendio' reconnecting to existing session
> 2017-02-03 09:19:19 ERROR tlwebaccess[68321]: [::ffff:] ----------------------------------------
> 2017-02-03 09:19:19 ERROR tlwebaccess[68321]: [::ffff:] Traceback (most recent call last):
> 2017-02-03 09:19:19 ERROR tlwebaccess[68321]: [::ffff:]   File "/opt/thinlinc/sbin/tlwebaccess", line 340, in post_or_get
> 2017-02-03 09:19:19 ERROR tlwebaccess[68321]: [::ffff:]     i1ii1iiI , Ii11I1 = getattr ( I1I1I , action ) ( o0o0o0oO0oOO , OO , i1ii1iiI )
> 2017-02-03 09:19:19 ERROR tlwebaccess[68321]: [::ffff:]   File "/opt/thinlinc/modules/thinlinc/tlwebaccess/main.py", line 158, in do_POST
> 2017-02-03 09:19:19 ERROR tlwebaccess[68321]: [::ffff:]     self . _POST_METHODS . get ( page_name , self . error_404 ) ( query ) )
> 2017-02-03 09:19:19 ERROR tlwebaccess[68321]: [::ffff:]   File "/opt/thinlinc/modules/thinlinc/tlwebaccess/main.py", line 326, in home
> 2017-02-03 09:19:19 ERROR tlwebaccess[68321]: [::ffff:]     return self . check_authenticated ( iI1ii1Ii , i1i , Oo000o , o00oOO0o )
> 2017-02-03 09:19:19 ERROR tlwebaccess[68321]: [::ffff:]   File "/opt/thinlinc/modules/thinlinc/tlwebaccess/main.py", line 400, in check_authenticated
> 2017-02-03 09:19:19 ERROR tlwebaccess[68321]: [::ffff:]     return self . thinlinc_login ( username , iiIi1IIi1I , screen_size_array )
> 2017-02-03 09:19:19 ERROR tlwebaccess[68321]: [::ffff:]   File "/opt/thinlinc/modules/thinlinc/tlwebaccess/main.py", line 570, in thinlinc_login
> 2017-02-03 09:19:19 ERROR tlwebaccess[68321]: [::ffff:]     o0OoOo00o0o [ "result" ] [ "termserv_hostname" ] ,
> 2017-02-03 09:19:19 ERROR tlwebaccess[68321]: [::ffff:] KeyError: 'result'
> 2017-02-03 09:19:19 ERROR tlwebaccess[68321]: [::ffff:] ----------------------------------------
> 2017-02-03 09:19:19 ERROR tlwebaccess[68321]: [::ffff:] code 500, message Internal error on page '/main/'
> 2017-02-03 09:19:19 INFO tlwebaccess[68321]: [::ffff:] 'POST /main/ HTTP/1.1' 500 -
------- Comment #43 From cendio 2017-02-03 13:19:04 -------
> Your session is currently unreachable. Please try again later, or contact your system administrator.

Nice, works well now. Tested with tl-4.7.0post_5363.r32167.jenkins516.
------- Comment #44 From cendio 2017-03-14 10:25:06 -------
*** Bug 5699 has been marked as a duplicate of this bug. ***
------- Comment #45 From cendio 2017-03-29 16:24:55 -------
The documentation isn't properly updated. We have a chapter that describes the
login handling in the client that needs several changes.
------- Comment #47 From cendio 2017-03-30 15:31:04 -------
There wasn't actually that much more needed since the new stuff is in new
dialogs that only pop up when there's an issue. Still, a small section
describing this new state was added.
------- Comment #48 From cendio 2017-04-06 10:11:03 -------
Looks good.