Bug 7578 - Web Access connections dropped on service restart
Summary: Web Access connections dropped on service restart
Status: CLOSED FIXED
Alias: None
Product: ThinLinc
Classification: Unclassified
Component: Web Access (show other bugs)
Version: trunk
Hardware: PC Unknown
: P2 Normal
Target Milestone: 4.12.1
Assignee: Niko Lehto
URL:
Keywords: frifl_tester, relnotes
Depends on:
Blocks:
 
Reported: 2020-10-27 14:36 CET by Pierre Ossman
Modified: 2021-02-02 10:04 CET (History)
3 users (show)

See Also:
Acceptance Criteria:


Attachments

Description Pierre Ossman cendio 2020-10-27 14:36:03 CET
Doing "sudo systemctl restart tlwebaccess" drops all currently active Web Access sessions, which is very much not what we want. The sysadmin should be able to restart the service at any time to pick up new configuration.

This is a regression from 4.11.0 and caused by our migration to systemd (bug 4290).

The likely reason is that before systemd the automatically generated service unit had "KillMode=process", but the default after conversion is "KillMode=control-group".

For reference sshd also has "KillMode=process". It might not actually need it though since it moves all connections to their own user session, something we could also do (bug 7577).
Comment 2 Niko Lehto cendio 2021-01-12 16:23:20 CET
Tested this on RHEL8 server, could reproduce it on nightly.

Session connection now persists through tlwebaccess restarts.
Comment 3 Frida cendio 2021-01-15 13:49:41 CET
Tested on Ubuntu20.04.

Reproduced with tl-4.12.0 and the issue is fixed when installing nightly build.

Made sure that everything in the active session still works as expected. Also tested to create new sessions after the restart and it works fine.

Checked the release notes, looks good. Closing.
Comment 4 Pierre Ossman cendio 2021-01-22 14:23:29 CET
I think this causes breakage like for bug 7526. On upgrade tl-setup fails to stop the old tlwebaccess before starting the new one:

tlsetup.log:

> 2021-01-22 14:07:06,112: Starting service 'tlwebaccess'...
> 2021-01-22 14:08:36,258: Output (stderr):
> 2021-01-22 14:08:36,259:     Job for tlwebaccess.service failed because a timeout was exceeded. See "systemctl status tlwebaccess.service" and "journalctl -xe" for details.
> 2021-01-22 14:08:36,259: Failed to start service 'tlwebaccess'

tlwebaccess.log:

> 2021-01-22 13:57:48 INFO tlwebaccess[9956]: ThinLinc Web Access version 4.11.0 build 6323 starting...
> 2021-01-22 13:57:48 INFO tlwebaccess[9956]: ThinLinc TLS Service ready on port 300
> 2021-01-22 13:57:48 INFO tlwebaccess[9956]: ThinLinc Web Access running as PID 9956 on port 300.
> 2021-01-22 14:07:06 INFO tlwebaccess[11685]: ThinLinc Web Access version 4.12.1 build 6724 starting...
> 2021-01-22 14:07:06 ERROR tlwebaccess[11685]: Could not start HTTP service: [Errno 98] Address already in use
Comment 5 Pierre Ossman cendio 2021-01-26 14:48:20 CET
The issue is that systemd has no idea which process is the main one for a SysV service. There is no standardised way for it to find the pid file, and it also disables its guessing mechanism ("GuessMainPID").

Before the upgrade this works fine as systemd delegates the job of stopping the service to the SysV script, which knows about the pid file. It also works fine after an upgrade for other service as they kill every process associated with the service.

However with the changes for tlwebaccess on this bug systemd will only kill the main process, which it has no idea which it is for a SysV service.


Ideally we would have stopped the services during a switch from SysV to systemd (bug 7163), but that's a too big change for this release.
Comment 6 Pierre Ossman cendio 2021-01-26 14:51:04 CET
I don't see any way of fixing up systemd, so it looks like we'll have to compensate for things somewhere else. I tried using "systemctl set-property", but it only allows you to change some specific things and not fundamental properties like MainPID.

Some magic in tl-setup when it starts the services is probably needed.
Comment 7 Pierre Ossman cendio 2021-01-26 14:55:28 CET
To make things extra difficult systemd will clean up the pid file even if it fails to do anything. So if the sysadmin does a manual "systemctl stop tlwebaccess" then we lose a reliable way of seeing if the main process is still running.
Comment 9 Pierre Ossman cendio 2021-01-27 11:03:39 CET
Added a temporary hack (until bug 7163) that manually kills any old tlwebaccess process just before starting the new one. That should hopefully work in most cases, but it will fail is someone has manually run "systemctl stop tlwebaccess" before tl-setup.
Comment 10 Frida cendio 2021-02-02 10:04:20 CET
Could reproduce on Ubuntu20.04 and RHEL7 when upgrading from tl-4.11.0 to old nightly build 6724. The issue is fixed when upgrading to today's nightly build 6736 instead. Also tested an upgrade from tl-4.12.0 to nightly build 6736. Works fine and the workaround is skipped.

When reproducing I did not get an error in tlsetup.log as described in comment #4. It seems like tlsetup does not recognize that something went wrong. This is what I got on RHEL7:

tlsetup.log:
>   2021-02-01 16:56:56,796: Starting services...
>   2021-02-01 16:56:56,796: Starting service 'vsmagent'...
>   2021-02-01 16:56:57,251: Starting service 'tlwebadm'...
>   2021-02-01 16:56:57,858: Starting service 'vsmserver'...
>   2021-02-01 16:56:58,371: Starting service 'tlwebaccess'...
>   2021-02-01 16:56:58,912: Services done.
tlwebaccess.log:
>   2021-02-01 16:53:45 INFO tlwebaccess[1599]: ThinLinc Web Access version 4.11.0 build 6323 starting...
>   2021-02-01 16:53:45 INFO tlwebaccess[1599]: ThinLinc TLS Service ready on port 300
>   2021-02-01 16:53:45 INFO tlwebaccess[1599]: ThinLinc Web Access running as PID 1599 on port 300.
>   2021-02-01 16:56:58 INFO tlwebaccess[3072]: ThinLinc Web Access version 4.12.0post build 6724 starting...
>   2021-02-01 16:56:58 ERROR tlwebaccess[3072]: Could not start HTTP service: [Errno 98] Address already in use
systemctl status:
>   ● tlwebaccess.service - ThinLinc Web Access
>      Loaded: loaded (/usr/lib/systemd/system/tlwebaccess.service; enabled; vendor preset: disabled)
>      Active: failed (Result: exit-code) since Mon 2021-02-01 16:56:58 CET; 5min ago
>     Process: 3053 ExecStart=/bin/bash --login -c /opt/thinlinc/sbin/tlwebaccess (code=exited, status=0/SUCCESS)
>    Main PID: 3072 (code=exited, status=1/FAILURE)
>      CGroup: /system.slice/tlwebaccess.service
>              └─1599 python-thinlinc /opt/thinlinc/sbin/tlwebaccess


On Ubuntu20.04 I got a systemd warning but no error in tlsetup.log: 

>   $ journalctl -u tlwebaccess
>   Feb 01 13:35:33 ubuntu2004 systemd[1]: Stopping ThinLinc Web Access...
>   Feb 01 13:35:33 ubuntu2004 systemd[1]: tlwebaccess.service: Found left-over process 1050 (python-thinlinc) in control group while starting unit. Ignoring.
>   Feb 01 13:35:33 ubuntu2004 systemd[1]: This usually indicates unclean termination of a previous run, or service implementation deficiencies.
>   Feb 01 13:35:33 ubuntu2004 systemd[1]: Starting ThinLinc Web Access...

Note You need to log in before you can comment on or make changes to this bug.