Our web based services (currently tlwebadm and tlwebaccess) are really two servers, not one. There's the python process (tlwebadm/tlwebaccess), and there's the tlstunnel process. The service scripts only track the first one though. In the event of a crash, the tlstunnel process is left running. This blocks the sockets and a restart of the service is insufficient to fix things. One fix would be to make the service script clever enough to track both daemons and kill as needed on restart. Might also want to consider if we want to delegate starting/stopping tlstunnel completely to the service script rather than having the python process control it.
Example failure scenario: # kill -9 931 (tlwebadm) # /etc/init.d/tlwebadm status ThinLinc Web Administration is stopped # /etc/init.d/tlwebadm restart Shutting down ThinLinc Web Administration Starting ThinLinc Web Administration # tail /var/log/tlwebadm.log 2014-01-07 09:20:10 INFO tlwebadm[931]: ThinLinc Web Administration version 4.1.1post build 4182 starting... 2014-01-07 09:20:10 INFO tlwebadm[931]: ThinLinc Web Administration running as PID 931 on port 1010. 2014-01-07 09:20:10 INFO tlwebadm[933]: ThinLinc TLS Service ready on port 1010. 2014-01-07 09:36:51 INFO tlwebadm[3615]: ThinLinc Web Administration version 4.1.1post build 4182 starting... 2014-01-07 09:36:51 ERROR tlwebadm[3615]: Could not bind to AF_UNIX socket /var/run/tlwebadm.sock. Check that there are no other processes using this socket. Exiting...
IMHO, the fact that an extra process (tlstunnel) is used should be considered an implementation detail and not "visible" on a higher level, ie on the service level. Thus, I believe we should try to fix the bug in the service code instead. For example, the service might need to record the tlstunnel pid in a file on disk, and check for stray tlstunnel processes on startup.
We could solve this by having a file descriptor open between the python service and tlstunnel. tlstunnel should then be able to do select() and that fd and detect if the python service has crashed.
This got fixed in r31160 for bug 5044. We no longer have a listening tlstunnel process that can get lost.