Bug 7689 - Fatal crash on XML-RPC timeout
Summary: Fatal crash on XML-RPC timeout
Status: CLOSED FIXED
Alias: None
Product: ThinLinc
Classification: Unclassified
Component: VSM Server (show other bugs)
Version: trunk
Hardware: PC Unknown
: P2 Normal
Target Milestone: 4.13.0
Assignee: Pierre Ossman
URL:
Keywords: linma_tester, relnotes
Depends on:
Blocks:
 
Reported: 2021-04-20 13:00 CEST by Pierre Ossman
Modified: 2021-05-06 16:38 CEST (History)
1 user (show)

See Also:
Acceptance Criteria:


Attachments

Description Pierre Ossman cendio 2021-04-20 13:00:24 CEST
It seems that the planets can align just right and cause vsmserver to crash and burn instead of just logging when an XML-RPC request times out:

> Traceback (most recent call last):
>   File "/opt/thinlinc/sbin/vsmserver", line 20, in <module>
>     VSMServer ( sys . argv )
>   File "/opt/thinlinc/modules/thinlinc/vsm/vsmserver.py", line 155, in __init__
>     self . loop ( )
>   File "/opt/thinlinc/modules/thinlinc/vsm/async.py", line 423, in loop
>     I11I11i1I = self . run_delayed_calls ( )
>   File "/opt/thinlinc/modules/thinlinc/vsm/async.py", line 381, in run_delayed_calls
>     oOO0OOOo . func ( * oOO0OOOo . args , ** oOO0OOOo . kw )
>   File "/opt/thinlinc/modules/thinlinc/vsm/xmlrpc.py", line 171, in handle_timeout
>     self . parent . log . warning ( "Timeout communicating with socket %s, closing" % str ( self . socket . getpeername ( ) ) )
>   File "/usr/lib64/python2.7/socket.py", line 224, in meth
>     return getattr(self._sock,name)(*args)
> socket.error: [Errno 107] Transport endpoint is not connected

It is unclear if this is the server or client side of an XML-RPC call. The triggering condition though is massive load on the system.
Comment 2 Pierre Ossman cendio 2021-04-20 14:06:50 CEST
This was supposedly fixed back in r25956 on bug 2862. However there is something we missed as the issue is still with us. We also failed to reproduce the issue then as well, so we likely never got a good fix.



It is very unclear how this can happen. socket.getpeername() only throws that error before it is connected, and after socket.shutdown(). However we never call socket.shutdown(), and self.connected is only set after the socket is verifiably connected. So there is some subtle bug here. Could even be a bug in glibc or the kernel where the low level socket is in a broken state.
Comment 3 Pierre Ossman cendio 2021-04-20 14:08:08 CEST
As an alternative we can start using self.addr instead. It is set by asyncore and should contain something valid.

(And in a worst case scenario it will just contain None, which won't crash)
Comment 4 Pierre Ossman cendio 2021-04-20 14:17:25 CEST
(In reply to Pierre Ossman from comment #0)
> 
> It is unclear if this is the server or client side of an XML-RPC call. The
> triggering condition though is massive load on the system.

I'm guessing the issue is with the server side of things. The client code seems to avoid this logging.

Perhaps socket.accept() can give us a broken socket under heavy load? However asyncore has code to detect this, so that still wouldn't explain things. :/
Comment 5 Pierre Ossman cendio 2021-04-20 14:31:04 CEST
Since we cannot find the scenario that triggers this we'll have to fix it blindly. Completely avoiding socket.getpeername() should be sufficiently safe.
Comment 7 Pierre Ossman cendio 2021-04-20 14:35:58 CEST
Adjusted now.

Test by causing timeouts on incoming XML-RPC handlers.
Comment 8 Linn cendio 2021-05-06 16:38:14 CEST
Tested with build 2048 on Ubuntu 20.04, and everything seems to be working as expected. However, I did not manage to reproduce the issue. Below are the commands I used to cause timeouts as well as the scenarios I tested.

-- Commands --
The first line disables access from <ip> on <port>, and the second line enables access again.
> sudo iptables -I INPUT -p tcp -s <ip> --dpor <port> -j DROP
> sudo iptables -D INPUT -p tcp -s <ip> --dpor <port> -j DROP

-- Scenario 1; repeated shorter timouts --

1. Log in to ThinLinc through the client.

2. On the server, I made a script to alternate enabling or disabling access from the client's IP. The swap between enabling or disabling occured every 5 seconds.

3. I let the script run for 5 minutes while using the client a bit meanwhile. No odd log messages in vsmserver.log.


-- Scenario 2; one long timeout --

1. Log in to ThinLinc through the client.

2. On the server, disable access from the client's IP.

3. Check that the client still is considered connected, by checking the session status in e.g. tlwebadm. 

4. Let the client session sit for 10 minutes. No odd log message in vsmserver.log.

Note You need to log in before you can comment on or make changes to this bug.