Bug 2862 - vsmagent crashes when machine is under heavy load
Summary: vsmagent crashes when machine is under heavy load
Status: CLOSED FIXED
Alias: None
Product: ThinLinc
Classification: Unclassified
Component: VSM Agent (show other bugs)
Version: pre-1.0
Hardware: PC All
: P2 Major
Target Milestone: 4.0.0
Assignee: Aaron Sowry
URL:
Keywords: ossman_tester
Depends on:
Blocks:
 
Reported: 2008-08-14 11:05 CEST by Pierre Ossman
Modified: 2012-11-28 12:08 CET (History)
0 users

See Also:
Acceptance Criteria:


Attachments

Comment 1 Peter Åstrand cendio 2008-10-07 14:07:12 CEST
We also have bug 2919, but seems to be a bit different. 
Comment 2 Peter Åstrand cendio 2008-11-03 15:48:18 CET
From Issue6885, we can also see this on vsmserver:

Traceback (most recent call last):
  File "/opt/thinlinc/sbin/vsmserver", line 22, in ?
    VSMServer ( sys . argv )
  File "/opt/thinlinc/modules/thinlinc/vsm/vsmserver.py", line 141, in __init__
    self . loop ( )
  File "/opt/thinlinc/modules/thinlinc/vsm/async.py", line 430, in loop
    o0o0oOOOo0oo = self . run_delayed_calls ( )
  File "/opt/thinlinc/modules/thinlinc/vsm/async.py", line 388, in run_delayed_calls
    I1i1i1 . func ( * I1i1i1 . args , ** I1i1i1 . kw )
  File "/opt/thinlinc/modules/thinlinc/vsm/xmlrpc.py", line 163, in handle_timeout
    self . parent . log . warning ( "Timeout reading socket %s, closing" % str ( self . socket . getpeername ( ) ) )
  File "<string>", line 1, in getpeername
socket.error: (107, 'Transport endpoint is not connected')
Comment 4 Peter Åstrand cendio 2009-03-17 13:51:06 CET
Difficult problem that happens seldom. Postpone. 
Comment 5 Aaron Sowry cendio 2012-08-03 14:33:51 CEST
Having some trouble reproducing this bug easily, however it looks like the problem lies in the logging of handle_timeout() in xmlrpc.py:

def handle_timeout(self, dolog=True):
    """Handle read or write timeout on socket"""
    if self.writable() and dolog:
        self.parent.log.warning("Timeout writing socket %s, closing" % str(self.socket.getpeername()))
    elif dolog:
        self.parent.log.warning("Timeout reading socket %s, closing" % str(self.socket.getpeername()))
    self.close()

If the timeout is due to the socket not being connected, socket.getpeername() will fail. I suppose we should either handle this case more elegantly within the method, or not bother with writing the peername in the logging message.
Comment 6 Aaron Sowry cendio 2012-08-03 16:18:25 CEST
Perhaps we could log socket.getsockname() instead of socket.getpeername()
Comment 7 Aaron Sowry cendio 2012-08-06 12:38:02 CEST
Fixed in r25565.
Comment 8 Aaron Sowry cendio 2012-08-08 10:09:55 CEST
Autotest updated in r25572
Comment 9 Pierre Ossman cendio 2012-10-03 10:50:06 CEST
getsockname() is a rather useless piece of information in most cases as it will just give you a local address of the machine, and the listening port of the server (i.e. things that are static and already well known).

The code should probably report both getsockname() and getpeername(), and changed to handle the fact that they can raise exceptions. Then we'll get useful information when possible.

(side note: to safely get the peer address in all cases, we would need to store it from the accept() call, but that's a lot more work)
Comment 10 Aaron Sowry cendio 2012-10-04 13:18:46 CEST
There are 2 problems here:

1) The function asynchat.writable() can return True even if nothing is connected to the socket. Our xmlrpc.handle_timout() function assumed that if asynchat.writable() is True, then there must be something connected to it. If nothing is connected to the socket, then socket.getpeername() will raise an exception, which is what was causing the initial problem.

2) Our xmlrpc.handle_timout() function assumed that if the socket is writable, then it must be a write timeout, otherwise it must be a read timeout. This isn't necessarily true; in fact, using the current handle_timout() implementation, it doesn't seem possible to tell exactly *what* timed out. The log text has therefore been changed to "communcation" timeout rather than "read/write" timeout.

Fixed in 25956.
Comment 11 Aaron Sowry cendio 2012-10-05 10:33:17 CEST
Autotests updated in r25959
Comment 12 Aaron Sowry cendio 2012-10-05 15:33:35 CEST
(In reply to comment #11)
> Autotests updated in r25959

Fixed again in r25960.
Comment 13 Pierre Ossman cendio 2012-10-17 13:23:20 CEST
Code review. Looks ok now.

Note You need to log in before you can comment on or make changes to this bug.