Bug 5442 - vsmserver crashes when all available file descriptors have been exhausted
Summary: vsmserver crashes when all available file descriptors have been exhausted
Status: NEW
Alias: None
Product: ThinLinc
Classification: Unclassified
Component: VSM Server (show other bugs)
Version: trunk
Hardware: PC Linux Red Hat
: P2 Normal
Target Milestone: MediumPrio
Assignee: Henrik Andersson
URL:
Keywords:
Depends on:
Blocks:
 
Reported: 2015-02-23 11:09 CET by Aaron Sowry
Modified: 2015-03-03 11:11 CET (History)
2 users (show)

See Also:
Acceptance Criteria:


Attachments

Description Aaron Sowry cendio 2015-02-23 11:09:49 CET
Some distributions set a low default limit on the number of file descriptors a user may have open. On CentOS, RHEL and Fedora the soft-limit is only 1024 with a hard-limit of 4096, for example. On large clusters vsmserver can exhaust this allocation, for example during a period where a lot of users are logging out at the same time (lots of agent hostname lookups going on, session database updates, logfile writes, etc). In most cases this simply results in specific threads crashing, for example:

--
2015-02-19 08:33:56 ERROR vsmserver.session: Unhandled exception trying to verify session for maastr1-ST0017 on VSM Agent cehca040:904: <class 'socket.gaierror'> [Errno -2] Name or service not known Traceback (most recent call last):
  File "/opt/thinlinc/modules/thinlinc/vsm/xmlrpc.py", line 332, in xmlrpc_call
  File "/usr/lib64/python2.6/asyncore.py", line 337, in connect
  File "<string>", line 1, in connect_ex
gaierror: [Errno -2] Name or service not known
--

and errors like:

--
2015-02-19 08:33:56 ERROR vsmserver.session: FAILURE WRITING SESSION DATABASE!: [Errno 24] Too many open files: '/var/lib/vsm/sessions.temp'
--

However in at least one instance, it has resulted in vsmserver crashing completely:

--
Traceback (most recent call last):
  File "/opt/thinlinc/sbin/vsmserver", line 22, in <module>
  File "/opt/thinlinc/modules/thinlinc/vsm/vsmserver.py", line 142, in __init__
  File "/opt/thinlinc/modules/thinlinc/vsm/async.py", line 430, in loop
  File "/opt/thinlinc/modules/thinlinc/vsm/async.py", line 388, in run_delayed_calls
  File "/opt/thinlinc/modules/thinlinc/vsm/sessionstore.py", line 239, in periodic_session_update
  File "/opt/thinlinc/modules/thinlinc/vsm/call_verifysession.py", line 34, in __init__
  File "/opt/thinlinc/modules/thinlinc/vsm/xmlrpc.py", line 280, in __init__
  File "/opt/thinlinc/modules/thinlinc/vsm/xmlrpc.py", line 136, in __init__
  File "/usr/lib64/python2.6/asyncore.py", line 288, in create_socket
  File "/usr/lib64/python2.6/socket.py", line 184, in __init__
socket.error: [Errno 24] Too many open files
--

We should probably do a couple of things here:

1) Investigate if we can handle this error condition better somehow. It's a bit difficult to continue normally in such a situation, but perhaps we could wait and try again later or something.

2) Investigate if we can make vsmserver more efficient with its fd usage

3) Document this limitation (and possibly methods of increasing it) in the TAG and/or Platform Specific Notes.
Comment 2 Pierre Ossman cendio 2015-03-03 11:11:21 CET
This bug is now about investigating these specific tracebacks, not some general fix for running out of file descriptors. We should look at these stacks and see if we can improve error handling. E.g. by trying again later.

Note You need to log in before you can comment on or make changes to this bug.