www.cendio.com
Bug 7097 - vsmserver uses 100% CPU (broken control pipe)
: vsmserver uses 100% CPU (broken control pipe)
Status: CLOSED FIXED
: ThinLinc
VSM Server
: 1.3.1
: PC Unknown
: P2 Normal
: 4.10.0
Assigned To:
:
:
:
:
  Show dependency treegraph
 
Reported: 2018-01-15 13:18 by
Modified: 2019-02-13 15:41 (History)
Acceptance Criteria:


Attachments


Note

You need to log in before you can comment on or make changes to this bug.


Description From cendio 2018-01-15 13:18:02
I was setting up smart card authentication and somewhere during the process I
managed to get vsmserver in a state where it was consuming 100% CPU and
spamming this to the log:

> Unknown control command  received on control pipe
> Unknown control command  received on control pipe

No relevant log lines before that unfortunately, so I am not sure what the
trigger was.

Stopping the process also shows something being wrong with the control pipe:

> 2018-01-15 13:02:03 INFO vsmserver: Got SIGTERM, signaling process to quit
> Traceback (most recent call last):
>   File "/opt/thinlinc/sbin/vsmserver", line 20, in <module>
>     VSMServer ( sys . argv )
>   File "/opt/thinlinc/modules/thinlinc/vsm/vsmserver.py", line 156, in __init__
>     self . loop ( )
>   File "/opt/thinlinc/modules/thinlinc/vsm/async.py", line 418, in loop
>     if 31 - 31: iII111i . OOooOOo - ooOoO0o . OoooooooOO / OoooooooOO
>   File "/opt/thinlinc/modules/thinlinc/vsm/vsmserver.py", line 170, in signal_handler
>     os . write ( self . control_pipe_write , "T" )
> OSError: [Errno 32] Broken pipe

This was seen on Fedora 27 with ThinLinc 4.8.0. I have failed to reproduce it
though.
------- Comment #2 From cendio 2018-07-09 17:27:36 -------
I can't find a sane way for this bug to occur. There is nothing ever closing
that fifo.

My best guess right now is that we have a bug elsewhere which closes the wrong
fd and happens to nuke this one. It then gets replaced by something else (a
closed socket?). Rather unlikely, but it would match the observed behaviour.
------- Comment #4 From cendio 2018-10-16 13:35:25 -------
Note that the above suggestion also needs to solve how to break a select() that
is waiting for events. Either we need to start polling (i.e. set a maximum
timeout for select()), or we need to keep some file descriptor that we can make
active. The latter could still be made much simpler than the current system
though.

We also looked at using pselect() and relying on EINTR, but pselect() doesn't
exist in Python.

There is also still the question of where this bug occurs. If it's in some
unrelated code then we might get other symptoms if we get rid of the control
pipe. Perhaps some other file descriptor will start randomly closing in that
case?


For now we'll await and see if we can get more information on when this occurs
and hopefully get some clues to this mystery.
------- Comment #6 From cendio 2018-12-21 13:33:13 -------
To summarize yesterdays investigations and theories:

We believe the pipe is closed by libnss-sss when doing a username lookup from
Python.

* When our services daemonize, we close all open fds,
  including one presumably already opened by libnss-sss.

* Our services starts using this fd for the control pipe.

* When another username lookup is performed, nss_sss takes
  over the fd again, closing the pipe.


The commit from comment #5 does not seem to be related to the issue at hand.
------- Comment #7 From cendio 2018-12-21 14:36:42 -------
sssd also has a cache that clients access via a mmap file. So they will only
mess with the socket if they get a cache miss.

Fortunately you can force it to ignore the cache by starting a program with
$SSS_NSS_USE_MEMCACHE=NO.

I've yet to reproduce the issue though. The key problem is getting a nss call
before we open the control pipe.
------- Comment #8 From cendio 2019-01-03 14:57:05 -------
The following test run produces an strace very similar to that given by the
customer:

> $ strace -vv python -c "import os, pwd; pwd.getpwnam('derfian'); os.close(3); pwd.getpwnam('derfian')"
------- Comment #9 From cendio 2019-01-03 15:01:03 -------
I think the cache might actually be important here. The strace shows an
munmap() right before the evil close(). This matches well with the cleanup of
the cache in nss-sss, so it looks like it is trying to close the cache and not
a socket.

Will have to dig further to understand when this happens.


I also found a bug related to the socket handling that might be able to cause
this. This bug was fixed in sssd 1.6.1 and is not relevant in the reported case
though. But older installations might have multiple ways of triggering the bug
in ThinLinc.
------- Comment #10 From cendio 2019-01-04 08:42:27 -------
So the python test case above shows this sequence:

> openat(AT_FDCWD, "/var/lib/sss/mc/passwd", O_RDONLY|O_CLOEXEC) = 3
> fstat(3, {st_dev=makedev(253, 0), st_ino=1323918, st_mode=S_IFREG|0644, st_nlink=1, st_uid=0, st_gid=0, st_blksize=4096, st_blocks=16424, st_size=8406312, st_atime=1546526363 /* 2019-01-03T15:39:23.979942098+0100 */, st_atime_nsec=979942098, st_mtime=1546526389 /* 2019-01-03T15:39:49.699373353+0100 */, st_mtime_nsec=699373353, st_ctime=1546526389 /* 2019-01-03T15:39:49.699373353+0100 */, st_ctime_nsec=699373353}) = 0
> mmap(NULL, 8406312, PROT_READ, MAP_SHARED, 3, 0) = 0x7fde90df1000
> fstat(3, {st_dev=makedev(253, 0), st_ino=1323918, st_mode=S_IFREG|0644, st_nlink=1, st_uid=0, st_gid=0, st_blksize=4096, st_blocks=16424, st_size=8406312, st_atime=1546526448 /* 2019-01-03T15:40:48.976367300+0100 */, st_atime_nsec=976367300, st_mtime=1546526389 /* 2019-01-03T15:39:49.699373353+0100 */, st_mtime_nsec=699373353, st_ctime=1546526389 /* 2019-01-03T15:39:49.699373353+0100 */, st_ctime_nsec=699373353}) = 0

Here nss-sss initially opens the cache as a result of the getpwnam() call.

> close(3)                                = 0

Here we deliberately screw up the cache file descriptor for nss-sss.

> fstat(3, 0x7fff6da21790)                = -1 EBADF (Bad file descriptor)

So the next call to getpwnam() gets a failure trying to verify the file
descriptor.

> munmap(0x7fde90df1000, 8406312)         = 0
> close(3)                                = -1 EBADF (Bad file descriptor)

Which results in it trying to free the cache, meaning unmapping the cache and
closing the file descriptor (even though it is already closed).

> getpid()                                = 21609
> fstat(-1, 0x7fff6da21870)               = -1 EBADF (Bad file descriptor)
> getpid()                                = 21609
> socket(AF_UNIX, SOCK_STREAM, 0)         = 3
> fcntl(3, F_GETFL)                       = 0x2 (flags O_RDWR)
> fcntl(3, F_SETFL, O_RDWR|O_NONBLOCK)    = 0
> fcntl(3, F_GETFD)                       = 0
> fcntl(3, F_SETFD, FD_CLOEXEC)           = 0
> connect(3, {sa_family=AF_UNIX, sun_path="/var/lib/sss/pipes/nss"}, 110) = 0

This is then the socket fallback being used as it had problems with the cache.
------- Comment #11 From cendio 2019-01-04 08:47:00 -------
Compare this with the sequence from the customer report:

> getpeername(6, {sa_family=AF_INET, sin_port=htons(1022), sin_addr=inet_addr("xxxxxxxxx")}, [16]) = 0

This is the low port check in vsmagent, just before it calls the handler. And
the next thing we see is when ReqSessionHandler calls getpwnam() to normalise
the username:

> munmap(0x7fb43c40e000, 8406312)         = 0
> close(4)                                = 0

This is the cleanup sequence for the nss-sss cache. Note that there is no call
to fstat() before this. This means the mapped memory failed to validate, since
that is checked before it looks at the file descriptor (see
sss_nss_check_header() in nss-sss).

> getpid()                                = 1061
> fstat(-1, 0x7ffd0e1347d0)               = -1 EBADF (Ung├╝ltiger Dateideskriptor)
> getpid()                                = 1061
> socket(AF_UNIX, SOCK_STREAM, 0)         = 4
> fcntl(4, F_GETFL)                       = 0x2 (flags O_RDWR)
> fcntl(4, F_SETFL, O_RDWR|O_NONBLOCK)    = 0
> fcntl(4, F_GETFD)                       = 0
> fcntl(4, F_SETFD, FD_CLOEXEC)           = 0
> connect(4, {sa_family=AF_UNIX, sun_path="/var/lib/sss/pipes/nss"}, 110) = 0

And then we have the socket fallback again since it considered the cache bad
somehow.
------- Comment #12 From cendio 2019-01-04 08:49:18 -------
What we can learn from this is that the final triggering condition is the sss
cache going bad somehow. We have yet to experiment when this happens. Perhaps
just when the cache gets invalidated or rotated? Restart of sssd?

It doesn't matter terribly much though as this should work fine anyway. The key
issue is why nss-sss thinks the cache file descriptor is the same as the
control pipe (i.e. 4). They should have independent fds.

We need earlier traces to see when and how this mixup occurs.
------- Comment #13 From cendio 2019-01-04 13:19:18 -------
Alright, so I've understood what is going on now. The key ingredients are:

 * $HOME must not be set when vsmserver/vsmagent starts

 * "sss" must be configured before "files" in nsswitch.conf

This is true on very modern systems as they want to use sssd's cache even for
local users. This requires sssd 1.16 or newer. RHEL 7 has a new enough version,
but does not ship a nsswitch.conf that uses it though. I.e. its default
configuration does not hit this bug.

What happens under these conditions is that python wants to find
~/.local/lib/python2.x/site-package. But since it doesn't have a $HOME, it
needs to do a getpwuid(getpid()) call to figure out the home directory. This
opens the cache and sets up the fd (which is then messed up by daemon_init()).

I'm not sure how the cache is then invalidated. I can easily provoke things
using the sss_cache tool, but I doubt that is what happens on the live systems.
I don't fully understand the cache handling though, so I'm having trouble
finding the conditions for when this can trigger. We'll have to assume it
happens under many real world conditions as we have both the customer report
and my initial description.
------- Comment #14 From cendio 2019-01-04 13:35:01 -------
You can see the problem before things actually explodes by looking at the fds
and maps of the process:

> $ sudo ls -l /proc/`pgrep -f vsmagent`/fd
> total 0
> lr-x------. 1 root root 64 Jan  4 13:33 0 -> /dev/null
> l-wx------. 1 root root 64 Jan  4 13:33 1 -> /var/log/vsmagent.log
> l-wx------. 1 root root 64 Jan  4 13:33 2 -> /var/log/vsmagent.log
> lr-x------. 1 root root 64 Jan  4 13:33 3 -> 'pipe:[16229793]'
> l-wx------. 1 root root 64 Jan  4 13:33 4 -> 'pipe:[16229793]'
> lrwx------. 1 root root 64 Jan  4 13:33 5 -> 'socket:[16229798]'
> 
> $ sudo grep lib/sss /proc/`pgrep -f vsmagent`/maps
> 7fdfc86be000-7fdfc8ec3000 r--s 00000000 fd:00 1324160                    /var/lib/sss/mc/passwd

The cache file (/var/lib/sss/mc/passwd) should never be mapped without a
corresponding fd open to it. So this state shows a process that will get
screwed up as soon as the cache gets invalidated.
------- Comment #15 From cendio 2019-01-04 13:56:02 -------
Web Access and tlwebadm are also affected by this bug, but not in the same way.
In both servers fd 4 will be a new incoming connection. This gets forked off to
a child that does all the meaty stuff. If that child does a getpw*() call, then
it will drop the TCP connection (if the cache has been invalidated). In the
browser you generally get some generic error message in these cases.

A key point to note here is that each child inherits the cache state from the
parent. So once the cache is invalidated it will remain so until
tlwebaccess/tlwebadm is restarted. The parent doesn't do any getpw*() calls
itself to get the cache re-validated.

Web Access becomes entirely unusable at this point as it has to do a user
lookup for the login. tlwebadm mostly works as it does very few such calls. The
only places are when editing TLDC entries, our doing test lookups on the health
page.
------- Comment #19 From cendio 2019-01-07 13:41:10 -------
All fixed. Tester should make sure the bug can no longer be provoked by:

 * Configuring "passwd: sss files" in nsswitch.conf on a modern system (sssd
1.16+)

 * Running sss_cache -U after the services have started

General testing should also be done:

 * Starting a session

 * Logging in via web access

 * User lookup in tlwebadm
------- Comment #20 From cendio 2019-01-09 12:57:02 -------
It seems like this can trigger whenever /etc/passwd or /etc/group changes. So
that could be what's been happening on the client system.
------- Comment #21 From cendio 2019-02-13 15:41:14 -------
Verified that I could reproduce the problem on Fedora 29 with:
 * sssd-2.0.0.4
 * sss before files for passwd & groups in /etc/nsswitch.conf
 * ThinLinc server build 5957

- 100% CPU for vsmagent & vsmserver -processes on server
- Spam in logs: "Unknown control command received on control pipe"
- Couldn't login using Web Access: ERR_EMPTY_RESPONSE

Verified that it works with build 6040 on same server, same config:
 * Web Access login
 * Native client login
 * Web Adm
 * No spam in log
 * Sane CPU usage