www.cendio.com

Bug 6125

Summary: Unable to login due to excessive (audio?) connections to ssh
Product: ThinLinc Reporter: Karl Mikaelsson <derfian@cendio.se>
Component: OtherAssignee: Karl Mikaelsson <derfian@cendio.se>
Status: CLOSED FIXED QA Contact: Bugzilla mail exporter <bugzilla-qa@cendio.se>
Severity: Normal    
Priority: P2 CC: botsch@cnf.cornell.edu, ossman@cendio.se, samuel@cendio.se
Version: trunkKeywords: ossman_tester, relnotes
Target Milestone: 4.8.0   
Hardware: PC   
OS: Unknown   
Acceptance Criteria:
Attachments: strace -s0 -p PID of ssh(d) on the LISTEN side of the zillion connection attempts

Description From cendio 2017-01-09 15:34:24
Report from a customer:

Once in a while, the master stops communicating with the agent, so people
will be denied reconnecting to their sessions.

/var/log/vsmserver.log:
> 2017-01-05 10:10:20 WARNING vsmserver: Timeout trying to verify session for user on VSM Agent host:904
> 2017-01-05 10:10:20 INFO vsmserver.session: User with uid 1000 (user) requested a new session
> 2017-01-05 10:10:20 WARNING vsmserver.session: User user has 1 sessions, only 1 are allowed per user. Denying start of new session

> So, after enabling DEBUG2 in a few places... it looks like on each
> attempt to reconnect to a session, part of checking the session status
> involves running "netstat -n -A inet -A inet6 -t" on the agent.

Here, something is filling up /var/log/secure with "open failed: connect
failed: Connection refused" messages:

> Jan  5 09:36:46 host sshd[24798]: channel 11: open failed: connect failed: Connection refused
> Jan  5 09:36:46 host sshd[24798]: channel 11: open failed: connect failed: Connection refused
> Jan  5 09:36:46 host sshd[24798]: channel 11: open failed: connect failed: Connection refused
> Jan  5 09:36:46 host sshd[24798]: channel 11: open failed: connect failed: Connection refused
> Jan  5 09:36:46 host sshd[24798]: channel 11: open failed: connect failed: Connection refused
> Jan  5 09:36:46 host sshd[24798]: channel 11: open failed: connect failed: Connection refused
> Jan  5 09:36:46 host sshd[24798]: channel 11: open failed: connect failed: Connection refused


> This fills up the return of netstat with a boatloat of TIME_WAIT
> entries, so netstat can take a little while to return.


> In addition to all those timeouts on connect, it looks like something,
> whilst a session is connected, is sitting there and opening conections
> to the local sshd for a particular user.

> [root@host ~]# netstat -anp | grep 4926 | grep -v WAIT
> tcp        0      0 127.0.0.1:4926              0.0.0.0:*      LISTEN      28920/sshd          
> tcp        0      0 127.0.0.1:59732             127.0.0.1:4926 ESTABLISHED -                   
> tcp        0     40 127.0.0.1:64894             127.0.0.1:4926 ESTABLISHED -                   
> tcp        0      1 127.0.0.1:46892             127.0.0.1:4926 SYN_SENT    -                   
> tcp        0      0 ::1:4926                    :::*           LISTEN      28920/sshd          

> [root@host ~]# ps -ef | grep 28920
> user2    28920 17074 19 09:57 ?        00:03:33 sshd: user2@notty 

> What's it doing, that it's sitting here repeatedly opening up
> connections to the same sshd?

> Well, I think I finally tracked it down, at least why the sshd process
> was going banananas... stracing of it showed:
>
> ioctl(4, SNDCTL_TMR_TIMEBASE or SNDRV_TIMER_IOCTL_NEXT_DEVICE or TCGETS, 0x7ffdce201760) = -1 EINVAL (Invalid argument)
>
> lots of those, as if something screwed up and it could no longer get to
> whatever audio device.
>
> I started a brand new session and that error is not there and there
> aren't a zillion ssh connections.
------- Comment #2 From cendio 2017-01-09 15:58:13 -------
Also note bug 5982 where we are looking at a netstat replacement.
------- Comment #3 From cendio 2017-01-10 13:21:30 -------
The customer reported that issue was encountered on a RHEL 6.8 system with
ThinLinc 4.7.0.
------- Comment #4 From cendio 2017-01-10 14:32:59 -------
Customer was using netstat from net-tools version 1.60 which is very old. This
means the issue could not have been caused by a netstat update.
------- Comment #5 From cendio 2017-01-11 16:12:01 -------
Did some more digging. There are two issues here:

 a) Reading /proc/net/tcp is slow

 b) netstat also does some more processing per entry that is slow

So skipping netstat and reading /proc/net/tcp gives is some improvement, but
doesn't fully solve the issue.

The cause of the massive amount of sockets is probably bug 6133.


The proper fix seems to be to switch to getting sockets via Netlink (directly
or via "ss"). There is no performance problem with that method.
------- Comment #6 From cendio 2017-01-30 10:56:37 -------
> The proper fix seems to be to switch to getting sockets via Netlink (directly
> or via "ss"). There is no performance problem with that method.

Non-exhaustive list of netstat alternatives I've found so far:

Pure Python:

 ❌ socket.AF-NETLINK (Python >= 2.5)

 ❌ gnlpy (Python >= 2.6)
   https://github.com/facebook/gnlpy

 ❌ pyroute2 (Python >= 2.7)
   https://pypi.python.org/pypi/pyroute2

 ❌ NetLink (Python >= 3)
   https://pypi.python.org/pypi/NetLink/

Subprocesses:

 - ss (iproute2)
   https://wiki.linuxfoundation.org/networking/iproute2
------- Comment #7 From cendio 2017-01-30 10:58:31 -------
See also: bug 6142.
------- Comment #8 From cendio 2017-01-30 14:06:19 -------
ss availability on select distributions

| Dist         | Present | Package Name |
|--------------+---------+--------------|
| RHEL 5       | ✓       | iproute      |
| RHEL 6       | ✓       | iproute      |
| RHEL 7       | ✓       | iproute      |
|--------------+---------+--------------|
| Ubuntu 14.04 | ✓       | iproute2     |
| Ubuntu 16.04 | ✓       | iproute2     |
| Ubuntu 16.10 | ✓       | iproute2     |
|--------------+---------+--------------|
| SLES 10      | ✓       | iproute2     |
| SLES 11      | ✓       | iproute2     |
| SLES 12      | ✓       | iproute2     |
|--------------+---------+--------------|
------- Comment #9 From 2017-01-30 19:23:09 -------
I just reproduced the issue, again.

In both cases, the thinlinc session was started from my home Fedora25 machine
over a Cisco VPN connection. Disconnected session there and then connected to
it here at work from a thinstation, and that's when we get the multiple
connection attempts to sshd.

So, I can provide strace output from that sshd.

I have not yet had the problem when starting a session at work on a thinstation
and switching between thinstations.

I also haven't specifically tested starting at work on my RHEL6 box and then
switching to a thinstation.

At least this time, the Cisco VPN disconnected me in the middle of sessions
several time while at home. I would hope that didn't contribute to it. I
checked netstat a lot from home and never saw the issue pop up.
------- Comment #10 From 2017-01-30 19:24:40 -------
Created an attachment (id=770) [details]
strace -s0 -p PID of ssh(d) on the LISTEN side of the zillion connection
attempts
------- Comment #11 From 2017-01-30 22:29:51 -------
I found a systemtap script from
http://unix.stackexchange.com/questions/186932/log-pid-of-each-connection-attempt
that shoes me connect attempts.

First thing I'm noticing... if I start a new session on my RHEL6 desktop here,
and then disconnect that session, the escd I had running instantly goes bonkers
trying to connect to my sshd which is no longer there.

eg (systemtap output):

connect [escd:4519] -> 127.0.0.1:4926
connect [escd:4519] -> 127.0.0.1:4926
connect [escd:4519] -> 127.0.0.1:4926
connect [escd:4519] -> 127.0.0.1:4926
connect [escd:4519] -> 127.0.0.1:4926
connect [escd:4519] -> 127.0.0.1:4926
connect [escd:4519] -> 127.0.0.1:4926
connect [escd:4519] -> 127.0.0.1:4926
connect [escd:4519] -> 127.0.0.1:4926
connect [escd:4519] -> 127.0.0.1:4926
connect [escd:4519] -> 127.0.0.1:4926
connect [escd:4519] -> 127.0.0.1:4926
connect [escd:4519] -> 127.0.0.1:4926
connect [escd:4519] -> 127.0.0.1:4926
connect [escd:4519] -> 127.0.0.1:4926
connect [escd:4519] -> 127.0.0.1:4926
connect [escd:4519] -> 127.0.0.1:4926
connect [escd:4519] -> 127.0.0.1:4926
connect [escd:4519] -> 127.0.0.1:4926
connect [escd:4519] -> 127.0.0.1:4926
connect [escd:4519] -> 127.0.0.1:4926
connect [escd:4519] -> 127.0.0.1:4926
connect [escd:4519] -> 127.0.0.1:4926
connect [escd:4519] -> 127.0.0.1:4926
connect [escd:4519] -> 127.0.0.1:4926

[root@universe ~]# ps -fu dwb7 |grep escd
dwb7      4519     1 82 16:17 ?        00:02:55 ./escd
--key_Inserted="/usr/bin/esc" --on_Signal="/usr/bin/esc"

While this doesn't appear to fill up netstat (since there is nothing for escd
to connect to), I imagine it could lead to file handle exhaustion. It would
appear that at *some* point in the future, escd finally stops.

If I reconnect, now on the thinstation, my sshd then starts back up, and for
whatever reason seems to be listening on port 4926... same as before.

And escd is now sitting there still doing zillions of connection attempts, but
now resulting in lots of TIME_WAIT states and filling up the netstat.

[root@universe ~]# lsof -nPi tcp:4926
COMMAND   PID USER   FD   TYPE     DEVICE SIZE/OFF NODE NAME
sshd    15829 dwb7   12u  IPv6 2417213418      0t0  TCP [::1]:4926 (LISTEN)
sshd    15829 dwb7   13u  IPv4 2417213420      0t0  TCP 127.0.0.1:4926 (LISTEN)

Disconnect, and I'm back to escd trying to talk to non-existant sshd.

Reconnect on my desktop, and escd is now happy and not trying to connect all
over the place.

Disconnect on my desktop, and escd goes nuts, again trying to connect to
non-existant sshd.

If I *start* the session on the ThinStation, I don't see any of this behaviour
with escd, even though it is running:

dwb7     15711     1  0 16:27 ?        00:00:00 ./escd
--key_Inserted="/usr/bin/esc" --on_Signal="/usr/bin/esc"
------- Comment #21 From cendio 2017-02-03 16:32:17 -------
Switching from netstat to ss will also solve bug 5982.
------- Comment #23 From cendio 2017-02-06 10:37:37 -------
That should be it - all netstat calls has been replaced with ss, which should
not exhibit the same poor performance on many simultaneous connections. To be
clear, this doesn't fix applications that does a lot of connections, but it
should not be a problem any longer.
------- Comment #24 From cendio 2017-02-06 14:51:07 -------
The connection status is now very fast. Whilst netstat takes many seconds:

> [root@dhcp-253-21 tl-4.7.0post-server]# time netstat -n -t | wc -l
> 12578
> 
> real	0m12.156s
> user	0m0.101s
> sys	0m2.000s

ss is extremely fast:

> [pid  2442] 14:37:31.037513 execve("/usr/sbin/ss", ["ss", "-n", "-t", "state", "established"], [/* 22 vars */]) = 0
> [pid  2442] 14:37:31.055406 +++ exited with 0 +++

However the invocation for unbinding ports is still concerning:

> [pid  2462] 14:37:31.187538 execve("/usr/sbin/ss", ["ss", "-n", "-t", "-a", "-p"], [/* 22 vars */]) = 0
> [pid  2462] 14:37:49.116805 +++ exited with 0 +++

That's a good 18 seconds. Not enough to trigger a timeout, but a fairly hefty
wait for the user.

Using ss' filters seem to make quite a dent in the time needed. Perhaps we can
use that?
------- Comment #25 From cendio 2017-02-06 15:01:14 -------
Side note, I can't reproduce these issues on my Fedora 25 machine so it seems
modern kernels are more efficient. ss is still way faster than netstat, but
netstat never seems to get in to those absurd conditions.
------- Comment #28 From cendio 2017-02-14 14:54:36 -------
Retested, and works extremely well. For the "normal" case of no-one occupying
ports it completely avoids digging around in /proc and is as fast as a normal
invocation of ss. Otherwise it still finishes in less than a second.