www.cendio.com
Bug 7231 - ThinLinc client frequently fails to connect to agent
: ThinLinc client frequently fails to connect to agent
Status: CLOSED FIXED
: ThinLinc
Client
: trunk
: PC Unknown
: P2 Normal
: 4.10.0
Assigned To:
:
:
:
:
  Show dependency treegraph
 
Reported: 2018-08-21 08:25 by
Modified: 2019-06-12 11:59 (History)
Acceptance Criteria:


Attachments
Fix issue with select for fifos (3.37 KB, patch)
2018-08-24 14:44, Henrik Andersson
Details


Note

You need to log in before you can comment on or make changes to this bug.


Description From cendio 2018-08-21 08:25:30
Under some circumstances the ThinLinc client fails to connect to the agent
server leaving following line in the log file:

   ssh[E]: select: An operation was attempted on something that is not a
socket.

and then ssh client is terminating.


Here follows a debug log for a successful connection:

  ssh[E]: debug1: Sending command: thinlinc-login dummy
  ssh[E]: THINLINC-LOGIN: HELLO 4.9.0
  ssh[E]: debug1: client_input_channel_req: channel 2 rtype exit-status reply 0
  ssh[E]: debug1: client_input_channel_req: channel 2 rtype eow@openssh.com
reply 0
  ssh[E]: 
  ssh[E]: COMMAND_EXITSTATUS: 0

And when it fails the following debug log is retreived:

  ssh[E]: debug1: Sending command: thinlinc-login dummy
  ssh[E]: debug1: client_input_channel_req: channel 2 rtype exit-status reply 0
  ssh[E]: debug1: client_input_channel_req: channel 2 rtype eow@openssh.com
reply 0
  ssh[E]: debug1: channel 0: free: port listener, nchannels 3
  ssh[E]: debug1: channel 1: free: port listener, nchannels 2
  ssh[E]: debug1: channel 2: free: client-session, nchannels 1
  ssh[E]: debug1: fd 0 clearing O_NONBLOCK
  ssh[E]: debug1: fd 1 clearing O_NONBLOCK
  ssh[E]: debug1: fd 2 clearing O_NONBLOCK
  ssh[E]: select: An operation was attempted on something that is not a socket.
  ssh[E]: Transferred: sent 2516, received 3124 bytes, in f seconds
  ssh[E]: Bytes per second: sent f, received f
  ssh[E]: debug1: Exit status 0
  WinProcess: Terminating process 6684 (ssh.exe)...
------- Comment #2 From cendio 2018-08-21 08:35:40 -------
Gathered information from related support issues:

  - Issue only affects Windows client platform

  - Problem only occurs while client is connected to WiFi network,
    if connected to physical network the problem goes away

  - Issue does not affect all users with windows client platform

  - Users with problems retries the connection for 30 minutes and
    will eventually do a successful connection

  - Using ThinLinc client 4.8.1 solves the issue
------- Comment #3 From cendio 2018-08-21 08:37:52 -------
I haven't been able to reproduce the issue for debugging, so I have looked into
code.

Found commit r32964 which seems related (rewrite of Win32 select() emualtion)
that was a change for 4.9.0 release.
------- Comment #4 From cendio 2018-08-21 08:40:41 -------
(In reply to comment #3)

> Found commit r32964 which seems related (rewrite of Win32 select()
> emualtion) that was a change for 4.9.0 release.
>

I reverted this commit and rebuilt ssh.exe to sent it to the two customers
for testing. One replied with this:

 "I tried out the modified ssh.exe and as far as I can tell it does appear to
  fix the problem.  It's a little hard to be 100% sure because the conditions
  to create the failure are not precisely known and so the failure is not
  always reproducible.  In my case however, I did find the first time I tried
  to connect with the original unmodified ssh.exe, I had the failure occur.
  Then I tried 10 times with the modified ssh.exe and the thinlinc client
  connected correctly all 10 times.  Hopefully your other customer will also
  get the same promising results."

Which sounds good enough to dive into the code change in r32964 for bugs that
could trigger the issue.
------- Comment #5 From cendio 2018-08-24 10:58:48 -------
Commit r32964 removes a check for if a fd is a FIFO in the write set. This
means that if write fd set only have a FIFO fd and that the read set is empty
or only contains non tty/fifo fd(s), a select() call will be made on the FIFO
which is invalid.
------- Comment #6 From cendio 2018-08-24 11:07:17 -------
(In reply to comment #5)
> Commit r32964 removes a check for if a fd is a FIFO in the write set. This
> means that if write fd set only have a FIFO fd and that the read set is empty
> or only contains non tty/fifo fd(s), a select() call will be made on the FIFO
> which is invalid.
>

I added a debug log for 'client-session' creation and dumping write and read fd
sets that produces the following log
for when the issue appears:

  2018-08-22T15:26:54: ssh[E]: debug1: channel 2: new [client-session]
  2018-08-22T15:26:54: ssh[E]: debug1: ssh_session2_open: 'client-session'
channel using fds in(3), out(4), err(5)
  2018-08-22T15:26:54: ssh[E]: debug1: ssh_session2_open: channel_new: 2
  .
  .
  .
  2018-08-22T15:26:54: ssh[E]: debug1: client_input_channel_req: channel 2
rtype exit-status reply 0
  2018-08-22T15:26:54: ssh[E]: debug1: client_input_channel_req: channel 2
rtype eow@openssh.com reply 0
  2018-08-22T15:26:54: ssh[E]: debug1: win32_select writefds(5)
  2018-08-22T15:26:54: ssh[E]: debug1: win32_select readfds(436, 440, 424)
  2018-08-22T15:26:54: ssh[E]: debug1: channel 0: free: port listener,
nchannels 3
  2018-08-22T15:26:54: ssh[E]: debug1: channel 1: free: port listener,
nchannels 2
  2018-08-22T15:26:54: ssh[E]: debug1: channel 2: free: client-session,
nchannels 1
  2018-08-22T15:26:54: ssh[E]: debug1: fd 0 clearing O_NONBLOCK
  2018-08-22T15:26:54: ssh[E]: debug1: fd 1 clearing O_NONBLOCK
  2018-08-22T15:26:54: ssh[E]: debug1: fd 2 clearing O_NONBLOCK
  2018-08-22T15:26:54: ssh[E]: select: En åtgärd försökte utföras på något som
inte är en socket.
  2018-08-22T15:26:54: ssh[E]: Transferred: sent 1804, received 2916 bytes, in
f seconds
  2018-08-22T15:26:54: ssh[E]: Bytes per second: sent f, received f
  2018-08-22T15:26:54: ssh[E]: debug1: Exit status 0


fd 5 is connected to a FIFO and my conclusion is that all fds over 400 are
sockets.

This means that this bug as described in comment #5 is the reason for the
failed select()
------- Comment #7 From cendio 2018-08-24 14:44:38 -------
Created an attachment (id=884) [details]
Fix issue with select for fifos
------- Comment #8 From cendio 2018-08-27 12:42:26 -------
Peter have tested the fix on his system which had a very how rating of
reproducing the issue, one out of two tests fail. He verified that he couldn't
reproduce the issue with the fix (attached patached).
------- Comment #9 From cendio 2018-08-29 09:20:48 -------
We've gotten confirmation from one customer that the patch solves the issue.
------- Comment #12 From cendio 2018-10-17 15:32:37 -------
Customers have reported seeing this issue even for active connections and not
only for new ones.
------- Comment #14 From cendio 2018-10-17 16:57:15 -------
(In reply to comment #12)
> Customers have reported seeing this issue even for active connections and not
> only for new ones.

Disregard this comment, it seems like another bug.
------- Comment #16 From cendio 2018-10-29 15:35:55 -------
I saw this issue on Windows 10 when connecting from our internal network out to
eudemo.thinlinc.com with a regular wired connection.
------- Comment #18 From cendio 2019-01-24 13:20:43 -------
The triggering condition seems to be that the final output from thinlinc-login
and the command terminating happens at the same time. I.e. the ssh client will
see both the output and the exit in the protocol stream without having to call
select() to get more data.

I have not figured out a good way to cause this buffering with an unmodified
client. But patching ssh easily provokes. Simply put a "Sleep(500);" at the top
of win32_select() and the bug will appear.
------- Comment #20 From cendio 2019-01-24 14:13:03 -------
Should be fixed now. Tester probably needs to use a modified ssh client, or
figure out some other way of provoking the bug.
------- Comment #21 From cendio 2019-02-18 12:49:49 -------
With 4.9.0, I can reproduce the problem on my laptop with Windows 7, if
connected via mobile phone tethering and downloading web pages during login. 2
out of 3 login fails. With 4.9.0post build 6045, it works correctly.