Bug 7231 - ThinLinc client frequently fails to connect to agent
Summary: ThinLinc client frequently fails to connect to agent
Status: CLOSED FIXED
Alias: None
Product: ThinLinc
Classification: Unclassified
Component: Client (show other bugs)
Version: trunk
Hardware: PC Unknown
: P2 Normal
Target Milestone: 4.10.0
Assignee: Pierre Ossman
URL:
Keywords: relnotes
Depends on:
Blocks:
 
Reported: 2018-08-21 08:25 CEST by Henrik Andersson
Modified: 2020-08-31 13:41 CEST (History)
2 users (show)

See Also:
Acceptance Criteria:


Attachments
Fix issue with select for fifos (3.37 KB, patch)
2018-08-24 14:44 CEST, Henrik Andersson
Details

Description Henrik Andersson cendio 2018-08-21 08:25:30 CEST
Under some circumstances the ThinLinc client fails to connect to the agent server leaving following line in the log file:

   ssh[E]: select: An operation was attempted on something that is not a socket.

and then ssh client is terminating.


Here follows a debug log for a successful connection:

  ssh[E]: debug1: Sending command: thinlinc-login dummy
  ssh[E]: THINLINC-LOGIN: HELLO 4.9.0
  ssh[E]: debug1: client_input_channel_req: channel 2 rtype exit-status reply 0
  ssh[E]: debug1: client_input_channel_req: channel 2 rtype eow@openssh.com reply 0
  ssh[E]: 
  ssh[E]: COMMAND_EXITSTATUS: 0

And when it fails the following debug log is retreived:

  ssh[E]: debug1: Sending command: thinlinc-login dummy
  ssh[E]: debug1: client_input_channel_req: channel 2 rtype exit-status reply 0
  ssh[E]: debug1: client_input_channel_req: channel 2 rtype eow@openssh.com reply 0
  ssh[E]: debug1: channel 0: free: port listener, nchannels 3
  ssh[E]: debug1: channel 1: free: port listener, nchannels 2
  ssh[E]: debug1: channel 2: free: client-session, nchannels 1
  ssh[E]: debug1: fd 0 clearing O_NONBLOCK
  ssh[E]: debug1: fd 1 clearing O_NONBLOCK
  ssh[E]: debug1: fd 2 clearing O_NONBLOCK
  ssh[E]: select: An operation was attempted on something that is not a socket.
  ssh[E]: Transferred: sent 2516, received 3124 bytes, in f seconds
  ssh[E]: Bytes per second: sent f, received f
  ssh[E]: debug1: Exit status 0
  WinProcess: Terminating process 6684 (ssh.exe)...
Comment 2 Henrik Andersson cendio 2018-08-21 08:35:40 CEST
Gathered information from related support issues:

  - Issue only affects Windows client platform

  - Problem only occurs while client is connected to WiFi network,
    if connected to physical network the problem goes away

  - Issue does not affect all users with windows client platform

  - Users with problems retries the connection for 30 minutes and
    will eventually do a successful connection

  - Using ThinLinc client 4.8.1 solves the issue
Comment 3 Henrik Andersson cendio 2018-08-21 08:37:52 CEST
I haven't been able to reproduce the issue for debugging, so I have looked into code.

Found commit r32964 which seems related (rewrite of Win32 select() emualtion) that was a change for 4.9.0 release.
Comment 4 Henrik Andersson cendio 2018-08-21 08:40:41 CEST
(In reply to comment #3)

> Found commit r32964 which seems related (rewrite of Win32 select()
> emualtion) that was a change for 4.9.0 release.
>

I reverted this commit and rebuilt ssh.exe to sent it to the two customers
for testing. One replied with this:

 "I tried out the modified ssh.exe and as far as I can tell it does appear to
  fix the problem.  It's a little hard to be 100% sure because the conditions
  to create the failure are not precisely known and so the failure is not
  always reproducible.  In my case however, I did find the first time I tried
  to connect with the original unmodified ssh.exe, I had the failure occur.
  Then I tried 10 times with the modified ssh.exe and the thinlinc client
  connected correctly all 10 times.  Hopefully your other customer will also
  get the same promising results."

Which sounds good enough to dive into the code change in r32964 for bugs that
could trigger the issue.
Comment 5 Henrik Andersson cendio 2018-08-24 10:58:48 CEST
Commit r32964 removes a check for if a fd is a FIFO in the write set. This means that if write fd set only have a FIFO fd and that the read set is empty or only contains non tty/fifo fd(s), a select() call will be made on the FIFO which is invalid.
Comment 6 Henrik Andersson cendio 2018-08-24 11:07:17 CEST
(In reply to comment #5)
> Commit r32964 removes a check for if a fd is a FIFO in the write set. This
> means that if write fd set only have a FIFO fd and that the read set is empty
> or only contains non tty/fifo fd(s), a select() call will be made on the FIFO
> which is invalid.
>

I added a debug log for 'client-session' creation and dumping write and read fd sets that produces the following log
for when the issue appears:

  2018-08-22T15:26:54: ssh[E]: debug1: channel 2: new [client-session]
  2018-08-22T15:26:54: ssh[E]: debug1: ssh_session2_open: 'client-session' channel using fds in(3), out(4), err(5)
  2018-08-22T15:26:54: ssh[E]: debug1: ssh_session2_open: channel_new: 2
  .
  .
  .
  2018-08-22T15:26:54: ssh[E]: debug1: client_input_channel_req: channel 2 rtype exit-status reply 0
  2018-08-22T15:26:54: ssh[E]: debug1: client_input_channel_req: channel 2 rtype eow@openssh.com reply 0
  2018-08-22T15:26:54: ssh[E]: debug1: win32_select writefds(5)
  2018-08-22T15:26:54: ssh[E]: debug1: win32_select readfds(436, 440, 424)
  2018-08-22T15:26:54: ssh[E]: debug1: channel 0: free: port listener, nchannels 3
  2018-08-22T15:26:54: ssh[E]: debug1: channel 1: free: port listener, nchannels 2
  2018-08-22T15:26:54: ssh[E]: debug1: channel 2: free: client-session, nchannels 1
  2018-08-22T15:26:54: ssh[E]: debug1: fd 0 clearing O_NONBLOCK
  2018-08-22T15:26:54: ssh[E]: debug1: fd 1 clearing O_NONBLOCK
  2018-08-22T15:26:54: ssh[E]: debug1: fd 2 clearing O_NONBLOCK
  2018-08-22T15:26:54: ssh[E]: select: En åtgärd försökte utföras på något som inte är en socket.
  2018-08-22T15:26:54: ssh[E]: Transferred: sent 1804, received 2916 bytes, in f seconds
  2018-08-22T15:26:54: ssh[E]: Bytes per second: sent f, received f
  2018-08-22T15:26:54: ssh[E]: debug1: Exit status 0


fd 5 is connected to a FIFO and my conclusion is that all fds over 400 are sockets.

This means that this bug as described in comment #5 is the reason for the failed select()
Comment 7 Henrik Andersson cendio 2018-08-24 14:44:38 CEST
Created attachment 884 [details]
Fix issue with select for fifos
Comment 8 Henrik Andersson cendio 2018-08-27 12:42:26 CEST
Peter have tested the fix on his system which had a very how rating of reproducing the issue, one out of two tests fail. He verified that he couldn't reproduce the issue with the fix (attached patached).
Comment 9 Pierre Ossman cendio 2018-08-29 09:20:48 CEST
We've gotten confirmation from one customer that the patch solves the issue.
Comment 12 Samuel Mannehed cendio 2018-10-17 15:32:37 CEST
Customers have reported seeing this issue even for active connections and not only for new ones.
Comment 14 Samuel Mannehed cendio 2018-10-17 16:57:15 CEST
(In reply to comment #12)
> Customers have reported seeing this issue even for active connections and not
> only for new ones.

Disregard this comment, it seems like another bug.
Comment 16 Samuel Mannehed cendio 2018-10-29 15:35:55 CET
I saw this issue on Windows 10 when connecting from our internal network out to eudemo.thinlinc.com with a regular wired connection.
Comment 18 Pierre Ossman cendio 2019-01-24 13:20:43 CET
The triggering condition seems to be that the final output from thinlinc-login and the command terminating happens at the same time. I.e. the ssh client will see both the output and the exit in the protocol stream without having to call select() to get more data.

I have not figured out a good way to cause this buffering with an unmodified client. But patching ssh easily provokes. Simply put a "Sleep(500);" at the top of win32_select() and the bug will appear.
Comment 20 Pierre Ossman cendio 2019-01-24 14:13:03 CET
Should be fixed now. Tester probably needs to use a modified ssh client, or figure out some other way of provoking the bug.
Comment 21 Peter Åstrand cendio 2019-02-18 12:49:49 CET
With 4.9.0, I can reproduce the problem on my laptop with Windows 7, if connected via mobile phone tethering and downloading web pages during login. 2 out of 3 login fails. With 4.9.0post build 6045, it works correctly.

Note You need to log in before you can comment on or make changes to this bug.