www.cendio.com
Bug 5618 - multi-threaded VNC rect decoding
: multi-threaded VNC rect decoding
Status: CLOSED FIXED
: ThinLinc
VNC
: pre-1.0
: PC Unknown
: P2 Normal
: 4.6.0
Assigned To:
:
:
: 5360
: 5106
  Show dependency treegraph
 
Reported: 2015-08-25 15:35 by
Modified: 2016-05-06 07:56 (History)
Acceptance Criteria:


Attachments


Note

You need to log in before you can comment on or make changes to this bug.


Description From cendio 2015-08-25 15:35:01
We should try to make use of more than one CPU when decoding incoming rects in
the VNC client. The rects are however not independent of each other, so there
are some issues to overcome:

 - The number of bytes used to encode a rect is not known on a high level. Each
decoder would have to handle reading its rects before we can start a parallel
decode.

 - Rects might overlap. We need some locking/queueing to handle these in order.
We also need to check if its better to stall the decoding of these rects, or
buffer the writes on a queue.

 - Some encodings might read existing data (e.g. CopyRect). These also need to
be handled in a serial manner with regard to other rects that write to these
areas.

 - Some encodings might have hidden dependencies between rects. E.g all zlib
based encodings tend to reuse the same zlib stream, meaning they have to be
decoded in a serial manner. Tight makes this extra complex by having four
possible streams.

We also need to fix bug 5615 if we want to use OpenMP for this.


TurboVNC had a threaded handling of rects in its Unix client:

https://github.com/TurboVNC/turbovnc/tree/27388e5962cb83abdcf07a25d79aabfaed353cd8/unix/vncviewer

Note that it doesn't seem to handle overlaps. Still, it could be useful to look
at.
------- Comment #1 From cendio 2015-11-09 14:21:43 -------
OpenMP doesn't seem useful here unfortunately. We read a single rect at a time
and then return, an OpenMP needs to contain the entire parallel section within
one block.

So we'll have to do this manually ourselves. Which might not be a major loss
given the complex dependencies we have between rects.
------- Comment #2 From cendio 2015-11-11 17:07:05 -------
Urgh. XP doesn't have condition variables which makes this a lot more
difficult. Perhaps it is time to implement bug 5618 and drop XP support?
------- Comment #3 From cendio 2015-11-16 15:17:50 -------
I have something that works now. Need to do more testing though, primarily for
regressions. We don't want slow single core machines to be penalised too much
by this change.
------- Comment #4 From cendio 2015-11-16 15:19:49 -------
I tested it on a somewhat slow armhf machine and found a fun issue; vncviewer
now consumes so much CPU that X and ssh are getting starved. The entire machine
becomes very unresponsive when this happens. Probably need to do something
about this problem.
------- Comment #5 From cendio 2015-11-18 09:06:30 -------
Waiting for the X server seems to solve that issue, as well as the massive
tearing. Which isn't really surprising after looking at the code. We are using
shared memory to give the pixel data to the X server. Yet we don't wait for the
X server to finish reading it before we start writing to it again.
------- Comment #6 From cendio 2015-11-18 16:36:44 -------
However this exposed bug 5720.
------- Comment #7 From cendio 2015-11-19 13:40:01 -------
Implementing this bug has screwed with how the bandwidth estimation code works,
so I need to have a look at that as well. Perhaps bug 2494 should be included
at that point.
------- Comment #9 From cendio 2015-12-03 16:58:50 -------
I have had a look at the auto mode, comparing how it worked prior to these
changes with now. Initially I used my phone as a wifi hotspot and limited the
networks it used.

I measured recieve bandwidth using the Task Manager on windows. JPEG quality
(JQ), full color (fc), and pixel format depth (pfd) was gathered from
tlclient.log.

I had two test cases:
* testing moving the profile chooser around on the solid background
* played the first 20 seconds of this video:
https://www.youtube.com/watch?v=tQ-WuUHNIz8

Looking at the logfile, I can gather from my tests that the threshold for
changing from quality level 8 to 6 seems to be 16000 kbit/s. The threshold for
disabling full color and changing pixel format depth seems to both be at 250
kbit/s. This didn't change between the old and the new code.

Windows 10 client tests:


3G networks, using new code:
        | recieve    | JQ    | fc    | pfd
----------------+---------------+-------+-------+------
Moving a window    |~350Kbps    | 8    | yes    | 24
Playing a video    |~12Mbps    | 6-8    | yes    | 24

The logfile reported heavy fluctuations between quality level 6 and 8 while
playing the test video. It changed back and forth roughly once every
half-second.


2G networks, using new code:
        | recieve    | JQ    | fc    | pfd
----------------+---------------+-------+-------+------
Moving a window    |~200Kbps    | 6    | yes    | 24
Playing a video    |~200Kbps    | 6    | yes/no| 8-24

The quality level was constantly at 6. The pixel format depth and full color
fluctuated a few times though.


3G networks, using old code:
        | recieve    | JQ    | fc    | pfd
----------------+---------------+-------+-------+------
Moving a window    |~400Kbps    | 8    | yes    | 24
Playing a video    |~12Mbps    | 6-8    | yes    | 24

Same as with new code.


I have exceeded my data limit for the month so I have to continue my tests
using a different method.
------- Comment #10 From cendio 2015-12-08 16:49:03 -------
A performance report is available here:

https://github.com/TigerVNC/tigervnc/tree/master/tests/results/multicore

The short summary is that we're seeing between 20% and 125% improvement in the
tests. Real world scenarios will probably see less as there are other bottle
necks besides the CPU.
------- Comment #11 From cendio 2015-12-08 16:50:30 -------
I also had a look at if we should leave one "spare" CPU for ssh and other
processes. But I could not find any scenario where they were competing for
resources, so for now we'll keep things as they are.
------- Comment #12 From cendio 2015-12-17 11:31:11 -------
Using the throttle machine that is available in /import/vmware-images/Throttle/
to limit bandwidth we got the following results testing the auto setting:

First we were playing a maximized video in a session with a resolution of
1280x1024. The new code resulted in a threshold for switching down from level 8
to level 6 JPEG compression on 18 mbit/s and 128 kbit/s for turning off full
color. The old code showed similar results with a first threshold at 15 mbit/s
and the second threshold at 128 kbit/s.

Lastly we tested scrolling through text documents in the same session
resolution. The new code displayed the same 18 mbit/s and 128 kbit/s thresholds
as in the previous test. The old code had thresholds of 16 mbit/s and 128
kbit/s.

This is similar enough. The new code does seem a little bit more accurate at
estimating bandwidth in fact. Both are horrible in reality though.
------- Comment #13 From cendio 2016-01-21 10:27:00 -------
I've tested the new multi-threaded code on several different terminals and
workstations, and the general experience is much better for streaming video
when there's more cores to use. Single-core performance feels about the same as
before.

Looks good to me.
------- Comment #14 From cendio 2016-04-12 14:38:51 -------
We've decided to support XP for a while longer, so we need some hack to get the
viewer running there (albeit without multi-core support).
------- Comment #16 From cendio 2016-04-18 14:02:15 -------
Seems to work fine. Linux and modern Windows gets multiple threads, and XP gets
just the one.
------- Comment #17 From cendio 2016-04-19 08:32:37 -------
When running nightly ThinLinc client 4.5.0post build 5097 on XP machine i get:

  "Warning: this Windows version is not supported"
------- Comment #18 From cendio 2016-04-19 08:59:52 -------
(In reply to comment #17)
> When running nightly ThinLinc client 4.5.0post build 5097 on XP machine i get:
> 
>   "Warning: this Windows version is not supported"

This is handled on bug #5360
------- Comment #19 From cendio 2016-04-19 09:05:09 -------
Verified using(In reply to comment #14)
> We've decided to support XP for a while longer, so we need some hack to get the
> viewer running there (albeit without multi-core support).

Tested using ThinLinc client 4.5.0post build 5097.

Verified that client works as expected on Windows XP platform (with 2 cores)
and verified that using process explorer that only one thread was used.

Verified that client on a Window 10 platform (with 4 cores) works as expected
and that vncviewer used more than one thread for decoding.
------- Comment #20 From cendio 2016-05-04 14:30:17 -------
*** Bug 5876 has been marked as a duplicate of this bug. ***
------- Comment #21 From cendio 2016-05-04 14:30:58 -------
Getting crashes on HP t610 in the update synchronisation code. See bug 5876.
------- Comment #23 From cendio 2016-05-04 17:57:25 -------
The X server on HP's machine has some serious bug. It sends out extra
completion events, which screw up our counter. The window id in those extra
event is also completely bogus (seems to be the actual id OR:ed with
0x40000001). For extra fun you get a segfault in the X server if you do
xwininfo on that id.

Fortunately it was fairly easy to work around. I suspect getting a fix from HP
anytime soon is not feasible.
------- Comment #24 From cendio 2016-05-05 11:49:08 -------
Nightly build works fine.
------- Comment #25 From cendio 2016-05-06 07:56:19 -------
Verified using ThinLinc client 4.6.0 build 5113 on a few client workstations
including HP terminal. Seems to work as expected.