Bug 4328 - very slow performance with AMD G-T56N APU
Reported: 2012-06-07 13:53
Modified: 2012-11-28 12:33
Description From cendio 2012-06-07 13:53:05
We got reports from Oetiker that he was getting horrible performance on the HP
t610, which is a high-end machine with lots of power. We're also seeing the
same issue on Wyse Z90D, which has exactly the same processor.

No clue as to what the problem is at this point. Probably something wrong with
the SIMD code.
------- Comment #1 From cendio 2012-06-08 12:53:36 -------
The problem is in the SSE2 code. Forcing it off gives expected performance.
------- Comment #2 From cendio 2012-06-08 13:06:47 -------
Problem is in the YUV to RGB conversion routine, which is good as it is one of
the simpler ones.
------- Comment #3 From cendio 2012-06-08 13:47:42 -------
This is the code that is slow for some odd reason:

    pcmpeqb    xmmH,xmmH            ; xmmH=(all 1's)
    maskmovdqu xmmA,xmmH            ; movntdqu XMMWORD [edi], xmmA
    add    edi, byte SIZEOF_XMMWORD    ; outptr
    maskmovdqu xmmD,xmmH            ; movntdqu XMMWORD [edi], xmmD
    add    edi, byte SIZEOF_XMMWORD    ; outptr
    maskmovdqu xmmF,xmmH            ; movntdqu XMMWORD [edi], xmmF
    add    edi, byte SIZEOF_XMMWORD    ; outptr
------- Comment #4 From cendio 2012-06-08 17:00:52 -------
The culprit is "maskmovdqu". This is a silly little instruction that serves
little purpose, and AMD therefore decided not to waste silicon on it.

Why it is in the JPEG code is because it's trying to emulate the instruction
"movntdq" without the alignment requirement it normally has. The proper
instruction for that is "movdqu", but it has slightly different cache

I'm not sure the cache avoidance of the current code is beneficial, so we need
to look further into that.

I'm also concerned by the fact that this code is in the fallback section for
unaligned buffers. That code will never be fast, so we are calling something
incorrectly somewhere.
------- Comment #5 From cendio 2012-06-13 11:18:36 -------
maskmovdqu has been eliminated in upstream libjpeg-turbo. Need to upgrade our
build system.

The performance boost is almost 10x on the Bobcat architecture, but we're also
seeing improvement up to 10% on other CPUs.
------- Comment #6 From cendio 2012-06-13 11:20:04 -------
Avoiding the cache or not in the rest of the code didn't have any measurable
effect on performance for simple tests. It might show something on higher level
tests, but we don't have time for that right now.

Aligning buffers better was moved to a separate bug.
------- Comment #7 From cendio 2012-06-14 10:51:27 -------
DRC found that the 32-bit code is producing bad output. Need to investigate.
------- Comment #8 From cendio 2012-07-02 12:38:21 -------
Several more fixes were done upstream. Brought in to our build system in
------- Comment #9 From cendio 2012-10-04 10:26:14 -------
A few tests with original 3.4.0 client release and build build 3671 
reveals a huge difference in the performance, video and web page scrolling
was involved in test.