I have been seeing an intermittent problem for some time that audio might refuse to work in a session. It's always been just as audio should start to play, and only on my Fedora 23 workstation. Sometimes it resolves itself, and sometimes I have to kill the session. No errors in any log.
The problem is also related somehow to the client's PulseAudio server, as bypassing the session server does not help. Local sound on the workstation still works though.
Unfortunately it has been very difficult to debug as it happens about once every two weeks. Until this week, where I found a test case that can reproduce it:
- Ubuntu 14.04
- Super Tux Kart
This seems to trigger the bug nine times out of ten. I have not tried to recreate the system from scratch to see how fragile the setup is.
Discussion started on the upstream mailing list:
Problem identified. It is caused by a reduction in latency (buffer size) and all related parameters. The scenario is this:
1. Large latency, large buffer, large target fill, large minimum request. Silence in queue (i.e. buffer is full).
2. Buffer drains slightly, making it fall below target fill. It is however still below the minimum request, so nothing is sent to the client.
3. The client requests a reduced latency, buffer is reduced, target fill is reduced, minimum request is reduced. The buffer now greatly exceeds target fill as it was almost up to the previous target fill level. This means that the server will not be asking the client for more data for a while.
4. Some time later we've drained most of the excess and are almost back down to the target fill level. However the data requested in 2 is sufficiently large that we never fall back down below target fill. Hence we never start requesting for more data. And we already decided in 2 not to send a request for the first portion.
So the fundamental problem here is that requesting data from the client can be triggered not only by the buffer emptying, but also by parameters changing. And specifically changes to the minimum request size is not handled properly.
In theory this can be caused by any program that triggers a massive reduction in buffer latency.
Sent suggested patches to upstream:
However this only fixes the problem long term as the bug is in the system's PulseAudio, not ours. It's not obvious if we can do a workaround until then.
The fix seems to provoke some glitches in the audio though. Not sure if it means the patch is bad, or if it simply exposes bugs in the tunnel module. I can see some chatter about buffer sizes in the log, but no underruns.
I turned up logging on the other two servers (system and session), and unfortunately nothing logged from those either when the sounds is crackling.
A large glitch was however noticed by the system server, which promptly increased its minimum latency to 4.0 ms. However our tunnel module fought back a bit and it took a few turns until it got the latency up high enough.
There is definitely more that can be done here, but I'm moving it to a separate bug. Opened bug 5903 for improving the latency handling.
The initial crackling is still a mystery though. Perhaps we should just start at a few ms minimum rather than zero?