www.cendio.com
Bug 4768 - Windows load balancing doesn't take the number of CPU:s into account when calculating load averages
: Windows load balancing doesn't take the number of CPU:s into account when cal...
Status: CLOSED FIXED
: ThinLinc
WTS Tools
: trunk
: PC Unknown
: P2 Major
: 4.1.1
Assigned To:
:
:
:
:
  Show dependency treegraph
 
Reported: 2013-08-13 18:32 by
Modified: 2013-11-19 08:52 (History)
Acceptance Criteria:


Attachments


Note

You need to log in before you can comment on or make changes to this bug.


Description From cendio 2013-08-13 18:32:22
(Reported on the ThinLinc-technical mailing list by Jens Langner - thanks!)

The load average numbers for Windows servers has a tendency to drop way into
negative numbers, due to the algorithm not taking the number of cpu:s into
account when doing the calculations.

The problematic part of the algorithm is this:

> free_bogomips = EST_BOGOMIPS * (1 - loadinfo.loadavg)

loadinfo.loadavg is a value reported from the Windows side. What this actually
means and what range it should have is a bit unclear judging from comment 18
and 20 of bug 3864. It used to be a number in the range from 0 to 1, but is now
a value from 0 to the number of cores in the Windows system.

The load balancer however still assumes that the number is 0 for no load and 1
for full load on all cores. When servers gain more and more cores, the load
balancer will report the server running with full load at merely 1/(number of
cores) load.
------- Comment #2 From cendio 2013-08-14 08:28:34 -------
For reference, the loadavg we report from VSM agent is adjusted on the agent
side, not the master. It's probably best to use the same principle here.
------- Comment #3 From cendio 2013-09-04 11:06:54 -------
Fixed in r27829.

With regards to comment #2: I decided against changing the nrpe_nt code back to
reporting 0..1 because it would affect all other users of nrpe_nt.
------- Comment #4 From cendio 2013-10-15 09:24:44 -------
(In reply to comment #3)
> Fixed in r27829.
> 
> With regards to comment #2: I decided against changing the nrpe_nt code back to
> reporting 0..1 because it would affect all other users of nrpe_nt.

This is confusing, it's better to revert to the earlier behaviour; how it
worked before:

r116 | hean01 | 2012-05-25 08:28:02 +0200 (fre, 25 maj 2012) | 4 lines
------- Comment #5 From cendio 2013-10-15 12:52:07 -------
Fixed in r28035, r28036.
------- Comment #6 From cendio 2013-10-17 12:54:39 -------
Looks good now.
------- Comment #7 From 2013-11-18 11:40:14 -------
As I have been the initial reporter of this bug and I just installed 4.1.1 on
our systems I am curious what might be the actual status of affairs regarding
the windows load balancing algorithm in 4.1.1? As I don't have access to the
sources of ThinLinc I can only try to guess from the comments above about what
was actually changed and to me it seems nothing was actually changed and the
behavior of the tl-best-winserver and check_nrpe functionality is actually the
same like in 4.1.0?!?

Is this actually the case and if so, why wasn't it changed and this bug closed?
And if not, what was actually changed in the algorithm?
------- Comment #8 From cendio 2013-11-18 14:04:04 -------
(In reply to comment #7)
> As I have been the initial reporter of this bug and I just installed 4.1.1 on
> our systems I am curious what might be the actual status of affairs regarding
> the windows load balancing algorithm in 4.1.1? As I don't have access to the
> sources of ThinLinc I can only try to guess from the comments above about what
> was actually changed and to me it seems nothing was actually changed and the
> behavior of the tl-best-winserver and check_nrpe functionality is actually the
> same like in 4.1.0?!?
> 
> Is this actually the case and if so, why wasn't it changed and this bug closed?
> And if not, what was actually changed in the algorithm?

Hi Jens,

The initial fix for this bug was to scale the load value back into the
range of 0-1 from 0-<cpus>. I initially solved this by scaling the
value I received from wts-tools on the "client" side (on the ThinLinc
server). However everyone wasn't happy with this solution, which led
me to reverting my own fix, and then later reverting the change in
nrpe_nt which changed the load value reported from 0-1 to 0-<cpus>.

Since all changes in the 4.1.1 release happened on the Windows side of
things, this means you also need to upgrade wts-tools to 4.1.1 when
you upgrade your ThinLinc server to 4.1.1. Perhaps this wasn't
communicated in a clear enough way from the comments here or the
release notes.
------- Comment #9 From 2013-11-19 08:52:42 -------
(In reply to comment #8)

> The initial fix for this bug was to scale the load value back into the
> range of 0-1 from 0-<cpus>. I initially solved this by scaling the
> value I received from wts-tools on the "client" side (on the ThinLinc
> server). However everyone wasn't happy with this solution, which led
> me to reverting my own fix, and then later reverting the change in
> nrpe_nt which changed the load value reported from 0-1 to 0-<cpus>.

Thanks for that information. Now its clear to me what exactly was changed and
that the load_avg value returned by check_nrpe will only be between 0 - 1.
Thus, I changed my own 'tl-best-winserver' script to reflect that change with
ThinLinc 4.1.1.

For reference and in case you are interested to review or somehow integrate my
tl-best-winserver script with ThinLinc (it might be interesting for some users)
please find the latest version here:

https://github.com/hzdr/thinstation/blob/master/ts/5.1/packages/hzdr/bin/scripts/tl-best-winserver

To explain why we are having an own version of tl-best-winserver, see here:

1. On our ThinClients (thinstation-based) we run an own GUI which allows to
either directly connect to our windows terminal servers via xfreerdp or if a
user chooses to connect to a Linux server it uses ThinLinc instead. Thus we
needed a possibility to query our windows servers for the same load balancing
information like ThinLinc is doing it internally.
2. we needed a tl-best-winserver command-line program which allows to override
the username which is currently not possible with the version coming with
ThinLinc.

> Since all changes in the 4.1.1 release happened on the Windows side of
> things, this means you also need to upgrade wts-tools to 4.1.1 when
> you upgrade your ThinLinc server to 4.1.1. Perhaps this wasn't
> communicated in a clear enough way from the comments here or the
> release notes.

Indeed, the release notes weren't particular clear on that as well as the
comments here.