Bugzilla – Full Text Bug Listing
|Summary:||Avoid writing the session database on all session changes|
|Product:||ThinLinc||Reporter:||Karl Mikaelsson <firstname.lastname@example.org>|
|Component:||VSM Server||Assignee:||Karl Mikaelsson <email@example.com>|
|Status:||CLOSED FIXED||QA Contact:||Bugzilla mail exporter <firstname.lastname@example.org>|
In large clusters with lots of concurrent users, we have a session file that can grow by about 2-4Kb per user session. Since this file is pickled down to disk every time sessions are created or closed, this could lead to unnecessary amounts of disk IO when lots of sessions are created/closed. The upside to this is that we in case of crashes never lose session data (unless the crash is caused by writing the file, which means we'll lose the data for the event that triggered a write). We should consider moving to a model where a session change event only marks the file as "dirty" and a timer takes care of writing the file every N seconds. This will give us a larger window where we can lose session data but it will mean that ThinLinc can scale to a larger number of concurrent users.
As part of the get_load XMLRPC call, vsmagent appends a list of dead sessions which vsmserver then can clean up automatically, without having to verify the session through a proper verify_session call. vsmserver then loops over the list of dead sessions and calls remove_session for each dead session. This will trigger a write of the session store for each reported session. The proposed fix for this bug would make sure that this loop is no longer a bottleneck.
Time estimation ranges from "review and commit attached patches" to "review and solve the problem another way".
With the report for the devmeeting written and presented, patches committed, broken autotests fixed, new autotests added and test runs shows that everything works as intended, I believe I'm done.
The delay seems to work fine. I tested creating hundreds of sessions and then killing them. I would then either a) wait for load info to update, or b) verify the sessions by bringing up their details in tlwebadm. In both cases it only made a single write some time after the sessions were removed. I could also see that the database was written directly on each new session, so the delay was only for removed sessions.