Bug 7632 - Agent permanently down if started when network is unavailable
Summary: Agent permanently down if started when network is unavailable
Status: CLOSED FIXED
Alias: None
Product: ThinLinc
Classification: Unclassified
Component: VSM Agent (show other bugs)
Version: trunk
Hardware: PC Unknown
: P2 Normal
Target Milestone: 4.12.1
Assignee: Frida Flodin
URL:
Keywords: linma_tester, relnotes
Depends on:
Blocks:
 
Reported: 2021-01-26 12:53 CET by Pierre Ossman
Modified: 2021-02-02 17:13 CET (History)
2 users (show)

See Also:
Acceptance Criteria:
* If an agent or HA node fails to look up any allowed clients, it should *not* result in that agent/node being permanently down. * If the DNS changes it should be reflected in the ThinLinc setup without needing to restart the services. * An old IP, no longer associated with an allowed client, should not be allowed to communicate.


Attachments

Description Pierre Ossman cendio 2021-01-26 12:53:51 CET
If the agent starts when the network is down, broken or otherwise unavailable then the agent can end up being permanently offline as it will reject all connections by the master.

This happens because we only look up allowed hosts when starting, not when the connections actually happen. If those lookups fail then we never retry.

You can spot these problems in the vsmserver log:

> 2021-01-26 11:36:33 WARNING vsmserver.loadinfo: VSM Agent tl.example.com:904 responded with permission denied in request for loadinfo. Marking as down.
> 2021-01-26 11:36:42 WARNING vsmserver.session: VSM Agent tl.example.com:904 responded with permission denied verifying sessions. 

The agent will aften not log anything, but might log "Couldn't lookup host" in some cases.



Bug 4290 makes this rather common as the vsmagent service now starts before the network in many cases. Also see bug 4243 and bug 7531 for similar issues.
Comment 1 Pierre Ossman cendio 2021-01-26 12:55:13 CET
Note that the same code is also responsible for master to master communication for HA, so the same issue exists there.


The fact that we only look things up during startup also means that we fail to notice any changes in the DNS once the service is running.
Comment 3 Frida Flodin cendio 2021-01-28 17:39:26 CET
So after some investigation on a solution for this I've concluded the following:

1. Looking at other projects that could have similar problems it seems like the
   common way is to always do a DNS lookup for every new request. I looked at
   the Apache module mod_authz_host [1]. The problem is that a DNS lookup could
   take long time. The architecture for Apache makes sure that a request
   does not block other requests.

2. Can we do a DNS lookup asynchronously? It seems to exist some libraries [2,
   3] for Python but they need installation and Python > 3.6. Since we, right
   now, need to support Python < 3.6 this is not an option.

Given this I think that a way to go forward now is to go with the straight
forward solution. That is to always check with the DNS server when we get a
request. If we see a real need for optimization we might need to look in to a
more complex solution.

[1] https://httpd.apache.org/docs/2.4/mod/mod_authz_host.html
[2] https://pypi.org/project/async-dns/
[3] https://pypi.org/project/adns/
Comment 5 Pierre Ossman cendio 2021-01-29 10:53:32 CET
I did a check with strace and socket.gethostbyname_ex() does not do any actual DNS requests if the input is already an IP address. So we don't have to do anything extra there ourselves.
Comment 10 Frida Flodin cendio 2021-02-01 13:46:22 CET
Given comment #3 we have decided to drop the following acceptance criteria:

>  * A DNS lookup that takes long time to complete should not lag or lock the ThinLinc system.

Right now it seems to require a lot of work and we don't know if it is a real problem. Instead we will write a warning in the log when the DNS lookup is too slow.
Comment 11 Frida Flodin cendio 2021-02-01 13:50:30 CET
I have tested the following:
- Server to Agent communication
- Agent denying a not allowed server
- Server to Server communication (HA)


Also checked the acceptance criteria:

> * If an agent or HA node fails to look up any allowed clients, it should *not* result in that agent/node being permanently down.
It does not. 


> * If the DNS changes it should be reflected in the ThinLinc setup without needing to restart the services.
Works, tested by modifying /etc/hosts.


> * An old IP, no longer associated with an allowed client, should not be allowed to communicate.
It is not allowed.
Comment 12 Linn cendio 2021-02-02 17:13:38 CET
Tested on RHEL 8 with Jenkins build 1845. Everything seems to work well and the logs look good.

- Server to agent communication
- Agent denies unauthorised servers
- Both servers update their session database when using HA
- Master reflects changes in tlwebadm


Acceptance criteria:

> * If an agent or HA node fails to look up any allowed clients, it should *not* result in that agent/node being permanently down.
Can successfully reconnect when agent/node is back on network.


> * If the DNS changes it should be reflected in the ThinLinc setup without needing to restart the services.
Works.


> * An old IP, no longer associated with an allowed client, should not be allowed to communicate.
Old IP:s are not allowed, neither as agents or HA servers.

Note You need to log in before you can comment on or make changes to this bug.