Blog

High-Availability Keepalived Linux ThinLinc

Setting up Keepalived to provide IP failover in a ThinLinc HA pair

May, 08, 23
Written by: Martin Östlund

What problem are we trying to solve?

A typical ThinLinc cluster consists of one “master” server, and several identical “agent” servers behind it, where the sessions themselves are run. Having multiple agent servers allows ThinLinc to perform load-balancing, and also provides a certain degree of redundancy. If one agent goes down, there are still others available to host sessions. However, as there is only one master server, this presents a potential single point of failure for the entire cluster. To address this, ThinLinc provides high-availability (HA) features which enable administrators to eliminate this single point of failure by having a second master server, which is only used if the primary one fails.

From our ThinLinc Administrator Guide, https://www.cendio.com/resources/docs/tag/HA-overview.html#ha-theory-of-operation it is stated “Both machines have an unique hostname and an unique IP address, but there is also a third IP address that is active only on the node currently responsible for the VSM server service. This is usually referred to as a resource IP address, which the clients are connecting to. ThinLinc does not move this resource IP address between servers, supplementary software is required for this purpose.”

The information provided in this guide outlines a possible solution on how to achieve this.

Different use cases

While there are other projects out there that can provide a solution to share a VIP (Virtual Private IP address) between two or more hosts, for example Pacemaker from the Linux-HA project, these can be a bit daunting to set up and keep maintained. Especially if the sole purpose is to provide a VIP between two hosts.

To give you an example, Pacemaker is a great choice if your setup is designed with a chain of different services that all depend on each other somehow. It might be necessary to bring other services up (or down) on another machine in case of a failure on the primary. In this case, you’re not only moving the VIP, but the service itself. You might need to do things like mounting a remote file system or ensuring that other services are brought down on the host that just lost the VIP. For that use-case, Pacemaker from Linux-HA is a good choice.

In a ThinLinc HA setup, two equal machines are used to keep the VSM service running and the session database in sync. One of the machines is considered to be primary, and the other one secondary. The primary machine is normally handling VSM server requests, and if that machine is, for whatever reason, offline, the secondary is ready to perform its duties. There are no other dependencies involved in a ThinLinc HA pair, and it’s rather simple in its design. That’s why keepalived is a good fit for providing a VIP failover between the pair.

To avoid this single point of failure, and to allow a failure of one VSM server to be completely transparent to ThinLinc users, we need a convenient way to make the transition to the secondary as smooth as possible. Simply informing the users by saying “Hey, use this IP address instead” is not very efficient nor user-friendly. That’s where keepalived comes in.

Keepalived is a software written in C that provides a simple tooling to manage high availability for Linux servers. It uses the Virtual Router Redundancy Protocol (VRRP) to create a virtual IP address that can be shared among multiple machines. It also ensures that if one machine fails, another machine can seamlessly take over its tasks by moving the VIP to the other machine.

Once keepalived is installed and configured on both VSM servers, the primary VSM server will handle incoming requests to the VIP. In case of failure, the VIP is instantly moved over to the secondary VSM server, which will happily continue on serving requests.

Once the primary server is available again, VSM server will synchronize its session database from the secondary server and the VIP will be moved back to the primary.

Let’s get to it.

The prerequisites for this setup are that you have two Linux servers available to you.

If you’re unsure about some commands used in this guide, please consult the man pages or the keepalived documentation. Links provided at the end of this guide.

The example of this guide will use the following setup:
Primary node with the hostname tlha-primary and IP address 10.0.0.2
Secondary node with the hostname tlha-secondary and IP address 10.0.0.3
VIP address of 10.0.0.4 with the DNS name tlha.

The above IP addresses and hostnames are just examples, you should replace these with something that’s appropriate for your environment.

The distribution we’re using is Red Hat Enterprise Linux 8.

To allow for services to bind to an IP address not present on the system, one must configure the kernel parameter net.ipv4.ip_nonlocal_bind.

The following command has to be executed on both tlha-primaryand tlha-secondary:

[cendio@tlha-primary ~]$ echo "net.ipv4.ip_nonlocal_bind = 1" | sudo tee -a /etc/sysctl.d/100-nonlocal_bind.conf

Give both machines a reboot to have the changes take effect. Once the machines are rebooted, log back in and verify that net.ipv4.ip_nonlocal_bind is now true.
On both machines, execute and verify that that net.ipv4.ip_nonlocal_bind = 1

[root@tlha-primary ~]# sysctl -a  | grep net.ipv4.ip_nonlocal_bind

We are now ready to install keepalived. On both machines:

[cendio@tlha-primary ~]$ sudo dnf install keepalived

After keepalived has been installed, a basic configuration file is placed in /etc/keepalived/keepalived.conf. Move it out of the way on both tlha-primary and tlha-secondary

[cendio@tlha-primary keepalived]$ sudo mv /etc/keepalived/keepalived.conf /etc/keepalived/keepalived.conf.bak

Then create a new configuration on both nodes. The values for some settings may differ between tlha-primary and tlha-secondary, so pay extra attention to the highlighted settings below:

on tlha-primary, make sure the contents of keepalived.conf are:

! Configuration File for keepalived

global_defs {
    notification_email {
        root@tlha-primary
    }
    notification_email_from root@tlha-primary
    smtp_server 127.0.0.1
    smtp_connect_timeout 30
    router_id tlha-primary
}

vrrp_instance TL_VSM {
    state MASTER
    interface ens192
    virtual_router_id 51
    priority 101 
    advert_int 1
    authentication {
        auth_type PASS
        auth_pass 1111
    }
    virtual_ipaddress {
        10.0.0.4/24
    }
}

on tlha-secondary, make sure the contents of keepalived.conf are:

! Configuration File for keepalived

global_defs {
    notification_email {
        root@tlha-secondary
    }
    notification_email_from root@tlha-secondary
    smtp_server 127.0.0.1
    smtp_connect_timeout 30
    router_id tlha-secondary
}

vrrp_instance TL_VSM {
    state BACKUP
    interface ens192
    virtual_router_id 51
    priority 100 
    advert_int 1
    authentication {
        auth_type PASS
        auth_pass 1111
    }
    virtual_ipaddress {
        10.0.0.4/24
    }
}

You’ll need to change the highlighted values to something that fits your environment.

Setting Explanation
notification_email If you choose to configure alert emails to be sent out on state changes, this is the email address the alerts will go to. The above configuration will not send any alerts, further configuration in the vrrp_instance block is necessary.
router_id A string that identifies this machine.
state Initial state of this VRRP instance.
virtual_router_id Arbitrary number used to differentiate multiple instances of vrrpd running on the same machine/network interface. We only have one instance in this scenario.
advert_int The default advertisement interval is set to one second. If the backup nodes fail to receive three consecutive VRRP advertisements, the backup server with the highest assigned priority takes over as the primary server and assigns the virtual IP addresses to its own network interface.
auth_pass Password for accessing vrrpd, used to authenticate servers for failover synchronization. This must be the same on both machines.
virtual_ipaddress Our VIP for tlha, 10.0.0.4

There are plenty of more configuration options available, and we recommend that you read up on what else is available for you to configure, /usr/share/doc/keepalived is a good source for example configurations.

If you have firewalld running on your RedHat system, you must allow VRRP traffic to pass between the keepalived machines. To configure the firewall to allow the VRRP traffic with firewalld, run the following commands on both tlha-primary and tlha-secondary:

[cendio@tlha-primary keepalived]$ sudo firewall-cmd --add-rich-rule='rule protocol value="vrrp" accept' --permanent
[cendio@tlha-primary keepalived]$ sudo firewall-cmd --reload

We are now all set for firing up keepalived, execute the following on both tlha-primary and tlha-secondary:

[cendio@tlha-primary keepalived]$ sudo systemctl enable --now keepalived

Verify that keepalived is running:

[cendio@tlha-primary keepalived]$ sudo systemctl status keepalived
● keepalived.service - LVS and VRRP High Availability Monitor
   Loaded: loaded (/usr/lib/systemd/system/keepalived.service; enabled; vendor preset: disabled)
   Active: active (running) since Fri 2023-04-28 08:08:45 CEST; 9min ago
  Process: 3099 ExecStart=/usr/sbin/keepalived $KEEPALIVED_OPTIONS (code=exited, status=0/SUCCESS)
 Main PID: 3100 (keepalived)
    Tasks: 2 (limit: 11368)
   Memory: 1.8M
   CGroup: /system.slice/keepalived.service
           ├─3100 /usr/sbin/keepalived -D
           └─3101 /usr/sbin/keepalived -D
Apr 28 08:08:48 tlha-primary Keepalived_vrrp[3101]: Sending gratuitous ARP on ens192 for 10.48.2.100
Apr 28 08:08:48 tlha-primary Keepalived_vrrp[3101]: Sending gratuitous ARP on ens192 for 10.48.2.100

You should now see that our VIP, 10.0.0.4, has been assigned on interface ens192 on tlha-primary:

[cendio@tlha-primary ~]$ ip a s ens192
2: ens192:  mtu 1500 qdisc mq state UP group default qlen 1000
    link/ether 00:50:56:b1:a3:25 brd ff:ff:ff:ff:ff:ff
    altname enp11s0
    inet 10.0.0.2/24 brd 10.48.2.255 scope global dynamic noprefixroute ens192
       valid_lft 2545sec preferred_lft 2545sec
    inet 10.0.0.4/24 scope global secondary ens192
       valid_lft forever preferred_lft forever

Executing the above command on tlha-secondary should show abscense of the VIP. Let’s try to failover the VIP to tlha-secondary. On the tlha-primary machine, stop keepalived:

[cendio@tlha-primary ~]$ sudo systemctl stop keepalived

Verify that the VIP was failed-over to tlha-secondary:

[cendio@tlha-secondary ~]$ ip a s ens192
2: ens192:  mtu 1500 qdisc mq state UP group default qlen 1000
    link/ether 00:50:56:b1:2a:69 brd ff:ff:ff:ff:ff:ff
    altname enp11s0
    inet 10.0.0.3/24 brd 10.48.2.255 scope global dynamic noprefixroute ens192
       valid_lft 2070sec preferred_lft 2070sec
    inet 10.0.0.4/24 scope global secondary ens192
       valid_lft forever preferred_lft forever

This can also be verified by looking at the journal for keepalived on tlha-secondary when stopping keepalived on tlha-primary:

[cendio@tlha-secondary ~]$ journalctl -u keepalived -f
Apr 28 08:34:50 tlha-secondary Keepalived_vrrp[1879]: (TL_VSM) Backup received priority 0 advertisement
Apr 28 08:34:50 tlha-secondary Keepalived_vrrp[1879]: (TL_VSM) Receive advertisement timeout
Apr 28 08:34:50 tlha-secondary Keepalived_vrrp[1879]: (TL_VSM) Entering MASTER STATE
Apr 28 08:34:50 tlha-secondary Keepalived_vrrp[1879]: (TL_VSM) setting VIPs.
Apr 28 08:34:50 tlha-secondary Keepalived_vrrp[1879]: (TL_VSM) Sending/queueing gratuitous ARPs on ens192 for 10.48.2.100

When starting keepalived on tlha-primary again, we can observe that the VIP has been moved back, using the same commands as above.

Problems?

If you experience any problems following this guide, or want to ask follow-up questions, please join us at our ThinLinc community forum.

Conclusion

Setting up keepalived can greatly enhance the availability and reliability of your ThinLinc infrastructure. With keepalived, you can ensure that your VSM service remains accessible even in the event of a machine failure. By following the steps outlined in this guide, you should be able to successfully configure keepalived on your system, and you can now follow along in the ThinLinc Administrators Guide (https://www.cendio.com/resources/docs/tag/HA.html) to continue to setup ThinLinc HA for VSM.

References

man 8 keepalived
man 5 keepalived.conf
Resources and examples: /usr/share/doc/keepalived
Project homepage: https://www.keepalived.org/index.html