I received a question recently on how to test if the Integrated Load Balancer (ILB) master goes down that another node actually takes over the VIP (i.e. failover). While there is technically no supported way to do this today, read on to learn how I would test this from the CLI.
How the ILB Works
The ILB works by having one node in the cluster elected as the ILB master. In general, Log Insight tries to ensure the ILB master is not the cluster master. As you may recall, the cluster master is responsible for running the UI and issues queries against ingested data. If the cluster master goes down then you will lose the UI and the ability to query, but ingestion and the ILB will continue to function assuming the rest of the nodes in the cluster are online.
The ILB master may change for a variety of reasons:
- The ILB master node is upgraded
- The ILB master node crashes
- The ILB master node shuts down due to a host failure
- The ILB believes it is network isolated
In all of these scenarios, the Log Insight cluster is aware of the issue and a new ILB master election takes place. The new ILB master will then take the IP by gratuitous ARP and continue to handle ingestion and query traffic.
Why You Should Test Failover
If you come from an operations background you likely know the answer to this question. In short, a failover or disaster recovery plan is only as good as its test execution. You need to ensure that your procedure is properly documented and regularly tested to ensure it works properly when you actually need it. In the case of Log Insight, the failover is handled automatically, but there are still good reasons to test. For example, a 2-node Log Insight cluster is not supported. If you attempt to test failover with a 2-node cluster you will realize that it does not work. If you were unaware that 2-node clusters were not supported then unless you tested failover you may not be aware that losing a node results in losing the cluster.
How to Test Failover of the ILB
The process is actually very straightforward. Before I outline the steps, there are several important things to note:
- This process requires stopping a Log Insight node, which will reduce the maximum ingestion rate of the cluster and could impact ingestion and/or query.
- If the ILB master is also the cluster master and you stop the cluster master then the UI and all queries will no longer work.
- If you have a 2-node cluster and stop the load balancer master, the other node will not take over the IP — 2-node clusters are not supported.
- If you cluster is not currently healthy (e.g. a node is currently down) and you run this test, you may impact ingestion and/or query.
WARNING: Proceed at your own risk!
The steps I would take to test failover are:
- SSH to the VIP, which represents the node acting as the load balancer master.
- Confirm the node you were SSHed into is not the cluster master. If it is you may not want to proceed with the test given the above warning.
- Run “service loginsight stop”.
- After the command finishes, in a separate terminal, SSH to the VIP again and you should now be connected to a different node, which represents the new load balancer master.
- Connect to the VIP via a web browser and confirm you can log in
- Go to the Administration > System Monitor > Statistics section and confirm events are being ingested.
- When you are satisfied with the failover test, go back to the node you shut down and run “service loginsight start”.
© 2015, Steve Flanders. All rights reserved.