In Log Insight 2.0 a clustering feature was introduced. This feature made it possible to go above the configuration maximums of a single node, however scale is not the only reason to consider when architecting a Log Insight solution and considering clustering. In this post, I would like to discuss other benefits to leveraging a cluster.
1. High Availability (HA)
In addition to scale, clustering makes it possible to support ingestion HA. What this means is that if a node goes down then ingestion can continue. Now, this assumes that you are forwarding traffic to a load balancer — sending traffic directly to a LI node does not provide ingestion HA. In the case of Log Insight 2.0, an external load balancer (ELB) needed to be leveraged. With Log Insight 2.5, an integrated load balancer (ILB) is available. In either case, clients need to be configured to forward traffic to the load balancer and not directly to nodes for ingestion HA to work.
Given that clustering provides ingestion HA, this also means you can upgrade a cluster without having ingestion downtime. Well, let me dig in a little more on the statement:
- If the ELB fails over (e.g. due to a failure and/or upgrade) then you will have minimum ingestion downtime for UDP
- If the ELB fails over (e.g. due to a failure and/or upgrade) then you will have minimum ingestion downtime for TCP unless you have properly configured stateful failover
- If you are using an ELB and you properly remove a Log Insight node from the ELB and put the node in maintenance mode before upgrading it then you should have no ingestion downtime for events being sent over UDP.
- If you are using an ELB and you properly remove a Log Insight node from the ELB and put the node in maintenance mode before upgrading it then you should have no ingestion downtime for events being sent over TCP though you may have minimum ingestion downtime because long-lived TCP sessions may not reconnect immediately.
- If you are using the ILB and have configured clients to forward traffic via UDP then you will have minimum ingestion downtime as the load balancer master fails over to another node just like when the ELB fails over.
- If you are using the ILB and have configured clients to forward traffic via TCP then you should have no ingestion downtime though you may have minimum ingestion downtime because long-lived TCP sessions may not reconnect immediately
Long story short, the behavior for both ELB and ILB is the same. For ingestion HA to function properly (e.g. upgrades and failover) it is important that you know and follow certain guidelines:
- In Log Insight 2.5, clustering requires a minimum of three nodes to provide ingestion HA — two nodes does not provide ingestion HA
- In Log Insight 2.5, clustering provides N+1 ingestion HA redundancy — if you lose two or more nodes you lose ingestion HA
- In Log Insight 2.0 and newer, during an upgrade the master node must be upgraded first
- In Log Insight 2.5, during an upgrade worker nodes must be upgraded one at a time not in bulk
Note: If you are using a Log Insight cluster only to provide HA then you could use small-sized nodes, however in production environments medium-sized nodes are the minimum recommended. The reason for this is because the number of CPUs allocated dictate the number of queries that can be run at a time.
2. Load Distribution
Leveraging a cluster makes it possible to handle load distribution better. For example, spikes of events ingested are expected especially when issues are being experienced in the environment. If you are using a cluster for HA then this means you have resources available to handle a node going down. In short, this means you have over-provisioned the cluster size. The additional capacity can be used to handle spikes in ingestion as they occur.
In addition, individual nodes may experience performance problems from time-to-time. For example, an ESXi host a node is running on may be over-subscribed resulting in a node not receiving all the resources it requires. This may result in the node not being able to keep up with the ingestion rate. The node can indicate when such a performance problem occurs. If you are leveraging the integrated load balancer (ILB), then the node can inform the ILB to back off. If you are leveraging an external load balancer (ELB) then the node would be unable to notify the ELB of the issue.
3. Future Growth
You may only need a single node to start, but over time you may ingest more events and need to move to a cluster. When you do you will need to leverage a load balancer. The result is that clients will need to connect to a different endpoint. Now, you should be using a FQDN for clients so the change should only be to DNS, but even in this case if clients are pointing to the FQDN of the master either the FQDN will need to change or the FQDN of the master will need to change.
Of course, you can overcome these issues by properly architecting a Log Insight solution that accounts for future growth. If you start with a cluster then this would not be a concern otherwise you may consider leveraging a DNS CNAME to the master FQDN to make it easy to transition later.
I have covered Log Insight backup and recovery in-depth. While a cluster is technically not a backup, it does provide some of the same benefits that a backup does. For one, all configuration except for the SSL certificate is replicated across the cluster. This means if you permanently lose a single node then you can recover all information on that node from other nodes in the cluster.
Note: This is not to say a cluster can be used as a backup strategy! For example, if you lose more than one node then some amount of configuration information will be lost. Proper backup is essential for production environments.
As you can see, there are a variety of reasons why you may wish to deploy a Log Insight cluster above and beyond scale. Understanding business requirements will ensure a properly architected Log Insight solution.
© 2015, Steve Flanders. All rights reserved.