I have covered Log Insight DR in the past. In this post, I would like to discuss the operational process as well as the thought process of failover and failback. Read on to learn more!
Operational Process
Let’s say you have two separate Log Insight instances that are both receiving the same traffic. Typically, you would designate one instance as the primary and the other instance as the secondary. Why? Because if one instance goes down, it will not have all the second instance data, which may impact queries. So this means you need to ensure users are properly connecting to the primary instance during normal operation and the secondary instance during failover. How can this be accomplished? With DNS, of course! One of the easiest configurations is to give each instance VIP its own DNS A record (e.g., li1 and li2) and then create a CNAME record (e.g., li) that points to the instance desired. To ensure you can fail over quickly, the only thing to keep in mind is the TTL of the zone.
The most important thing to keep in mind from an operational perspective is that you need to ensure the configuration between your primary and secondary Log Insight instances stays in sync. For example, let’s assume a Super Admin user adds a new user or group to the primary Log Insight instance. How does this configuration change make it to the secondary Log Insight instance? If the two instances are not kept in sync, this may have serious implications during failover (e.g., someone cannot authenticate with Log Insight). The answer to this question depends on your operational process but may involve limiting the Super Admin list, leveraging proper change management techniques, and/or making all configuration changes via API calls.
Thought Process
Like any DR situation, failover must be considered carefully. Assuming you follow the operational process above, the only delay to the DR process is the time it takes to propagate the DNS change. The problem becomes ensuring that the failover instance has all the data you expect it to. Let’s say you have 30 days of retention on both Log Insight instances, and on day 10, the secondary Log Insight instance goes down. On day 12, the secondary Log Insight comes back online. On day 20, the primary Log Insight instance goes down. (Note: this is a worst-case scenario and not something I would expect you to see typically.) If you choose to failover on day 20, you will not have the complete 30 days of retention that you expected, but you will be able to access available data. Of course, this is a worst-case scenario and having an instance up is better than having no instance up while you root cause the outage.
This same scenario applies to failback as well. Do you really want to failback to a Log Insight instance that has recently missing events? Probably not. This means if you do decide to failover from your primary Log Insight instance to your secondary, then you should stay on your secondary instance (i.e., your secondary becomes your primary and vice versa). Of course, any configuration changes made during a failover operation will need to be properly reflected on the other Log Insight instance once it is back online — in general, I would advise against configuration changes while one Log Insight instance is unavailable to limit the possibility of configuration drift.
© 2017 – 2021, Steve Flanders. All rights reserved.