I have covered Log Insight reference architectures in the past, but I have received a few inquiries about large Log Insight deployments. In this post, I will cover a variety of different large Log Insight deployments and the reference architecture information you need to know. Read on to learn more!
Large Ingestion Rate
Log Insight has documented configuration maximums as well as a Log Insight calculator to assist with sizing. The question becomes, what if you have a non-standard ingestion rate? First, let me define what I mean:
- Events in non-standard format (e.g. XML) — Just ingest as you normally would. No best practices here, just note anything beyond syslog format would require custom parsing to leverage static fields. The Log Insight agent provides built-in parsing capabilities or you can roll your own and send via the ingestion API.
- Ingestion rate over the configuration maximum — Nothing over the configuration maximums is officially supported, but in general you can expect to be able to increase the cluster size from 12 to 18 nodes if needed. You also could try increasing the CPU and memory above the configuration maximums if you have available resources. WARNING: Not officially supported. Proceed at your own risk.
- Events significantly larger than 200 bytes on average — This means your maximum ingestion rate will go down. It also may mean that your events may appear collapsed by default on Interactive Analytics. No changes can be made here today.
Large Query Rate
Large query rate can be broken down into two primary categories
- Large number of user alert
- Ensure users are subscribing to alerts instead of cloning them
- Ensure alerts are running as infrequently as possible (e.g. not all “on any match) — you can check/edit on /admin/alerts page
- Large number of concurrent queries
- Use large size nodes — increasing the number of CPU always helps here
- Add more nodes — If you are near or above an average of 5,000 EPS per node add a few more nodes to better distribute the query load
Large Retention Requirements
In general, Log Insight is meant for real-time troubleshooting and root cause analysis. A typically maximum retention policy for Log Insight is 30 days. The maximum storage allowed assuming a 12-node cluster is 48 TB, which works out to 1.6 TB a day.
If you require more retention then this you will need to set up an event forwarder to send to duplicate the traffic to separate Log Insight instance. On the second Log Insight instance, you could go above the configuration maximums (e.g. cluster size and/or storage per node). WARNING: Not officially supported. Proceed at your own risk.
No matter if you exceed the configuration maximum for retention or not, you may have the need to run queries over a large time range (e.g. all time). These queries are typically expensive in Log Insight and could impact other users of the system. It is ideal if you can move all long running queries to a dedicated Log Insight instance. This way all real-time queries (e.g. user alerts, interactive analysis, etc) are not impacted by long running queries. This means you would have a Log Insight instance that forwards to a secondary Log Insight instance.
Log Insight Forwarders
Log Insight forwarders should only be used to forward events to another destination (e.g. a central Log Insight instance). Assuming you are following this model, you can manually disable some services running on the forwarder to save resources. These changes will ONLY impact the forwarder instance. All changes are made from the /internal/config page — be sure to select the “Show all settings checkbox” — and are changes, not additions to the configuration. WARNING: Not officially supported. Proceed at your own risk.
- Disable machine learning (i.e. the Event Types and Event Trends tabs on Interactive Analytics)
<spock> <leo enabled="false" /> <rex enabled="false" /> </spock>
- Disable aliasing (i.e. vSphere datastoreID to name mapping)
<alias> <enable-query-alias value="false" /> <enable-alias-learning value="false" /> </alias>
- Disable user alerts
<alerts> <suspend-alerts value="true" /> </alerts>
Large Instances in General
For large Log Insight instances, you can consider disabling some/all/none of the proposals in the Log Insight Forwarders section above — it depends on what features you are leveraging. Assuming the large central instances are receiving their traffic primarily from Log Insight forwarders — which is the best practice — then you should consider disabling L7 load balancing on the central instance. L7 load balancing works to provide a fair share of ingestion across nodes. Even Log Insight forwarder instances provide the majority of the ingestion then the L7 load balancing already happened on the forwarder and does not need to be repeated on the central instances. Disabling L7 load balancing on large Log Insight instances will significantly reduce network traffic and overall system load. WARNING: Not officially supported. Proceed at your own risk.
<importer> <peer-message-router-enabled value="false" /> </importer>
Note: The above change requires a manual restart of every node in the instance.
© 2017, Steve Flanders. All rights reserved.