A pretty common question I get asked are what are the best practices when deploying/configuring Log Insight? While the information is available in the official documentation. There is no consolidated list available. In this post, I will list what I consider to be the best practices.
- Minimum of 3-node cluster: When deploying Log Insight in a production environment, use a minimum of 3-nodes. The reasons for this include ingestion HA, easy ability to scale-out, and easy ability to perform maintenance. Note that 2-node clusters are not supported and do not provide ingestion HA.
- Minimum of medium-sized nodes: Log Insight supports a variety of node sizes, but for production environments I would use a minimum of medium-sized. The use of medium would be regardless of ingestion rate. The reason for this has to do with the way in which Log Insight maps queries to available CPUs. If you expect a high number of concurrent queries then large-sized nodes should be used if possible. Note that when determining concurrent query load, user alerts need to be factored in as well. If you can support large, go large.
- All nodes the same size: You should not mix node sizing in a cluster. All nodes should be the same size else the performance will be limited by the smallest node size.
- All nodes in the same datacenter: Log Insight does not support geo-clusters today. If you have multiple datacenters then you should deploy Log Insight forwarders in each datacenter — discussed below.
- All nodes in the same L2 network: The Log Insight Integrated Load Balancer (ILB) — discussed next — requires that all nodes be in the same L2 network. I explicitly separated out this best practice as L2 can technically be extended cross-datacenter.
- Use of the Integrated Load Balancer: To properly balance traffic across nodes in a cluster and to minimize administrative overhead, the ILB should be used. Note that use of an external load balancer is technically unsupported. Also note that the ILB should be configured on 1-node instances as well — this is what all queries and ingestion should be pointed to so a cluster can easily be supported in the future if needed.
- Local Active Directory domain controllers configured: Active Directory can be configured through the Administration section of the Log Insight UI. By default, Log Insight attempts to connect to any domain controllers available. If any domain controller is experiencing issues then you may experience slow logins into Log Insight — up to 20 minutes! You should follow KB 2079763 and specify the domain controllers closest to the Log Insight cluster.
- Use of Active Directory service account as the binding user: When configuring Active Directory in Log Insight you need to provide credentials for a binding user. It is strongly recommended that you use a service account for this binding user because if the binding user credentials expire / get locked out / etc then no Active Directory users will be able to log into the UI.
- Archiving cleanup script leveraged: If archiving is configured then it the user’s responsibility to ensure that the archive destination is cleaned up and does not become full. Log Insight does not manage the archive destination today and will continue to attempt to archive sealed buckets. A simple cronjob is sufficient for this best practice — future post to come.
- Restrict the ports accessible to Log Insight via an external firewall: Per the Security Guide, Log Insight should sit behind an external firewall with only the required ports allowed into and out of the network.
- Use Log Insight forwarders for all client traffic: While you can send traffic directly to a Log Insight cluster, in a production environment I would strongly advise configuring forwarders in front of the central Log Insight instance. These forwarders should be deployed in every datacenter at a minimum — including the datacenter the central Log Insight instance is in. Further granularity may be defined — per network, per department, etc — if desired.
- Cluster forwarders: Log Insight forwarders should be clustered for ingestion HA — meaning a minimum of 3-nodes. In terms of sizing, small is sufficient since the forwarders will not be used as the primary means of storing ingested events and/or querying.
- When adding storage to Log Insight always add new virtual disks with a maximum space of 2 TB: Never attempt to extend an existing virtual disk and never attempt to add a virtual disk greater than 2TB. Note that as of Log Insight 2.5, each node supports a maximum of 2TB.
- Configure DRS rules so two Log Insight nodes are not on the same physical host: Log Insight 2.5 provides N+1 redundancy so if you lose more than one node then the cluster may be down.
- Never remove a node once it has been joined to a cluster: Removing a node is technically unsupported and may result in data loss. The only time the remove option should be considered is if a node is permanently lost.
- Never attempt to reduce the storage on any node: Log Insight supports increasing retention storage, but it does not support reducing the storage capacity. Attempting to reduce the storage capacity may result in data loss.
- Point all ingestion traffic to a FQDN (via DNS): If you want/need to change the IP address of the ingestion traffic in the future, it will be easier if you configured everything to point to a DNS record.
- Replace the self-signed SSL certificate with a trusted certificate: To avoid additional configuration required to configure agents, event forwarding, and access the product via a browser or API, the self-signed SSL certificate should be replaced with a trusted certificate.
© 2015, Steve Flanders. All rights reserved.
3 comments on “Log Insight Best Practices: Server”
I have 4 nodes including the Master. Two are in a separate datacenter, I see this is not supported. How do I move these nodes? Would it just be better to delete? This is a Lab setup, not prod.
I also think I’m suffering from performance issues from this setup.
You could power it down and move them, however you likely need to change the IP addresses which means you will need to reference my backup and recovery series (part 4 specifically). In general, I would not recommend deleting as this may cause issues, but given it is a lab setup you could also deploy and add two more nodes in the proper data center, then remove one of the old nodes at a time waiting about 15 minutes between. I hope this helps!