Log Insight and fsck

Depending on how long you have been running Log Insight and how many times you have restarted it, you may notice that the virtual appliance runs a fsck after a certain amount of time. Depending on how much retention you have configured on your virtual appliance, the fsck check could take minutes or hours. In this post, I would like to talk about the configuration of the virtual appliance and why automatic fsck checks are critical.
li-logo

Default Configuration

In order to determine the default configuration of fsck on the Log Insight virtual appliance you need to check two things. First what are the mount options:

# grep /storage /etc/fstab
/dev/data/var /storage/var ext3 defaults 0 2
/dev/data/core /storage/core ext3 defaults 0 2

The last number is the fsck option, which are defined as:

  • 0: Do not check
  • 1: Check first file system (partition)
  • 2: All other file systems to be checked

In this case, we know that the /storage partitions will both be checked. Next we need to determine how frequently they are checked:

# dumpe2fs -h /dev/mapper/data-core | grep -i 'mount count'
dumpe2fs 1.41.9 (22-Aug-2009)
Mount count: 3
Maximum mount count: 32

As you can see, it is after 32 mounts and my instance is currently at 3.

Why run fsck?

Now the question becomes, why is the Log Insight virtual appliance configured to check the /storage partitions? For those familiar with fsck, you know it can check and can repair file system issues. In the case of Log Insight, the data in the /storage partitions is the most important data. Any corruption of this data may result in incomplete or lost events. While Log Insight has checks in place and can handle file system corruption, it is limited in the options it has available. Instead of reimplementing a file checking tool, Log Insight offloads that responsibility to the virtual appliance. Log Insight does keep backups and checksums of information it writes, but if it is unable to access the data or finds corruption in the data at some point it needs to ignore the data.
Running fsck clears up any file system issues that may exist and gives Log Insight the ability to recover information that was previously corrupt. While this should be a rare occurrence, data integrity is a top priority for Log Insight users.

Why does fsck take so long?

The time fsck takes depends on two primary aspects:

  1. The amount of space that needs to be checked
  2. The amount of time it takes to repair any issues

With the default storage allocation for Log Insight, a fsck operation should be quick — I would say well under an hour and possibly under 10 minutes. At the maximum storage allocation the operation will take much longer — easily over an hour.

Why does the length of fsck matter?

For those reading this thinking what is the big deal? Consider the situation where you have a datacenter outage and lose all systems. For production environments you typically have a RTO and RPO to meet. You fix the underlying issue and power on all the systems only to learn that Log Insight will take over an hour from the time it was powered on to start-up. This may be a problem for your RTO and also may make troubleshooting issues during recovery more difficult.
Now arguable, you can overcome this issue in a variety of ways including:

  • Properly distributed Log Insight nodes across infrastructure
  • Leveraging agents and forwarders everywhere
  • Configuring active-active DR for Log Insight

For those with less strict business requirements, the length of time Log Insight is down for a fsck operation is often just a nuisance, but at the same time provides data integrity which is critical for any log analysis product.

What can be done?

Today, the best option is to stick with the default configuration and know that Log Insight could be down for a while on restart. Ensure you have a proper architecture — reference architectures coming in a future post — to meet business requirements and the additional restart time should not be an issue.
Longer term, Log Insight initiated file system checks could be initiated and interactive prompting could be added.
Finally, there is the unsupported path of removing fsck checking.

DISCLAIMER: THIS IS FULLY UNSUPPORTED AND NOT RECOMMENDED FOR ANY ENVIRONMENT. PROCEED AT YOUR OWN RISK AND KNOW THAT PERMANENT DATA LOSS COULD OCCUR. I AM INCLUDING THIS INFORMATION BECAUSE IF I DO NOT SOMEONE ELSE WILL.

Given the system is Linux and I have already provided the steps to confirm that fsck checking is enabled. All that needs to be done is to disable this checking. There are several ways to achieve this:

  1. Change the mount options in /etc/fstab to:
    /dev/data/var /storage/var ext3 defaults 0 0
    /dev/data/core /storage/core ext3 defaults 0 0
  2. Remove the check intervals:
    # tune2fs -c 0 -i 0 /dev/mapper/data-core
    # tune2fs -c 0 -i 0 /dev/mapper/data-var
  3. Skip check on restart:
    # shutdown -rf now

Summary

Log Insight is configured to run fsck automatically to ensure data integrity. The fsck operation can take hours to complete depending on the amount of data in the repository. Proper Log Insight architecture should be in place to mitigate the restart time of the Log Insight virtual appliance when the fsck operation runs. It is completely unsupported and may lead to permanent data loss to disable the fsck operation.

© 2015, Steve Flanders. All rights reserved.

2 comments on “Log Insight and fsck

I still think it a major pain point for the solution. VMware should find a better way to deal with this. My experiance tells me that 2tb could easily take a full day/working day before its done.

Thanks for the comment Michael — I agree though note this should be a very rare occurring scenario.

Leave a Reply

Your email address will not be published. Required fields are marked *

Back To Top