A while back, I came across an entry from Michael White’s newsletter about changing VMware ESXi host logging levels:
Changing VMware ESXi host logging level
Someone was talking about doing this, and using this as a guide, but I would like to say you may not want to do this. There may be a reason for the log levels, and if you change them it may be harder to support you if you call VMware for help. And your syslog of choice should be able to handle the volume and I know – in fact better than most – that ESXi logs are noisy, but you can deal with that with good searching.
I completely agree with Michael and this post will explain why.
Developers configure logging on a device such that is outputs information necessary to troubleshoot a problem. In terms of logging verbosity, the most common levels are:
- error
- warning
- information
- debug
In my opinion, every device should be set to the information level. The reasons for this are straight-forward:
- If you get an error/warning message, it is often too late to prevent the issue
- If you get an error/warning message, you may not be able to get the root cause analysis without the information logs
- Some important logs are logged to the wrong facility (e.g. I have seen error message logged as info and info logs marked as error)
- Error/Warning messages may happen so infrequently you may not know that remote logging is working properly
So why do people choose the error or warning verbosity level? Some common reasons are:
- Too much information
- Too much storage space required to keep the messages
- Too expensive to query the information (e.g. products that charge per GB of logs)
- Too difficult to find the important logs (usually due to lack of an enterprise logging solution)
- Logs are used as a monitoring tool
- Only care about error and warning events (though also care about root cause analysis)
- Unless an error/warning is seen, logs are not/rarely analyzed
While I understand the reasons, I have lived in the operations world and know that being able to troubleshoot issues quickly and completely is critical. Storage is cheap and powerful analytics products such as Log Insight encourage, via licensing, more storage collection and quick log analysis. If you look specifically at ESXi, it defaults to information logging and such information is critical in support bundles that VMware support uses to troubleshoot problems.
So the next time you hear someone ask how to reduce the verbosity of logs ask them why they want to do it. Challenge them to defend their reasons and encourage them to look into other solutions if costs or other means are getting in the way. The amount of information in an environment is growing every day and reducing information by dropping it is not the solution.
UPDATE: VMware has several KB on log verbosity and I would like to call out http://kb.vmware.com/kb/1004795. It states that you should not increase the verbosity level — unless requested by VMware support — as it can cause issues (i.e. increase resource utilization, full disk, etc). Unfortunately, VMware to my knowledge does not have a KB stating you should not reduce the verbosity level — likely because this will not cause any problems in your environment per se. The issue with reducing verbosity is that VMware support may not be able to complete troubleshooting of an issue you reported. As such, they may require you to adjust the verbosity level and attempt to reproduce the issue. When you have a critical support case open with VMware, the last thing you want to do is make configuration changes, attempt to reproduce a problem, upload new support bundles to the case, and wait for an update. Again, leave the verbosity level alone.
© 2014, Steve Flanders. All rights reserved.
One detail that would affect your above advice: the default logging level on ESXi 5.1 for the vpxa and hostd logs is “verbose” not “info”. Also, wouldn’t it be a worthy option to set the local logging level to “info”, but the syslog level to “warning”? That way you can still use the local logs for forensic troubleshooting, and not send 8 million messages to a syslog collector per day, which happened in our organization.
Short answer: I would not recommend it
Long answer: I agree that having the default logging level for ESXi 4.x and newer set to verbose by default is a bit much, but ESXi is a complex system and some of the necessary logs needed to troubleshoot issues are found under verbose instead of info. You can call this a limitation of ESXi (I do), but it is what it is. This means that you should NOT change the local logging level from verbose to info as it may make it impossible to troubleshoot issues and VMware support may be unable to help you when you have an issue.
As for syslog logging, it comes down to requirements. If you want alerting of issues as they happen then “warning” may be sufficient for your needs. If you want information such as number of datastores in your environment, how much VMotion network traffic you are consuming and what VM snapshot operations have been performed then “warning” is not for you. In addition, if you want to complete correlation of events across systems in your environment you need more than “warning”. Eight million messages is a lot, but the events are small so they are not consuming network bandwidth, the events will consume disk space (on average 250 MB a day) however disk is cheap. The biggest issue with eight million events is how do you query through them and find the needle in the haystack? This is why you need a tool like Log Insight that can match patterns, determine scheme, extract fields, contain predefined queries, and more.
I realize this is an older thread. . .I like what Brad suggests but I don’t see how you can set the log levels differently for local vs remote logging on ESXi v5 or above. Anyone?
It is not possible / supported to change the log levels for local vs syslog. Also, change from verbose to info is not recommended as it will prevent GSS from being able to diagnose problems in your environment.
We’re receiving approx. 20000 events an hour with the verbose default level which really isn’t supportable as it’s causing network congestion and packet loss for other syslog applications. So we’ll be dropping it to “info” which will hopefully remove a lot of the noise whilst keeping the error/warning messages plus info data for troubleshooting.
Thanks for the comment! I hear this a lot, but usually the issue is the environment not the verbosity level. Let me explain: 20,000 events an hour / 60 minutes in an hour / 60 seconds in a minute = 5.33 event per second. Let’s just call it 5 for easier math. 5 events per second with an average byte size of 200 bytes = 1,000 bytes a second per ESXi host. Let’s assume you have 10,000 ESXi hosts so 10,000 * 1,000 = 10,000,000 bytes. Convert from bytes into gigabit and you have 0.0093 gigabit. Unless we are talking about a remote office connection, I assume it is fair to say most people have at least a 1 gigabit connection and given a very large ESXi deployment we are talking about 1/100th of the connection (or more like 1/90th given you will never hit maximum). Note all of this assumes full duplex, but I figure most users leverage full duplex today.
Steve –
Syslog has always been utilized at different levels. That is call giving the power to the administrators. VMware should allow administrators the ability to readily modify logging levels as they see fit without jumping thru hoops period.
Hey Amir — Thanks for the comment. VMware does allow users to change the verbosity level as desired (e.g. https://blogs.vmware.com/vsphere/2013/04/new-component-based-logging-for-hostd-in-esxi-5-1-update-1.html), however allowing this functionality and recommending changing the defaults are two different things. You are free to change the verbosity level, but doing so means that anyone looking at the logs (e.g. GSS) may not be able to troubleshoot the problem. VMware recommends not changing the default verbosity level for this reason.
“VMware recommends not changing the default verbosity level for this reason.”
Except for the fact in their documentation states Verbose is not recommended for production environments.
https://kb.vmware.com/selfservice/microsites/search.do?language=en_US&cmd=displayKC&externalId=1004795
“. VMware does not recommend enabling trivia/verbose logging in a production environment”
With our View and Server environments we were generating over 100K events per minute. Our current disk consumption rate is 70GB per day. It just didnt make sense to have to maintain that level of logging.
Hey Bryan — Thanks for the comment. I think you misread my comment. I recommended not to change the default verbosity level (not change the verbosity level to verbose). In the post I stated, “If you look specifically at ESXi, it defaults to information logging and such information is critical in support bundles that VMware support uses to troubleshoot problems.” The point is you should not increase or decrease the verbosity level because doing so may mean you are losing important information. I agree that devices can be chatty and generate a lot of logs, however disk is (typically) cheap and depending on the business requirements logging is critical. In your case, you may have reduced the logging rate, however it may also prevent GSS from troubleshooting any support bundles you upload — it is always a trade-off.
Thanks Steve. I have numerous syslog client cases right now that we are having a devil of a time simply reducing traffic.
Here is an example of a case I just opened up: # 17354804401. There seems to be a lot of different syslog issues with the various ESXi versions. I don’t exactly understand why there isn’t simply a global switch that can immediately reduce the syslog level across all sources.
Just doing a google search on syslog esxi reveals the travails that people go thru simply to configure / reduce logging. Forgive the whining and thanks for your help.
It is an industry standard so I understand the frustration — today often the storage cost associated with the verbose logging is not the issue, the issue is the ability to quickly find the needle in the haystack. Products like LI help, but it is a difficult problem to solve.
The level of logging for various products is ridiculous. VSAN and vCenter in particular. Today I found that my little 3 host cluster was getting literally millions of messages every 5 minutes, causing Log Insight VM to peg the CPU trying to process it all. This creates issues unto itself because I can’t troubleshoot the root cause until I troubleshoot how to deal with the firehose of messages. I probably don’t have much option but to disconnect vCenter from Log Insight to troubleshoot the problem. How much sense does that make? The tool that is suppose to help me deal with the logs can’t do the job it is designed for. Under more normal circumstances vSAN and vCenter dominate the other sources being ingested by Log Insight making their info harder to surface.
Maybe people wouldn’t need to have support troubleshoot as many issues if they would make logging more useful. Millions of log messages in under 5 minutes is a dump, not a log.