Log Insight 3.0 Agents: CLF Parser

The third agent parser I want to take a look at is the CLF parser. Read to learn how it works!

li-agent

How the Parser Works

The CLF format is very common for Apache HTTP events, but turns out this format is extremely flexible and can be used for way more than Apache events. At first glance, the CLF parser may look daunting. First you may think it is just for Apache events. The reason for this is because the CLF parser supports a “format” option, which defaults to the following:

For those familiar with Apache HTTP events the format should be recognizable. Second, you may feel overwhelmed by all the parameters the format option supports. From the official documentation:

‘%a’: “remote_ip”
‘%A’: “local_ip”
‘%B’, ‘%b’: “response_size”
‘%{VARNAME}C’: dependent on the name of variable specified in the format
‘%D’: “request_time_ms”
‘%E’: “error_status”
‘%{VARNAME}e’: dependent on the name of variable specified in the format
‘%F’, ‘%f’: “file_name”
‘%h’: “remote_host”
‘%H’: “request_protocol”
‘%{VARNAME}i’: dependent on the name of variable specified in the format
‘%k’: “keepalive_request_count”
‘%l’: “remote_log_name”
‘%L’: “request_log_id”
‘%M’: “log_message”(parser stops parsing of input log after reaching this specifier)
‘%m’: “request_method”
‘%{VARNAME}n’: dependent on the name of variable specified in the format
‘%{VARNAME}o’: dependent on the name of variable specified in the format
‘%p’: “server_port”
‘%P’: “process_id”
‘%q’: “query_string” (this is not generated by Apache, and might be excluded)
‘%r’: “request”
‘%R’: “response_handler”
‘%s’: “status_code”
‘%t’: “timestamp” will work as event timestamp on ingestion, engages timestamp parser. To override timestamp auto detection, date & time format can be specified in curly braces: %{%Y-%m-%d %H:%M:%S}t, see timestamp parser for more details.
‘%T’: “request_time_sec”
‘%t’: “date_time” (“timestamp” will work as event timestamp on ingestion)
‘%u’: “remote_auth_user”
‘%U’: “requested_url”
‘%v’: “server_name”
‘%V’: “self_referential_server_name”
‘%X’: “connection_status”
‘%I’: “received_bytes”
‘%O’: “sent_bytes”
‘%S’: “transferred_size”

Oh man… Now again for Apache logs the options are likely familiar. Each parameter basically represents the field (or key) and where the parameter exists in the event is the value.

Basic Example

As it turns out, because each parameter basically represents the field (or key) and where the parameter exists in the event is the value, this makes it possible to use the CLF parser for way more than Apache HTTP events. Let’s use a concrete example. Let’s say I have a log message like the following:

I could define the above parser as:

If we break down the format we have the following:

  • %t = timestamp (key) with a value of “2015-10-10 05:34:12+0400” — %t works just like the timestamp parser described in a later post
  • %{component}i = component (key) with a value of “com.loginsight.blah”
  • %u = request (key) with a value of “GET”
  • %U = requested_url (key) with a value of “/admin/users”
  • %M = log_message (key) with a value of “[Test message]”

This means all the % parameters simply defines a specific key name for the field where it is located in the file. C, e, i, o, n are special parameters that require a KEYNAME be specified and the KEYNAME is used as the key. This means the above format could also be written as:

Now there are a few very important notes about the above format:

  • %t is a special case in that it can support spaces between the date and/or timestamp. Again this option behaves like the timestamp parser which will be discussed in a future post.
  • The square brackets and spaces in the above format are REQUIRED. Again, you are defining the format of the event and it must match exactly.
  • %M is another special case in that it matches everything else to the end of the event. Replacing %M with %{log_message}i would not work.

The only parameters you really need for the CLF parser are %t and one of the VARNAME ones — I prefer %{VARNAME}i. The reason for this is because you can name VARNAME whatever you want. What is also cool is that while you could exclude fields parsed via the CLF parser using the exclude_fields option, you can also do it by not specifying a VARNAME for the parameters that require it. For example, if I use the format:

Then I will only get the timestamp and log_message fields and the other fields will be ignored.

IMPORTANT: The timestamp can be used as part of the timestamp parser — discussed in a future post — but it will not be an actual field in the Log Insight server.

Advanced Example

Let’s say I have a more complex log file containing the following:

How can I extract the contents within each set of square brackets? The answer is with the same format used above. The only caveat is that the log_message field will contain the closing square bracket within the value.

How about if the log file looks like the following:

How can I parse this one? With the CLF parser this is not possible. Wait, why? The reason is because the CLF parsers requires a known format. The known format is space sensitive. Given that the third set of square brackets have a dynamic number of space separated values a known pattern cannot be determined. Now if I did not care about the third and four square brackets I could use %i to ignore or %M to store both into the log_message field, but if I want to parse them then the CLF parser will not help. Instead, if I wish to parse all square brackets in the above log file then I should use the CSV parser with a delimiter option set for open and close square bracket.

Summary

As you can see, the CLF parser is very flexible, but only works for known format events. Be sure to carefully define the format option or else no fields will be sent. Again, the CLF parser is only for events that follow a known format. For other logs formats see my previous posts on other agent parsing options. Do you have a need for the CLF parser?

© 2015, Steve Flanders. All rights reserved.

Leave a Reply