If you have worked in software then you can probably relate to the fact that time is hard. In this post, I will provide some background on why time is hard and focus on how to properly setup NTP to ensure you have accurate time in your environment in an attempt to make time a little easier.
Why is Time Important?
Time is money, but when it comes to devices keeping time, why does it matter? There are a lot of reasons including:
- Troubleshooting and root cause analysis: when an issue happens you want to be able to go back in time to pin point when it started, when it was resolved, and what happen around the same time
- Configuration management: knowing when changes were made
- Security: knowing who did what and when
Why is Time Hard?
Time is hard for a variety of reasons including:
- Different time zones: What time it is depends on what time zone you are in. When working with people in different time zones, talking about time and ensuring everyone is looking at the same time can be hard. Have a look at this post where I talk about the impact this has on log analysis and what Log Insight does about it.
- Different ways of representing time: There are several standards on time format and there are a ton of non-standard ways to represent time. When time is represented different across systems analyzing or correlating can be hard. Have a look at this post where I talk about the impact this has on log analysis and what Log Insight does about it.
- Different ways of viewing time: While you may work primarily in one time zone you may travel to others. When you do, how do your devices handle the time change? Some change automatically, while others need to be manually changed. This makes time hard.
- Accurate time is important across systems: How do you know the time on your device is accurate? What happens if it is not accurate? What if you have two devices showing different times? Unless you have accurate time across devices they may not work properly making time hard.
What is NTP?
NTP is a standard (see RFC 958, RFC 1305 and RFC 5905) on how to get time. The idea was to create a protocol that was resilient to failures and accurate enough for most use cases. In its most basic configuration, it checks one or more upstream NTP servers and based on a very specialized algorithm determines what time is most accurate and updates the local time of the device if necessary. Of course there are a lot of configuration options depending on your use case including collecting time from peers and securing the connection used to collect time from other devices.
Why is NTP hard?
Many people do not understand how NTP works. While the internal workings can remain a mystery to most people, how to configure NTP is extremely important and not well understood. To illustrate my point I would ask a simple question, how many NTP servers do you configure your devices to point to? I suspect the top answers will be (in this order):
- I did not change the default configuration so whatever NTP comes with
- I configured two NTP servers
- I configured something else
If you answered number 1 then you are likely OK assuming your devices can actually connect to the NTP server specified in the configuration file (if they can access the Internet then you are good). If you answered number 2 then it is possible and likely that your time is not accurate on a consistent basis. If you answered number 3 then you likely know what I am about to talk about.
Proper NTP configuration
Did you know that in order to ensure you have the best chance of having accurate time on your devices on a consistent basis you should have a MINIMUM of FOUR NTP servers configured? Let me explain:
- One NTP server: you will have accurate time on all your devices with one NTP server, but you have no redundancy for when the NTP server goes down. This is equivalent to N+0 redundancy.
- Two NTP servers: having two NTP servers does not provide redundancy because the client will not know which NTP server to trust – what if the time between the two NTP servers is different? Which one is right? While you may consider this N+1 redundancy, since NTP cannot determine which one to trust and it is possible that different devices will trust different servers this is worse than N+0 redundancy.
- Three NTP servers: having three NTP servers is the same as having one NTP server because while NTP will now be able to determine if one NTP server is incorrect (known as a falseticker), it will only have two NTP servers left and be unable to determine which one to trust if they disagree. While you may consider this N+1 redundancy, since NTP cannot determine which one to trust if one is determined bad it is still N+0 redundancy.
- Four NTP servers: having four NTP servers makes it possible for one NTP to be determined “bad” (falseticker) and removed from consideration while still being able to determine the most accurate time based on the remaining three NTP servers. This is equivalent to N+1 redundancy.
- More than four NTP servers: this makes it possible to support more than one falseticker. This is equivalent to greater than N+1 redundancy.
Now you be saying, “but what about the prefer option?” If you are familiar with NTP then you may know that it offers a prefer configuration option. The man page states:
Marks the server as preferred. All other things being equal, this host will be chosen for synchronization among a set of correctly operating hosts. See the Mitigation Rules and the prefer Keyword page for further information. This option is valid only with the server and peer commands.
This means you could have TWO NTP servers with one being preferred and get N+1 redundancy, right? WRONG. Did you read the Mitigation Rules like the man page references? Probably not because they are not available in the man page. Here is a link to the Mitigation Rules and here is an important excerpt:
In the prefer scheme the clustering algorithm is modified so that the prefer peer is never discarded; on the contrary, its potential removal becomes a termination condition. If the original algorithm were about to toss out the prefer peer, the algorithm terminates right there. The prefer peer can still be discarded by the sanity checks and intersection algorithms, of course, but it will always survive the clustering algorithm. The prefer peer is used as long as it survives the sanity checks and intersection algorithm. If it does not survive or for some reason it fails to provide updates, it will eventually become unreachable and the clock selection will remitigate to select the next best source.
In short, only if the preferred server makes it to the clustering algorithm will it be used as the official time source. Given the use of only two NTP servers, this is not guaranteed, in fact it is a 50/50 shot. Still not convinced that your need four NTP servers? Have a look at this link and this link.
In addition, it is important to note that the NTP servers used by your devices should NOT be virtual machines. It is very common practice to use authentication devices like Microsoft Active Directory domain controllers as NTP servers. The problem is many domain controllers today are virtual and virtual machines should never be used if you wish to ensure accurate time.
Summary
As you can see, time is hard. Even when focusing just on NTP, the configuration is not as easy as it appears. When configuring NTP remember:
- Use a MINIMUM of FOUR NTP servers
- The prefer option can be used, but you still need a MINIMUM of FOUR NTP servers
- NTP servers specified should NOT be virtual machines
© 2015, Steve Flanders. All rights reserved.