Optimizing DNS Time –to Live (TTL) Settings for Performance and Reliability

Optimizing DNS Time –to Live (TTL) Settings for Performance and Reliability

July 17, 2024
Table of Contents
Share on LinkedIn

In the realm of IT, where every millisecond counts, understanding the nuances of DNS (Domain Name System) settings can significantly impact your network’s efficiency and user experience. One crucial aspect of DNS management is the Time to Live (TTL) setting.

DNS TTL refers to the duration that a DNS record is cached by a DNS resolver or server before it must query the authoritative DNS server again. Essentially, it’s the “expiry date” for the cached information.

DNS TTL settings play a critical role in network performance and reliability as well as the user experience. Understanding and optimizing your DNS TTL settings can make the difference between a seamless user experience and one riddled with delays and disruptions. In the following sections, we’ll explore best practices for setting your DNS TTL, helping you strike the perfect balance between performance and flexibility.

DNS fundamentals.

DNS is a hierarchical database, which has data spread over a large number of authoritative servers. In general, end users never communicate with these authoritative servers as they talk to recursive servers when they need information. On the authoritative side, the data is arranged hierarchically, organized by domain names. At the top, the root servers provide a starting point, but they only know the names and addresses of the servers that handle the top-level domains (TLDs) such as.com, .us, .bank, etc.

The TLD servers know about the next level down. For example, if you want to look up www.vercara.com, the set of servers responsible for .com can direct you to any of half a dozen nameservers within UltraDNS that are responsible for vercara.com. In general, a recursive server would have to ask the root servers, which would return a reference to the servers that know about .com, which in turn would return a reference to the UltraDNS servers responsible for vercara.com, which would finally return information about www.vercara.com. You can already see one reason that recursive resolvers are helpful: they save a lot of complexity for the end user’s machine!

The importance of TTL settings.

The other reason for having end-user machines use recursive resolvers is that the results of queries can be cached. If you use a large recursive resolver, chances are that it already has references to the first few steps in its cache. For example, it’s likely to already know the servers responsible for .com; if you’re looking for information about widely used hosts (e.g., facebook.com, amazon.com, etc.), the information you need may already be in the cache as well.

If the data you’re requesting is already in cache, it can save the recursive servers work, and they can provide the information you need faster. Most end users use a recursive resolver provided by their ISP (e.g. Comcast users might use Comcast’s recursive servers). Others may choose to use recursive resolvers provided by widely known networking companies (e.g. Google’s 8.8.8.8).

Businesses will often deploy their own DNS recursive resolvers or use a recursive service, such as Vercara’s UltraDDR, to provide screening of queries to proactively block requests that are considered risky. This includes requests associated with malware or requests that fall outside corporate policy, such as those targeting gambling or adult industry sites.

As you can guess, caching reduces the workload on large recursive resolvers. It also reduces the number of queries that they send to authoritative servers, reducing DNS overhead overall. However, the question remains: how long content should be cached?

Some DNS information is quite static and can be held for a long time. Other information may be more dynamic and should only be held for a short period of time. For example, many CDN providers want to be able to switch server addresses quickly if there are problems with a server, or they want to do load balancing so they would prefer caches to be cleared more quickly.

Unfortunately, those are conflicting issues as reducing load on recursive resolvers means they want to cache for a long time, but being able to switch the data for a particular domain name means that long cache times are bad. The recursive resolver response may be out of date if a server address has changed. And, in general, only the people who are responsible for a particular domain name know how the trade-off will impact them. They understand the frequency of their applications changing IP addresses and how much downtime will cost while obsolete addresses are still in caches after a switch.

To solve that issue, DNS records have a field called the Time to Live (TTL). It specifies the maximum amount of time a recursive resolver is allowed to cache the record. That way, the DNS administrator for the domain can have some control on how long recursive servers cache data.

Finding the best TTL settings.

Intuition says that DNS administrators would use short records for everything, but that approach has costs. One is that most major DNS providers bill by the query. This can have significant impacts on customers. For example, a customer reported that their DNS bill had literally doubled and claimed that they hadn’t changed anything at that time. Most of their queries were for addresses such as www.theirdomain.com or theirdomain.com, so the investigation was focused on those. The records had 30-second TTLs, which means that recursive resolvers only kept the records cached for a short time before having to look up the records again. It turned out that they’d been working on a new web server and wanted to be able to update the address quickly. The customer had set the TTL for those two records to 30 seconds versus the original 3,600 seconds (about an hour) while they were doing the work.

They forgot to set it back to 3,600 seconds when the work on the server was finished, causing a huge increase in queries and the subsequent large DNS bill. Once the TTLs were reset to 3,600, their query count immediately went down by a factor of 2.

Different DNS record types, as well as records associated with specific domains, have different optimal TTLs. Some record types are rarely changed and even more rarely changed without advanced planning. Others may change frequently.

For example, NS records, which tell the names of the servers responsible for a zone, almost never change without advanced planning. As a result, it rarely makes sense to have the TTL for NS records very low. 3,600 seconds (1 hour) to 86,400 seconds (one day) is the range that’s generally reasonable.

TTLs longer than a day could interfere with an emergency migration to a new provider, while TTLs shorter than an hour typically mean that you’re paying for too many repeat queries. In general, unless you’re in the process of moving to a new provider, a day is probably the most reasonable value.

Other examples of records that can typically have relatively long TTLs are MX or SRV records. While they are important record types, both can pre-configure backup servers. That is a much better way to prepare for a server failure than planning to change the record in an emergency.

Other record types, such as A, AAAA, and sometimes CNAME, may need to be changed quickly in an emergency. In general, TTLs for important records should be equivalent to the time it would take to fix a problem. That’s rarely less than about 15 minutes if the fix involves administrative intervention, as they would need to notice that a web server is down, provision a backup, turn it on, etc.

When servers are automatically monitored and updated to switch to other servers that are already available, it can make sense to have much lower TTLs. This will cost more because of the higher query rate, but that can be a small cost compared to an extended outage.

For records that are monitored pools (e.g., UltraDNS SiteBacker, Traffic Controller, Simple Load Balancing, etc.) or records that are monitored and automatically updated (through an API or another rapid mechanism), TTLs as low as 30 seconds can be appropriate. Even in those cases, it is important to consider the entire ecosystem. If you are using a pool that probes every 15 minutes, a TTL of 10 to 15 minutes makes more sense than 30 seconds.

Different record types have different recommended TTLs based on industry best practices. In the table below, we have a recommended range of TTLs for various commonly used record types.

 

Record Type Min TTL if not monitored Max TTL
A[1] 300 to 900 seconds (5 to 15 minutes) 86400 (1 day)
AAAA[2] 300 to 900 seconds (5 to 15 minutes) 86400 (1 day)
Apex Alias[3] 900 seconds (15 minutes) 86400 seconds (one day)
CNAME[4] 900 seconds (15 minutes) 86400 seconds (one day)
MX[5] 3600 seconds (one hour) 86400 seconds (one day)
NS 3600 seconds (one hour) 86400 seconds (one day)
SOA[6] 3600 seconds (one day) 86400 seconds (one day)
SPF[7] <should not be used> <should not be used>
SRV[8] 3600 seconds (one hour) 86400 seconds (one day)
TXT 3600 seconds (one hour) 86400 seconds (one day)

NOTE: these are recommendations, based on personal experience and working with many UltraDNS customers. In some environments, TTLs outside these ranges may be preferable. Vercara consultants can help provide guidance based on your specific situation.

The importance of Min Cache.

There is another TTL-like value in DNS, which can be important. It’s the final parameter in the Start of Authority (SOA) record, called Min Cache. It controls the maximum time that a recursive resolver can remember a “no such record” answer. More precisely, a recursive resolver is allowed to “negative cache,” a name for the minimum of the Min Cache value and the TTL of the SOA record.

Negative caching can decrease query counts and load on recursive resolvers, sometimes to a significant degree. This means that on a strictly cost basis, you would want this setting to be quite high. However, they can also affect the length of time it takes to add a new record, meaning a relatively low value for Min Cache might be favorable. A reasonable compromise is somewhere in the 900 to 3,600-second range (15 minutes to one hour), with 3600 as the more common setting.

Optimizing DNS performance with proper TTL settings.

Properly setting DNS TTL values is essential for optimizing DNS performance. This balances the need for efficient query processing with the necessity of distributing updates promptly. DNS administrators can ensure their networks operate effectively and cost-efficiently by following the guidelines provided in this article and adjusting based on specific use cases.

To learn more about how to optimize the performance of your DNS, further secure your online presence, and get the latest cybersecurity news, visit our resources page.

Footnotes

[1]A records which are monitored and automatically updated, either using UltraDNS’ monitored pools, such as SiteBacker, Traffic Controller, Simple Failover, etc. or which the zone owner monitors and updates through the API or by sending notifies and doing zone transfers,  can be lower; they should typically be about the same as or slightly below the monitoring interval.

[2]AAAA records which are monitored and automatically updated, either using UltraDNS’ monitored pools, such as SiteBacker, Traffic Controller, Simple Failover, etc. or which the zone owner monitors and updates through the API or by sending notifies and doing zone transfers, can be lower; they should typically be about the same as or slightly below the monitoring interval.

[3]The TTL returned for an Apex Alias record will be the lower of the TTL configured for the record and the TTL for the target record.

[4]CNAME records that are monitored and automatically updated, either using UltraDNS’ monitored pools, such as SiteBacker, Traffic Controller, Simple Failover, etc. or which the zone owner monitors and updates through the API or by sending notifies and doing zone transfer, can be lower They should typically be about the same as or slightly below the monitoring interval.

[5]Best practice for MX records is to preconfigure one or more backup servers, configured with the appropriate priorities for them. Because of this fall back, the TTL for the MX records can be relatively long. If you do not configure the backup server(s), it may be useful to lower the TTL, possibly somewhere in the 300 to 900 second (5 to 15 minute) range.

[6]SOA records provide metadata about the zone. One critical value, the last one in the SOA record, called “Min Cache” in UltraDNS’ portal, controls how long a recursive resolver can remember an NXDOMAIN response (“no such record exists”). The negative caching interval actually used will be the lower of the “Min Cache” field in the SOA and the TTL of the SOA.

[7]The ‘true’ SPF record type is deprecated, and should not be used. You should use a TXT style SPF record instead.

[8]Best practice for SRV records is to preconfigure one or more backup servers, with appropriate priorities set for them. Because of this fall back, the TTL for the SRV records can be relatively long. If you do not configure the backup server(s), it may be useful to lower the TTL, possibly somewhere in the 300 to 900 second (5 to 15 minute) range.

[9]    A records which are monitored and automatically updated, either using UltraDNS’ monitored pools, such as SiteBacker, Traffic Controller, Simple Failover, etc. or which the zone owner monitors and updates through the API or by sending notifies and doing zone transfers,  can be lower; they should typically be about the same as or slightly below the monitoring interval.

[10]   AAAA records which are monitored and automatically updated, either using UltraDNS’ monitored pools, such as SiteBacker, Traffic Controller, Simple Failover, etc. or which the zone owner monitors and updates through the API or by sending notifies and doing zone transfers, can be lower; they should typically be about the same as or slightly below the monitoring interval.

[11]   The TTL returned for an Apex Alias record will be the lower of the TTL configured for the record and the TTL for the target record.

[12]   CNAME records that are monitored and automatically updated, either using UltraDNS’ monitored pools, such as SiteBacker, Traffic Controller, Simple Failover, etc. or which the zone owner monitors and updates through the API or by sending notifies and doing zone transfer, can be lower They should typically be about the same as or slightly below the monitoring interval.

[13]   Best practice for MX records is to preconfigure one or more backup servers, configured with the appropriate priorities for them. Because of this fall back, the TTL for the MX records can be relatively long. If you do not configure the backup server(s), it may be useful to lower the TTL, possibly somewhere in the 300 to 900 second (5 to 15 minute) range.

[14]   SOA records provide metadata about the zone. One critical value, the last one in the SOA record, called “Min Cache” in UltraDNS’ portal, controls how long a recursive resolver can remember an NXDOMAIN response (“no such record exists”). The negative caching interval actually used will be the lower of the “Min Cache” field in the SOA and the TTL of the SOA.

[15]   The ‘true’ SPF record type is deprecated, and should not be used. You should use a TXT style SPF record instead.

[16]   Best practice for SRV records is to preconfigure one or more backup servers, with appropriate priorities set for them. Because of this fall back, the TTL for the SRV records can be relatively long. If you do not configure the backup server(s), it may be useful to lower the TTL, possibly somewhere in the 300 to 900 second (5 to 15 minute) range.

July 17, 2024
Last Updated: July 18, 2024
Interested in learning more?
Experience unbeatable protection.
Schedule a demo to see our cloud solutions.
  • Solutions
  • Products
  • Industries
  • Why Vercara
  • Plans
  • Partners
  • Resources
  • Company