D. Lawrence
Request for Comments: 8767
Updates: 1034, 1035, 2181
Category: Standards Track
9ISSN: 2070-1721 Google
March 2020
13 Serving Stale Data to Improve DNS Resiliency
17 This document defines a method (serve-stale) for recursive resolvers
18 to use stale DNS data to avoid outages when authoritative nameservers
19 cannot be reached to refresh expired data. One of the motivations
20 for serve-stale is to make the DNS more resilient to DoS attacks and
21 thereby make them less attractive as an attack vector. This document
22 updates the definitions of TTL from RFCs 1034 and 1035 so that data
23 can be kept in the cache beyond the TTL expiry; it also updates RFC
24 2181 by interpreting values with the high-order bit set as being
25 positive, rather than 0, and suggests a cap of 7 days.
771. Introduction
79 Traditionally, the Time To Live (TTL) of a DNS Resource Record (RR)
80 has been understood to represent the maximum number of seconds that a
81 record can be used before it must be discarded, based on its
82 description and usage in [RFC1035] and clarifications in [RFC2181].
84 This document expands the definition of the TTL to explicitly allow
85 for expired data to be used in the exceptional circumstance that a
86 recursive resolver is unable to refresh the information. It is
87 predicated on the observation that authoritative answer
88 unavailability can cause outages even when the underlying data those
89 servers would return is typically unchanged.
91 We describe a method below for this use of stale data, balancing the
92 competing needs of resiliency and freshness.
94 This document updates the definitions of TTL from [RFC1034] and
95 [RFC1035] so that data can be kept in the cache beyond the TTL
96 expiry; it also updates [RFC2181] by interpreting values with the
97 high-order bit set as being positive, rather than 0, and also
98 suggests a cap of 7 days.
1002. Terminology
102 The key words "MUST", "MUST NOT", "REQUIRED", "SHALL", "SHALL NOT",
104 "OPTIONAL" in this document are to be interpreted as described in
105 BCP 14 [RFC2119] [RFC8174] when, and only when, they appear in all
106 capitals, as shown here.
108 For a glossary of DNS terms, please see [RFC8499].
1103. Background
112 There are a number of reasons why an authoritative server may become
113 unreachable, including Denial-of-Service (DoS) attacks, network
114 issues, and so on. If a recursive server is unable to contact the
115 authoritative servers for a query but still has relevant data that
116 has aged past its TTL, that information can still be useful for
117 generating an answer under the metaphorical assumption that "stale
118 bread is better than no bread."
120 [RFC1035], Section 3.2.1 says that the TTL "specifies the time
121 interval that the resource record may be cached before the source of
122 the information should again be consulted." [RFC1035], Section 4.1.3
123 further says that the TTL "specifies the time interval (in seconds)
124 that the resource record may be cached before it should be
125 discarded."
127 A natural English interpretation of these remarks would seem to be
128 clear enough that records past their TTL expiration must not be used.
129 However, [RFC1035] predates the more rigorous terminology of
130 [RFC2119], which softened the interpretation of "may" and "should".
132 [RFC2181] aimed to provide "the precise definition of the Time to
133 Live," but Section 8 of [RFC2181] was mostly concerned with the
134 numeric range of values rather than data expiration behavior. It
135 does, however, close that section by noting, "The TTL specifies a
136 maximum time to live, not a mandatory time to live." This wording
137 again does not contain BCP 14 key words [RFC2119], but it does convey
138 the natural language connotation that data becomes unusable past TTL
139 expiry.
141 As of the time of this writing, several large-scale operators use
142 stale data for answers in some way. A number of recursive resolver
143 packages, including BIND, Knot Resolver, OpenDNS, and Unbound,
144 provide options to use stale data. Apple macOS can also use stale
145 data as part of the Happy Eyeballs algorithms in mDNSResponder. The
146 collective operational experience is that using stale data can
147 provide significant benefit with minimal downside.
1494. Standards Action
151 The definition of TTL in Sections 3.2.1 and 4.1.3 of [RFC1035] is
152 amended to read:
154 TTL a 32-bit unsigned integer number of seconds that specifies the
155 duration that the resource record MAY be cached before the
156 source of the information MUST again be consulted. Zero values
157 are interpreted to mean that the RR can only be used for the
158 transaction in progress, and should not be cached. Values
159 SHOULD be capped on the order of days to weeks, with a
160 recommended cap of 604,800 seconds (7 days). If the data is
161 unable to be authoritatively refreshed when the TTL expires, the
162 record MAY be used as though it is unexpired. See Sections 5
163 and 6 of [RFC8767] for details.
165 Interpreting values that have the high-order bit set as being
166 positive, rather than 0, is a change from [RFC2181], the rationale
167 for which is explained in Section 6. Suggesting a cap of 7 days,
168 rather than the 68 years allowed by the full 31 bits of Section 8 of
169 [RFC2181], reflects the current practice of major modern DNS
170 resolvers.
172 When returning a response containing stale records, a recursive
173 resolver MUST set the TTL of each expired record in the message to a
174 value greater than 0, with a RECOMMENDED value of 30 seconds. See
175 Section 6 for explanation.
177 Answers from authoritative servers that have a DNS response code of
178 either 0 (NoError) or 3 (NXDomain) and the Authoritative Answer (AA)
179 bit set MUST be considered to have refreshed the data at the
180 resolver. Answers from authoritative servers that have any other
181 response code SHOULD be considered a failure to refresh the data and
182 therefore leave any previous state intact. See Section 6 for a
183 discussion.
1855. Example Method
187 There is more than one way a recursive resolver could responsibly
188 implement this resiliency feature while still respecting the intent
189 of the TTL as a signal for when data is to be refreshed.
191 In this example method, four notable timers drive considerations for
192 the use of stale data:
194 * A client response timer, which is the maximum amount of time a
195 recursive resolver should allow between the receipt of a
196 resolution request and sending its response.
198 * A query resolution timer, which caps the total amount of time a
199 recursive resolver spends processing the query.
201 * A failure recheck timer, which limits the frequency at which a
202 failed lookup will be attempted again.
204 * A maximum stale timer, which caps the amount of time that records
205 will be kept past their expiration.
207 Most recursive resolvers already have the query resolution timer and,
208 effectively, some kind of failure recheck timer. The client response
209 timer and maximum stale timer are new concepts for this mechanism.
211 When a recursive resolver receives a request, it should start the
212 client response timer. This timer is used to avoid client timeouts.
213 It should be configurable, with a recommended value of 1.8 seconds as
214 being just under a common timeout value of 2 seconds while still
215 giving the resolver a fair shot at resolving the name.
217 The resolver then checks its cache for any unexpired records that
218 satisfy the request and returns them if available. If it finds no
219 relevant unexpired data and the Recursion Desired flag is not set in
220 the request, it should immediately return the response without
221 consulting the cache for expired records. Typically, this response
222 would be a referral to authoritative nameservers covering the zone,
223 but the specifics are implementation dependent.
225 If iterative lookups will be done, then the failure recheck timer is
226 consulted. Attempts to refresh from non-responsive or otherwise
227 failing authoritative nameservers are recommended to be done no more
228 frequently than every 30 seconds. If this request was received
229 within this period, the cache may be immediately consulted for stale
230 data to satisfy the request.
232 Outside the period of the failure recheck timer, the resolver should
233 start the query resolution timer and begin the iterative resolution
234 process. This timer bounds the work done by the resolver when
235 contacting external authorities and is commonly around 10 to 30
236 seconds. If this timer expires on an attempted lookup that is still
237 being processed, the resolution effort is abandoned.
239 If the answer has not been completely determined by the time the
240 client response timer has elapsed, the resolver should then check its
241 cache to see whether there is expired data that would satisfy the
242 request. If so, it adds that data to the response message with a TTL
243 greater than 0 (as specified in Section 4). The response is then
244 sent to the client while the resolver continues its attempt to
245 refresh the data.
247 When no authorities are able to be reached during a resolution
248 attempt, the resolver should attempt to refresh the delegation and
249 restart the iterative lookup process with the remaining time on the
250 query resolution timer. This resumption should be done only once per
251 resolution effort.
253 Outside the resolution process, the maximum stale timer is used for
254 cache management and is independent of the query resolution process.
255 This timer is conceptually different from the maximum cache TTL that
256 exists in many resolvers, the latter being a clamp on the value of
257 TTLs as received from authoritative servers and recommended to be
258 7 days in the TTL definition in Section 4. The maximum stale timer
259 should be configurable. It defines the length of time after a record
260 expires that it should be retained in the cache. The suggested value
261 is between 1 and 3 days.
2636. Implementation Considerations
265 This document mainly describes the issues behind serving stale data
266 and intentionally does not provide a formal algorithm. The concept
267 is not overly complex, and the details are best left to resolver
268 authors to implement in their codebases. The processing of serve-
269 stale is a local operation, and consistent variables between
270 deployments are not needed for interoperability. However, we would
271 like to highlight the impact of various implementation choices,
272 starting with the timers involved.
274 The most obvious of these is the maximum stale timer. If this
275 variable is too large, it could cause excessive cache memory usage,
276 but if it is too small, the serve-stale technique becomes less
277 effective, as the record may not be in the cache to be used if
278 needed. Shorter values, even less than a day, can effectively handle
279 the vast majority of outages. Longer values, as much as a week, give
280 time for monitoring systems to notice a resolution problem and for
281 human intervention to fix it; operational experience has been that
282 sometimes the right people can be hard to track down and
283 unfortunately slow to remedy the situation.
285 Increased memory consumption could be mitigated by prioritizing
286 removal of stale records over non-expired records during cache
287 exhaustion. Eviction strategies could consider additional factors,
288 including the last time of use or the popularity of a record, to
289 retain active but stale records. A feature to manually flush only
290 stale records could also be useful.
292 The client response timer is another variable that deserves
293 consideration. If this value is too short, there exists the risk
294 that stale answers may be used even when the authoritative server is
295 actually reachable but slow; this may result in undesirable answers
296 being returned. Conversely, waiting too long will negatively impact
297 user experience.
299 The balance for the failure recheck timer is responsiveness in
300 detecting the renewed availability of authorities versus the extra
301 resource use for resolution. If this variable is set too large,
302 stale answers may continue to be returned even after the
303 authoritative server is reachable; per [RFC2308], Section 7, this
304 should be no more than 5 minutes. If this variable is too small,
305 authoritative servers may be targeted with a significant amount of
306 excess traffic.
308 Regarding the TTL to set on stale records in the response,
309 historically TTLs of 0 seconds have been problematic for some
310 implementations, and negative values can't effectively be
311 communicated to existing software. Other very short TTLs could lead
312 to congestive collapse as TTL-respecting clients rapidly try to
313 refresh. The recommended value of 30 seconds not only sidesteps
314 those potential problems with no practical negative consequences, it
315 also rate-limits further queries from any client that honors the TTL,
316 such as a forwarding resolver.
318 As for the change to treat a TTL with the high-order bit set as
319 positive and then clamping it, as opposed to [RFC2181] treating it as
320 zero, the rationale here is basically one of engineering simplicity
321 versus an inconsequential operational history. Negative TTLs had no
322 rational intentional meaning that wouldn't have been satisfied by
323 just sending 0 instead, and similarly there was realistically no
324 practical purpose for sending TTLs of 2^25 seconds (1 year) or more.
325 There's also no record of TTLs in the wild having the most
326 significant bit set in the DNS Operations, Analysis, and Research
327 Center's (DNS-OARC's) "Day in the Life" samples [DITL]. With no
328 apparent reason for operators to use them intentionally, that leaves
329 either errors or non-standard experiments as explanations as to why
330 such TTLs might be encountered, with neither providing an obviously
331 compelling reason as to why having the leading bit set should be
332 treated differently from having any of the next eleven bits set and
333 then capped per Section 4.
335 Another implementation consideration is the use of stale nameserver
336 addresses for lookups. This is mentioned explicitly because, in some
337 resolvers, getting the addresses for nameservers is a separate path
338 from a normal cache lookup. If authoritative server addresses are
339 not able to be refreshed, resolution can possibly still be successful
340 if the authoritative servers themselves are up. For instance,
341 consider an attack on a top-level domain that takes its nameservers
342 offline; serve-stale resolvers that had expired glue addresses for
343 subdomains within that top-level domain would still be able to
344 resolve names within those subdomains, even those it had not
345 previously looked up.
347 The directive in Section 4 that only NoError and NXDomain responses
348 should invalidate any previously associated answer stems from the
349 fact that no other RCODEs that a resolver normally encounters make
350 any assertions regarding the name in the question or any data
351 associated with it. This comports with existing resolver behavior
352 where a failed lookup (say, during prefetching) doesn't impact the
353 existing cache state. Some authoritative server operators have said
354 that they would prefer stale answers to be used in the event that
355 their servers are responding with errors like ServFail instead of
356 giving true authoritative answers. Implementers MAY decide to return
357 stale answers in this situation.
359 Since the goal of serve-stale is to provide resiliency for all
360 obvious errors to refresh data, these other RCODEs are treated as
361 though they are equivalent to not getting an authoritative response.
362 Although NXDomain for a previously existing name might well be an
363 error, it is not handled that way because there is no effective way
364 to distinguish operator intent for legitimate cases versus error
365 cases.
367 During discussion in the IETF, it was suggested that, if all
368 authorities return responses with an RCODE of Refused, it may be an
369 explicit signal to take down the zone from servers that still have
370 the zone's delegation pointed to them. Refused, however, is also
371 overloaded to mean multiple possible failures that could represent
372 transient configuration failures. Operational experience has shown
373 that purposely returning Refused is a poor way to achieve an explicit
374 takedown of a zone compared to either updating the delegation or
375 returning NXDomain with a suitable SOA for extended negative caching.
376 Implementers MAY nonetheless consider whether to treat all
377 authorities returning Refused as preempting the use of stale data.
3797. Implementation Caveats
381 Stale data is used only when refreshing has failed in order to adhere
382 to the original intent of the design of the DNS and the behavior
383 expected by operators. If stale data were to always be used
384 immediately and then a cache refresh attempted after the client
385 response has been sent, the resolver would frequently be sending data
386 that it would have had no trouble refreshing. Because modern
387 resolvers use techniques like prefetching and request coalescing for
388 efficiency, it is not necessary that every client request needs to
389 trigger a new lookup flow in the presence of stale data, but rather
390 that a good-faith effort has been recently made to refresh the stale
391 data before it is delivered to any client.
393 It is important to continue the resolution attempt after the stale
394 response has been sent, until the query resolution timeout, because
395 some pathological resolutions can take many seconds to succeed as
396 they cope with unavailable servers, bad networks, and other problems.
397 Stopping the resolution attempt when the response with expired data
398 has been sent would mean that answers in these pathological cases
399 would never be refreshed.
401 The continuing prohibition against using data with a 0-second TTL
402 beyond the current transaction explicitly extends to it being
403 unusable even for stale fallback, as it is not to be cached at all.
405 Be aware that Canonical Name (CNAME) and DNAME records [RFC6672]
406 mingled in the expired cache with other records at the same owner
407 name can cause surprising results. This was observed with an initial
408 implementation in BIND when a hostname changed from having an IPv4
409 Address (A) record to a CNAME. The version of BIND being used did
410 not evict other types in the cache when a CNAME was received, which
411 in normal operations is not a significant issue. However, after both
412 records expired and the authorities became unavailable, the fallback
413 to stale answers returned the older A instead of the newer CNAME.
4158. Implementation Status
417 The algorithm described in Section 5 was originally implemented as a
418 patch to BIND 9.7.0. It has been in use on Akamai's production
419 network since 2011; it effectively smoothed over transient failures
420 and longer outages that would have resulted in major incidents. The
421 patch was contributed to the Internet Systems Consortium, and the
422 functionality is now available in BIND 9.12 and later via the options
423 stale-answer-enable, stale-answer-ttl, and max-stale-ttl.
425 Unbound has a similar feature for serving stale answers and will
426 respond with stale data immediately if it has recently tried and
427 failed to refresh the answer by prefetching. Starting from version
428 1.10.0, Unbound can also be configured to follow the algorithm
429 described in Section 5. Both behaviors can be configured and fine-
430 tuned with the available serve-expired-* options.
432 Knot Resolver has a demo module here: <https://knot-
433 resolver.readthedocs.io/en/stable/modules-serve_stale.html>.
435 Apple's system resolvers are also known to use stale answers, but the
436 details are not readily available.
438 In the research paper "When the Dike Breaks: Dissecting DNS Defenses
439 During DDoS" [DikeBreaks], the authors detected some use of stale
440 answers by resolvers when authorities came under attack. Their
441 research results suggest that more widespread adoption of the
442 technique would significantly improve resiliency for the large number
443 of requests that fail or experience abnormally long resolution times
444 during an attack.
4469. EDNS Option
448 During the discussion of serve-stale in the IETF, it was suggested
449 that an EDNS option [RFC6891] should be available. One proposal was
450 to use it to opt in to getting data that is possibly stale, and
451 another was to signal when stale data has been used for a response.
453 The opt-in use case was rejected, as the technique was meant to be
454 immediately useful in improving DNS resiliency for all clients.
456 The reporting case was ultimately also rejected because even the
457 simpler version of a proposed option was still too much bother to
458 implement for too little perceived value.
46010. Security Considerations
462 The most obvious security issue is the increased likelihood of DNSSEC
463 validation failures when using stale data because signatures could be
464 returned outside their validity period. Stale negative records can
465 increase the time window where newly published TLSA or DS RRs may not
466 be used due to cached NSEC or NSEC3 records. These scenarios would
467 only be an issue if the authoritative servers are unreachable (the
468 only time the techniques in this document are used), and thus serve-
469 stale does not introduce a new failure in place of what would have
470 otherwise been success.
472 Additionally, bad actors have been known to use DNS caches to keep
473 records alive even after their authorities have gone away. The
474 serve-stale feature potentially makes the attack easier, although
475 without introducing a new risk. In addition, attackers could combine
476 this with a DDoS attack on authoritative servers with the explicit
477 intent of having stale information cached for a longer period of
478 time. But if attackers have this capacity, they probably could do
479 much worse than prolonging the life of old data.
481 In [CloudStrife], it was demonstrated how stale DNS data, namely
482 hostnames pointing to addresses that are no longer in use by the
483 owner of the name, can be used to co-opt security -- for example, to
484 get domain-validated certificates fraudulently issued to an attacker.
485 While this document does not create a new vulnerability in this area,
486 it does potentially enlarge the window in which such an attack could
487 be made. A proposed mitigation is that certificate authorities
488 should fully look up each name starting at the DNS root for every
489 name lookup. Alternatively, certificate authorities should use a
490 resolver that is not serving stale data.
49211. Privacy Considerations
494 This document does not add any practical new privacy issues.
49612. NAT Considerations
498 The method described here is not affected by the use of NAT devices.
50013. IANA Considerations
502 This document has no IANA actions.
