In today’s digital landscape, the Domain Name System (DNS) plays a critical role in ensuring that internet traffic is properly directed to its intended destination. To improve the speed and reliability of DNS queries, many organizations rely on distributed DNS anycast systems. However, as we learned the hard way, even the most well-designed and carefully maintained systems can be vulnerable to hijacking due to routing issues. In this blog post, we’ll share our experience of how our distributed DNS anycast system was hijacked and what we learned from the incident. We hope that by sharing our story, we can help other organizations avoid similar pitfalls and strengthen their own DNS infrastructure.
The Domain Name System (DNS) is an essential component of the internet, translating human-readable domain names into machine-readable IP addresses. DNS resolution is required for every internet transaction, including browsing websites, sending emails, and using applications. The performance and reliability of DNS resolution can have a significant impact on the user experience. To improve the performance and reliability of DNS queries, many organizations use anycast, a networking technique that allows multiple servers to share the same IP address.
Anycast enables requests to be processed by the server that is closest to the user, improving the speed and efficiency of DNS resolution. In a traditional DNS infrastructure, multiple servers are deployed across different locations, and each server has a unique IP address. When a user sends a DNS query, the request is sent to the nearest DNS server, based on the user’s geographic location. This can result in longer DNS resolution times, as the query may have to travel further to reach the server.
In contrast, a DNS anycast infrastructure enables multiple servers to share the same IP address, allowing requests to be automatically directed to the nearest available server. When a user sends a DNS query to an anycast IP address, the request is automatically routed to the nearest available DNS server, based on network topology and routing protocols. This can significantly reduce the distance that the query must travel, improving the speed and efficiency of DNS resolution.
One of the key benefits of DNS anycast is improved performance. By distributing DNS servers across multiple locations, the system can reduce latency and minimize the distance that DNS queries must travel. This can lead to faster DNS resolution times and a better user experience. Additionally, DNS anycast can provide greater resiliency in the face of network disruptions or outages. If one server becomes unavailable, the anycast IP address can automatically redirect queries to a different server, ensuring that the service remains available.
Another benefit of DNS anycast is improved security. DNS anycast can provide a level of protection against distributed denial-of-service (DDoS) attacks, which can overwhelm servers with a flood of traffic. In a DNS anycast infrastructure, the traffic is automatically distributed among multiple servers, reducing the impact of a DDoS attack on any one server
BGP (Border Gateway Protocol) is a protocol used to exchange routing information between internet service providers (ISPs). BGP can also be used to support anycast DNS by advertising the same IP address from multiple geographically distributed DNS servers to the internet. When a user sends a DNS query, the request is automatically routed to the nearest DNS server based on BGP routing tables. This ensures that the request is processed by the nearest available server, reducing latency and improving the overall performance of the DNS infrastructure. BGP is a widely adopted standard for internet routing, and its use in anycast DNS enables organizations to leverage existing network infrastructure to improve the performance and reliability of their DNS services.
In some cases, DNS anycast can be implemented locally within an ISP’s network. For example, an ISP may have multiple DNS servers distributed across different locations, with each server having a unique IP address. The ISP can enable BGP on each DNS server and advertise a common anycast IP address to the network. When a user sends a DNS query, the query is automatically routed to the nearest available DNS server based on BGP routing tables.
To control the traffic flow between the anycast DNS servers, the ISP can use BGP attributes such as AS-path and MED. For example, the ISP may assign a lower MED value to the DNS server that is closest to the user, ensuring that queries are directed to that server first. If the nearest server becomes unavailable, BGP routers on the internet will automatically route queries to the next closest server based on the AS-path attribute. By using BGP attributes to control the traffic flow, ISPs can ensure that DNS queries are processed by the nearest available DNS server, improving the speed and efficiency of DNS resolution. Anycast DNS is a popular way to improve the speed and efficiency of DNS resolution. However, like any other system, it is susceptible to attacks and issues that can disrupt its functioning.
Incident: In our case, we experienced a significant increase in QPS (Queries per Second) on all our DNS servers, which raised suspicion of a DDoS attack.
We immediately implemented mitigation measures on our security devices, but the problem persisted despite disabling top DNS queriers and tweaking multiple parameters on the DNS side. After some investigation, we discovered that the issue was not related to a DDoS attack but rather a routing issue.
Some internet destinations were not reachable, and upon further investigation, we found that one-fourth of the internet in the forward direction was pointing to a stub test router. This was because our IGP (Interior Gateway Protocol), IS-IS, was announcing a 192.0.0.0/2 which included a quarter of the internet IPv4 addresses.
RCA: This route was introduced mistakenly with wrong prefix length into network by a COA implementation from DB for an IPOE subscriber. We removed that subscriber from network and its associated route.
On stub Node we had following route.
Routing entry for 192.0.0.0/2
Known via "subscriber IPSUB_SUBSCR", distance 1, metric 0
Installed Apr 29 09:32:24.843 for 02:01:43
Routing Descriptor Blocks
202.165.247.4, from 0.0.0.0
Route metric is 0
No advertising protos.
On our other nodes we had this route in RIB.
Routing entry for 192.0.0.0/2
Known via "isis ntlcore", distance 115, metric 30, type level-2
Installed Apr 29 09:32:24.740 for 01:58:46
Routing Descriptor Blocks
58.65.165.41, from 172.16.33.147, via Bundle-Ether9
Route metric is 30
No advertising protos.
To understand why our network was impacted by the least specific route, there were two reasons. Firstly, our BGP policies only allowed the default route from the internet, and we only had a few thousand routes that matched the criteria for three as-paths. If we had the entire internet routing table on all our devices, this /2 would not have affected us. Secondly, we had no control over the export policies of IS-IS, which only had plain old redistribute connected and redistribute static in operation. This lack of control led to the issue.
To resolve the issue, we implemented a new policy on our network devices. The policy on IOS-XR, called “slash_2”, would drop any destination in the 0.0.0.0/0 le 18 range, and pass everything else.
route-policy slash_2
if destination in (0.0.0.0/0 le 18) then
drop
else
pass
endif
end-policy
router isis 1
address-family ipv4 unicast
redistribute connected level-2 route-policy slash_2
redistribute static level-2 route-policy slash_2
redistribute subscriber level-2 route-policy slash_2
Meanwhile, on Juniper, we created the ISIS-EXPORT policy, which used the following configuration:
set policy-options policy-statement ISIS-EXPORT term slash_2 from protocol direct
set policy-options policy-statement ISIS-EXPORT term slash_2 from protocol static
set policy-options policy-statement ISIS-EXPORT term slash_2 from route-filter 0.0.0.0/0 upto /18
set policy-options policy-statement ISIS-EXPORT term slash_2 then reject
By implementing these policies, we were able to regain control over the export policies of IS-IS and prevent any future issues arising from the redistribution of connected and static routes.
The high QPS on our DNS servers was due to customers requesting resolution for URLs whose IP addresses belonged to the specific /2 prefix which was impacted by the routing hijack. As customers were not able to reach their intended destinations, they retried the request, causing a DNS amplification effect and further increasing the QPS on our servers. This resulted in a DDoS-like situation, which we initially suspected to be an actual DDoS attack. However, upon investigation, we realized that it was a result of the routing hijack and the subsequent impact on our DNS infrastructure.
Conclusion:
In conclusion, the incident involving the routing hijack of a /2 prefix highlights the importance of having a comprehensive understanding of the network’s routing and configuration. The incident was initially suspected to be a DDoS attack, but thorough investigation revealed the root cause to be a misconfigured IGP exporting a less specific route, leading to traffic being routed to a test router. The incident underscores the importance of implementing proper network monitoring tools, including BGP route validation and IGP export policies, to identify and mitigate such issues promptly. Proper configuration and monitoring can go a long way in ensuring the performance and reliability of essential network services like DNS.