Mission Control: We Never Quit on Our Customers
When you're a security and networking engineer, knowing when something is wrong is important. Not giving up until you repair its root cause is critical - even if the root cause is with a third-party provider who doesn't think there is a problem.
Mission Control engineers, like myself, pride themselves on never giving up until a problem is resolved to our satisfaction and the satisfaction of our customers. One of our company values is to solve customer problems whatever it takes, and we live by that value each day.
When I first saw the 'NTP: NOT_SYNCED' error message, I didn't think it would take two months and nearly a dozen people to resolve.
NTP Network Time Protocol (NTP) is a networking protocol for clock synchronization between systems. Having exact and synchronized time across a compute and network environment is critical. When you're tracking down network problems or containing a cyberthreat, knowing exactly when something occurred relative to other events is essential. Without synchronized time, coordinated processes could fail, time-based activities would launch when they shouldn't, and application, security, and error logs have incorrect timestamps making them less helpful or nearly useless.
As is the case with all sophisticated Open Systems services, engines are running on important systems. They actively and passively monitor the state of the service delivery platform or SDP. If there is something malfunctioning, such as a not running service, the engine tries to repair it and restart the affected service. If that's not possible, the event is forwarded back to one of our Mission Control operations centers, where a ticket is created for further investigation and processing.
A local monitoring process observed that an SDP could no longer synchronize its clock with public time servers. In this case, the SDP affected was hosted on a VM from an ISP. It was the primary gateway to a service being used and a critical part of this backend infrastructure. Since it was my responsibility to monitor and manage this service, I investigated the 'NTP: NOT_SYNCED' message.
My original thought was that the host configuration was incorrect. But that was not the case because NTP worked with the same configuration on similar cloud hosts. My next thought was that the configuration with the ISP was not correct. That was not causing the problem either because NTP worked fine on its peer.
I suspected it had to be something else. After some more investigation and packet traces on the NTP server (the one that receives and replies to requests from the affected host), I could confirm that UDP traffic (at least NTP and DNS) leaves the service platform being used and is received successfully by the server providing the NTP service, but it's not getting from that server to the affected host.
My experience told me that the issue was not on the SDP. I started to read through all the open issues on the service provider's platform without any success. I continued working on the problem for several hours while also clearing other tickets. I created a firewall group allowing UDP port 123. NTP is a UDP-based service and uses well-known port 123 to talk to each other and NTP clients. Unfortunately, that led to a dead-end. After some more research, I discovered that the ISP service being used applies DDoS protection everywhere, even if you don't know it. After learning this, I was fairly sure that DDoS protection trigger thresholds could be at the root of the problem.
Then something disconcerting happened - 'NTP:OK'. This means the issue is technically resolved, and NTP is working again. The internet is constantly changing, causing issues to appear and disappear. The challenge now is trying to determine the root cause of something that is suddenly working. Although I had control over some of the environment, I had no control over the internet and certainly not the infrastructure at the ISP.
As luck would have it, 'NTP:NOT_SYNCED' reappeared. This was a mixed blessing because it meant I had something to work with, but it was intermittent. As time went on, the error kept coming and going. Anyone who has tried to resolve an intermittent issue knows it's like chasing a ghost. To make it worse, I didn't have control over the entire environment. I had to contend with the internet and a third-party ISP who kept their platform closed for security reasons.
At this point, I double-checked everything on my end. I couldn't do anything further. I could only escalate it to the ISP. I had my "monitoring setup" with commands running every second and packet captures on the server. Now I had to wait until it appeared again to perform my capture and escalate it to the ISP. For this environment, the ISP is responsible for the network and the hosting of the VM for the service. Luckily, I didn't have to wait long until 'NTP:NOT_SYNCED,' reappeared, and I could open a case with the ISP.
I received a notification from a support representative asking me if I had time. Not too much longer, I'm on a call with six other people working for a third-party contractor doing 1st level support for the ISP. I was happy that I could discuss the issue and explained my DDoS suspicions. I was confident to have the problem resolved, if not by tonight, then by the end of the week.
I was wrong. They didn't understand the problem. What followed was a frustrating series of support calls, conferences, and emails. They kept telling me the commands useful to debug TCP or ICMP, but they were not telling me the commands for UDP. NTP is UDP, which is what was being affected. I was asked countless questions that, if you know how UDP and networking in general works, wouldn't lead to solving the problem. I was asked to click here and there in their customer portal to no avail. Obviously, these were all steps I had already taken before creating the case. But I respect that they are following a process, and I was patient, did what they asked, and provided all the information they requested.
I then mentioned again that I discovered that the service being used applies DDoS protection when needed, everywhere, and you may not know it. Perhaps this could be causing the NTP synchronization communication to flip-flop.
The 1st level support representative didn't know anything further and asked his advisor, who was not any better. As hard as it was to believe, the email thread was so long that I could not write replies any longer. The email application would freeze and had to be restarted.
At this point, our ISP relationship manager became involved, with more back and forth emails and calls with the contractor doing 1st level support. Unfortunately, 'NTP: NOT_SYNCED' prevailed. Yet, they refused to escalate to engineers at the ISP itself, claiming the servers being used were too powerful to cause these types of issues. They also didn't explore the possibility that the DDoS configuration thresholds, which I proposed, may be causing the problem.
To understand how poorly 1st level support works when trying to resolve real-world issues, it's probably best to give you an example of what happened:
- I write our 1st level support representative in the morning at 08:00 to tell him that the issue is ongoing. The representative assigned to the case is always the same. You may get lucky with someone competent and relentless, but that's usually not the case.
- I receive an autoreply: I will be available from 12:30 PM-09:30 PM Mon-Fri (IST) (GMT+5:30)
- I finally get a message, and he asks if the issue is still present. I tell him yes.
- One hour later, I get a reply, "I'm engaging the internal team for packet capture."
- Two hours later, I ask about the status. The reply is, "Still waiting for a reply from the internal team."
- Another 3 hours later: "We have the internal team with us for the captures. Will you be available now?" I answer, "Yes."
- Some more chats back and forth and another hour later, "They are setting up access - allow just a few minutes."
- Finally, at around 19:00 (remember that I wrote my first email of the day at 08:00), they have their capture.
- The next day, I ask about the status of the analysis. His reply, "They are still checking the capture."
- This went on and on, with me spending hours supporting a 1st level support contractor, but without any success.
- And then, I received this email on June 23, 2021, at 13:41:Hello Christian,As per the troubleshooting call with the team, we observed that we don't find any issue with our service.We tried all possible troubleshooting steps required to isolate the issue, and as the packets are leaving our service (using VM and host captures), we conclude the issue is not with our service. Hence, no further troubleshooting can be done in this case. Sorry for the inconvenience caused.Thank you for all your help and support during the troubleshooting.
As you can imagine, I was extremely frustrated, especially since I explained the likely cause of the problem is related to DDoS and that it's coming from somewhere in their environment.
At this point, and with all my last efforts exhausted, I contacted our relationship manager. Although the SDP was still working, we spent too much time monitoring it for NTP synchronization caused issues and needed it resolved.
It worked. I finally have a 1st level support engineer from the ISP. Someone who replied to emails and did understand the issue. It still took several night shifts because of the time difference between our two locations, but on 09.07. 20:06 we had good news.
"Yes, we have found something," and later at 21:09: "DDoS is the issue, both VIPs exceeded our thresholds for basic PIP."
I was relieved. I suspected DDoS was the issue on the first day of the ticket and mentioned it to the 1st level support contractor several times. They denied it every time. Fortunately, the ISP was able to reproduce the problem.
To give you some technical background, some IPSs have platforms with built-in DDoS protection. Because it's an internal failsafe in the platform, no one in their support teams receives any alarms or metrics in the event the protection mechanism is triggered or released. In many cases, it's a fixed limit for TCP and UDP. I this case, the UDP traffic and the threshold is a fixed value independent of host traffic. For reasons of security, these values are not revealed. Once triggered, all UDP traffic, including DNS and NTP, is limited. As traffic decreases, the protection stops. This explains the NTP service flip-flopping as DDoS protection turns on and off.
In the end, we solved the issue and changed configurations and settings to prevent NTP synchronization issues from ever occurring again.
At Open Systems, when you submit a ticket, it is directly handled by a level 3 engineer. We don't give-up and leave a problem unsolved. It doesn’t matter if the problem is in the customer environment, their third partner provider, or somewhere on the internet. Our commitment is to bring passionate levels of care to keep our customers secure.