
When Disaster Strikes…Or what happens when a SaaS provider goes out of business

It was just a regular Thursday morning, with the usual network and security tasks slowly unfolding as Raphi was drinking his first cup of coffee at home. Routine stuff. "Pling!" Yet another message. Except it wasn’t quite what Raphi had expected. Very soon, he knew he would be spending his weekend grappling with a disaster-level scenario.
Figure 1: Email from a vendor describing the situation.
As a Systems Engineer at Open Systems, Raphi had a decade of experience, so he knew that the product manager who wanted him to lead the Crisis Response team was not exaggerating. The email from the partnering vendor had come out of the blue, without any warning signs. It stated that they had let go of most of their personnel and had no one left to support their product. To appreciate its significance, you need to know that Open Systems is in the business of connecting and securing users of hybrid environments seamlessly, simply, and efficiently, while protecting them from threats. 24x7. Around the globe.
Open Systems operations centers, which go by the name of “Mission Control”, run some 10,000 edge devices in over 180 countries. This means reliably managing web proxies, spam and malware filtering, VPN tunnels, firewalling, connectivity, routing and network optimization, encryption, threat protection – in high-availability setups for global organizations, around the clock. To do so, experienced, level-3 engineers in Mission Control work from Switzerland, India, and Hawaii so they are present for customers in all time zones. They each spend about one day a week in direct support functions, facing customer needs at the front lines, and the rest of their time developing, maintaining, and tuning networks and systems to ensure they perform effectively. Security is integrated into every action.
Structure the Chaos
Figure 2: First steps to create order out of chaos.
At the office, Raphi reserved a meeting room to serve as the command post. Next came the creation of a dedicated communications channel and a structure for documentation where his team could later report their findings. The very first sitrep (situation-report) meeting he called would be crucial for swift alignment and coordinated action. But before that, it was his responsibility to describe the situation clearly and concisely.
And come up with a plan.
He noted the basics:
- Vendor – out of business.
- Vendor’s product used in proxy service.
- Product still working, but if it goes down, the situation will worsen.
- Weekend soon.
He concluded that he wanted to address 3 things:
- Sustain operations and – in parallel – communicate clearly to customers.
- Tackle a new solution.
Then it was time for action.
Within the first few hours of that initial sitrep meeting, the first coordinated response was set up.
First Day
Raphi was free to focus on creating a high-level plan to address the situation. He was particularly conscious not to micromanage.
High-Level Plan
His plan needed to be understandable by all parties: customers, executives, engineers, account managers, and anyone else involved in operations and Mission Control.
Figure 3: High-level plan, illustrated out by Raphi.
The plan stated that in the first days, weeks, and months, the company had to sustain operations. Chris, an engineer in the Web Security team, was put in charge of that. A new solution, led by another Open Systems engineer, would realistically be ready in a few months. Until then, frequent but short updates had to be communicated, and as the time passed, their frequency could be decreased.
Communication
The team decided to communicate quickly and transparently, without sugarcoating the situation. Within the first 24 hours, customer communication took the form of emails, news notifications in the customer portal, and phone calls. Meanwhile, in the background, Raphi was working on a detailed plan of action.
Figure 4: Internal and external communication was crucial.
Failure Modes
For sustaining operations at this early stage, the top priority for Chris was to explain the tech setup to participating parties so everybody understood what the affected components were.
How the Affected Components Work
Figure 5: Diagram of the affected components.
At Open Systems, the Secure Web Gateway (SWG) secures corporate web traffic that goes from the customers to the internet. Various feeds in SWG analyze traffic for threats or use the URL Filter to classify the traffic, a service which was provided by the vendor that went out of business.
The URL Filter consisted of two main components: a locally deployed daemon and some vendor endpoints in the cloud.
As live web traffic passed through the SWG (see Figure 5), the URL Filter would, in a first step, assign a category to the URL request, for instance, categories such as Finance, Personal Storage or News. The daemon then looked up the category in a local cache. If there was a cache miss, it went to the cloud endpoints for the lookup.
In a second step, based on the category, the URL Filter applied a corporate policy that stated whether the traffic should be allowed or not, for example “allow News, but block Personal Storage websites.”
Because the vendor systems were still running, Chris knew it was vital to identify and address their future failure modes. He was certain the systems would fail at some point. The essential questions were: when and how?
Considerations: (1) Quality Degradation and (2) Daemon Failure
First, his team considered quality degradation (see Figure 6). By then, the vendor had confirmed that the systems were unmaintained, which meant no new websites would be registered. It was safe to assume that as time went on, the URL Filter would be increasingly unable to classify traffic or would classify it incorrectly. This was easy to check with Open Systems telemetry. Chris also expected an increased number of support requests to Mission Control about incorrectly classified traffic.
Figure 6: Two failure modes of the existing solution.
The second failure mode to consider was: What if vendor cloud endpoints went offline completely?
To simulate this scenario, Chris’s team applied some local firewall rules to their test SWG. They confirmed what they had already suspected, namely that the proprietary daemon used a cache and was thus capable of sustaining some level of operations for a limited duration. Eventually, all categorization requests would time out.
Open Systems had configured the URL Filter in “fail open” mode, by choice, to keep systems running if some failure conditions were present. In this case, if requests timed out, the SWG would end up in an “allow all” situation, because corporate policies could no longer be applied.
A curious bit of information the team figured out was that any attempt to restart the proprietary daemon would fail if its cloud endpoints were offline. The vendor had set it up to validate its license towards its cloud endpoints. A classic move.
Having understood the two failure modes and how to identify them, it was time to communicate internally. Chris’s team then created a runbook to guide the engineers in Mission Control for any eventualities.
Both Chris and Raphi looked back at an intense first day with sighs of relief:
- Coordinated crisis response ✅
- Technical assessment ✅
- Communication ✅
First Week
To keep operations running, the engineers at Open Systems decided to temporarily mimic the URL vendor’s functionality. In the meantime, this would give the SWG product manager the necessary time to evaluate a long-term solution for the replacement of the URL Filter components.
Technical Operations
Chris’s team went ahead with building their own local URL Filter feed (see Figure 7). Their idea was to plug it into UPSID, the policy and intelligence security daemon, which would pick up the work – if the vendor endpoints suddenly went offline.
Figure 7: Building a temporary local URL Filter feed.
From a security and compliance point of view, knowing which traffic to block is key. That’s why the engineers narrowed down their solution to recognizing and blocking the traffic that would have been blocked when everything was working normally. Allowed traffic was unaffected.
To help them in their quest, they chose to make use of SWG Telemetry, which typically contains the URL, its category, and the associated action.
An existing data processing pipeline was used to compile an index of the top 100,000 blocked domains globally. By the end of the first week, the first version of the index was deployed at all customer sites alongside the temporary URL Filter feed. It was ready to pick up the work for whenever the vendor endpoints ended up going offline.
Continuing Communication
As Raphi followed the progress of the technical operations teams, he continued to lead communication updates on all channels. He discovered that briefing the executive teams produced two advantages: if the execs were aware of and understood what was being done, they were unlikely to interfere. Secondly, they put their experience to good use by posing some pertinent questions, which improved the response.
Figure 8: Continued communication.
An internal FAQ was organized so customer-facing employees had sufficiently precise information to settle a variety of customer concerns. That also meant that the engineers working actively on solutions had the capacity to put more focus into their direct tasks. Internally, the high-level plan was available and communicated to every employee.
For external communication, Open Systems continued to use its web portal to fan out information to its entire customer base. Email was used sparingly at that point, and only in specific circumstances.
The cycle of simple sitrep meetings with key employees continued every 6 hours, gradually becoming less frequent as the situation stabilized.
Figure 9: Continued cycle of: 1. Review, 2. Outlook 3. Align & Decide.
At every such meeting, Chris, as the Sustaining Operations Lead, would first review what was done and what was planned. Then it was the turn of the New Solution Lead, followed by Raphi, who acted as the Communication Lead. Through this process, the decisions that didn’t align became immediately obvious, and they needed to be discussed so as to align them and continue.
Things were progressing smoothly until there was a surge of tickets to Mission Control. The vendor endpoints went offline. Luckily, they recovered after 15 minutes, making it unnecessary to activate contingency plans. Nevertheless, that served as a good reminder that the situation was anything but stable, and to continue with the engineering efforts at full speed.
At the end of the first week, Raphi reported:
- First mitigation was activated ✅
- New solution was being researched ✅
- Regular meetings and communication ✅
First Month(s)
As the weeks turned to months, boredom began creeping in. This was not a bad thing.
By then, it was sufficient to have sitrep meetings on a weekly basis. The local feed for sustaining operations was good enough, and the focus shifted to building the new solution.
Building a New URL Filter Solution
Figure 10: Timeframes of building a new URL Filter solution.
Replacing a URL filter, even at the best of times, is a lengthy and meticulous process.
- Three vendors were evaluated in the first month.
- Integration of the preferred product began right after seeing it in action on secure web gateways in production on the Open Systems company network.
- The contract with a new vendor was signed three months into the crisis.
- The first customer migration took place one month after that.
- In this early migration phase, there were headaches in engineering and in Mission Control. The secure web gateways of Open Systems are deployed in some 180 countries, so it was necessary to ensure smooth operations also in regions with heterogeneous connectivity environments.
- Technical account managers then began to migrate customers one by one. This was, consciously, a more conservative approach than just flipping a switch to migrate all customers at the same time. The goal was to ensure that systems still performed well and reliably, which had been fine-tuned for each customer’s corporate policy over the past 10 years.
- The last customer was migrated seven months into the crisis.
The crisis was over.
Fun fact: the old vendor endpoints in the cloud were still reachable and providing service.
Zooming Out and Looking Back
Open Systems achieved the all-important goal of keeping customers secure and compliant while ensuring their business continuity. To round up the crisis handling, Raphi and Chris wanted to think about and summarize what went well, what didn’t, and why – in context.
Engineering Perspective
Figure 11: Engineering perspective.
From the engineering perspective, Chris noted the following:
- A proprietary daemon in the hot path of customer web traffic with a weird license validation mechanism that could lead to failure if the vendor endpoints were unreachable. That sounds like trouble. If the engineers were to start from scratch, they would certainly have designed this system differently. However, evaluating and selecting a URL filter vendor is not quite like selecting a public cloud, or a frontend framework, or a programming language. The options are limited.
- This crisis pushed everyone out of their comfort zone. For the engineering teams, it was great to have some familiar tools, technologies, and processes they could rely on. For example, the unified policy and security intelligence daemon gave them the necessary abstraction between the core web proxy and the various intelligence feeds. It tremendously facilitated integration and mitigation, and later, component replacement.
- Experienced engineers who were confident about operations were key to taking effective data-driven decisions, and so was having telemetry at their fingertips as well as an existing data processing pipeline.
- The Open Systems configuration and runtime platform typically rolled out hot fixes within 24 hours to all Secure Web Gateway deployments, including to areas with poor connectivity and to a variety of customer setups in various industries.
- For the vendor replacement task, a shortlist of three alternative vendors was already available, having been proactively scouted two years before due to contract negotiations. The product manager could use that list immediately.
Crisis-Lead Perspective
Figure 12: Crisis-lead perspective.
From the crisis lead perspective, Raphi observed the following:
- Although there were processes in place, the engineering teams could have benefitted from regular, proactive crisis training and staged situations in advance. This type of training has since been integrated into the organization.
- The rules to communicate fast and transparently were good. In practice, continuous mass mailings turned out inefficient for this situation and of questionable effectiveness. The news section in the customer portal performed better. Communication had to focus on setting realistic customer expectations over time frames lasting several months.
- “Hope for the best but plan for the worst” proved to be a useful mantra. Although the vendor endpoints were reachable most of the time, and the engineers hoped they would remain that way, they could not rely on this. They decided to go with what they could control and mitigate, while moving on with the URL Filter replacement.
- The crisis response process worked well, i.e. a high-level plan with sitreps followed by action. Keeping it simple paid off particularly well, because it was very easy to onboard new people with everyone being able to understand it, from the executive team to engineers.
- The mantra of “Trust your people” proved itself too. This means getting to know the individuals working in the organization, and understanding what knowledge and skills they possess. Then, if a crisis strikes, it is much easier to team up with these people and face the crisis together – effectively.
Takeaways
Don’t have a crisis in the first place.
But if one does happen:
- Step up and team up.
- Make a good, simple plan.
- Lead.
This blog post is based on a presentation given at SREcon in October, 2024.
Lassen Sie die Komplexität
hinter sich
Sie möchten auch von der Open Systems SASE Experience profitieren? Unsere Experten helfen Ihnen gern weiter.
Kontakt