Choosing a cloud service provider that offers true information security can be a daunting task. In addition to building new external partnerships, you may find yourself evangelizing different ways of working within your organization, new ways of contracting, or find yourself managing unfamiliar projects that emerge from data provided from your service provider. Additionally, you will inevitably change the way your organization considers risk through the sharing of data with a third party.
Minimizing risk means choosing a partner that takes information security very seriously. Many cloud providers are quite good at security—after all, the success of their business may depend on it. Other providers are not as diligent with their security management and thus may not be a great match for your company, depending on your appetite for risk.
Information Security covers a broad range of topics including physical security, network security, application security, access security, transit security, and more. Although constant vigilance and good data hygiene is necessary, it’s not always sufficient. Once good security measures are in place, they will only stand the test of time through a process of ongoing improvement.
Surprisingly, resiliency and disaster recovery are two crucial parts of an information security portfolio that are often ignored. What’s worse is that remediation procedures need constant updating as your systems evolve to remain current. Even companies with great “security” can suffer confusion and paralysis in the case of a real emergency.
In our decades of experience operating cloud software, we’ve learned to combat the fragile nature of complex systems by making a science out of resiliency and disaster recovery. We’ve also learned that these types of urgent situations occur at the worst possible times: in the middle of the night, often with corresponding connectivity issues, and always seemingly with technical leadership on vacation. So, what does it take to be prepared for an eventual systems breach or failure? Planning, process and practice.
As part of our Service Level Agreements at WaterSmart, in addition to continuous system monitoring and maintenance, we rigorously document and regularly test our operational procedures with the goal of being ready in the event of real necessity. For example, during a recent recovery exercise using our test environment, WaterSmart successfully completed a full recovery from a simulated total hardware failure of our primary database cluster in just over two hours.
For a detailed understanding of the steps taken to accomplish this remarkable achievement, let’s break down the recovery process timeline:
- 1:00 pm – We disabled the database hardware serving 70+ utility partner tenancies, simulating a rare, dead-in-the-water outage
- 1:01 pm - Our automatic monitoring framework broadcast an emergency alert to our operations team via cellphone and desktop notifications.
- 1:03 pm - We assigned the incident owner role to an individual on our DevOps team
- 1:04 pm - The incident owner created a secure, live-chat war room using our enterprise communication tool for command-and-control, and then distributed the online link to our recovery playbook
- 1:08 pm - We disabled all inbound and outbound-facing APIs, and put customer-facing websites into maintenance mode
- 1:10 pm - We provisioned 3 new machines comprising over a terabyte of storage from our cloud service provider and securely assigned to these servers our private service network.
- 1:14 pm – Storage provisioning completed
- 1:20 pm - We began database software installation and configuration using pre-defined roles from configuration-as-code system called ‘Chef’
- 1:40 pm - We completed system configuration and brought the new database cluster online
- 1:46 pm - We exhumed the most recent encrypted daily backups, and started the restore process
- 3:04 pm - We completed database cluster restoration to all 70+ utility partner tenancies
- 3:16 pm - We enabled all services and websites
After the exercise, we met to update our recovery playbook based on new learnings and notes, and scheduled new work tasks designed to further improve automation within the process.
These types of failures are extremely rare, but planning, process and practice can prevent system failures from becoming disasters that interrupt business operations and create unnecessary stress and confusion for our utility partners. Restoring from bare iron and backups in approximately two hours is a proud achievement for our team, and we’re excited to bring this level of discipline to our family of WaterSmart partners. As you evaluate cloud service partners, ask some questions about how they prepare and practice for operations emergencies; their responses may be a good indicator of their commitment to excellence.