Improving Reliability and Changing IT Culture at Nationwide


Peer Practices
Written by Kara Bobowski

Jason Patterson

AVP, Site Reliability Engineering

Nationwide

Jason Patterson, AVP of Site Reliability Engineering, along with Guru Vasudeva, SVP & CIO of Infrastructure and Operations, Todd Kasper, VP of Cloud Operations, and Doreen Luke, VP of Technical Operations, have been tackling technology downtime and exhausted reactionary production support at Nationwide. Along the way, they established a Reliability Engineering program, including a Site Reliability Engineering team, which together have made dramatic reductions in customer downtime, improved monitoring confidence, and inspired a shift toward proactive reliability efforts.

Kasper, Luke, and Vasudeva are the executive sponsors of the resulting program and team, and Patterson is the self-described “highly vocal, passionate driver” of it. The program provided cross-cutting focus areas across Infrastructure and Operations departments, while the team of Site Reliability Engineers surgically focused on specific customer experiences. 

While the team was initially established to enable reliable cloud adoption, their experiences across the enterprise well positioned the engineers to partner with application developers and technology engineers across the enterprise, improving observability, protecting against faults, and applying self-healing platform features.
 

Confronting the Technology Challenge

In late 2020, the organization experienced a series of technology incidents that impacted multiple lines of business. As Patterson says, “It was a little bit of a trifecta that all three lines of our business were impacted with large outages.” Their response was to focus and “create a program and deliverables with regular report outs.”

The challenges occurred about seven months after the internal team of consultants was already up and running, and the outages led to the establishment of a focused program staffed by the existing team. Patterson explains, “We were adopting modern technologies in a challenged way, and that spawned the reliability engineering program.”

And where do you start when faced with multiple challenges? Patterson says, “Challenges weren't being met from platform to platform or system to system, so the focus was on the platform and the customer experience.” 
 

Establishing a Reliable Solution

Patterson explains that they started with “all the noise in the system – including a lot of false starts or false incidents.” The team embedded themselves in those areas and learned that 70-80% of the alerts were not really a problem. “We had so many outage alerts that it was hard for the team to take them seriously,” he adds.

“We had to establish trust around alerts, bring those down, so if you get an alert in the middle of the night, it’s a real problem,” Patterson continues. To build trust, the team took a number of actions, including reviewing monitoring tools, instrumenting code bases, and onboarding a developer to help with the code base.

Along the way, they improved their monitoring and shifted the culture around false alarms. Patterson notes, “There's a lot of fear in saying, if we get fewer false alarms, are we going to miss some? But what we were able to show is that we are actually improving monitoring. So, we're reducing false alarms. We're reducing user-reported issues. We're reducing support-reported issues. And actually we made both better.”

You're not being hit up for all these false alarms during the day and waking you up at night. Now, we can step back and say, where do we need to improve?”

 

The reduction in false alarms also freed up a lot of time and attention for the team. In addition, the improved monitoring surfaced issues that Patterson and his team didn’t know about. “That was perhaps the scariest thing,” he adds.
 

Outcomes of Reliable Performance and Lessons Learned

While the focus was on specific customer-interfacing areas that were causing most of the challenges, improving reliability there had impacts across other parts of the business. The demonstrated shift in the customer experience for interacting digitally with Nationwide had set the bar for other areas of the enterprise. 

“Along the way, I would say we created some of those objectives where the rest of the enterprise can follow,” Patterson says. “We helped establish a good baseline.”

Now, the outages on key customer-facing platforms are almost nonexistent. In fact, around 500 critical platforms have shown remarkable improvement in reliability and performance. With the program and the team out of crisis mode, they are looking to create momentum and urgency for other IT strategies and initiatives.

One of the biggest lessons learned, according to Patterson, is that this nimble and responsive approach can lead to “taking over” the problem, instead of enabling others to solve it. With their success, other teams were looking for them to own the challenges. “If you improve it, you own it,” Patterson says.

But that led to stress and burnout on his team, which impacted IT talent retention and recruitment. “My biggest ‘watch’ item is not taking over these spaces that we’re working on, but taking more of an approach that says, ‘I will show you how to improve so you can replicate it.’”

As to the future of the responsive team and program, Patterson says the team “lives on,” as they try to determine what intake looks like, what funding looks like, and where they should prioritize their time. But he feels positive about the team and their “willingness to change, evolve it and expect better.”

 

Special thanks to Jason Patterson and Nationwide.

by CIOs, for CIOs



Join the conversation with peers in your local CIO community.

LEARN MORE