Imagine if you were a passenger on an airline and the pilot announced, “I have no instruments in the cockpit, so please let me know if one of our engines stops running.” This is how many of the software applications in Veterans Affairs (VA) operated in the past. The VA Office of Information Technology (OIT) assumed their computer programs were working fine until users alerted OIT when the application broke.
June 6, 2019 – a turning point
That all changed on June 6th, 2019 – the 75th anniversary of World War II’s D-Day. As we approached the day that our software supporting the MISSION Act was supposed to go live, I was worried because, as the co-pilot responsible to deliver the IT needed for the MISSION Act, my dashboard gave me only limited vision into software performance. While we had many infrastructure (hardware and network) monitoring tools, we did not have what was needed to manage software applications, that being a robust software application performance monitoring toolset and mindset.
Application Performance Monitoring (APM) tools help an IT organization monitor the performance of their software. APM is an electrocardiogram EKG for software applications. It shows the software’s heartbeat. APM plays a proactive role in ensuring that applications run and run well. Without APM, IT managers are flying blind which was the case in 2019.
When Dev meets Ops
The lack of APM also exposed a cultural crack in the VA/OIT. “Dev” (software development teams) and “Ops” (IT operations teams) worked pretty much independently. Software teams who created VA applications enjoyed a good night’s sleep based on their peace of mind of not knowing how well or how poorly their own software applications were operating. The responsibility for the operational performance of the code they built was somebody else’s job. i.e., the IT operations organization.
As the Deputy Assistant Secretary for Development and Operations (DAS DevOps), I wanted to bridge this cultural crack between the teams who built the software and those who ran the software. To do that, I instituted a couple of rules.
- YBIYOI – You Build It, You Own It. Software teams who built code could not “drop and run”. There would be no throwing the code over the fence and hope someone else would worry about its operation. The software folks needed to feel like they were part of a larger team that was responsible for the end user’s experience (UX). We needed to cultivate a culture of IT empathy – to learn and care about the Veteran whose life would be impacted by software code and to care about their IT operations teammates as well. To do this, software teams needed APM data about the user’s experience such as response time and whether the program was up or down.
- APM needed to be built in, not added on. We needed to have our software applications report out their own health status. In an enterprise as large as VA, something is always breaking. It is very difficult to quickly pinpoint the root cause of the disruption. Sometimes, it was a non-VA issue such as a construction crew digging up network fiber or a hurricane knocking out power. But regardless of the cause of disruption, the software systems themselves must be built to work well with APM tools.
To implement the above ideas, we had to acknowledge that VA OIT’s DIY software monitoring and observability approaches were nonexistent or inadequate. We went to industry to find open-source and industry-leading packages that could observe, gather, and display real-time performance information about the applications and the user experience. Then, we had to inculcate the concept of APM into the DevOps build and product release process and into the day-to-day operational and troubleshooting cadence.
On the right path to a long journey
Since VA has such a large and complex software inventory, VA/OIT still has a long way to go with APM. But with APM tools and processes now in place, VA can track the performance of specific applications. Due to increased observability, VA can spot discrepancies from different application environments and detect performance problems while still maintaining the stability of the production.
Continued investment in modern APM tools is not a “nice to have” but a “must have”. As the number of VA applications continues to grow, using APM, VA/OIT’s managers will increasingly become able to understand the complex web of dependencies that underlie outages and respond quickly to prevent a cascade of related problems. Fortunately, it will not be long until APM data will begin to be used by Artificially Intelligent bots to autonomously observe, operate, and dynamically provision resources, thereby reducing downtime and accelerating Mean-Time-To-Restore.
APM for VA software applications is necessary now and critical for the future. APM helps VA avoid or recover from outages, increase VA/OIT productivity and observability, offer insights into investments needed for innovation, and understand and improve the customer experience of Veterans.
Your article is so true. I work with the VA on a regular basis on their APM and Full-Stack Observability projects. Like most government agencies they run mission-critical solutions across their enterprise and when something goes “bump in the night” they need answers that are accurate, timely, and need to react as quickly as possible.