Question Asked | Answer Given |
What is the most common type of problem your operations triage team runs into today? | At a high level – poor system performance and, in many cases, due to change. That leads us to be understanding of both technology and process. They truly go hand in hand. |
How is the larger work-from-home user base impacting triage and performance? | WFH provides a new opportunity to understand how we address performance, the tools we use the access we create. The VA has been successfully ahead of the curve throughout the pandemic as we’ve seen a sizeable portion of our resources move to WFH. |
is there a role for AI/RPA etc. in your org, and if so, what are the plans for leveraging newer technologies? | As we get further into this webcast we speak to AI and AIOps. This is increasingly becoming an important tool in our strategy to improve the system performance of our users. Work has been started by several groups to bring ML to our ability to respond to known system faults to address without human intervention. |
Does your team get involved early — i.e. as new applications and tools are rolled out, are you able to do the synthetic monitoring things you discussed BEFORE they hit the real users? | OTG’s SREs are not only support triage efforts when systems perform poorly or fail, but also during the early stages of planning and deployment too. Regarding SynthMon – most of our focus is in the development of solutions for existing applications right now. Once we have a formidable mass of solutions in place, it will be important to focus our efforts on enabling SynthMon strategies and solutions to pre-prod and promote along with the app’s code base. |
You mentioned that there are still major issues that are reported by users, though automated alerting is growing. What is the biggest challenge in fully automating Incident alerting? | coordination and ensuring we capture the right information is probably at the top of the list. As mentioned in the webcast, we continue to build out system performance measurement and response. In many cases, we choose a standard value because system-specific data is not available – or timely. Once that data is captured though we still need to interpret and understand how best that fits within both system and business SLOs, SLIs and in turn, SLAs. We’re succeeding in capturing data. Our next big step is ensuring it’s the right data that allows us to measure what matters. |
Have these principles been applied to other Federal agencies? What recommendations would you give for other agencies to enact similar services as what your office is supplying at VA? | Senior and Executive leadership is a must for creating an SRE start-up organization.
Finding the right individuals with an insatiable desire to solve problems is also key. Finally, SREs need to be strong communicators, evangelists, and arbitrators on a daily basis. Understanding that we are here to help improve, not to poke around is core to a solid relationship with our system and business partners. |
How do you evaluate the quality of the data your DevOps org collects? Is there a VA data standard you adhere to (or other coordination with VA CDO possibly) to ensure data collected is valuable, actionable and reusable? | Data is our friend and nemesis. we’re continually verifying and re-verifying both our performance and incident data to ensure we’re accurately capturing not only the most important data, but the right data. Within our SRE team alone, we have a peer review process for both data capture and reporting. Within our ECC (Enterprise Command Center) team, there is a constant battle rhythm of review of the data and threshold settings for our APM tools. This too allows us to better understand poor system behavior and respond more quickly. |
You mentioned how your efforts have improved VA’s telehealth services. what are some of the other veteran programs/services IT triage is helping to improve? | Our SREs have been involved in many of VA’s most critical systems. that would include those supporting Veteran Benefits, Emergency Room Management, and user Authentication to name a few. In many cases, we transcend the OIT organization to support where we can as best we can. |