Explore related resources and telemetry: reducing mean time to diagnose issues for DevOps in troubleshooting scenarios

TL;DR

In 2024, I led the UX for a strategic project aiming to reduce mean time to diagnose issues across AWS. The solution took the form of a side panel tool integrated into over 60 AWS services, allowing operators to efficiently explore related resources and telemetry during troubleshooting.

Presented at Amazon Re:Invent in Las Vegas, the tool saw strong week-over-week adoption growth and received highly positive feedback—especially from some of AWS’s biggest-spending customer cohorts. The tool significantly reduced diagnostic time and improved troubleshooting efficiency.

The challenge

Before the side panel tool, DevOps teams faced time-consuming troubleshooting due to fragmented tools and scattered information. Operators relied heavily on experience or runbooks and had to jump between multiple pages and tabs, making it hard to know the next steps in complex architectures.

Our objective

Save valuable time during critical troubleshooting scenarios by helping users connect the dots faster, guide them on the best next steps, provide easy navigation of complex architectures through a resource map, and offer quick access to related telemetry like metrics and logs.

Key goals

  • Reduce mean time to diagnose issues.

  • Enhance troubleshooting efficiency by guiding users on where to go next.

  • Deliver an easy-to-use, integrated solution with high customer satisfaction.

  • Align with AWS goals to remove friction in troubleshooting scenarios.

My role

I led UX strategy and design direction, aligning cross-functional teams and stakeholders. Responsibilities included designing interaction flows, prototypes, pixel-perfect mockups, collaborating closely with PMs and engineers, conducting user testing, supporting post-launch analysis, and contributing to project communications.

Key contribution highlights

  • Spearheaded UX strategy to unify stakeholder needs into a user-focused vision.

  • Designed the contextual side panel interface integrated into 60+ AWS services.

  • Created interactive prototypes to secure alignment and buy-in.

  • Led multiple rounds of user testing to inform iterative design improvements.

  • Optimized information architecture to reduce cognitive load.

  • Collaborated cross-functionally to maintain design fidelity during implementation.

  • Contributed to post-launch analysis driving continuous enhancements.

Design process

1. Research & discovery
Analyzed telemetry data and conducted operator interviews to identify troubleshooting pain points and workflow inefficiencies.

2. Information architecture optimization
Restructured content to provide unified access to resources and telemetry within a logical side panel.

3. Interaction design
Developed interactive prototypes focusing on intuitive navigation and contextual guidance for next best steps.

4. User testing
Conducted iterative usability testing with DevOps engineers and operations managers, refining workflows based on feedback.

5. Visual design & handoff
Delivered pixel-perfect mockups and collaborated closely with engineering for seamless implementation.

Results

  • Integrated into 60+ AWS services with strong adoption growth.

  • Consistent week-over-week increase in usage.

  • Significant reduction in mean time to diagnose reported by users.

  • High satisfaction ratings, especially from AWS’s largest customer cohorts.

  • We’re still tweaking and polishing and measuring impact. For example I pushed for adding new entry points on the critical path, and we’ve been witnessing a significant growth in usage.

What I Learned

  • Collaboration across teams was essential to success.

  • Importance of involving stakeholders and get them buy in early.

  • Post launch usage data analysis allowed to spot areas of improvements, and make changes driving significant adoption.

Beyond the brief: going the extra mile for users and the team

  • Engaging with users leaving feedback

  • Measuring key KPIs for the feature: usefulness, ease of use, satisfaction, is the feature helping folks to know where to go next and is the feature helping users to save time while troubleshooting.

  • Maintained high team morale through transparent communication and celebration of wins.

  • Convincing stakeholders to keep funding improvements for this feature.