Anomaly detection: The evolution of an experience over 3 years

Problem and context:

When I joined the team at New Relic that focused on AI and machine learning, the team was in the middle of re-chartering. The team had built a way to automatically detect anomalies in complex environments, based on what is commonly known as the golden signals – focusing on errors, response time, and throughput.

My task was to figure out how to deliver the experience for these anomalies to New Relic users. These anomalies would ideally assist them in troubleshooting various issues in their tech stack. From a business perspective, this was seen as an added value for New Relic’s regular customers.

I had to make sure I fully understood the problem I was trying to solve. While the engineering team focused on the underlying detection algorithms, I spent some time rapidly understanding the users and problem space. Through a lot of reading and conversations, I quickly developed a straw-person of a user journey for troubleshooting.

As a team, we decided to focus on a single part of the journey, the why:

I developed this problem statement to frame our work:

“How might we help incident responders understand and troubleshoot problems faster in complex environments?”

Delivering anomalies in multiple contexts

Anomalies with alert notifications:

Our first effort was to deliver detected anomalies along side existing alert notifications that would provide more context about the problem.

One of the design challenges I faced was figuring out how much attention to give to the anomaly vs the alert notification. I did a lot of iterations.

What we first delivered was a banner that would appear in context with an alert, on the alert details page. The copy became very important – we couldn’t definitively say that the anomaly was part of the root cause of why the alert had triggered, but rather that it was additional context and something worthy of investigation.

Anomalies in the New Relic mobile App:

Soon after building the desktop software experience, we wanted to deliver the experience in the New Relic mobile app. Mobile is one of the primary touchpoints for alert responders, as they’re often on call and not always in front of their laptop.

Anomalies in Slack:

The next most important touchpoint was Slack, where most New Relic users already received alert notifications.

Designing this experience in Slack became the most uniquely challenging aspect of this project. Slack is limited in terms of what visual elements can be included, and requires the use of its very specific design components, but there was so much potential in meeting users where they already were, and deliver them vital information that they could then take action on.

Anomaly detection outside of alert notifications

Our analytics showed a high amount of interaction with anomalies in Slack, and leadership saw a lot of promise in delivering a quick way to detect problems automatically without creating complex alert thresholds.

Delivering anomalies “automatically,” and independent of alert notifications in Slack, required a configuration experience within the New Relic product. This would allow users to configure what types of anomalies were delivered, and to where (Slack channel or webhooks).

My team had a lot of concerns we’d be overwhelming users with too many notifications. To mitigate this, I designed some carefully scripted interactivity that allowed users to mute or turn off notifications. In an ideal world, we’d allow users to provide more direct feedback on the detections that would influence the algorithms . That work was on the horizon, but it turned out it was a pretty challenging engineering problem to solve.

Getting user feedback:

When we started working on getting user feedback for anomalies, it was with the desktop experience. I iterated on how best to do this in both the desktop application and in Slack. I created an A/B test in the desktop application to identify the best way to gather feedback.

Once our focus had shifted to delivering anomalies in Slack, the ability to provide feedback became the biggest requested feature from customers. It was challenging for me to figure out exactly what to ask users to get the right feedback. Often, an anomaly would occur that was expected (for example, throughput increasing during health checks that would run at regular intervals). What we were detecting wasn’t wrong, but users didn’t necessarily want to be notified every time.

I worked with my team of ML engineers to figure out how to ask the right question with a few different possible responses:

It turned out to be incredibly difficult for the application to actually adjust the anomalies detected over time. As it turns out, machine learning is hard. So for the time being, we used the feedback to “gather information.”

Research revealed that we HAD to solve the feedback problem.

Shortly after we launched the feedback feature, I conducted a pretty big research study on the end-to-end anomaly detection experience. Almost all users expected that providing feedback would actually change their detections, a completely reasonable expectation. The engineering team was still working on this feedback problem when I left the organization.

There was a lot of promise in the ability to get more context for anomalies. As part of this research study, I showed some customers an early mockup of a new experience my team was exploring – the ability to “analyze” an anomaly and get more context and information.

Users could click from Slack that would open this view to get more details about the anomaly that was identified:

A shift in strategy:

Our team’s focus shifted to developing this new experience, and I refined the designs to focus on questions from the troubleshooting journey:

Has anything else been detected?
What’s unique about this problem?
Where should I start troubleshooting?
What other metrics might be affected?
What’s not being affected?
What’s happening upstream and downstream of this application?

I refined the design further. We built it, released it, and I added a microsurvey using Sprig to evaluate the effectiveness of both the analysis and the design. This also helped with recruiting customers to talk to for continuous discovery – users could indicate in the microsurvey if they’d be willing to talk to a member of the product team.

Value delivered:

Created a culture within the team of continuous discovery and customer feedback that helped us evolve and improve the experience over time.
Rapid storyboard iteration that helped the team make design decisions quickly
Worked with Slack’s Block Kit Builder to quickly prototype what our Slack experience would look like
Gathered customer feedback that helped drive the strategic focus of the product

My contributions:

Lead/solo designer responsible for end-to-end product and interaction design
Planned and executed generative and evaluative research
Contributed to product strategy, working closely with PM