Do’s and Don’ts for Performance Teams

Rico Mariani
Level Up Coding
Published in
8 min readJan 28, 2020

--

Approximately correct and hopefully helpful

Photo by Kelly Sikkema on Unsplash

This article is about setting up your performance team for success. It’s about goals and practices and management, not about technical problems. If you’re looking for an article about the importance of locality on modern processors you should stop reading now.

Working on a performance team can be like running a marathon. I think everyone remembers the times when awesome stuff happened. When you landed some huge result with a lot of hard work, but those pushes are often the exception. I think maybe perf people get addicted to those moments. They sure help you get through the other times.

I’m getting ahead of myself.

Introduction and Context

Many years back a friend (thank you Jason) introduced me to the “offense” and “defense” parlance of performance. It’s funny because I had been using much more complicated words to describe it and when he said “offense and defense” I thought it was such a simple way of explaining the situation. Everyone remembers the offense, and organizations immediately value it. That’s “hero time.” But, actually, you win with good defense. And good defense requires that everyday attention-to-detail that you find among the best performance people.

So, for your organization to work, you need it to be effective on both fronts. And I can tell you from experience it is very easy to burn out on defense. That’s the trench warfare part. Even the best organizations are plagued by the normal entropy of engineering — the average change is far more likely to be bad for performance than good. That’s just the math of it.

With defense being so important, as a manager you might be tempted to set lots of goals around it. “No regressions”, “hold the line”, etc. That’s all fine and well. But, tempting as this is, if you demand this of your performance team, they will all just quit. The stubborn ones might take longer but, in the end, you’ll have nobody. You certainly won’t have the rock stars you want. So what do you do? Good defense is essential.

Let’s start with some unfortunate realities. First, even though it’s super-hard, defensive results rarely play well when it comes to story-telling. There are exceptions of course, but as a rule when you’re discussing the superstars you rarely tell the story of how these 5 people held the line. And even if you try it can be a tough sell. You might hear responses like “We could have held the line by sending everyone home” — which is totally exasperating. On the flip side you might hear “we prevented 25 regressions of about 5–7% each,” but that’s problematic in the reverse way. Regressions can be arbitrarily bad, and in fact the bad ones are generally the easiest to find and fix. So, if you prevent a 2000% regression, that’s almost certainly less hard than preventing a 0.2% regression. And it’s those paper-cuts that ultimately kill you. If your code is churning with any kind of frequency, then 0.2% on each iteration will soon spell death.

Great so to do your job you have to find and fix 0.2% regressions, and if you fail, you’re screwed, and if you succeed, nobody is impressed. Wow! This is such a great job! Everyone should be doing this!

Looking at offense, the situation is likewise wonky. If you find say a 50% improvement in the code it’s far more likely that the code was horrible in the first place and you just made it not suck than it is that you’re a genius and you came up with a super-clever way to solve the problem twice as fast. In fact, giant performance gains are almost certainly the result of giant existing mistakes rather than elite engineering in the fixes.

It’s actually the accumulated small wins that tend to be the hardest, and often best, work. Engineers that manage to squeeze out another 10% by doing a series of cleanups — often leaving the code much better than they found — are probably the cleverest. And, again, the hard work frustratingly looks less impressive. If Jane improves your time by 28% and Nancy improves it by 7% does that mean Jane is 4x better than Nancy at finding and fixing problems? She might be, but not because of those two numbers.

What are we to do about this? How do you recognize and reward? Where should there be accountability?

Important Observations From Experience

#1 Aspirations are not automatically team success metrics

When you’re planning a project, or a milestone of a project you often want to set some kind of goal right away. If you’re lucky you have some good user-facing research to tell you what your performance needs to be in order to be successful. If you’re less lucky, you’ll just have some intuition. Either way you’re likely to want to make this concrete so you define some aspiration. I think this is a good idea, but you must remember that these things are not grounded in reality by default.

For instance, you can say “Look we need this thing to be about two seconds or we’re screwed” that’s great. That insight will help set the tone for what you can and can’t do and it’s invaluable for roughing out engineering plans. But until you have done some research you don’t know that two seconds is even possible. You can tell you’re getting into trouble when “about two seconds” turns into 2000ms on some official-looking spreadsheet.

So, success criteria are great, but people have to sign up for them, and the more important a criterion is, the sooner you need to be sure you can actually make it happen. Which brings us to the next guideline.

#2 Customer facing goals are basically never good engineering goals

You may think I’m a heretic but I’m famous for saying this and I may as well stick to my guns: Time sucks. Customers care about time so we can’t ignore it but time-based analyses are the hardest to control and generally the least probative. In #1 I talked about checking if things are achievable — one of the essential ways you do this is to get an understanding of what work has to happen to do the job in question. You do this in terms of consumption: so many network round-trips, so much i/o, so many instructions, whatever it may be. When you can write the bill for the important operations and use those costs to model the likely time then you’ll have a sense that you can do it, and how to keep it under control. If you can’t model it, you’ll be left to try to measure the time and costs after the fact and hope you can make it work. That’s not very forward looking and it could leave you in a bad place at the finish line.

You can tell you’re in trouble when all you have are times (or not even that) and they aren’t broken into work. Split-times help, but they’re no replacement for raw costs.

#3 Defense should center around engineering metrics not customer metrics

Now I’ll begin by saying that you must make sure that your engineering metrics actually work, so it’s vital to control the details against customer reality, but when you’re telling your organization what’s wrong with a build you need to do it in terms of engineering metrics. These can be cardinals like “we are painting twice as many pixels as last week for the same work” or “we do 12 reads now, we used to do 9” or whatever the case may be. Urgency can be expressed by reference to how things affect the customer (e.g. we’re 50% worse on our top metrics because of [thing an engineer understands]).

The trouble with the top-level metrics is that they are frequently not actionable; I like to say “that’s data I can only cry about.” Remember, time is a consequence of bad consumption, not the other way around. If you’re looking for root causes, you start with how the resources were spent. A suite of good resource checks will help you to find the blame, or at least find the people you should ask.

The other lovely thing about consumption is that it can often be easily measured in many contexts, on a variety of hardware, at different development stages (e.g. unit tests) and it will still show problems clearly. Bytes, paints, reads, instructions retired, locks taken, all these things can be invariant under many kinds of conditions and are invaluable in root causing problems. Remember, this work isn’t just about identifying problems it’s about understanding and explaining them.

You can tell you’re in trouble when problems are reported in terms of customer facing goals only and engineers feel like finding regressions is mysterious at best, and often hopeless.

#4 Orchestration seeps into every performance problem

Understanding what work happened, and why is essential to getting data you can work with. For instance, I spent many years working on the Edge browser (RIP) and for many performance problems, I could I tell you that we were burning more time in the formatting loop without even looking. But that doesn’t mean there is anything wrong with the formatting code, in fact it may not have changed in ages. The question then becomes WHY is there more formatting, what caused the extra dirty state. You simply will not find these kinds of answers in a typical profile — the code causing the dirt could itself be super-fast, it might never be sampled. Even if it was, you’re unlikely to look at something so tiny and there are potentially thousands of somethings-tiny for it to hide among.

The way you understand this is to log important events that drive cost (not just the thing that is costly) so you can see how they change. You can use begin/end pairs sometimes but even that may not be enough. You may need something like “here’s the 27 reasons we dirty the tree” and “here’s a count of each kind” so when one suddenly rears its ugly head you have a hope of finding it “look this style change is now driving lots more formatting, what happened?”

You can tell you’re in trouble if you can’t understand the how and why from your lab results: how your work is originating, how it’s being queued or throttled, and how it’s being dispatched. Without this information, it’s impossible to have a clear understanding of the big-picture problems.

The Dos And Don’ts

So, with these things in mind, you may have a sense of how you can create and keep a successful performance team and not burn them out. I offer these bullet-sized suggestions based on the above and a little experience.

Do: Set customer-facing aspirations.

  • Don’t assume they are success criteria.

Do: Measure things customers care about (like latency).

  • Don’t use them as your primary defense mechanism.

Do: Create rich engineering metrics.

  • Don’t assume their success automatically makes happy customers.

Do: Set top-level goals for your perf team.

  • Don’t make them responsible for all metrics.

Do: Make product changes for customer benefit.

  • Don’t hold all goals steady in the face of changes.

Do: Expect your perf team to work on the hardest performance problems.

  • Don’t expect them to handle all the problems, all the way.

Do: Expect your performance team to have a framework for successful engineering.

  • Don’t blame them for the consequences when it is ignored.

Do: Emphasize excellence for the organization above all else

  • Don’t limit the team to only defense, or only offensive analysis.

If you do and don’t as above, and reward your team accordingly, you’re much more likely to keep and grow a team of performance specialists. If you ignore most of it, or demand miracles like perfect defense, your team will pretty much implode.

Finally, though what I’ve written was centered on performance teams, a lot of this really applies to every quality-focused team with maybe some adjustments.

If you’ve read this far, thanks, and I hope there was something useful for you.

--

--

I’m an Architect at Microsoft; I specialize in software performance engineering and programming tools.