When there are many managers in a large organization, how do you make sure that any performance ratings, promotions, and subsequent compensation changes are fair, relative to each other? A common mechanism is for all the managers to get together, and compare notes — or “calibrate” — on their team’s ratings and promotions.

Why Calibrations?

Calibrations are primarily about reducing the bias of ratings from individual managers, and increasing equity between managers. Secondarily, they are a great way to train managers on how the company thinks about performance. They also give senior managers data on the highest potential folks in a large organization, for strategic organization design.

In most companies, there is a fixed budget in dollars for things like compensation changes and promotions. Calibrations distribute this budget as fairly as possible. This post will not explicitly cover compensation, but know that most companies will want to link compensation to some combination of company, business unit and individual performance.

In a “pay for performance” system, you want to have outsized rewards for outsized impact. But the budget is still fixed. If you put twice as many people in the exceeding expectations category, then those people will get half the compensation increase they would have otherwise gotten. In order to maximize the impact of rewards, you do need to make decisions between cases.

Elephant in the Room: Stack Ranking

Stack ranking means that you create an ordered list of individuals, according to their performance. It’s a mechanism to force the discussion about how to allocate rewards for greatest impact. The version that most people object to is if you are forced to label some people as low performers. Don’t do that.

If you have a large organization, it’s reasonable for there to be a bell curve of performance. In practice, I have seen that for cohorts of 50 people or more — at both large companies and start-ups — performance will roughly fit a curve. This comports with the central limit theorem, which says that a sample size of at least 30 will approach a normal distribution. But that does not mean that the low end of the curve are necessarily low performers. I do not think it’s acceptable to mandate a fixed percentage here.

You can keep yourselves honest by pressure testing people in the “meeting expectations” bucket. This is less of a budget issue, and more about holding managers and individuals accountable for real performance problems. While it’s not fair to absolutely require a fixed number of people to be labeled underperforming, there is a definite tendency for managers to avoid these difficult conversations by giving someone an unwarranted “pass”.

Stack ranking gets a bad name from being forced on too-small cohorts, and for forcing low performance ratings. Companies that had a bad reputation for doing this in the past commonly still do have ratings and curves, but have substantially changed their process for how the ratings are generated and reconciled. Some do away with ratings, but still have differentiated compensation changes that could be reverse engineered into ratings. Some move the reconciliation to the Director level, and give them more discretion on fitting the curve.

How to Run Calibrations

I’m going to outline a process that works well for cohorts of about 50 individuals, and can scale up to any size by layering on multiple calibration rounds. The goal of calibrations is to have fair and consistent performance ratings across organizations, roles, and levels. You want to reward high performance, and make sure that appropriate poor performance conversations are happening.

This process can also be scaled up and down, in terms of how much time it takes. You can cap write-ups in length. You could potentially do the entire thing asynchronous ,with no meeting. Reducing the size of the cohort will also save time. There are all trade-offs against how much to want ratings to be equitable across teams.

Pre-work Before Calibrations

Before you can calibrate, you need initial ratings and write ups. You need to gather data. This is where self reviews and 360 reviews come in.

Self reviews should use the same format as the eventual manager review. You should have a defined template for this up front. They should reference your existing framework for expectations per role and level. You should ask people to provide a rating for themselves. Any training you can do here to help folks write good self reviews will pay off.

Individuals and their manager should determine a set of three to five 360 reviewers per person. The format of a 360 should be short. Ask reviewers to timebox their effort, skip pleasantries and platitudes, and focus on situation, behavior, and impact. 360s can be anonymous or not, as long as it’s clear to the reviewer.

Managers should draft their reviews in parallel with self reviews and 360s, and then incorporate any new information prior to calibrations. It’s at this point that managers should document initial ratings and promotion candidates. You are going to want to gather these in some central system, even if it’s just a spreadsheet.

The Calibration Meeting

For the actual calibrations, you should pick one facilitator. It’s this person’s responsibility to set an agenda, set a timebox for each individual, and manage time during the meeting. The facilitator will need to have access to the preliminary ratings and promotions for every person to be discussed. They should create an agenda that has the name of each individual, the order they will be discussed, and a link to the manager write-up. It may help to give the attendees basic guidance about how many people would need to receive each rating in order to hit the budget.

Calibration meetings have a reputation of taking forever. A six or more hour meeting to decide the ratings for 50 individuals is not uncommon. It’s also common for certain managers to dominate the airtime, which can result in bias. How can these be run more efficiently, and result in equitable outcomes?

After several iterations, I’ve arrived at a format that I love. It’s pretty simple:

  1. Send out a pre-read with the links to each write-up ahead of time. Group by level, and then by proposed rating. For example, all people at a certain level who are up for an exceeding expectations rating should be calibrated on in a block.
  2. Solicit Q&A about each individual right in the agenda doc. This is the most valuable feedback the manager will ever get about how well they are calibrated.
  3. In the calibration meeting, each candidate starts with 5 minutes of silent reading. Managers do not “present”. Attendees can write additional questions during this time.
  4. The manager answers the written questions live for the rest of the time, someone else takes notes.
  5. At the end of time, everyone submits a “confidence vote” on the candidate

The confidence voting is the real secret sauce here. The idea is that every person present at calibrations submits a “blind” score for how confident they are in this individual being “at the bar” for the proposed rating or promotion. Blind means that scores are submitted without seeing anyone else’s score, to avoid bias. You can use any scale, but I like a star system with options for:

  • Not Ready
  • Stretch Case
  • Solid Case
  • Slam Dunk

You can also include a free-text field for comments. If not, encourage folks to leave any comments about why this case was not a “Slam Dunk” in the Q&A notes.

The confidence voting system can also be used to pressure test low performers. Typically, managers are somewhat reluctant to put individuals up for discussion as potential low performers. Assuming you do not have time to calibrate on every single individual in the organization, it helps if you have some proxy data you can use to identify people to talk about here. For example, you could identify individuals who have missed delivery milestones, or who have not done the expected volume of candidate interviews. Group these folks together, and vote on confidence at a “meeting expectations” rating. This is how the group can come up with calibrated ratings for potential low performers.

Post Calibrations

The confidence votes are tallied and revealed after calibrations. Scores are averaged together per case, and the results are made available for all calibration attendees. It’s at this point that the overall decision maker for the organization can make final decisions by defining a “cut line” in the graph. It’s up to them to figure out how closely to hew to any distribution guidance.

Calibrations need one decision maker. Typically this would be the manager that all of the calibration attendees report into, with the same being true for the individuals whose performance is being calibrated on. Alternatively, you could randomize cohorts across an organization to further mitigate bias, and the cost of the calibrators having less context.

With the voting system, the decision maker does not need to make the final decisions live during the meeting. Instead, they can look at the results after the fact, and decide. Just because an average score is under the “Solid Case” line, does not mean that the rating or promotion does not get approved. It may, but much more important is the scores relative to other individuals up for the same rating or promotion.

I’ve been surprised in that decision making role just how close the ranges of scores on any given individual are in practice, and also how much differentiation there is between cases. When this happens, it’s a sign that the managers in those calibrations are indeed thinking about performance and expectations in a way that is aligned. In that case, the decision maker knows that they are not unilaterally making decisions that may be biased.

Large Organizations & Multiple Calibration Rounds

A single calibration session can only scale up to cover 50 individuals. Assume that you only discuss people who are not solidly “meeting expectations”, and that those folks will be roughly 50% of the total. If you need a minimum of eight minutes per person, this would mean a three hour meeting. The total number of managers, or meeting attendees, would be between 6 and 12.

You will most likely have multiple calibrations, due to visibility issues. It’s typical for managers at level N to only sit in on calibrations discussing individual contributors at N-1 and below, or similar. That means that higher level individuals are likely to be discussed by managers that control more scope, which means managing larger orgs, which means that there are probably hundreds of individuals under those managers, way too many for a single session.

There should be one overall facilitator to create the timeline and define the cohorts for how these multiple calibrations will be scheduled, so that no cohort is over about 50 individuals, and individuals are not discussed multiple times.

Sometimes, you may choose to intentionally discuss individuals multiple times. This can be when you want to give managers practice writing their cases, and provide feedback in time to affect the final decisions. It can also be when you want to gather data from managers who work with a high-level individual regularly, before presenting the case in a higher level calibration.

In all these calibrations, you should strive (or require) cohorts to individually meet their budget guidance. This is how you scale to any organization size without having to directly compare every individual with every other individual. It will not be possible to draw cut lines across cohorts later, because the relative confidence scoring between cohorts is not directly comparable.

Last Word on Confidence Voting

The confidence voting system in particular delivers some key benefits that most calibrations struggle to achieve. The meeting time and conversations about individuals can be timeboxed. You can choose to not discuss up to 50% of people who are solidly in the “meeting expectations” camp, if time does not allow. The final results can be justified with data, and we can reason about exactly how “calibrated” the group was. We can choose how closely to follow any given budget guidance. We can also pressure test low performers.