Noise: A Flaw in Human Judgement by Daniel Kahneman, Oliver Sibony, & Cass Sunstein

When you want to find good judges look for those who are well trained, more intelligent, and actively open-minded. The latter relates to how they think. There are jobs where judgments are verifiable such as doctors or weather forecasters. There are also many where judgments are not verifiable such as wine tasting, book and movie reviewing, and essay grading. In such fields practitioners are considered respect-experts due to the respect they enjoy from their peers.
Intelligence or general mental ability (GMA) is correlated with good performance in virtually all domains. It becomes more important as the job becomes more complex, which includes those that require judgment. Better judges are more likely to engage in reflective thinking (type 2) rather than trusting their gut (type 1). If you want to reduce error it is better to remain open to counterarguments and to know that you might be wrong.

Trying to remove bias is a form of decision hygiene. It can be done before or after. Automatic enrollment in things like pension plans and free school food programs are examples of the before variety. You can also train people to look for bias. You should designate someone as a decision observer to look for bias in real-time so it is an ongoing process. Consider using a checklist like to one provided in the appendix.

Where there is judgment there is bias, even when it comes to fingerprint analysis. When getting a second opinion in any field it’s vital that the person making the second opinion not know what the first opinion was. People making judgments also should not have access to information they don’t need to make a judgment.

Many judgments involve forecasting and forecasters tend to be overconfident. This includes fields as diverse as climate forecasting and predicting Supreme Court outcomes. If your organization relies on forecasts from time to time you can improve your forecasts by selecting the best forecasters and having them work together. Here it is vital that they produce initial forecasts independently before they defend them and consider the forecasts of other members of the team. Here you have a “wisdom of crowds” situation using a select crowd. Once members of the team have shared their forecasts and considered other opinions, they produce a new forecast, which may or may not differ from the original. These second forecasts are then averaged to reduce noise. Final forecasts should not be aggregated until it is absolutely necessary. This gives the team maximum access to useful information and the most time for consideration.

There is a lot more noise in medical diagnosis and treatment than most of us would think. The amount of noise varies from one condition to another with some having little or no noise and others having a lot. Increased skill and training can reduce noise. Second opinions exist because the medical community recognizes the existence of noise. Where there are guidelines to follow, the amount of noise is usually less. It’s like using artificial intelligence to diagnose and treat. In terms of noise, psychiatry is an extreme case. The likelihood of agreement between two opinions seems to be only a little more than 50%. Some like to diagnose depression while others prefer anxiety.

These ratings can be absolute where two people can have the same rating or ranked where they can’t. The later eliminates level noise between rankers as everyone can only put one person on top and so on down. Forcing ranking on an absolute measure is illogical, cruel, and absurd (Doug: This doesn’t stop schools from using it when they use percentiles.) It forces a differentiated distribution on an undifferentiated reality. Almost all people who get and give ratings hate the process and some organizations have eliminated them altogether. The best you can do is work hard to clarify your rating scale and train people well in it’s use.

Standard interviews are almost useless. They are full of bias and noise due in large part to the impact of first impressions. (Doug: Keep in mind that you are not hiring someone to interview for a living.) The authors cite Google’s efforts as a way to make them more reliable. These are called structured interviews. You first need to define clearly and specifically what you are looking for. In the case of teachers your list of categories might include GPA, difficulty of courses taken, teaching experience, appearance, personality, and so forth. Each rater should independently rate each candidate for each category using a rubric of some sort. These ratings are then aggregated from multiple interviews. Once all of the evidence has been collected and analyzed it’s time for the judgment and intuition part, which sadly comes first in many organizations.

The structured process described for interviews can also be used for any kind of decision. First, you come up with as many separate assessments of different independent aspects of the deal. You then have everyone analyze, report, and rate each aspect. You present everyone’s findings to the group and discuss each one. The goal here is to get at the truth rather than sell a position. Finally, you bring in judgment and intuition in order to make the decision. This won’t eliminate noise and bias, but it will introduce a measure of what the authors call decision hygiene.