On the first of March, 1932, an intruder entered the New Jersey home of aviator Charles Lindbergh. The intruder used a ladder to enter the bedroom of little Charles Jr. and kidnapped the sleeping infant. A little over two months later, the baby’s body was found nearby.
The intruder had left a ransom note on the windowsill in the baby’s bedroom:
The Lindberghs paid the ransom money—a small fortune in those days—and some of the dollar bills they handed over soon surfaced in New York City. The trail ultimately led to Richard Hauptmann, a German immigrant with a criminal record, who was promptly arrested.
During Hauptmann’s trial, 8 handwriting experts testified to similarities between the ransom note and other specimens of Hauptmann’s writing. Hauptmann was convicted of capital murder and was electrocuted in 1936, four years after the kidnapping.
Hauptmann’s case is one of several famous criminal cases in which forensic handwriting analysis served as a key piece of evidence.
Forensic handwriting analysis relies mainly on expert judgment, namely the side-by-side comparison of different samples of handwriting, similar to the comparison of fingerprints that we have discussed here earlier. How reliable is this expert judgment? If 8 handwriting experts agree, can we be reasonably confident of Hauptmann’s guilt?
In a recent article in the Psychonomic Bulletin & Review, researchers Kristy Martire, Bethany Growns, and Danielle Navarro examined one aspect of forensic handwriting expertise, namely the ability to assign probabilities to idiosyncratic features of a person’s handwriting. This is an important question to examine because forensic scientists are increasingly required to assign probabilities to their expert judgments, rather than simply declaring a “match” or “mismatch”.
Martire and colleagues were able to gain access to a recent database of handwriting features that sought to estimate the frequency of idiosyncratic handwriting features in a representative sample of Americans. The researchers were granted access to the database before it was published, permitting them to score expert performance before the information was available to the expert community. In addition to probing the performance of forensic handwriting experts, Martire and colleagues also examined the performance of American and non-American novices. The inclusion of a non-U.S. comparison group allowed the researchers to estimate the importance of exposure to culturally-specific environmental probabilities.
Martire and colleagues recruited 18 court-practicing handwriting specialists, both from the U.S. and elsewhere, who were writing an average of around 30 reports per year. A further 77 non-expert participants were also recruited. Participants were presented with 60 feature exemplars (30 cursive and 30 printed) that were selected to represent 5 levels of actual occurrence probabilities in the corpus; 1%, 25%, 50%, 75%, and 99%. Participants were asked to estimate the occurrence probability of the feature (without access to the true values, of course).
A sample trial is presented in the figure below:
What is your estimate of the frequency of this feature? Out of 100 Americans, how many would use two strokes to print a lower case ‘z’? How well do you think you would do at this task overall?
The figure below summarizes the main results of the study by Martire and colleagues. Each set of bars on the right shows the mean absolute error—that is, the difference between actual and estimated percentages—for the category of actual occurrence frequencies shown on the left. So the top bar chart shows results for features whose actual occurrence was 1%, whereas the bottom chart is for features that occur in 99% of all individual samples, and so on.
The figure suggests that the experts were more accurate than the novices overall, with that advantage being accentuated for the most extreme actual probabilities (1% and 99%). In addition to expertise, the country of origin also mattered: the most accurate participants were the American experts (20% error), followed by the non-US experts (22%), and the two groups of novices. Curiously, the non-US novices performed better (24%) than the American novices (28%).
Those group differences tell only part of the story, as becomes apparent when performance is considered at the individual level. The results of this analysis are shown in the next figure, which plots each individual participant’s calibration. That is, the grey curves within each panel represent the estimates of each individual’s calibration between subjective probabilities (on the Y-axis) and actual frequencies of occurrence (X-axis). Perfect calibration corresponds to the diagonal set of black diamonds. If any participant’s grey line fell exactly onto that diagonal, then that person would respond to each level of actual frequency of occurrence with an identical subjective estimate.
The figure shows striking heterogeneity among the novices (right-hand panels): their calibration curves take on a bewildering range of shapes and slopes and curvatures. The experts (left-hand panels), by contrast, are quite consistent and differ less from each other. The U.S.-based experts, in particular, cluster tightly together and are near the perfect calibration line (black diamonds).
In a final analysis, Martire and colleagues sought evidence for the “wisdom of crowds” effect—that is, would the novices in the aggregate perform as well as the experts? The first set of results in the figure above seems to rule out that possibility—after all, on average the experts did better than the novices. There are, however, better ways to aggregate performance than by forming a simple average.
The results of this analysis are shown in the next figure:
Two aspects of this pattern are particularly noteworthy: First, averaging all participants together is worse than considering the experts along (center panel; recall that we are plotting error, and so a larger number means worse performance). Second, when the data across individuals are instead aggregated using a hierarchical Bayesian model (right-most panel), then the inclusion of all participants—including the novices—reduces the error compared to the inclusion of the experts alone.
The reasons underlying this result are fascinating but too technical to explain in this post. You can check out the Wikipedia entry on Stein’s paradox for a thumbnail sketch of why certain ways of combining individual responses is superior to averaging. In a nutshell, this occurs because the measurement error in each participant’s responses is reduced by “borrowing” information from all other participants.
So how many people did you think used two strokes for a lower case ‘z’? In case your guess was 2 out of 100, you were pretty close. The true frequency in the corpus is 0.021. And if you were way off, then don’t worry too much—provided your response is combined with those of many other novices using a hierarchical Bayesian model, you would still have made a contribution to the wisdom of the crowd.
Psychonomics article featured in this blogpost:
Martire, K., Growns, B., & Navarro, D. (2018). What do the experts know? Calibration, precision, and the wisdom of crowds among forensic handwriting experts. Psyconomic Bulletin & Review. DOI: 10.3758/s13423-018-1448-3.