There has been a longstanding belief in the scientific validity of fingerprint evidence, based on the apparent permanence and uniqueness of individual fingerprints, the experience-based claims of trained fingerprint examiners, and the longstanding courtroom acceptance of this forensic technique. Yet systematic scientific study of the accuracy of latent fingerprint identification is a very recent development, still very much in progress. In the past, fingerprint identification was sometimes even claimed to be “infallible” or to have a “zero error rate” so long as the method was appropriately applied by an experienced examiner , . High-profile cases in which errors were discovered, along with the inherent implausibility of assertions of infallibility, led to doubts about such claims of accuracy, but only in the last few years have scientific efforts to assess the strengths and limitations of fingerprint identification gained traction. The 2009 National Academy of Sciences report on forensic science  emphasized and spotlighted both the limits of our knowledge and the need for basic research, and since that report. The available data suggest a low level of false positive errors by experts under experimental conditions and a substantially higher rate for false negatives , . While these data suggest that well-trained, experienced examiners are highly accurate when making positive identifications, it is also clear that errors still occur. Understanding what characteristics of print pair comparisons make errors more or less likely is thus critical to assess both the power and limits of this important forensic technique.
Fingerprint examiners can specialize and become latent or tenprint examiners or both. A latent examiner focuses on comparing “chance” fingerprints left accidentally at crime scenes or elsewhere, to possible source prints. A tenprint examiner, by contrast, compares fingerprints purposefully collected in controlled circumstances (such as at a police station) with those on file in a database. In police stations, impressions from all ten fingers are often collected on a single sheet, which is why they are called tenprints. Tenprints are also referred to as “known prints” because the identity of the source of the impression is known. In this paper, we use the term known print to refer to such prints. Latent prints have to be processed in order to be made visible, and often contain only a portion of a finger or other friction ridge area. They are often smudged, distorted, and may contain artifacts or noise due to the surface upon which they were left, or as a result of processing. By contrast, known prints are collected in controlled situations where poor impressions can be retaken, so they are typically larger, clearer, and richer in information content than latent images. Latent prints tend to be highly variable in quality, while known prints generally capture fingerprint information with high fidelity. Known prints are often acquired by law enforcement agencies using ink or a scanner. A sample latent and known print are shown in Figure 1.
Figure 1. Sample fingerprint images used in the study.
The image on the left is a latent print. Note large areas of the image that are smudged or missing. Contrast and ridge clarity vary greatly across the fingerprint area. These and other aspects of the image could make comparison difficult. The image on the right is a known print and is much clearer.
Until recently, there were virtually no scientific studies of how often fingerprint examiners made errors. However, recent studies have provided helpful information for this assessment –. Ulery, Hicklin, Buscaglia, and Roberts  had 169 latent print examiners compare an independent sampling of 100 fingerprint pairs (from a set of 744), each pair consisting of one latent print and one known print. Ulrey, et al.  found that 7.5% of matching pairs were labeled non-matches (false negatives), while only 0.1% of non-matching pairs were labeled matches (false positives). Similar results were found by Tangen, Thompson, and McCarthy : 7.88% errors for matching pairs (false negatives) and 0.68% errors for non-matching pairs (false positives). These studies took place in experimental conditions quite different from actual casework. Error rates from these studies likely do not fully reflect real-world performance , but they do indicate high levels of performance by experts. Studies also indicate that experts perform far better than novices at fingerprint matching tasks , .
From a research point of view, the low false positive rates among fingerprint examiners make the discovery of determinants of such errors quite difficult. High accuracy leads to little variability in performance, undermining standard statistical analyses. However, the low number of these errors should not be taken as an indication that studying them has little practical importance. A false match can lead to a false conviction, and a false exclusion can lead investigators to focus their attention on erroneous leads or to fail to convict the actual perpetrator. Furthermore, the realities and pressures in real criminal casework may substantially increase error rates, including false positives. In addition, even if these experimental error rates were established to be similar to those in actual practice, these low error rates get multiplied by a very large number of fingerprint comparisons, so the absolute quantity of real-world errors would not be de minimus.
Ironically, the practical importance of understanding when and why fingerprint comparison errors occur is likely to increase as technology advances. It is common for a latent print to be submitted to an AFIS (automated fingerprint identification system) database, where automated routines return a number of most likely potential matches. Error rates (especially of the false-positive type) may increase as databases get larger (currently some databases include tens of millions of prints). The reason for this is that as a database grows, an AFIS searching that database is increasingly likely to find close non-matches, (prints that are highly similar to the latent, but are in fact from a different individual – what are often termed “look-alikes”). Obviously, searching larger databases also increases the chances of finding a true match, but such progress can also make the task of the human examiner more demanding and, potentially, error-prone .
From a visual information processing perspective, it is therefore interesting and important to determine what visual characteristics of fingerprints influence the ease and accuracy of comparisons. Ultimately, it may be possible to evaluate a fingerprint comparison in terms of the quantity and quality of visual information available  in order to predict likely error rates in comparisons. Better understanding of objective metrics could also help determine when a print pair contains or lacks sufficient information to make an identification or exclusion, that is, to determine when an “inconclusive” assessment is warranted. These considerations motivate the present study. Its primary goals are to: (1) measure expert examiner performance, and (2) to create a predictive framework by which one could assign an appropriate level of confidence in expert decisions, derived from an objective assessment of characteristics of the pair of images involved in a particular fingerprint comparison. These two goals are interconnected: examiner performance levels (error rates) are likely to depend on the complexity and difficulty of the comparison. Specifically, as comparisons become more difficult, errors are more likely to occur. A single overall ‘error rate’ for latent fingerprint comparison would be insufficiently granular, as it would fail to recognize that some comparisons are likely far easier than others, and thus far less prone to error. Hence, the characterization and prediction of error rates must be a function of the difficulty of the comparison. Notwithstanding this relationship, no previous research on fingerprint identification has attempted to generate objective models for the assessment of fingerprint difficulty.
Perceptual Aspects of Fingerprint Expertise
If asked to give reasons for a conclusion in a given comparison, fingerprint examiners would display significant explicit knowledge relating to certain image features, such as global configurations, ridge patterns and minutiae, as these are often explicitly tagged in comparison procedures, and they are pointed out in training of examiners. It would be a mistake, however, to infer that the processes of pattern comparison and the determinants of difficulty are therefore fully available for conscious report or explicit description. As in many other complex tasks in which learning has led to generative pattern recognition (the ability to find relevant structure in new instances) and accurate classification, much of the relevant processing is likely to be at least partly implicit –.
Like many other tasks in which humans, with practice and experience, attain high levels of expertise, feature extraction and pattern classification in fingerprint examination involves perceptual learning – experience-induced changes in the way perceivers pick up information , . With extended practice, observers undergo task-specific changes in the information selected – coming to discover new features and relationships that facilitate classification in that domain. Evidence supporting this claim comes from increased perceptual learning when these features are exaggerated during training .
There are also profound changes in fluency: What initially requires effort, sustained attention, and high cognitive load comes to be done faster, with substantial parallel processing and reduced cognitive load . In turn, becoming more automatic at extracting basic information frees up resources for observers to discover even more subtle or complex structural information, e.g., . This iterative cycle of discovery and automaticity followed by higher-level discovery is believed to play a significant role in attaining the impressive levels of performance humans can attain in areas such as chess, chemistry, mathematics, and air traffic control, to name just a few domains , .
While several studies have explored the influence of bias and emotional context on fingerprint matching and classification –, there has been relatively little work investigating perceptual aspects of expertise among examiners or perceptual learning processes that lead to expertise. One exception is a study by Busey and Vanderkolk , in which novice and expert fingerprint examiner performance was compared on a configural processing task. Subjects were shown a small image patch from a fingerprint and, after a mask, attempted to match the image patch to a luminance- and orientation-adjusted version presented along with a distractor fingerprint image patch. Subjects did not have an opportunity to closely examine the fingerprint patches as they only appeared for one second. This presentation procedure required subjects to rely heavily on the broad patterns of ridge flow (configural information) to perform the matching task. Experts exhibited nearly perfect performance, while novices had an accuracy of 0.8. Compared to the novices, experts may have utilized fingerprint information more efficiently, focused on entirely different information, and/or more effectively filtered out irrelevant information; the study did not provide information for distinguishing among these possibilities. However, the ease of information processing of task-relevant information is a hallmark of fluency effects in perceptual learning.
Similarly, Thompson et al.  found that the amount of visible area in a target print was positively correlated with classification accuracy among novices. Interestingly, this relationship also depended on the source of the print (e.g., index vs. pinky), although it was unclear why one finger should hold more information than another since presented areas were constant across prints. Marcon  had naïve observers rate “high quality” (known prints) and “low quality” latents for distinctiveness. Performance for categorizing pairs of prints as coming from the same source or a different source was higher for high-quality and high-distinctiveness images. Together, these studies show that performance suffers when fingerprint image quality is low, but reveal little about the specific nature of the information that correlates with low or high quality.
Fingerprint Features in the Standard Taxonomy
The first step in latent print examination is often manual preprocessing. For example, the region of the image that contains the fingerprint could be selected from the background and oriented upright. If a fingerprint is to be submitted to a database for automated comparison, key features need to be identified and labeled. Automated searches are then carried out by software that finds fingerprints on file with similar spatial relationships among the features labeled in the submitted fingerprint. This is the only part of the examination and comparison process that is automated. The software returns a list of potential matches, many of which can be quickly excluded. Some will be closer non-matches or a match, and these require further scrutiny by a human examiner.
Whether examiners are provided with potential matches via automated database searches or via investigative work, they often make their match decisions using the ACE-V approach: Analysis, Comparison, Evaluation, and Validation . The examiner first looks at the latent print closely (analysis), then compares the two prints relative to each other, looking for both similarities and differences (comparison). They then evaluate those similarities and differences to arrive at a decision about whether the prints match or not. In the final step, a second examiner independently validates the conclusion. Mnookin  points out that there is no formalized process for any of these steps. There is no method or metric for specification of which features should be used for comparison, nor any general measure for what counts as sufficient information to make a decision. Examiners rely on their experience and training rather than formal methods or quantified rubrics at each step of the process.
Despite the lack of a formalized procedure, some attempts have been made to formally describe and classify the kinds of features that might be found in a fingerprint. Three types of features are commonly used to describe the information used for fingerprint comparison (for a complete discussion, see ). Level I features are global descriptors of ridge flow easily seen with the naked eye. These include patterns in the central region (the “core”) of the fingerprint. Cores can be classified into a limited number of typical patterns such as left- and rightward loops, whorls, tented-arches and arches. Deltas are another Level I feature that are triangular patterns of ridge flow that often occur on the sides of loops and whorls. A leftward loop and a delta are indicated by the yellow and green boxes respectively in Figure 2. Level I features are too common to be sufficient for identification, but they can be used for exclusion purposes as well as to guide inspection of the more detailed Level II and Level III features.
Figure 2. Various image features commonly identified by expert examiners.
Red circles indicate minutiae (ridge bifurcations or endings); blue circles indicate pores (they appear as small white dots along a ridge); the yellow square indicates the delta; the green rectangle indicates the core, in this case a leftward loop.
Level II features include minutiae such as ridge bifurcations and ridge endings. Level II features are found where fingerprint ridges and valleys split or end. Minutiae are highlighted in red circles in Figure 2. The power of fingerprints for identification purposes is largely due to the high variability in the existence and relative positions of these features across fingers and individuals. Scarring, which occurs naturally with age and wear, can also add unique ridge patterns to a fingerprint. However, while scars can be used to compare the fingerprint found at a crime scene to that of a suspect in custody, they may not always exist in fingerprints on file that can be old and therefore predate the markings.
Level III features are the smallest fingerprint features used by some examiners for comparison. These include the positions of sweat pores and ridge thickness. Pores are indicated in light blue circles in Figure 2. The visibility of Level III features depends on the quality of the prints and examiners do not uniformly make use of them for comparison purposes.
What properties of the images in fingerprint pairs are most important and informative in comparing fingerprints? What visual qualities of individual prints or of print pairs make accurate matching performance more or less likely? Although we relied on regression methods to provide answers to this question, it was important to develop as inputs to the regression analyses a wide variety of possible image characteristics that could be relevant. To generate such factors, we were guided by vision science, intuition, insights from fingerprint examiners, and prior work on image processing of fingerprints , as well as the standard taxonomy of levels of pattern information in fingerprints (described above). Some variables intuitively seem likely to relate to the sheer quantity of available information; for example, having greater print area available for comparison might make comparisons more accurate. However, this might well be oversimplified; quality of information might matter as much or more than total print size. We created several image quality metrics that appear sensitive to smudging, missing regions, poor contrast, etc. These metrics were computed in an automated fashion on the fingerprint images themselves, and were designed to relate in a variety of ways to the presence or absence of visual information that examiners use, and could therefore function as independent variables that are predictive of examiner performance.
We hypothesized that difficulty would be a function both of the characteristics of the individual prints (the latent and the potential match) and also of the characteristics of the pair. Because known prints are obtained under relatively standardized conditions, they are subject to significantly less variability than latent prints obtained from crime scenes. Accordingly, we expected that more of the variability in visual information quality affecting fingerprint comparisons would be determined by characteristics of latent prints. An especially poor quality latent might be more difficult to assess than a higher quality one, all else being equal. However, we also believed that pair difficulty would be a function of interaction effects between the latent and the known, not simply a function of the information quality and quality of each independently. We therefore developed quantitative measures involving both individual prints and print pairs.
A general description and motivation for the image features we selected or developed is provided below. Except where noted, we assessed each predictor variable for both the latent print and the known print. For many variables, we also derived a variable that expressed an interaction or relationship of the values of a variable for the latent and known print combined (such as the ratio of latent print area to the known print area, or the Euclidean sum of contrast variability for the latent and known print combined). For details about the procedures used to derive the measures we used, please see the supplementary materials.
This variable was defined as the number of pixels in the fingerprint after the fingerprint was segmented from the background. Although machine vision algorithms exist that could have been used for determining the region of usable print image, those algorithms we examined were not as good as human segmentation, and different human observers in pilot work produced strong agreement. Accordingly, we segmented fingerprints from their surrounds by having human observers designate their boundaries (see supplementary materials for details). In general, we expected that larger areas, especially of latent prints, would provide more information for making comparisons.
To relate the relative area of a latent to a potentially matching known print, we divided the area of the latent fingerprint by the area of the known print. Typically the known print, obtained under controlled conditions, presents a more complete image. Thus, Area Ratio relates to the proportion of known print information potentially available in the latent print. However, for non-matching prints, the area of the latent may be larger than that of the known print because of differing finger sizes. Occasionally, even for a matching latent and known print, the latent could be larger than the known print due to smearing. The ratio was therefore not strictly in the range [0,1] and cannot be considered a true proportion.
We measured the mean and standard deviation of pixel intensity taking into account all of the pixels in each fingerprint image (with intensities scaled in the range of [0,255]). The mean intensity and standard deviation of intensity provide two related but different measures, sensitive to different image characteristics. Very dark images (low mean intensity) might indicate the presence of large smudges that produce large, dark areas. Low standard deviation in intensity would make ridges (transitions from light to dark) difficult to detect.
The image was divided into 50×50 pixel regions and the average pixel intensity was computed within each region. The mean of the block intensities is the same as the overall mean Image Intensity. The standard deviation of these regional averages (standard deviation of block intensity), however, can provide additional information about variability in image intensity across the image. Low variability is indicative of many similar areas across the image, but does not provide information about whether those regions have low or high contrast (i.e., an all black image and an image with 50% white and 50% black pixels, evenly distributed across the image would have low Block Intensity variability). When pixel intensities are not uniformly distributed across the image, variability of block intensity is high (i.e., some regions of the image are darker than others). For latent images, this may indicate the presence of a smudge or worse contact (lighter impression) in some regions of the image.
Deviation from Expected Average Intensity (DEAI).
Intensity, as coded above, may be a useful predictor variable, but both intuition and pilot work led us to believe that it might not capture some significant aspects of intensity variations. We therefore developed a separate intensity measure – deviation from expected average intensity. In an ideal fingerprint image, one might expect approximately half of the pixels to be white (valleys) and half to be black (ridges). The expected mean intensity would therefore be half of the range, or 127.5 (with the brightest pixel normalized to 255 and the darkest to 0). The absolute deviation of the observed average from the expected average was computed using the following formula:
Using absolute value here ensures that deviations from the midpoint of the intensity range in either direction are scored as equivalent; the negative sign ensures that the measure increases as the mean pixel intensity approaches 127.5 (large deviations produce a large negative value of the measure). While ridges (black regions), on average, are thicker than valleys (white regions), making the average intensity slightly lower than 127.5, the difference was relatively small and was ignored.
Michelson contrast was computed for each segmented fingerprint. Michelson contrast is defined as:
This contrast measure produces a value between 0 (least contrast) and 1 (most) by dividing the difference of maximum and minimum intensity values by their sum. Michelson contrast is typically calculated from luminance values. In our images, we calculate Michelson contrast from pixel intensity values, which is appropriate given that fingerprint images may be displayed on a variety of monitors with different Gamma corrections.
The preceding measure obtained the Michelson contrast for an entire image. We also computed contrast for smaller image regions – block contrast – by segmenting the entire image into 50×50 pixel regions. Block Contrast is defined as the mean across the blocks. To illustrate the difference between overall contrast and block contrast, the Michelson contrast of an entire image containing all gray pixels except for one white and one black pixel, would be 1. Block Contrast, however, would be very low, since most regions of the image would have 0 contrast. If black and white pixels were distributed more evenly across the image such that they appeared in each block, then Block Contrast would be high. High values of the measure may indicate the presence of clear ridges and valleys in many areas of the fingerprint. A separate but related predictor was the standard deviation of block contrast across blocks. Small standard deviation values could indicate high information content throughout the image (Block Contrast close to 1 everywhere) or that the image was uniformly smudged (Block Contrast close to 0 everywhere).
Orientation-sensitive filters were used to detect edges in the fingerprint image. The relative responses of these filters were then used to identify “high reliability” regions where ridge orientation was uniquely specified (see supplementary materials for details). The proportion of high reliability regions was computed, resulting in an overall reliability score for each print. Ridge Reliability ranged between 0 and 1, with larger values indicating a greater proportion of print area with well-defined ridge orientation. An additional, relational predictor was computed by taking the Euclidean sum of the Ridge Reliability for the latent and known print (Ridge Reliability Sum). Large values of this measure indicate a high proportion of regions with well-defined ridge orientation in both the latent and known prints.
Visibility of Cores and Deltas.
Earlier we described global configurations – Cores and Deltas – that provide Level I information to fingerprint examiners. The fact that ridge flow in fingerprints tends to follow a circular pattern dictates that there will be some global core (a whorl, loop, or arch) at or near the center of each print. Likewise the transition from a core, especially loops and whorls, to the circular ridge flow tends to give rise to deltas (see Figure 2). As there will be only one core and at most a small number of deltas in any print, these serve as important reference points in making comparisons . Unlike all of the other variables we used, which could take on a continuous range of values, Cores and Deltas are binary (either present or not).
Relations Among Basic Predictors
To remove effects on regression coefficients of differing scales of various predictors, we standardized all continuous metrics by subtracting the mean and dividing by the standard deviation. Standardization made some measures that were strictly non-negative (like Standard Deviation of Intensity) take on negative values. As is often recommended in using regression methods , we also examined the variables for collinearity and found that several predictors were highly correlated. For example, the mean and standard deviation Intensity measures were correlated (Pearson's r = −0.77 for latents and −0.44 for known prints). High correlation among predictors is an undesirable feature for regression models  because it makes it harder to assess the individual effect of those predictors. If two predictors had a correlation of greater than 0.5, we removed one of them. After removal, the variance inflation factor, a measure of collinearity, for all continuous metrics was less than 5, indicating that collinearity was sufficiently reduced –.
In addition, we included two-way interactions between all predictors that applied to both a latent and known print. For example, in addition to the Standard Deviation of Block Contrast for the latent and known print, we included the interaction between the two terms. In addition to Area Ratio and Ridge Reliability Sum, these interactions are relational predictors that encode something about the relative quality of information in a latent and known print.
Overview of the Experiment
We developed a database of fingerprint images, both of the known prints and their corresponding latents, and computed a variety of metrics that we hypothesized would relate to image quality and information content. Our primary focus was accuracy, but we also measured response times and asked experts to provide subjective judgments of difficulty and confidence for each print pair. We tested expert fingerprint examiners in a task requiring a forced-choice judgment of whether two prints matched. As will be described below, the task approximated what examiners do in their real-life work in some ways but differed in others. For example, in the study reported here, images appeared on a computer monitor; examiners were limited in comparison time for each pair; and judgments were constrained to indicating that a pair of prints came from the same source or different sources, i.e., “inconclusive” was not a permitted response for difficult comparisons. These features of our design were chosen so that we could collect important data, including best-guess match determinations for difficult comparisons, and to permit us to obtain enough data to allow us to explore the set of image characteristics that might predict difficulty. We fit a regression model to measure how various image characteristics predict performance. To foreshadow some of the results, we found that a subset of image features such as measures of the reliability of ridge orientation information, the ratio of the visible area of the latent to the known print, and measures of contrast and intensity information were predictive of performance. The model accurately identified print pairs that had low accuracies, suggesting that it can be used as a valid tool for identifying potentially difficult comparisons and that in general, it may be feasible to use these methods to predict error rates for print pairs, as a function of comparison difficulty, with reasonable accuracy.
This study was performed in accordance with the guidelines of the Declaration of Helsinki. All experts provided written, informed consent after the general purpose of study was explained and were fully aware of the purpose and procedure of the study. Participation was voluntary. The study was approved by the institutional review board of the University of California, Los Angeles.
Fifty-six fingerprint examiners (18 male, 35 female, three not reported) participated in the study. Forty participants self-reported as latent print examiners, three as known print examiners, ten as both, and three did not report. Years of experience were reported between the range of 1 and 25 years (Latent: Mean = 9.54, SD = 6.97; Ten-Print: Mean = 10.45, SD = 8.07). Twenty-seven participants reported being IAI certified. 32 reported that their labs were accredited.
Participants were either directly recruited at the 2011 IAI (International Association for Identification) Educational Conference or via a flyer sent out in advance of the conference. As incentive, all participants were entered into a raffle to win an iPad 2. All participants signed informed consent forms prior to participating. As indicated above, some limited demographic information was collected, but it was stored separately from individual participant IDs such that the two could not be linked.
All stimuli were displayed on laptop computers with 17-inch monitors at a resolution of 1024×768 pixels. Stimuli were presented using a program accessed online; data were stored on the website's server.
Fingerprints were collected from 103 individuals. Each individual first used a single finger to produce a clear, known print using ink as is often done in police stations. Then, using the same finger, they touched a number of surfaces in a variety of ways (with varying pressure, smudges, etc.), to create a range of latent fingerprint marks that reflect those found in a crime scene. Professional fingerprint examiners who participated in the study reported that these prints were similar to those that they encounter in their everyday casework). The latent fingerprints were lifted using powder and were scanned at 500 dpi using the FISH system. Image dimensions ranged from 826 pixels in height to 1845 pixels and from 745 pixels in width to 1825 pixels. The latent prints varied in clarity, contrast, and size. For each individual who contributed to the database, we collected a total of six prints – one known print and five matching latent prints. Across individuals we varied the fingers used. Each scanned fingerprint was oriented vertically and approximately centered. Some individuals contributed multiple sets of prints from different fingers.
To create the non-matching pair of prints, we did not want to randomly select a known and a latent, as such pairs would often be too obviously different. This would make the “non-match” decisions nearly uniformly easy, and would also, by default, indicate which were the “matching” pairs. Therefore, we obtained similar, but non-matching, known prints by submitting each latent print to an AFIS search process. An expert selected from the AFIS candidate list what he deemed to be the most similar non-matching print. That enabled us to produce non-matching pairs with a relatively high degree of similarity. The final database consisted of 1,133 fingerprint images – five latent prints from 103 fingers (515), 103 known prints that matched (103), and another 515 known prints, to provide a potential non-match for each of the latents. Since we used an AFIS database from a different country from where we collected the known prints, it was highly unlikely that an actual match would be presented by the AFIS database search as a candidate. Furthermore, the expert who selected the most similar print from the AFIS candidate list verified for each comparison that this was a similar print, but not an actual match.
Of the 1,133 fingerprint images, 200 latent and known print pairs were selected and used for the study; half were a match and half were a close non-match. Individual print metrics were computed for each image or image pair (see below) and prints were selected to (approximately) uniformly sample each feature space. Known prints were sampled without replacement, but multiple latent prints from the same finger were occasionally selected since each latent could be paired with a different known print image (the match or a close non-match). Print pairs were then grouped into batches of 20, each containing ten matches and ten non-matches. Latent prints from the same finger did not appear within the same batch.
A group of experts made match/non-match judgments and provided confidence and difficulty ratings on a subset of 200 print pairs selected from a database of over a thousand fingerprint images. Two fingerprint images that were either from the same finger (match) or from two different fingers (non-match) were presented side-by-side. Images were presented on computer screens and were oriented upright. Examiners had a maximum of three minutes to evaluate each pair of images. Performance was recorded for each print-pair tested, and a model was fit predicting performance based on the set of image features computed for each image in the database.
Participants were tested in a large room, seated at desks with individual laptop computers. Before data collection began, each participant was asked to sign a consent form, and then given written instructions detailing how the stimuli would be presented and the judgments they would be required to make. Participants were told that they would be asked to compare latent-known print pairs and determine whether they were matches or non-matches (without the option to choose “inconclusive” as a response). Participants were also told that they would be asked for confidence and difficulty ratings for each of their judgments. The instructions emphasized that this procedure was not intended to replicate real-world conditions and that participants should simply try to maximize accuracy. Participants were also instructed to refrain from using any fingerprint examiner tools not provided by the experimenter, such as a compass.
When the experimental program was initiated, participants were asked to report their age, gender, years of experience, specialization, IAI certification, lab accreditation, and lab affiliation. Reporting this information was optional.
Next, the experiment began. On each trial, two fingerprints were presented side-by-side. The latent print was always on the left. A button in the top-left corner of each image window allowed participants to zoom in on each image individually. Fingerprint image size was constrained within the bounds of each window, so that each print was always viewed through an aperture of 460 pixels by 530 pixels. The initial presentation of the images had them scaled to fit entirely in this window. A single level of zoom allowed participants to magnify the image. Participants could also translate each image independently within its window (both when the image was zoomed or unzoomed) either by dragging it with mouse or by using arrow buttons in the top-left corner of each image window. No other image manipulation features were available.
Participants made a match/non-match judgment by clicking a button at the bottom of the screen. Specifically, participants were asked: “Do these prints come from the same source or a different source?” Participants then made difficulty and confidence ratings by clicking on a Likert scale. The participants were asked: “How difficult is the comparison?” and “How confident are you in your decision?” On the Likert scales, “1” corresponded to least difficult/least confident and “6” corresponded to most difficult/most confident. Once all responses were recorded, an additional button appeared allowing the participant to advance to the next trial. Supplementary Figure S1 shows a sample screenshot of the experiment.
Participants had three minutes to complete each trial. A message was given after two and a half minutes warning that the trial would end in 30 seconds. If the full three minutes elapsed without a decision, that trial was ended, and the participant moved on to the next trial. After presentation of a set of 20 print pairs, participants were given a short break and asked if they wanted to complete another set of 20 comparisons.
Each set of 20 print pairs contained ten match and ten non-match comparisons, though examiners were not provided with this information. The order in which print pairs were presented within a set was randomized across subjects. The sets were presented in a pseudo-random order so that approximately ten participants completed each set. Although the number of trials completed by individual participants varied based on their availability and willingness to do more comparisons, most participants completed two sets of prints (40 print pairs).
If the participant made a match/non-match judgment, but time expired before they could make difficulty or confidence ratings, the data were retained. There were thirteen such trials. If only difficulty and confidence ratings were provided, but a comparison judgment was not made before time expired, the trial was excluded from the analyses. Twenty such trials were excluded from the total of 2,312 comparisons (fewer than 1%). For one subject, time expired on eight of the trials they completed. There was no consistency in which print pairs had time expire – for two of those pairs, time expired for two subjects, for the rest, time expired for only one subject.
Responses were aggregated across participants and prints. Overall accuracy (percent of correctly classified latent-known print pairs, averaged across subjects) was 91% (range: 8.3–100%, SD 17%). Overall accuracy was 86% for “match” trials (14% false negatives) and 97% for “non-match” trials (3% false positives). Of the 2,292 comparisons, there were 200 errors, resulting in an overall error rate of 9.6%. There was some variability in performance among experts (range: 79–100%, SD 5%).
Across all participants, 118 of the 200 print pairs produced 100% accuracy. Mean difficulty and confidence ratings for these pairs were 2.62 and 5.23 respectively, compared to ratings of 4.06 and 4.15 for prints that were misclassified by at least one participant. Of the118 pairs that produced no errors, 72 were non-matches and 46 were matches. The lowest accuracy, 8.3% (1/12), corresponded to false negatives for a “match” print-pair. Average accuracy for each print pair is shown in Figure 3 sorted by increasing accuracy.
Correlations Among Dependent Measures
We measured the correlations among the three dependent measures. There was a strong negative correlation between average difficulty and confidence ratings (r(198) = −0.91, p<0.001) and weaker correlations between average accuracy and confidence (r(198) = 0.52, p<0.001), and between average accuracy and difficulty (r(198) = −0.50, p<0.001). These results suggest that experts' confidence in their judgments is well matched to their perceived difficulty of the judgments, and further that both expert's perceived confidence and difficulty are predictive of performance. There was also a strong positive correlation between response time (RT) and difficulty (r(198) = 0.71, p<0.001) and a negative correlation between response time and confidence (r(198) = −0.59, p<0.001). Accuracy was highest and RT lowest for prints that were rated least difficult. Accuracy decreased and RT increased as print difficulty ratings increased. Altogether, these results suggest that experts faced with difficult print comparisons tend to have lower confidence in their judgments, and take more time to ultimately make a match/non match decision. Excluding the 118 prints with 100% accuracy, the correlations between accuracy and confidence and between accuracy and difficulty were qualitatively weaker, but the difference did not reach significance. The full set of correlations is shown in Table 1.
We fit a crossed, logistic regression model in which print pair performance (1 = accurate; 0 = inaccurate) was crossed with expert and print identity. This is a type of mixed-effects model and is appropriate for analyzing these data for several reasons , . First, not every subject evaluated every print pair. A mixed-effects approach enables the examination of both the predictor variables and the random effects due to inter-subject differences (i.e., differences between expert performance and differences between evaluations of the same print pair by multiple experts). Second, a mixed-effects approach allows one to model individual item differences by fitting data from individual trials instead of aggregating across all presentations of an item , . Differing levels of expertise and experience, as well as differences in comparison strategy and decision thresholds, could give rise to variability in participant performance independent of the fingerprint features. Variability across items could occur if some comparisons were easier than others irrespective of differences in measured image features. Including these sources of variability in the model allows us to test whether print comparisons and experts differed from one another, instead of assuming they were all equivalent, and simply averaging across participants and items. Data were fit using the “arm”  and the “lme4”  R packages for R version 2.15.2.
For each of i print-pair comparisons (items) and j experts (subjects), we define yi,j as where Xi,j is a vector describing the features measured on a print pair, β is a vector of coefficients (the fixed effects; one coefficient for each feature), expertIDj is the expert-specific random effect, which allows the intercepts to vary across experts, and printIDi is the item-specific random effect. expertID and printID were normally distributed.
The regression equation can be rewritten and expanded as: (2)where n is the number of predictors. In this form, it can be seen that printID and expertID can be grouped with β0 as intercept terms. Because printID and expertID are vectors, the equation reflects that each combination of print and expert has its own intercept term. It is this combined term (β0 + printIDi + expertIDj) that varies across experts and items. Multi-level modeling allows one to capture possible differences between individual subjects or test items without fitting a separate regression equation for each item (by applying a distribution over the terms that vary, in this case printID and expertID; see ).
Individual differences among experts may arise due to differences in experience, training, and other factors. These could manifest themselves as different baselines of performance, or intercept terms in the model. All else being equal, one expert might do better with the exact same print pair than another expert. This variability is captured by the expertID term in the model. It is also possible to model item-specific (in this case, print-pair-specific) effects; these are represented by printID. PrintID captures differences in print comparison difficulty inherent to individual print pairs and not related to the features used to predict print pair accuracy. In constructing a model, it is assumed that the error terms are uncorrelated; however, it is possible that print pair errors are correlated across participants. Inclusion of the item-specific term captures this potential non-independence (Baayen et al., 2008). A likelihood ratio test showed that the model with the predictors fit the data better than a null model with only the random effects terms (χ2(17) = 53.27, p<0.001).
Comparing a model that included the random expert effect (expertID) to one that did not, we found that the Akaike Information Criterion (AIC) was slightly smaller for the model that included the effect, but the Bayes Information Criterion (BIC) was smaller for a model that did not. Both of these measures are information-theoretic metrics of goodness-of-fit that take into account overfitting of the data with excess parameters. Qualitatively, a more parsimonious model that fit the data almost as well would have a smaller AIC and BIC , . The fact that the criteria move in opposite directions when the model includes expertID suggests that any differences between the models should be treated with caution . A likelihood ratio test comparing the two models was significant (χ2(1) = 4.79, p<0.05). ExpertID terms varied from between −0.52±0.69 to 0.44±0.77. All values of expertID were within two standard errors of zero. In terms of Equation 2, this means that β0 + expertID was not reliably different from β0. Based on these analyses, we felt justified in averaging across experts and ignoring between-expert differences in all subsequent modeling steps by removing the expertID term. This same analysis could not justify excluding the print-pair specific term, printID, and so it was retained in the model.
We simplified the model further by removing predictors (fixed effects) based on minimization of the AIC . A likelihood ratio test revealed no statistically significant difference between a model that included all of the predictors and the reduced model (χ2(11) = 9.55, p>0.05), indicating that the removal of predictors increased parsimony without significantly impacting predictive ability.
This analysis included all print pairs used in the study. This was done because the goal of the study was to create a model of difficulty for novel comparisons for which ground truth regarding whether or not a print pair shares a common source is unavailable. A separate analysis using only matching pairs showed highly similar results, including all of the predictors that proved to be reliable in the main analysis.
The Accuracy Model
The model obtained for accuracy was:
Where L and K indicate whether the predictor applies to a latent or known print image respectively, and LxK indicates predictors that apply to print pairs. printID is the item-specific, random effect. The parameters of the fitted model are shown in Table 2. All predictors were significant (Wald's z, ps<0.05), except for Delta (L) and DEAI (LxK) which were marginally significant (p = 0.054 and p = 0.053 respectively). It should be noted that there is some disagreement on how to calculate p-values for the Wald statistic in unbalanced, mixed-effects data due to difficulty in determining the appropriate degrees of freedom and they should therefore be interpreted with caution , , .
To get a more intuitive notion of model performance, we used the predicted proportions from the logistic regression as estimates of average performance across experts. The resulting fit was very good (R2adj = 0.91). We also computed the root mean squared error (RMSE) by taking the sum of the squared differences between predicted and observed values. Values closer to 0 indicated better performance. The error for the fitted model (RMSEmodel = 0.06) was lower than for a null model that only included the printID random effect (RMSEnull = 0.18).
Validation of the Regression Model for Accuracy
The dataset was split into training and testing sets. The training set contained 180 (90%) of the print pairs (2063 individual observations), and the testing set contained the remaining 20 print-pairs (10%, 229 observations). The testing set print pairs were a representative sample of the overall dataset, containing 12 pairs with perfect accuracy and 8 pairs with less-than-perfect accuracy. This was important in order to ensure that the training set did not have too few pairs with low accuracies (there were only 24 pairs with average accuracies below 80%). We replicated the model selection procedure for data only from the training set. The same predictors were selected with comparable coefficients, except for Delta (L) which was replaced with Core (L). For both the full and training datasets, the coefficients for these two predictors, Delta (L) and Core (L), were not significantly different from zero and were within two standard deviations of zero. Nevertheless, they could not be excluded based on the selection procedure described above. The fit of the model to the training set was comparable to the fit of the model to the full set (R2adj = 0.89, RMSEtrain = 0.07).
We used this regression model fitted to the training set to predict accuracy for the withheld testing set of 20 print pairs. The percentage of variance explained was worse for the testing set than for the training set, suggesting some amount of overfitting (R2adj = 0.64). The error, however, was comparable between the training and testing sets (RMSEtest = 0.07). The model's predictions are shown in Figure 4.
Fingerprinting is a valuable police tool for tracking down suspects, but it's not perfect. However, we can reduce the risk of any mistaken identity if we work within the limits of fingerprinting.
Is forensic science an oxymoron? A new White House report suggests there are major issues with many of the forensic disciplines used to convict defendants of crimes in the U.S.
New plastic banknotes pose a challenge to forensic scientists that clever chemistry can solve.
It could even help crack long-forgotten cold cases.
Saying they are is to dangerously misunderstand the limits of scientific enquiry.
The ancient Egyptians knew a thing or two about how to produce a vibrant blue pigment for their tombs and coffins. Now it's being used to help find fingerprints.
President Obama's call for better electronic gun-safety systems put a spotlight on the technologies currently in the R&D pipeline that aim to make sure only authorized users can fire a gun.
Prime Minister David Cameron has stated that the UK government will look at “switching off” some forms of encryption in order to make society safer from terror attacks. This might make a grand statement…
How can we ensure that someone is who they say they are? How can be sure that the person in our system, both digitally speaking or physically in front of us, is who whom they claim to be? You may think…
For thousands of years, people believed their future could be read in the lines etched into the palm of their hands. The ancient art of palmistry, originating in India, claimed a close examination of the…
Technology to acquire and use biometric data such as fingerprints has been around for several decades and has made its way from forensic investigation to laptop computers – and now, with this week’s introduction…
High-tech fingerprint scanners may stop most thieves accessing a restricted area, but they have one fatal flaw: scanners…