Forensic Face Recognition: From characteristic descriptors to strength of evidence

Research output: ThesisPhD Thesis - Research UT, graduation UTAcademic

233 Downloads (Pure)

Abstract

Forensic Face Recognition (FFR) is the use of biometric face recognition for several appli- cations in forensic science. Biometric face recognition uses the face modality as a means to discriminate between human beings; forensic science is the application of science and tech- nology to law enforcement. There are two image types involved in FFR. The trace image often captures a crime scene and is most of the time taken under uncontrolled conditions. The reference image is a photograph of a suspect and is taken under controlled conditions. In general, as described by Meuwly and Veldhuis [1], FFR includes scenarios of ID verifica- tion, identification, investigation and intelligence, and evaluation of strength of evidence. The evaluation of strength of evidence is commonly referred to as forensic evidence evaluation. The strength of evidence, in combination with prior assumptions, can be used by a court of law in its verdict whether a suspect is considered guilty or not. This dissertation is primarily concerned with topics related to forensic evidence evaluation in the domain of FFR.
The field of face recognition has made impressive improvements in the last two decades. State-of-the-art biometric face recognition can recognise faces with low error rates (e.g. a false-rejection probability of 1% at a false-acceptance probability of 0.1%) [2]. Although face recognition systems in principle can be used for investigation and intelligence purposes, forensic evidence evaluation is still largely a manual process performed by human FFR- examiners. They are able to amortise common influences on the quality of trace material during their assessment of trace and reference images. We refer to [3] for a study on (per- formance) differences between FFR-examiners and non-examiners. The influences include image compression artifacts, lens distortion, perspective effects, low resolution, interlacing, pose, illumination, and expression. Also, partial occlusion of the face is commonly encoun- tered in trace images. These influences restrict the use of a standard face recognition system. An additional reason to be somewhat reluctant towards the use of face recognition systems is their use of abstract, general feature descriptors like SIFT [4] and LBP [5]. These descrip- tors are not endowed with any forensic meaning and are hardly understandable outside the technical computer vision domain, in particular in a court of law.
During the manual forensic evidence evaluation process, traces and references are as- sessed by the FFR-examiner who will pay attention to mostly shape like and potentially

1


highly discriminating facial features [6]. The Facial Identification Scientific Working Group (FISWG) [7] has published the Facial Image Comparison Feature List for Morphological Analysis [8]. It describes characteristic descriptors (facial features) that can be used during forensic evidence evaluation. Although this feature list is not a formal standard, similar foren- sic evidence evaluation procedures in The Netherlands and Sweden [9–11] indicate that it can be regarded as an informal standard, representative of those used throughout other countries as well [12].
The mere fact that the characteristic descriptors are documented in the FISWG Feature List does not automatically imply their suitability, in particular for their intended use under forensically relevant conditions. Actually, little research is done on this topic. The transfer from the Frye to the Daubert rule and the very critical report of the National Research Coun- cil of the National Academies on the state of forensic science in the USA, is an additional incentive to initiate such research on FISWG characteristic descriptors.
Prior to 2000, admissibility of expert evidence presented to a US trial court was governed by the Frye rule. This rule states that evidence is admissible as long its method is “(...) sufficiently established to have gained general acceptance in the particular field in which it belongs.” [13]. In almost all jurisdictions, this rule has been superseded by the Daubert rule (“a trial judge must ensure that any and all scientific testimony or evidence admitted is not only relevant, but reliable”) [13]. This rule puts more emphasis on the used methodology being scientific. This includes the use of peer reviewed methods, insight in known or potential error rates, the formulation of hypotheses, and the conduction of experiments to prove or to falsify hypotheses. In other words, there has been a shift from conclusions or opinions under the Frye rule to strength of evidence established in a scientific manner under the Daubert rule. A summary of forensic facial expert testimony illustrating the dire, non-scientific approach in some selected cases can be found in [14]. In 2009 the National Research Council of the National Academies published an elaborate and critical report [15] on the current state of forensic science in the USA. It includes an in depth discussion of the Frye and Daubert rules and its implications on current practice of forensic science. In total 13 recommendations have been formulated. Recommendation (3) is of particular interest: “Research is needed to address issues of accuracy, reliability, and validity in the forensic science disciplines. (...)”.
Considering this discussion, we are interested in several aspects related either directly or indirectly to the FISWG characteristic descriptors. These aspects start in the vicinity of the current practice, the human FFR-examiner, and they gradually zoom out towards the presentation of a practical framework for forensic evidence evaluation that in principle also can be applied to research outside the FFR domain. These, in total eight, aspects in turn form the basis of the addressed research questions in this dissertation.
The first aspect is how well FFR-examiners and non-examiners perform on a compari- son task when they use FISWG characteristic descriptors versus a best-effort approach. The results are indicative of the added value of characteristic descriptors over an alternative ap- proach.
Starting from the second aspect, we set the human aside and focus on the design and usage of biometric classifiers. The previously mentioned face recognition systems are examples of biometric classifiers. In general, a classifier compares a trace (having a questioned label) and a reference (having a known label), outputs a comparison score that encapsulates how convinced the classifier is that trace and reference input have a common label, and given a threshold, makes a decision1. If the comparison score exceeds this threshold, the decision is affirmative: trace and reference are assumed to have a common label, otherwise different labels are assumed. Although in this dissertation we use the term classifier, we are mostly interested in the produced comparison score. A biometric classifier is a classifier that uses biometric features as its input. In particular, we will primarily focus on biometric classifiers that use characteristic descriptors as their input. Furthermore, we are interested in comparison scores that are either modelled or converted to strength of evidence. The input and output of such classifiers have a clear forensic meaning and are understandable by a court of law, as opposed to the previously mentioned abstract, general feature descriptors like SIFT. Also, by using biometric classifiers that are specialised on a specific characteristic descriptor, we have by design the guarantee that only the descriptor is taken into account during the computation of strength of evidence.
Returning to the second aspect, it focuses on classifiers using FISWG characteristic de- scriptors as their input, producing strength of evidence, and how they perform in general in relation to other biometric classifiers that use non-forensic features, under relatively well- conditioned settings. General performance is measured by considering the comparison scores of a biometric classifier when it is offered a set of trace-reference pairs of multiple subjects whose ground truth (same source, different source) is known.
The third aspect extends the previous aspect by using trace images that are more repre- sentative of various forensic use cases. It considers the general performance of biometric classifiers using characteristic descriptors as their input, also in relation to face recognition systems.
The fourth aspect shifts the focus from the biometric classifier to mostly properties of the characteristic descriptors themselves. In particular, it considers (a) their measurability and (b) the influence of measurement variation on the value of characteristic descriptors and produced strength of evidence. Measurability refers to which extent characteristic descriptors can be extracted. Furthermore, in this dissertation, most characteristic descriptors have been extracted from manual annotation. This is due to the lower quality of trace images and the general difficulty of implementing a semantic definition of a characteristic descriptor in a robust extraction algorithm.
The fifth aspect considers differences between general and subject based performance. Subject based performance is measured by considering the comparison scores of a biometric classifier when it is offered a set of trace-reference pairs for which the traces only originate from the subject at hand, the references come from multiple subjects, and for each pair the ground truth (same source, different source) is known. The reason to consider this, is that a biometric classifier using a characteristic descriptor as its input might have poor general per- formance, whereas the subject based performance might be better or even good. We believe that this behaviour is exemplary for the face modality in a forensic context; looking into this matter seems warranted. Insight in the variation of subject based performance is indicative of the proportion of cases in which the characteristic descriptor could be used to discriminate a subject. Moreover, inspecting the appearance of a characteristic descriptor of a particu- lar subject whose biometric classifier exhibits a good subject based performance connects its phenotype to that performance and is potentially beneficial for identifying discriminative characteristic descriptors in general. Finally, it shows the contribution of each characteristic descriptor but also their limits. This aspect is taken into account by considering empirical re- sults and a theoretical construction creating a gap between perfect subject based and general random performance.
The sixth aspect considers the suitability of facial marks in forensic evidence evaluation and extends the previous subject based performance to a broader subject based approach. Facial marks are interesting as they are representative of FISWG characteristic descriptors that have a potential to be very discriminative. This aspect describes a proto-framework that contains possible choices during the design and evaluation of biometric classifiers that use features derived from facial mark locations. An example choice is whether to consider a classifier that is trained with subject based data. It also incorporates other, forensically relevant, performance characteristics that can be evaluated at a subject based level. The proto-framework is created as a response to existing facial mark classifier studies.
The seventh aspect extends the proto-framework of the previous aspect into a framework, applicable to the design and evaluation of biometric classifiers for forensic evidence evalua- tion in general, in principle even applicable outside the FFR domain, with a special emphasis on the subject based approach. Also, its applicability is shown by considering two relevant applications in the domain of FFR of which one extends the facial mark study.
The eighth, and final, aspect complements the previous aspects in an abstract manner. Although the subject based performance might be reasonable or even good in some cases, a large proportion of biometric classifiers will probably have a performance that is poor to the extent that it is unclear whether it could have been produced by a random classifier, that is, a classifier that essentially outputs random comparison scores without considering the trace and reference inputs. This aspect takes a particular performance measure, the Area Under the Curve (AUC), and quantifies the boundary between random and non-random performance.
Overall, we believe that by addressing these eight aspects in this dissertation, the FISWG characteristic descriptors are considered from relevant points of view and as such our ap- proach does justice to the intention encapsulated in the Daubert rule.


Original languageEnglish
Awarding Institution
  • University of Twente
Supervisors/Advisors
  • Veldhuis, Raymond N.J., Supervisor
  • Spreeuwers, Lieuwe Jan, Advisor
Award date3 Nov 2017
Place of PublicationEnschede
Publisher
Print ISBNs978-90-365-4375-0
DOIs
Publication statusPublished - 3 Nov 2017

Fingerprint

Face recognition
Classifiers
Biometrics
Labels

Keywords

  • forensic face recognition
  • characteristic descriptors
  • strength of evidence

Cite this

@phdthesis{f17897da47e644e49e3982dc3f9d25d4,
title = "Forensic Face Recognition: From characteristic descriptors to strength of evidence",
abstract = "Forensic Face Recognition (FFR) is the use of biometric face recognition for several appli- cations in forensic science. Biometric face recognition uses the face modality as a means to discriminate between human beings; forensic science is the application of science and tech- nology to law enforcement. There are two image types involved in FFR. The trace image often captures a crime scene and is most of the time taken under uncontrolled conditions. The reference image is a photograph of a suspect and is taken under controlled conditions. In general, as described by Meuwly and Veldhuis [1], FFR includes scenarios of ID verifica- tion, identification, investigation and intelligence, and evaluation of strength of evidence. The evaluation of strength of evidence is commonly referred to as forensic evidence evaluation. The strength of evidence, in combination with prior assumptions, can be used by a court of law in its verdict whether a suspect is considered guilty or not. This dissertation is primarily concerned with topics related to forensic evidence evaluation in the domain of FFR.The field of face recognition has made impressive improvements in the last two decades. State-of-the-art biometric face recognition can recognise faces with low error rates (e.g. a false-rejection probability of 1{\%} at a false-acceptance probability of 0.1{\%}) [2]. Although face recognition systems in principle can be used for investigation and intelligence purposes, forensic evidence evaluation is still largely a manual process performed by human FFR- examiners. They are able to amortise common influences on the quality of trace material during their assessment of trace and reference images. We refer to [3] for a study on (per- formance) differences between FFR-examiners and non-examiners. The influences include image compression artifacts, lens distortion, perspective effects, low resolution, interlacing, pose, illumination, and expression. Also, partial occlusion of the face is commonly encoun- tered in trace images. These influences restrict the use of a standard face recognition system. An additional reason to be somewhat reluctant towards the use of face recognition systems is their use of abstract, general feature descriptors like SIFT [4] and LBP [5]. These descrip- tors are not endowed with any forensic meaning and are hardly understandable outside the technical computer vision domain, in particular in a court of law.During the manual forensic evidence evaluation process, traces and references are as- sessed by the FFR-examiner who will pay attention to mostly shape like and potentially1 highly discriminating facial features [6]. The Facial Identification Scientific Working Group (FISWG) [7] has published the Facial Image Comparison Feature List for Morphological Analysis [8]. It describes characteristic descriptors (facial features) that can be used during forensic evidence evaluation. Although this feature list is not a formal standard, similar foren- sic evidence evaluation procedures in The Netherlands and Sweden [9–11] indicate that it can be regarded as an informal standard, representative of those used throughout other countries as well [12].The mere fact that the characteristic descriptors are documented in the FISWG Feature List does not automatically imply their suitability, in particular for their intended use under forensically relevant conditions. Actually, little research is done on this topic. The transfer from the Frye to the Daubert rule and the very critical report of the National Research Coun- cil of the National Academies on the state of forensic science in the USA, is an additional incentive to initiate such research on FISWG characteristic descriptors.Prior to 2000, admissibility of expert evidence presented to a US trial court was governed by the Frye rule. This rule states that evidence is admissible as long its method is “(...) sufficiently established to have gained general acceptance in the particular field in which it belongs.” [13]. In almost all jurisdictions, this rule has been superseded by the Daubert rule (“a trial judge must ensure that any and all scientific testimony or evidence admitted is not only relevant, but reliable”) [13]. This rule puts more emphasis on the used methodology being scientific. This includes the use of peer reviewed methods, insight in known or potential error rates, the formulation of hypotheses, and the conduction of experiments to prove or to falsify hypotheses. In other words, there has been a shift from conclusions or opinions under the Frye rule to strength of evidence established in a scientific manner under the Daubert rule. A summary of forensic facial expert testimony illustrating the dire, non-scientific approach in some selected cases can be found in [14]. In 2009 the National Research Council of the National Academies published an elaborate and critical report [15] on the current state of forensic science in the USA. It includes an in depth discussion of the Frye and Daubert rules and its implications on current practice of forensic science. In total 13 recommendations have been formulated. Recommendation (3) is of particular interest: “Research is needed to address issues of accuracy, reliability, and validity in the forensic science disciplines. (...)”.Considering this discussion, we are interested in several aspects related either directly or indirectly to the FISWG characteristic descriptors. These aspects start in the vicinity of the current practice, the human FFR-examiner, and they gradually zoom out towards the presentation of a practical framework for forensic evidence evaluation that in principle also can be applied to research outside the FFR domain. These, in total eight, aspects in turn form the basis of the addressed research questions in this dissertation.The first aspect is how well FFR-examiners and non-examiners perform on a compari- son task when they use FISWG characteristic descriptors versus a best-effort approach. The results are indicative of the added value of characteristic descriptors over an alternative ap- proach.Starting from the second aspect, we set the human aside and focus on the design and usage of biometric classifiers. The previously mentioned face recognition systems are examples of biometric classifiers. In general, a classifier compares a trace (having a questioned label) and a reference (having a known label), outputs a comparison score that encapsulates how convinced the classifier is that trace and reference input have a common label, and given a threshold, makes a decision1. If the comparison score exceeds this threshold, the decision is affirmative: trace and reference are assumed to have a common label, otherwise different labels are assumed. Although in this dissertation we use the term classifier, we are mostly interested in the produced comparison score. A biometric classifier is a classifier that uses biometric features as its input. In particular, we will primarily focus on biometric classifiers that use characteristic descriptors as their input. Furthermore, we are interested in comparison scores that are either modelled or converted to strength of evidence. The input and output of such classifiers have a clear forensic meaning and are understandable by a court of law, as opposed to the previously mentioned abstract, general feature descriptors like SIFT. Also, by using biometric classifiers that are specialised on a specific characteristic descriptor, we have by design the guarantee that only the descriptor is taken into account during the computation of strength of evidence.Returning to the second aspect, it focuses on classifiers using FISWG characteristic de- scriptors as their input, producing strength of evidence, and how they perform in general in relation to other biometric classifiers that use non-forensic features, under relatively well- conditioned settings. General performance is measured by considering the comparison scores of a biometric classifier when it is offered a set of trace-reference pairs of multiple subjects whose ground truth (same source, different source) is known.The third aspect extends the previous aspect by using trace images that are more repre- sentative of various forensic use cases. It considers the general performance of biometric classifiers using characteristic descriptors as their input, also in relation to face recognition systems.The fourth aspect shifts the focus from the biometric classifier to mostly properties of the characteristic descriptors themselves. In particular, it considers (a) their measurability and (b) the influence of measurement variation on the value of characteristic descriptors and produced strength of evidence. Measurability refers to which extent characteristic descriptors can be extracted. Furthermore, in this dissertation, most characteristic descriptors have been extracted from manual annotation. This is due to the lower quality of trace images and the general difficulty of implementing a semantic definition of a characteristic descriptor in a robust extraction algorithm.The fifth aspect considers differences between general and subject based performance. Subject based performance is measured by considering the comparison scores of a biometric classifier when it is offered a set of trace-reference pairs for which the traces only originate from the subject at hand, the references come from multiple subjects, and for each pair the ground truth (same source, different source) is known. The reason to consider this, is that a biometric classifier using a characteristic descriptor as its input might have poor general per- formance, whereas the subject based performance might be better or even good. We believe that this behaviour is exemplary for the face modality in a forensic context; looking into this matter seems warranted. Insight in the variation of subject based performance is indicative of the proportion of cases in which the characteristic descriptor could be used to discriminate a subject. Moreover, inspecting the appearance of a characteristic descriptor of a particu- lar subject whose biometric classifier exhibits a good subject based performance connects its phenotype to that performance and is potentially beneficial for identifying discriminative characteristic descriptors in general. Finally, it shows the contribution of each characteristic descriptor but also their limits. This aspect is taken into account by considering empirical re- sults and a theoretical construction creating a gap between perfect subject based and general random performance.The sixth aspect considers the suitability of facial marks in forensic evidence evaluation and extends the previous subject based performance to a broader subject based approach. Facial marks are interesting as they are representative of FISWG characteristic descriptors that have a potential to be very discriminative. This aspect describes a proto-framework that contains possible choices during the design and evaluation of biometric classifiers that use features derived from facial mark locations. An example choice is whether to consider a classifier that is trained with subject based data. It also incorporates other, forensically relevant, performance characteristics that can be evaluated at a subject based level. The proto-framework is created as a response to existing facial mark classifier studies.The seventh aspect extends the proto-framework of the previous aspect into a framework, applicable to the design and evaluation of biometric classifiers for forensic evidence evalua- tion in general, in principle even applicable outside the FFR domain, with a special emphasis on the subject based approach. Also, its applicability is shown by considering two relevant applications in the domain of FFR of which one extends the facial mark study.The eighth, and final, aspect complements the previous aspects in an abstract manner. Although the subject based performance might be reasonable or even good in some cases, a large proportion of biometric classifiers will probably have a performance that is poor to the extent that it is unclear whether it could have been produced by a random classifier, that is, a classifier that essentially outputs random comparison scores without considering the trace and reference inputs. This aspect takes a particular performance measure, the Area Under the Curve (AUC), and quantifies the boundary between random and non-random performance.Overall, we believe that by addressing these eight aspects in this dissertation, the FISWG characteristic descriptors are considered from relevant points of view and as such our ap- proach does justice to the intention encapsulated in the Daubert rule.",
keywords = "forensic face recognition, characteristic descriptors, strength of evidence",
author = "Zeinstra, {Christopher Gerard}",
note = "CTIT Ph.D. thesis series no. 17-439",
year = "2017",
month = "11",
day = "3",
doi = "10.3990/1.9789036543750",
language = "English",
isbn = "978-90-365-4375-0",
publisher = "University of Twente",
address = "Netherlands",
school = "University of Twente",

}

Forensic Face Recognition : From characteristic descriptors to strength of evidence. / Zeinstra, Christopher Gerard.

Enschede : University of Twente, 2017. 208 p.

Research output: ThesisPhD Thesis - Research UT, graduation UTAcademic

TY - THES

T1 - Forensic Face Recognition

T2 - From characteristic descriptors to strength of evidence

AU - Zeinstra, Christopher Gerard

N1 - CTIT Ph.D. thesis series no. 17-439

PY - 2017/11/3

Y1 - 2017/11/3

N2 - Forensic Face Recognition (FFR) is the use of biometric face recognition for several appli- cations in forensic science. Biometric face recognition uses the face modality as a means to discriminate between human beings; forensic science is the application of science and tech- nology to law enforcement. There are two image types involved in FFR. The trace image often captures a crime scene and is most of the time taken under uncontrolled conditions. The reference image is a photograph of a suspect and is taken under controlled conditions. In general, as described by Meuwly and Veldhuis [1], FFR includes scenarios of ID verifica- tion, identification, investigation and intelligence, and evaluation of strength of evidence. The evaluation of strength of evidence is commonly referred to as forensic evidence evaluation. The strength of evidence, in combination with prior assumptions, can be used by a court of law in its verdict whether a suspect is considered guilty or not. This dissertation is primarily concerned with topics related to forensic evidence evaluation in the domain of FFR.The field of face recognition has made impressive improvements in the last two decades. State-of-the-art biometric face recognition can recognise faces with low error rates (e.g. a false-rejection probability of 1% at a false-acceptance probability of 0.1%) [2]. Although face recognition systems in principle can be used for investigation and intelligence purposes, forensic evidence evaluation is still largely a manual process performed by human FFR- examiners. They are able to amortise common influences on the quality of trace material during their assessment of trace and reference images. We refer to [3] for a study on (per- formance) differences between FFR-examiners and non-examiners. The influences include image compression artifacts, lens distortion, perspective effects, low resolution, interlacing, pose, illumination, and expression. Also, partial occlusion of the face is commonly encoun- tered in trace images. These influences restrict the use of a standard face recognition system. An additional reason to be somewhat reluctant towards the use of face recognition systems is their use of abstract, general feature descriptors like SIFT [4] and LBP [5]. These descrip- tors are not endowed with any forensic meaning and are hardly understandable outside the technical computer vision domain, in particular in a court of law.During the manual forensic evidence evaluation process, traces and references are as- sessed by the FFR-examiner who will pay attention to mostly shape like and potentially1 highly discriminating facial features [6]. The Facial Identification Scientific Working Group (FISWG) [7] has published the Facial Image Comparison Feature List for Morphological Analysis [8]. It describes characteristic descriptors (facial features) that can be used during forensic evidence evaluation. Although this feature list is not a formal standard, similar foren- sic evidence evaluation procedures in The Netherlands and Sweden [9–11] indicate that it can be regarded as an informal standard, representative of those used throughout other countries as well [12].The mere fact that the characteristic descriptors are documented in the FISWG Feature List does not automatically imply their suitability, in particular for their intended use under forensically relevant conditions. Actually, little research is done on this topic. The transfer from the Frye to the Daubert rule and the very critical report of the National Research Coun- cil of the National Academies on the state of forensic science in the USA, is an additional incentive to initiate such research on FISWG characteristic descriptors.Prior to 2000, admissibility of expert evidence presented to a US trial court was governed by the Frye rule. This rule states that evidence is admissible as long its method is “(...) sufficiently established to have gained general acceptance in the particular field in which it belongs.” [13]. In almost all jurisdictions, this rule has been superseded by the Daubert rule (“a trial judge must ensure that any and all scientific testimony or evidence admitted is not only relevant, but reliable”) [13]. This rule puts more emphasis on the used methodology being scientific. This includes the use of peer reviewed methods, insight in known or potential error rates, the formulation of hypotheses, and the conduction of experiments to prove or to falsify hypotheses. In other words, there has been a shift from conclusions or opinions under the Frye rule to strength of evidence established in a scientific manner under the Daubert rule. A summary of forensic facial expert testimony illustrating the dire, non-scientific approach in some selected cases can be found in [14]. In 2009 the National Research Council of the National Academies published an elaborate and critical report [15] on the current state of forensic science in the USA. It includes an in depth discussion of the Frye and Daubert rules and its implications on current practice of forensic science. In total 13 recommendations have been formulated. Recommendation (3) is of particular interest: “Research is needed to address issues of accuracy, reliability, and validity in the forensic science disciplines. (...)”.Considering this discussion, we are interested in several aspects related either directly or indirectly to the FISWG characteristic descriptors. These aspects start in the vicinity of the current practice, the human FFR-examiner, and they gradually zoom out towards the presentation of a practical framework for forensic evidence evaluation that in principle also can be applied to research outside the FFR domain. These, in total eight, aspects in turn form the basis of the addressed research questions in this dissertation.The first aspect is how well FFR-examiners and non-examiners perform on a compari- son task when they use FISWG characteristic descriptors versus a best-effort approach. The results are indicative of the added value of characteristic descriptors over an alternative ap- proach.Starting from the second aspect, we set the human aside and focus on the design and usage of biometric classifiers. The previously mentioned face recognition systems are examples of biometric classifiers. In general, a classifier compares a trace (having a questioned label) and a reference (having a known label), outputs a comparison score that encapsulates how convinced the classifier is that trace and reference input have a common label, and given a threshold, makes a decision1. If the comparison score exceeds this threshold, the decision is affirmative: trace and reference are assumed to have a common label, otherwise different labels are assumed. Although in this dissertation we use the term classifier, we are mostly interested in the produced comparison score. A biometric classifier is a classifier that uses biometric features as its input. In particular, we will primarily focus on biometric classifiers that use characteristic descriptors as their input. Furthermore, we are interested in comparison scores that are either modelled or converted to strength of evidence. The input and output of such classifiers have a clear forensic meaning and are understandable by a court of law, as opposed to the previously mentioned abstract, general feature descriptors like SIFT. Also, by using biometric classifiers that are specialised on a specific characteristic descriptor, we have by design the guarantee that only the descriptor is taken into account during the computation of strength of evidence.Returning to the second aspect, it focuses on classifiers using FISWG characteristic de- scriptors as their input, producing strength of evidence, and how they perform in general in relation to other biometric classifiers that use non-forensic features, under relatively well- conditioned settings. General performance is measured by considering the comparison scores of a biometric classifier when it is offered a set of trace-reference pairs of multiple subjects whose ground truth (same source, different source) is known.The third aspect extends the previous aspect by using trace images that are more repre- sentative of various forensic use cases. It considers the general performance of biometric classifiers using characteristic descriptors as their input, also in relation to face recognition systems.The fourth aspect shifts the focus from the biometric classifier to mostly properties of the characteristic descriptors themselves. In particular, it considers (a) their measurability and (b) the influence of measurement variation on the value of characteristic descriptors and produced strength of evidence. Measurability refers to which extent characteristic descriptors can be extracted. Furthermore, in this dissertation, most characteristic descriptors have been extracted from manual annotation. This is due to the lower quality of trace images and the general difficulty of implementing a semantic definition of a characteristic descriptor in a robust extraction algorithm.The fifth aspect considers differences between general and subject based performance. Subject based performance is measured by considering the comparison scores of a biometric classifier when it is offered a set of trace-reference pairs for which the traces only originate from the subject at hand, the references come from multiple subjects, and for each pair the ground truth (same source, different source) is known. The reason to consider this, is that a biometric classifier using a characteristic descriptor as its input might have poor general per- formance, whereas the subject based performance might be better or even good. We believe that this behaviour is exemplary for the face modality in a forensic context; looking into this matter seems warranted. Insight in the variation of subject based performance is indicative of the proportion of cases in which the characteristic descriptor could be used to discriminate a subject. Moreover, inspecting the appearance of a characteristic descriptor of a particu- lar subject whose biometric classifier exhibits a good subject based performance connects its phenotype to that performance and is potentially beneficial for identifying discriminative characteristic descriptors in general. Finally, it shows the contribution of each characteristic descriptor but also their limits. This aspect is taken into account by considering empirical re- sults and a theoretical construction creating a gap between perfect subject based and general random performance.The sixth aspect considers the suitability of facial marks in forensic evidence evaluation and extends the previous subject based performance to a broader subject based approach. Facial marks are interesting as they are representative of FISWG characteristic descriptors that have a potential to be very discriminative. This aspect describes a proto-framework that contains possible choices during the design and evaluation of biometric classifiers that use features derived from facial mark locations. An example choice is whether to consider a classifier that is trained with subject based data. It also incorporates other, forensically relevant, performance characteristics that can be evaluated at a subject based level. The proto-framework is created as a response to existing facial mark classifier studies.The seventh aspect extends the proto-framework of the previous aspect into a framework, applicable to the design and evaluation of biometric classifiers for forensic evidence evalua- tion in general, in principle even applicable outside the FFR domain, with a special emphasis on the subject based approach. Also, its applicability is shown by considering two relevant applications in the domain of FFR of which one extends the facial mark study.The eighth, and final, aspect complements the previous aspects in an abstract manner. Although the subject based performance might be reasonable or even good in some cases, a large proportion of biometric classifiers will probably have a performance that is poor to the extent that it is unclear whether it could have been produced by a random classifier, that is, a classifier that essentially outputs random comparison scores without considering the trace and reference inputs. This aspect takes a particular performance measure, the Area Under the Curve (AUC), and quantifies the boundary between random and non-random performance.Overall, we believe that by addressing these eight aspects in this dissertation, the FISWG characteristic descriptors are considered from relevant points of view and as such our ap- proach does justice to the intention encapsulated in the Daubert rule.

AB - Forensic Face Recognition (FFR) is the use of biometric face recognition for several appli- cations in forensic science. Biometric face recognition uses the face modality as a means to discriminate between human beings; forensic science is the application of science and tech- nology to law enforcement. There are two image types involved in FFR. The trace image often captures a crime scene and is most of the time taken under uncontrolled conditions. The reference image is a photograph of a suspect and is taken under controlled conditions. In general, as described by Meuwly and Veldhuis [1], FFR includes scenarios of ID verifica- tion, identification, investigation and intelligence, and evaluation of strength of evidence. The evaluation of strength of evidence is commonly referred to as forensic evidence evaluation. The strength of evidence, in combination with prior assumptions, can be used by a court of law in its verdict whether a suspect is considered guilty or not. This dissertation is primarily concerned with topics related to forensic evidence evaluation in the domain of FFR.The field of face recognition has made impressive improvements in the last two decades. State-of-the-art biometric face recognition can recognise faces with low error rates (e.g. a false-rejection probability of 1% at a false-acceptance probability of 0.1%) [2]. Although face recognition systems in principle can be used for investigation and intelligence purposes, forensic evidence evaluation is still largely a manual process performed by human FFR- examiners. They are able to amortise common influences on the quality of trace material during their assessment of trace and reference images. We refer to [3] for a study on (per- formance) differences between FFR-examiners and non-examiners. The influences include image compression artifacts, lens distortion, perspective effects, low resolution, interlacing, pose, illumination, and expression. Also, partial occlusion of the face is commonly encoun- tered in trace images. These influences restrict the use of a standard face recognition system. An additional reason to be somewhat reluctant towards the use of face recognition systems is their use of abstract, general feature descriptors like SIFT [4] and LBP [5]. These descrip- tors are not endowed with any forensic meaning and are hardly understandable outside the technical computer vision domain, in particular in a court of law.During the manual forensic evidence evaluation process, traces and references are as- sessed by the FFR-examiner who will pay attention to mostly shape like and potentially1 highly discriminating facial features [6]. The Facial Identification Scientific Working Group (FISWG) [7] has published the Facial Image Comparison Feature List for Morphological Analysis [8]. It describes characteristic descriptors (facial features) that can be used during forensic evidence evaluation. Although this feature list is not a formal standard, similar foren- sic evidence evaluation procedures in The Netherlands and Sweden [9–11] indicate that it can be regarded as an informal standard, representative of those used throughout other countries as well [12].The mere fact that the characteristic descriptors are documented in the FISWG Feature List does not automatically imply their suitability, in particular for their intended use under forensically relevant conditions. Actually, little research is done on this topic. The transfer from the Frye to the Daubert rule and the very critical report of the National Research Coun- cil of the National Academies on the state of forensic science in the USA, is an additional incentive to initiate such research on FISWG characteristic descriptors.Prior to 2000, admissibility of expert evidence presented to a US trial court was governed by the Frye rule. This rule states that evidence is admissible as long its method is “(...) sufficiently established to have gained general acceptance in the particular field in which it belongs.” [13]. In almost all jurisdictions, this rule has been superseded by the Daubert rule (“a trial judge must ensure that any and all scientific testimony or evidence admitted is not only relevant, but reliable”) [13]. This rule puts more emphasis on the used methodology being scientific. This includes the use of peer reviewed methods, insight in known or potential error rates, the formulation of hypotheses, and the conduction of experiments to prove or to falsify hypotheses. In other words, there has been a shift from conclusions or opinions under the Frye rule to strength of evidence established in a scientific manner under the Daubert rule. A summary of forensic facial expert testimony illustrating the dire, non-scientific approach in some selected cases can be found in [14]. In 2009 the National Research Council of the National Academies published an elaborate and critical report [15] on the current state of forensic science in the USA. It includes an in depth discussion of the Frye and Daubert rules and its implications on current practice of forensic science. In total 13 recommendations have been formulated. Recommendation (3) is of particular interest: “Research is needed to address issues of accuracy, reliability, and validity in the forensic science disciplines. (...)”.Considering this discussion, we are interested in several aspects related either directly or indirectly to the FISWG characteristic descriptors. These aspects start in the vicinity of the current practice, the human FFR-examiner, and they gradually zoom out towards the presentation of a practical framework for forensic evidence evaluation that in principle also can be applied to research outside the FFR domain. These, in total eight, aspects in turn form the basis of the addressed research questions in this dissertation.The first aspect is how well FFR-examiners and non-examiners perform on a compari- son task when they use FISWG characteristic descriptors versus a best-effort approach. The results are indicative of the added value of characteristic descriptors over an alternative ap- proach.Starting from the second aspect, we set the human aside and focus on the design and usage of biometric classifiers. The previously mentioned face recognition systems are examples of biometric classifiers. In general, a classifier compares a trace (having a questioned label) and a reference (having a known label), outputs a comparison score that encapsulates how convinced the classifier is that trace and reference input have a common label, and given a threshold, makes a decision1. If the comparison score exceeds this threshold, the decision is affirmative: trace and reference are assumed to have a common label, otherwise different labels are assumed. Although in this dissertation we use the term classifier, we are mostly interested in the produced comparison score. A biometric classifier is a classifier that uses biometric features as its input. In particular, we will primarily focus on biometric classifiers that use characteristic descriptors as their input. Furthermore, we are interested in comparison scores that are either modelled or converted to strength of evidence. The input and output of such classifiers have a clear forensic meaning and are understandable by a court of law, as opposed to the previously mentioned abstract, general feature descriptors like SIFT. Also, by using biometric classifiers that are specialised on a specific characteristic descriptor, we have by design the guarantee that only the descriptor is taken into account during the computation of strength of evidence.Returning to the second aspect, it focuses on classifiers using FISWG characteristic de- scriptors as their input, producing strength of evidence, and how they perform in general in relation to other biometric classifiers that use non-forensic features, under relatively well- conditioned settings. General performance is measured by considering the comparison scores of a biometric classifier when it is offered a set of trace-reference pairs of multiple subjects whose ground truth (same source, different source) is known.The third aspect extends the previous aspect by using trace images that are more repre- sentative of various forensic use cases. It considers the general performance of biometric classifiers using characteristic descriptors as their input, also in relation to face recognition systems.The fourth aspect shifts the focus from the biometric classifier to mostly properties of the characteristic descriptors themselves. In particular, it considers (a) their measurability and (b) the influence of measurement variation on the value of characteristic descriptors and produced strength of evidence. Measurability refers to which extent characteristic descriptors can be extracted. Furthermore, in this dissertation, most characteristic descriptors have been extracted from manual annotation. This is due to the lower quality of trace images and the general difficulty of implementing a semantic definition of a characteristic descriptor in a robust extraction algorithm.The fifth aspect considers differences between general and subject based performance. Subject based performance is measured by considering the comparison scores of a biometric classifier when it is offered a set of trace-reference pairs for which the traces only originate from the subject at hand, the references come from multiple subjects, and for each pair the ground truth (same source, different source) is known. The reason to consider this, is that a biometric classifier using a characteristic descriptor as its input might have poor general per- formance, whereas the subject based performance might be better or even good. We believe that this behaviour is exemplary for the face modality in a forensic context; looking into this matter seems warranted. Insight in the variation of subject based performance is indicative of the proportion of cases in which the characteristic descriptor could be used to discriminate a subject. Moreover, inspecting the appearance of a characteristic descriptor of a particu- lar subject whose biometric classifier exhibits a good subject based performance connects its phenotype to that performance and is potentially beneficial for identifying discriminative characteristic descriptors in general. Finally, it shows the contribution of each characteristic descriptor but also their limits. This aspect is taken into account by considering empirical re- sults and a theoretical construction creating a gap between perfect subject based and general random performance.The sixth aspect considers the suitability of facial marks in forensic evidence evaluation and extends the previous subject based performance to a broader subject based approach. Facial marks are interesting as they are representative of FISWG characteristic descriptors that have a potential to be very discriminative. This aspect describes a proto-framework that contains possible choices during the design and evaluation of biometric classifiers that use features derived from facial mark locations. An example choice is whether to consider a classifier that is trained with subject based data. It also incorporates other, forensically relevant, performance characteristics that can be evaluated at a subject based level. The proto-framework is created as a response to existing facial mark classifier studies.The seventh aspect extends the proto-framework of the previous aspect into a framework, applicable to the design and evaluation of biometric classifiers for forensic evidence evalua- tion in general, in principle even applicable outside the FFR domain, with a special emphasis on the subject based approach. Also, its applicability is shown by considering two relevant applications in the domain of FFR of which one extends the facial mark study.The eighth, and final, aspect complements the previous aspects in an abstract manner. Although the subject based performance might be reasonable or even good in some cases, a large proportion of biometric classifiers will probably have a performance that is poor to the extent that it is unclear whether it could have been produced by a random classifier, that is, a classifier that essentially outputs random comparison scores without considering the trace and reference inputs. This aspect takes a particular performance measure, the Area Under the Curve (AUC), and quantifies the boundary between random and non-random performance.Overall, we believe that by addressing these eight aspects in this dissertation, the FISWG characteristic descriptors are considered from relevant points of view and as such our ap- proach does justice to the intention encapsulated in the Daubert rule.

KW - forensic face recognition

KW - characteristic descriptors

KW - strength of evidence

UR - http://dx.doi.org/10.3990/1.9789036543750

U2 - 10.3990/1.9789036543750

DO - 10.3990/1.9789036543750

M3 - PhD Thesis - Research UT, graduation UT

SN - 978-90-365-4375-0

PB - University of Twente

CY - Enschede

ER -