Forensic speaker identification is the application of science to solve the problems related to identification of the unknown speaker in criminal investigation. A voice is much more than just a string of words. Although evidence from DNA grabs the headlines, but the fact is that DNA can’t talk. It can’t be recorded planning, carrying out or confessing to a crime1. The voice of a person can be successfully used as a biometric feature as it is well accepted by the users and can be easily recorded using microphones and hardware of low costs2. It can provide an alternative, more secure means of permitting entry without any need of remembering a password, lock combination etc and thus, breaking all restrictions of accessing a secured area using keys, magnetic card or any other fallible device which can be easily stolen. In the present era, widely available facilities of telephones, mobiles and tape recorders results in the misuse of the device and thus, making them an efficient tool in commission of criminal offences such as kidnapping, extortion, blackmail threats, obscene calls, anonymous calls, harassment calls, ransom calls, terrorist calls, match fixing etc. The criminals has seen the possibility for misuse of the various modes of communication of voice, believing that he will remain incognito, and thus, nobody would recognize him. It is fortunately no longer true. The voice can identify him and pin the crime on him3.
Speaker identification is less complicated and leads to a more definite opinion when the expert has to deal with the normal or ideal voice recognition. The problem arises when the cases of disguised voice samples, involving both accidental as well as attempted disguise, comes for the purpose of identification. There is another aspect that makes the achievement of this goal of speaker identification a bit difficult i.e. the case of almost similar sounding speakers, sharing the same sex, age and dialect.
Speech is the vocalization form of human communication4. Human beings express their ideas, thoughts and feelings orally to one another through a series of complex movements that alter and mold the basic tone created by voice into specific, decodable sounds5. Speech development is a gradual process that requires years of practice. Communication is a process, a series of events allowing the speaker to express thoughts and emotions and the listener to understand them. Speech communication begins as thought that is transformed into language for expression6.
Speech signal is a multidimensional acoustic wave7 (as shown in fig 1), which conveys the information about the words or message being spoken, identity of the speaker, language spoken, the presence and type of speech pathologies, the physical and emotional state of the speaker. The person’s speech also contains the features that may reveal their geographical origin, ethnicity or race, age, sex, education level and religious orientation and background8, 9, 10. Often, humans are able to extract the identity information when the speech comes from a speaker they are acquainted with.
Speech is a compelling biometric for several well known reasons and particularly because it is the only one available modality in a large set of situations11
SPEECH MECHANISM AND ITS UNIQUENESS
The mechanism of speech is a very complex one and to undertake analysis of any language it is important to understand the processes that go to make up the message that a speaker transmits and a listener receives12. For production of any sound, there must be some disturbance in the air. Such disturbance in the speech sound is provided by movement of certain organs of body such as muscles of chest, vocal cords, tongue, lips etc. This disturbance in the form of sound waves travels to the ear of the listener, who interprets the wave as sound.
By the process of inhalation the air from the environment is drawn into the lungs, stored in the lungs for a short period of time and finally expelled from the lungs under pressure by the process of exhalation. During exhalation, air under pressure is sent from the lungs to the larynx. The function of the larynx, particularly that part known as the vocal folds, is to set the molecules of this breath stream into vibration13 (as shown in fig 2). For sound to be produced, these molecules have to vibrate at a rate that falls within a particular range. The process by which molecules of air are set into vibration is known as phonation.
The vibration pattern of molecules produced by phonation is complex. It contains a wide range of frequencies and has a buzzing sound. This buzz is moulded into speech sounds by vocal tract. The vocal tract consists of the pharynx (throat), oral cavity and nasal cavity. The configuration , or shape, of the vocal tract at a particular moment determines what speech sound will be produced. The configuration of the vocal tract can be changed by movement of several structures within it specifically, the tongue, lips, lower jaw and soft palate14.
Representation of speech mechanism
For indistinguishable voice, the two individuals should have the identical vocal mechanism and identical coordination of their articulators, which is least probable. Hence the human voice is unique personal trait.
Speaker recognition may be defined as any activity in which a speech sample is attributed to a person on the basis of its acoustic or perceptual properties15.The information content of a spoken utterance are speaker characteristics, spoken phrase, emotions, additional noise, channel transformations etc16 .It can be divided into Speaker Identification and Speaker Verification. Speaker identification determines which registered speaker provides a given utterance from amongst a set of known speakers. The unknown speaker is identified as the speaker hose model best matches the input utterance. Speaker verification accepts or rejects the identity claim of a speaker – is the speaker the person they say they are17, 18, 19? In speaker recognition, you don’t make the identification by analysing the language used, by remembering what the speaker looks like or by any other means. This is sometimes used when a person is not quite sure whether the process is that of verification or identification20. In a scheme for the mechanical recognition of the speakers, it is desirable to use acoustic parameters that are closely related to voice characteristics that distinguish speakers. It involves selection of such parameters which are which are motivated by known relations between the voice signal and vocal-tract shapes and gestures21. In speaker recognition we differ between low-level and high-level information. High level-information is values like a dialect, an accent, the talking style, the subject manner of context, phonetics, prosodic and lexical information22. These features are currently only recognized and analyzed by humans. The Low-level features are denoted by the information like fundamental frequency (F0), formant frequency, pitch, intensity, rhythm, tone, spectral magnitude and bandwidths of an individual’s voice23. An ideal feature would:
Have lower intraspeaker variability and high interspeaker variability
Be robust against noise and distortion
Occurs frequently and naturally in speech
Be easy to measure from speech signal
Difficult to mimic
Not be affected by speaker’s health or long term variations in voice
There are different ways to categorize the features. From the viewpoint of their physical interpretation, we can divide them into24:-
Short-term spectral features -These features, as the name suggests, are computed from the short frames of about 20 to 30 milliseconds in duration. They are usually the descriptors of the resonance properties of the supralaryngeal vocal tract.
Voice source features -These features characterize the glottal excitation signal of voiced sounds such as glottal pulse shape and fundamental frequency, and it is reasonable to assume that they carry speaker-specific information.
Spectro-temporal features -It is reasonable to assume that the spectro temporal Signal details such as formant transitions and energy modulations contain useful speaker-specific information.
Prosodic features – Prosody refers to non-segmental aspects of speech, including syllable stress, intonation patterns, speaking rate and rhythm. One important aspect of prosody is that, unlike the traditional short-term spectral features, it spans over long segments like syllables, words, and utterances and reflects differences in speaking style, language background, sentence type and emotion of the speaker.
High level features -These features attempt to capture conversation-level characteristics of speakers, such as characteristic use of words (”uh-huh”, “you know”, “oh yeah”, etc.). Other features are the dialect of any language used in the conversation by the speaker, accent of the speaker and the style of speaking.
Any type of alteration, distortion or deviation from the normal speech, irrespective of the cause, is defined as the speech disguise. Disguise can take many forms, and can be very damaging to both lay as well as to technical speaker identification25.The criminal often disguises his or her voice. The effect of the disguise is that, the acoustic features of the criminal exemplar, is altered to become less similar to the acoustic features of the actual criminal’s undisguised utterances. There tended to be two types of research. One type was non-electronic and attempted to measure the ability of non-expert humans to identify other humans who were disguising their voice in a variety of ways. The second type was electronic, often involving speech spectrograms, or so-called “voiceprints”26.
The question of voice disguise detection appears as fundamental in forensic applications. Different kinds of approaches provide significant results of discrimination. A complementary study based on formant and automatic analysis could be fused to increase the recognition rate27.
MOTIVATION IN STUDYING DISGUISED SPEECH28
Generally, the expert faces two types of challenges while examining the questioned . First, disguised voice is often used in the committal of a crime where the criminal has the fear of being caught. Often, it is necessary to identify or verify a suspect based on the disguised voice. Some means is needed to:
Determine that a voice has been disguised on a voice recording,
Determine the method of disguise
Perform computer speaker identification despite the disguise.
The second challenge is that the speaker identification essentially is incapable of accurately determining the identity of a speaker when a test sample of his disguised speech is compared to a reference based on his normal speaking mode. To date, and to the best of our knowledge, the above statement remains true. One goal of forensic speaker recognition is to undertake research to reverse that situation, at least for a large and useful subset of disguise types.
TYPES OF DISGUISE
Disguised speech can be of two types:
Non- deliberate or accidental disguise- This form of voice disguise involves alterations that result from some involuntary state of the individual. The cases of accidental disguise involve the temporary change in person’s speech due to change in physical state like due to chewing, eating and illness or emotional state of person like stress, anger, fear, nervousness, cheerfulness, surprise, sadness etc. Research has been done for developing robust and precise automatic speaker verification system based on these speaker based variation in features29.
Deliberate or attempted disguise- The samples of attempted disguise are frequently encountered in the cases of anonymous calls, ransom calls and threatening calls where the speaker makes a deliberate effort to change their voice by changing its phonetic, phonemic and prosodic features, in order to hide their identity due to the fear of being caught.
TECHNIQUES USED FOR SPEAKER RECOGNITION
In this era of telephones, radio and tape recorder communications, the human voice may often prove to be valuable evidence for associating an individual with criminal act. The telephoned bomb threat, obscene calls or tape recorded ransom messages have become frequent enough occurrences to warrant the interest of law enforcement officials in scientific techniques capable of transforming the voice into a form suitable for personal identification31. Speaker identification is to determine who the speaker of the given utterance is. To do so it is necessary to know a great deal about that person’s speech characteristic (a rare occurrence) or to be able to match the voices of the unknown talker to one from the group of suspects.
Various methodologies for approaching the problem of speaker identification have been proposed. For identification purpose, different well recognised standard techniques will be used for maintaining the validity of the work done and the choice will be as per the requirement:
1) Listener method or Auditory analysis-
The voice of a person is as easily distinguishable by the ear, as face by the eye. This method of speaker recognition by listening is the oldest amongst all. In this situation a person attempts to recognize a voice by its familiarity32. The extraordinary ability of humans to recognize many familiar people by their voices is exceptional both in accuracy and adaptability33. In this method, the decision of similarity and dissimilarities is taken by human experts after audition of speech samples. One method is of repeated listening of the available audio files by a group of experts looking for similarities in linguistic, phonetic and acoustic features. The different utterances of the speakers are segregated in respect of each speaker by way of repeated listening of recorded conversation. The segregated conversations of each speaker are repeatedly heard to identify linguistic features and phonetic features like articulation rate, flow of speech, degree of vowels and consonant formation, rhythm, striking time, pauses etc. The clue words are selected from both questioned and specimen samples of the speaker and are then used for instrumental analysis.
Human listeners are robust speaker recognizers when presented with the degraded speech. Listener performance is a function of acoustic variables such as, the signal to noise ratio, speech bandwidth, the amount of speech material, distortions in the speech signals introduced by speech coding, transmission systems, etc. This is owing to the fact that there are sources of knowledge that contribute in various ways to speaker recognition; providing weak, moderate and high discriminating power. Auditory speaker recognition has long been used and accepted in forensics as part of the testimony of a victim or witness. Prior to the inventions of the telephone and sound recording equipments, it could be the key evidence on behalf of which a suspected individual could be identified or excluded from an offence committed in the dark or when a victim has been blindfolded34. However, with any human decision process, it is stressed that the listener method leads to a subjective decision. Nevertheless, this method is still used in some countries for forensic speaker identification.
2) Instrumental analysis or Spectrographic method-
The spectrographic method for speaker recognition makes use of an instrument that converts the speech signals into a visual display. Today voice analysis has matured into a sophisticated identification technique, using the latest technology science has to offer. Both aural and spectrographic analyses are combined to form the conclusion about the identity of voice in question35. In 1941, an electro mechanical acoustic spectrograph was developed by Dr. Raleph Potter, Bell Telephone Laboratory, with an idea to convert sounds into pictures36.
A sound spectrograph is an instrument which is able to give a permanent record of changing energy-frequency distribution throughout the time of a speech wave37, (as shown in fig 3 and fig 4). Spectrograms are visual representations of the speech signal; they convey information about the message by the speaker as well as about the speaker himself. In this method, the opinion about similarities or dissimilarities between two samples will be taken on the basis of their phonetic and acoustic elements such as, frequencies, amplitude, plosive duration, unvoiced signals at different positions etc. The sound spectrograph is more commonly known as the Voiceprint analyser. Voice patterns are transformed into visual patterns on a graph that moves through an instrument at a controlled speed, and patterns drawn on the paper as it moves. By analysing the charts, you can compare a tape of an individual’s normal speech pattern with a tape of the same person being questioned about his or her involvement in some type of crime or other misbehaviour38. These voiceprints may be an important in helping the law enforcement agencies in identifying the criminals. Much like fingerprints, voiceprint identification uses the unique features in the spectrographic impressions of people’s utterances39.
In the classical analogue spectrograph a magnetic tape recorder and playback unit is used to process the sounds into electronic signals. These signals are then sent through a variable electronic bandpass filter, which selects a frequency band that is to be analysed, before a stylus measures its energy and records the results on electrical sensitive paper. The paper is mounted on a drum, which is rotating during playback in order to plot the time variations in the signal. When the whole length of the speech sample in analysed at a specific frequency band, the band of the filter and the position of the stylus are correspondingly altered. The tape is then played again in order to analyse a new part of the frequency spectrum. This process is repeated over again until the entire desired frequency range is analysed. In each spectrogram, the horizontal dimension is time, the vertical dimension represents frequency and the darkness represents the intensity on the compression scale40.The differences in amplitude values are shown in a grey scaling where black represents the most intense and white the least intense waveform components.
However since 1962, it was considered as a fool- proof method of personal identification, voice identification by spectrographic analysis, the “voiceprint” technique has been in a legal limbo. But the recent developments in both science and the law, however, indicate that despite initially adverse scientific and judicial reaction, spectrographic voice identification is perhaps coming of legal age41.
3) Computerized approach-
This is a semi automatic approach for recognition of speech samples which involves three stages:
In this method the parameters of the signals are extracted by means of spectrum analyzer and recognition is made by means of computer system on the basis of stored data in respect of controlled samples of the speakers.
However it is observed that the error rates of machines are often more than an order of magnitude greater than those of humans, as machine performance degrades below that of humans in noise, with channel variability, and for spontaneous speech42.
4) Modern technique using a software: BATVOX 3.043-
BATVOX 3.0 is an automatic speaker recognition application designed to allow the biometric identification of speakers in an investigation comparing voice models to a set of audios added in the system. The audio files entered in BATVOX 3.0 have to fulfil certain conditions:
BATVOX 3.0 accepts audio files in the following format: .wav files with linear PCM coding, sampling frequency 8 KHz, 16-bit resolution and mono.
Manages audio files of at least 7 seconds of net speech.
Manages audio files whose signal to noise ratio is more than 10dBs
The test and the training audio files should possess the voice of the speakers sharing the same sex, same language and have same channel characteristics
LIMITATIONS OF SPEAKER IDENTIFICATION44
Short duration samples should be analysed properly
The dissimilar language in questioned and specimen are difficult to analyze
Emotion Variability in questioned and specimen samples45
Misspoken or misread prompted phrases
Poorly recorded/noisy samples are difficult to analyze46
Insufficient number of comparable words
Disguise in speech samples poses a problem in speaker recognition and/or the degree of disguise is decided by the expert
Extreme emotional states (e.g. stress or duress)47,48
Change in physical state of the speaker (e.g. eating, effect of ethanol, etc)49
The attitude of the how the speech is said by the speaker
Channel mismatch or mismatch in recording conditions (e.g. using different microphones for enrolment and verification)50
Different pronunciation speed of the test data compared with the training data.
Aging (the vocal tract can drift away from models with age) 53,54
ACCURACY IN SPEAKER RECOGNITION
In order to get accurate results from speaker recognition, one must give more emphasis on following factors:
The minimum duration of the collected samples should be of 60 seconds
Conditions under which the voice samples are recorded should possess less noise or the signal to noise ratio of the samples should be greater
The characteristics of the instruments used
The skill of the examiner making judgment
Examiners knowledge about the case
Examiners knowledge about the language in question55
Properties of the voice involved
Delay in examination of samples56
The language of the questioned and controlled samples should be similar
The expert should be competent enough to deal with the cases involving disguised speech samples.
CRITERIA FOR IDENTIFICATION
A listener may recognize a voice even without seeing the speaker. There are cues in voice and speech behaviour, which are individual and thus make it possible to recognize the familiar voices57. A person’s mental ability to control his vocal tract muscles during utterance is learned during his childhood. These habits affect the range of sound that may be effectively produced by an individual. The range of sounds is the subset of the set of possible sounds that an individual could create with his or her personal vocal tract. It is not easy for an individual to change voluntarily these physical characteristics58. The speech wave is the response of the vocal tract filter system to one or more sound source. Speech wave may be uniquely specified in terms of source and filter characteristics59. Data obtained from measurements of the acoustic properties of human voices are very different from DNA profiles. Acoustic data are continuous not discrete and the speaker never says the same thing, exactly the same way twice. The strength of evidence from a forensic voice comparison cannot be expressed as a match probability and must be expressed in form of a full likelihood ratio60. It is observed that very reliable decisions can be made by trained professional examiners when samples are obtained in the manner described. The studies produced strong evidence that even very good mimics cannot duplicate an- other’s speech patterns61.
The criteria of identification of speech samples using different techniques are discussed as follows:
Auditory analysis- In this method, the identification is done on the basis of following voice characteristics-
Quality of speech sample- Synthetic speech can be compared and evaluated with respect to intelligibility, naturalness, and suitability for used application62. Pronunciation, Accent, Speech sounds like vowels and consonants, plosives, fricatives, nasal and throat sounds and coupling effect, Grammar, Stress, Syllable stress, Intonation, Rhythm, Fluency, pacing, Phrasing and Blending63. Each person possesses a unique voice quality which depend on number of anatomical features, such as, dimension of oral tract, pharynx, nasal cavity, shape and size of tongue and lips, position of teeth, tissue density etc.
Linguistic features- Linguistics is the scientific study of natural language. These features involves, the stylish impression of speech, delivery of speech ( the style in which the speech is delivered i.e., Manuscript, Memorized, Impromptu, and Extemporaneous64), Phonation (the process by which the vocal folds produce certain sounds through quasi-periodic vibration or any oscillatory state of any part of larynx that modifies the airstream, of which voicing is one example65).
Articulatory speech- This is a type of speech produced by movement or articulation of the articulators. This involves, flow of speech (depends upon the fluency of the speaker66), plosive formation (First, a complete closure of the passage of air at the same point in the vocal tract, then the removal of the closure, causing a sudden release of the blocked air with some explosive noise), nasality (Nasal consonants have a continuous full closure at some point in the oral cavity. Since the velum is set in the low position, opening the velopharyngeal port, air is let out through the nasal cavity).
Prosodic analysis- It involves the intonation pattern, dynamic of loudness (dynamics refers to the volume of a sound or note and loudness is the strength of sensation received through the ear), speech rate (relative timing of different speech events in spoken utterances), speech variations, striking time features, pauses (number/length/pattern).
Voice impairment- Speech or language impairment (SLI) means a communication disorder, such as stuttering, impaired articulation, language impairment, or a voice impairment, that adversely affects a person’s educational performance. Speech and language disorders refer to problems in communication and related areas such as oral motor function. These delays and disorders range from simple sound substitutions to the inability to understand or use language or use the oral-motor mechanism for functional speech and feeding. Some causes of speech and language disorders include hearing loss, neurological disorders, brain injury, mental retardation, drug abuse, physical impairments such as cleft lip or palate, and vocal abuse or misuse. Frequently, however, the cause is unknown.
Temporal measurements- The temporal properties of speech play an important role in linguistic contrast. Speech can be said to be comprised of three main temporal features based on dominant fluctuation rates; envelope, periodicity and fine structure. Each feature has distinct acoustic manifestations, auditory and perceptual correlates and roles in linguistic contrasts67. These measurements involves phonation-time (P/T) ratio, speech time (S/T) rate, speech burst (its number/length/patterns).
Spectrographic analysis- The spectrograph is an instrument used to analyse the complex waveforms of sound and their alterations in time. This is done through spectrograms, which are graphic displays of the amplitude as a function of both frequency and time68. In this method, the clue words are selected from the questioned and the specimen samples on the basis of auditory analysis. These are then selected for voice spectrographic analysis. A trained examiner may be able to give an opinion about the similarity between the two samples on the basis of characteristics like:
Fundamental frequency- It is the frequency of vibration of vocal cord produced during the rapid opening and closing of vocal cord69, (as shown in fig 5). The fundamental frequency of a periodic signal is an inverse of period length. The period, in turn, is the smallest repeating unit of a signal70. In voice spectrogram, horizontal distance between vertical striations is an indication of fundamental frequency. It also includes the pitch of voice i.e., the rate of vibration of vocal cords.
Software, BATVOX 3.0- The working of this software depends upon the following elements43:-
Case- It is the repository of audio files, models and calculations part of the same investigation or forensic case.
Audio file- this is the first element to enter into the system in order to build the models and compute some biometric calculations. The audio files in BATVOX can mainly classified in two types
Test audio: Unknown audio file used to be compared to a suspect model in order to find it out if both belongs to the same speaker
Training audio: audio file recorded from the known speaker, used to create a voice model which can be compared with the test audio files.
Model- A model generated from the audio files is the representation of characteristics of the speaker’s voice.
Training of a model- A biometric process which extracts the characteristics of the voice from the audio samples and thus, generates a model.
Session- Group of calculations gathered together because of some common aspects according to the criteria of the user. The calculations included in a session can be identification and a LR calculation.
Identification- The objective of the speaker identification is to classify a voice whose origin is not known.
Likelihood ratio (LR) – It is a relationship of probabilities. Firstly, we have the likelihood that the test belongs to a suspect and secondly, the test does not belong to the suspect. One of the differences between the LR and identification is the way of expressing results.
Normalization- It is the process of correcting the effects that the lack of alignment has on statistical scoring. This lack of alignment is caused by the heterogeneous nature of the audio system.
Reference population- These types of samples are basically required for the calibration of the instrument. For a proper selection of the reference population, the characteristics of the population should match the characteristics of the disputed speaker. These characteristics include the sex of the speaker, channel type, net spoken length and language75.
Phil Rose & James R Robertson, “Forensic Speaker Identification”, Taylor & Francis,1999
MohamedChenafa et al, “Biometric System Based on Voice Recognition Using Multiclassifiers”, Springer Berlin / Heidelberg, Volume 5372/2008
B.R. Sharma ,”Scientific Criminal Investigation”, universal law publishing company
Definitions of speech”, (en.wikipedia.org/wiki/Speech)
“National Institute on Deafness and other Communication Disorders (NIDCD)”,( www.nidcd.nih.gov/directory)
Dennis C. Tanner & Matthew E. Tann