Week 6: Signal analysis and the source - filter model of speech
1. Recapitulation: Spectral analysis
- Filters and filtering
- Digital spectral analysis: the Fast-Fourier Transform (FFT)
- The frequency components in spectral analysis of a vowel
- Harmonic components (of the voice):
- Formant frequencies:
2. The ‘waterfall plot’
- A little historical background
- How the (analogue) spectrograph worked
- continuous recording drum (2.5 secs)
- analysing filter of variable centre frequency and 2 bandwidth settings
- electric stylus, electro-sensitive paper
- Broadband and narrow band spectrogram displays
- Spectrograms of high pitched (children’s and women’s) voices
- have greater frequency 'gaps' between harmonic components.
- Therefore we need to vary the analysing filter bandwidth for optimal broad and narrow band displays.
- Suggest an appropriate broad band filter bandwidth for a voice with an average f0=320 Hz.
- But even if the bandwidth is optimal, accurate estimation formant frequencies is difficult for children’s vowels.
- Explain why this is so.
4. Steps in Signal processing
- Analogue transduction: microphones
- Signal digitization
- Signal processing
- parameter extraction
- feature analysis
- (automated) speech/speaker recognition
5. Transduction
- Properties of the transducer
- Signal fidelity:
- frequency response
- dynamic range
6. Digitisation of signals
- Analogue to digital conversion
- sampling rate (in Hz or kHz)
- quantization level (in bits per sample)
- Digitising speech
- What is an appropriate sampling rate?
- Quantization noise
- Obtaining an optimal dynamic range (S/N ratio)
7. Signal processing
- Some time and frequency domain analysis methods
- zero crossing rate: simple time domain technique, for acoustic segmentation - obstruents vs voiced sonorants
- short term average energy trace (RMS energy)
- fundamental frequency trace, based upon autocorrelation analysis.
-
- the fft (fast fourier transform], for computing the short term spectrum.
- digital spectrogram
- calculated from a series of fft's
- small window [64-128 points] yields broad band spectrogram
- large window [256-512 points] yields narrow band spectrogram
- Linear predictive coding (LPC)
- a computationally efficient estimate of the vocal tract resonances and voice fundamental frequency.
- used for speech compression in telephony and speech synthesis: link
- assumptions of the LPC model:
- No secondary resonating cavities, such as nasal cavity,
- which introduce anti-resonances (zeros) into the vocal tract filter.
- No non-linear components in the model.
8. Parameter extraction
- Components of the Source - filter model:
- The glottal source spectrum
- The filter function, showing resonance peaks of vocal tract
- The output vowel spectrum
- A crude simulation of the Source - Filter model:
- Can you hear which vowels the model is trying to simulate?
- Why do these 'vowels' sound so 'machine like'?
10. Modeling the vocal tract resonators:
- The simplest case: a tube of uniform width, open at one end.
- Resonance in a closed pipe:
- Standing waves
- Tube 17 cm has first resonant frequency at 500 Hz
- second resonant frequency at 1500 Hz
- third resonant frequency at 3500 Hz ....
- What vowel does the uniform tube resonator model?
- Formula for resonances in a closed pipe of uniform width, open at one end:
- Fn = ((2n-1)c) / 4L
- where, n = formant (resonance) number
- L = length of tube in cms.
- c = speed of sound in air (35,000 cm/sec)
- F = resonant frequency in Hz.
- With this formula and a spectrogram of [c], you should be able to calculate the length of a speaker’s vocal tract.
11. Multiple tube models of the vocal tract resonances:
- Two tube model of the vocal tract:
- It is possible to approximate most pure vowels, though the sizes and lengths of the tubes are not accurate models of vocal cavities.
- Three tube model of the vocal tract:
- Contains a constriction in mid-region of the vocal tract
- divides the oral from the pharyngeal cavity:
- a feature of the homo-sapien vocal tract, caused by low placement of the larynx in the throat.
- The rear tube + constriction forms a 'Helmholtz' resonator.
- The resonances of this complex tube are determined by:
- the length of the front tube
- the length of the back tube
- the location of the constriction
- The three tube model is useful in understanding why lip rounding - spreading correlates with the back - front feature of vowels in the worlds languages.
- An 'n' tube model:
- A close model of the vocal tract can be obtained by considering the vocal tract, cross-sectional area function.
- Vocal tract is modelled as a series of tubes of varying cross sectional areas.
- Each tube approximately .5cm in length, circular in cross-sectional shape;
- Tube widths based on cross-sectional x-ray images of the vocal tract.
- This model also makes simplifying assumptions (e.g.: energy loss due to soft-tissue walls of vocal tract are not modelled.)
12. Vowel Formant chart:
- A plot of F1 - F2 in Hz of different vowel sounds approximates the vowel quadrilateral.
- A strong argument for the auditory nature of the vowel space.
- A better approximation of the vowel quadrilateral, and the perceptual relations among vowel sounds is obtained by plotting the formant frequencies in Bark (or some other equal interval pitch scale), and by plotting (F2-F1) instead of F2 on the horizontal axis.
13. Australian and New-Zealand vowels
- An F1-F2 plot reveals the expected phonetic differences between OZ and NZ vowels. What are these differences?
14. Measuring formant frequencies and formant trajectories:
- Monophthongal (one-target) vowels may be measured at the centre or steady state portion of the vowel.
- Allowing for co-articulation effects; formant transitions.
- Dipththongs (two-target) vowels. Problems deternmining the second target. The status of the “glide” (formant trajectory between targets)
- Approximating formant trajectories.