A Novel Approach to Speaker Weight Estimation Using a Fusion of the i-vector and NFA Frameworks

Document Type : Researsh Articles

Authors

1 Aalborg University

2 KU Leuven

Abstract

This paper proposes a novel approach for automatic speaker weight estimation from spontaneous telephone speech signals. In this method, each utterance is modeled using the i-vector framework which is based on the factor analysis on Gaussian Mixture Model (GMM) mean super vectors, and the Non-negative Factor Analysis (NFA) framework which is based on a constrained factor analysis on GMM weight super vectors. Then, the available information in both Gaussian means and Gaussian weights is exploited through a feature-level fusion of the i-vectors and the NFA vectors. Finally, a least-squares support vector regression (LSSVR) is employed to estimate the weight of speakers from the given utterances.
The proposed approach is evaluated on spontaneous telephone speech signals of National Institute of Standards and Technology (NIST) 2008 and 2010 Speaker Recognition Evaluation (SRE) corpora. To investigate the effectiveness of the proposed approach, this method is compared to the i-vector-based speaker weight estimation and an alternative fusion scheme, namely the score-level fusion. Experimental results over 2339 utterances show that the correlation coefficients between the actual and the estimated weights of female and male speakers are 0.49 and 0.56, respectively, which indicate the effectiveness of the proposed method in speaker weight estimation.

Keywords


[1] G. Fant, Acoustic Theory of Speech Production. The Hague: Mouton, 1960.
[2] N. J. Lass and M. Davis, “An investigation on speaker height and weight identification,” Journal of the Acoustical Society of America, vol. 60, pp. 700–703, 1976.
[3] C. Darwin, The Descent of Man and Selection in Relation to Sex. London: Murray, 1871.
[4] N. J. Lass and W. S. Brown, “Correlational study of speakers heights, weights, body surface areas, and speaking fundamental frequencies,” Journal of the Acoustical Society of America, vol. 63, pp. 1218–1220, 1978.
[5] H. J. Kunzel, “How well does average fundamental frequency correlates with speaker height and weight?,” Journal of Phonetica, vol. 46, pp. 117–125, 1989.
[6] T. W. Fitch, “Vocal tract length and formant frequency dispersion correlate with body size in rhesus macaques,” Acoustical Society of America, vol. 102, pp. 1213–1222, 1997.
[7] J. Gonzalez, “Formant frequencies and body size of speaker: a weak relationship in adult humans,” Journal of Phonetics, vol. 32, pp. 277–287, 2004.
[8] U. G. Goldstein, “An articulatory model for the vocal tracts of growing children.” Ph.D. dissertation, Massachusetts Institute of Technology, 1980.
[9] V. E. Negus, The Comparative Anatomy and Physiology of the Larynx. New York: Hafner, 1949.
[10] W. A. Van Dommelen and B. H. Moxness, “Acoustic parameters in speaker height and weight identification: sex-specific behavior,” Language and Speech, vol. 38, pp. 267–287, 1995.
[11] W. Campbell, D. Sturim, and D. Reynolds, “Support vector machines using GMM supervectors for speaker verification,” IEEE Signal Process. Letters, vol. 13, no. 5, pp. 308–311, 2006.
[12] N. Dehak, P. Kenny, R. Dehak, P. Dumouchel, and P. Ouellet, “Frontend factor analysis for speaker verification,” IEEE Trans. Audio, Speech, and Lang. Process., vol. 19, no. 4, pp. 788–798, 2011.
[13] M. H. Bahari, M. McLaren, H. Van hamme, and D. Van Leeuwen, “Age estimation from telephone speech using i-vectors,” in Proc. Interspeech, 2012, pp. 506–509.
[14] M. H. Bahari, “Automatic Speaker Characterization: Automatic Identification of Gender, Age, Language and Accent from Speech Signals,” Ph.D. dissertation, KU Leuven – Faculty of Engineering Science, Belgium, May 2014.
[15] A. H. Poorjam, M. H. Bahari, V. Vasilakakis, and H. Van hamme, “Height estimation from speech signals using i-vectors and least-squares support vector regression,” in Proc. 37th International Conference on Telecommunications and Signal Processing, Germany, 2014.
[16] M. H. Bahari, N. Dehak, H. Van hamme, L. Burget, A. Ali, and J. Glass, “Non-negative factor analysis of Gaussian mixture model weight adaptation for language and dialect recognition,” Transactions on Audio, Speech, and Language Processing, vol. 22, no. 7, pp. 1117–1129, July 2014.
[17] A. H. Poorjam, “Speaker Profiling for Forensic Applications,” Master’s thesis, KU Leuven – Faculty of Engineering Science, 2014.
[18] M. H. Bahari, R. Saeidi, H. Van hamme, and D. van Leeuwen, “Accent recognition using i-vector, Gaussian mean super vector and Gaussian posterior probability super vector for spontaneous telephone speech,” in Proc. ICASSP 2013, 2013, pp.7344-7348.
[19] A. H. Poorjam, M. H. Bahari, and H. Van hamme, “Multitask speaker profiling for estimating age, height, weight and smoking habits from spontaneous telephone speech signals,” in Proc. 4th International Conference on Computer and Knowledge Engineering, Iran, 2014.
[20] P. Kenny, G. Boulianne, and P. Dumouchel, “Eigenvoice modeling with sparse training data,” IEEE Transaction on Speech and Audio Processing, vol. 13, no. 3, pp. 345–354, 2005.
[21] S. Shum, N. Dehak, R. Dehak, and J. Glass, “Unsupervised speaker adaptation based on the cosine similarity for text-independent speaker verification,” in Proc. Odyssey, 2010.
[22] N. Dehak, “Discriminative and Generative Approaches for Long- and Short-term Speaker Characteristics Modeling: Application to Speaker Verification,” Ph.D. dissertation, Ecole de Technologie Superieure de Montreal, Montreal, QC, Canada, 2009.
[23] A. Hatch, S. Kajarekar, and A. Stolcke, “Within-class covariance normalization for SVM-based speaker recognition,” in Proc. Interspeech, vol. 4, no. 2.2, 2006.
[24] P. Kenny, P. Ouellet, N. Dehak, V. Gupta, and P. Dumouchel, “A study of interspeaker variability in speaker verification,” IEEE Trans. Audio, Speech, and Lang. Process., vol. 16, no. 5, pp. 980–988, 2008.
[25] R. O. Duda, P. E. Hart, and D. G. Stork, Pattern Classification and Scene Analysis. 2nd ed., 1995.
[26] J. A. K. Suykens, T. Van Gestel, J. De Brabanter, B. De Moor, and J. Vandewalle, Least Squares Support Vector Machines. Singapore: World Scientific, 2002.
[27] J. Pelecanos and S. Sridharan, “Feature warping for robust speaker verification,” in ODYSSEY-2001, pp. 213–218.
[28] M. McLaren and D. van Leeuwen, “A simple and effective speech activity detection algorithm for telephone and microphone speech,” in Proc. NIST SRE Workshop, 2011.
[29] K. DeBrabanter, P. Karsmakers, F. Ojeda, C. Alzate, J. De Brabanter, K. Pelckmans, B. De Moor, J. Vandewalle, and J. A. K. Suykens, “Ls-svmlab1.8 toolbox,” http://www.esat.kuleuven.be/sista/lssvmlab.
[30] R. Battiti, “First and second order methods for learning: Between steepest descent and Newton's method,” Neural Computation, vol. 4, no. 2, pp. 141-166, 1992.
[31] M. T. Hagan, H. B. Demuth, and M. H. Beale, Neural Network Design. Boston: PWS Publishing Co., 1997.
CAPTCHA Image