Multi-Layer Perceptron with Advanced Acoustic Features for Speech Emotion Recognition in Education Evaluation

Main Article Content

Muhammad Afiq Tamamul Wafa

Abstract

Traditional methods for evaluating lecturer performance in education, such as student surveys, are often limited by their nature. This study explores the development of an objective, a framework to complement these evaluations through Speech Emotion Recognition (SER). This Research utilizes a specialized Indonesian speech emotion dataset, applying data augmentation techniques to enhance model generalization. A set of advanced acoustic features, including Mel Frequency Cepstral Coefficients (MFCCs), Chroma, and Spectral Contrast, along with their statistical variations, is used to create representations of the vocal expressions. A Multi Layer Perceptron (MLP) neural network was designed and trained on these features to classify five different emotions: happy, angry, sad, surprised, and neutral. The Research resulted in a model that demonstrated very good performance, achieving an overall classification accuracy of 94% with high precision, recall, and F1-scores across all emotions, indicating a balanced and reliable system. A critical feature analysis was also conducted, revealing the significance of the standard deviation of Chroma and MFCC features. This study shows that an MLP model paired with feature engineering can be used as a powerful and objective tool for providing deeper insights into student feedback, contributing a valuable new methodology for quality assurance in higher education.

Article Details

How to Cite
Tamamul Wafa, M. A. (2026). Multi-Layer Perceptron with Advanced Acoustic Features for Speech Emotion Recognition in Education Evaluation. Journal of Informatics Information System Software Engineering and Applications (INISTA), 8(1), 53-61. https://doi.org/10.20895/inista.v8i1.2096
Section
Centive 2025

References

[1] L. Harvey and D. Green, “Defining quality,” Assessment & Evaluation in Higher Education, vol. 18, no. 1, pp. 9–34, 1993.
[2] A. Boring, K. Ottoboni, and P. B. Stark, “Student evaluations of teaching are not only unreliable, they are significantly biased against female instructors,” ScienceOpen Research, 2016.
[3] M. Braga, M. Paccagnella, and M. Pellizzari, “Evaluating students' evaluations of professors,” Economics of Education Review, vol. 41, pp. 71–88, 2014.
[4] K. R. Scherer, “Vocal affect expression: A review and a model for future research,” Psychological Bulletin, vol. 99, no. 2, pp. 143–165, 1986.
[5] B. W. Schuller, “Speech emotion recognition: Two decades in a nutshell, benchmarks, and ongoing trends,” Communications of the ACM, vol. 61, no. 5, pp. 90–99, 2018.
[6] Y. D. Rosita, M. R. Firmansyah, and A. Utami, “Exploring bibliometric trends in speech emotion recognition (2020-2024),” IAES International Journal of Artificial Intelligence (IJ-AI), vol. 14, no. 4, pp. 3421–3434, 2025.
[7] C. Busso, M. Bulut, C. C. Lee, A. Kazemzadeh, E. Mower, S. Kim, and S. Narayanan, “Iemocap: Interactive emotional dyadic motion capture database,” Language Resources and Evaluation, vol. 42, no. 4, pp. 335–359, 2008.
[8] M. El Ayadi, M. S. Kamel, and F. Karray, “Survey on speech emotion recognition: Features, classification schemes, and databases,” Pattern Recognition, vol. 44, no. 3, pp. 572–587, 2011.
[9] Y. D. Rosita, Z. Salsabila, and A. R. P. Pamungkas, “Lecturer evaluation from the perspective of speech emotion recognition with deep learning,” in 2025 International Conference on Data Science and Its Applications (ICODSA), (Jakarta, Indonesia), pp. 565–571, 2025.
[10] S. Davis and P. Mermelstein, “Comparison of parametric representations for monosyllabic word recognition in continuously spoken sentences,” IEEE Transactions on Acoustics, Speech, and Signal Processing, vol. 28, no. 4, pp. 357–366, 1980.
[11] M. Müller and S. Ewert, “Chroma toolbox: Matlab implementations for feature extraction in music audio signal processing,” in Proceedings of the International Conference on Music Information Retrieval (ISMIR), pp. 215–220, 2011.
[12] B. McFee, C. Raffel, D. Liang, D. P. Ellis, M. McVicar, E. Battenberg, and O. Nieto, “librosa: Audio and music signal analysis in python,” in Proceedings of the 14th Python in Science Conference, vol. 8, pp. 18–25, 2015.
[13] F. Pedregosa et al., “Scikit-learn: Machine learning in python,” Journal of Machine Learning Research, vol. 12, pp. 2825–2830, 2011.
[14] M. Abadi et al., “Tensorflow: Large-scale machine learning on heterogeneous distributed systems,” arXiv preprint arXiv:1603.04467, 2016.
[15] J. Salamon and J. P. Bello, “Deep convolutional neural networks and data augmentation for environmental sound classification,” IEEE Signal Processing Letters, vol. 24, no. 3, pp. 279–283,
[16] V. Nair and G. E. Hinton, “Rectified linear units improve restricted boltzmann machines,” in Proceedings of the 27th International Conference on Machine Learning (ICML-10), pp. 807–814, 2010.
[17] N. Srivastava, G. Hinton, A. Krizhevsky, I. Sutskever, and R. Salakhutdinov, “Dropout: a simple way to prevent neural networks from overfitting,” The Journal of Machine Learning Research, vol. 15, no. 1, pp. 1929–1958, 2014.
[18] D. P. Kingma and J. Ba, “Adam: A method for stochastic optimization,” arXiv preprint arXiv:1412.6980, 2014.
[19] L. Breiman, “Random forests,” Machine Learning, vol. 45, no. 1, pp. 5–32, 2001.
[20] S. G. Koolagudi and K. S. Rao, “Emotion recognition from speech: a review,” International Journal of Speech Technology, vol. 15, no. 2, pp. 99–117, 2012.
[21] Z. Zeng, M. Pantic, G. I. Roisman, and T. S. Huang, “A survey of affect recognition methods: Audio, visual, and spontaneous expressions,” IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 31, no. 1, pp. 39–58, 2009.
[22] A. Baevski, Y. Zhou, A. Mohamed, and M. Auli, “wav2vec 2.0: A framework for self-supervised learning of speech representations,” in Advances in Neural Information Processing Systems, vol. 33, pp. 12449–12460, 2020.
[23] M. D. Pell, L. Monetta, S. Paulmann, and S. A. Kotz, “Recognizing emotions in a foreign language,” Journal of Nonverbal Behavior, vol. 33, no. 2, pp. 107–120, 2009.