Multi-Layer Perceptron with Advanced Acoustic Features for Speech Emotion Recognition in Education Evaluation

Muhammad Afiq Tamamul Wafa

doi:10.20895/inista.v8i1.2096

PDF

Published Feb 11, 2026

DOI https://doi.org/10.20895/inista.v8i1.2096

Muhammad Afiq Tamamul Wafa

Telkom University Purwokerto

Abstract

Traditional methods for evaluating lecturer performance in education, such as student surveys, are often limited by their nature. This study explores the development of an objective, a framework to complement these evaluations through Speech Emotion Recognition (SER). This Research utilizes a specialized Indonesian speech emotion dataset, applying data augmentation techniques to enhance model generalization. A set of advanced acoustic features, including Mel Frequency Cepstral Coefficients (MFCCs), Chroma, and Spectral Contrast, along with their statistical variations, is used to create representations of the vocal expressions. A Multi Layer Perceptron (MLP) neural network was designed and trained on these features to classify five different emotions: happy, angry, sad, surprised, and neutral. The Research resulted in a model that demonstrated very good performance, achieving an overall classification accuracy of 94% with high precision, recall, and F1-scores across all emotions, indicating a balanced and reliable system. A critical feature analysis was also conducted, revealing the significance of the standard deviation of Chroma and MFCC features. This study shows that an MLP model paired with feature engineering can be used as a powerful and objective tool for providing deeper insights into student feedback, contributing a valuable new methodology for quality assurance in higher education.

How to Cite

Tamamul Wafa, M. A. (2026). Multi-Layer Perceptron with Advanced Acoustic Features for Speech Emotion Recognition in Education Evaluation. Journal of Informatics Information System Software Engineering and Applications (INISTA), 8(1), 53-61. https://doi.org/10.20895/inista.v8i1.2096

Issue

Vol 8 No 1 (2025): November 2025

Section

Centive 2025

This work is licensed under a Creative Commons Attribution-ShareAlike 4.0 International License.

Authors who publish with this journal agree to the following terms:

Authors retain copyright and grant the journal right of first publication with the work simultaneously licensed under a Creative Commons Attribution License (CC BY-SA 4.0) that allows others to share the work with an acknowledgement of the work's authorship and initial publication in this journal.
Authors are able to enter into separate, additional contractual arrangements for the non-exclusive distribution of the journal's published version of the work (e.g., post it to an institutional repository or publish it in a book), with an acknowledgement of its initial publication in this journal.
Authors are permitted and encouraged to post their work online (e.g., in institutional repositories or on their website) prior to and during the submission process, as it can lead to productive exchanges, as well as earlier and greater citation of published work.

References

[1] L. Harvey and D. Green, “Defining quality,” Assessment & Evaluation in Higher Education, vol. 18, no. 1, pp. 9–34, 1993.
[2] A. Boring, K. Ottoboni, and P. B. Stark, “Student evaluations of teaching are not only unreliable, they are significantly biased against female instructors,” ScienceOpen Research, 2016.
[3] M. Braga, M. Paccagnella, and M. Pellizzari, “Evaluating students' evaluations of professors,” Economics of Education Review, vol. 41, pp. 71–88, 2014.
[4] K. R. Scherer, “Vocal affect expression: A review and a model for future research,” Psychological Bulletin, vol. 99, no. 2, pp. 143–165, 1986.
[5] B. W. Schuller, “Speech emotion recognition: Two decades in a nutshell, benchmarks, and ongoing trends,” Communications of the ACM, vol. 61, no. 5, pp. 90–99, 2018.
[6] Y. D. Rosita, M. R. Firmansyah, and A. Utami, “Exploring bibliometric trends in speech emotion recognition (2020-2024),” IAES International Journal of Artificial Intelligence (IJ-AI), vol. 14, no. 4, pp. 3421–3434, 2025.
[7] C. Busso, M. Bulut, C. C. Lee, A. Kazemzadeh, E. Mower, S. Kim, and S. Narayanan, “Iemocap: Interactive emotional dyadic motion capture database,” Language Resources and Evaluation, vol. 42, no. 4, pp. 335–359, 2008.
[8] M. El Ayadi, M. S. Kamel, and F. Karray, “Survey on speech emotion recognition: Features, classification schemes, and databases,” Pattern Recognition, vol. 44, no. 3, pp. 572–587, 2011.
[9] Y. D. Rosita, Z. Salsabila, and A. R. P. Pamungkas, “Lecturer evaluation from the perspective of speech emotion recognition with deep learning,” in 2025 International Conference on Data Science and Its Applications (ICODSA), (Jakarta, Indonesia), pp. 565–571, 2025.
[10] S. Davis and P. Mermelstein, “Comparison of parametric representations for monosyllabic word recognition in continuously spoken sentences,” IEEE Transactions on Acoustics, Speech, and Signal Processing, vol. 28, no. 4, pp. 357–366, 1980.
[11] M. Müller and S. Ewert, “Chroma toolbox: Matlab implementations for feature extraction in music audio signal processing,” in Proceedings of the International Conference on Music Information Retrieval (ISMIR), pp. 215–220, 2011.
[12] B. McFee, C. Raffel, D. Liang, D. P. Ellis, M. McVicar, E. Battenberg, and O. Nieto, “librosa: Audio and music signal analysis in python,” in Proceedings of the 14th Python in Science Conference, vol. 8, pp. 18–25, 2015.
[13] F. Pedregosa et al., “Scikit-learn: Machine learning in python,” Journal of Machine Learning Research, vol. 12, pp. 2825–2830, 2011.
[14] M. Abadi et al., “Tensorflow: Large-scale machine learning on heterogeneous distributed systems,” arXiv preprint arXiv:1603.04467, 2016.
[15] J. Salamon and J. P. Bello, “Deep convolutional neural networks and data augmentation for environmental sound classification,” IEEE Signal Processing Letters, vol. 24, no. 3, pp. 279–283,
[16] V. Nair and G. E. Hinton, “Rectified linear units improve restricted boltzmann machines,” in Proceedings of the 27th International Conference on Machine Learning (ICML-10), pp. 807–814, 2010.
[17] N. Srivastava, G. Hinton, A. Krizhevsky, I. Sutskever, and R. Salakhutdinov, “Dropout: a simple way to prevent neural networks from overfitting,” The Journal of Machine Learning Research, vol. 15, no. 1, pp. 1929–1958, 2014.
[18] D. P. Kingma and J. Ba, “Adam: A method for stochastic optimization,” arXiv preprint arXiv:1412.6980, 2014.
[19] L. Breiman, “Random forests,” Machine Learning, vol. 45, no. 1, pp. 5–32, 2001.
[20] S. G. Koolagudi and K. S. Rao, “Emotion recognition from speech: a review,” International Journal of Speech Technology, vol. 15, no. 2, pp. 99–117, 2012.
[21] Z. Zeng, M. Pantic, G. I. Roisman, and T. S. Huang, “A survey of affect recognition methods: Audio, visual, and spontaneous expressions,” IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 31, no. 1, pp. 39–58, 2009.
[22] A. Baevski, Y. Zhou, A. Mohamed, and M. Auli, “wav2vec 2.0: A framework for self-supervised learning of speech representations,” in Advances in Neural Information Processing Systems, vol. 33, pp. 12449–12460, 2020.
[23] M. D. Pell, L. Monetta, S. Paulmann, and S. A. Kotz, “Recognizing emotions in a foreign language,” Journal of Nonverbal Behavior, vol. 33, no. 2, pp. 107–120, 2009.

Article Sidebar

Main Article Content

Abstract

Article Details

References