Determining Air Quality Influential Parameters Using Machine Learning Techniques

Evita Fitri; Andi Saryoko

doi:10.20895/dinda.v4i2.1567

Evita Fitri Universitas Nusa Mandiri https://orcid.org/0009-0007-2778-6110
Andi Saryoko Universitas Nusa Mandiri

DOI: https://doi.org/10.20895/dinda.v4i2.1567

Keywords: air quality, data preprocessing, gradient boosted trees, machine learning, normalization, outliers, PM10, PM2.5, random forest regression

Abstract

Air quality is an important issue in public health and the environment. This research aims to develop an air quality prediction model based on PM₁₀ and PM_2.5 parameters using various regression and machine learning approaches. The dataset used includes air pollutant standard index (ISPU) data from a number of stations in the Jakarta area with an observation period from January to April 2024. The research method includes collecting datasets, reviewing literature and testing several models of machine learning techniques. Furthermore, the handling of outliers was carried out using the numeric outliers node and data normalization to prepare the data before dividing the training and testing data. The models evaluated include Linear Regression, Random Forest Regression, Gradient Boosted Trees, and Multilayer Perceptron (MLP), with validation using 10 times cross-validation. The results showed that the Random Forest Regression and Gradient Boosted Trees models provided good prediction performance for both PM₁₀ and PM_2.5 parameters. Random Forest Regression showed the lowest RMSE value on testing data for PM₁₀ (0.048) and PM_2.5 (0.037), while Gradient Boosted Trees showed the lowest RMSE value on training data for PM_2.5 (0.032). The process of handling outliers and normalizing the data successfully improved the prediction accuracy of the model. Suggestions for future research include the exploration of new models, the addition of meteorological and socio-economic variables, and the application of models in real-time air quality monitoring systems.