Determining Air Quality Influential Parameters Using Machine Learning Techniques

Keywords: air quality, data preprocessing, gradient boosted trees, machine learning, normalization, outliers, PM10, PM2.5, random forest regression

Abstract

Air quality is an important issue in public health and the environment. This research aims to develop an air quality prediction model based on PM10 and PM2.5 parameters using various regression and machine learning approaches. The dataset used includes air pollutant standard index (ISPU) data from a number of stations in the Jakarta area with an observation period from January to April 2024. The research method includes collecting datasets, reviewing literature and testing several models of machine learning techniques. Furthermore, the handling of outliers was carried out using the numeric outliers node and data normalization to prepare the data before dividing the training and testing data. The models evaluated include Linear Regression, Random Forest Regression, Gradient Boosted Trees, and Multilayer Perceptron (MLP), with validation using 10 times cross-validation. The results showed that the Random Forest Regression and Gradient Boosted Trees models provided good prediction performance for both PM10 and PM2.5 parameters. Random Forest Regression showed the lowest RMSE value on testing data for PM10 (0.048) and PM2.5 (0.037), while Gradient Boosted Trees showed the lowest RMSE value on training data for PM2.5 (0.032). The process of handling outliers and normalizing the data successfully improved the prediction accuracy of the model. Suggestions for future research include the exploration of new models, the addition of meteorological and socio-economic variables, and the application of models in real-time air quality monitoring systems.

Published
2024-08-06
Section
Articles