TY - JOUR
T1 - Spatiotemporal data partitioning for distributed random forest algorithm
T2 - Air quality prediction using imbalanced big spatiotemporal data on spark distributed framework
AU - Asgari, Marjan
AU - Yang, Wanhong
AU - Farnaghi, M.
N1 - Publisher Copyright:
© 2022 The Author(s)
PY - 2022/8
Y1 - 2022/8
N2 - Spatiotemporal air quality datasets are typically collected hourly in monitoring stations deployed non-uniformly across a metropolitan city. These datasets are not only big, which poses challenges on the storage and processing capacity of centralized computing systems but also imbalanced and spatially heterogeneous, which may result in biased air quality prediction. To address these challenges, we designed and developed a parallel air quality prediction system equipped with a spatiotemporal data partitioning method, a distributed machine learning algorithm, Hadoop's distributed data storage platform and its resource scheduler/manager, and Spark's efficient and in-memory execution environment, which is suitable for running iterative algorithms, e.g., machine learning. Our proposed spatiotemporal partitioning method accounted for imbalance and spatial heterogeneity features of big air quality data in predictive models, which comply with the load-balancing requirement of distributed computing systems. Distributed Random Forest algorithm in the H2O library of the Spark framework was selected as the distributed machine learning algorithm to develop the air quality predictive model. This algorithm is an ensemble forest with algorithm-level adjustments to perform as efficiently as possible for big imbalanced datasets. An application of the parallel quality prediction system for Tehran, Iran showed that the parallel prediction system had considerable speedup gain and improved both the overall accuracy and class precision of air quality prediction when working with imbalanced big spatiotemporal air quality datasets. A future research direction is to add data streaming and visualization functions to the system to provide rapid and reliable air quality prediction for supporting environmental health management.
AB - Spatiotemporal air quality datasets are typically collected hourly in monitoring stations deployed non-uniformly across a metropolitan city. These datasets are not only big, which poses challenges on the storage and processing capacity of centralized computing systems but also imbalanced and spatially heterogeneous, which may result in biased air quality prediction. To address these challenges, we designed and developed a parallel air quality prediction system equipped with a spatiotemporal data partitioning method, a distributed machine learning algorithm, Hadoop's distributed data storage platform and its resource scheduler/manager, and Spark's efficient and in-memory execution environment, which is suitable for running iterative algorithms, e.g., machine learning. Our proposed spatiotemporal partitioning method accounted for imbalance and spatial heterogeneity features of big air quality data in predictive models, which comply with the load-balancing requirement of distributed computing systems. Distributed Random Forest algorithm in the H2O library of the Spark framework was selected as the distributed machine learning algorithm to develop the air quality predictive model. This algorithm is an ensemble forest with algorithm-level adjustments to perform as efficiently as possible for big imbalanced datasets. An application of the parallel quality prediction system for Tehran, Iran showed that the parallel prediction system had considerable speedup gain and improved both the overall accuracy and class precision of air quality prediction when working with imbalanced big spatiotemporal air quality datasets. A future research direction is to add data streaming and visualization functions to the system to provide rapid and reliable air quality prediction for supporting environmental health management.
KW - Air quality prediction
KW - Big spatiotemporal data
KW - Distributed random forest algorithm
KW - Distributed systems
KW - Imbalanced data
KW - ITC-ISI-JOURNAL-ARTICLE
KW - ITC-GOLD
U2 - 10.1016/j.eti.2022.102776
DO - 10.1016/j.eti.2022.102776
M3 - Article
AN - SCOPUS:85133263461
SN - 2352-1864
VL - 27
JO - Environmental Technology and Innovation
JF - Environmental Technology and Innovation
M1 - 102776
ER -