## Data availability

Data will be made available on request.

这是用户在 2024-3-18 16:36 为 https://www.sciencedirect.com/science/article/pii/S0048969723033223 保存的双语快照页面，由 沉浸式翻译 提供双语支持。了解如何保存？

用于空气质量预测的多视图多任务时空图卷积网络

Air quality prediction

Graph convolutional network

Multi-task learning

Multi-view learning

空气质量预测图卷积网络多任务学习多视图学习

Air pollution has been a concern in recent years because it is not only a key risk factor for non-communicable diseases, including chronic obstructive pulmonary disease, asthma, lung disease, and cardiovascular diseases but also brings enormous costs to the economy and the environment (Howse et al., 2021). Thus, accurate air quality prediction is critically essential for the government and the citizens. On the one hand, air quality prediction can help the government make decisions filled with wisdom, such as restricting traffic based on odd-even license plates and ordering factories to mis-peak production. On the other hand, the residents make reasonable travel plans and reduce unnecessary travel according to instructions.

近年来，空气污染一直是人们关注的问题，因为它不仅是慢性阻塞性肺病、哮喘、肺病和心血管疾病等非传染性疾病的关键危险因素，而且还给经济和环境带来了巨大的成本（Howse 等人，2021 年）。因此，准确的空气质量预测对政府和公民至关重要。一方面，空气质量预测可以帮助政府做出充满智慧的决策，例如基于奇偶车牌限制交通和命令工厂错峰生产。另一方面，居民根据指示制定合理的旅行计划，减少不必要的旅行。

Many studies have been proposed, mainly divided into two parts, including traditional methods based on the parametric model and deep learning-based methods. Traditional time series prediction algorithms, such as autoregressive (AR), moving average (MA), autoregressive and moving average (ARMA), and an autoregressive integrated moving average (ARIMA) (Pan et al., 2012) build a linear model for modeling and learning time-series patterns. However, these algorithms have a strict requirement for the stationarity of a time series and only consider the history data of a station. These methods achieve poor results because they only consider trends in time series and ignore other complex relationships between stations. Statistical models, such as support vector regression (SVR) (Chang and Lin, 2011) simplify the prediction into individual time-series predictions. With the development of deep learning, researchers have proposed many cutting-edge methods based on deep learning for air quality prediction. Deep learning models excel at extracting spatiotemporal dependence in fields such as traffic prediction (Yu et al., 2021; Li et al., 2021), air quality prediction (Liang et al., 2018; Zheng et al., 2013; Li et al., 2016; Luo et al., 2019; Qi et al., 2018; Yi et al., 2018), and so on. Recurrent neural networks (RNNs) (Liang et al., 2018; Zhang et al., 2019) like Gated Recurrent Unit (GRU) are applied to model the complex sequential interactions. Recently, graph neural networks have been increasingly used to model complex spatial correlation. (Wang et al., 2021; Yu et al., 2018; Ma et al., 2019) are based on graph convolution networks to handle spatial dependencies and RNNs (i.e., gated recurrent unit, long short-term memory unit) to process temporal dependencies simultaneously. Despite substantial research, air quality prediction remains challenging due to subtle spatiotemporal interactions and predisposing factors. Recently, graph convolution is also applied to air quality prediction for capturing geographical spatial neighbor information (Wang et al., 2021; Han et al., 2021a; Han et al., 2021b). Most of the works only ponder the relationships between adjacent stations within the actual geographic area and disregard the influence of non-geographic space.

已经提出了许多研究，主要分为两部分，包括基于参数模型的传统方法和基于深度学习的方法。传统的时间序列预测算法，如自回归（AR）、移动平均（MA）、自回归和移动平均（ARMA）以及自回归综合移动平均（ARIMA）（Pan et al.， 2012）构建了一个线性模型，用于建模和学习时间序列模式。但是，这些算法对时间序列的平稳性有严格的要求，只考虑站的历史数据。这些方法的结果很差，因为它们只考虑了时间序列中的趋势，而忽略了台站之间的其他复杂关系。统计模型，如支持向量回归（SVR）（Chang和Lin，2011）将预测简化为单个时间序列预测。随着深度学习的发展，研究人员提出了许多基于深度学习的空气质量预测前沿方法。深度学习模型擅长提取交通预测等领域的时空依赖性（Yu et al.， 2021;Li et al.， 2021）、空气质量预测（Liang et al.， 2018;Zheng 等人，2013 年;Li 等人，2016 年;Luo等人，2019;Qi 等人，2018 年;Yi et al.， 2018），依此类推。递归神经网络（RNN）（Liang等人，2018;Zhang et al.， 2019） 等门控循环单元 （GRU） 被应用于模拟复杂的顺序相互作用。最近，图神经网络越来越多地用于模拟复杂的空间相关性。（Wang 等人，2021 年;Yu 等人，2018 年;马等人，2019）基于图卷积网络来处理空间依赖关系，RNN（即门控循环单元，长短期记忆单元）同时处理时间依赖关系。 尽管进行了大量研究，但由于微妙的时空相互作用和诱发因素，空气质量预测仍然具有挑战性。最近，图卷积也被应用于空气质量预测，以捕捉地理空间邻居信息（Wang et al.， 2021;Han 等人，2021a;Han 等人，2021b）。大多数作品只思考了实际地理区域内相邻站点之间的关系，而忽略了非地理空间的影响。

As illustrated in Fig. 1, the air quality of station A and station B, geographically adjacent, is similar. Furthermore, although station A and station C are geographically distant, they exhibit similar patterns. Stations A, B, and C demonstrate actual spatial interactions associated with air quality and other spatial relations. This paper calls the non-geographic spatial relationships logical spatial relationships. Existing work focuses almost exclusively on modeling single spatial interactions but ignores the interaction of multi-spatial relationships. Recently, ATGCN (Wang et al., 2021) considered various types of relationships between stations for the first time and designed a late fusion of distinct types of relationships, which resulted in the model ignoring low-level interactions between different types of relationships.

如图 1 所示，地理上相邻的 A 站和 B 站的空气质量相似。此外，尽管A站和C站在地理上相距遥远，但它们表现出相似的模式。站点 A、B 和 C 展示了与空气质量和其他空间关系相关的实际空间相互作用。本文将非地理空间关系称为逻辑空间关系。现有的工作几乎完全集中在对单个空间交互的建模上，而忽略了多空间关系的交互。最近，ATGCN（Wang et al.， 2021）首次考虑了台站之间的各种类型关系，并设计了不同类型关系的后期融合，导致该模型忽略了不同类型关系之间的低级交互。

To address the issues raised previously, we propose a Multi-View Multi-Task Spatiotemporal Graph Convolutional Network *named* M2 for predicting the air quality of citywide air quality stations. M2 consists of a multi-view encoder and multi-task classification regression decoders. On the one hand, the multi-view encoder section considers the impact on air quality from three perspectives: geographic spatial view, logical spatial view, and temporal view. First, geographic and logical spatial views model diverse spatial linkages using graph convolutional networks. Next, the attention mechanism combines two spatial representations at each time interval. Finally, the fused representation is passed through a Gated Recurrent Unit (GRU) to generate a representation that incorporates all three views. On the other hand, predicting air quality is divided into two subtasks: regression, which predicts the value of air quality, and classification, which predicts the level of air quality. The value prediction task is the primary task, while the level prediction task is an ancillary task.

为了解决之前提出的问题，我们提出了一种名为M2的多视图多任务时空图卷积网络，用于预测全市空气质量站的空气质量。M2 由多视图编码器和多任务分类回归解码器组成。一方面，多视图编码器部分从地理空间视图、逻辑空间视图和时间视图三个角度考虑对空气质量的影响。首先，地理和逻辑空间视图使用图卷积网络对不同的空间联系进行建模。接下来，注意力机制在每个时间间隔内组合了两种空间表示。最后，融合表示通过门控循环单元 （GRU） 以生成包含所有三个视图的表示。另一方面，预测空气质量分为两个子任务：预测空气质量值的回归和预测空气质量水平的分类。值预测任务是主要任务，而级别预测任务是辅助任务。

The main contributions of our study are three folds:

我们研究的主要贡献有三个方面：

- •
We propose a multi-task deep learning-based framework for air quality prediction to solve the problem that air quality prediction is not accurate enough. The framework considers air quality value and level and incorporates their joint influence on future air quality.

针对空气质量预测不够准确的问题，提出了一种基于深度学习的多任务空气质量预测框架。该框架考虑了空气质量值和水平，并结合了它们对未来空气质量的共同影响。 - •
We design a unified multi-view model that simultaneously considers geographical spatial, temporal, and logical spatial relationships. We accumulate neighbor information from different spatial views and fuse them according to the attention mechanism at each time interval to obtain a comprehensive information representation and utilize GRU to capture temporal information.

我们设计了一个统一的多视图模型，同时考虑地理空间、时间和逻辑空间关系。我们从不同空间视图中积累邻居信息，并在每个时间间隔根据注意力机制进行融合，以获得全面的信息表示，并利用GRU捕获时间信息。 - •
Our extensive experiments on two real-world datasets demonstrate that M2 outperforms baselines of different types in yielding better forecasting performance.

我们在两个真实世界数据集上的广泛实验表明，M2 在产生更好的预测性能方面优于不同类型的基线。

The rest of the paper is structured as follows: in Section 2, we review the existing methods from which we got the inspiration. Section 3 gives notations and mathematically restates the air quality prediction problem. This is followed by details of our M2 in Section 4. Next, Section 5 describes our experiment setup and reports the results. Finally, Section 6 concludes this work and discusses possible future extensions.

本文的其余部分结构如下：在第 2 节中，我们回顾了我们从中获得灵感的现有方法。第 3 节给出了符号，并以数学方式重述了空气质量预测问题。接下来是第 2 节中 M4 的详细信息。接下来，第 5 节描述了我们的实验设置并报告了结果。最后，第 6 节总结了这项工作，并讨论了未来可能的扩展。

2.1. 空气质量预测

Air quality prediction has garnered considerable attention, and several efficient works have been produced. These works can be classified into statistic learning-based methods and deep learning-based methods. Statistic learning-based methods mainly include linear regression, ARIMA and SVR (Chang and Lin, 2011; Howse et al., 1997; Díaz-Robles et al., 2008). Linear regression predicts air quality by identifying relationships between air quality and relevant features. ARIMA discovers patterns in previous data over time and finally uses them to predict future air quality. (Sánchez et al., 2013) establishes a highly nonlinear urban air quality model based on SVM. However, these methods produce poor results because it is challenging to model the relationship among all affecting factors.

空气质量预测引起了相当大的关注，并产生了一些有效的工作。这些工作可以分为基于统计学习的方法和基于深度学习的方法。基于统计学习的方法主要包括线性回归、ARIMA 和 SVR（Chang 和 Lin，2011 年;Howse 等人，1997 年;Díaz-Robles等人，2008年）。线性回归通过识别空气质量与相关要素之间的关系来预测空气质量。ARIMA在以前的数据中发现一段时间内的模式，并最终使用它们来预测未来的空气质量。（Sánchez et al.， 2013）建立了一个基于支持向量机的高度非线性城市空气质量模型。然而，这些方法产生的结果很差，因为对所有影响因素之间的关系进行建模具有挑战性。

Recently, researchers attempted to utilize deep learning-based methods to predict air quality. Zheng proposed a multi-level attention-based recurrent neural network (named GeoMAN) (Liang et al., 2018) to predict the readings of a station over a couple of future hours. To further improve the performance of prediction, Multi-Group (Zhang et al., 2019) model local weather influences and fuse heterogeneous data for next-day air quality prediction. However, these models ignore the case that some distant stations could also have strong dependencies due to high similarities in other aspects. Later, ATGCN (Wang et al., 2021) model diverse inter-station relationships for air quality prediction of citywide stations. However, all of the above works still ignore the influence of air quality levels on air quality prediction tasks.

最近，研究人员试图利用基于深度学习的方法来预测空气质量。Zheng提出了一个基于多层次注意力的递归神经网络（名为GeoMAN）（Liang et al.， 2018）来预测未来几个小时内电台的读数。为了进一步提高预测性能，Multi-Group（Zhang et al.， 2019）对当地天气影响进行建模，并融合异构数据进行次日空气质量预测。然而，这些模型忽略了一些遥远的台站也可能由于在其他方面的高度相似性而具有很强的依赖性的情况。后来，ATGCN（Wang et al.， 2021）对各种站间关系进行建模，用于预测全市站的空气质量。然而，上述所有工作仍然忽略了空气质量水平对空气质量预测任务的影响。

Therefore, this paper proposes a novel model that takes air quality level as an auxiliary task to predict air quality value better.

因此，本文提出了一种以空气质量水平为辅助任务的新模型，以更好地预测空气质量值。

2.2. 时空预测中的图卷积网络

Graph convolutional network extends a convolutional neural network to non-Euclidean data. It has been used a lot in recent years for spatial-temporal prediction tasks like traffic prediction and air quality prediction.

图卷积网络将卷积神经网络扩展到非欧几里得数据。近年来，它被大量用于交通预测和空气质量预测等时空预测任务。

Graph convolutional networks can be divided into two categories, spatial-based and spectral-based. Spectral-based works take graph convolution operation as denoise from graph signal. Spatial-based spatial-temporal works update information by designing different strategies to aggregate features of their neighbors. The diffusion GCN (Li et al., 2018) was proposed to process a graph with two different directions for traffic forecasting. (Wu et al., 2019) stacks graph convolution layers to predict spatial-temporal graph data with long-range temporal sequences. (Wang et al., 2021) encodes three types of relationships among stations into graphs and designs parallel GCN-based encoder-decoder architecture to generate multi-interval air quality predictions for all stations.

图卷积网络可分为两类，基于空间的和基于光谱的。基于频谱的工作将图卷积运算作为图信号的去噪。基于空间的时空作品通过设计不同的策略来聚合其相邻的特征来更新信息。扩散GCN（Li et al.， 2018）被提出用于处理具有两个不同方向的图以进行流量预测。（Wu et al.， 2019） 堆叠图卷积层以预测具有长程时间序列的时空图数据。（Wang et al.， 2021）将三种站点之间的关系编码为图，并设计了基于GCN的并行编码器-解码器架构，以生成所有站点的多间隔空气质量预测。

In light of the primary research, we construct multi-view graphs to reflect the similarities across stations from various semantic perspectives and then use multiple graph convolutions to capture different semantic spatial-temporal correlations of air quality patterns.

基于初步研究，我们构建了多视图图，从不同的语义角度反映了不同站点的相似性，然后使用多个图卷积来捕捉空气质量模式的不同语义时空相关性。

Multi-task learning (MTL) is a common solution for multiple related tasks (Caruana, 1997). A single-task model exclusively learns limited features, ignoring and discarding information from other tasks. MTL, on the other hand, facilitates knowledge transfer. MTL has been used widely in many fields, such as natural language processing, spatial-temporal prediction (Wang et al., 2020; Liu et al., 2016) and so on. Wang et al. (2020) considers that crowd flow and the Origin-Destination location of flow trajectories and flow trajectory are highly correlated and affect each other, so both tasks are predicted simultaneously to achieve better performance. Liu et al. (2016) treat each station as a task to predict water quality.

多任务学习（MTL）是多个相关任务的常用解决方案（Caruana，1997）。单任务模型只学习有限的特征，忽略和丢弃来自其他任务的信息。另一方面，MTL促进了知识转移。MTL在自然语言处理、时空预测等多个领域得到了广泛的应用（Wang et al.， 2020;Liu et al.， 2016）等。Wang等人（2020）认为，人群流与流动轨迹和流动轨迹的始地-目的地位置高度相关并相互影响，因此同时预测两个任务以获得更好的性能。Liu et al. （2016） 将每个站点视为预测水质的任务。

In this paper, we utilize air quality level as an auxiliary task for air quality prediction.

本文利用空气质量水平作为空气质量预测的辅助任务。

To begin with, we introduce some necessary notations and then mathematically restate the air quality prediction problem.

首先，我们引入一些必要的符号，然后从数学上重述空气质量预测问题。

**Notations**. Assume an area has N air quality monitoring stations, which are represented by a set S = $\left\{{S}_{1},{S}_{2}\dots ,{S}_{N}\right\}$. Each station has inherent properties, such as id, longitude, and latitude. Moreover, point of interests (POIs) and road networks(RNs) of station *S*_{i} are counted around each station, denoted as *P*_{i}, *R*_{i}, respectively. Specifically, we jointly represent *P*_{i} and *R*_{i} as *PR*_{i} = [*P*_{i}, *R*_{i}] and logical spatial information is denoted as $\mathit{PR}\in {\mathrm{\mathbb{R}}}^{N\times q}$, where q is the logical spatial feature dimension of each station.

符号。假设一个区域有 N 个空气质量监测站，这些监测站由集合 S = 表示 $\left\{{S}_{1},{S}_{2}\dots ,{S}_{N}\right\}$ 。每个站点都有固有属性，例如 id、经度和纬度。此外，在每个站点周围计算站点 S 的兴趣点 （POI） 和道路网络 （RN），分别表示为 P、R。具体来说，我们共同将 P 和 R 表示为 PR = [P， R]，逻辑空间信息表示为 $\mathit{PR}\in {\mathrm{\mathbb{R}}}^{N\times q}$ ，其中 q 是每个站点的逻辑空间特征维数。

Let ${X}_{t}\in {\mathrm{\mathbb{R}}}^{N\times m}$ denotes the features of all the stations at time interval t, including the information of meteorological data (e.g., temperature, humidity, air pressure, wind speed, wind direction, etc.) and air quality data (e.g., PM2.5, PM10, O3, NO2, SO2, etc.). Let ${X}^{{S}_{i}}\in {\mathrm{\mathbb{R}}}^{T\times m}$ denotes the features of station *S*_{i}. Specially, we take PM2.5 as the objective air pollutant. Let ${\widehat{Y}}_{t+\tau}^{\mathit{value}}\in {\mathrm{\mathbb{R}}}^{N}$, ${\widehat{Y}}_{t+\tau}^{\mathit{level}}\in {\mathrm{\mathbb{R}}}^{N}$ denotes the value and level of PM2.5 at time interval t+*τ* respectively.

Let ${X}_{t}\in {\mathrm{\mathbb{R}}}^{N\times m}$ 表示所有站点在时间间隔t处的特征，包括气象数据（如温度、湿度、气压、风速、风向等）和空气质量数据（如PM2.5、PM10、O3、NO2、SO2等）等信息。让我们 ${X}^{{S}_{i}}\in {\mathrm{\mathbb{R}}}^{T\times m}$ 表示S站的特征，特别是我们以PM2.5为客观空气污染物。设 ${\widehat{Y}}_{t+\tau}^{\mathit{value}}\in {\mathrm{\mathbb{R}}}^{N}$ ， ${\widehat{Y}}_{t+\tau}^{\mathit{level}}\in {\mathrm{\mathbb{R}}}^{N}$ 分别表示PM2.5在时间间隔t+τ处的值和水平。

**Problem Statement**. Given past meteorological data, air quality data, POI data, and road networks data, we make multi-step predictions of air quality values and levels separately. To capture geographical spatial relationship, we define the N correlated air quality stations as a weighted directed graph ${\mathcal{G}}_{G}=\left(S,{\mathrm{\mathcal{E}}}_{G},{A}_{G}\right)$ where S is a set of $\mid S\mid =N$ nodes, ${\mathrm{\mathcal{E}}}_{G}$ is a set of edges, and ${A}_{G}\in {\mathrm{\mathbb{R}}}^{N\times N}$ is a weighted adjacency matrix described in detail later. Similarly, we design a logical spatial graph ${\mathcal{G}}_{L}=\left(S,{\mathrm{\mathcal{E}}}_{L},{A}_{L}\right)$ where ${A}_{L}\in {\mathrm{\mathbb{R}}}^{N\times N}$. The air quality prediction problem aims to learn a function $\mathrm{\mathcal{F}}$ that can predict ${T}^{\text{'}}$ future air quality values and levels simultaneously given T historical data and the graph ${\mathcal{G}}_{G},{\mathcal{G}}_{L}$:$\left[{\mathit{X}}_{\left(t-T+1\right):t},\mathit{PR},{\mathcal{G}}_{G},{\mathcal{G}}_{L}\right]\stackrel{\mathrm{\mathcal{F}}}{\to}\left[{\widehat{\mathit{Y}}}_{\left(t+1\right):\left(t+{T}^{\text{'}}\right)}^{\mathit{value}};{\widehat{\mathit{Y}}}_{\left(t+1\right):\left(t+{T}^{\text{'}}\right)}^{\mathit{level}}\right],$where ${\mathit{X}}_{\left(t-T+1\right):t}\in {\mathrm{\mathbb{R}}}^{N\times m\times T}$, ${\widehat{\mathit{Y}}}_{\left(t+1\right):\left(t+{T}^{\text{'}}\right)}^{\mathit{value}}\in {\mathrm{\mathbb{R}}}^{N\times {T}^{\text{'}}}$ and ${\widehat{\mathit{Y}}}_{\left(t+1\right):\left(t+{T}^{\text{'}}\right)}^{\mathit{level}}\in {\mathrm{\mathbb{R}}}^{N\times {T}^{\text{'}}}$.

问题陈述。根据过去的气象数据、空气质量数据、POI数据和道路网络数据，我们分别对空气质量值和水平进行多步预测。为了捕捉地理空间关系，我们将 N 个相关空气质量站点定义为加权有向图， ${\mathcal{G}}_{G}=\left(S,{\mathrm{\mathcal{E}}}_{G},{A}_{G}\right)$ 其中 S 是一组 $\mid S\mid =N$ 节点， ${\mathrm{\mathcal{E}}}_{G}$ 是一组边， ${A}_{G}\in {\mathrm{\mathbb{R}}}^{N\times N}$ 是一个加权邻接矩阵，稍后将详细描述。类似地，我们设计了一个逻辑空间图， ${\mathcal{G}}_{L}=\left(S,{\mathrm{\mathcal{E}}}_{L},{A}_{L}\right)$ 其中 ${A}_{L}\in {\mathrm{\mathbb{R}}}^{N\times N}$ .空气质量预测问题旨在学习一个函数，该函数 $\mathrm{\mathcal{F}}$ 可以在给定 T 历史数据和图形的情况下同时预测 ${T}^{\text{'}}$ 未来的空气质量值和水平 ${\mathcal{G}}_{G},{\mathcal{G}}_{L}$ ： $\left[{\mathit{X}}_{\left(t-T+1\right):t},\mathit{PR},{\mathcal{G}}_{G},{\mathcal{G}}_{L}\right]\stackrel{\mathrm{\mathcal{F}}}{\to}\left[{\widehat{\mathit{Y}}}_{\left(t+1\right):\left(t+{T}^{\text{'}}\right)}^{\mathit{value}};{\widehat{\mathit{Y}}}_{\left(t+1\right):\left(t+{T}^{\text{'}}\right)}^{\mathit{level}}\right],$ 其中 ${\mathit{X}}_{\left(t-T+1\right):t}\in {\mathrm{\mathbb{R}}}^{N\times m\times T}$ ， ${\widehat{\mathit{Y}}}_{\left(t+1\right):\left(t+{T}^{\text{'}}\right)}^{\mathit{value}}\in {\mathrm{\mathbb{R}}}^{N\times {T}^{\text{'}}}$ 和 ${\widehat{\mathit{Y}}}_{\left(t+1\right):\left(t+{T}^{\text{'}}\right)}^{\mathit{level}}\in {\mathrm{\mathbb{R}}}^{N\times {T}^{\text{'}}}$ 。

In this section, we outline the architecture of our proposed model, M2. Multi-task learning is adopted to predict air quality, with the value prediction of air quality as the main task (regression task) and the level prediction of air quality as the auxiliary task (classification task). Fig. 2 illustrates the framework of our proposed method M2 based on the encoder-decoder architecture. M2 contains four modules: the denoise block filtering out noise, the past spatio-temporal information encoding module consisting of three views, the decoder module of future value prediction, and the decoder module of future level prediction. The details of the model will be described in the following section.

在本节中，我们概述了我们提出的模型 M2 的架构。采用多任务学习进行空气质量预测，以空气质量的值预测为主要任务（回归任务），以空气质量的水平预测为辅助任务（分类任务）。图2展示了我们提出的基于编码器-解码器架构的方法M2的框架。M2包含四个模块：滤除噪声的降噪块、由三个视图组成的过去时空信息编码模块、未来值预测的解码器模块和未来水平预测的解码器模块。该模型的详细信息将在下一节中介绍。

Intuitively, pure or stable data can improve the model's prediction performance. To obtain purer data, the original data need to be denoised. Inspired by (Zhou et al., 2022), we utilize Fast Fourier Transform (FFT) and Inverse Fast Fourier Transform (IFFT) to filter out the irrelevant noises of each station. Given the input data ${X}^{{S}_{i}}\in {\mathrm{\mathbb{R}}}^{T\times m}$ of station *S*_{i}, we first perform FFT along the feature dimension to convert ${X}^{{S}_{i}}$ to the frequency domain:(1)${\mathbf{F}}^{{S}_{i}}=\mathrm{\mathcal{F}}\left({X}^{{S}_{i}}\right)\in {\mathrm{\u2102}}^{T\times m}$where $\mathrm{\mathcal{F}}$ (·) indicates the one-dimensional FFT. Be aware that ${F}^{{S}_{i}}$ represents the spectrum of ${X}^{{S}_{i}}$ and is a complex tensor. The spectrum can then be modulated by multiplying a learnable filter $\mathbf{W}\in {\mathrm{\u2102}}^{T\times m}$:(2)${\stackrel{~}{\mathbf{F}}}^{{S}_{i}}=\mathbf{W}\odot {\mathbf{F}}^{{S}_{i}}$where $\odot $ denotes the element-wise multiplication. Finally, we utilize the IFFT to convert the modulated spectrum ${\stackrel{~}{\mathbf{F}}}^{{S}_{i}}$ back to the time domain and to update the representations:(3)${\stackrel{~}{\mathbf{X}}}^{{S}_{i}}\leftarrow {\mathrm{\mathcal{F}}}^{-1}\left({\stackrel{~}{\mathbf{F}}}^{{S}_{i}}\right)\in {\mathrm{\u2102}}^{T\times m}$where ${\mathrm{\mathcal{F}}}^{-1}$ represents the inverse 1D FFT, which transoms the complex tensor into a real number tensor. With FFT and IFFT, the noise in the raw data can be effectively reduced, allowing us to obtain more accurate feature representations. We also introduce residual connection and layer normalization operations to mitigate gradient vanishing and unstable training issues as:(4)${\stackrel{~}{X}}^{{S}_{i}}=\text{LayerNorm}\left({\stackrel{~}{\mathbf{X}}}^{{S}_{i}}+{X}^{{S}_{i}}\right)$

直观地说，纯数据或稳定数据可以提高模型的预测性能。为了获得更纯净的数据，需要对原始数据进行去噪。受（周等人，2022）的启发，我们利用快速傅里叶变换（FFT）和逆快速傅里叶变换（IFFT）来滤除每个站的不相关噪声。给定站点 S 的输入数据 ${X}^{{S}_{i}}\in {\mathrm{\mathbb{R}}}^{T\times m}$ ，我们首先沿特征维度执行 FFT 以转换为 ${X}^{{S}_{i}}$ 频域： (1)${\mathbf{F}}^{{S}_{i}}=\mathrm{\mathcal{F}}\left({X}^{{S}_{i}}\right)\in {\mathrm{\u2102}}^{T\times m}$ 其中 $\mathrm{\mathcal{F}}$ （·） 表示一维 FFT。请注意，它 ${F}^{{S}_{i}}$ 表示 ${X}^{{S}_{i}}$ 的谱，是一个复张量。然后可以通过乘以可学习滤波器 $\mathbf{W}\in {\mathrm{\u2102}}^{T\times m}$ 来调制频谱： (2)${\stackrel{~}{\mathbf{F}}}^{{S}_{i}}=\mathbf{W}\odot {\mathbf{F}}^{{S}_{i}}$ 其中 $\odot $ 表示元素乘法。最后，我们利用 IFFT 将调制频谱 ${\stackrel{~}{\mathbf{F}}}^{{S}_{i}}$ 转换回时域并更新表示： (3)${\stackrel{~}{\mathbf{X}}}^{{S}_{i}}\leftarrow {\mathrm{\mathcal{F}}}^{-1}\left({\stackrel{~}{\mathbf{F}}}^{{S}_{i}}\right)\in {\mathrm{\u2102}}^{T\times m}$ 其中 ${\mathrm{\mathcal{F}}}^{-1}$ 表示逆一维 FFT，它将复张量转换为实数张量。通过FFT和IFFT，可以有效降低原始数据中的噪声，从而获得更准确的特征表示。我们还引入了残差连接和层归一化操作，以缓解梯度消失和不稳定的训练问题，例如： (4)${\stackrel{~}{X}}^{{S}_{i}}=\text{LayerNorm}\left({\stackrel{~}{\mathbf{X}}}^{{S}_{i}}+{X}^{{S}_{i}}\right)$

4.2. 编码器模块中的地理空间视图

The geographical spatial view model the relationship between geographic neighbors of stations according to the actual location. Motivated by the First Law of Geography (Tobler, 1970), i.e., “Everything is related to everything else, but near things are more related than distant things,” we discovered that the closer stations are to one another, the more similar their air quality is. In recent years, graph convolution has gained widespread application in various domains due to its ability to model the interaction between nodes using a non-European graph. Thus, we employ graph convolution to discover the geographical spatial relationship across stations.

地理空间视图根据实际位置对站点的地理邻居之间的关系进行建模。在地理第一定律（Tobler，1970）的启发下，即“一切都与其他事物有关，但近处的事物比远处的事物更相关”，我们发现站点之间的距离越近，它们的空气质量就越相似。近年来，图卷积因其能够使用非欧洲图对节点之间的交互进行建模而在各个领域获得了广泛的应用。因此，我们利用图卷积来发现跨站点的地理空间关系。

We construct a geographical spatial graph *A*_{G} = ($S,{\mathrm{\mathcal{E}}}_{G}$), which associates with the natural geographical neighbors to pick up geographical spatial information. In *A*_{G}, a node *S*_{i} represents a station and an edge ${e}_{\mathit{ij}}^{G}\in {\mathrm{\mathcal{E}}}_{G}$ represents that *S*_{i} and *S*_{j} are neighbors. Here, *A*_{G} also represents its adjacency matrix, where ${A}_{G}\left[i,j\right]=1$ if there exists an edge ${e}_{\mathit{ij}}^{G}$ in ${\mathrm{\mathcal{E}}}_{G}$; otherwise, it is 0. Following STGCN (Yu et al., 2018), the weighted adjacency matrix *A*_{G} can be formed as,(5)${A}_{{G}_{\mathit{ij}}}=\left\{\begin{array}{ll}exp\left(-\frac{{d}_{\mathit{ij}}}{{\sigma}^{2}}\right),& {d}_{\mathit{ij}}<{\xi}_{G}\\ 0,& \text{otherwise}\end{array}\right.$where *σ* and ${\xi}_{G}$ are the thresholds to control the distribution and sparsity of *A*_{G}. ${A}_{{G}_{\mathit{ij}}}$ is the edge weight which is related to *d*_{ij} (the distance between station i and j).

构建地理空间图A _{G} = （ $S,{\mathrm{\mathcal{E}}}_{G}$ ），与自然地理邻域关联获取地理空间信息。在 A _{G} 中，节点 S 表示站点，边 ${e}_{\mathit{ij}}^{G}\in {\mathrm{\mathcal{E}}}_{G}$ 表示 S 和 S 是邻居。这里，A _{G} 也表示它的邻接矩阵，其中 ${A}_{G}\left[i,j\right]=1$ 如果存在 ${\mathrm{\mathcal{E}}}_{G}$ 一条边 ${e}_{\mathit{ij}}^{G}$ ;否则为 0。根据 STGCN （Yu et al.， 2018），加权邻接矩阵 A _{G} 可以形成为 ， (5)${A}_{{G}_{\mathit{ij}}}=\left\{\begin{array}{ll}exp\left(-\frac{{d}_{\mathit{ij}}}{{\sigma}^{2}}\right),& {d}_{\mathit{ij}}<{\xi}_{G}\\ 0,& \text{otherwise}\end{array}\right.$ 其中 σ 和 ${\xi}_{G}$ 是控制 A _{G} 分布和稀疏性的阈值。 ${A}_{{G}_{\mathit{ij}}}$ 是与 D _{ij} （站点 I 和 J 之间的距离）相关的边重。

Given the denoised data ${\stackrel{~}{X}}_{t}\in {\mathrm{\mathbb{R}}}^{N\times m}$ at time interval t and geographical spatial graph *A*_{G}, the accumulated information is formulated as:(6)${s}_{t}^{G}=\mathit{GConv}\left({\stackrel{~}{X}}_{t},{A}_{G}\right)=\sum _{k=1}^{K}{A}_{G}^{k}{\stackrel{~}{X}}_{t}{W}_{G}^{k}$where ${\stackrel{~}{X}}_{t}$ denotes the input of the graph convolution. The number of graph convolution step k is set from 1 to K to respectively aggregate information from k-order neighbors with learnable feature weighted matrix ${W}_{G}^{k}\in {\mathrm{\mathbb{R}}}^{m\times h}$, where h is the dimension of a hidden state vector. Finally, for each time interval t, we obtain the ${s}_{t}^{G}\in {\mathrm{\mathbb{R}}}^{N\times h}$ as the geographical spatial view representation for all stations.

The logical spatial view model the relationship between logical neighbors of stations according to the POIs and RNs. In addition to geographical spatial neighbor similarity, POIs and road networks also lead to the logical spatial similarity between stations. For example, stations in the industrial park would have equally poor air quality, while stations in the wetland would have relatively new air quality. Although they are geographically far apart, they are still similar. Thus, we construct a logical spatial graph *A*_{L} = ($S,{\mathrm{\mathcal{E}}}_{S}$) to obtain similarity between stations. The weighted adjacency matrix *A*_{L} can be formed as,(7)${A}_{{L}_{\mathit{ij}}}=\left\{\begin{array}{ll}\mathit{cos}\left({\mathit{PR}}_{i},{\mathit{PR}}_{j}\right),& \mathit{cos}\left({\mathit{PR}}_{i},{\mathit{PR}}_{j}\right)>{\xi}_{L}\\ 0,& \text{otherwise}\end{array}\right.$where ${\xi}_{L}$ is the threshold to determine the sparsity of adjacency matrix *A*_{L}. We utilize cosine similarity to measure the similarity.

Given the denoised data ${\stackrel{~}{X}}_{t}$ at time interval t and logical spatial graphs *A*_{L}, the accumulated information is formulated as:(8)${s}_{t}^{L}=\mathit{GConv}\left({\stackrel{~}{X}}_{t},{A}_{L}\right)=\sum _{k=1}^{K}{A}_{L}^{k}{\stackrel{~}{X}}_{t}{W}_{L}^{k}$where ${\stackrel{~}{X}}_{t}$ denotes the input of the graph convolution. The interval of convolution k is set from 1 to K to respectively aggregate information from k-order neighbors with learnable feature weighted matrix ${W}_{L}^{k}\in {\mathrm{\mathbb{R}}}^{m\times h}$, where h is the dimension of a hidden state vector. Finally, for each time interval t, we obtain the ${s}_{t}^{L}\in {\mathrm{\mathbb{R}}}^{N\times h}$ as the logical spatial view representation for all stations.

The temporal view models sequential relations. Geographical and logical spatial relationships have varying effects on the stations depending on the time of day. In order to differentiate the relative relevance of contextual states in different views at different moments, we design an attention fusion unit to improve the capacity to pick relative information from distinct views.(9)$\begin{array}{l}{\stackrel{~}{s}}_{t}={\alpha}_{t}^{G}{s}_{t}^{G}+{\alpha}_{t}^{L}{s}_{t}^{L}\\ {\alpha}_{t}^{G}=\frac{\mathit{exp}\left({s}_{t}^{G}\right)}{exp\left({s}_{t}^{G}\right)+exp\left({s}_{t}^{L}\right)}\\ {\alpha}_{t}^{L}=\frac{\mathit{exp}\left({s}_{t}^{L}\right)}{exp\left({s}_{t}^{G}\right)+exp\left({s}_{t}^{L}\right)}\end{array}$

We utilize GRU to capture underlying temporal correlations between stations. The GRU, an improved version of recurrent neural networks (RNNs), is chosen here because it addressed the vanishing gradient problem (Huang et al., 2019) and has an excellent performance in sequential modeling. The operation of the GRU can be expressed as follows:(10)$\begin{array}{l}{\mathbf{R}}_{t}=\sigma \left({\mathbf{W}}_{R}\left[{\mathbf{H}}_{t-1}\parallel {\stackrel{~}{\mathbf{s}}}_{t}\right]+{\mathbf{b}}_{R}\right)\\ {\mathbf{Z}}_{t}=\sigma \left({\mathbf{W}}_{Z}\left[{\mathbf{H}}_{t-1}\parallel {\stackrel{~}{\mathbf{s}}}_{t}\right]+{\mathbf{b}}_{Z}\right)\\ {\stackrel{~}{\mathbf{H}}}_{t}=tanh\left({\mathbf{W}}_{\stackrel{~}{H}}\left[{\mathbf{R}}_{t}\odot {\mathbf{H}}_{t-1}\parallel {\stackrel{~}{\mathbf{s}}}_{t}\right]+{\mathbf{b}}_{\stackrel{~}{H}}\right)\\ {\mathbf{H}}_{t}=\left(1-{\mathbf{Z}}_{t}\right)\odot {\mathbf{H}}_{t-1}+{\mathbf{Z}}_{t}\odot {\stackrel{~}{\mathbf{H}}}_{t}\end{array}$where *R*_{t}, *Z*_{t} denote reset gate and update gate at time interval t, *W*_{R}, *W*_{Z}, ${W}_{\stackrel{~}{H}}$, *b*_{R}, *b*_{Z} and ${b}_{\stackrel{~}{H}}$ are trainable parameters and $\parallel $, $\odot $ represents the concatenation operation and the Hadardmard product respectively. Hidden states ${H}^{t}\in {\mathrm{\mathbb{R}}}^{N\times h}$ are updated under the control of two gates, with the reset gates determining how much previous information is to be maintained, and the update gates ensuring that the candidate hidden states and memory are in balance. At each time step t, the extracted context states of station *S*_{i} are represented by the i-th state vector, which is denoted by ${h}_{t}^{i}\in {\mathrm{\mathbb{R}}}^{h}$.

We consider value prediction to be the primary task and employ LSTM to make predictions for each station. In the value decoder, as shown in the upper right part of Fig. 2, we concatenate the output of the encoder ${h}_{t}^{i}\in {\mathrm{\mathbb{R}}}^{h}$ and the last output of the value decoder ${c}_{\tau -1}^{i,\mathit{value}}$ to update the hidden state with LSTM as below:(11)${c}_{\tau}^{i,\mathit{value}}={\mathit{LSTM}}^{v}\left({h}_{t}^{i}\parallel {c}_{\tau -1}^{i,\mathit{value}}\right)$

Note that the output of LSTM contains all effects of geographical spatial, temporal and logical spatial view. Then, we use a fully connected network for the final value prediction at future time interval *τ* and use the Rectified Linear Unit (ReLU) as the activation function:(12)${y}_{\tau}^{i,\mathit{value}}=\mathit{ReLU}\left({W}_{v}{c}_{\tau}^{i,\mathit{value}}+{b}_{v}\right)$where *W*_{v} and *b*_{v} are learnable parameters. ReLU is a typical activation function, and its output is in [0,1], as the air quality values are normalized. We later denormalize the prediction to get the actual air quality values.

Similar to the value prediction task, we use LSTM to make multi-step predictions for each station individually using level prediction as an auxiliary task. In the level decoder, as shown in the lower right part of Fig. 2, we concatenate the output of the encoder ${h}_{t}^{i}$ and the last output of the value decoder ${c}_{\tau -1}^{i,\mathit{level}}$ to update the hidden state with LSTM as below:(13)$\begin{array}{l}{c}_{\tau}^{i,\mathit{level}}={\mathit{LSTM}}^{l}\left({h}_{t}^{i}\parallel {c}_{\tau -1}^{i,\mathit{level}}\right)\end{array}$

Then, we use a fully connected network for the final level prediction at future time interval *τ* and use ReLU as the activation function:(14)${y}_{\tau}^{i,\mathit{level}}=\mathit{ReLU}\left({W}_{l}{c}_{\tau}^{i,\mathit{level}}+{b}_{l}\right)$where ${W}_{l}\in {\mathrm{\mathbb{R}}}^{h\times 6}$ and ${b}_{l}\in {\mathrm{\mathbb{R}}}^{6}$ are learnable parameters.

The network can be trained through the backpropagation strategy and the Adam optimizer. To train our model, we aim to minimize the following loss function with two terms:(15)$\begin{array}{l}\mathrm{\mathcal{L}}\left(\theta \right)={\mathrm{\mathcal{L}}}_{\mathit{value}}+\lambda {\mathrm{\mathcal{L}}}_{\mathit{level}}\\ {\mathrm{\mathcal{L}}}_{\mathit{value}}=\mathit{MSELoss}\left({\widehat{Y}}^{\mathit{value}},{Y}^{\mathit{value}}\right)\\ {\mathrm{\mathcal{L}}}_{\mathit{level}}=\mathit{CrossEntropyLoss}\left({\widehat{Y}}^{\mathit{level}},{Y}^{\mathit{level}}\right)\end{array}$where *λ* is a trade-off between these two losses and *θ* are all learnable parameters in our model. MSELoss is Mean Squared Error for evaluating the errors between our prediction ${\widehat{Y}}^{\mathit{value}}$ and the corresponding ground truth ${Y}^{\mathit{value}}$ and CrossEntropyLoss is the cross-entropy loss between the predicted ${\widehat{Y}}^{\mathit{level}}$ and the ground-truth ${Y}^{\mathit{level}}$.

In this section, we evaluate our proposed model by empirically examining two real-world datasets. The following research questions (RQs) are used to guide our experiments:

- •
RQ1: How does the proposed model perform compared to existing air quality prediction methods?

- •
RQ2: How does each component contribute to the performance of the proposed model?

- •
RQ3: How do different hyperparameters affect the performance of the proposed model?

Our experiments are conducted on two real-world datasets: Beijing and Tianjin. Both datasets contain air quality data and meteorological data collected by sensors. Statistics of these datasets are shown in Table 1. Furthermore, other details of the datasets are introduced below:

- •
Beijing: Air quality data, including PM2.5, PM10, AQI, NO

_{2}, SO_{2}, CO, O_{3}, and meteorological data, including temperature, wind speed, humidity, wind direction, are collected hourly from the real-time release platform of the national urban air quality of China Environmental Monitoring Station.^{1}Otherwise, similar to Zheng et al. (2013), we consider 12 types of POIs on Amap^{2}and calculated the number of each type for each station. - •
Tianjin: Air quality data including PM2.5, PM10, NO

_{2}, SO_{2},CO,O_{3}and meteorological data including temperature, pressure, humidity, wind speed, wind direction are collected hourly.^{3}Additionally, we include the POIs data which contain 20 categories scratched from a public website.^{4}

Properties | Beijing | Tianjin | |
---|---|---|---|

Air quality | Stations | 35 | 26 |

Time span | 01/2016–01/2018 | 01/2014–04/2015 | |

Features | 7 | 6 | |

Records | 1,079,040 | 214,760 | |

Meteorology | Features | 4 | 5 |

POIs | Categories | 12 | 20 |

Time | Features | 3 | 3 |

In addition, similar to Zhang et al. (2019), we extract three time features from the timestamp of each data point: the hour of the day, the day of the week, and the month. And we process data that is null through linear interpolation. Moreover, we recruit the first 70 % as the training set, the next 20 % as the validation set, and the rest for the test set according to chronological order.

In this paper, we use typical Root Mean Square Error (RMSE) and Mean Average Error (MAE) to evaluate different models. Smaller metrics indicate better performance. The detailed definitions of the two metrics are stated as below:(16)$\mathrm{MAE}=\frac{1}{N}\sum _{i=1}^{N}\left|{\widehat{y}}_{i}-{y}_{i}\right|$(17)$\text{RMSE}=\sqrt{\frac{1}{N}\sum _{i=1}^{N}{\left({\widehat{y}}_{i}-{y}_{i}\right)}^{2}}$

We compare M2 with following seven baselines.

- •
HA: we predict PM2.5 by the historical average method on each time step. For example, the average historical data observed from 01:00 to 12:00 is utilized to forecast the next 6 h (13:00–18:00).

- •
ARIMA (Box and Pierce, 1970): the autoregressive integrated moving average is a well-known method for predicting time series.

- •
SVR (Chang and Lin, 2011): support vector regression is a traditional time series via learning feature mapping functions.

- •
Seq2Seq (Sutskever et al., 2014): we implement a two-layer sequence-to-sequence model for air quality prediction, where LSTM is chosen as the RNN implementation.

- •
MGED-Net (Zhang et al., 2019): MGED-Net uses the structure with a multi-group encoder and single decoder. The grouped features are input to the multi-encoder, and their output is fused to decode for air quality prediction.

- •
GWaveNet (Wu et al., 2019): graph WaveNet captures hidden spatial dependency with spatial-temporal graph modeling to make predictions.

- •
GMAN (Zheng et al., 2020): GMAN proposes a spatial and temporal attention mechanism with gated fusion to simulate complex temporal and spatial correlations, and designs a switching attention mechanism to mitigate the effects of error propagation to improve long-term prediction performance.

- •
ATGCN (Wang et al., 2021): ATGCN encodes multiple inter-station relationships into graphs and designs parallel GCN-based encoding and decoding modules to aggregate features from related stations using different graphs.

- •
STPGCN (Zhao et al., 2022): STPGCN proposes an adaptive inference module method for spatiotemporal position-aware relations, which captures dynamic spatiotemporal relations by combining spatial and temporal position embedding and integrates these relations into the graph convolution layer to realize aggregation and updating of node features.

As shown in Table 2, we divide PM2.5 into six levels from 1 to 6 based on the standard issued by China ministry of environmental protection. In all experiments, 12 h of historical data is used to predict traffic conditions in the next 6 h. All experiments are conducted on a Linux server (CPU: Intel(R) Xeon(R) Platinum 8260 CPU @ 2.40GHz, GPU: NVIDIA TESLA V100 16GB). The hidden size is 64. The batch size is 256, and the training epoch is 500. Early stopping on the validation dataset is used.

Range | Pollution level | Level |
---|---|---|

0–35 | Excellent | 1 |

35–75 | Good | 2 |

75–115 | Mildly contaminated | 3 |

115–150 | Moderately contaminated | 4 |

150–250 | Heavily contaminated | 5 |

>250 | Severely contaminated | 6 |

Our model, M2, has been consistently observed to deliver the best results across all datasets, as presented in Table 3. This provides compelling evidence of the effectiveness of our approach. The naive baseline generally gives high errors, considering only temporal correlations of air quality. Moreover, deep learning-based models could achieve better performance than traditional statistic methods, demonstrating the superior capacity of deep learning-based models. Graph Wavenet generally performs well, which might benefit from adopting adaptive graphs to model relationships between nodes, indicating that adaptive graph-based methods could effectively exploit valuable but latent spatial dependencies from historical air quality data. GMAN exhibits excellent performance on the Tianjin dataset, mainly due to its powerful attention mechanism capability. ATGCN has high accuracy on the Beijing dataset, probably because ATGCN performs better when data is abundant. STPGCN performs well on MAE, but not on RMSE, indicating that the model is more sensitive to outliers. Furthermore, to gain further insights into some of the design choices, we also compare M2 with its variants, which will be discussed in the ablation analysis.

Empty Cell | Beijing | Tianjin | ||
---|---|---|---|---|

MAE | RMSE | MAE | RMSE | |

HA | 29.60 | 32.04 | 34.78 | 37.51 |

ARIMA | 25.77 | 29.22 | 30.66 | 34.69 |

SVR | 23.63 | 26.84 | 27.18 | 30.89 |

Seq2Seq | 15.41 | 23.22 | 19.58 | 30.63 |

MGED-Net(19′) | 16.05 | 22.74 | 17.58 | 26.19 |

GWavenet(19′) | 14.90 | 22.44 | 16.99 | 24.15 |

GMAN(20′) | 14.82 | 23.48 | 15.74 | 23.36 |

ATGCN(21′) | 14.82 | 21.99 | 17.00 | 26.53 |

STPGCN(22′) | 13.53 | 23.59 | 16.52 | 24.70 |

M2 | 13.08 | 20.86 | 14.10 | 21.91 |

To validate effectiveness of our proposed components, we evaluate the following variants:M2 w/o auxiliary-task level prediction(M2 w/o m), M2 w/o dual-view attention-based fusion module(M2 w/o a), M2 w/o logical spatial view(M2 w/o l), M2 w/o geographical spatial view (M2 w/o g):

- •
M2 w/o auxiliary-task level prediction (M2 w/o m): in this variant, we remove the air quality level prediction part (auxiliary task) and only predict air quality value instead.

- •
M2 w/o dual-view attention-based fusion module (M2 w/o a): in this variant, instead of using an attention-based fusion module in each time interval, we directly add a representation of two spatial views to demonstrate the effectiveness of our proposed module.

- •
M2 w/o logical spatial view (M2 w/o l): we only consider the geographical and temporal views in this variant.

- •
M2 w/o geographical spatial view (M2 w/o g): we only consider logical and temporal views in this variant.

Results of the ablation study are shown in Fig. 3, in which it can be seen that key designs all contribute to the improvement of the proposed model. Compared to M2 w/o dual-view attention-based fusion, M2 better demonstrates the importance of different views fusion at each interval. The model's performance degrades dramatically when the spatial views are eliminated, as evidenced by the stations' geographical and logical impact. Compared to M2 w/o auxiliary-task level prediction, M2 achieves better performance indicating that air quality level prediction is effective for air quality value prediction. It is worth noting that the improvement effect of auxiliary tasks in the Beijing dataset is more obvious than that in the Tianjin dataset. On the one hand, this may be because different datasets have different characteristics. For example, Beijing has more data records than Tianjin, and Beijing has more monitoring stations than Tianjin. On the other hand, auxiliary tasks may work better in datasets with more complex air quality and meteorological conditions. In these cases, the auxiliary tasks help capture the underlying air quality patterns better, thus improving the prediction accuracy of the main task. Conversely, in datasets with relatively simple and stable air quality and meteorological conditions, the contribution of the auxiliary tasks to improving prediction performance may be limited. The elimination of any component results in a significant rise in inaccuracy, which validates the relevance of each member.

This section analyzes the effects of hyper-parameter K which is the number of graph convolution steps and hyper-parameter *α* which is the weight of auxiliary task in the loss function. In our model, K is selected from $\left\{\mathrm{1,2,3,4}\right\}$ and *α* is from $\left\{\mathrm{0.1,0.01,0.001,0.0001,0.00001}\right\}$. Every hyper-parameter is finely tuned in the validation set, and an early stopping strategy is implemented. This strategy ensures that training is stopped after 15 iterations if there is no improvement in validation performance.

Fig. 4 illustrates the various performances when K is varied. The results show that K = 3 is the optimal value for the Beijing dataset, whereas K = 2 is optimal for the Tianjin dataset. The low graph convolution step degrades performance since little neighbor information is aggregated. As the graph convolution step increases, more neighbor information is captured, and the model's performance starts to improve. At a certain point, however, the model's performance begins to degrade again, possibly due to the introduction of unexpected noise.

Next we consider the effect of different auxiliary task weight *α* on the prediction performance of our model. As shown in Fig. 5, distinct datasets correspond to specific optimal auxiliary task loss weight *α*. Excess weight and a deficiency will diminish the model's predictive performance. The results show that *α* = 0.001 is the best for the Beijing dataset, whereas *α* = 0.0001 is better for the Tianjin dataset. The optimal *α* on different datasets are different, which mainly depends on dataset characteristics, task correlation, training sample distribution and weight adjustment in the training process.

As different seasons have varying sources of pollutants (Ma et al., 2019), we use seasons as the research scale to investigate the differences in air quality prediction. Taking the example of the Beijing dataset, March 2017 to May 2017 represents spring, June 2017 to August 2017 represents summer, September 2017 to November 2017 represents autumn, and December 2017 to February 2018 represents winter. We compare the performance of GMAN, ATGCN, STPGCN, and M2 models in predicting PM2.5 concentrations during different seasons, as demonstrated in Table 4. The impact of climate on air quality varies with different seasons. For instance, in winter, the reduced temperature and weaker solar radiation could increase pollutant concentration in the air. Our experimental results indicate that our proposed M2 model best predicts PM2.5 concentrations during winter when pollution sources are diverse, and PM2.5 concentrations are high. This suggests that our model has a more stable ability to predict pollution during highly polluted seasons. Additionally, meteorological conditions such as wind speed, wind direction, temperature, humidity, air pressure, and precipitation also significantly influence air quality. Strong winds help disperse and dilute pollutants, thereby reducing the level of pollution. However, temperature inversions and stagnant wind conditions could lead to pollutant accumulation and worsening air quality. Our proposed model considers the impact of meteorological information on air quality and incorporates it as an input to the model, leading to more precise predictions. To conclude, air quality is affected by various factors, and we need to consider multiple perspectives, including domain knowledge related to the environment and air quality and data-driven approaches, to predict it accurately.

Empty Cell | Spring | Summer | Autumn | Winter | ||||
---|---|---|---|---|---|---|---|---|

MAE | RMSE | MAE | RMSE | MAE | RMSE | MAE | RMSE | |

GMAN(20′) | 15.72 | 39.09 | 14.15 | 22.23 | 11.41 | 16.35 | 15.37 | 26.74 |

ATGCN(21′) | 21.08 | 32.31 | 13.50 | 18.97 | 26.46 | 30.60 | 16.29 | 25.71 |

STPGCN(22′) | 20.81 | 38.52 | 12.80 | 19.01 | 14.74 | 23.56 | 15.48 | 28.25 |

M2 | 17.05 | 23.78 | 12.68 | 19.17 | 14.46 | 20.00 | 12.74 | 22.63 |

In our study, we aim to compare the performance and efficiency of our proposed model, M2, with three other baseline models, GMAN, ATGCN, and STPGCN, on the Beijing and Tianjin datasets. We evaluate the training time of these models and present the results in Fig. 6. The Tianjin dataset demonstrates faster overall running speeds than the Beijing dataset, mainly due to its smaller data volume. We find that the multi-layer ST-Attention blocks in GMAN significantly slow down its training time compared to the other models. Furthermore, M2 has a significantly slower speed than GMAN, is slower than ATGCN, and is comparable to STPGCN. However, M2 shows higher accuracy despite the slower speed than the other models. This indicates that our proposed model, M2, can provide a better trade-off between efficiency and accuracy for air quality prediction.

In this paper, we studied the multi-interval air quality prediction problem. We proposed a multi-view multi-task learning approach to predict air quality based on encoder-decoder. The encoder incorporates three perspectives on the foundations of GCN and GRU, including a geographical view, a logical spatial view, and a temporal view, to capture the potential relationships between sites from various perspectives. In addition, LSTM is deployed in the decoder to predict air quality values and levels. Experiments on two datasets reveal that our model outperforms previous baseline models and illustrates the utility of multi-view and multi-task modeling.

In the future, we plan to extend our proposed model to other spatial-temporal prediction problems, such as leveraging the traffic speed level or the traffic accident risk level. To further enhance our approach, we will explore different types of auxiliary tasks and their applicability in various datasets. Additionally, we will consider incorporating more environmental domain knowledge into air quality prediction to increase the model's interpretability and persuasiveness.

**Shanshan Sui:** Conceptualization, Methodology, Software, Data curation, Writing – original draft, Visualization, Investigation, Validation, Resources. **Qilong Han:** Supervision, Writing – review & editing, Funding acquisition, Project administration.

The authors declare that they have no known competing financial interests or personal relationships that could have appeared to influence the work reported in this paper.

This work was supported by the National Key Research and Development Program of China under Grant No. 2020YFB1710200.

Data will be made available on request.

- Box and Pierce, 1970Distribution of Residual Autocorrelations in Autoregressive Integrated Moving Average Time Series Modelsvol. 65 (1970), pp. 1509-1526
- Caruana, 1997Multitask learningJournal of the Air and Waste Management Association 47, vol. 28 (1997), pp. 41-75
- Chang and Lin, 2011LIBSVM: a library for support vector machinesACM Trans. Intell. Syst. Technol., 2 (3) (2011), pp. 27:1-27:27
- Díaz-Robles et al., 2008A Hybrid Arima and Artificial Neural Networks Model to Forecast Particulate Matter in Urban Areas: The Case of Temuco, Chilevol. 42 (2008), pp. 8331-8340
- Han et al., 2021aJoint air quality and weather prediction based on multi-adversarial spatiotemporal networksThirty-Fifth AAAI Conference on Artificial Intelligence (2021), pp. 4081-4089
- Han et al., 2021bFine-grained air quality inference via multi-channel attention modelInternational Joint Conference on Artificial Intelligence (2021), pp. 2512-2518
- Howse et al., 1997Comparing neural network and regression models for ozone forecastingJournal of the Air and Waste Management Association, 47 (1997), pp. 653-663
- Howse et al., 2021Air pollution and the noncommunicable disease prevention agenda: opportunities for public health and environmental science, in: Eloise Howse et al 2021Environ. Res. Lett., 16 (2021), Article 065002
- Huang et al., 2019Deep dynamic fusion network for traffic accident forecastingProceedings of the 28th ACM International Conference on Information and Knowledge Management, CIKM 2019, Beijing, China, November 3-7, 2019 (2019), pp. 2673-2681
- Li et al., 2016Deep learning architecture for air quality predictionsEnviron. Sci. Pollut. Res., 23 (22) (2016), pp. 22408-22417(2016)
- Li et al., 2018Diffusion convolutional recurrent neural network: data-driven traffic forecastingInternational Conference on Learning Representations (2018)
- Li et al., 2021Traffic flow prediction over muti-sensor data correlation with graph convolution networkNeurocomputing, 427 (2021), pp. 50-63
- Liang et al., 2018Geoman: multi-level attention networks for geo-sensory time series predictionProceedings of the 27th International Joint Conference on Artificial Intelligence (2018), pp. 3428-3434
- Liu et al., 2016Urban water quality prediction based on multi-task multi-view learningProceedings of the Twenty-Fifth International Joint Conference on Artificial Intelligence (2016), pp. 2576-2581
- Luo et al., 2019Accuair: winning solution to air quality prediction for KDD cup 2018Proceedings of the 25th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining (2019), pp. 1842-1850
- Ma et al., 2019Spatial and seasonal characteristics of particulate matter and gaseous pollution in China: implications for control policyEnviron. Pollut., 248 (MAY) (2019), pp. 421-428
- Pan et al., 2012Utilizing real-world transportation data for accurate traffic prediction12th IEEE International Conference on Data Mining (2012), pp. 595-604
- Qi et al., 2018Deep air learning: interpolation, prediction, and feature analysis of fine-grained air qualityIEEE Trans. Knowl. Data Eng., 30 (12) (2018), pp. 2285-2297
- Sánchez et al., 2013Nonlinear Air Quality Modeling Using Support Vector Machines in Gijón Urban Area (Northern Spain) at Local Scalevol. 14 (2013), pp. 291-305
- Sutskever et al., 2014Sequence to sequence learning with neural networksNeural Information Processing Systems (2014), pp. 3104-3112
- Wang et al., 2020Multi-task adversarial spatial-temporal networks for crowd flow predictionThe 29th ACM International Conference on Information and Knowledge Management (2020), pp. 1555-1564
- Wang et al., 2021Modeling inter-station relationships with attentive temporal graph convolutional network for air quality predictionThe 14th ACM International Conference on Web Search and Data Mining (2021), pp. 616-634
- Wu et al., 2019Graph wavenet for deep spatial-temporal graph modelingProceedings of the 28th International Joint Conference on Artificial Intelligence (2019), pp. 1907-1913
- Yi et al., 2018Deep distributed fusion network for air quality predictionY. Guo, F. Farooq (Eds.), Proceedings of the 24th ACM SIGKDD International Conference on Knowledge Discovery, ACM (2018), pp. 965-973
- Yu et al., 2018Spatio-temporal graph convolutional networks: a deep learning framework for traffic forecastingProceedings of the Twenty-Seventh International Joint Conference on Artificial Intelligence, IJCAI 2018, July 13-19, 2018, Stockholm, Sweden (2018), pp. 3634-3640
- Yu et al., 2021Deep spatio-temporal graph convolutional network for traffic accident predictionNeurocomputing, 423 (2021), pp. 135-147
- Zhang et al., 2019Multi-group encoder-decoder networks to fuse heterogeneous data for next-day air quality predictionProceedings of the 28th International Joint Conference on Artificial Intelligence (2019), pp. 4341-4347
- Zhao et al., 2022Spatial-temporal position-aware graph convolution networks for traffic flow forecastingIEEE Transactions on Intelligent Transportation Systems, vol. 23 (2022), pp. 20202-20216
- Zheng et al., 2013U-air: when urban air quality inference meets big dataProceedings of the 19th ACM SIGKDD International Conference on Knowledge Discovery (2013), pp. 1436-1444
- Zheng et al., 2020GMAN: a graph multi-attention network for traffic predictionThe Thirty-Fourth AAAI Conference on Artificial Intelligence (2020), pp. 1234-1241
- Zhou et al., 2022Filter-enhanced MLP is all you need for sequential recommendationWWW, ACM (2022), pp. 2388-2399

### Spatiotemporal hierarchical transmit neural network for regional-level air-quality prediction

2024, Knowledge-Based Systems

View Abstract

© 2023 Elsevier B.V. All rights reserved.