构建使用少量标记数据和大量未标记数据的 ML 模型对应于一种称为半监督学习的学习方法。多年来,半监督学习(SSL)因获取大量数据的便捷性和标记数据的固有难度而受到研究界的广泛关注。近年来,深度学习的半监督方法取得了令人瞩目的成果。在自我监督学习[25]和一致性正则化[26]等之前,这些方法已成功应用于许多图像分类问题,实现了与全监督方法相当的性能,同时仅使用少量标签[27]-[29]。
Currently, most SSL approaches work in the single-instance learning setting, where the goal is to predict the label yy of a data sample x. However, a different setting with particular interest for remote disease screening is that of MultipleInstance Learning (MIL), where the goal is to predict a label yy for a bag of samples (or instances) {x_(1),x_(2),dots}\left\{\mathbf{x}_{1}, \mathbf{x}_{2}, \ldots\right\}. Throughout the learning procedure, the labels of the instances in the bag are unknown and the only available annotation is a label that describes the entire bag. This situation is regularly encountered in practice. In our case, for example, PD tremor may manifest only for a fraction of time during everyday life, depending on the disease’s stage, the symptom’s intermittence or levodopa intake. Hence, to detect tremor from sensor measurements obtained in-the-wild, we must resort to MIL, as there is no easy way to obtain detailed ground-truth of the on-off periods.
我们关注半监督与多实例学习的结合。特别是,我们感兴趣的是是否可以使用未标记的包来改进多实例分类器。有趣的是,这个问题得到了非常少的研究关注。[30]的早期工作在基于内容的图像检索设置中使用未标记的包,并提出了一种将 MI 问题转化为具有 MI 约束的基于单实例图标签传播[31]问题的方法。[32]的后续工作采用了一种类似的基于图的方法,并建议在实例级和包级图像表示之间实现统一,以进行图像的归纳标注。[33]提出了对图优化目标的进一步修改。一个有趣的方法是[34]提出的基于正则化的 MI SSL 方法,用于视频标注任务,其中鼓励相似实例共享相似标签,而[35]提出了一种结合标签传播与 MI 的实例级标签传播方案。最后,[36]进行了探索。
本文其余部分组织如下。在第①节中,我们介绍了 SSL 和 MIL 领域的相关文献,主要关注与我们相关的方法。在第Ⅲ节中,我们介绍了我们的 MI SSL 方法及其提出的变体。第Ⅳ节展示了初步的、概念验证的实验结果。然后第Ⅴ节介绍了野外地震检测问题,并展示了我们的方法如何应用于提高该问题的性能。最后,在第Ⅵ节中,我们讨论了所提出方法的可能益处和注意事项。
II. 前言
A. 半监督学习
半监督学习(SSL)是一种情况,除了一个完全标记的集合 D_(l)={(x_(i),y_(i))}_(i=1)^(L)\mathcal{D}_{l}=\left\{\left(\mathbf{x}_{i}, y_{i}\right)\right\}_{i=1}^{L} 外,我们还得到了一个从相同边缘分布独立同分布抽取的未标记数据点集合 D_(u)={x_(i)}_(i=L+1)^(L+U)\mathcal{D}_{u}=\left\{\mathbf{x}_{i}\right\}_{i=L+1}^{L+U} 。目标是利用 D_(u)\mathcal{D}_{u} 来学习一个比
what would be possible using only D_(l)\mathcal{D}_{l}. In general, it is not evident how unlabelled data can help, as knowledge of the marginal does not directly contribute to the data likelihood for a given model. In fact, D_(u)\mathcal{D}_{u} can be helpful only if certain assumptions are true. The most common one is the smoothness assumption, which states that if two points x_(1),x_(2)\mathbf{x}_{1}, \mathbf{x}_{2} in a highdensity region are close, then so should be their predicted labels y_(1),y_(2)y_{1}, y_{2}. This suggests that the learnt classifier must be smooth in high-density regions. The well-known low-density separation assumption that requires the decision boundary to lie in a low-density region, is an alternative view of the smoothness assumption.
Early SSL techniques for neural networks were designed to enforce the low-density separation assumption on the resulting classifier. They did so by penalizing a decision boundary in high-density regions, for example by encouraging the model output distribution to have low entropy [38]. More recent techniques employ a similar regularization scheme, in which the model is encouraged to be invariant across label-preserving transformations of the input data (e.g. a small amount of additive gaussian noise that does not change the label). This principle is called consistency regularization and is used in many state-of-the-art methods for semi-supervised image classification, like Pseudo-Ensembles [39], Temporal Ensembling [40] and Mean Teacher [41].
r_(vadv)=arg max_(r;||r||_(2)=epsilon)D[p(y∣x; hat(theta)),p(y∣x+r; hat(theta))]\mathbf{r}_{v a d v}=\underset{\mathbf{r} ;\|\mathbf{r}\|_{2}=\epsilon}{\arg \max } D[p(y \mid \mathbf{x} ; \hat{\theta}), p(y \mid \mathbf{x}+\mathbf{r} ; \hat{\theta})]
where p(y∣x; hat(theta))p(y \mid \mathbf{x} ; \hat{\theta}) is our model, DD is a distribution divergence metric and hat(theta)\hat{\theta} denotes the model parameters at the current step. The optimization problem of Eq. 1 can be approximated efficiently with just an additional forward-backward pass through the network. Having estimated r_(vadv)\mathbf{r}_{v a d v}, the model is then encouraged to be smooth along its direction by minimizing the Local Distributional Smoothing (LDS) loss at each data point:
LDS=D[p(y∣x;( hat(theta))),p(y∣x+r_(vadv);( hat(theta)))]L D S=D\left[p(y \mid \mathbf{x} ; \hat{\theta}), p\left(y \mid \mathbf{x}+\mathbf{r}_{v a d v} ; \hat{\theta}\right)\right]
实证结果表明,鼓励沿着虚拟对抗方向的一致性, r_(vadv)\mathbf{r}_{v a d v} ,与随机扰动的一致性相比,显著提高了性能。在以下内容中,我们将提出一种基于 VAT 的半监督多实例学习方法。我们选择这种特定的技术,因为其最终目标在概念上直观且优雅,其理论基础坚实。
B. 多实例学习
在多实例学习(MIL)中,我们再次面对一组样本及其标签 D={(X_(i),y_(i))}_(i=1)^(L)D=\left\{\left(X_{i}, y_{i}\right)\right\}_{i=1}^{L} 。不同的是,这里的每个样本本身就是一个实例包。
i.e. X_(i)={x_(i)^(1),x_(i)^(2),dots,x_(i)^(K)}X_{i}=\left\{\mathbf{x}_{i}^{1}, \mathbf{x}_{i}^{2}, \ldots, \mathbf{x}_{i}^{K}\right\} with x_(i)^(j)inR^(N)\mathbf{x}_{i}^{j} \in \mathbb{R}^{N}, while y_(i)y_{i} refers to the entire bag X_(i)X_{i} and not to any one instance x_(i)^(j)\mathbf{x}_{i}^{j}. The goal in this scenario is to learn a bag classifier. Since a bag is an unordered set of instances without dependencies between its members, our classifier should be permutation-invariant with respect to the ordering of the bag instances. Theoretical results [43] suggest that a bag function f(X)f(X) is permutation-invariant if and only if it can be decomposed in the form:
where phi\phi is an embedding function R^(N)|->R^(M),zinR^(M)\mathbb{R}^{N} \mapsto \mathbb{R}^{M}, \mathbf{z} \in \mathbb{R}^{M} is the embedding of XX and rho\rho a suitable transformation R^(M)|->Y\mathbb{R}^{M} \mapsto \mathcal{Y} (with Y\mathcal{Y} we denote the label domain).
For models based on neural networks, the transformation phi\phi is usually a high-capacity CNN that is used either for feature extraction or for direct instance classification. The transformation rho\rho is then either a classification head that takes us from the embedding space to the class space, or simply the identity. A rather interesting modification to the above stems from incorporating an attention mechanism on the sum of Equation 3. This approach [44] defines the bag embedding as a non-linear combination (with learnable parameters V,w\mathbf{V}, \mathbf{w} ) of the instance features:
The attention parameters V\mathbf{V} and w\mathbf{w} can be easily modelled as neural networks, thus allowing the whole model to be learnable end-to-end. In addition, attention scores provide an elegant way of identifying key instances within a bag. Attention-based MIL has been successfully applied to many problems [22], [45], [46]. Owing to its attractive properties and performance, we will use it as our core MIL model, which we will enhance in the next section with a semi-supervised component.
III. 半监督多实例学习
In this section, we present our approach for utilizing unlabelled bags of instances in order to improve a multipleinstance classifier. As VAT provides a principled and elegant way of using unlabelled data, we elect to use it for semisupervised MIL, over other SSL approaches (e.g. Mean Teachers) that could be directly applied to the same problem.
First, we extend VAT to the multiple-instance scenario. To that end, we introduce the concept of bag perturbation which is a set R=(r_(1),r_(2),dots,r_(K))R=\left(\mathbf{r}_{1}, \mathbf{r}_{2}, \ldots, \mathbf{r}_{K}\right) that when added elementwise to a given bag XX, slightly perturbs it. The Multiple-Instance Local Distributional Smoothing (MI-LDS) loss can now be defined as:
MI-LDS(X, hat(theta))=D[p(y∣X;( hat(theta))),p(y∣X+R_(vadv);( hat(theta)))]M I-L D S(X, \hat{\theta})=D\left[p(y \mid X ; \hat{\theta}), p\left(y \mid X+R_{\mathrm{vadv}} ; \hat{\theta}\right)\right]