Audio-Visual Scene Analysis with Self-Supervised Multisensory Features

自监督声源定位，与sound of pixel可能类似

2018.4

论文地址：https://arxiv.org/pdf/1804.03641.pdf

项目地址： http://andrewowens.com/multisensory

Abstract.

The thud of abouncing ball,the onset of speech as lips open—when visual and audio events occur together, it suggests that there might be a common, underlying event that produced both signals. In this paper, we argue that the visual and audio components of a video signal should be modeled jointly using a fused multisensory representation. We propose to learn such a representation in a self-supervised way, by training a neural network to predict whether video frames and audio are temporally aligned. We use this learned representation for three applications: (a) sound source localization, i.e. visualizing the source of sound in a video; (b) audio-visual action recognition; and (c) on/offscreen audio source separation, e.g. removing the off-screen translator’s voice from a foreign ofﬁcial’s speech. Code, models, and video results are available on our webpage: http://andrewowens.com/multisensory.

当视觉和音频事件一起发生时，砰砰作响的球，嘴唇张开时的语音开始，这表明可能存在产生两种信号的共同的潜在事件。在本文中，我们认为视频信号的视觉和音频组件应使用融合多感觉表示联合建模。我们建议以自我监督的方式学习这种表示，通过训练神经网络来预测视频帧和音频是否在时间上对齐。我们将这种学习的表示用于三种应用：（a）声源定位，即可视化视频中的声源; （b）视听动作识别; （c）开/关音频源分离，例如，从外国官方的演讲中删除屏幕外翻译者的声音。代码，模型和视频结果可在我们的网页上找到：http：//andrewowens.com/multisensory。

1 Introduction

As humans, we experience our world through a number of simultaneous sensory streams. When we bite into an apple, not only do we taste it, but — as Smith and Gasser [1] point out — we also hear it crunch, see its red skin, and feel the coolness of its core. The coincidence of sensations gives us strong evidence that they were generated by a common, underlying event [2], since it is unlikely that they co-occurred across multiple modalities merely by chance. These cross-modal, temporal co-occurrences therefore provide a useful learning signal: a model that is trained to detect them ought to discover multimodal structures that are useful for other tasks. In much of traditional computer vision research, however, we have been avoiding the use of other, non-visual modalities, arguably making the perception problem harder, not easier

作为人类，我们通过一系列同时感知流来体验我们的世界。当我们咬一口苹果时，我们不仅尝到了它，而且 - 正如Smith和Gasser [1]所指出的那样 - 我们也听到它紧缩，看到它的红色皮肤，并感受到其核心的凉爽。感觉的巧合给了我们强有力的证据，证明它们是由一个共同的，潜在的事件产生的[2]，因为它们不太可能仅仅偶然地在多种形式中共同发生。因此，这些跨模态，时间共现提供了有用的学习信号：经过训练以检测它们的模型应该发现对其他任务有用的多模态结构。然而，在许多传统的计算机视觉研究中，我们一直避免使用其他非视觉模式，可以说使感知问题更难，更容易

In this paper, we learn a temporal, multisensory representation that fuses the visual and audio components of a video signal. We propose to train this model without using any manually labeled data. That is, rather than explicitly telling the model that, e.g., it should associate moving lips with speech or a thud with a bouncing ball, we have it discover these audio-visual associations through self-supervised training [3]. Specifically, we train a neural network on a “pretext” task of detecting misalignment between audio and visual streams in synthetically-shifted videos. The network observes raw audio and video streams — some of which are aligned, and some that have been randomly shifted by a few seconds — and we task it with distinguishing between the two. This turns out to be a challenging training task that forces the network to fuse visual motion with audio information and, in the process, learn a useful audio-visual feature representation.

在本文中，我们学习了一种时间的，多感官的表示，融合了视频信号的视觉和音频组件。 我们建议在不使用任何手动标记数据的情况下训练此模型。也就是说，不是明确地告诉模型，例如，它应该将动人的嘴唇与语音或砰砰声与弹跳球相关联，我们通过自我监督训练发现这些视听联想[3]。具体而言，我们在“借口”任务上训练神经网络，以检测合成移位视频中的音频和视觉流之间的未对准。网络观察原始音频和视频流 - 其中一些是对齐的，一些是随机移动了几秒钟 - 我们的任务是区分两者。这证明是一项具有挑战性的训练任务，迫使网络将视觉运动与音频信息融合，并在此过程中学习有用的视听特征表示。

相比较而言，sound of pixel是非时序的，由模型结构可以看出来

We demonstrate the usefulness of our multisensory representation in three audiovisual applications: (a) sound source localization, (b) audio-visual action recognition; and (c) on/off-screen sound source separation. Figure 1 shows examples of these applications. In Fig. 1(a), we visualize the sources of sound in a video using our network’s learned attention map, i.e. the impact of an axe, the opening of a mouth, and moving hands of a musician. In Fig. 1(b), we show an application of our learned features to audio-visual action recognition, i.e. classifying a video of a chef chopping an onion. In Fig. 1(c), we demonstrate our novel on/off-screen sound source separation model’s ability to separate the speakers’ voices by visually masking them from the video.

我们展示了我们的多感官表示在三种视听应用中的有用性：（a）声源定位，（b）视听动作识别; （c）屏幕外/声源分离。图1显示了这些应用程序的示例。在图1（a）中，我们使用我们网络的学习注意力图来显示视频中的声源，即斧头的影响，嘴巴的开口和音乐家的移动手。在图1（b）中，我们展示了我们学习的特征在视听动作识别中的应用，即对切碎洋葱的厨师的视频进行分类。在图1（c）中，我们展示了我们新颖的开/关屏幕声源分离模型通过视觉掩盖视频来分离扬声器声音的能力。

The main contributions of this paper are: 1) learning a general video representation that fuses audio and visual information; 2) evaluating the usefulness of this representation qualitatively (by sound source visualization) and quantitatively (on an action recognition task); and 3) proposing a novel video-conditional source separation method that uses our representation to separate on- and off-screen sounds, and is the first method to work successfully on real-world video footage, e.g. television broadcasts. Our feature representation, as well as code and models for all applications are available online.

本文的主要贡献是：1）学习融合音频和视觉信息的一般视频表示; 2）定性地（通过声源可视化）和定量地（在动作识别任务上）评估该表示的有用性; 3）提出一种新颖的视频条件源分离方法，该方法使用我们的表示来分离屏幕上和屏幕外的声音，并且是第一种成功地在真实世界的视频镜头上工作的方法，例如，电视广播。我们的功能表示以及所有应用程序的代码和模型均可在线获取。

2 Related work

Evidence from psychophysics While we often think of vision and hearing as being distinct systems, in humans they are closely intertwined [4] through a process known as multisensory integration. Perhaps the most compelling demonstration of this phenomenon is the McGurk effect [5], an illusion in which visual motion of a mouth changes one’s interpretation of a spoken sound1. Hearing can also influence vision: the timing of a sound, for instance, affects whether we perceive two moving objects to be colliding or overlapping [2]. Moreover, psychologists have suggested that humans fuse audio and visual signals at a fairly early stage of processing [7,8], and that the two modalities are used jointly in perceptual grouping. For example, the McGurk effect is less effective when the viewer first watches a video where audio and visuals in a video are unrelated, as this causes the signals to become “unbound” (i.e. not grouped together) [9,10]. This multi-modal perceptual grouping process is often referred to as audio-visual scene analysis [11,7,12,10]. In this paper, we take inspiration from psychology and propose a self-supervised multisensory feature representation as a computational model of audio-visual scene analysis.

来自心理物理学的证据 虽然我们经常认为视觉和听觉是不同的系统，但在人类中，它们通过称为多感觉整合的过程紧密地交织在一起[4]。也许这种现象最引人注目的证明就是McGurk效应[5]，这是一种幻觉，在这种幻觉中，嘴巴的视觉运动会改变人们对口语1的解释。听觉也会影响视力：例如，声音的时间会影响我们是否感知两个移动物体碰撞或重叠[2]。此外，心理学家已经建议人类在处理的早期阶段融合音频和视觉信号[7,8]，并且这两种方式在感知分组中联合使用。例如，当观众首先观看视频中的音频和视觉不相关的视频时，McGurk效果效果较差，因为这导致信号变为“未绑定”（即未组合在一起）[9,10]。这种多模态感知分组过程通常被称为视听场景分析[11,7,12,10]。在本文中，我们从心理学中汲取灵感，并提出一种自我监督的多感官特征表示作为视听场景分析的计算模型。

Self-supervised learning Self-supervised methods learn features by training a model to solve a task derived from the input data itself, without human labeling. Starting with the early work of de Sa [3], there have been many self-supervised methods that learn to find correlations between sight and sound [13,14,15,16]. These methods, however, have either learned the correspondence between static images and ambient sound [15,16], or have analyzed motion in very limited domains [14,13] (e.g. [14] only modeled drumstick impacts). Our learning task resembles Arandjelovic and Zisserman [ ´ 16], which predicts whether an image and an audio track are sampled from the same (or different) videos. Their task, however, is solvable from a single frame by recognizing semantics (e.g. indoor vs. outdoor scenes). Our inputs, by contrast, always come from the same video, and we predict whether they are aligned; hence our task requires motion analysis to solve. Time has also been used as supervisory signal, e.g. predicting the temporal ordering in a video [17,18,19]. In contrast, our network learns to analyze audio-visual actions, which are likely to correspond to salient physical processes.

自我监督学习 自我监督方法通过训练模型来学习特征，以解决从输入数据本身派生的任务，而无需人工标记。从de Sa [3]的早期工作开始，已经有许多自我监督的方法学会找到视觉和声音之间的相关性[13,14,15,16]。然而，这些方法要么学习静态图像和环境声音之间的对应[15,16]，要么在非常有限的领域[14,13]分析运动（例如[14]仅模仿鼓棒的影响）。我们的学习任务类似于Arandjelovic和Zisserman ['16]，它预测图像和音轨是否来自相同（或不同）的视频。然而，他们的任务可以通过识别语义（例如室内场景和室外场景）从单帧中解决。相比之下，我们的输入总是来自同一个视频，我们预测它们是否一致;因此我们的任务需要运动分析来解决。时间也被用作监督信号，例如预测视频中的时间顺序[17,18,19]。相比之下，我们的网络学习分析视听行为，这可能与显着的物理过程相对应。

Audio-visual alignment While we study alignment for self-supervised learning, it has also been studied as an end in itself [20,21,22] e.g. in lip-reading applications [23]. Chung and Zisserman [22], the most closely related approach, train a two-stream network with an embedding loss. Since aligning speech videos is their end goal, they use a face detector (trained with labels) and a tracking system to crop the speaker’s face. This allows them to address the problem with a 2D CNN that takes 5 channel-wise concatenated frames cropped around a mouth as input (they also propose using their image features for self-supervision; while promising, these results are very preliminary). Sound localization The goal of visually locating the source of sounds in a video has a long history. The seminal work of Hershey et al. [24] localized sound sources by measuring mutual information between visual motion and audio using a Gaussian process model. Subsequent work also considered subspace methods [25], canonical correlations [26], and keypoints [27]. Our model learns to associate motions with sounds via self-supervision, without us having to explicitly model them.

视听对齐 虽然我们研究了自我监督学习的对齐，但它本身也被研究作为目的[20,21,22]，例如在唇读应用[23]。 Chung和Zisserman [22]是最密切相关的方法，训练一个嵌入式损失的双流网络。由于对齐语音视频是他们的最终目标，因此他们使用面部检测器（使用标签训练）和跟踪系统来裁剪扬声器的面部。这允许他们用2D CNN来解决问题，该CNN采用在嘴周围裁剪的5个通道连接帧作为输入（他们还建议使用他们的图像特征进行自我监督;虽然有希望，但这些结果是非常初步的）。声音定位视觉定位视频中声音源的目标历史悠久。 Hershey等人的开创性工作。 [24]通过使用高斯过程模型测量视觉运动和音频之间的互信息来定位声源。随后的工作还考虑了子空间方法[25]，规范相关[26]和关键点[27]。我们的模型学会通过自我监督将动作与声音联系起来，而无需我们明确地对它们进行建模。

Audio-Visual Source Separation Blind source separation (BSS), i.e. separating the individual sound sources in an audio stream — also known as the cocktail party problem [28] — is a classic audio-understanding task [29]. Researchers have proposed many successful probabilistic approaches to this problem [30,31,32,33]. More recent deep learning approaches involve predicting an embedding that encodes the audio clustering [34,35], or optimizing a permutation invariant loss [36]. It is natural to also want to include the visual signal to solve this problem, often referred to as Audio-Visual Source Separation. For example, [37,25] masked frequencies based on their correlation with optical flow; [12] used graphical models; [27] used priors on harmonics; [38] used a sparsity-based factorization method; and [39] used a clustering method. Other methods use face detection and multi-microphone beamforming [40]. These methods make strong assumptions about the relationship between sound and motion, and have mostly been applied to lab-recorded video. Researchers have proposed learning-based methods that address these limitations, e.g. [41] use mixture models to predict separation masks. Recently, [42] proposed a convolutional network that isolates on-screen speech, although this model is relatively small-scale (tested on videos from one speaker). We do on/off-screen source separation on more challenging internet and broadcast videos by combining our representation with a u-net [43] regression model.

视听源分离盲源分离（BSS），即分离音频流中的各个声源 - 也称为鸡尾酒会问题[28] - 是一种经典的音频理解任务[29]。研究人员已经提出了许多成功的概率方法来解决这个问题[30,31,32,33]。最近的深度学习方法涉及预测编码音频聚类的嵌入[34,35]，或优化置换不变损失[36]。很自然地也希望包括视觉信号来解决这个问题，通常被称为视听源分离。例如，[37,25]掩蔽频率基于它们与光流的相关性; [12]使用图形模型; [27]使用先验谐波; [38]使用基于稀疏性的分解方法; [39]使用了聚类方法。其他方法使用面部检测和多麦克风波束成形[40]。这些方法对声音和运动之间的关系做出了强有力的假设，并且主要应用于实验室录制的视频。研究人员提出了解决这些局限性的基于学习的方法，例如： [41]使用混合模型来预测分离面具。最近，[42]提出了一种隔离屏幕语音的卷积网络，尽管这种模式规模相对较小（在一个发言者的视频上测试）。我们通过将我们的表示与u-net [43]回归模型相结合，在更具挑战性的互联网和广播视频上进行屏幕/屏幕外源分离。

Concurrent work Concurrently and independently from us, a number of groups have proposed closely related methods for source separation and sound localization. Gabbay et al. [44,45] use a vision-to-sound method to separate speech, and propose a convolutional separation model. Unlike our work, they assume speaker identities are known. Ephrat et al. [46] and Afouras et al. [47] separate the speech of a user-chosen speaker from videos containing multiple speakers, using face detection and tracking systems to group the different speakers. Work by Zhao et al. [48] and Gao et al. [49] separate sound for multiple visible objects (e.g. musical instruments). This task involves associating objects with the sounds they typically make based on their appearance, while ours involves the “fine-grained” motion-analysis task of separating multiple speakers. There has also been recent work on localizing sound sources using a network’s attention map [50,51,52]. These methods are similar to ours, but they largely localize objects and ambient sound in static images, while ours responds to actions in videos.

同时进行的工作 与我们同时独立，许多团体提出了密切相关的源分离和声音定位方法。 Gabbay等。 [44,45]使用视觉 - 声音方法来分离语音，并提出卷积分离模型。与我们的工作不同，他们认为说话人身份是已知的。 Ephrat等。 [46]和Afouras等人。 [47]使用面部检测和跟踪系统将不同的扬声器分组，将用户选择的扬声器的语音与包含多个扬声器的视频分开。赵等人的工作。 [48]和高等人。 [49]为多个可见物体（例如乐器）分离声音。此任务涉及将对象与他们通常根据其外观制作的声音相关联，而我们的任务涉及分离多个扬声器的“细粒度”运动分析任务。最近还有一项使用网络注意力图来定位声源的工作[50,51,52]。这些方法与我们的类似，但它们主要在静态图像中定位对象和环境声音，而我们对视频中的动作做出响应。

3 Learning a self-supervised multisensory representation

We propose to learn a representation using self-supervision, by training a model to predict whether a video’s audio and visual streams are temporally synchronized.

我们建议使用自我监督来学习表示，通过训练模型来预测视频的音频和视频流是否在时间上同步。

融合视听网络。我们训练早期融合的多感官网络，以预测视频帧和音频是否在时间上对齐。我们在成对的卷积之间包括残余连接[53]。我们将输入表示为T×H×W体积，并用“/ 2”表示步幅。为了生成未对齐的样本，我们将音频合成移位几秒钟。

这个网络可能是用于结果生成的，或者是用于生成优化目标的，类似GAN中的D

Aligning sight with sound During training, we feed a neural network video clips. In half of them, the vision and sound streams are synchronized; in the others, we shift the audio by a few seconds. We train a network to distinguish between these examples. More specifically, we learn a model pθ(y | I; A) that predicts whether the image stream I and audio stream A are synchronized, by maximizing the log-likelihood:

使视觉与声音对齐 在训练期间，我们提供神经网络视频剪辑。其中一半，视觉和声音流是同步的; 在其他人中，我们将音频移动几秒钟。我们训练网络来区分这些例子。更具体地说，我们通过最大化对数似然来学习模型pθ（y j I; A），它预测图像流I和音频流A是否同步：

where As is the audio track shifted by s secs., t is a random temporal shift, θ are the model parameters, and y is the event that the streams are synchronized. This learning problem is similar to noise-contrastive estimation [54], which trains a model to distinguish between real examples and noise; here, the noisy examples are misaligned videos.

其中As是音轨移动了s秒，t是随机时移，θ是模型参数，y是流同步的事件。 这种学习问题类似于噪声对比估计[54]，它训练模型以区分真实例子和噪声; 在这里，嘈杂的例子是未对齐的视频。

Fused audio-visual network design Solving this task requires the integration of lowlevel information across modalities. In order to detect misalignment in a video of human speech, for instance, the model must associate the subtle motion of lips with the timing of utterances in the sound. We hypothesize that early fusion of audio and visual streams is important for modeling actions that produce a signal in both modalities. We therefore propose to solve our task using a 3D multisensory convolutional network (CNN) with an early-fusion design (Figure 2).

融合的视听网络设计解决此任务需要跨模态集成低级信息。例如，为了检测人类语音的视频中的未对准，模型必须将嘴唇的微妙运动与声音中的话语定时相关联。我们假设音频和视觉流的早期融合对于在两种模态中产生信号的建模动作是重要的。因此，我们建议使用具有早期融合设计的3D多感官卷积网络（CNN）来解决我们的任务（图2）。

Before fusion, we apply a small number of 3D convolution and pooling operations to the video stream, reducing its temporal sampling rate by a factor of 4. We also apply a series of strided 1D convolutions to the input waveform, until its sampling rate matches that of the video network. We fuse the two subnetworks by concatenating their activations channel-wise, after spatially tiling the audio activations. The fused network then undergoes a series of 3D convolutions, followed by global average pooling [55]. We add residual connections between pairs of convolutions. We note that the network architecture resembles ResNet-18 [53] but with the extra audio subnetwork, and 3D convolutions instead of 2D ones (following work on inflated convolutions [56]).

在融合之前，我们对视频流应用少量3D卷积和合并操作，将其时间采样率降低4倍。我们还对输入波形应用一系列跨步1D卷积，直到其采样率与视频网络。在空间平铺音频激活之后，我们通过在频道方式连接它们的激活来融合两个子网。然后融合网络经历一系列3D卷积，然后是全球平均汇集[55]。我们在成对的卷积之间添加残余连接。我们注意到网络架构类似于ResNet-18 [53]，但有额外的音频子网和3D卷积而不是2D卷积（跟随膨胀卷积的工作[56]）。

Training We train our model with 4.2-sec. videos, randomly shifting the audio by 2.0 to 5.8 seconds. We train our model on a dataset of approximately 750,000 videos randomly sampled from AudioSet [57]. We use full frame-rate videos (29.97 Hz), resulting in 125 frames per example. We select random 224 × 224 crops from resized 256×256 video frames, apply random left-right flipping, and use 21 kHz stereo sound. We sample these video clips from longer (10 sec.) videos. Optimization details can be found in Section A1.

训练我们训练我们的模型4.2sec。视频，将音频随机移动2.0到5.8秒。我们在从AudioSet [57]随机抽样的大约750,000个视频的数据集上训练我们的模型。我们使用全帧率视频（29.97 Hz），每个示例产生125帧。我们从调整大小的256×256视频帧中选择随机224×224作物，应用随机左右翻转，并使用21 kHz立体声声音。我们从较长（10秒）的视频中对这些视频片段进行采样。优化细节可以在A1部分找到。

Task performance We found that the model obtained 59.9% accuracy on held-out videos for its alignment task (chance = 50%). While at first glance this may seem low, we note that in many videos the sounds occur off-screen [15]. Moreover, we found that this task is also challenging for humans. To get a better understanding of human ability, we showed 30 participants from Amazon Mechanical Turk 60 aligned/shifted video pairs, and asked them to identify the one with out-of-sync sound. We gave them 15 secs. of video (so they have significant temporal context) and used large, 5-sec. shifts. They solved the task with 66:6% ± 2:4% accuracy.

任务表现我们发现该模型对于其对齐任务的持续视频获得了59.9％的准确率（机会= 50％）。虽然乍一看这可能看起来很低，但我们注意到在很多视频中声音都出现在屏幕外[15]。而且，我们发现这项任务对人类也具有挑战性。为了更好地理解人类能力，我们展示了来自Amazon Mechanical Turk 60对齐/移位视频对的30名参与者，并要求他们识别出具有不同步声音的视频对。我们给了他们15秒。视频（因此他们有重要的时间背景）并使用大，5秒。转移。他们以66：6％±2：4％的准确度解决了这项任务。

To help understand what actions the model can predict synchronization for, we also evaluated its accuracy on categories from the Kinetics dataset [58] (Figure A1). It was most successful for classes involving human speech: e.g., news anchoring, answering questions, and testifying. Of course, the most important question is whether the learned audio-visual representation is useful for downstream tasks. We therefore turn out attention to applications.

为了帮助理解模型可以预测同步的动作，我们还从动力学数据集[58]（图A1）评估了其对类别的准确性。对于涉及人类言语的课程来说，这是最成功的：例如，新闻锚定，回答问题和作证。当然，最重要的问题是学习的视听表示是否对下游任务有用。因此，我们关注应用程序。

4 Visualizing the locations of sound sources

One way of evaluating our representation is to visualize the audio-visual structures that it detects. A good audio-visual representation, we hypothesize, will pay special attention to visual sound sources — on-screen actions that make a sound, or whose motion is highly correlated with the onset of sound. We note that there is ambiguity in the notion of a sound source for in-the-wild videos. For example, a musician’s lips, their larynx, and their tuba could all potentially be called the source of a sound. Hence we use this term to refer to motions that are correlated with production of a sound, and study it through network visualizations.

评估我们的表示的一种方法是可视化它检测到的视听结构。我们假设，良好的视听表现将特别关注视觉声源 - 制作声音的屏幕动作，或其动作与声音的开始高度相关。 我们注意到，野外视频的声源概念含糊不清。例如，音乐家的嘴唇，喉咙和大号都可能被称为声音的来源。因此，我们使用这个术语来指代与声音产生相关的运动，并通过网络可视化来研究它。

To do this, we apply the class activation map (CAM) method of Zhou et al. [59], which has been used for localizing ambient sounds [52]. Given a space-time video patch Ix , its corresponding audio Ax, and the features assigned to them by the last convolutional layer of our model, f(Ix; Ax), we can estimate the probability of alignment with:

为此，我们应用Zhou等人的类激活图（CAM）方法[59]，已被用于定位环境声音[52]。给定一个空时视频补丁Ix，它对应的音频Ax，以及我们模型的最后一个卷积层f（Ix; Ax）分配给它们的特征，我们可以估计对齐的概率：

where y is the binary alignment label, σ the sigmoid function, and w is the model’s final affine layer. We can therefore measure the information content of a patch — and, by our hypothesis, the likelihood that it is a sound source — by the magnitude of the prediction jw>f(Ix; Ax)j

其中y是二进制对齐标签，σ是sigmoid函数，w是模型的最终仿射层。因此，我们可以通过预测的大小|w> f（Ix; Ax）|来测量补丁的信息内容 - 并且根据我们的假设，它是声源的可能性。

One might ask how this self-supervised approach to localization relates to generative approaches, such as classic mutual information methods [24,25]. To help understand this, we can view our audio-visual observations as having been produced by a generative process (using an analysis similar to [60]): we sample the label y, which de termines the alignment, and then conditionally sample Ix and Ax. Rather than computing mutual information between the two modalities (which requires a generative model that self-supervised approaches do not have), we find the patch/sound that provides the most information about the latent variable y, based on our learned model p(y j Ix; Ax).

有人可能会问这种自我监督的本地化方法如何与生成方法相关，例如经典的互信息方法[24,25]。为了帮助理解这一点，我们可以将我们的视听观察视为由生成过程产生（使用类似于[60]的分析）：我们对标签y进行采样，确定对齐，然后有条件地采样Ix和 Ax。我们不是计算两种模态之间的互信息（这需要自我监督方法所不具备的生成模型），而是基于我们的学习模型p（y|Ix，Ax）找到提供关于潜变量y的最多信息的补丁/声音。

Visualizations What actions does our network respond to? First, we asked which space-time patches in our test set were most informative, according to Equation 2. We show the top-ranked patches in Figure 3, with the class activation map displayed as a heatmap and overlaid on its corresponding video frame. From this visualization, we can see that the network is selective to faces and moving mouths. The strongest responses that are not faces tend to be unusual but salient audio-visual stimuli (e.g. two top-ranking videos contain strobe lights and music). For comparison, we show the videos with the weakest response in Figure 4; these contain relatively few faces. Next, we asked how the model responds to videos that do not contain speech, and applied our method to the Kinetics-Sounds dataset [16] — a subset of Kinetics [58] classes that tend to contain a distinctive sound. We show the examples with the highest response for a variety of categories, after removing examples in which the response was solely to a face (which appear in almost every category). We show results in Figure 5. Finally, we asked how the model’s attention varies with motion. To study this, we computed our CAM-based visualizations for videos, which we have included in the supplementary video (we also show some hand-chosen examples in Figure 1(a)). These results qualitatively suggest that the model’s attention varies with on-screen motion. This is in contrast to single-frame methods models [50,52,16], which largely attend to sound-making objects rather than actions.

可视化我们的网络响应什么行动？首先，根据公式2，我们询问测试集中哪些时空补丁最具信息性。我们在图3中显示排名靠前的补丁，类激活映射显示为热图并覆盖在其相应的视频帧上。从这种可视化中，我们可以看到网络对面部和移动嘴部具有选择性。非面部最强烈的反应往往是不寻常但突出的视听刺激（例如两个排名靠前的视频包含频闪灯和音乐）。为了比较，我们在图4中显示响应最弱的视频;这些包含相对较少的面孔。接下来，我们询问模型如何响应不包含语音的视频，并将我们的方法应用于Kinetics-Sounds数据集[16] - Kinetics [58]类的一个子集，它们往往包含独特的声音。在删除了响应仅针对面部（几乎出现在每个类别中）的示例之后，我们展示了对各种类别的响应最高的示例。我们在图5中显示结果。最后，我们询问模型的注意力如何随运动而变化。为了研究这个，我们计算了基于CAM的视频可视化，我们已经将其包含在补充视频中（我们还在图1（a）中展示了一些手工选择的例子）。这些结果定性地表明模型的注意力随着屏幕上的运动而变化。这与单帧方法模型[50,52,16]形成对比，后者主要关注声音制作对象而不是动作。

5 Action recognition

We have seen through visualizations that our representation conveys information about sound sources. We now ask whether it is useful for recognition tasks. To study this, we fine-tuned our model for action recognition using the UCF-101 dataset [64], initializing the weights with those learned from our alignment task. We provide the results in Table 1, and compare our model to other unsupervised learning and 3D CNN methods.

我们通过可视化看到我们的表示传达了有关声源的信息。我们现在问它是否对识别任务有用。为了研究这一点，我们使用UCF-101数据集[64]微调了我们的动作识别模型，用我们的对齐任务中学到的权重初始化权重。我们在表1中提供结果，并将我们的模型与其他无监督学习和3D CNN方法进行比较。

UCF101 动作识别数据集，从youtube收集而得，共包含101类动作。其中每类动作由25个人做动作，每人做4-7组，共13320个视频，分辨率为320*240，共6.5G。UCF101在动作的采集上具有非常大的多样性，包括相机运行、外观变化、姿态变化、物体比例变化、背景变化、光纤变化等。101类动作可以分为5类：人与物体互动、人体动作、人与人互动、乐器演奏、体育运动。

We train with 2.56-second subsequences, following [56], which we augment with random flipping and cropping, and small (up to one frame) audio shifts. At test time, we follow [65] and average the model’s outputs over 25 clips from each video, and use a center 224 × 224 crop. Please see Section A1 for optimization details.

我们使用[56]跟随2.56秒的后续序列进行训练，我们通过随机翻转和裁剪以及小（最多一帧）音频移位进行扩充。在测试时，我们遵循[65]并将模型的输出平均来自每个视频的25个剪辑，并使用中心224×224裁剪。有关优化详情，请参阅第A1节。

Analysis We see, first, that our model significantly outperforms self-supervised approaches that have previously been applied to this task, including Shuffle-and-Learn [17] (82.1% vs. 50.9% accuracy) and O3N [19] (60.3%). We suspect this is in part due to the fact that these methods either process a single frame or a short sequence, and they solve tasks that do not require extensive motion analysis. We then compared our model to methods that use supervised pretraining, focusing on the state-of-the-art I3D [56] model. While there is a large gap between our self-supervised model and a version of I3D that has been pretrained on the closely-related Kinetics dataset (94.5%), the performance of our model (with both sound and vision) is close to the (visual-only) I3D pretrained with ImageNet [66] (84.2%).

分析我们首先看到，我们的模型明显优于先前已应用于此任务的自我监督方法，包括Shuffle-and-Learn [17]（82.1％对50.9％准确度）和O3N [19]（60.3％））。我们怀疑这部分是由于这些方法处理单个帧或短序列，并且它们解决了不需要大量运动分析的任务。然后，我们将模型与使用监督预训练的方法进行比较，重点关注最先进的I3D [56]模型。虽然我们的自我监督模型和I3D版本之间存在很大差距，这种模型已经在密切相关的Kinetics数据集（94.5％）上进行了预测，但我们的模型（声音和视觉）的性能接近于（仅视觉）I3D使用ImageNet预训练[66]（84.2％）。

Next, we trained our multisensory network with the self-supervision task of [16] rather than our own, i.e. creating negative examples by randomly pairing the audio and visual streams from different videos, rather than by introducing misalignment. We found that this model performed significantly worse than ours (78.7%), perhaps due to the fact that its task can largely be solved without analyzing motion.

接下来，我们使用[16]的自我监督任务训练我们的多感官网络，而不是我们自己，即通过随机配对来自不同视频的音频和视频流而不是通过引入错位来创建负面示例。我们发现这个模型比我们的表现差得多（78.7％），这可能是因为它的任务很大程度上可以在不分析运动的情况下得到解决。

Finally, we asked how components of our model contribute to its performance. To test whether the model is obtaining its predictive power from audio, we trained a variation of the model in which the audio subnetwork was ablated (activations set to zero), finding that this results in a 5% drop in performance. This suggests both that sound is important for our results, and that our visual features are useful in isolation. We also tried training a variation of the model that operated on spectrograms, rather than raw waveforms, finding that this yielded similar performance (Section A2). To measure the importance of our self-supervised pretraining, we compared our model to a randomly initialized network (i.e. trained from scratch), finding that there was a significant (14%) drop in performance — similar in magnitude to removing ImageNet pretraining from I3D. These results suggest that the model has learned a representation that is useful both for vision-only and audio-visual action recognition.

最后，我们询问了模型的组件如何对其性能做出贡献。为了测试模型是否从音频获得其预测能力，我们训练了模型的变体，其中音频子网被消融（激活设置为零），发现这导致性能下降5％。这表明声音对我们的结果很重要，并且我们的视觉特征在隔离中是有用的。我们还尝试训练模型的变体，该模型使用光谱图而不是原始波形，发现这产生了类似的性能（第A2节）。为了衡量我们自我监督的预训练的重要性，我们将我们的模型与随机初始化的网络（即从头开始训练）进行了比较，发现性能显着下降（14％） - 与从I3D中移除ImageNet预训练相似。这些结果表明该模型已经学习了一种对于仅视觉和视听动作识别都有用的表示。

6 On/off-screen audio-visual source separation

We now apply our representation to a classic audio-visual understanding task: separating on- and off-screen sound. To do this, we propose a source separation model that uses our learned features. Our formulation of the problem resembles recent audio-visual and audio-only separation work [34,36,67,42]. We create synthetic sound mixtures by summing an input video’s (“on-screen”) audio track with a randomly chosen (“off-screen”) track from a random video. Our model is then tasked with separating these sounds.

我们现在将我们的表示应用于经典的视听理解任务：分离屏幕上和屏幕外的声音。为此，我们提出了一种使用我们学习的特征的源分离模型。我们对这个问题的表述类似于最近的视听和纯音频分离工作[34,36,67,42]。我们通过将输入视频（“屏幕上”）音轨与随机视频中随机选择的（“屏幕外”）音轨相加来创建合成声音混合。然后我们的模型负责分离这些声音。

Task We consider models that take a spectrogram for the mixed audio as input and recover spectrogram for the two mixture components. Our simplest on/off-screen separation model learns to minimize:

任务我们考虑采用混合音频的频谱图作为输入的模型，并恢复两种混合成分的频谱图。我们最简单的开/关屏幕分离模型学习最小化：

where xM is the mixture sound, xF and xB are the spectrograms of the on- and offscreen sounds that comprise it (i.e. foreground and background), and fF and fB are our model’s predictions of them conditional on the (audio-visual) video I.

使我们的视听网络适应源分离任务。我们的模型将输入频谱图分为屏幕外和屏幕外音频流。在每个时间下采样层之后，我们的多感官特征与通过频谱图计算的u-net连接。我们反转频谱图以获得波形。该模型在原始视频上运行，没有任何预处理（例如，没有面部检测）。

其中xM是混合声音，xF和xB是包含它的幕外声音（即前景和背景）的谱图，fF和fB是我们模型对（视听）视频I的条件预测。

We also consider models that segment the two sounds without regard for their onor off-screen provenance, using the permutation invariant loss (PIT) of Yu et al. [36]. This loss is similar to Equation 3, but it allows for the on- and off-screen sounds to be swapped without penalty:

我们还考虑使用Yu等人的置换不变损失（PIT）来分割这两种声音的模型，而不考虑它们在屏幕外的起源。[36]。这种损失类似于公式3，但它允许交换屏幕上和屏幕外的声音而不会受到惩罚：

where L(xi; xj) = ||xi - xF ||1 + ||xj - xB||1 and x^1 and x^2 are the predictions.

6.1 Source separation model

We augment our audio-visual network with a u-net encoder-decoder [43,69,70] that maps the mixture sound to its on- and off-screen components (Figure 6). To provide the u-net with video information, we include our multisensory network’s features at three temporal scales: we concatenate the last layer of each temporal scale with the layer of the encoder that has the closest temporal sampling rate. Prior to concatenation, we use linear interpolation to make the video features match the audio sampling rate; we then mean-pool them spatially, and tile them over the frequency domain, thereby reshaping our 3D CNN’s time/height/width shape to match the 2D encoder’s time/frequency shape. We use parameters for u-net similar to [69], adding one pair of convolution layers to compensate for the large number of frequency channels in our spectrograms. We predict both the magnitude of the log-spectrogram and its phase (we scale the phase loss by 0.01 since it is less perceptually important). To obtain waveforms, we invert the predicted spectrogram. We emphasize that our model uses raw video, with no preprocessing or labels (e.g. no face detection or pretrained supervised features).

我们使用u-net编码器解码器[43,49,70]扩充我们的视听网络，将混合声音映射到其屏幕上和屏幕外组件（图6）。为了向u-net提供视频信息，我们在三个时间尺度上包括我们的多感官网络特征：我们将每个时间尺度的最后一层与具有最接近的时间采样率的编码器层连接起来。在连接之前，我们使用线性插值使视频特征与音频采样率匹配;然后我们在空间上对它们进行平均汇总，并将它们平铺在频域上，从而重塑我们的3D CNN的时间/高度/宽度形状以匹配2D编码器的时间/频率形状。我们使用类似于[69]的u-net参数，添加一对卷积层来补偿谱图中的大量频道。我们预测对数谱图的大小及其相位（我们将相位损失缩放0.01，因为它在感知上不太重要）。为了获得波形，我们反转预测的频谱图。我们强调我们的模型使用原始视频，没有预处理或标签（例如，没有面部检测或预训练的监督功能）。

Training We evaluated our model on the task of separating speech sounds using the VoxCeleb dataset [71]. We split the training/test to have disjoint speaker identities (72%, 8%, and 20% for training, validation, and test). During training, we sampled 2.1-sec. clips from longer 5-sec. clips, and normalized each waveform’s mean squared amplitude to a constant value. We used spectrograms with a 64 ms frame length and a 16 ms step size, producing 128 × 1025 spectrograms. In each mini-batch of the optimization, we randomly paired video clips, making one the off-screen sound for the other. We jointly optimized our multisensory network and the u-net model, initializing the weights using our self-supervised representation (see supplementary material for details).

训练我们使用VoxCeleb数据集[71]评估我们的模型分离语音的任务。我们将训练/测试分成不同的演讲者身份（72％，8％和20％用于培训，验证和测试）。在训练期间，我们采样2.1秒。剪辑从5秒开始。剪辑，并将每个波形的均方幅度归一化为一个恒定值。我们使用64 ms帧长和16 ms步长的频谱图，产生128×1025个频谱图。在优化的每个小批量中，我们随机配对视频剪辑，使一个屏幕外的声音为另一个。我们联合优化了我们的多感官网络和u-net模型，使用我们的自我监督表示来初始化权重（详见补充材料）。

6.2 Evaluation

We compared our model to a variety of separation methods: 1) we replaced our selfsupervised video representation with other features, 2) compared to audio-only methods using blind separation methods, 3) and compared to other audio-visual models.

我们将我们的模型与各种分离方法进行了比较：1）我们将自我监督的视频表示替换为其他特征，2）与使用盲分离方法的纯音频方法相比，3）并与其他视听模型进行比较。

Ablations Since one of our main goals is to evaluate the quality of the learned features, we compared several variations of our model (Table 2). First, we replaced the multisensory features with the I3D network [56] pretrained on the Kinetics dataset — a 3D CNN-based representation that was very effective for action recognition (Section 5). This model performed significantly worse (11.4 vs. 12.3 spectrogram ‘1 loss for Equation 3). One possible explanation is that our pretraining task requires extensive motion analysis, whereas even single-frame action recognition can still perform well [65,72].

消除由于我们的主要目标之一是评估学习特征的质量，我们比较了模型的几种变体（表2）。首先，我们用动力学数据集上预先训练的I3D网络[56]取代了多感官特征 - 基于3D CNN的表示，对动作识别非常有效（第5节）。该模型的表现明显更差（方程3中11.4对12.3频谱图'1损失）。一种可能的解释是，我们的预训练任务需要进行广泛的运动分析，而即使单帧动作识别仍然可以表现良好[65,72]。

我们的开/关屏幕分离模型的定性结果。我们展示了来自我们测试装置的两种合成混合物的输入框架和光谱图，以及两个包含多个扬声器的野外互联网视频。第一种（雄性/雄性混合物）含有比第二种（雌性/雄性混合物）更多的伪影。第三个视频是一个现实世界的混合体，其中女性发言者（同时）将男性西班牙语发音者翻译成英语。最后，我们在电视新闻节目中将两位（男性）演讲者的演讲分开。虽然这些现实世界的例子没有基本的事实，但是源分离方法定性地分离了这两个声音。有关视频源分离结果，请参阅我们的网页（http://andrewowens.com/multisensory）。

We then asked how much of our representation’s performance comes from motion features, rather than from recognizing properties of the speaker (e.g. gender). To test this, we trained the model with only a single frame (replicated temporally to make a video). We found a significant drop in performance (11.4 vs. 14.8 loss). The drop was particularly large for mixtures in which two speakers had the same gender — a case where lip motion is an important cue.

然后我们询问我们的表现有多少来自动作特征，而不是识别说话者的属性（例如性别）。为了测试这一点，我们仅用一个帧训练模型（在时间上复制以制作视频）。我们发现性能显着下降（11.4对14.8损失）。对于两个扬声器具有相同性别的混合物，下降特别大 - 唇部运动是一个重要线索。

One might also ask whether early audio-visual fusion is helpful — the network, after all, fuses the modalities in the spectrogram encoder-decoder as well. To test this, we ablated the audio stream of our multisensory network and retrained the separation model. This model obtained worse performance, suggesting the fused audio is helpful even when it is available elsewhere. Finally, while the encoder-decoder uses only monaural audio, our representation uses stereo. To test whether it uses binaural cues, we converted all the audio to mono and re-evaluated it. We found that this did not significantly affect performance, which is perhaps due to the difficulty of using stereo cues in in-the-wild internet videos (e.g. 39% of the audio tracks were mono). Finally, we also transferred (without retraining) our learned models to the GRID dataset [73], a labrecorded dataset in which people speak simple phrases in front of a plain background, finding a similar relative ordering of the methods.

人们可能还会问早期视听融合是否有用 - 毕竟，网络也融合了频谱图编码器 - 解码器中的模态。为了测试这一点，我们消除了多感官网络的音频流并重新训练了分离模型。该模型获得了更差的性能，表明融合的音频即使在其他地方可用时也是有用的。最后，虽然编码器 - 解码器仅使用单声道音频，但我们的表示使用立体声。为了测试它是否使用双声道提示，我们将所有音频转换为单声道并重新评估它。我们发现这并没有显着影响性能，这可能是由于在野外互联网视频中使用立体声提示的困难（例如39％的音轨是单声道的）。最后，我们还将我们学到的模型转移（不重新训练）到GRID数据集[73]，这是一个实验室记录的数据集，人们在普通背景前讲简单的短语，找到方法的相似的相对顺序。

Audio-only separation To get a better understanding of our model’s effectiveness, we compared it to audio-only separation methods. While these methods are not applicable to on/off-screen separation, we modified our model to have it separate audio using an extra permutation invariant loss (Equation 4) and then compared the methods using blind separation metrics [68]: signal-to-distortion ratio (SDR), signal-to-interference ratio (SIR), and signal-to-artifacts ratio (SAR). For consistency across methods, we resampled predicted waveforms to 16 kHz (the minimum used by all methods), and used the mixture phase to invert our model’s spectrogram, rather than the predicted phase (which none of the others predict).

仅音频分离为了更好地理解我们模型的有效性，我们将其与仅音频分离方法进行了比较。虽然这些方法不适用于屏幕上/屏幕外分离，但我们修改了模型，使其使用额外的置换不变损失（公式4）将其分离，然后使用盲分离度量[68]比较方法：信号到 - 失真比（SDR），信号干扰比（SIR）和信号与伪像比（SAR）。为了保证各种方法的一致性，我们将预测波形重新采样到16 kHz（所有方法使用的最小值），并使用混合相来反转模型的频谱图，而不是预测的相位（其他没有预测的相位）。

We compared our model to PIT-CNN [36]. This model uses a VGG-style [74] CNN to predict two soft separation masks via a fully connected layer. These maps are multiplied by the input mixture to obtain the segmented streams. While this method worked well on short clips, we found it failed on longer inputs (e.g. obtaining 1.8 SDR in the experiment shown in Table 2). To create a stronger PIT baseline, we therefore created an audio-only version of our u-net model, optimizing the PIT loss instead of our on/offscreen loss, i.e. replacing the VGG-style network and masks with u-net. We confirmed that this model obtains similar performance on short sequences (Table 3), and found it successfully trained on longer videos. Finally, we compared with a pretrained separation model [67], which is based on recurrent networks and trained on the TSP dataset [75].

我们将我们的模型与PIT-CNN进行了比较[36]。该模型使用VGG型[74] CNN通过完全连接的层预测两个软分离掩模。这些映射乘以输入混合以获得分段流。虽然这种方法在短片段上运行良好，但我们发现它在较长输入时失败（例如，在表2中所示的实验中获得1.8 SDR）。为了创建更强大的PIT基线，我们因此创建了我们的u-net模型的纯音频版本，优化了PIT损失而不是我们的开/关屏幕丢失，即用u-net替换VGG风格的网络和掩码。我们确认该模型在短序列上获得了类似的性能（表3），并且发现它在较长的视频上成功训练。最后，我们与预训练分离模型[67]进行了比较，该模型基于循环网络并在TSP数据集上进行了训练[75]。

We found that our audio-visual model, when trained with a PIT loss, outperformed all of these methods, except for on the SAR metric, where the u-net PIT model was slightly better (which largely measures the presence of artifacts in the generated waveform). In particular, our model did significantly better than the audio-only methods when the genders of the two speakers in the mixture were the same (Table 2). Interestingly, we found that the audio-only methods still performed better on blind separation metrics when transferring to the lab-recorded GRID dataset, which we hypothesize is due to the significant domain shift.

我们发现我们的视听模型在受到PIT损失训练时表现优于所有这些方法，除了SAR指标，其中u-net PIT模型略好一些（主要测量生成的工件的存在）波形）。特别是，当混合物中两个扬声器的性别相同时，我们的模型明显优于仅音频方法（表2）。有趣的是，我们发现当转移到实验室记录的GRID数据集时，仅音频方法在盲分离度量上仍然表现更好，我们假设这是由于显着的域移位。

Audio-visual separation We compared to the audio-visual separation model of Hou et al. [42]. This model was designed for enhancing the speech of a previously known speaker, but we apply it to our task since it is the most closely related prior method. We also evaluated the network of Gabbay et al. [45] (a concurrent approach to ours). We trained these models using the same procedure as ours ([45] used speaker identities to create hard mixtures; we instead assumed speaker identities are unknown and mix randomly). Both models take very short (5-frame) video inputs. Therefore, following [45] we evaluated 200ms videos (Table 3). For these baselines, we cropped the video around the speaker’s mouth using the Viola-Jones [76] lip detector of [45] (we do not use face detection for our own model). These methods use a small number of frequency bands in their (Mel-) STFT representations, which limits their quantitative performance. To address these limitations, we evaluated only the on-screen audio, and downsampled the audio to a low, common rate (2 kHz) before computing SDR. Our model significantly outperforms these methods. Qualitatively, we observed that [45] often smooths the input spectrogram, and we suspect its performance on source separation metrics may be affected by the relatively small number of frequency bands in its audio representation.

视听分离我们与Hou等人的视听分离模型进行了比较。 [42]。该模型旨在增强先前已知说话者的语音，但我们将其应用于我们的任务，因为它是最密切相关的先前方法。我们还评估了Gabbay等人的网络。 [45]（与我们同时采用的方法）。我们使用与我们相同的程序训练这些模型（[45]使用说话人身份来创建硬混合物;我们改为假设说话者身份是未知的并随机混合）。两种型号都采用非常短（5帧）的视频输入。因此，在[45]之后，我们评估了200ms的视频（表3）。对于这些基线，我们使用[45]的Viola-Jones [76]唇形检测器在扬声器的嘴周围裁剪视频（我们不对自己的模型使用面部检测）。这些方法在其（Mel-）STFT表示中使用少量频带，这限制了它们的定量性能。为了解决这些限制，我们仅评估了屏幕上的音频，并在计算SDR之前将音频下采样到较低的通用速率（2 kHz）。我们的模型明显优于这些方法。定性地，我们观察到[45]经常使输入谱图平滑，并且我们怀疑其在源分离度量上的性能可能受其音频表示中相对较少数量的频带的影响。

6.3 Qualitative results

Our quantitative results suggest that our model can successfully separate on- and offscreen sounds. However, these metrics are limited in their ability to convey the quality of the predicted sound (and are sensitive to factors that may not be perceptually important, such as the frequency representation). Therefore, we also provide qualitative examples.

我们的定量结果表明我们的模型可以成功地分离屏幕上和屏幕外的声音。然而，这些度量在它们传达预测声音质量的能力方面受到限制（并且对于可能在感知上不重要的因素敏感，例如频率表示）。因此，我们也提供了定性的例子。

Real mixtures In Figure 7, we show results for two synthetic mixtures from our test set, and two real-world mixtures: a simultaneous Spanish-to-English translation and a television interview with concurrent speech. We exploit the fact that our model is fully convolutional to apply it to these 8.3-sec. videos (4× longer than training videos). We include additional source separation examples in the videos on our webpage. This includes a random sample of (synthetically mixed) test videos, as well as results on in-the-wild videos that contain both on- and off-screen sound.

真正的混合物在图7中，我们展示了来自我们的测试集的两种合成混合物的结果，以及两种真实世界的混合物：同时进行的西班牙语到英语的翻译以及同时演讲的电视采访。我们利用了这样一个事实，即我们的模型是完全卷积的，可以将它应用到这些8.3秒。视频（比培训视频长4倍）。我们在网页上的视频中添加了其他源代码分隔示例。这包括（合成混合的）测试视频的随机样本，以及包含屏幕上和屏幕外声音的野外视频的结果。

Multiple on-screen sound sources To demonstrate our model’s ability to vary its prediction based on the speaker, we took a video in which two people are speaking on a TV debate show, visually masked one side of the screen (similar to [25]), and ran our source separation model. As shown in Figure 1, when the speaker on the left is hidden, we hear the speaker on the right, and vice versa. Please see our video for results.

多个屏幕声源为了展示我们的模型基于扬声器改变其预测的能力，我们拍摄了一个视频，其中两个人在电视辩论节目中发言，在视觉上屏蔽了屏幕的一侧（类似于[25]），并运行我们的源分离模型。如图1所示，当左侧的扬声器隐藏时，我们会听到右侧的扬声器，反之亦然。请查看我们的视频了解结果.

Large-scale training We trained a larger variation of our model on significantly more data. For this, we combined the VoxCeleb and VoxCeleb2 [77] datasets (approx. 8× as manys videos), as in [47], and modeled ambient sounds by sampling background audio tracks from AudioSet approximately 8% of the time. To provide more temporal context, we trained with 4.1-sec. videos (approx. 256 STFT time samples). We also simplified the model by decreasing the spectrogram frame length to 40 ms (513 frequency samples) and increased the weight of the phase loss to 0.2. Please see our webpage for results.

大规模培训我们在更多数据上训练了更大的模型变体。为此，我们将VoxCeleb和VoxCeleb2 [77]数据集（大约8倍作为manys视频）组合在一起，如[47]所示，并通过从AudioSet中大约8％的时间采样背景音轨来模拟环境声音。为了提供更多的时间背景，我们训练了4.1秒。视频（大约256个STFT时间样本）。我们还通过将频谱图帧长度减小到40毫秒（513个频率样本）并将相位损失的权重增加到0.2来简化模型。请查看我们的网页了解结果。

7 Discussion

In this paper, we presented a method for learning a temporal multisensory representation, and we showed through experiments that it was useful for three downstream tasks: (a) pretraining action recognition systems, (b) visualizing the locations of sound sources, and (c) on/off-screen source separation. We see this work as opening two potential directions for future research. The first is developing new methods for learning fused multisensory representations. We presented one method — detecting temporal misalignment — but one could also incorporate other learning signals, such as the information provided by ambient sound [15]. The other direction is to use our representation for additional audio-visual tasks. We presented several applications here, but there are other audio-understanding tasks could potentially benefit from visual information and, likewise, visual applications that could benefit from fused audio information.

在本文中，我们提出了一种学习时间多感觉表示的方法，并且我们通过实验证明它对三个下游任务有用：（a）预训练动作识别系统，（b）可视化声源的位置，和（c））开/关源分离。我们认为这项工作为未来研究开辟了两条潜在方向。第一个是开发学习融合多感觉表示的新方法。我们提出了一种方法 - 检测时间错位 - 但也可以包含其他学习信号，例如环境声音提供的信息[15]。另一个方向是使用我们的表示来进行额外的视听任务。我们在这里介绍了几个应用程序，但是其他音频理解任务可能会从视觉信息中受益，同样，视觉应用程序可以从融合的音频信息中受益。

人面猴
序言：七十年代末，一起剥皮案震惊了整个滨河市，随后出现的几起案子，更是在滨河造成了极大的恐慌，老刑警刘岩，带你破解...
沈念sama阅读 161,326评论 4赞 369
死咒
序言：滨河连续发生了三起死亡事件，死亡现场离奇诡异，居然都是意外死亡，警方通过查阅死者的电脑和手机，发现死者居然都...
沈念sama阅读 68,228评论 1赞 304
救了他两次的神仙让他今天三更去死
文/潘晓璐我一进店门，熙熙楼的掌柜王于贵愁眉苦脸地迎上来，“玉大人，你说我怎么就摊上这事。” “怎么了？”我有些...
开封第一讲书人阅读 110,979评论 0赞 252
道士缉凶录：失踪的卖姜人
文/不坏的土叔我叫张陵，是天一观的道长。经常有香客问我，道长，这世上最难降的妖魔是什么？我笑而不...
开封第一讲书人阅读 44,489评论 0赞 217
港岛之恋（遗憾婚礼）
正文为了忘掉前任，我火速办了婚礼，结果婚礼上，老公的妹妹穿的比我还像新娘。我一直安慰自己，他们只是感情好，可当我...
茶点故事阅读 52,894评论 3赞 294
恶毒庶女顶嫁案：这布局不是一般人想出来的
文/花漫我一把揭开白布。她就那样静静地躺着，像睡着了一般。火红的嫁衣衬着肌肤如雪。梳的纹丝不乱的头发上，一...
开封第一讲书人阅读 40,900评论 1赞 224
城市分裂传说
那天，我揣着相机与录音，去河边找鬼。笑死，一个胖子当着我的面吹牛，可吹牛的内容都是我干的。我是一名探鬼主播，决...
沈念sama阅读 32,075评论 2赞 317
双鸳鸯连环套：你想象不到人心有多黑
文/苍兰香墨我猛地睁开眼，长吁一口气：“原来是场噩梦啊……” “哼！你这毒妇竟也来了？” 一声冷哼从身侧响起，我...
开封第一讲书人阅读 30,803评论 0赞 205
万荣杀人案实录
序言：老挝万荣一对情侣失踪，失踪者是张志新（化名）和其女友刘颖，没想到半个月后，有当地人在树林里发现了一具尸体，经...
沈念sama阅读 34,565评论 1赞 249
护林员之死
正文独居荒郊野岭守林人离奇死亡，尸身上长有42处带血的脓包…… 初始之章·张勋以下内容为张勋视角年9月15日...
茶点故事阅读 30,778评论 2赞 253
白月光启示录
正文我和宋清朗相恋三年，在试婚纱的时候发现自己被绿了。大学时的朋友给我发了我未婚夫和他白月光在一起吃饭的照片。...
茶点故事阅读 32,255评论 1赞 265
活死人
序言：一个原本活蹦乱跳的男人离奇死亡，死状恐怖，灵堂内的尸体忽然破棺而出，到底是诈尸还是另有隐情，我是刑警宁泽，带...
沈念sama阅读 28,582评论 3赞 261
日本核电站爆炸内幕
正文年R本政府宣布，位于F岛的核电站，受9级特大地震影响，放射性物质发生泄漏。R本人自食恶果不足惜，却给世界环境...
茶点故事阅读 33,254评论 3赞 241
男人毒药：我在死后第九天来索命
文/蒙蒙一、第九天我趴在偏房一处隐蔽的房顶上张望。院中可真热闹，春花似锦、人声如沸。这庄子的主人今日做“春日...
开封第一讲书人阅读 26,151评论 0赞 8
一桩弑父案，背后竟有这般阴谋
文/苍兰香墨我抬头看了看天上的太阳。三九已至，却和暖如春，着一层夹袄步出监牢的瞬间，已是汗流浃背。一阵脚步声响...
开封第一讲书人阅读 26,952评论 0赞 201
情欲美人皮
我被黑心中介骗来泰国打工，没想到刚下飞机就差点儿被人妖公主榨干…… 1. 我叫王不留，地道东北人。一个月前我还...
沈念sama阅读 36,035评论 2赞 285
代替公主和亲
正文我出身青楼，却偏偏与公主长得像，于是被迫代替她去往敌国和亲。传闻我的和亲对象是个残疾皇子，可洞房花烛夜当晚...
茶点故事阅读 35,839评论 2赞 277