正文之前

好久没写了，但是实在没当初那股子热情和时间了。。大家看看我的最近的考试和报告安排。。

这日子简直没法过了！！！熬夜爆肝吧，只能如此了。这篇是前阵子《计算机系统安全》这门课的课堂报告的论文的译文。。全程对着谷歌翻译挨个挨个翻译过来的，虽然还是有点不顺畅，但是起码比你自个儿再去瞎看应该好的多了。。

正文

Multimedia protocol tunneling enables the creation of covert channels by modulating data into the input of popular multimedia applications such as Skype. To be effective, protocol tunneling must be unobservable, i.e., an adversary should not be able to distinguish the streams that carry a covert channel from those that do not. However, existing multimedia protocol tunneling systems have been evaluated using ad hoc methods, which casts doubts on whether such systems are indeed secure, for instance, for censorship-resistant communication

多媒体协议隧道技术通过将数据调制成Skype等流行多媒体应用程序的输入来创建隐蔽通道。为了有效，协议隧道必须是不可观测的，即，在美国，敌人不应该能够区分带有隐蔽通道的流量和没有隐蔽通道的流量。然而，现有的多媒体协议隧道系统已经使用特殊方法进行了评估，这使人怀疑这些系统是否确实是安全的，例如，用于抗审查通信

In this paper, we conduct an experimental study of the unobservability properties of three state of the art systems: Facet, CovertCast, and DeltaShaper. Our work unveils that previous claims regarding the unobservability of the covert channels produced by those tools were flawed and that existing machine learning techniques, namely those based on decision trees, can uncover the vast majority of those channels while incurring in comparatively lower false positive rates. We also explore the application of semi-supervised and unsupervised machine learning techniques. Our findings suggest that the existence of manually labeled samples is a requirement for the successful detection of covert channels.

在本文中，我们对三种最先进系统的不可观察性进行了实验研究：Facet，CovertCast和DeltaShaper。我们的工作揭示了先前关于这些工具产生的隐蔽通道不可观察性的说法是有缺陷的，现有的机器学习技术，即基于决策树的技术，可以揭示绝大多数这些渠道，同时产生相对较低的误报率。我们还探讨了半监督和无监督机器学习技术的应用。我们的研究结果表明，手动标记样本的存在是成功检测隐蔽通道的必要条件。

1 Introduction

Multimedia protocol tunneling has emerged as a potentially effective technique to create covert channels which are difficult to identify. In a nutshell, this technique consists of encoding covert data into the video (and / or audio) channel of popular encrypted streaming applications such as Skype without requiring any changes to the carrier application. Systems such as Facet [30], CovertCast [34], and DeltaShaper [2] implement this technique, and introduce different approaches for data modulation that aim at raising the difficulty of an adversary to identify covert data transmissions.

多媒体协议隧道已经成为创建难以识别的隐蔽信道的潜在有效技术。简而言之，这种技术包括将隐蔽数据编码到流行的加密流媒体应用程序（如Skype）的视频（和/或音频）通道中，而无需对运营商应用程序进行任何更改。诸如Facet [30]，CovertCast [34]和DeltaShaper [2]等系统实现了这种技术，并引入了不同的数据调制方法，旨在提高对手识别隐蔽数据传输的难度。

An important property that all these systems strive to achieve is unobservability. A covert channel is deemed unobservable if an adversary that is able to scan any number of streams is not able to distinguish those that carry a covert channel from those that do not [20, 23]. Thus, an adversary aims at correctly detecting all streams that carry covert channels, among a set of genuine streams, as effectively as possible. In practice, a multimedia protocol tunneling system that provides a high degree of unobservability prevents an adversary from flagging a large fraction of covert flows (i.e., from attaining a high true positive rate) while flagging a low amount of regular traffic (i.e., while attaining a low false positive rate).

所有这些系统努力实现的一个重要特性是不可观察性。如果能够扫描任意数量的流的对手无法区分那些携带隐蔽通道的对手，那么隐蔽通道被认为是不可观察的[20,23]。因此，对手旨在尽可能有效地正确地检测在一组真实流中携带隐藏信道的所有流。在实践中，提供高度不可观察性的多媒体协议隧道系统防止对手标记大部分隐蔽流（即，获得高真实正率），同时标记少量常规流量（即，在获得时）误报率低。

In spite of the efforts to build unobservable systems, the methodology currently employed for their evaluation raises concerns. To assess the unobservability of a system such as Facet, experiments are mounted in order to play regular traffic along with covert traffic, collect the resulting traces, and employ similarity-based classifiers (e.g., relying in the χ2 similarity function) to determine whether covert traffic can be detected with a low number of false positives [30]. However, each system has been evaluated with a different classifier, making results hard to compare. Furthermore, those studies use just one among the many machine learning (ML) techniques available today. Yet, providing a common ground for assessing the unobservability of multimedia protocol tunneling systems is a relevant problem which, nevertheless, has been overlooked in the literature. Considering that such systems emerged from the need to circumvent Internet censorship, flawed systems may pose life-threatening risks to end-users, e.g., journalists that report news in extreme conditions may be prosecuted, imprisoned, or even murdered if covert channels are detected.

尽管努力建立不可观察的系统，但目前用于评估的方法引起了关注。为了评估诸如Facet之类的系统的不可观察性，安装实验以便与隐蔽流量一起播放常规流量，收集得到的轨迹，并采用基于相似性的分类器（例如，依赖于χ2相似度函数）来确定是否能在较低错误率的情况下检测是否为隐蔽通道[30]。但是，每个系统都使用不同的分类器进行评估，结果难以比较。此外，这些研究仅使用当今许多机器学习（ML）技术中的一种。然而，为评估多媒体协议隧道系统的不可观察性提供共同点是一个相关问题，然而，该问题在这一领域中被忽略了。考虑到这些系统摆脱了绕过互联网审查的需要，有缺陷的系统可能对最终用户构成威胁生命的风险，例如，如果检测到隐蔽信道，在极端条件下报告新闻的记者可能会被起诉，监禁或甚至谋杀。

To fill this gap, our goal is to systematically assess the unobservability of existing systems against powerful adversaries making use of traffic analysis techniques based on ML. We aim at understanding which ML techniques are better suited for the purpose of detecting covert channels in multimedia streams and what are the limitations of such techniques. In particular, we seek to explore ML techniques which have yielded successful results when applied in other domains (e.g., Tor hidden services fingerprinting [22]), but have not yet been studied in the context of covert traffic detection.

为了填补这一空白，我们的目标是系统地利用基于ML的流量分析技术评估现有系统对强大对手的不可观察性。我们的目标是了解哪种ML技术更适合于检测多媒体流中的隐蔽信道，以及这些技术的局限性。特别是，我们寻求探索ML技术，这些技术在应用于其他领域时已经取得了成功的结果（例如，Tor隐藏服务指纹识别[22]），但尚未在隐蔽流量检测的背景下进行研究。

In this paper, we present the first experimental study of the unobservability of covert channels produced by state-of-the-art multimedia protocol tunneling systems. We test three systems – Facet, CovertCast, and Deltashaper – using the original code provided by their maintainers. For our study, we take a systematic approach by investigating a spectrum of anomaly detection techniques, ranging from supervised, to semi-supervised and unsupervised, where for each category we explore different classifiers, and investigate the trade-offs involved in the ability to flag a large amount of covert channels while minimizing false positives. From our study, we highlight the following three main contributions.

在本文中，我们提出了第一个由最先进的多媒体协议隧道系统产生的隐蔽通道不可观察性的实验研究。我们使用维护者提供的原始代码测试三个系统 - Facet，CovertCast和Deltashaper。对于我们的研究，我们通过调查一系列异常检测技术采取系统的方法，从监督，半监督和无监督，我们探索不同的分类器，并研究标志能力所涉及的权衡大量的隐蔽通道，同时最大限度地减少误报。从我们的研究中，我们强调了以下三个主要贡献。

First, our analysis reveals that some state-of-the-art systems are flawed. In particular, CovertCast flows can be detected with few false positives by an adversary, even when resorting to existing similarity-based classifiers. While the remaining systems exhibit different degrees of unobservability according to their parameterization, we show that none of the currently employed similaritybased classifiers can detect such channels without incurring in large numbers of false positives. We also conclude that one of the existing similarity-based classifiers – using χ2 distance – consistently outperforms all others in the task of detecting covert channels.

首先，我们的分析表明，一些最先进的系统存在缺陷。特别是，即使在使用现有的基于相似性的分类器时，对手也可以几乎没有误报的检测到CovertCast流。虽然其余系统根据其参数化表现出不同程度的不可观察性，但我们表明当前使用的基于相似性的分类器都不能在不产生大量的误报的情况下检测到这样的通道。我们还得出结论，现有的基于相似性的分类器之一 - 使用χ2距离 - 在检测隐蔽通道的任务中始终优于所有其他分类器。

Second, we show that ML techniques based on decision trees and some of their variants are extremely effective at detecting covert traffic with reduced false positive rates. For example, an adversary employing XGBoost would be able to flag 90% of all Facet traffic while erroneously flagging only 2% of legitimate connections. Moreover, the performance of such techniques is very high, meaning that the adversary is able to classify traffic in a few seconds, with a relatively low number of samples per training set, and taking a low memory footprint. Additionally, the use of decision tree-based techniques allows us to understand which traffic features are most important for detecting the functioning of particular multimedia protocol tunneling systems. These findings suggest that, apart from their performance, decision treebased techniques can provide meaningful insight into the inner workings of these systems and we propose that they should be used for assessing the unobservability of multimedia protocol tunneling systems in the future.

其次，我们表明基于决策树及其某些变体的ML技术在检测隐蔽流量方面可以有效的降低误报率。例如，使用XGBoost的对手将能够标记所有Facet流量的90％，同时错误地仅标记2％的合法连接。此外，这种技术的性能非常高，这意味着对手能够在几秒内对流量进行分类，每个训练集的样本数量相对较少，并且占用内存较少。另外，使用基于决策树的技术使我们能够理解哪些流量特征对于检测特定多媒体协议隧道系统的功能是最重要的。这些研究结果表明，除了性能之外，基于决策树的技术可以提供对这些系统内部工作的有意义的洞察，我们建议将它们用于评估未来多媒体协议隧道系统的不可观察性。

Third, we explore alternative ML approaches for the detection of covert channels when the adversary is assumed to be partially or totally deprived of labeled data. Our findings suggest that unsupervised learning techniques provide no advantage for the classification of multimedia protocol tunneling covert channels, while the application of semi-supervised learning techniques yields a significant fraction of false positives. However, we note that the performance of semi-supervised techniques can be significantly improved through the optimization of parameters or by providing algorithms with extra training data. The study of semi-supervised anomaly detection techniques with an ability to self-tune parameters can be a promising future direction of research which would enable adversaries to detect covert traffic while avoiding the burden of generating and manually label data.

第三，当假设对手被部分或完全剥夺标记数据时，我们探索用于检测隐蔽通道的替代ML方法。我们的研究结果表明，无监督学习技术对多媒体协议隧道隐蔽信道的分类没有任何优势，而半监督学习技术的应用产生了很大一部分误报。然而，我们注意到，通过优化参数或通过提供额外训练数据的算法，可以显着提高半监督技术的性能。具有自我调整参数能力的半监督异常检测技术的研究可以成为未来有希望的研究方向，这将使对手能够检测隐蔽的流量，同时避免生成和手动标记数据的负担。

We note that we synthesize a limited number of legitimate and covert traffic samples in laboratory settings for creating our datasets. While this is a common approach for generating datasets for the type of unobservability assessment we conduct in this paper, it is possible that adversaries possessing a privileged position in the network can build a more accurate representation of traffic.

我们注意到，我们在实验室环境中合成了有限数量的合法和隐蔽流量样本，以创建我们的数据集。虽然这是为我们在本文中进行的不可观测性评估类型生成数据集的常用方法，但是在网络中拥有特权位置的攻击者可以构建更准确的流量表示。

The remainder of our paper is organized as follows. Section 2 presents the methodology of our study. Section 3 presents the main findings of our study regarding the comparison of similarity-based classifiers. Section 4 presents the results obtained when assessing unobservability resorting to decision tree-based classifiers. Section 5 presents our first insights on using semi-supervised and unsupervised anomaly detection techniques for the identification of covert traffic. In Section 6, we discuss obtained results and we present the related work in Section 7. Lastly, we conclude our work in Section 8

本文的其余部分安排如下。第2节介绍了我们研究的方法。第3节介绍了我们关于基于相似性的分类器比较的研究的主要发现。第4节介绍了在使用基于决策树的分类器评估不可观察性时获得的结果。第5节介绍了我们对使用半监督和非监督异常检测技术识别隐蔽流量的第一个见解。在第6节中，我们讨论了获得的结果，并在第7节中介绍了相关的工作。最后，我们在第8节中总结了我们的工作

2 Methodology

This section introduces the systems we analyzed, our adversary model, and the experimental setup of our study.

本节介绍我们分析的系统，我们的对手模型以及我们研究的实验设置。

2.1 Systems Under Analysis

Below, we describe three state-of-the-art approaches at multimedia protocol tunneling which serve as a basis for our study. We selected these systems because all of them encode data into video streams, and their code is publicly available for open testing. We note that although these systems have been conceived for the purpose of censorship circumvention, in practice, they may be used for other purposes, such as concealing criminal activity.

下面，我们描述了多媒体协议隧道的三种最先进的方法，它们是我们研究的基础。我们选择了这些系统，因为它们都将数据编码为视频流，并且它们的代码可公开用于开放测试。我们注意到，虽然这些系统是为审查规避而设想的，但在实践中，它们可能被用于其他目的，例如隐瞒犯罪活动。

Facet [30] allows clients to watch arbitrary videos by replacing the audio and video feeds of Skype videocalls. To watch a video, clients contact a Facet server by sending it a message containing the desired video URL. Afterwards, the Facet server downloads the requested video and feeds its content to microphone and camera emulators. Then, the server places a videocall to the client transmitting the selected video and audio instead. Thus, clients are not required to install any software in order to use the system. For approximating the traffic patterns of regular videocalls, Facet re-samples the audio frequency and overlays the desired video in a fraction of each frame while the remaining frame area is filled up by a video resembling a typical videocall. Decreasing the area occupied by the concealed video translates into increased resistance against traffic analysis.

Facet [30]允许客户通过替换Skype视频通话的音频和视频源来观看任意视频。要观看视频，客户通过向其发送包含所需视频URL的消息来联系Facet服务器。然后，Facet服务器下载所请求的视频并将其内容提供给麦克风和相机模拟器。然后，服务器将视频通话发送到发送所选视频和音频的客户端。因此，客户端不需要安装任何软件即可使用该系统。为了近似常规视频通话的流量模式，Facet重新采样音频并在每帧的一小部分内覆盖所需视频，而剩余帧区域由类似于典型视频通话的视频填充。减少隐藏视频占用的面积可以提高对流量分析的抵抗力。

CovertCast [34] scrapes and modulates the content of web pages into images which are distributed via livestreaming platforms such as YouTube. Multiple clients can consume the data being transmitted in a particular live stream simultaneously. CovertCast modulates web content by encoding it into colored matrix images. A colored matrix is parameterized by a cell size (adjacent pixels with a given color), the number of bits encoded in each cell (represented with a color), and the rate at which a matrix containing new data is loaded. Clients scrape and demodulate the images served through the live stream extracting the desired web content.

CovertCast [34]将网页内容划分并调整为图像，这些图像通过YouTube等直播平台进行分发。多个客户端可以同时使用在特定直播流中传输的数据。 CovertCast通过将网页内容编码为彩色矩阵图像来调整网页内容。通过单元大小（具有给定颜色的相邻像素），在每个单元中编码的比特数（用颜色表示）以及加载包含新数据的矩阵的速率来参数化彩色矩阵。客户端抓取并解调通过实时流提供的图像，提取所需的Web内容。

DeltaShaper [2] differentiates itself from the previous systems in that it allows for tunneling arbitrary TCP/IP traffic. This is achieved by modulating covert data into images which are transmitted through a bi-directional Skype videocall. DeltaShaper follows a similar data encoding mechanism to that of CovertCast. However, and similarly to Facet, a colored matrix is overlayed in a fraction of the call screen, on top of a typical chat video running in the background. This overlay, named payload frame, can be carefully parameterized to provide different levels of resistance against traffic analysis. On call start, DeltaShaper undergoes a calibration phase for adjusting its encoding parameters according to the current network conditions in order to preserve unobservability.

DeltaShaper [2]与上述的系统区别在于它允许隧道传输任意TCP / IP流量。这是通过将隐蔽数据调制成通过双向Skype视频通话传输的图像来实现的。 DeltaShaper遵循与CovertCast类似的数据编码机制。然而，与Facet类似，彩色矩阵覆盖在背景中运行的典型聊天视频之上，呼叫屏幕的一小部分中。这个名为有效载荷帧的叠加层可以仔细参数化，以提供不同级别的流量分析阻力。在通话开始时，DeltaShaper经历校准阶段，以根据当前网络状况调整其编码参数，以便保持不可观察性。

2.2 Adversary Model

To study the unobservability properties of the aforementioned systems, we emulate a state-level adversary which will attempt to detect the traffic of multimedia protocol tunneling tools while resorting to different anomaly detection techniques. The providers of encrypted multimedia applications which are used as carriers for covert channels are not assumed to collude with the adversary. Thus, the adversary cannot simply demand application providers to decipher and disclose raw multimedia content which could be easily screened for the presence of covert data. The adversary is also assumed to be unable to control the software installed in the computers of endusers. However, domestic ISPs are assumed to cooperate with the adversary, enabling it to monitor, store and inspect all traffic flows crossing its borders.

为了研究上述系统的不可观察性，我们模拟了一个状态级别的对手，它将尝试在采用不同的异常检测技术的同时检测多媒体协议隧道工具的流量。用作隐蔽频道的载体的加密多媒体应用的提供者不被认为与对手勾结。因此，攻击者不能简单地要求应用程序提供者破译和公开原始多媒体内容，这些内容可以容易地筛选出隐藏数据的存在。还假设对手无法控制最终用户计算机中安装的软件。但是，假设国内ISP与对手合作，使其能够监控，存储和检查跨越其边界的所有流量。

An adversary faces an inherent trade-off between the ability to correctly detect a large amount of covert channels and to erroneously flag legitimate flows. Flagging legitimate flows as covert channels is something that the adversary wants to avoid in most practical settings. For example, a censor that aims at blocking flows containing covert channels may not be willing to block large fractions of legitimate calls, that are used daily by companies and business, as these calls may be key for the economy of the censor’s regime [17]. Also, law-enforcement agencies may not be willing to risk to falsely flag legitimate actions of citizens as criminal activity.

对手在正确检测大量隐蔽通道和错误标记合法流量的能力之间面临着固有的权衡。将合法流量标记为隐蔽渠道是对手希望在大多数实际环境中避免的事情。例如，旨在阻止包含隐蔽渠道的流量的审查员可能不愿意阻止公司和企业每天使用的大部分合法呼叫，因为这些呼叫可能是检查员制度经济的关键[17]。。此外，执法机构可能不愿意冒险将公民的合法行为视为犯罪活动。

2.3 Performance Metrics

In face of the previous observations, when comparing the different techniques we mainly use the following metrics: true positive rate, false positive rate, accuracy, and the area under the ROC curve. The True Positive Rate (TPR) measures the fraction of positive samples that are correctly identified as such, while the False Positive Rate (FPR) measures the proportion of negative samples erroneously classified as positive. Thus, adversaries will attempt to obtain a high TPR and a low FPR when performing covert traffic classification. Accuracy captures the fraction of correct labels output by the classifier among all predictions, and can be used as a summary of the classification performance since high accuracy implies a high true positive rate and a low false positive rate. The ROC curve plots the TPR against the FPR for the different possible cutout points for classifiers possessing adjustable internal thresholds. The area under the ROC curve (ROC AUC) [16] summarizes this trade-off. While a classifier outputting a random guess has an AUC=0.5, a perfect classifier would achieve an AUC=1, where the optimal point on the ROC curve is FPR=0 and TPR=1

面对先前的观察，在比较不同的技术时，我们主要使用以下指标：真阳性率，假阳性率，准确度和ROC曲线下的面积。真阳性率（TPR）测量正确识别的阳性样本的分数，而假阳性率（FPR）测量错误分类为阳性的阴性样本的比例。因此，当执行隐蔽流量分类时，攻击者将尝试获得高TPR和低FPR。准确度捕获所有预测中分类器输出的正确标签的分数，并且可以用作分类性能的总结，因为高准确度意味着高真阳性率和低误报率。对于具有可调内部阈值的分类器，ROC曲线将TPR与FPR绘制成不同的可能切除点。 ROC曲线下面积（AUC）[16]总结了这种权衡。虽然输出随机猜测的分类器具有AUC = 0.5，但是完美分类器将实现AUC = 1，其中ROC曲线上的最佳点是FPR = 0且TPR = 1

2.4 Experimental Setup

For conducting our study, we were required to analyze a number of network traces produced by the systems described in Section 2.1. For our testbed, we used two 64-bit Ubuntu 14.04.5 LTS virtual machines (VMs) provisioned with a 2.40GHz Intel Core2 Duo CPU and 8GB of RAM configured in a LAN setting. We used the v4l2loopback camera emulator and the pulseaudio sound server to feed video and audio to the carrier multimedia applications. The prototypes of the considered systems were obtained from their respective websites [3, 29, 33]. Due to the deprecation of Skype v4.3 and the incompatibility of v4l2loopback with the latest Skype v8.x desktop version, we have resorted to Skype for Web. For gathering the traffic samples generated by each system, we captured the network packets produced by the carrier multimedia streams for a duration of 60 seconds after a given covert channel has been established. The methodology we followed for gathering traffic samples has been commonly used in the literature since it allows for the analysis of the unobservability properties of covert channels while executing in steady-state. Next, we describe the methodology we followed for generating our covert and legitimate traffic datasets.

为了进行我们的研究，我们需要分析由2.1节中描述的系统产生的许多网络迹线。对于我们的测试平台，我们使用两个64位Ubuntu 14.04.5 LTS虚拟机（VM）配置了2.40GHz Intel Core2 Duo CPU和8GB RAM配置在LAN设置中。我们使用v4l2loopback相机模拟器和pulseaudio声音服务器将视频和音频馈送到运营商多媒体应用程序。所考虑系统的原型来自各自的网站[3,29,33]。由于Skype v4.3的弃用以及v4l2loopback与最新的Skype v8.x桌面版本的不兼容，我们采用了Skype for Web。为了收集由每个系统生成的流量样本，我们在建立给定的隐蔽信道之后捕获由载波多媒体流产生的网络分组持续60秒。我们在收集流量样本时采用的方法已经在文献中普遍使用，因为它允许在稳态下执行时分析隐蔽通道的不可观察性。接下来，我们将介绍生成隐蔽和合法流量数据集时遵循的方法。

Facet: For building our covert video dataset, we collected 1000 YouTube videos from the YouTube-curated Top Shared and Liked playlist. The legitimate Skype video dataset consists of 1000 recorded live chat videos available on YouTube. We adapted the Facet prototype to sample three types of Facet transmissions, corresponding to scaling the covert videos on top of legitimate videos by a factor of 50%, 25% and 12.5% – the available prototype represents a proof-of-concept only capable of a (unmorphed) 100% scaling. Then, we gathered 1000 traffic samples for each scaling factor by combining a pair of legitimate and covert videos while following the audio and video morphing techniques detailed in Facet’s original description. To emulate legitimate Skype calls, we streamed the media comprising our legitimate Skype video dataset. The resolution of the camera emulator was set to 320x240. For gathering traffic samples, we used each of the available VMs as a Skype peer.

Facet：为了构建我们的隐蔽视频数据集，我们从YouTube策划的人群中收集了1000个YouTube视频热门共享和喜欢的播放列表。合法的Skype视频数据集包含1000个录制的实时聊天视频可在YouTube上找到。我们改编了Facet原型采样三种类型的Facet传输，对应在合法视频之上缩放隐藏视频50％，25％和12.5％的因子 - 可用的原型代表概念验证，只能进行（非变形）100％缩放。然后，我们通过组合一对来为每个缩放因子收集1000个流量样本合法和隐蔽视频，同时遵循Facet's详细介绍的音频和视频变形技术原始描述。要模拟合法的Skype通话，我们用合法的Skype播放媒体视频数据集。相机模拟器的分辨率是设置为320x240。为了收集流量样本，我们使用了每个可用的VM作为Skype对等体。

CovertCast: For building our legitimate live-streamingdataset, we crawled 200 live-streams included in the LiveYouTube-curated list. Then, we generated 200 CovertCast live-streams by broadcasting several news websites already included in the available CovertCast prototype. The server component, responsible for scraping websites, was executed in one of our VMs and streamed modulated video frames to YouTube. We used a Windowslaptop running Google Chrome as a CovertCast client.Each video was streamed with a 1280x720 resolution.

CovertCast：为了构建我们合法的直播数据集，我们抓取了LiveYouTube策划列表中包含的200个直播流。然后，我们通过广播已包含在可用的CovertCast原型中的几个新闻网站，生成了200个CovertCast直播流。负责抓取网站的服务器组件在我们的一个虚拟机中执行，并将流式调制视频帧执行到YouTube。我们使用运行谷歌Chrome的Windows laptop作为CovertCast客户端。每个视频以1280x720分辨率流式传输。

DeltaShaper: We emulated 300 legitimate bi-directionalSkype calls by streaming a subset of our legitimate Skypevideo dataset. We gathered DeltaShaper traffic samplesby establishing a DeltaShaper connection between theSkype endpoints installed in both VMs. We gathereddata for two DeltaShaper configurations, found to provide traffic analysis resistance guarantees, and which respected the tuple (payload frame area, cell size, number of bits, framerate). These were comprised by theh320×240,8×8,6,1i and h160×120,4×4,6,1i tuples.Each video was streamed in a 640x480 resolution.

DeltaShaper：我们通过流式传输我们合法的Skypevideo数据集的子集来模拟300个合法的双向Skype调用。我们通过在两个VM中安装的Skype端点之间建立DeltaShaper连接来收集DeltaShaper流量样本。我们收集了两个DeltaShaper配置的数据，发现它们提供流量分析阻力保证，并且尊重元组（有效载荷帧区域，单元大小，位数，帧速率）。这些由h320×240,8×8,6,1i和h160×120,4×4,6,1i元组组成。每个视频以640x480分辨率流式传输。

3 Similarity-based Classiﬁcation

For the purpose of unobservability assessment, multiple similarity functions have been used to feed similaritybased classiﬁers. This section details the rationale behind each of these functions and how they have been used for the construction of similarity-based classiﬁers and applied to different multimedia protocol tunneling systems. Then, we conduct a comparative analysis of the performance of each of these classiﬁers.

出于不可观察性评估的目的，已经使用多个相似性函数来提供基于相似性的分类。本节详细介绍了这些功能背后的基本原理，以及它们如何用于构建基于相似性的分类器并应用于不同的多媒体协议隧道系统中

。然后，我们对每个分类器的性能进行比较分析。

3.1 Currently Used Similarity Functions

Next, we introduce the three similarity-based classiﬁers which have been previously used for evaluating the unobservability of Facet, CovertCast, and DeltaShaper.

接下来，我们介绍三个基于相似度的分类器，这些分类器以前用于评估Facet，CovertCast和DeltaShaper的不可观察性。

In similarity-based classiﬁcation [10], labeling is performed by taking into account the pairwise-similarities between the test sample and a set of labeled training samples (or a representative model based on these). In the context of trafﬁc analysis, similarity scores are often obtained from the comparison of the frequency distribution of packet lengths or inter-arrival times of trafﬁc samples. Pearson’s Chi-squared Test (χ2) [40] tells us whether the distributions of two categorical variables differ signiﬁcantly from each other, by comparing the observed and expected frequencies of each category. The χ2 test is used in a classiﬁer adapted for distinguishing Facet trafﬁc [30, 51]. The classiﬁer starts by building two models for legitimate and Facet trafﬁc, respectively, using labeled samples. These models are based upon a selection of the bi-gram distribution of packet lengths, where bi-grams expected to hurt classiﬁcation performance are identiﬁed and discarded. Test samples are compared to each of the models using the χ2 test. A simpler version of this classiﬁer labels a sample according to the minimum distance obtained when compared against each model. A more sophisticated version of the classiﬁer labels samples according to whether the ratio between the distance to each model surpasses a threshold. An adversary can adjust this threshold for balancing the expected true positive and false positive rates of the classiﬁer.

在基于相似性的分类[10]中，通过考虑测试样本与一组标记的训练样本（或基于这些样本的代表性模型）之间的成对相似性来执行标记。在流量分析的背景下，相似性得分通常从分组长度的频率分布或流量样本的到达间隔时间的比较中获得。 Pearson的卡方检验（χ2）[40]通过比较每个类别的观察频率和预期频率，告诉我们两个分类变量的分布是否彼此显着不同。 χ2检验用于适用于区分Facet交易的分类器[30,51]。分类器首先使用已经标记的样本为合法和Facet流量两种情况构建两个模型。这些模型基于对包长度的二元分布的选择，其中预期会损害分类性能的二元组被识别和丢弃。使用χ2检验将测试样品与每个模型进行比较。该分类器的更简单版本根据与每个型号进行比较时获得的最小距离来标记样品。更复杂的分类器版本根据每个模型的距离之间的比率是否超过阈值来标记样本。攻击者可以调整此阈值以平衡分类器的预期真阳性和假阳性率。

Kullback-Leibler Divergence (KL) [28] is a measure of relative entropy between two target distributions which is obtained by computing the information lost when trying to approximate one distribution with the other. The KL divergence is used for building a classiﬁer for CovertCast trafﬁc. The classiﬁer aims at distinguishing a set of YouTube videos carrying modulated data from a set of regular YouTube videos through the comparison of the quantized frequency distribution of packet lengths. For each sample in a given set, the classiﬁer computes its KL divergence from every other member in the same set and every member in the other set. Then, the classiﬁer computes a success metric, corresponding to the number of times the KL divergence between a member of one set is more similar to another member of the same set, divided by the total KL divergences that were computed.

Kullback-Leibler Divergence（KL）[28]是两个目标分布之间相对熵的度量，它是通过计算当试图用另一个分布逼近一个分布时丢失的信息来获得的。 KL分歧用于构建CovertCast流量的分类器。该分类器旨在通过比较分组长度的量化频率分布来区分一组携带来自一组常规YouTube视频的调制数据的YouTube视频。对于给定集合中的每个样本，分类器计算其与同一组中的每个其他成员以及另一组中的每个成员的KL分歧。然后，分类器计算成功度量，该成功度量对应于一组成员之间的KL分歧与同一组的另一成员更相似的次数除以计算的总KL分歧。

（DKL(p||q) 表示的就是概率 q 与概率 p 之间的差异，很显然，散度越小，说明概率 q 与概率 p 之间越接近，那么估计的概率分布于真实的概率分布也就越接近。）

Earth Movers's Distance (EMD) [43] measures the dissimilarity between two distributions, where the distance between single features can be defined in a distance matrix. Informally, this dissimilarity represents the necessary amount of work to turn one probability distribution into the other, where the cost of this transformation translates to the amount of observations moved times the distance defined in the associated distance matrix. The EMD (provided with a unitary distance matrix) is used for comparing the quantized frequency distribution of packet lengths of traffic samples, and is used as basis for building a classifier for DeltaShaper traffic. First, the classifier computes the pairwise EMD between each sample in the dataset and each legitimate sample, recording its average. The intuition is that legitimate samples will exhibit a smaller average EMD. An internal threshold adjusts the trade-off between the true positive and false positive rates of the classifier. For labeling a new sample, the classifier computes the pairwise distance of this sample to each legitimate sample and verifies whether its average EMD surpasses the threshold.

地球移动器的距离（EMD）[43]测量两个分布之间的不相似性，其中单个特征之间的距离可以在距离矩阵中定义。非正式地，这种不相似性表示将一个概率分布转换为另一个概率分布的必要工作量，其中该变换的成本转换为移动的观察量乘以相关距离矩阵中定义的距离。 EMD（提供有单一距离矩阵）用于比较业务样本的分组长度的量化频率分布，并且用作构建DeltaShaper业务的分类器的基础。首先，分类器计算数据集中每个样本与每个合法样本之间的成对EMD，记录其平均值。直觉是合法样品将表现出较小的平均EMD。内部阈值调整分类器的真阳性和假阳性率之间的权衡。为了标记新样本，分类器计算该样本与每个合法样本的成对距离，并验证其平均EMD是否超过阈值。

[图片上传失败...(image-cad856-1547127466157)]

3.2 Main Findings

We now present the main findings of our analysis after assessing the unobservability of each system with all the similarity-based classifiers described above.

我们现在在使用上述所有基于相似性的分类器评估每个系统的不可观察性之后，呈现我们的分析的主要发现。

1. The claims on the unobservability guarantees of multimedia protocol tunneling systems are intimately tied to the classifier employed in their evaluation. This finding can be illustrated by the numbers in Table l, which shows the accuracy, true positive and true negative rates obtained by the classifiers described in Section 3.1. For example, when detecting Facet s= =50% traffic, we can see that the X performs relatively well, with an accuracy of 74.3%. Contrastingly, the KL and EMD classifiers attain an accuracy close to random guessing, providing an optimistic estimate on the unobservability of Facet s= 50%. The values in Table l suggest a similar trend when detecting DeltaShaper and CovertCast traffic. Our results also suggest that χ2 outperforms all other similarity-based classifiers proposed for the assessment of unobservability. This can be inferred from the consis- tently higher accuracy values provided by χ2.

1.关于多媒体协议隧道系统不可观察性保证的主张与其评估中使用的分类器密切相关。这一发现可以通过表1中的数字来说明，表1显示了3.1节中描述的分类器获得的准确性，真阳性和真阴性率。例如，当检测到Facet s = = 50％流量时，我们可以看到X表现相对较好，精度为74.3％。相比之下，KL和EMD分类器获得接近随机猜测的准确度，提供对Facet s = 50％的不可观察性的乐观估计。表1中的值表示在检测DeltaShaper和CovertCast流量时的类似趋势。我们的结果还表明，χ2优于为评估不可观察性而提出的所有其他基于相似性的分类器。这可以从χ2提供的一致的更高准确度值推断出来。

2. χ2 produces large false positive rates when classifying Facet and DeltaShaper traffic. Figure 1 depicts the ROC curve of the χ2 and EMD classifiers when detecting Facet and DeltaShaper traffic. Figure 1a shows that for correctly identifying 90% of all Facet traffic (TPR=90%), with s=50%, an adversary would tag 45% of legitimate connections (45% FPR) as covert traffic, while employing the χ2 classifier. For identifying 90% of all DeltaShaper ⟨320 × 240, 8 × 8, 6, 1⟩ traffic, the adversary would face an FPR=51%. Thus, even the deployment of the best performing similaritybased classifier results in a large number of misclassifications for legitimate traffic. Misclassifications are further aggravated should an adversary resort to the EMD classifier. Figure 1 confirms that χ2 performs only fairly in distinguishing covert channels (e.g., AUC=0.83 for Facet s=50%, AUC=0.74 for DeltaShaper ⟨320 × 240,8×8,6,1⟩). We do not show a ROC curve for KL as the classifier is not adjustable by an internal threshold.

2.在对Facet和DeltaShaper流量进行分类时，χ2会产生较大的误报率。图1描绘了在检测Facet和DeltaShaper流量时χ2和EMD分类器的ROC曲线。图1a显示，为了正确识别所有s = 50％时的Facet流量的90％（TPR = 90％），攻击者将45％的合法连接（45％FPR）标记为隐蔽流量，同时使用χ2分类器。为了识别所有DeltaShaper⟨320×240,8×8,6,1⟩流量的90％，对手将面临FPR = 51％。因此，即使部署性能最佳的基于相似性的分类器，也会导致合法流量的大量错误分类。如果对手采用EMD分类器，则误分类会进一步恶化。图1证实χ2仅在区分隐蔽通道时公平地执行（例如，对于Facet s = 50％，AUC = 0.83，对于DeltaShaper = 320×240,8×8,6,1⟩，AUC = 0.74）。我们没有显示KL的ROC曲线，因为分类器不能通过内部阈值调整。

3. CovertCast fails to provide unobservability. The results in Table 1 show that the χ2 classifier can correctly identify all of CovertCast streams while incurring only in a 2% false positive rate. Additionally, the numbers show that the remaining classifiers can correctly identify >96.5% of CovertCast streams, albeit incurring in a larger false positive rate (e.g., EMD: TPR=0.965, FPR=0.305). We conjecture two explanations that may justify the differences beween our results and those published in the original CovertCast paper. Firstly, our results may stem from the use of a dataset which is one order of magnitude larger than the one used for CovertCast evaluation. This increased dataset may more accurately represent the patterns generated by legitimate YouTube streams’ traffic and reveal CovertCast activity. Secondly, implementation changes in YouTube may have impacted the unobservability properties provided by hardcoded data modulation parameters, which may in turn be no longer adequate to ensure unobservability.

3. CovertCast无法提供不可观察性。表1中的结果表明，χ2分类器可以正确地识别所有CovertCast流，同时仅产生2％的假阳性率。另外，数字显示剩余的分类器可以正确地识别> 96.5％的CovertCast流，虽然引起更大的假阳性率（例如，EMD：TPR = 0.965，FPR = 0.305）。我们推测两种解释可以证明我们的结果与原始CovertCast论文中发表的结果之间存在差异。首先，我们的结果可能源于数据集的使用，该数据集比用于CovertCast评估的数据集大一个数量级。这种增加的数据集可以更准确地表示合法YouTube流的流量生成的模式，并揭示CovertCast活动。其次，YouTube中的实施更改可能会影响硬编码数据调制参数提供的不可观察性属性，而这些属性又可能不再足以确保不可观察性。

（PS：文章太长不让发。。。我日了。。分段吧那就）

正文之后

还有有ppt和课堂演示的稿子，后面等我把作业交了再来上传吧。。我的妈。今天的体系结构让我完全没看并行。。明天可咋整！！

《Effective Detection of Multimedia Protocol Tunneling using Machine Learning》译文（一）