时间:2014年11月21日(星期五)上午9:30
地点:仓山校区 成功楼603报告厅
主讲:澳大利亚迪肯大学 王宇博士
主办:数学与计算机科学学院、福建省网络安全与密码技术重点实验室
专家简介:王宇,博士。目前在澳大利亚迪肯大学网络安全与计算实验室从事博士后研究工作,主要研究领域包括网络流量建模与分类、网络和系统安全、机器学习等。曾在《IEEE Transactions on Parallel and Distributed Systems》等期刊上发表过多篇学术论文。
报告摘要:Network traffic classification is the process of associating network traffic flows with their underlying network protocols or applications, which is a fundamental technique of broad interest. The classification decisions can be made based on a variety of information carried in the network traffic, such as the port number fields in packet headers, the application-layer payload content, and the statistical properties of the traffic flows. Nonetheless, the state of the art approaches all rely on some sort of a priori knowledge, such as the well-known and registered port list, protocol specifications, protocol signatures, and pre-labelled training data sets. Therefore, labor-intensive pre-processing is required and the ability to deal with previously unknown applications is limited.
In this talk, we will review some of our recent research towards automating the process of network traffic classification, which is based on the novel statistics-based classification schemes. First, we will look at unsupervised learning (i.e. clustering), which is a useful and important tool in practice, where the training data usually come without class labels and unknown patterns are always emerging. Although previous studies have reported promising results of applying some classic clustering algorithms such as K-Means and EM for the task, the quality of resultant traffic clusters was far from satisfactory. To address the problem, we have proposed a constrained traffic clustering scheme that makes decisions with consideration of some background information in addition to the observed traffic statistics. Specifically, we make use of equivalence set constraints indicating that particular sets of flows are using the same application layer protocols, which can be efficiently inferred from packet headers according to the background knowledge of TCP/IP networking. We model the observed data and the constraints using Gaussian mixture density and adapt an approximate algorithm for the maximum likelihood estimation of model parameters. Next we will cover another work that proposes to make use of unlabeled background data in the process of supervised learning, with the purpose to enhance the statistics-based traffic classifiers' ability to distinguish novel traffic patterns that are unknown during the time of training.