Technical tags

Research Project Data Mining C++ Language Wireshark

Source code


In this project, we apply a sequential pattern mining algorithm to automatically generate signatures of an application from its network traffic. The proposed method can be used to get signatures of proprietary protocols, in which these protocols do not have public specifications. With the ubiquitous developing of networking technologies, many new Apps are published by days. Due to the lacking of public accessible documents to well define these protocols in new Apps, specialists may not extract comprehensive signatures in a timely manner. To satisfy the classification of these new emerging networking Apps, we need an efficient method to get signatures from the traffic of these Apps. These signatures can be applied to enhance the classifier in recognizing new Apps. The application of the method can be used to monitor traffic at ISP's backbone network, improve the quality of services, or management of traffic related tasks. To extract valid signatures has these challenges:
  1. Noise data are mixed in the network data to be extracted.
  2. Parameters are manually adjusted to get high-quality signatures for different data sets in the existing systems. Therefore, the procedure is not automatic.
  3. Keywords in a signature can be discontinuous and with different offsets in different packets.
  4. The output in the previous steps may give a lot of keywords. We need a method to merge these keywords to get more elegant signatures that can cover the most portion of network flows.
In our method, we consider the distribution of signatures can be in any position of payload. An adaptively sequential mining with dynamic minimum support is applied to the payload. The adaptive method can separate application traffic from noise traffic. Description of signatures based on regular expressions are generalized that can describe signatures in a more general type.clipped and morphed application signatures that are combined with frequent sequences.


The core workflow for the signatures extraction is shown in the figure. Based on the classic generalized sequential pattern (GSP) algorithm, The proposed algorithm uses 4 steps in getting signatures.
  1. Adjust the minimum support parameter based on the estimated interval in payloads that include keywords of signatures.
  2. Use Generalized Sequential Pattern (GSP) algorithm with constraints to mining common substrings.
  3. Synthesize signatures from discovered common substrings.
  4. Optimize signatures by merging, morphing and clipping.

The minimum support is calculated based on a sliding window. Based on the frequency of words in a window, we set a minimum support for the window.


To verify the presented method, we verify the method by using it to mining signatures of some popular applications. The discovered signatures are listed here: Some applications using protocols in the above figure have standard specifications. For example, the HTTP protocol is well-defined. The signatures we get for the HTTP protocol is very similar to the standard definition. The BitTorrent protocol is another one with standard specification. Using our method, the keywords in the protocol are correctly discovered.


The result shows that the method can extract valid signatures from network traffic of applications. The regular expressions of signatures can accurately describe keywords and their positions in network flows.