In this project, we apply a sequential pattern mining algorithm to automatically generate signatures of
an application from its network traffic.
The proposed method can be used to get signatures of proprietary protocols,
in which these protocols do not have public specifications.
With the ubiquitous developing of networking technologies, many new Apps are published by days.
Due to the lacking of public accessible documents to well define these protocols in new Apps,
specialists may not extract comprehensive signatures in a timely manner.
To satisfy the classification of
these new emerging networking Apps, we need an efficient method to get signatures from the traffic of these Apps.
These signatures can be applied to enhance the classifier in recognizing new Apps.
The application of the method can be used to monitor traffic at ISP's backbone network, improve the quality of services,
or management of traffic related tasks.
To extract valid signatures has these challenges:
Noise data are mixed in the network data to be
extracted.
Parameters are manually adjusted to get high-quality
signatures for different data sets in the existing systems.
Therefore, the procedure is not automatic.
Keywords in a signature can be
discontinuous and with different offsets in different
packets.
The output in the previous steps may
give a lot of keywords.
We need a method to merge these keywords
to get more elegant signatures that can cover
the most portion of network flows.
In our method, we consider the distribution of signatures
can be in any position of payload. An adaptively sequential mining
with dynamic minimum support is applied to the payload.
The adaptive method can separate application traffic from noise traffic.
Description of signatures based on regular expressions are generalized
that can describe signatures in a more general type.clipped and morphed
application signatures that are combined with frequent
sequences.
Method
The core workflow for the signatures extraction is shown in the figure.
Based on the classic generalized sequential pattern (GSP) algorithm,
The proposed algorithm uses 4 steps in getting signatures.
Adjust the minimum support parameter based on the
estimated interval in payloads that include keywords of signatures.
Use Generalized Sequential Pattern (GSP) algorithm with constraints to mining common substrings.
Synthesize signatures from discovered common substrings.
Optimize signatures by merging, morphing and clipping.
The minimum support is calculated based on a sliding window.
Based on the frequency of words in a window, we set a minimum support for the window.
Result
To verify the presented method, we verify the method by using it to mining signatures of some popular applications.
The discovered signatures are listed here:
Some applications using protocols in the above figure have standard specifications.
For example, the HTTP protocol is well-defined. The signatures we get for the HTTP protocol is very similar to the
standard definition. The BitTorrent protocol is another one with standard specification. Using our method,
the keywords in the protocol are correctly discovered.
Conclusion
The result shows that the method can extract valid signatures from network traffic of applications.
The regular expressions of signatures can accurately describe keywords and their positions in network flows.