For the past few years, NCC Group has been an industry partner to the Centre for Doctoral Training in Data Intensive Science (CDT in DIS) at University College London (UCL). CDT is composed of a group of over 80 academics from across UCL in areas such as High Energy Physics, Astrophysics, Atomic and Molecular Physics, Space and Climate Physics, Computer Science, Mathematics, Electrical Engineering, and Statistical Science, all with a shared interest in data-intensive methods. The program aims to train PhD students in these fields with data science skills to become leaders in academia and industry. As a part of this program, NCC Group works with a cohort of PhD students each year on a four-month project tackling a data science problem of industrial significance.
This year, given the ever-growing everyday problem of ransomware, and our continued research focus on machine learning, we focussed on the application of machine learning techniques for malware detection. Specifically, we sought to determine the efficacy of various individual ML primitives, as well as ensemble methods, for the static classification of Windows binaries in terms of whether or not they are malicious. In particular, the project focussed on PE (Portable Executable) files on Windows, which make up nearly half of all files submitted to VirusTotal. In the remainder of this post, I’ll describe some of this research at a very high level, and share some of the associated findings.
The full research paper, Machine Learning for Static Malware Analysis, written by CDT PhD students Emily Lewis, Toni Mlinarevic, and Alex Wilkinson of University College London, can be downloaded at the bottom of this post.
While our focus in 2021 was on malware analysis, keen readers of this blog may recall that our project with University College London for 2020 focussed on deepfakes, the report for which can be downloaded here.
Project Challenge
For this project, we challenged the researchers to develop a machine learning model that could independently evaluate whether an arbitrary Windows binary was or was not malicious. Using a dataset of 74,924 malware samples and 32,967 benign samples, the researchers trained several different machine learning models on features (characteristics) of the binaries, including:
- PE headers, which describe program structure and thus [generally] functionality;
- bytes n-gram, whereby the executable (in hex) is sampled in windows each of length n bytes;
- control flow graphs, in which the vertices of the graph are blocks of assembly instructions and the edges of the graph are control flow (for the purposes of this research, CFGs are specifically derived from opcodes);
- API call graphs, in which the vertices of the graph are functions and the edges of the graph are relationships between functions (for the purposes of this research, these graphs represent API calls)
The relative tradeoffs of these feature types as applied to malware analysis are discussed in detail within Section 3 of the paper.
The dataset itself was built by the researchers by pulling binaries from a number of Windows PCs/servers of various OS versions, as well as several hundred of the most popular Windows packages from a package manager, as well as pulling malicious samples from on online malware repository (for details, see Section 2 of the paper). Upon creating a coarse-grained mapping of the sampled malware families, it was confirmed that the created dataset consisted of samples of several different strains of spyware, adware, and ransomware, offering sufficient indication for our proof-of-concept purposes that the classifier would be able to detect malware of a diverse range of origins.
While each individual model performed well in classifying the binaries as malicious or non-malicious, a hybrid (ensemble) method between multiple models (trained using the prediction outputs of the individual models as input features) improved overall classification performance to an accuracy of 98.9%, a meaningful improvement over the most effective individual model, suggesting both that machine learning is an effective tool for classification of potentially malicious compiled binaries, and furthermore that performance improvements can be gained through use of ensemble methods against this type of classification task.
More specifically, the late fusion approach taken to the ultimate classification task in this research is quite novel in this domain and has only once before been used for malware analysis of any kind. Late fusion involves training one model per feature, which are fused and decided afterward (in contrast to early fusion, which trains a single model across all features/modalities). This research has shown that late fusion is a powerful future direction for ML-driven malware detection due to its’ robustness in the absence of complete featuresets and more generally for unseen samples – a likely real-world scenario when classifying binaries in the wild.
Conclusion
Ultimately, the researchers were able to train a number of different machine learning models to identify malicious executables on the basis of features which included PE headers, bytes n-grams, control-flow graphs and API call graphs, all of which performed well. Making use of ensemble methods, the researchers were able to achieve a classification accuracy of 98.9%, suggesting that the particular featureset and ensemble model used (multi-modal late fusion) is effective for the detection of malware binaries at scale.
Future work in automated classification of malicious code – whether statically or dynamically, on both binaries and source code – remains an important and vast field. While there are numerous directions, perhaps the most interesting are those associated with ransomware (particularly that which is hard to detect or highly polymorphic in nature) given the vast societal implications of such work. Ideally, such tooling would be released as open source software or otherwise made widely accessible to defenders.
Acknowledgments
The students did an excellent job of quickly grasping the nuances of the problem of analyzing compiled binaries as opposed to compositional source code, as well as the weaknesses of traditional static malware analysis – a particularly strong feat considering that their primary fields are across mathematics and physics, not cybersecurity. They brought multiple sophisticated perspectives to the analysis of malware samples working backwards through the lens of data science toward uncovering signatures and identifiers within the associated samples, further supporting our long-held belief that true diversity in perspectives and areas of expertise is essential to creatively solving the many real-world problems involved in making the internet safer and more secure. They brought powerful critical analysis skills, an eagerness to learn, and refreshing creativity to the research, and made excellent progress especially on such a short timescale. We would like to thank University College London for their continued partnership with us through this program, and in particular to extend thanks to Professor Nikolaos Konstantinidis and Dr Tim Scanlon of UCL who lead this exceptional program and talented group of graduate students.
Research Paper
The full research paper can be downloaded below. This work is authored by Emily Lewis, Toni Mlinarevic, and Alex Wilkinson of University College London, and was mentored by Matt Lewis, Jennifer Fernick, and Mostafa Hassan of NCC Group.