Kaggle: Microsoft Malware Detection

1 minute read

Problem statement

To detect what type of malware is present in the file

Data Source

Kaggle

For every malware, we have two files

  • .asm file
  • .bytes file (the raw data contains the hexadecimal representation of the file’s binary content, without the PE header)

Total train dataset consist of 200GB data out of which 50Gb of data is .bytes files and 150GB of data is .asm files:

There are total 10,868 .bytes files and 10,868 asm files total 21,736 files

There are 9 types of malwares (9 classes) in our give data -

  • Ramnit
  • Lollipop
  • Kelihos_ver3
  • Vundo
  • Simda
  • Tracur
  • Kelihos_ver1
  • Obfuscator.ACY
  • Gatak

Mapping to a ML problem

There are 9 classes under target variable hence a multi-class classfication problem with metric to be used a log-loss metric.

Exploratory Data Analysis

First of all, we separate both file types into different folders.

Distribution of malware classes

In whole data set

In train data set

In test data set

In validation data set

Feature Extraction

  • Bytes file: A simple byte file has a representation as follows:

00401000 56 8D 44 24 08 50 8B F1 E8 1C 1B 00 00 C7 06 08

So, the first feature is to account for each byte file size, next we remove the address from each byte file (in above sample, address is 00401000) and create a unigram bag of words resulting into another 257 new features.

  • asm file: The assembly files consists of content which can be made of -
    • Address
    • Segments
    • Opcodes
    • Registers
    • function calls
    • APIs

Since given data size is 150GB, so we went through given discussion on Kaggle to choose 52 major commands (like push, pop, etc) and created unigram bag of words.

So finally we have nearly 300 features to be used in ML model.

Machine Learning modeling

Using parallel processing, we implemented following classifiers -

Classifier Train Log-loss Validation Log-Loss Test Log-Loss
RF 0.016 0.016 0.040
XG Boost 0.011 0.031 0.032
Tuned XG Boost 0.012 0.034 0.031

References