Kaggle: Personalized Medicine: Redefining Cancer Treatment

2 minute read

Problem statement

Classify the given genetic variations/mutations based on evidence from text-based clinical literature.

Data sources

Get the data from Kaggle.

We have two data files: one contains the information about the genetic mutations and the other contains the clinical evidence (text) that human experts/pathologists use to classify the genetic mutations. Both these data files have a common column called ID.

Data file’s information:

  • training_variants (ID , Gene, Variations, Class)
  • training_text (ID, Text)

Mapping to a ML problem

There are 9 distint classes of mutations to classify into, hence a multi-class classfication problem with metric as log-loss.

Exploratory Data Analysis

How data is presented?

First file
Number of data points : 3321
Number of features : 4
Features : [‘ID’ ‘Gene’ ‘Variation’ ‘Class’]

Second file
Number of data points : 3321
Number of features : 2
Features : [‘ID’ ‘TEXT’]

Merging both the files together based on ID. Next, we look for missing data which is present in TEXT attribute for 5 instances. Filling the NaN by combining Gene and Variation separated by a space.
Proceeding with train(64%), test(20%) and validation(16%) split of the data. Since it is a multi class problem, we always split data following stratified sampling such that the distribution can be maintained equally. The distribution of classes in each data is shown in following graphs -



Featurizing Gene

There are 229 different categories of genes in the train data, and they are distibuted as follows

Next, we featurize Gene using one-hot encoding giving us 229 new features.
In test data 643 out of 665 data points are covered: 96.69%
In cross validation data 514 out of 532 data points are covered: 96.61%

Featurizing Variations

There are 1924 different categories of variations in the train data, and they are distibuted as follows

Next, we featurize Variation using one-hot encoding giving us 1924 new features.
In test data 64 out of 665 data points are covered: 9.62%
In cross validation data 56 out of 532 data points are covered: 10.52%

Featurizing Text

Total number of unique words in train data : 54850

Next, we featurize Text using one-hot encoding giving us 54850 new features.
97.125 % of word of test data appeared in train data
98.056 % of word of Cross Validation appeared in train data

=============================
One hot encoding features :
(number of data points * number of features) in train data = (2124, 57039)
(number of data points * number of features) in test data = (665, 57039)
(number of data points * number of features) in cross validation data = (532, 57039)

Machine Learning model

Applying balanced class linear SVM on one-hot encoded features gave us the following results -
For values of best alpha = 0.01 The train log loss is: 0.76
For values of best alpha = 0.01 The cross validation log loss is: 1.24
For values of best alpha = 0.01 The test log loss is: 1.15

References