Kaggle: Quora question pair similarity

4 minute read

Problem statement

To predict which of the provided pairs of questions contain two questions with the same meaning.

Source

Kaggle: Quora question pair

Focus area

To achieve a probability of a pair of questions to be duplicates so that you can choose any threshold of choice with minimal misclassification.

Data source

The data is available on Kaggle, features of which are briefly summarised here -

  • id - the id of a training set question pair
  • qid1, qid2 - unique ids of each question (only available in train.csv)
  • question1, question2 - the full text of each question
  • is_duplicate - the target variable, set to 1 if question1 and question2 have essentially the same meaning, and 0 otherwise.

Formulating a ML problem

Since the target column is binary (0 - no similarity, 1 - similar), hence it’s a binary classification problem.

The metric as suggested by Kaggle for this competition is Log Loss which is absolutely necessary to predict the certainity of two question similarity in terms of probability. A perfect value of log-loss is 0 and worst is inf.

Exploratory Data Analysis

Loading the data

The data is loaded into a pandas dataframe with nearly 400K rows in train.csv.

Missing data

Only Question 2 has 2 missing values which are filled with blank.

Distribution of target data

There are more number of unique question pairs (63%) as compared to duplicate question pairs.

Number of unique questions combining both columns question1, question2

Total num of Unique Questions are: 537933
Number of unique questions that appear more than one time: 111780 (20.8%). Hence, nearly 80% of questions are unique.
Max number of times a single question is repeated: 157

Duplicate data

There are 0 duplicate rows.

Feature engineering

Constructed few features like:

  • freq_qid1 = Frequency of qid1’s
  • freq_qid2 = Frequency of qid2’s
  • q1len = Length of q1
  • q2len = Length of q2
  • q1_n_words = Number of words in Question 1
  • q2_n_words = Number of words in Question 2
  • word_Common = (Number of common unique words in Question 1 and Question 2)
  • word_Total =(Total num of words in Question 1 + Total num of words in Question 2)
  • word_share = (word_common)/(word_Total)
  • freq_q1+freq_q2 = sum total of frequency of qid1 and qid2
  • freq_q1-freq_q2 = absolute difference of frequency of qid1 and qid2

Checking how the features word_share and word_Common is going to be helpful -

As observed from above plots, the PDFs of word_share for 0 and 1 can be useful as they are neither overlapping completely nor separated apart ideally.

The word_Common doesn’t help much in distinguishing 0 from 1.

Text preprocessing

  • Removing html tags
  • Removing Punctuations
  • Performing stemming
  • Removing Stopwords
  • Expanding contractions (expanding % symbol to percentage, etc)

Advanced feature engineering

Added 15 more advanced features after tokenization including features from fuzzywuzzy

  • cwc_min = common_word_count / (min(len(q1_words), len(q2_words))
  • cwc_max = common_word_count / (max(len(q1_words), len(q2_words))
  • csc_min = common_stop_count / (min(len(q1_stops), len(q2_stops))
  • csc_max = common_stop_count / (max(len(q1_stops), len(q2_stops))
  • ctc_min = common_token_count / (min(len(q1_tokens), len(q2_tokens))
  • ctc_max = common_token_count / (max(len(q1_tokens), len(q2_tokens))
  • last_word_eq = int(q1_tokens[-1] == q2_tokens[-1])
  • first_word_eq = int(q1_tokens[0] == q2_tokens[0])
  • abs_len_diff = abs(len(q1_tokens) - len(q2_tokens))
  • mean_len = (len(q1_tokens) + len(q2_tokens))/2
  • fuzz_ratio
  • fuzz_partial_ratio
  • token_sort_ratio
  • token_set_ratio
  • longest_substr_ratio = len(longest common substring) / (min(len(q1_tokens), len(q2_tokens))

Feature analysis

Frequency of duplicate question pairs

Frequency of non-duplicate question pairs

Pair plot of features [‘ctc_min’, ‘cwc_min’, ‘csc_min’, ‘token_sort_ratio’]

As can be observed from the plots, there lies partial separability among features which can be quite helpful in distinguishing 0 from 1.

Distribution of the token_sort_ratio

The PDFs are not prefectly overlapping, hence it will be a good feature.

Distribution of the fuzz_ratio

Here also, the PDFs are not prefectly overlapping, hence it will be a good feature.

Since there are 15 such features, so let’s take the visualization using T-SNE approach.

Dimensionality reduction by T-SNE

Certainly, we can say that in some cases we can easily separate out 0 from 1. Hence, pointing out that our features will be helpful.

TF-IDF Glove (Global Vectors)

Creating TF-IDF weighted Glove features for question1 and question2.

Glove and Word2Vec builds on simliar core concept which is looking for semantic similarity. Optionally, we can also use Word2Vec.

Now, we have basic features, advanced features and TF-IDF weighted Glove features for each of the questions, making final set of independent features

Machine Learning Model

Loading data

Storing the CSV into a SQLite engine created database (just to explore) and loading up data to initiate the modeling. Next, split data into train-test.

Implementing models

1. Logistic Regression

Implemented LR with hyper-parameter tuning, L2 regularisation, followed by calibrating the score (due to the log-loss which works only on calibrated models and LR may output some un-calibrated data).

Tuned LR results into a log-loss of 0.52 (Also checking overfitting, didn’t occur)

2. Linear SVM

Also implemented linear SVM with a log-loss of 0.489. (Also checking overfitting, didn’t occur)

3. GBDT - XGBoost

Also implemented GBT with a log-loss of 0.35. (Also checking overfitting, didn’t occur)

References