KL Divergence on Iris Dataset
I was doing research in brain-computer interface and was interested in the technique of transfer learning. Based on numerous papers I read, KL-divergence is a common method to measure similarity of distribution between subjects’ EEG signals. At that time, I found an article that explained KL divergence to measure similarity on data with one variable [1]. But I couldn’t find example KL divergence on multivariate data.
As someone who had minimal knowledge in information theory, I wanted to see how, given the equation, it can be applied to multivariate data. This blog is the kind of thing I wished I had found at that time. Here I will present a quick and simple example of applying Kullback-Leibler divergence on data with multiple features (multivariate data). In this example we will use our long-old friend, iris dataset.
The Definition and Equation
Given two d dimensional data multivariate Gaussian distribution P and Q
where μ and Σ is the mean and covariance of d dimensional features over all samples, respectively.
KL divergence can be calculated as follows:
For those who are curious on the derivation of this, please look at reference [2], [3]. It’s worth noting that:
- d dimensional simply means we have d features in our data.
- The more similar two distributions are the closer KL divergence is to zero, vice versa.
The Dataset
Iris dataset is a classic public dataset that consists of 150 samples and 4 features. The data consist of three class of iris flower, for simplicity we will only compare feature of two kinds of flowers, and we will also test KL divergence on two-class data with two features and all four features, using only two features allow us to visualize features distribution. These are four cases on which KL divergence will be tested:
- Same class, two features
- Different class, two features
- Same class, four features
- Different class, four features
Load dataset
Let us import necessary package and load the dataset
import numpy as np
import pandas as pd
from sklearn.datasets import load_irisdata = load_iris()iris_data = pd.DataFrame(np.c_[data.data, data.target], columns=data.feature_names + ['target'])# Peek at dataset
iris_data.sample(5)
Let’s look at number of class
Output:
target
0.0 50
1.0 50
2.0 50
dtype: int64
Because KL divergence measure similarity between two different distribution, we will only consider class 0 and class 1.
iris_two = iris_data[iris_data.target!= 2]iris_two.groupby('target').size()
Output:
target
0.0 50
1.0 50
dtype: int64
Define KL Divergence
Let’s convert the above equation into code:
Now we are ready to compute KL divergence on each pre-defined case.
Case 1: Same class, two features
This case will only consider class 0 and two features of class 0. We will split the class 0 data into two, our hypothesis is KL divergence should give value close to zero.
iris0 = iris_two[iris_two.target==0]print(iris0.groupby('target').size())
print(iris0.shape)
Output:
target
0.0 50
dtype: int64
(50, 5)
There are 50 samples of class 0, we will split this into data consisting of 25 samples each.
# Only take two features
iris0 = iris0[['sepal length (cm)', 'petal length (cm)']]# Divide the data into two set _1 and _2
iris0_1 = iris0.iloc[:25]
iris0_2 = iris0.iloc[25:]print(‘Group 1 shape:’, iris0_1.shape)
print(‘Group 2 shape:’, iris0_2.shape)
Output:
Group 1 shape: (25, 2)
Group 2 shape: (25, 2)
Next, we can visualize how the two features are distributed
Given above plot we can infer that the two iris0_1
and iris0_2
are really similar, their features are distributed in same manner (of course, they are form the same class). The value of KL divergence should be really small.
KL_0_0_2feat = KL_div(iris0_1, iris0_2)print('The value of KL divergence of two data from the same class: %.3f' %KL_0_0_2feat)
print('This should be quite small, close to zero')
Output:
The value of KL divergence of two data from the same class: 0.190 This should be quite small, close to zero
Yes, 0.190 indicates that two groups have high similarity, let’s proceed with KL divergence on two different classes.
Case 2: Different class, two features
Previously we already had data of class 0 stored in iris0
, let`s grab the data of class 1
# Take iris dataset of class 1
iris1 = iris_two[iris_two.target==1]print(iris1.groupby('target').size())
Output:
target
1.0 50
dtype: int64
Then, we can visualize features of iris1
and iris0
to see their distributions
# Visualize both dataset class 0 and class 1
fig, ax = plt.subplots(figsize=(8, 5))iris0.plot(kind='scatter', x='sepal length (cm)', y='petal length (cm)', ax=ax, c='b', alpha=0.5)
iris1.plot(kind='scatter', x='sepal length (cm)', y='petal length (cm)', ax=ax, c='r', alpha=0.5)plt.show()
It`s crystal clear that two data are not similar, red and blue features are distributed differently, we expect the KL divergence to be quite high, or at least higher than the previous one.
Output:
The value of KL divergence of two data from the same class: 26.676 This should be bigger than previous result
Bingo, the result matches our expectations, now let`s extend this example to using all four features of our data.
Case 3: Same class, four features
# Take only class 0, all four features
iris0_4feat = iris_two.loc[(iris_two.target==0)]# Divide them into two groups
iris0_4feat_1 = iris0_4feat.iloc[:25, :-1]
iris0_4feat_2 = iris0_4feat.iloc[25:, :-1]print('All data shape:', iris0_4feat.shape)
print('Group 1 shape:', iris0_4feat_1.shape)
print('Group 2 shape:', iris0_4feat_2.shape)
Output:
All data shape: (50, 5)
Group 1 shape: (25, 4)
Group 2 shape: (25, 4)
We expect the KL divergence to be small, because the two data come from same class
KL_0_0_4feat = KL_div(iris0_4feat_1, iris0_4feat_2)print('The value of KL divergence of two data from the same class: %.3f' %KL_0_0_4feat)
print('This should be quite small ~ 0')
Output:
The value of KL divergence of two data from the same class: 0.857 This should be quite small ~ 0
Case 4: Different classes, four features
# Take class 0 and 1 only
iris0_4feat = iris_two.loc[(iris_two.target==0)]
iris1_4feat = iris_two.loc[(iris_two.target==1)]# Remove target column
iris0_4feat = iris0_4feat.iloc[:, :-1]
iris1_4feat = iris1_4feat.iloc[:, :-1]print('Group 1 shape:',iris0_4feat.shape)
print('Group 2 shape:',iris1_4feat.shape
Output:
Group 1 shape: (50, 4)
Group 2 shape: (50, 4)
Next, compute KL divergence
KL_0_1_4feat = KL_div(iris0_4feat, iris1_4feat)print('The value of KL divergence of two data from the same class: %.3f' %KL_0_1_4feat)
print('This should be quite large >> 0')
Output:
The value of KL divergence of two data from the same class: 52.724 This should be quite large >> 0
Recap
Now here’s a recap of all similarity results:
print('KL divergence between two class')
print('using two features: %.3f' %KL_0_1_2feat)
print('using four features: %.3f' %KL_0_1_4feat)print('')print('KL divergence of same class')
print('using two features: %.3f' %KL_0_0_2feat)
print('using four features: %.3f' %KL_0_0_4feat)
Output:
KL divergence between two class
using two features: 26.676
using four features: 52.724 KL divergence of same class
using two features: 0.278
using four features: 0.857
This is the end of this blog, constructive feedbacks and suggestions are always welcome. Hope this example will be of any use to you. Have a good day!
Reference
- https://towardsdatascience.com/light-on-math-machine-learning-intuitive-guide-to-understanding-kl-divergence-2b382ca2b2a8
- https://stats.stackexchange.com/questions/257735/kl-divergence-between-two-bivariate-gaussian-distribution
- https://mr-easy.github.io/2020-04-16-kl-divergence-between-2-gaussian-distributions/