KL Divergence on Iris Dataset

Orvin Demsy
6 min readAug 21, 2020

--

I was doing research in brain-computer interface and was interested in the technique of transfer learning. Based on numerous papers I read, KL-divergence is a common method to measure similarity of distribution between subjects’ EEG signals. At that time, I found an article that explained KL divergence to measure similarity on data with one variable [1]. But I couldn’t find example KL divergence on multivariate data.

As someone who had minimal knowledge in information theory, I wanted to see how, given the equation, it can be applied to multivariate data. This blog is the kind of thing I wished I had found at that time. Here I will present a quick and simple example of applying Kullback-Leibler divergence on data with multiple features (multivariate data). In this example we will use our long-old friend, iris dataset.

The Definition and Equation

Given two d dimensional data multivariate Gaussian distribution P and Q

where μ and Σ is the mean and covariance of d dimensional features over all samples, respectively.

KL divergence can be calculated as follows:

KL divergence for multivariate data

For those who are curious on the derivation of this, please look at reference [2], [3]. It’s worth noting that:

  1. d dimensional simply means we have d features in our data.
  2. The more similar two distributions are the closer KL divergence is to zero, vice versa.

The Dataset

Iris dataset is a classic public dataset that consists of 150 samples and 4 features. The data consist of three class of iris flower, for simplicity we will only compare feature of two kinds of flowers, and we will also test KL divergence on two-class data with two features and all four features, using only two features allow us to visualize features distribution. These are four cases on which KL divergence will be tested:

  1. Same class, two features
  2. Different class, two features
  3. Same class, four features
  4. Different class, four features

Load dataset

Let us import necessary package and load the dataset

import numpy as np
import pandas as pd
from sklearn.datasets import load_iris
data = load_iris()iris_data = pd.DataFrame(np.c_[data.data, data.target], columns=data.feature_names + ['target'])# Peek at dataset
iris_data.sample(5)
5 samples of iris dataset

Let’s look at number of class

Output:

target 
0.0 50
1.0 50
2.0 50
dtype: int64

Because KL divergence measure similarity between two different distribution, we will only consider class 0 and class 1.

iris_two = iris_data[iris_data.target!= 2]iris_two.groupby('target').size()

Output:

target 
0.0 50
1.0 50
dtype: int64

Define KL Divergence

Let’s convert the above equation into code:

Now we are ready to compute KL divergence on each pre-defined case.

Case 1: Same class, two features

This case will only consider class 0 and two features of class 0. We will split the class 0 data into two, our hypothesis is KL divergence should give value close to zero.

iris0 = iris_two[iris_two.target==0]print(iris0.groupby('target').size())
print(iris0.shape)

Output:

target 
0.0 50
dtype: int64
(50, 5)

There are 50 samples of class 0, we will split this into data consisting of 25 samples each.

# Only take two features
iris0 = iris0[['sepal length (cm)', 'petal length (cm)']]
# Divide the data into two set _1 and _2
iris0_1 = iris0.iloc[:25]
iris0_2 = iris0.iloc[25:]
print(‘Group 1 shape:’, iris0_1.shape)
print(‘Group 2 shape:’, iris0_2.shape)

Output:

Group 1 shape: (25, 2)
Group 2 shape: (25, 2)

Next, we can visualize how the two features are distributed

Feature distribution of iris0_1 and iris0_2

Given above plot we can infer that the two iris0_1 and iris0_2 are really similar, their features are distributed in same manner (of course, they are form the same class). The value of KL divergence should be really small.

KL_0_0_2feat = KL_div(iris0_1, iris0_2)print('The value of KL divergence of two data from the same class: %.3f' %KL_0_0_2feat)
print('This should be quite small, close to zero')

Output:

The value of KL divergence of two data from the same class: 0.190 This should be quite small, close to zero

Yes, 0.190 indicates that two groups have high similarity, let’s proceed with KL divergence on two different classes.

Case 2: Different class, two features

Previously we already had data of class 0 stored in iris0, let`s grab the data of class 1

# Take iris dataset of class 1
iris1 = iris_two[iris_two.target==1]
print(iris1.groupby('target').size())

Output:

target 
1.0 50
dtype: int64

Then, we can visualize features of iris1 and iris0 to see their distributions

# Visualize both dataset class 0 and class 1
fig, ax = plt.subplots(figsize=(8, 5))
iris0.plot(kind='scatter', x='sepal length (cm)', y='petal length (cm)', ax=ax, c='b', alpha=0.5)
iris1.plot(kind='scatter', x='sepal length (cm)', y='petal length (cm)', ax=ax, c='r', alpha=0.5)
plt.show()
Feature distribution of iris0 and iris1

It`s crystal clear that two data are not similar, red and blue features are distributed differently, we expect the KL divergence to be quite high, or at least higher than the previous one.

Output:

The value of KL divergence of two data from the same class: 26.676 This should be bigger than previous result

Bingo, the result matches our expectations, now let`s extend this example to using all four features of our data.

Case 3: Same class, four features

# Take only class 0, all four features
iris0_4feat = iris_two.loc[(iris_two.target==0)]
# Divide them into two groups
iris0_4feat_1 = iris0_4feat.iloc[:25, :-1]
iris0_4feat_2 = iris0_4feat.iloc[25:, :-1]
print('All data shape:', iris0_4feat.shape)
print('Group 1 shape:', iris0_4feat_1.shape)
print('Group 2 shape:', iris0_4feat_2.shape)

Output:

All data shape: (50, 5) 
Group 1 shape: (25, 4)
Group 2 shape: (25, 4)

We expect the KL divergence to be small, because the two data come from same class

KL_0_0_4feat = KL_div(iris0_4feat_1, iris0_4feat_2)print('The value of KL divergence of two data from the same class: %.3f' %KL_0_0_4feat)
print('This should be quite small ~ 0')

Output:

The value of KL divergence of two data from the same class: 0.857 This should be quite small ~ 0

Case 4: Different classes, four features

# Take class 0 and 1 only
iris0_4feat = iris_two.loc[(iris_two.target==0)]
iris1_4feat = iris_two.loc[(iris_two.target==1)]
# Remove target column
iris0_4feat = iris0_4feat.iloc[:, :-1]
iris1_4feat = iris1_4feat.iloc[:, :-1]
print('Group 1 shape:',iris0_4feat.shape)
print('Group 2 shape:',iris1_4feat.shape

Output:

Group 1 shape: (50, 4)
Group 2 shape: (50, 4)

Next, compute KL divergence

KL_0_1_4feat = KL_div(iris0_4feat, iris1_4feat)print('The value of KL divergence of two data from the same class: %.3f' %KL_0_1_4feat)
print('This should be quite large >> 0')

Output:

The value of KL divergence of two data from the same class: 52.724 This should be quite large >> 0

Recap

Now here’s a recap of all similarity results:

print('KL divergence between two class')
print('using two features: %.3f' %KL_0_1_2feat)
print('using four features: %.3f' %KL_0_1_4feat)
print('')print('KL divergence of same class')
print('using two features: %.3f' %KL_0_0_2feat)
print('using four features: %.3f' %KL_0_0_4feat)

Output:

KL divergence between two class 
using two features: 26.676
using four features: 52.724
KL divergence of same class
using two features: 0.278
using four features: 0.857

This is the end of this blog, constructive feedbacks and suggestions are always welcome. Hope this example will be of any use to you. Have a good day!

Reference

  1. https://towardsdatascience.com/light-on-math-machine-learning-intuitive-guide-to-understanding-kl-divergence-2b382ca2b2a8
  2. https://stats.stackexchange.com/questions/257735/kl-divergence-between-two-bivariate-gaussian-distribution
  3. https://mr-easy.github.io/2020-04-16-kl-divergence-between-2-gaussian-distributions/

--

--

Orvin Demsy
Orvin Demsy

Written by Orvin Demsy

I don't write things down to remember; I write so I could forget; But sometimes I even forget what I'm gonna write.

Responses (2)