h1

How many clusters?

February 20, 2015

PYP0a_2

Sometimes people think that a Dirichlet Process (DP) can be used to pick the “right number of clusters”.  The following plots done by Monash Matlab whiz Mark Carman show that this has to be done very carefully.

Given N samples, Mark’s first plot shows the expected number of clusters, M, that one would get with a DP using concentration parameter \alpha.  The thing to notice is that the number of clusters is moderately well determined by the concentration parameter.  In fact the mean (expected value) of M is given by:

\alpha \left( \psi(\alpha+N) - \psi(\alpha)\right)

where \psi(\cdot) is the digamma function; for details see the ArXiv report by Marcus Hutter and I, Section 5.3.  Morever, the standard deviation of M is approximately the square root of the mean (for larger concentrations).

So fixing a given concentration parameter means roughly fixing the number of clusters, which increases as the sample size grows.  So with a fixed concentration parameter you cannot really “estimate” the right number of clustersroughly you are fixing the value ahead of time when setting the concentration.

PYP0e_2

Mark’s second plot shows what we do to overcome this.  We have to estimate the concentration parameter as well.  So if we put a prior distribution, an Exponential with parameter \epsilon, on the concentration, we now smear out the previous plots.  So now we show plots for different values of \epsilon.  As you can see, these plots have a much higher variance, which is what you want.  With a given \epsilon, you are still determining the broad range of the number of clusters, but you have a lot more latitude.

In implementation, this means we estimate the concentration \alpha (usually) by sampling it.  If we use Chinese restaurant processes, there is a simple auxiliary variable sampling formula for the concentration (presented in the original HDP paper by Teh et al.).  If we use our blocked table indicator sampling, the posterior on the concentration is log concave so we can use either slice sampling or adaptive rejection sampling (ARS). The implementation is moderately simple, and it works well.  However, it does mean now your algorithm will expend more time as it has to try and find the right concentration as well.

Advertisements

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s

%d bloggers like this: