## Lectures: Learning with Graphical Models

July 15, 2017

I’m giving a series of lectures this semester combining graphical models and some elements of nonparametric statistics.  The intent is to build up to the theory of discrete matrix factorisation and its many variations. The lectures start on 27th July and are mostly given weekly.  Weekly topics are given in the calendar.

## On the “world’s best tweet clusterer” and the hierarchical Pitman–Yor process

July 30, 2016

Kar Wai Lim has just been told they “confirmed the approval” of his PhD (though it hasn’t been “conferred” yet, so he’s not officially a Dr., yet) and he spent the time post submission pumping out journal and conference papers.  Ahhh, the unencumbered life of the fresh PhD!

This one:

“Nonparametric Bayesian topic modelling with the hierarchical Pitman–Yor processes”, Kar Wai Lim , Wray Buntine, Changyou Chen, Lan Du, International Journal of Approximate Reasoning78 (2016) 172–191.

includes what I believe is the world’s best tweet clusterer.  Certainly blows away the state of the art tweet pooling methods.  Main issue is that the current implementation only scales to a million or so tweets, and not the 100 million or expected in some communities.  Easily addressed with a bit of coding work.

We did this to demonstrate the rich possibilities in terms of semantic hierarchies one has, largely unexplored, using simple Gibbs sampling with Pitman-Yor processes.   Lan Du (Monash) started this branch of research.  I challenge anyone to do this particular model with variational algorithms 😉   The machine learning community in the last decade unfortunately got lost on the complexities of Chinese restaurant processes and stick-breaking representations for which complex semantic hierarchies are, well, a bit of a headache!

## What “50 Years of Data Science” Leaves Out

November 28, 2015

This blog post from Sean Owen, Director, Data Science @Cloudera / London

I was so glad to find David Donoho’s critical take in 50 Years of Data Science, which has made its way around the Internet. … Along the way to arguing that Data Science can’t be much more than Statistics, it fails to contemplate Data Engineering, which I’d argue is most of what Data Science is and Statistics is not.

Much as I enjoyed reading Donoho’s work, I think its important for people to realise that Data Science isn’t just a new take on applied statistics, a superset yes, but an important superset.

• Donoho like Breiman before him splits Statistics/Machine Learning into Generative versus Predictive modelling.  I never really understand this because near 40% of published ML is generative modelling, and the majority of my work.
• Other important aspects of Data Science we cover in our Monash course are:
• data governance and data provenance
• the business processes and “operationalisation” (putting the results to work to achieve value)
• getting data, fusing different kinds of data, envisaging data science projects
• These are above and beyond the area of Greater Data Science (Donoho, section 8) that we refer to as the Data Analysis Process, and is probably the most in-demand skill for what the industry calls a data scientist.

Also, as a Machine Learning guy, who’s been doing Computational Statistics for 25 years, I also think its important to point out that Machine Learning exists as a separate field because their are so many amazing and challenging tasks to do in areas like robotics, natural language processing and image processing.  These require statistical ingenuity, domain understanding, and computational trickery.   I have important contacts both in the Statistical community and the NLP community so I can do my work.

## Basic tutorial: Oldie but a goody …

November 7, 2015

A student reminded me of Gregor Heinrich‘s excellent introduction to topic modelling, including a great introduction to the underlying foundations like Dirichlet distributions and multinomials.  Great reading for all students!  See

• G. Heinrich, Parameter estimation for text analysis, Technical report, Fraunhofer IGD, 15 September 2009 at his publication page.

## Some diversions

July 4, 2015

Quora‘s answers on What are good ways to insult a Bayesian statistician?   Excellent.  I’m insulted 😉

Great Dice Data: How Tech Skills Connect.  “Machine learning” is there (case sensitive) under the Data Science cluster in light green.

Data scientist payscales: “machine learning”‘ raises your expected salary but “MS SQL server” lowers it!

Cheat sheetsMachine Learning Cheat Sheet and the Probability Cheat Sheet.   Very handy!

## Dirichlet-multinomial distribution versus Pitman-Yor-multinomial

March 18, 2015

If you’re familiar with the Dirichlet distribution as the workhorse for discrete Bayesian modelling, then you should know about the Dirichlet-multinomial.  This is what happens when you combine a Dirichlet with a multinomial and integrate/marginalise out the common probability vector.

Graphically, you have a probability vector $\vec\theta$ that is the mean for a probability vector $\vec{p}$.  You then have data, a count vector $\vec{n}$, drawn using a multinomial distribution with mean $\vec{p}$. If we integrate out the vector $\vec{p}$ then we get the Dirichlet-multinomial.

The standard formulation is to have:
$\vec{p} \sim \mbox{Dirichlet}\left(\alpha,\vec\theta\right) ~~~~~~~~~~~~ \vec{n} \sim \mbox{Multinomial}\left(\vec{p},N\right)$

Manipulating this, we get the distribution for the Dirichlet-multinomial:
$p\left( \vec{n}\,\middle\vert\, \alpha,\vec\theta,N, \mbox{\small Dirichlet-multi.}\right)$
$~~~~~~~~~~~~=~ \frac{1}{\mbox{\small Beta}(\alpha \vec\theta)}{N \choose \vec{n} } \int_{\mbox{\scriptsize simplex}} \prod_k p_{k}^{n_k+\alpha \theta_k-1} \mbox{d}\vec{p}$
$~~~~~~~~~~~~=~ {N \choose \vec{n} } \frac{\mbox{\small Beta}(\vec{n}+\alpha \vec\theta)}{\mbox{\small Beta}(\alpha \vec\theta)}$

Now we will rewrite this into the Pochhammer symbol notation we use for Pitman-Yor marginals,

 $(c|d)_{T}$ is the Pochhammer symbol (a generalised version is used, and a different notation) which is a form of rising factorial so its given by $c (c+d) (c+2d) ... (c+(T-1)d)$; can be computed using Gamma functions $(c)_{T}$ (special case) a related rising factorial given by $c (c+1) (c+2) ... (c+(T-1))$ so it corresponds to $(c|1)_{T}$

The Beta functions simplify to yield the Dirichlet-multinomial in Pochhammer symbol notation:

$p\left( \vec{n}\,\middle\vert\, \alpha,\vec\theta,N,\mbox{\small Dirichlet-multi.}\right) ~=~ {N \choose \vec{n} } \frac{1}{(\alpha)_N} \prod_{k=1}^K (\alpha\theta_k)_{n_k}$

Now lets do the same with the Pitman-Yor process (PYP).
$\vec{p} \sim \mbox{PYP}\left(d,\alpha,\vec\theta\right) ~~~~~~~~~~~~ \vec{n} \sim \mbox{Multinomial}\left(\vec{p},N\right)$

The derivation of the combination is more detailed but is found in the Buntine Hutter Arxiv report or my tutorials. For this, you have to introduce a new latent vector $\vec{t}$ where $t_k$ represents the subset of the count $n_k$ that will be passed up from the $\vec{p}$-node up to the $\vec{\theta}$-node to convey information about the data $\vec{n}$.  Keep this phrase in mind.  It will be explained a bit later. These $\vec{t}$ are constrained as follows:

 constraint $t_k \le n_k$ constraint $t_k>0$ whenever $n_k>0$ total $N=\sum_k n_k$ total $T=\sum_k t_k$

With these, we get the PYP-multinomial in Pochhammer symbol notation:

$p\left( \vec{n},\vec{t}\,\middle\vert\, d,\alpha,\vec\theta,N,\mbox{\small PYP-multi.}\right) ~=~ {N \choose \vec{n}} \frac{(\alpha|d)_T}{(\alpha)_N} \prod_{k = 1}^K \mathcal{S}_{t_k, d}^{n_k} \theta_k^{t_k}$

You can see the three main differences. With the PYP:

1. the terms in $\vec\theta$ appear as simple powers, just like a multinomial likelihood, so this form is readily used in a hierarchy;
2. you now have to work with the generalised Stirling numbers $\mathcal{S}_{t_k, d}^{n_k}$; and
3. you have to introduce the new latent vector $\vec{t}$.

The key thing is that $\vec\theta$ only appears in the expression $\prod_{k=1}^K \theta_k^{t_k}$ which is a multinomial likelihood.   So, as said earlier, $t_k$ represents the subset of the count $n_k$ that will be passed up from the $\vec{p}$-node up to the $\vec{\theta}$-node. If $\vec{t}=\vec{n}$ then all the data is passed up, and the likelihood looks like $\prod_{k=1}^K \theta_k^{n_k}$, which is what you would get if $\vec{p}=\vec{\theta}$.

If we use a Dirichlet process (DP) rather than a PYP, the only simplification is that $(\alpha|d)_T$ simplifies to $\alpha^T$.  This small change means that the above formula can be rewritten as:

$p\left( \vec{n},\vec{t}\,\middle\vert\, \alpha,\vec\theta,N,\mbox{\small DP-multi.}\right) ~=~ {N \choose \vec{n}} \frac{1}{(\alpha)_N} \prod_{k = 1}^K \mathcal{S}_{t_k, d}^{n_k}(\alpha\theta_k)^{t_k}$

This has quite broad implications as the $t_k$ are now independent variables!  In fact, their posterior distribution takes the form:

$p\left( t_k\,\middle\vert\, n_k,\alpha,\theta_k\right)~=~ \mathcal{S}_{t_k, 0}^{n_k} \frac{(\alpha\theta_k)^{t_k}}{ (\alpha\theta_k)_{n_k} }$

The penalty for using the PYP or DP is that you now have to work with the generalised Stirling numbers and the latent vector $\vec{t}$.  With the right software, the Stirling numbers are easy to handle.  While my students have built their own Java packages for that, I use a C library available from MLOSS.  The latent vectors $\vec{t}$ at each node require different sampling algorithms.

Now one final note, this isn’t how we implement these.  Sampling the full range of $\vec{t}$ is too slow … there are too many possibilities.  Moreover, if sampling $\vec{t}$, you ignore the remainder of the hierarchy.  For fast mixing of Markov chains, you want to sample a cluster of related variables.  The hierarchical CRP does this implicitly as it resamples and samples up and down the parent hierarchy.  So to achieve this same effect using the marginalised posteriors above we have too Booleanise (turn into a Boolean) the $\vec{t}$‘s and sample a cluster of related Booleans up and down the hierarchy.  We figure out how to do this in 2011, paper at ECML.

## How many clusters?

February 20, 2015

Sometimes people think that a Dirichlet Process (DP) can be used to pick the “right number of clusters”.  The following plots done by Monash Matlab whiz Mark Carman show that this has to be done very carefully.

Given $N$ samples, Mark’s first plot shows the expected number of clusters, $M$, that one would get with a DP using concentration parameter $\alpha$.  The thing to notice is that the number of clusters is moderately well determined by the concentration parameter.  In fact the mean (expected value) of $M$ is given by:

$\alpha \left( \psi(\alpha+N) - \psi(\alpha)\right)$

where $\psi(\cdot)$ is the digamma function; for details see the ArXiv report by Marcus Hutter and I, Section 5.3.  Morever, the standard deviation of $M$ is approximately the square root of the mean (for larger concentrations).

So fixing a given concentration parameter means roughly fixing the number of clusters, which increases as the sample size grows.  So with a fixed concentration parameter you cannot really “estimate” the right number of clustersroughly you are fixing the value ahead of time when setting the concentration.

Mark’s second plot shows what we do to overcome this.  We have to estimate the concentration parameter as well.  So if we put a prior distribution, an Exponential with parameter $\epsilon$, on the concentration, we now smear out the previous plots.  So now we show plots for different values of $\epsilon$.  As you can see, these plots have a much higher variance, which is what you want.  With a given $\epsilon$, you are still determining the broad range of the number of clusters, but you have a lot more latitude.

In implementation, this means we estimate the concentration $\alpha$ (usually) by sampling it.  If we use Chinese restaurant processes, there is a simple auxiliary variable sampling formula for the concentration (presented in the original HDP paper by Teh et al.).  If we use our blocked table indicator sampling, the posterior on the concentration is log concave so we can use either slice sampling or adaptive rejection sampling (ARS). The implementation is moderately simple, and it works well.  However, it does mean now your algorithm will expend more time as it has to try and find the right concentration as well.