I’m giving a series of lectures this semester combining graphical models and some elements of nonparametric statistics. The intent is to build up to the theory of discrete matrix factorisation and its many variations. The lectures start on 27th July and are mostly given weekly. Weekly topics are given in the calendar.
Archive for the ‘theory’ Category
Lectures: Learning with Graphical Models
July 15, 2017On the “world’s best tweet clusterer” and the hierarchical Pitman–Yor process
July 30, 2016Kar Wai Lim has just been told they “confirmed the approval” of his PhD (though it hasn’t been “conferred” yet, so he’s not officially a Dr., yet) and he spent the time post submission pumping out journal and conference papers. Ahhh, the unencumbered life of the fresh PhD!
This one:
“Nonparametric Bayesian topic modelling with the hierarchical Pitman–Yor processes”, Kar Wai Lim , Wray Buntine, Changyou Chen, Lan Du, International Journal of Approximate Reasoning, 78 (2016) 172–191.
includes what I believe is the world’s best tweet clusterer. Certainly blows away the state of the art tweet pooling methods. Main issue is that the current implementation only scales to a million or so tweets, and not the 100 million or expected in some communities. Easily addressed with a bit of coding work.
We did this to demonstrate the rich possibilities in terms of semantic hierarchies one has, largely unexplored, using simple Gibbs sampling with Pitman-Yor processes. Lan Du (Monash) started this branch of research. I challenge anyone to do this particular model with variational algorithms 😉 The machine learning community in the last decade unfortunately got lost on the complexities of Chinese restaurant processes and stick-breaking representations for which complex semantic hierarchies are, well, a bit of a headache!
What “50 Years of Data Science” Leaves Out
November 28, 2015This blog post from Sean Owen, Director, Data Science @Cloudera / London
I was so glad to find David Donoho’s critical take in 50 Years of Data Science, which has made its way around the Internet. … Along the way to arguing that Data Science can’t be much more than Statistics, it fails to contemplate Data Engineering, which I’d argue is most of what Data Science is and Statistics is not.
Much as I enjoyed reading Donoho’s work, I think its important for people to realise that Data Science isn’t just a new take on applied statistics, a superset yes, but an important superset.
Some additional comments:
- Donoho like Breiman before him splits Statistics/Machine Learning into Generative versus Predictive modelling. I never really understand this because near 40% of published ML is generative modelling, and the majority of my work.
- Other important aspects of Data Science we cover in our Monash course are:
- data governance and data provenance
- the business processes and “operationalisation” (putting the results to work to achieve value)
- getting data, fusing different kinds of data, envisaging data science projects
- These are above and beyond the area of Greater Data Science (Donoho, section 8) that we refer to as the Data Analysis Process, and is probably the most in-demand skill for what the industry calls a data scientist.
Also, as a Machine Learning guy, who’s been doing Computational Statistics for 25 years, I also think its important to point out that Machine Learning exists as a separate field because their are so many amazing and challenging tasks to do in areas like robotics, natural language processing and image processing. These require statistical ingenuity, domain understanding, and computational trickery. I have important contacts both in the Statistical community and the NLP community so I can do my work.
Basic tutorial: Oldie but a goody …
November 7, 2015A student reminded me of Gregor Heinrich‘s excellent introduction to topic modelling, including a great introduction to the underlying foundations like Dirichlet distributions and multinomials. Great reading for all students! See
- G. Heinrich, Parameter estimation for text analysis, Technical report, Fraunhofer IGD, 15 September 2009 at his publication page.
Some diversions
July 4, 2015Quora‘s answers on What are good ways to insult a Bayesian statistician? Excellent. I’m insulted 😉
Great Dice Data: How Tech Skills Connect. “Machine learning” is there (case sensitive) under the Data Science cluster in light green.
Data scientist payscales: “machine learning”‘ raises your expected salary but “MS SQL server” lowers it!
Cheat sheets: Machine Learning Cheat Sheet and the Probability Cheat Sheet. Very handy!
Dirichlet-multinomial distribution versus Pitman-Yor-multinomial
March 18, 2015If you’re familiar with the Dirichlet distribution as the workhorse for discrete Bayesian modelling, then you should know about the Dirichlet-multinomial. This is what happens when you combine a Dirichlet with a multinomial and integrate/marginalise out the common probability vector.
Graphically, you have a probability vector that is the mean for a probability vector . You then have data, a count vector , drawn using a multinomial distribution with mean . If we integrate out the vector then we get the Dirichlet-multinomial.
The standard formulation is to have:
Manipulating this, we get the distribution for the Dirichlet-multinomial:
Now we will rewrite this into the Pochhammer symbol notation we use for Pitman-Yor marginals,
is the Pochhammer symbol (a generalised version is used, and a different notation) which is a form of rising factorial so its given by ; can be computed using Gamma functions | |
(special case) a related rising factorial given by so it corresponds to |
The Beta functions simplify to yield the Dirichlet-multinomial in Pochhammer symbol notation:
Now lets do the same with the Pitman-Yor process (PYP).
The derivation of the combination is more detailed but is found in the Buntine Hutter Arxiv report or my tutorials. For this, you have to introduce a new latent vector where represents the subset of the count that will be passed up from the -node up to the -node to convey information about the data . Keep this phrase in mind. It will be explained a bit later. These are constrained as follows:
constraint | |
constraint | whenever |
total | |
total |
With these, we get the PYP-multinomial in Pochhammer symbol notation:
You can see the three main differences. With the PYP:
- the terms in appear as simple powers, just like a multinomial likelihood, so this form is readily used in a hierarchy;
- you now have to work with the generalised Stirling numbers ; and
- you have to introduce the new latent vector .
The key thing is that only appears in the expression which is a multinomial likelihood. So, as said earlier, represents the subset of the count that will be passed up from the -node up to the -node. If then all the data is passed up, and the likelihood looks like , which is what you would get if .
If we use a Dirichlet process (DP) rather than a PYP, the only simplification is that simplifies to . This small change means that the above formula can be rewritten as:
This has quite broad implications as the are now independent variables! In fact, their posterior distribution takes the form:
The penalty for using the PYP or DP is that you now have to work with the generalised Stirling numbers and the latent vector . With the right software, the Stirling numbers are easy to handle. While my students have built their own Java packages for that, I use a C library available from MLOSS. The latent vectors at each node require different sampling algorithms.
Now one final note, this isn’t how we implement these. Sampling the full range of is too slow … there are too many possibilities. Moreover, if sampling , you ignore the remainder of the hierarchy. For fast mixing of Markov chains, you want to sample a cluster of related variables. The hierarchical CRP does this implicitly as it resamples and samples up and down the parent hierarchy. So to achieve this same effect using the marginalised posteriors above we have too Booleanise (turn into a Boolean) the ‘s and sample a cluster of related Booleans up and down the hierarchy. We figure out how to do this in 2011, paper at ECML.
How many clusters?
February 20, 2015Sometimes people think that a Dirichlet Process (DP) can be used to pick the “right number of clusters”. The following plots done by Monash Matlab whiz Mark Carman show that this has to be done very carefully.
Given samples, Mark’s first plot shows the expected number of clusters, , that one would get with a DP using concentration parameter . The thing to notice is that the number of clusters is moderately well determined by the concentration parameter. In fact the mean (expected value) of is given by:
where is the digamma function; for details see the ArXiv report by Marcus Hutter and I, Section 5.3. Morever, the standard deviation of is approximately the square root of the mean (for larger concentrations).
So fixing a given concentration parameter means roughly fixing the number of clusters, which increases as the sample size grows. So with a fixed concentration parameter you cannot really “estimate” the right number of clusters … roughly you are fixing the value ahead of time when setting the concentration.
Mark’s second plot shows what we do to overcome this. We have to estimate the concentration parameter as well. So if we put a prior distribution, an Exponential with parameter , on the concentration, we now smear out the previous plots. So now we show plots for different values of . As you can see, these plots have a much higher variance, which is what you want. With a given , you are still determining the broad range of the number of clusters, but you have a lot more latitude.
In implementation, this means we estimate the concentration (usually) by sampling it. If we use Chinese restaurant processes, there is a simple auxiliary variable sampling formula for the concentration (presented in the original HDP paper by Teh et al.). If we use our blocked table indicator sampling, the posterior on the concentration is log concave so we can use either slice sampling or adaptive rejection sampling (ARS). The implementation is moderately simple, and it works well. However, it does mean now your algorithm will expend more time as it has to try and find the right concentration as well.