Archive for the ‘theory’ Category


Notes on Determinantal Point Processes

September 11, 2017

I’m giving a tutorial on these amazing processes while in Moscow.  The source “book” for this is of course Alex Kulesza and Ben Taskar’s, “Determinantal Point Processes for Machine Learning”, Foundations and Trends® in Machine Learning: Vol. 5: No. 2–3, pp 123-286, 2012.

If you have an undergraduate in mathematics with loads of multi-linear algebra and real analysis, this stuff really is music for the mind.  The connections and results are very cool.  In my view these guys don’t spend enough time in their intro. on gram matrices, which really is the starting point for everything.  In their online video tutorials they got this right, and lead with these results.

There is also a few interesting connections they didn’t mention.  Anyway, I did some additional lecture notes to give some of the key results mentioned in the long article and elsewhere that didn’t make their tutorial slides.


Advanced Methodologies for Bayesian Networks

August 22, 2017

The 3rd Workshop on Advanced Methodologies for Bayesian Networks is being run in Kyoto September 20-22, 2017.  I’ll be talking about our (with François Petitjean, Nayyar Zaidi and Geoff Webb) recent work with Bayesian Network Classifiers:

Backoff methods for estimating parameters of a Bayesian network

Various authors have highlighted inadequacies of BDeu type scores and this problem is shared in parameter estimation. Basically, Laplace estimates work poorly, at least because setting the prior concentration is challenging. In 1997, Freidman et al suggested a simple backoff approach for Bayesian network classifiers (BNCs). Backoff methods dominate in in n-gram language models, with modified Kneser-Ney smoothing, being the best known, and a Bayesian variant exists in the form of Pitman-Yor process language models from Teh in 2006. In this talk we will present some results on using backoff methods for Bayes network classifiers and Bayesian networks generally. For BNCs at least, the improvements are dramatic and alleviate some of the issues of choosing too dense a network.

Its built on the amazing Chordalysis system of François Petitjean, and the code is available as HierarchicalDirichletProcessEstimation.  Boy, Nayyar and François really can do good empirical work!


Lectures: Learning with Graphical Models

July 15, 2017



I’m giving a series of lectures this semester combining graphical models and some elements of nonparametric statistics.  The intent is to build up to the theory of discrete matrix factorisation and its many variations. The lectures start on 27th July and are mostly given weekly.  Weekly details are given in the calendar too.  The slides are on the Monash share drive under “Wray’s Slides” so if you are at Monash, do a search on Google drive to find them.  If you cannot find them, email me for access.

Motivating Probability and Decision Models, Lecture 1, 27/07/17, Wray Buntine
This is an introduction to motivation for using Bayesian methods, these days called “full probability modelling” by the cognoscenti, to avoid prior cultish associations and implications. We will look at modelling, causality, probability as frequency, and axiomatic underpinnings for reasoning, decisions, and belief . The importance of priors and computation form the basis of this.

No lectures 03/08 (writing for ACML) and 10/08 (attending ICML).

Information and working with Independence, Lecture 2, 17/08/17, Wray Buntine
This will continue with information (entropy) left over from the previous lecture.  Then we will look at the definition of independence and the some independence models, including its relationship with causality.  Basic directed and undirected models will be introduced.  Some example problems will be presented (simply) to tie these together:  simple bucket search, bandits, graph colouring and causal reasoning.

Directed and Undirected Independence Models, Lecture 3, 31/08/17, Wray Buntine
We will develop the basic properties and results for directed and undirected graphical models.  This includes testing for independence, developing the corresponding functional form, and understanding probability operations such as marginalising and conditioning.  To complete this section, we will also investigate operations on clique trees, to illustrate the principles.  We will not do full graphical model inference.

Basic Distributions and Poisson Processes, Lecture 4, 07/09/17, Wray Buntine
We review the standard discrete distributions, relationships, properties and conjugate distributions.  This includes deriving the Poisson distribution as an infinitely divisible distribution on natural numbers with a fixed rate.  Then we introduce Poisson point processes as a model of stochastic processes.  We show how they behave in both the discrete and continuous case, and how they have both constructive and axiomatic definitions.  The same definitions can be extended to any infinitely divisible distributions, so we use this to introduce the gamma process.  We illustrate Bayesian operations for the gamma process:  data likelihoods, conditioning on discrete evidence and marginalising.

No lectures the following two weeks, 14th and 21st September, as I will be on travel.



On the “world’s best tweet clusterer” and the hierarchical Pitman–Yor process

July 30, 2016

Kar Wai Lim has just been told they “confirmed the approval” of his PhD (though it hasn’t been “conferred” yet, so he’s not officially a Dr., yet) and he spent the time post submission pumping out journal and conference papers.  Ahhh, the unencumbered life of the fresh PhD!

This one:

“Nonparametric Bayesian topic modelling with the hierarchical Pitman–Yor processes”, Kar Wai Lim , Wray Buntine, Changyou Chen, Lan Du, International Journal of Approximate Reasoning78 (2016) 172–191.

includes what I believe is the world’s best tweet clusterer.  Certainly blows away the state of the art tweet pooling methods.  Main issue is that the current implementation only scales to a million or so tweets, and not the 100 million or expected in some communities.  Easily addressed with a bit of coding work.

We did this to demonstrate the rich possibilities in terms of semantic hierarchies one has, largely unexplored, using simple Gibbs sampling with Pitman-Yor processes.   Lan Du (Monash) started this branch of research.  I challenge anyone to do this particular model with variational algorithms 😉   The machine learning community in the last decade unfortunately got lost on the complexities of Chinese restaurant processes and stick-breaking representations for which complex semantic hierarchies are, well, a bit of a headache!


What “50 Years of Data Science” Leaves Out

November 28, 2015

This blog post from Sean Owen, Director, Data Science @Cloudera / London

I was so glad to find David Donoho’s critical take in 50 Years of Data Science, which has made its way around the Internet. … Along the way to arguing that Data Science can’t be much more than Statistics, it fails to contemplate Data Engineering, which I’d argue is most of what Data Science is and Statistics is not.

Much as I enjoyed reading Donoho’s work, I think its important for people to realise that Data Science isn’t just a new take on applied statistics, a superset yes, but an important superset.

Some additional comments:

  • Donoho like Breiman before him splits Statistics/Machine Learning into Generative versus Predictive modelling.  I never really understand this because near 40% of published ML is generative modelling, and the majority of my work.
  • Other important aspects of Data Science we cover in our Monash course are:
    • data governance and data provenance
    • the business processes and “operationalisation” (putting the results to work to achieve value)
    • getting data, fusing different kinds of data, envisaging data science projects
  • These are above and beyond the area of Greater Data Science (Donoho, section 8) that we refer to as the Data Analysis Process, and is probably the most in-demand skill for what the industry calls a data scientist.

Also, as a Machine Learning guy, who’s been doing Computational Statistics for 25 years, I also think its important to point out that Machine Learning exists as a separate field because their are so many amazing and challenging tasks to do in areas like robotics, natural language processing and image processing.  These require statistical ingenuity, domain understanding, and computational trickery.   I have important contacts both in the Statistical community and the NLP community so I can do my work.



Basic tutorial: Oldie but a goody …

November 7, 2015

A student reminded me of Gregor Heinrich‘s excellent introduction to topic modelling, including a great introduction to the underlying foundations like Dirichlet distributions and multinomials.  Great reading for all students!  See

  • G. Heinrich, Parameter estimation for text analysis, Technical report, Fraunhofer IGD, 15 September 2009 at his publication page.

Some diversions

July 4, 2015

Quora‘s answers on What are good ways to insult a Bayesian statistician?   Excellent.  I’m insulted 😉

Great Dice Data: How Tech Skills Connect.  “Machine learning” is there (case sensitive) under the Data Science cluster in light green.

Data scientist payscales: “machine learning”‘ raises your expected salary but “MS SQL server” lowers it!

Cheat sheetsMachine Learning Cheat Sheet and the Probability Cheat Sheet.   Very handy!