h1

MDSS Seminar Series: Doing Bayesian Text Analysis

August 4, 2017

Giving a talk to the Monash Data Science Society on August 28th.  Details here.  Its a historical perspective and motivational talk about doing text and document analysis.  Slides are here.

Advertisements
h1

Lectures: Learning with Graphical Models

July 15, 2017

I’m giving a series of lectures this semester combining graphical models and some elements of nonparametric statistics.  The intent is to build up to the theory of discrete matrix factorisation and its many variations. The lectures start on 27th July and are mostly given weekly.  Weekly details are given in the calendar too.  The slides are on the Monash share drive under “Wray’s Slides” so if you are at Monash, do a search on Google drive to find them.  If you cannot find them, email me for access.

OK lectures over as of 24/10/2017!  Have some other things to prepare.

Variational Algorithms and Expectation-Maximisation, Lecture 6, 19/10/17, Wray Buntine

This week takes up on material not covered last lecture.  For exponential family distributions, working with the mean of Gibbs samples sometimes sometimes corresponds to another algorithm called Expectation-Maximisation. We will look at this in terms of the Kullback-Leibler versions of variational algorithms. The general theory is quite involved, so we will work through it with some examples, like variational auto-encoders, Gaussian mixture models, and extensions to LDA.

No lectures this week, 12th October, as I will be catching up on reviews and completing a journal article. Next week we’ll work through some examples of variational algorithms, including LDA with a HDP, a model whose VA theory has been thoroughly botched up historically.

Gibbs Sampling, Variational Algorithms and Expectation-Maximisation, Lecture 5, 05/10/17, Wray Buntine

Gibbs sampling is the simplest of the Monte Carlo Markov Chain methods, and the easiest to understand. For computer scientists, it is closely related to local search. We will look at the basic justification of Gibbs sampling and see examples of its variations: block Gibbs, augmentation and collapsing. Clever use of these techniques can dramatically improve performance. This gives a rich class of algorithms that, for smaller data sets at least, addresses most problems in learning. For exponential family distributions, taking the mean instead of sampling sometimes corresponds to another algorithm called Expectation-Maximisation. We will look at this in terms of the Kullback-Leibler versions of variational algorithms. We will look at the general theory and some examples, like variational auto-encoders and Gaussian mixture models.

ASIDE: Determinantal Point Processes, one off lecture, 28/09/17, Wray Buntine

Representing objects with feature vectors lets us measure similarity using dot products.  Using this notion, the determinantal point process (DPP) can be introduced as a distribution over objects maximising diversity.  In this tutorial we will explore the DPP with the help of the visual analogies developed by Kulesza and Taskar in their tutorials and their 120 page Foundations and Trends article “Determinantal Point Processes for Machine Learning.” Topics covered are interpretations and definitions, probability operations such as marginalising and conditioning, and sampling.  The tutorial makes great use of the knowledge of matrices and determinants.

No lectures the following two weeks, 14th and 21st September, as I will be on travel.

Basic Distributions and Poisson Processes, Lecture 4, 07/09/17, Wray Buntine

We review the standard discrete distributions, relationships, properties and conjugate distributions.  This includes deriving the Poisson distribution as an infinitely divisible distribution on natural numbers with a fixed rate.  Then we introduce Poisson point processes as a model of stochastic processes.  We show how they behave in both the discrete and continuous case, and how they have both constructive and axiomatic definitions.  The same definitions can be extended to any infinitely divisible distributions, so we use this to introduce the gamma process.  We illustrate Bayesian operations for the gamma process: data likelihoods, conditioning on discrete evidence and marginalising.

Directed and Undirected Independence Models, Lecture 3, 31/08/17, Wray Buntine

We will develop the basic properties and results for directed and undirected graphical models.  This includes testing for independence, developing the corresponding functional form, and understanding probability operations such as marginalising and conditioning.  To complete this section, we will also investigate operations on clique trees, to illustrate the principles.  We will not do full graphical model inference.

Information and working with Independence, Lecture 2, 17/08/17, Wray Buntine

This will continue with information (entropy) left over from the previous lecture.  Then we will look at the definition of independence and the some independence models, including its relationship with causality.  Basic directed and undirected models will be introduced.  Some example problems will be presented (simply) to tie these together:  simple bucket search, bandits, graph colouring and causal reasoning.

No lectures 03/08 (writing for ACML) and 10/08 (attending ICML).

Motivating Probability and Decision Models, Lecture 1, 27/07/17, Wray Buntine

This is an introduction to motivation for using Bayesian methods, these days called “full probability modelling” by the cognoscenti, to avoid prior cultish associations and implications. We will look at modelling, causality, probability as frequency, and axiomatic underpinnings for reasoning, decisions, and belief . The importance of priors and computation form the basis of this.
h1

Reviewing in Machine Learning

June 11, 2017

A common subject for mutual commiseration in the community is the quality of reviewing.  In huge and specialised conferences like NIPS and ICML, there are so many papers and so many reviewers that generally the match-up between reviewers and papers is quite good, as good as or better than for a journal article.  In smaller conferences, like ACML, and for grant applications in relatively small places like Australia (e.g, the ARC), the match-up can be a lot poorer.  This causes reviewer misunderstandings.

Of course, one needs to be aware of The Great NIPS Reviewing Experiment of 2014.  This is a grand applied statistical experiment that only machine learning folks could think of 😉  I’ll just mention this because it is important to understand that the reviewing process is very challenging, and we as a community are trying our hardest.

Now, I think its very reasonable for some reviewers to not be specialists in the subject of the paper, merely “knowledgable”.  After all, we would like the paper to be readable by more than just the 20 people who focus on that very specific topic.  These non-specialist reviewers generally flag themselves, so the meta-reviewers and authors know to take their comments with a (respectful) grain of salt.  But they can still be excellent in related and broadly applicable areas like experimental methodology and mathematical definitions of models, so they are still an important part of the reviewing ecosystem.  This works when reviewers know their limitations.  Unfortunately, reviewers don’t always do so.

But I still find general aspects of reviewing enlightening.

Case in point is our recent ICML 2017 paper “Leveraging Node Attributes for Incomplete Relational Data”.   Two reviewers said strong accept and one a mild reject.  For the would be rejecter, the method was too simple.  We knew this paper was not full of the usual theoretical complexities expected of an ICML one, of course, so we made sure the experimental work was rock solid.  It was a risk submitting to ICML anyway, as anyone with experience knows the experimental work at ICML can be patchy, its not something generally looked for by reviewers.  If you want quality experimental work in machine learning, go to the knowledge discovery conferences like SIGKDD, certainly not a machine learning conference!

The reason we submitted the paper to ICML was because this simple method beat all previous work handily, either or both in predictive performance or speed.  Simplicity it seems has its advantages, and people should find out about it when it happens.   But, if it was so damn simple, why didn’t someone try it already (in truth, it wasn’t that simple), and given it works so much better, shouldn’t people find out that for this problem all the ICML-ish model complexity of previous methods was unnecessary 😉 .  Now we did add a tricky hierarchical part to our otherwise simple model, just to appease the “meta is better” crowd, and we’re now busy trying to figure out how to add a novel stochastic process (something I love to do).

But unnecessary complexity is something I’m not a big fan of.  My favorite example of this is papers starting off with 2 pages of stochastic process theory before, finally, getting to the model and implementation.  But the model they implement is a truncated one, is completely finite and requires no stochastic process theory to analyse in any way.  In a longer journal format, linking the truncated version with the full stochastic process theory is important to round off the treatment.  In a short format paper with considerable experimental work, details of Levy processes are unnecessary if real non-parametric methods are not actually used in the algorithmic theory.

h1

ICML 2017 paper: Leveraging Node Attributes for Incomplete Relational Data

May 19, 2017

Here is a paper with Ethan Zhao and Lan Du, both of Monash, we’ll present in Sydney.

Relational data are usually highly incomplete in practice, which inspires us to leverage side information to improve the performance of community detection and link prediction. This paper presents a Bayesian probabilistic approach that incorporates various kinds of node attributes encoded in binary form in relational models with Poisson likelihood. Our method works flexibly with both directed and undirected relational networks. The inference can be done by efficient Gibbs sampling which leverages sparsity of both networks and node attributes. Extensive experiments show that our models achieve the state-of-the-art link prediction results, especially with highly incomplete relational data.

As usual, the reviews were entertaining, and some interesting results we didn’t get in the paper.  Its always enlightening doing comparative experiments.

h1

ALTA 2016 Tutorial: Simpler Non-parametric Bayesian Models

April 21, 2017

They recorded my tutorial ran at ALTA late in 2016.

Part 1 and part 2 up on Youtube, about an hour each.

 

h1

Visiting SMU in Singapore

November 23, 2016

Currently visiting the LARC with Steven Hoi and will give a talk tomorrow on my latest work based on Lancelot James’ extensive theories of Bayesian non-parametrics.

Simplifying Poisson Processes for Non-parametric Methods

Main thing is I’ve been looking at hierarchical models, which have all their own theory, though easily developed using James’ general results.  Not sure I’ll put the slides up because its not yet published!

h1

On the “world’s best tweet clusterer” and the hierarchical Pitman–Yor process

July 30, 2016

Kar Wai Lim has just been told they “confirmed the approval” of his PhD (though it hasn’t been “conferred” yet, so he’s not officially a Dr., yet) and he spent the time post submission pumping out journal and conference papers.  Ahhh, the unencumbered life of the fresh PhD!

This one:

“Nonparametric Bayesian topic modelling with the hierarchical Pitman–Yor processes”, Kar Wai Lim , Wray Buntine, Changyou Chen, Lan Du, International Journal of Approximate Reasoning78 (2016) 172–191.

includes what I believe is the world’s best tweet clusterer.  Certainly blows away the state of the art tweet pooling methods.  Main issue is that the current implementation only scales to a million or so tweets, and not the 100 million or expected in some communities.  Easily addressed with a bit of coding work.

We did this to demonstrate the rich possibilities in terms of semantic hierarchies one has, largely unexplored, using simple Gibbs sampling with Pitman-Yor processes.   Lan Du (Monash) started this branch of research.  I challenge anyone to do this particular model with variational algorithms 😉   The machine learning community in the last decade unfortunately got lost on the complexities of Chinese restaurant processes and stick-breaking representations for which complex semantic hierarchies are, well, a bit of a headache!