Posts Tagged ‘HCA’

h1

A small model of ArXiv abstracts

April 12, 2016

I’ve been working with Dorota Glowacka of University of Helsinki on a search system built on the ArXiv.org.  We have a demo appearing in SIGIR 2016.  Here is a model with 100 topics built using normalised gamma priors on topics (giving each topic a variance parameter as well as a mean parameter) on the 1.1 million abstracts to March 2016.  The model took about 6 hours to run on my desktop.

This is a huge PNG file (2.8Mb).  YOU will not be able to view it unless you:

  • so load up on a big screen,
  • click on the image to enter image view mode,
  • then scroll down to bottom right click on “View full size” to bring it up,
  • and then zoom around to view.
Advertisements
h1

Visualising a topic model

March 25, 2016

Finally decided to write a proper visualiser for topic models.   I used the WordCloud Python tool from AMueller[GitHub].   Modified it because the input I needed to use words with precomputed scores, rather than text input.  Moreover, I wanted two dimensions for words displayed, size (word frequency in topic) and lightness (degree to which the word is characterised by the topic, measured as frequency over document frequency).  I also scale the final tag cloud depending on the size of the topic in the corpus.  The correlation between topics is computed from the document-topic proportions.  All these then go into GraphViz, where nodes are displayed as images and a lot of careful weighting and organising of the number of topic correlations to display, edge weights, etc.

Below are the results on  ABC news articles from their website 2003-2012 collected by Dr. Jinjing Li of NATSEM in Canberra.   These images are about 4000 by 4000 pixels.  YOU will not be able to view it unless you:

  • get on a big screen,
  • click on the image to enter image view mode,
  • then scroll down to bottom right, click on “View full size” to bring it up,
  • and then zoom around to view.

To produce the banking one, I do the following commands with hca:

#  generate the topic model into result set B1
hca -Ang -v -K50 -C1000 -q2 bank B1
#  compute the diagnostics
hca -v -v -V -V -r0 -C0 bank B1
#  generate the image
topset2word.pl --dot "-Kfdp" --lang png  B1 BN1

h1

Talk at Topic Models workshop at CIKM 2015

October 21, 2015

Attended a great workshop at CIKM 2015, Topic Models: Post-Processing and Applications, and gave a talk.  Surprisingly good quality papers for a workshop of its kind so learnt a lot.  My talk was better motivating and explaining some of the features of our non-parametric system that lets you diagnose topics: CIKM15 TM talk, Buntine.

h1

Video of a Lecture

February 20, 2015

The talk I gave at JSI (Jozef Stefan Institute in Ljublana) on 14th Jan 2015 was recorded.  The group here, with Dunja Mladenić and Marko Grobelnik, are expert in areas like Data Science and Text Mining, but they’re not into Bayesian non-parametrics, so in this version of the talk I mostly avoided the statistical details and talked more about what we did and why.  The talk is up on Video Lectures.  The original PDF of the EU talk sequence was on this post.

h1

Latent IBP compound Dirichlet Allocation

January 30, 2015

I visited Ralf Herbrich’s new Amazon offices in Berlin on 8th January and chatted there on topic modelling.  A pleasant result for me was meeting both Cédric Archambeau and Jan Gasthaus, both having been active in my area.

With Cedric I discussed Latent IBP compound Dirichlet Allocation (in the long awaited IEEE Trans PAMI 2015 special issue on Bayesian non-parametics, PDF on Cédric’s webpage) model which combines a 3-parameter Indian Buffet Process with Dirichlets.  This was joint work with Balaji Lakshminarayanan and Guillaume Bouchard.

Their work improves on the earlier paper by Williamson, Wang, Heller, and Blei (2010) on the Focused Topic Model which struggled with using the IBP theory.  I’d originally ignored Williamson et al.s’ work because they only tested on toy data sets, despite it being such a great model.  The marginal posterior for the document-topic indicator matrix is given in Archambeau et al.s’ Equation (43) which they attribute to Teh and Görür (NIPS 2009) but its easily derived using Dirichlet marginals and Lancelot James’ general formulas for IBPs.  This is the so-called IBP compound Dirichlet. From there, its easy to derive a collapsed Gibbs sampler mirroring regular LDA.  Theory and sampling hyper-parameters for the 3-parameter IBP I describe in my Helsinki talk and coming tutorial.

This is a great paper with quality empirical work and the best results I’ve seen for non-parametric LDA (other than ours).  Their implementation isn’t performance tuned so their timing figures are not that indicative, but they ran on non-trivial data sets so its good enough.

Note for the curious, I ran our HCA code to duplicate their experimental results on the two larger data sets.  Details in the Helsinki talk.  Basically:

  • They compared against prior HDP-LDA implementations so of course beat them substantially.
  • Our version of HDP-LDA (without burstiness) works as well as their LIDA algorithm, their better one with IBPs on the topic and the word side, and is substantially faster.
  • Our fully non-parametric LDA (DPs for documents, PYPs for words, no burstiness) beat their LIDA substantially.

So while we beat them with our superior collapsed Gibbs sampling, their results are impressive so I’m excited by the possibility of trying their methods.

h1

Experimental results for non-parametric LDA

November 12, 2014

Swapnil Mishra and I have been testing different software for HDP-LDA, the non-parametric version of LDA first published by Teh, Jordan, Beal and Blei in JASA 2006.  Since then ever more complex and theoretical approaches have been published for this, and its a common topic in recent NIPS conferences.  We’d noticed that the LDA implementation in Mallet has an asymmetric-symmetric version which is a truncated form of HDP-LDA, and David Mimno says this has been around since 2007, though Wallach, Mimno and McCallum published their results with it in NIPS 2009.  Another fast implementation is by Sato, Kurihara, and Nakagawa in KDD 2011,  The original version some people test against is Yee Whye Teh’s implementation from 2004, Nonparametric Bayesian Mixture Models – release 2.1.  This is impressive because it does “slow” Gibbs sampling and still works OK!

We’ve had real trouble comparing different published work because everyone has different ways of measuring perplexity, their test sets are hard to reconstruct, and sometimes their code works really badly and we’ve been unable to get realistic looking results.

Our KDD paper did some comparisons.  We’re re-running things more carefully now.  To compare against Sato et al.‘s results he kindly sent us his original data sets.   Their experimental work was thorough, and precisely written up so its a good one to compare with.  We’ve then re-run Mallet with their ASYM-SYM option to compare and our own two versions of HDP-LDA and our fully non-parametric version NP-LDA (puts a Pitman-Yor process on the word distributions and a Dirichlet process on the topic distributions).  Results in the plot below, lower perplexity is better.  Not sure about Teh’s original 2004 code but our experience from other runs would be it doesn’t match these others.

pcvb0Most of these are with 300 topics.  We’re impressed with how well the others performed.  Mallet is a lot faster because they use some very clever sampling designed for LDA.  Ours is the next fastest because our samplers are much better and it runs moderately well in 8 cores (cheating really).  More details on the comparison in a forth-coming paper.

h1

New release of HCA

September 10, 2014

Version 0.61 just posted.  I’ve been adding some integration to make the diagnostics more useful.  Plus a few corrections to sampling.  Really need to clean up the code though.   uuuggghhh!