Archive for the ‘results’ Category

h1

A small model of ArXiv abstracts

April 12, 2016

I’ve been working with Dorota Glowacka of University of Helsinki on a search system built on the ArXiv.org.  We have a demo appearing in SIGIR 2016.  Here is a model with 100 topics built using normalised gamma priors on topics (giving each topic a variance parameter as well as a mean parameter) on the 1.1 million abstracts to March 2016.  The model took about 6 hours to run on my desktop.

This is a huge PNG file (2.8Mb).  YOU will not be able to view it unless you:

  • so load up on a big screen,
  • click on the image to enter image view mode,
  • then scroll down to bottom right click on “View full size” to bring it up,
  • and then zoom around to view.
Advertisements
h1

Visualising a topic model

March 25, 2016

Finally decided to write a proper visualiser for topic models.   I used the WordCloud Python tool from AMueller[GitHub].   Modified it because the input I needed to use words with precomputed scores, rather than text input.  Moreover, I wanted two dimensions for words displayed, size (word frequency in topic) and lightness (degree to which the word is characterised by the topic, measured as frequency over document frequency).  I also scale the final tag cloud depending on the size of the topic in the corpus.  The correlation between topics is computed from the document-topic proportions.  All these then go into GraphViz, where nodes are displayed as images and a lot of careful weighting and organising of the number of topic correlations to display, edge weights, etc.

Below are the results on  ABC news articles from their website 2003-2012 collected by Dr. Jinjing Li of NATSEM in Canberra.   These images are about 4000 by 4000 pixels.  YOU will not be able to view it unless you:

  • get on a big screen,
  • click on the image to enter image view mode,
  • then scroll down to bottom right, click on “View full size” to bring it up,
  • and then zoom around to view.

To produce the banking one, I do the following commands with hca:

#  generate the topic model into result set B1
hca -Ang -v -K50 -C1000 -q2 bank B1
#  compute the diagnostics
hca -v -v -V -V -r0 -C0 bank B1
#  generate the image
topset2word.pl --dot "-Kfdp" --lang png  B1 BN1