Archive for the ‘software’ Category

h1

Is C that bad of a programming language?

May 2, 2016

The following quote about C comes from a Quora answer to the above question:

I don’t think C gets enough credit. Sure, C doesn’t love you. C isn’t about love–C is about thrills. C hangs around in the bad part of town. C knows all the gang signs. C has a motorcycle, and wears the leathers everywhere, and never wears a helmet, because that would mess up C’s punked-out hair. C likes to give cops the finger and grin and speed away. Mention that you’d like something, and C will pretend to ignore you; the next day, C will bring you one, no questions asked, and toss it to you with a you-know-you-want-me smirk that makes your heart race. Where did C get it? “It fell off a truck,” C says, putting away the boltcutters. You start to feel like C doesn’t know the meaning of “private” or “protected”: what C wants, C takes. This excites you. C knows how to get you anything but safety. C will give you anything but commitment. In the end, you’ll leave C, not because you want something better, but because you can’t handle the intensity. C says “I’m gonna live fast, die young, and leave a good-looking corpse,” but you know that C can never die, not so long as C is still the fastest thing on the road.

I love it.  Still do most of my programming in C using a mix of vi, emacs, gdb, valgrind and all that good old stuff, resorting to Python/Perl scripts sometimes for automation. Know I should be using a proper development UI and loading up my code with bulky libraries like Boost, and using complex install systems like Cmake and Autoconf, hell, why not even Imake (done all these in the past).  I should also be using great inventions like multiple inheritance, operator overloading and recursive templates, but I find C’s simple approach to memory handling and functions just a lot safer.

Most of my students use Java though.  Automatic garbage collection and the UI seem to be what they like, as well as the loads of good code to work on out there.

h1

Visualising a topic model

March 25, 2016

Finally decided to write a proper visualiser for topic models.   I used the WordCloud Python tool from AMueller[GitHub].   Modified it because the input I needed to use words with precomputed scores, rather than text input.  Moreover, I wanted two dimensions for words displayed, size (word frequency in topic) and lightness (degree to which the word is characterised by the topic, measured as frequency over document frequency).  I also scale the final tag cloud depending on the size of the topic in the corpus.  The correlation between topics is computed from the document-topic proportions.  All these then go into GraphViz, where nodes are displayed as images and a lot of careful weighting and organising of the number of topic correlations to display, edge weights, etc.

Below are the results on  ABC news articles from their website 2003-2012 collected by Dr. Jinjing Li of NATSEM in Canberra.   These images are about 4000 by 4000 pixels.  YOU will not be able to view it unless you:

  • get on a big screen,
  • click on the image to enter image view mode,
  • then scroll down to bottom right, click on “View full size” to bring it up,
  • and then zoom around to view.

To produce the banking one, I do the following commands with hca:

#  generate the topic model into result set B1
hca -Ang -v -K50 -C1000 -q2 bank B1
#  compute the diagnostics
hca -v -v -V -V -r0 -C0 bank B1
#  generate the image
topset2word.pl --dot "-Kfdp" --lang png  B1 BN1

h1

Github activity

November 24, 2015

So most of this year I spent doing the Introduction to Data Science (introductory unit at Monash) and getting the Grad. Dip. of Data Science and the Master of Data Science up and running (some background here).

As a result, you can see the disastrous impact it has had on my Github activity, which is a measure of my coding productivity!

GithubWray

Wray’s activity on Github for 2015

h1

Experimental results for non-parametric LDA

November 12, 2014

Swapnil Mishra and I have been testing different software for HDP-LDA, the non-parametric version of LDA first published by Teh, Jordan, Beal and Blei in JASA 2006.  Since then ever more complex and theoretical approaches have been published for this, and its a common topic in recent NIPS conferences.  We’d noticed that the LDA implementation in Mallet has an asymmetric-symmetric version which is a truncated form of HDP-LDA, and David Mimno says this has been around since 2007, though Wallach, Mimno and McCallum published their results with it in NIPS 2009.  Another fast implementation is by Sato, Kurihara, and Nakagawa in KDD 2011,  The original version some people test against is Yee Whye Teh’s implementation from 2004, Nonparametric Bayesian Mixture Models – release 2.1.  This is impressive because it does “slow” Gibbs sampling and still works OK!

We’ve had real trouble comparing different published work because everyone has different ways of measuring perplexity, their test sets are hard to reconstruct, and sometimes their code works really badly and we’ve been unable to get realistic looking results.

Our KDD paper did some comparisons.  We’re re-running things more carefully now.  To compare against Sato et al.‘s results he kindly sent us his original data sets.   Their experimental work was thorough, and precisely written up so its a good one to compare with.  We’ve then re-run Mallet with their ASYM-SYM option to compare and our own two versions of HDP-LDA and our fully non-parametric version NP-LDA (puts a Pitman-Yor process on the word distributions and a Dirichlet process on the topic distributions).  Results in the plot below, lower perplexity is better.  Not sure about Teh’s original 2004 code but our experience from other runs would be it doesn’t match these others.

pcvb0Most of these are with 300 topics.  We’re impressed with how well the others performed.  Mallet is a lot faster because they use some very clever sampling designed for LDA.  Ours is the next fastest because our samplers are much better and it runs moderately well in 8 cores (cheating really).  More details on the comparison in a forth-coming paper.

h1

New release of HCA

September 10, 2014

Version 0.61 just posted.  I’ve been adding some integration to make the diagnostics more useful.  Plus a few corrections to sampling.  Really need to clean up the code though.   uuuggghhh!

h1

So Mallet’s asymmetic-symmetric LDA approximates HDP-LDA

June 10, 2014

So Swapnil and I we were doing a stack of different comparisons with others’ software, and it dawned on me, the asymmetic-symmetric variant in Mallet’s LDA approximates HDP-LDA, first described in the hierarchical Dirichlet process article, the much cited non-parametric version of LDA.  In fact since 2008 its probably been the best implementation and no-one knew, and its blindingly fast too.  Though we got better performance, see our KDD paper.