h1

On the “world’s best tweet clusterer” and the hierarchical Pitman–Yor process

July 30, 2016

Kar Wai Lim has just been told they “confirmed the approval” of his PhD (though it hasn’t been “conferred” yet, so he’s not officially a Dr., yet) and he spent the time post submission pumping out journal and conference papers.  Ahhh, the unencumbered life of the fresh PhD!

This one:

“Nonparametric Bayesian topic modelling with the hierarchical Pitman–Yor processes”, Kar Wai Lim , Wray Buntine, Changyou Chen, Lan Du, International Journal of Approximate Reasoning78 (2016) 172–191.

includes what I believe is the world’s best tweet clusterer.  Certainly blows away the state of the art tweet pooling methods.  Main issue is that the current implementation only scales to a million or so tweets, and not the 100 million or expected in some communities.  Easily addressed with a bit of coding work.

We did this to demonstrate the rich possibilities in terms of semantic hierarchies one has, largely unexplored, using simple Gibbs sampling with Pitman-Yor processes.   Lan Du (Monash) started this branch of research.  I challenge anyone to do this particular model with variational algorithms😉   The machine learning community in the last decade unfortunately got lost on the complexities of Chinese restaurant processes and stick-breaking representations for which complex semantic hierarchies are, well, a bit of a headache!

h1

Is C that bad of a programming language?

May 2, 2016

The following quote about C comes from a Quora answer to the above question:

I don’t think C gets enough credit. Sure, C doesn’t love you. C isn’t about love–C is about thrills. C hangs around in the bad part of town. C knows all the gang signs. C has a motorcycle, and wears the leathers everywhere, and never wears a helmet, because that would mess up C’s punked-out hair. C likes to give cops the finger and grin and speed away. Mention that you’d like something, and C will pretend to ignore you; the next day, C will bring you one, no questions asked, and toss it to you with a you-know-you-want-me smirk that makes your heart race. Where did C get it? “It fell off a truck,” C says, putting away the boltcutters. You start to feel like C doesn’t know the meaning of “private” or “protected”: what C wants, C takes. This excites you. C knows how to get you anything but safety. C will give you anything but commitment. In the end, you’ll leave C, not because you want something better, but because you can’t handle the intensity. C says “I’m gonna live fast, die young, and leave a good-looking corpse,” but you know that C can never die, not so long as C is still the fastest thing on the road.

I love it.  Still do most of my programming in C using a mix of vi, emacs, gdb, valgrind and all that good old stuff, resorting to Python/Perl scripts sometimes for automation. Know I should be using a proper development UI and loading up my code with bulky libraries like Boost, and using complex install systems like Cmake and Autoconf, hell, why not even Imake (done all these in the past).  I should also be using great inventions like multiple inheritance, operator overloading and recursive templates, but I find C’s simple approach to memory handling and functions just a lot safer.

Most of my students use Java though.  Automatic garbage collection and the UI seem to be what they like, as well as the loads of good code to work on out there.

h1

A small model of ArXiv abstracts

April 12, 2016

I’ve been working with Dorota Glowacka of University of Helsinki on a search system built on the ArXiv.org.  We have a demo appearing in SIGIR 2016.  Here is a model with 100 topics built using normalised gamma priors on topics (giving each topic a variance parameter as well as a mean parameter) on the 1.1 million abstracts to March 2016.  The model took about 6 hours to run on my desktop.

This is a huge PNG file (2.8Mb).  YOU will not be able to view it unless you:

  • so load up on a big screen,
  • click on the image to enter image view mode,
  • then scroll down to bottom right click on “View full size” to bring it up,
  • and then zoom around to view.
h1

Talk at Data Science Meetup

April 4, 2016

Today I’ll be giving a version of my “document analysis” grand tour talk to the Data Science Meetup in Melbourne. The slides for the talk in PDF are here.  I also did a smaller version of one of my new graphics, this one on obesity.  Needs to display for general viewers some distance away, so must be larger in perspective.   The standard ones need a really big screen or you need to be up close!

h1

Visualising a topic model

March 25, 2016

Finally decided to write a proper visualiser for topic models.   I used the WordCloud Python tool from AMueller[GitHub].   Modified it because the input I needed to use words with precomputed scores, rather than text input.  Moreover, I wanted two dimensions for words displayed, size (word frequency in topic) and lightness (degree to which the word is characterised by the topic, measured as frequency over document frequency).  I also scale the final tag cloud depending on the size of the topic in the corpus.  The correlation between topics is computed from the document-topic proportions.  All these then go into GraphViz, where nodes are displayed as images and a lot of careful weighting and organising of the number of topic correlations to display, edge weights, etc.

Below are the results on  ABC news articles from their website 2003-2012 collected by Dr. Jinjing Li of NATSEM in Canberra.   These images are about 4000 by 4000 pixels.  YOU will not be able to view it unless you:

  • get on a big screen,
  • click on the image to enter image view mode,
  • then scroll down to bottom right, click on “View full size” to bring it up,
  • and then zoom around to view.

To produce the banking one, I do the following commands with hca:

#  generate the topic model into result set B1
hca -Ang -v -K50 -C1000 -q2 bank B1
#  compute the diagnostics
hca -v -v -V -V -r0 -C0 bank B1
#  generate the image
topset2word.pl --dot "-Kfdp" --lang png  B1 BN1

h1

Introduction to Data Science Tutorial

December 27, 2015

Next two days, 28th and 29th December I’ll be giving a tutorial at KAIST hosted by Alice Oh.   We just flew in last night from visiting Chengdu and Xi’an in China.  This is based on the Introduction to Data Science unitt, FIT5145, at Monash.

 

h1

What “50 Years of Data Science” Leaves Out

November 28, 2015

This blog post from Sean Owen, Director, Data Science @Cloudera / London

I was so glad to find David Donoho’s critical take in 50 Years of Data Science, which has made its way around the Internet. … Along the way to arguing that Data Science can’t be much more than Statistics, it fails to contemplate Data Engineering, which I’d argue is most of what Data Science is and Statistics is not.

Much as I enjoyed reading Donoho’s work, I think its important for people to realise that Data Science isn’t just a new take on applied statistics, a superset yes, but an important superset.

Some additional comments:

  • Donoho like Breiman before him splits Statistics/Machine Learning into Generative versus Predictive modelling.  I never really understand this because near 40% of published ML is generative modelling, and the majority of my work.
  • Other important aspects of Data Science we cover in our Monash course are:
    • data governance and data provenance
    • the business processes and “operationalisation” (putting the results to work to achieve value)
    • getting data, fusing different kinds of data, envisaging data science projects
  • These are above and beyond the area of Greater Data Science (Donoho, section 8) that we refer to as the Data Analysis Process, and is probably the most in-demand skill for what the industry calls a data scientist.

Also, as a Machine Learning guy, who’s been doing Computational Statistics for 25 years, I also think its important to point out that Machine Learning exists as a separate field because their are so many amazing and challenging tasks to do in areas like robotics, natural language processing and image processing.  These require statistical ingenuity, domain understanding, and computational trickery.   I have important contacts both in the Statistical community and the NLP community so I can do my work.

 

Follow

Get every new post delivered to your Inbox.