A small model of ArXiv abstracts

April 12, 2016

I’ve been working with Dorota Glowacka of University of Helsinki on a search system built on the ArXiv.org.  We have a demo appearing in SIGIR 2016.  Here is a model with 100 topics built using normalised gamma priors on topics (giving each topic a variance parameter as well as a mean parameter) on the 1.1 million abstracts to March 2016.  The model took about 6 hours to run on my desktop.

This is a huge PNG file (2.8Mb).  YOU will not be able to view it unless you:

  • so load up on a big screen,
  • click on the image to enter image view mode,
  • then scroll down to bottom right click on “View full size” to bring it up,
  • and then zoom around to view.

Talk at Data Science Meetup

April 4, 2016

Today I’ll be giving a version of my “document analysis” grand tour talk to the Data Science Meetup in Melbourne. The slides for the talk in PDF are here.  I also did a smaller version of one of my new graphics, this one on obesity.  Needs to display for general viewers some distance away, so must be larger in perspective.   The standard ones need a really big screen or you need to be up close!


Visualising a topic model

March 25, 2016

Finally decided to write a proper visualiser for topic models.   I used the WordCloud Python tool from AMueller[GitHub].   Modified it because the input I needed to use words with precomputed scores, rather than text input.  Moreover, I wanted two dimensions for words displayed, size (word frequency in topic) and lightness (degree to which the word is characterised by the topic, measured as frequency over document frequency).  I also scale the final tag cloud depending on the size of the topic in the corpus.  The correlation between topics is computed from the document-topic proportions.  All these then go into GraphViz, where nodes are displayed as images and a lot of careful weighting and organising of the number of topic correlations to display, edge weights, etc.

Below are the results on  ABC news articles from their website 2003-2012 collected by Dr. Jinjing Li of NATSEM in Canberra.   These images are about 4000 by 4000 pixels.  YOU will not be able to view it unless you:

  • get on a big screen,
  • click on the image to enter image view mode,
  • then scroll down to bottom right, click on “View full size” to bring it up,
  • and then zoom around to view.

To produce the banking one, I do the following commands with hca:

#  generate the topic model into result set B1
hca -Ang -v -K50 -C1000 -q2 bank B1
#  compute the diagnostics
hca -v -v -V -V -r0 -C0 bank B1
#  generate the image
topset2word.pl --dot "-Kfdp" --lang png  B1 BN1


Introduction to Data Science Tutorial

December 27, 2015

Next two days, 28th and 29th December I’ll be giving a tutorial at KAIST hosted by Alice Oh.   We just flew in last night from visiting Chengdu and Xi’an in China.  This is based on the Introduction to Data Science unitt, FIT5145, at Monash.



What “50 Years of Data Science” Leaves Out

November 28, 2015

This blog post from Sean Owen, Director, Data Science @Cloudera / London

I was so glad to find David Donoho’s critical take in 50 Years of Data Science, which has made its way around the Internet. … Along the way to arguing that Data Science can’t be much more than Statistics, it fails to contemplate Data Engineering, which I’d argue is most of what Data Science is and Statistics is not.

Much as I enjoyed reading Donoho’s work, I think its important for people to realise that Data Science isn’t just a new take on applied statistics, a superset yes, but an important superset.

Some additional comments:

  • Donoho like Breiman before him splits Statistics/Machine Learning into Generative versus Predictive modelling.  I never really understand this because near 40% of published ML is generative modelling, and the majority of my work.
  • Other important aspects of Data Science we cover in our Monash course are:
    • data governance and data provenance
    • the business processes and “operationalisation” (putting the results to work to achieve value)
    • getting data, fusing different kinds of data, envisaging data science projects
  • These are above and beyond the area of Greater Data Science (Donoho, section 8) that we refer to as the Data Analysis Process, and is probably the most in-demand skill for what the industry calls a data scientist.

Also, as a Machine Learning guy, who’s been doing Computational Statistics for 25 years, I also think its important to point out that Machine Learning exists as a separate field because their are so many amazing and challenging tasks to do in areas like robotics, natural language processing and image processing.  These require statistical ingenuity, domain understanding, and computational trickery.   I have important contacts both in the Statistical community and the NLP community so I can do my work.



Github activity

November 24, 2015

So most of this year I spent doing the Introduction to Data Science (introductory unit at Monash) and getting the Grad. Dip. of Data Science and the Master of Data Science up and running (some background here).

As a result, you can see the disastrous impact it has had on my Github activity, which is a measure of my coding productivity!


Wray’s activity on Github for 2015


Introduction to Data Science

November 7, 2015

On 11th-14th January 2016 I’ll be visiting the School of IT at Monash University Malaysia, which is located within the Bandar Sunway township in Malaysia just outside Kuala Lumpur city.  My talk should be on the Monday (11th).  The slides are here (available temporarily).

Title:  Introduction to Data Science

This 2 hour seminar works through some of the emerging highlights of Data Science, reviewing major videos, blogs and articles that helped mold the field. This seminar looks at processes and case studies to understand the many facets of working with data, and the significant effort in Data Science over and above the core task of Data Analysis.  So the series is a broad introduction to working with data rather than a deep dive into the world of statistics. The seminar is aimed at those with an IT background who either want to start in Data Science or work with it, for instance in management or as a data engineer.  Attendees should have a knowledge of information technology and computer science.

The talk will be extracted from our FIT5145 unit given in the Master of Data Science.