Archive for November, 2015


What “50 Years of Data Science” Leaves Out

November 28, 2015

This blog post from Sean Owen, Director, Data Science @Cloudera / London

I was so glad to find David Donoho’s critical take in 50 Years of Data Science, which has made its way around the Internet. … Along the way to arguing that Data Science can’t be much more than Statistics, it fails to contemplate Data Engineering, which I’d argue is most of what Data Science is and Statistics is not.

Much as I enjoyed reading Donoho’s work, I think its important for people to realise that Data Science isn’t just a new take on applied statistics, a superset yes, but an important superset.

Some additional comments:

  • Donoho like Breiman before him splits Statistics/Machine Learning into Generative versus Predictive modelling.  I never really understand this because near 40% of published ML is generative modelling, and the majority of my work.
  • Other important aspects of Data Science we cover in our Monash course are:
    • data governance and data provenance
    • the business processes and “operationalisation” (putting the results to work to achieve value)
    • getting data, fusing different kinds of data, envisaging data science projects
  • These are above and beyond the area of Greater Data Science (Donoho, section 8) that we refer to as the Data Analysis Process, and is probably the most in-demand skill for what the industry calls a data scientist.

Also, as a Machine Learning guy, who’s been doing Computational Statistics for 25 years, I also think its important to point out that Machine Learning exists as a separate field because their are so many amazing and challenging tasks to do in areas like robotics, natural language processing and image processing.  These require statistical ingenuity, domain understanding, and computational trickery.   I have important contacts both in the Statistical community and the NLP community so I can do my work.



Github activity

November 24, 2015

So most of this year I spent doing the Introduction to Data Science (introductory unit at Monash) and getting the Grad. Dip. of Data Science and the Master of Data Science up and running (some background here).

As a result, you can see the disastrous impact it has had on my Github activity, which is a measure of my coding productivity!


Wray’s activity on Github for 2015


Introduction to Data Science

November 7, 2015

On 11th-14th January 2016 I’ll be visiting the School of IT at Monash University Malaysia, which is located within the Bandar Sunway township in Malaysia just outside Kuala Lumpur city.  My talk should be on the Monday (11th).  The slides are here (available temporarily).

Title:  Introduction to Data Science

This 2 hour seminar works through some of the emerging highlights of Data Science, reviewing major videos, blogs and articles that helped mold the field. This seminar looks at processes and case studies to understand the many facets of working with data, and the significant effort in Data Science over and above the core task of Data Analysis.  So the series is a broad introduction to working with data rather than a deep dive into the world of statistics. The seminar is aimed at those with an IT background who either want to start in Data Science or work with it, for instance in management or as a data engineer.  Attendees should have a knowledge of information technology and computer science.

The talk will be extracted from our FIT5145 unit given in the Master of Data Science.


Basic tutorial: Oldie but a goody …

November 7, 2015

A student reminded me of Gregor Heinrich‘s excellent introduction to topic modelling, including a great introduction to the underlying foundations like Dirichlet distributions and multinomials.  Great reading for all students!  See

  • G. Heinrich, Parameter estimation for text analysis, Technical report, Fraunhofer IGD, 15 September 2009 at his publication page.