Fabulous data science tag cloud

June 2, 2018

This comes from PhD student Caitlin Doogan.

Tag Cloud on Data

Tag Cloud on Data by Caitlin Doogan


Graduating MDS students

May 25, 2018

Our first larger batch of MDS students graduating.   Here are some who attended the ceremony.  Really great students!


MDS Graduation May 2018


Some research papers on hierarchical models

May 15, 2018

Accurate parameter estimation for Bayesian network classifiers using hierarchical Dirichlet processes”, by François Petitjean, Wray Buntine, Geoffrey I. Webb and Nayyar Zaidi, in Machine Learning, 18th May 2018, DOI 10.1007/s10994-018-5718-0.  Available online at Springer Link.  To be presented at ECML-PKDD 2018 in Dublin in September, 2018.

Abstract This paper introduces a novel parameter estimation method for the probability tables of Bayesian network classifiers (BNCs), using hierarchical Dirichlet processes (HDPs).  The main result of this paper is to show that improved parameter estimation allows BNCs  to outperform leading learning methods such as random forest for both 0–1 loss and RMSE,  albeit just on categorical datasets. As data assets become larger, entering the hyped world of “big”, efficient accurate classification requires three main elements: (1) classifiers with low bias that can capture the fine-detail of large datasets (2) out-of-core learners that can learn from data without having to hold it all in main memory and (3) models that can classify new data very efficiently. The latest BNCs satisfy these requirements. Their bias can be controlled easily by increasing the number of parents of the nodes in the graph. Their structure can be learned out of core with a limited number of passes over the data. However, as the bias is made lower to accurately model classification tasks, so is the accuracy of their parameters’ estimates, as each parameter is estimated from ever decreasing quantities of data. In this paper, we introduce the use of HDPs for accurate BNC parameter estimation even with lower bias. We conduct an extensive set of experiments on 68 standard datasets and demonstrate that our resulting classifiers perform very competitively with random forest in terms of prediction, while keeping the out-of-core capability and superior classification time.
Keywords Bayesian network · Parameter estimation · Graphical models · Dirichlet 19 processes · Smoothing · Classification

“Leveraging external information in topic modelling”, by He Zhao, Lan Du, Wray Buntine & Gang Liu, in Knowledge and Information Systems, 12th May 2018, DOI 10.1007/s10115-018-1213-y.  Available online at Springer Link.  This is an update of our ICDM 2017 paper.

Abstract Besides the text content, documents usually come with rich sets of meta-information, such as categories of documents and semantic/syntactic features of words, like those encoded in word embeddings. Incorporating such meta-information directly into the generative process of topic models can improve modelling accuracy and topic quality, especially in the case where the word-occurrence information in the training data is insufficient. In this article, we present a topic model called MetaLDA, which is able to leverage either document or word meta-information, or both of them jointly, in the generative process. With two data augmentation techniques, we can derive an efficient Gibbs sampling algorithm, which benefits from the fully local conjugacy of the model. Moreover, the algorithm is favoured by the sparsity of the meta-information. Extensive experiments on several real-world datasets demonstrate that our model achieves superior performance in terms of both perplexity and topic quality, particularly in handling sparse texts. In addition, our model runs significantly faster than other models using meta-information.
Keywords Latent Dirichlet allocation · Side information · Data augmentation ·
Gibbs sampling

“Experiments with Learning Graphical Models on Text”, by Joan Capdevila, He Zhao, François Petitjean and Wray Buntine, in Behaviormetrika, 8th May 2018, DOI 10.1007/s41237-018-0050-3.  Available online at Springer Link.  This is work done by Joan Capdevila during his visit to Monash in 2017.

Abstract A rich variety of models are now in use for unsupervised modelling of text documents, and, in particular, a rich variety of graphical models exist, with and without latent variables. To date, there is inadequate understanding about the comparative performance of these, partly because they are subtly different, and they have been proposed and evaluated in different contexts. This paper reports on our experiments with a representative set of state of the art models: chordal graphs, matrix factorisation, and hierarchical latent tree models. For the chordal graphs, we use different scoring functions. For matrix factorisation models, we use different hierarchical priors, asymmetric priors on components. We use Boolean matrix factorisation rather than topic models, so we can do comparable evaluations. The experiments perform a number of evaluations: probability for each document, omni-directional prediction which predicts different variables, and anomaly detection. We find that matrix factorisation performed well at anomaly detection but poorly on the prediction task. Chordal graph learning performed the best generally, and probably due to its lower bias, often out-performed hierarchical latent trees.
Keywords Graphical models · Document analysis · Unsupervised learning ·
Matrix factorisation · Latent variables · Evaluation





The Big Tech Healthcare Invasion

April 25, 2018

The Big Tech Healthcare Invasion Infographic


Facebook and Data Science

April 6, 2018

My favorite topics in teaching, other than Bayesian statistics (“of course”), are about interesting applications, ethics and impact to society.  One of the things I always do is point out that many of the big technology companies are fundamentally “data” companies selling their consumer data to advertisers.  Lots of gnarly ethical issues here.  But the huge sleeper issue in all this is medical informatics where medical research really needs consumer lifestyle data if it wants to make major breakthroughs in the lifestyle diseases that are gradually strangling the Western economies.  Even gathering lifestyle data is difficult (think diet, for instance), let alone dealing with the ethics and privacy involved.

Anyway, great piece by Jennifer Duke “You’re worth $2.54 to Facebook: Care to pay more?” in the Australian press today (SMH, The Age).  We had an insightful 15 minute discussion on the phone on Friday and I managed to get a worthy quote in her article.  Impressed by her broad knowledge of the topics.  Good to see our journalists know their stuff.

Reuters has an extensive piece outlining details and the election influence, Cambridge Analytica CEO claims influence on U.S. election, Facebook questioned.

The Conversation seems to be surfing the media hub-bub with a dozen or more articles from the academic community in the last week or so.  Here are some that caught my eye.

Some other background articles are:

  • An older article in the Huffington Post, Didn’t Read Facebook’s Fine Print? Here’s Exactly What It Says, and commenting on an older terms of service, but a lot still applies.

  • The Australian Government has fairly strong privacy laws under the Privacy Act and its amendments.  This is described at Guide to securing personal information, which has a broad definition of personal information that probably covers most of what Facebook keeps.  Though a special class of information, sensitive information, which includes medical and financial details, silent phone numbers, etc., and requires a higher level of protection.
  • In 2014 Cambridge researchers Kosinski, Stillwell and Graepel published an article on PNAS (Proc. National Academy of Sciences of the USA) showing Private traits and attributes are predictable from digital records of human behavior.  If that sounds too technical, the short version is:
    If your a frequent user, Facebook probably knows your religion, sexual preferences and any serious diseases you might have, and major personality traits, even if you take great care not to expose them.
    Keep in mind this was the best known of a long series of research.  When this information is inferred (i.e., predicted using a statistical algorithm) it is called implicit information.
  • Note Facebook has been relaxing their privacy default settings over the years, The Evolution of Privacy on Facebook, according to Matt McKeon.  This makes their job of monetising their users easier.
  • Note it is not clear what the Privacy Act says about implicit information.  Note implicit information can be very hard to extract, and require access to a fuller database to make inferences.

Personally, I believe online data privacy will evolve in fits and bursts, but there are a lot of technical hurdles.  Online advertising, for instance, needs to turn around impressions at great speed and doesn’t have time to work through complex APIs so I suspect they will need the personal data in some form on their own servers.  Sounds like a perfect application for cryptosystems to me, if it can be made fast enough.  As for data harvesting, well, I expect that will go on forever.





What’s Hot in IT

March 2, 2018

Attended the Whats Hot in IT event held by the Victorian ICT for Women group.  See their Tweets about the event.  Prof. Maria Garcia De La Banda (2nd from left) gave a fabulous overview.



Trying out DataCamp this semester

February 21, 2018

Our Master of Data Science students explore a lot of things and discuss.  I got a lot of requests to include the excellent material from DataCamp:

DataCamp logo

DataCamp – who support data science education for free

So we’ll see how it goes.  Not sure how well I’ll get to integrate it, because this semester I’m working more on our introductory statistics class.