h1

Some research papers on hierarchical models

May 15, 2018

Accurate parameter estimation for Bayesian network classifiers using hierarchical Dirichlet processes”, by François Petitjean, Wray Buntine, Geoffrey I. Webb and Nayyar Zaidi, in Machine Learning, 18th May 2018, DOI 10.1007/s10994-018-5718-0.  (Soon to be) Available online at Springer Link.  To be presented at ECML-PKDD 2018 in Dublin in September, 2018.

Abstract This paper introduces a novel parameter estimation method for the probability tables of Bayesian network classifiers (BNCs), using hierarchical Dirichlet processes (HDPs).  The main result of this paper is to show that improved parameter estimation allows BNCs  to outperform leading learning methods such as random forest for both 0–1 loss and RMSE,  albeit just on categorical datasets. As data assets become larger, entering the hyped world of “big”, efficient accurate classification requires three main elements: (1) classifiers with low bias that can capture the fine-detail of large datasets (2) out-of-core learners that can learn from data without having to hold it all in main memory and (3) models that can classify new data very efficiently. The latest BNCs satisfy these requirements. Their bias can be controlled easily by increasing the number of parents of the nodes in the graph. Their structure can be learned out of core with a limited number of passes over the data. However, as the bias is made lower to accurately model classification tasks, so is the accuracy of their parameters’ estimates, as each parameter is estimated from ever decreasing quantities of data. In this paper, we introduce the use of HDPs for accurate BNC parameter estimation even with lower bias. We conduct an extensive set of experiments on 68 standard datasets and demonstrate that our resulting classifiers perform very competitively with random forest in terms of prediction, while keeping the out-of-core capability and superior classification time.
Keywords Bayesian network · Parameter estimation · Graphical models · Dirichlet 19 processes · Smoothing · Classification

“Leveraging external information in topic modelling”, by He Zhao, Lan Du, Wray Buntine & Gang Liu, in Knowledge and Information Systems, 12th May 2018, DOI 10.1007/s10115-018-1213-y.  Available online at Springer Link.  This is an update of our ICDM 2017 paper.

Abstract Besides the text content, documents usually come with rich sets of meta-information, such as categories of documents and semantic/syntactic features of words, like those encoded in word embeddings. Incorporating such meta-information directly into the generative process of topic models can improve modelling accuracy and topic quality, especially in the case where the word-occurrence information in the training data is insufficient. In this article, we present a topic model called MetaLDA, which is able to leverage either document or word meta-information, or both of them jointly, in the generative process. With two data augmentation techniques, we can derive an efficient Gibbs sampling algorithm, which benefits from the fully local conjugacy of the model. Moreover, the algorithm is favoured by the sparsity of the meta-information. Extensive experiments on several real-world datasets demonstrate that our model achieves superior performance in terms of both perplexity and topic quality, particularly in handling sparse texts. In addition, our model runs significantly faster than other models using meta-information.
Keywords Latent Dirichlet allocation · Side information · Data augmentation ·
Gibbs sampling

“Experiments with Learning Graphical Models on Text”, by Joan Capdevila, He Zhao, François Petitjean and Wray Buntine, in Behaviormetrika, 8th May 2018, DOI 10.1007/s41237-018-0050-3.  Available online at Springer Link.  This is work done by Joan Capdevila during his visit to Monash in 2017.

Abstract A rich variety of models are now in use for unsupervised modelling of text documents, and, in particular, a rich variety of graphical models exist, with and without latent variables. To date, there is inadequate understanding about the comparative performance of these, partly because they are subtly different, and they have been proposed and evaluated in different contexts. This paper reports on our experiments with a representative set of state of the art models: chordal graphs, matrix factorisation, and hierarchical latent tree models. For the chordal graphs, we use different scoring functions. For matrix factorisation models, we use different hierarchical priors, asymmetric priors on components. We use Boolean matrix factorisation rather than topic models, so we can do comparable evaluations. The experiments perform a number of evaluations: probability for each document, omni-directional prediction which predicts different variables, and anomaly detection. We find that matrix factorisation performed well at anomaly detection but poorly on the prediction task. Chordal graph learning performed the best generally, and probably due to its lower bias, often out-performed hierarchical latent trees.
Keywords Graphical models · Document analysis · Unsupervised learning ·
Matrix factorisation · Latent variables · Evaluation

 

 

 

Advertisements
h1

The Big Tech Healthcare Invasion

April 25, 2018
koeppel-big-tech-healthcare-invasion-ig

The Big Tech Healthcare Invasion Infographic

h1

Facebook and Data Science

April 6, 2018

My favorite topics in teaching, other than Bayesian statistics (“of course”), are about interesting applications, ethics and impact to society.  One of the things I always do is point out that many of the big technology companies are fundamentally “data” companies selling their consumer data to advertisers.  Lots of gnarly ethical issues here.  But the huge sleeper issue in all this is medical informatics where medical research really needs consumer lifestyle data if it wants to make major breakthroughs in the lifestyle diseases that are gradually strangling the Western economies.  Even gathering lifestyle data is difficult (think diet, for instance), let alone dealing with the ethics and privacy involved.

Anyway, great piece by Jennifer Duke “You’re worth $2.54 to Facebook: Care to pay more?” in the Australian press today (SMH, The Age).  We had an insightful 15 minute discussion on the phone on Friday and I managed to get a worthy quote in her article.  Impressed by her broad knowledge of the topics.  Good to see our journalists know their stuff.

Reuters has an extensive piece outlining details and the election influence, Cambridge Analytica CEO claims influence on U.S. election, Facebook questioned.

The Conversation seems to be surfing the media hub-bub with a dozen or more articles from the academic community in the last week or so.  Here are some that caught my eye.

Some other background articles are:

  • An older article in the Huffington Post, Didn’t Read Facebook’s Fine Print? Here’s Exactly What It Says, and commenting on an older terms of service, but a lot still applies.

  • The Australian Government has fairly strong privacy laws under the Privacy Act and its amendments.  This is described at Guide to securing personal information, which has a broad definition of personal information that probably covers most of what Facebook keeps.  Though a special class of information, sensitive information, which includes medical and financial details, silent phone numbers, etc., and requires a higher level of protection.
  • In 2014 Cambridge researchers Kosinski, Stillwell and Graepel published an article on PNAS (Proc. National Academy of Sciences of the USA) showing Private traits and attributes are predictable from digital records of human behavior.  If that sounds too technical, the short version is:
    If your a frequent user, Facebook probably knows your religion, sexual preferences and any serious diseases you might have, and major personality traits, even if you take great care not to expose them.
    Keep in mind this was the best known of a long series of research.  When this information is inferred (i.e., predicted using a statistical algorithm) it is called implicit information.
  • Note Facebook has been relaxing their privacy default settings over the years, The Evolution of Privacy on Facebook, according to Matt McKeon.  This makes their job of monetising their users easier.
  • Note it is not clear what the Privacy Act says about implicit information.  Note implicit information can be very hard to extract, and require access to a fuller database to make inferences.

Personally, I believe online data privacy will evolve in fits and bursts, but there are a lot of technical hurdles.  Online advertising, for instance, needs to turn around impressions at great speed and doesn’t have time to work through complex APIs so I suspect they will need the personal data in some form on their own servers.  Sounds like a perfect application for cryptosystems to me, if it can be made fast enough.  As for data harvesting, well, I expect that will go on forever.

 

 

 

h1

What’s Hot in IT

March 2, 2018

Attended the Whats Hot in IT event held by the Victorian ICT for Women group.  See their Tweets about the event.  Prof. Maria Garcia De La Banda (2nd from left) gave a fabulous overview.

VICT4W

h1

Trying out DataCamp this semester

February 21, 2018

Our Master of Data Science students explore a lot of things and discuss.  I got a lot of requests to include the excellent material from DataCamp:

DataCamp logo

DataCamp – who support data science education for free

So we’ll see how it goes.  Not sure how well I’ll get to integrate it, because this semester I’m working more on our introductory statistics class.

h1

Whither the Scientific Method

January 28, 2018

Long before the Industrial Age in Europe we had the Dark Ages. Popular culture tells us it was believed that the Earth was flat, witches caused the plague, and the ways of the world were decreed by kings, or God himself. While rationalist explanations of the world appeared independently in many ancient civilisations, the scientific method as we know it became prominent in the 19th century as a remarkable series of scientific and engineering discoveries propelled the world into the industrial age. Indeed, Karl Pearson stated “the scientific method is the sole gateway to the whole region of knowledge.”.

With the pre-eminence of science in our modern society, controversies about science often occur in the media and public discussion, and the list of such areas is large. It doesn’t help that aspects of society, politics or religion have been falsely dressed up with “science,” so-called Scientism.  The expression “the science is settled” is a phrase from global warming skeptics that seeks to align global warning views with scientism (i.e., science is never settled so how can global warming be settled).  Note we can also view the statement “the science is settled” as a Socratian noble lie, therefore justify its use in public discussion.

So apart from false applications of science, i.e., scientism, what flaws with the scientific method are there?

Flaws in the Science?

Medical science has suffered bad press in recent times. Best known through the popular work of John Ioannidis, provocatively titled “Why Most Published Research Findings Are False”, testaments from famous and authoritative medical researchers about the flaws of published medical research abound. As an empirical computer scientist, I can assure you flaws in research are not restricted to medical science, its just that medical science is perhaps our most societally important area of science.

Some of the discussed flaws in research are the misuse of p-values though a variety of means. For an entertaining example, see John Bohannon’s “I Fooled Millions Into Thinking Chocolate Helps Weight Loss.” Other flaws are so-called surrogate endpoints (a biomarker such as a blood test is used as a substitute for a clinical endpoint such as a heart attack), and others still are poorly matched motivations, i.e., for academics, the idea of “publish or perish” but for industry this would be “publish and profit”.  Many lists of flaws have been published.

In all, however, the scientific method holds up as a valid approach because the flaws invariable amount to corruption of the original method. One way the medical community addresses this is by adding an additional layer on top of the standard scientific method, often called the systematic review.  This is where unbiased experts review a series of scientific studies on a particular question, make judgments about the quality of the scientific method and the evidence, and develop recommendations for healthcare. The systematic review is, if you like, quality control for the scientific method.

The End of Theory?

Another seeming assault on the scientific method comes from data science. In 2008 Chris Anderson of Wired published a controversial blog about “The End of Theory”. The idea is that the deluge of data completely changes how we should progress with scientific discovery. We don’t need theory, he claims, we just extract information from the deluge of data. The responses, and there are many, came quickly, for instance Massimo Pigliucci said “But, if we stop looking for models and hypotheses, are we still really doing science?”, and others questions the veracity and appropriateness of much observational data, and hence its suitability as a subject of analysis.

Anderson’s “end of theory,” like John Horgan’s “end of science”, is not so much wrong as much more complex that it first seems. The relationship between data science and the scientific method is not simple. To understand this, consider that the poster child for 19th century science was physics.  Physics, a mathematical science, is fundamentally different to say modern medicine. In physics, Eugene Wigner’s notion of the “unreasonable effectiveness of mathematics” holds sway: from a concise theory we can derive enormous consequences. A relatively small amount of well chosen scientific hypotheses have uncovered vast regions of the engineering and physics universe. For instance, weather predictions are currently based on simulations built using the Newtonian laws of physics coupled with geophysical and weather data.

The imbalance (a small number of scientific hypotheses needed to justify a large area of science) indeed suits the scientific method. Peter Norvig, however, points out this is not feasible in areas such as biology and medical science, where the unreasonable effectiveness of mathematics does not hold. In these areas, the complexities of the underlying processes means we cannot necessarily simulate the impact of a eating raw cocoa or drinking red wine on heart health because the simulation or derivations from fundamental properties of nature are just too complex.

Norvig’s colleagues at Google, some of the founders of data science, instead refer to the unreasonable effectiveness of data. That is, fundamental complexity of some sciences mean we should instead be using data-driven processes for discovery of scientific details.

Data Dredging

To understand how data science can change the scientific method, we need to look at how it should not change it. Statisticians like to talk derisively about data dredging, with p-hacking being the best known example. As in the chocolate study mentioned above, this is where studies are repeated (in some way) until a significant p-value is obtained. They argue data driven discovery is dangerous. But this is the wrong viewpoint for data science. In complex areas like medical science, we have many possible hypotheses and our intuitions can be poor in the complex world of biology.

Computer science has an elegant theory of complexity called NP-completeness, which is the notion that one may need to test an exponential number of things before finding one that works. This indeed is the situation we find ourselves with hypothesis testing in the broader scientific world where the unreasonable effectiveness of mathematics fails.

In the early days of machine learning I worked at Prof. Ross Quinlan’s lab in Sydney. We soon discovered our own version of Ioannidis’ flaws in medical science that applied to machine learning. We called it theory overfitting, in contrast to regular overfitting which is an artifact of the bias-variance dilemma in statistics and machine learning. People tested a bunch of different theories on a small number, say 5 data sets, and eventually they find one that works, and so write it up and publish it. In truth its just another variant of p-hacking.

In data science, if we’re appling machine learning or neural network algorithms to a body of data, we are invariably trying to solve an NP-complete problem and are thus subject to overfitting or p-hacking. Even if we employ careful statistical methods to try to overcome this, we may subsequently be doing theory overfitting. However, if we don’t employ machine learning methods, we may never uncover reasonable hypotheses in the exponential pool of candidates. This is the conundrum of data science for the scientific method when used in broader non-mathematical domains.

Powering the Scientific Method

Organisations and hard nose businesses have this conundrum effectively solved. At Kaggle for instance, and in TREC competitions, a test data set is always hidden from the machine learners, and only used for a final validation, which acts like a (final) cycle of the scientific method. The initial “develop a general theory” step of the scientific method has been done with machine learning. This can be considered to be millions of embedded hypothesise-test cycles. Thus we have an epicycle view of the scientific method.

But applying this approach in the medical world is not straight forward.  The medical research world keeps data registries that feasibly can be used to obtain data for discovery purposes. However, to obtain data, one usually has to apply for ethics clearance/approval, that the intended use of the data is good. The ethics committees who oversee the approval are the gatekeepers of data, and oftentimes they expect to see a valid scientific plan, not an open-ended discovery proposal. With the epicycle view of the scientific method, registries, when they release data for discovery exercises, would need to withhold some data for a final validation step in order to preserve the validity of the scientific method.

h1

To good health!

January 10, 2018

So enrolment sessions start soon for our incoming Master of Data Science students.  I know its a stressful time for some students in terms of “life”.  I usually talk briefly about staying healthy, and Monash offers various services to support this.  But for PhD students I think its important to take this on as a lifestyle objective.  They are undertaking a knowledge intensive career path and brain health will be critical for their future career.

Disclaimer:  Now, this page is full of opinions and pointers to, in some case, controversial material.  I’m just a little old computer science professor, so my opinions have no real backing, and I have no recognised expertise. All care but no responsibility for what I say! 

The fact is, keeping health and understanding how to keep healthy in the modern world is a subject fraught with challenges.  To understand this, consider the following:

  1. The official Australian government position on colds and flu prevention, and the official USA government position:  hygiene and vaccines.  What’s missing:  discussion of healthy diet and exercise to strengthen and repair the immune system.
  2. The Time magazine reports extensive research shows vitamin D helps prevent colds and flu, so some sunshine is also important.  No mention of this in the official government positions above!
  3. Believe it or not, in the USA prescription drugs are the third leading cause of death!  There is a larger issue here in that most published medical research findings are false.  Note, I see this is a systemic thing not limited to medical research, and the medical research community given its importance has extensive, concerted efforts like systematic reviews to address the issue.
  4. Tobacco science is a term used to describe fake science protecting an industry.  Read about the Tobacco Institute and see the movie The Insider.  How much of this goes on in the food industry?
  5. Sugar is now known to be very damaging to the health.  Here is a hard hitting discussion about it, though note quite a few of these claims are considered controversial.  But it is known that sugar suppresses the immune system.  Figuring out your sugar consumption is challenging.   There are rumors (in movie form) of tobacco science going on here too.
  6. Energy drinks rot the teeth, like soft drinks.  Its due to the high acid content.  Its certainly not clear they give any energy.
  7. Artificial sweeteners are not a substitute, in fact evidence suggests they have poor health impacts, and they mess up the brains analysis of your food intake.
  8. Fats are the subject of a massive onslaught from advertisers.  For years we were told to avoid butter and use margarine instead, but now it seems butter is good.    The rather hilarious and utterly confusing history of health advice about butter is in this Butter Studies Roundup. Current conflicting advice is now being broadcast about the humble coconut.
  9. The health of organic produce is currently a propaganda battleground.   None other than former tobacco scientist Henry I Miller (he was a founder of TASSC) has claimed its an expensive scam.  Hint:  organics are also lower in toxic pesticide residue, but no mention of that.
  10. The commercial world has taken on healthy eating big time, and it is the fastest growing segment of the food industry.  Monash University has done a wonderful job of getting really good fast food vendors at the Caulfield campus food court.

Summary:  There is lots of conflicting and bad advice out there!  Heck, even the government websites seem to have errors of omission.

Now, if we consider the specific position of someone who wants their brain to function well, then consider the following:

  1. Short term exercise is known to boost mental performance.
  2. Meditation and mindfulness is also known to boost performance in exams.
  3. Long term sitting is considered to be as bad for health as smoking!  Here is a poster of the dangers.
  4. There are also lifestyle recommendations about studying from scientists:  don’t cram for subjects, learn slowly over the semester.
  5. Recent studies show the brain can be encouraged to grow new cells.
  6. The brain is mostly fat, so we need healthy fats to work well.  Don’t believe a lot of what you read about fats!  Cholesterol is also important for the brain.
  7. Sugar consumption (e.g., soft drinks, commercial juices, commercial cereals, flavoured yogurts, etc. etc. etc.) is bad for the brain, as well as the immune system.
  8. Canola oil is bad for the brain.  This one is important because most cheap salad oils, margarines and many food products are loaded with it.
  9. All sorts of food and chemicals are bad for the brain.  Here’s a TEDx talk on details.  Note TEDx means not official (is this a reliable information source?).
  10. Deep sleep is the basis for memory, learning and health.   In particular, without deep sleep, your brain will not be functioning properly and your memory will be impaired.  Here is a disquieting Google talk on health and sleep (along the lines of the hideous anti-smoking adverts some countries have), but there are many more on this.
  11. Adult neurogenesis is the process by which we adults gain new brain cells.  Not surprisingly, this is very popular amongst the Silicon Valley crowd, and I suspect is also a domain where snake-oil salesman like to peddle.  Nonetheless, here is a video on it:  a TED talk.

Note, for each of these, there are 10’s-100’s of good articles and scientific literature to back it up, though oftentimes conflicting scientific literature as well.  I’m just giving generally readable and somewhat respectable accounts.  A lot of these issues remain controversial, and possibly there is some tobacco science going on, but its hard for us non-experts to really know.

Anyway, I hope from this you understand the complexity of trying to stay health, and trying to keep your brain functioning well in the modern world.

I’m probably a bit extreme but I say,

About eating and food:

  • Try and cook your own meals from real ingredients.  After a while, it becomes easy and its a great way to wind down with friends.
  • If someone’s great grandmother (anyone’s, Fiji, Vietnam, Sweden, …) didn’t make the food 100 years ago, its probably not good for you.
  • Don’t take dietary or health advice from Big Food.  In fact, looking at the government advice (listed above) on the flu, I’d say their’s is missing some major points too for some issues.
  • Try and avoid packaged meals, fast food, and canned and bottle drinks.  Likewise, avoid most commercial fruit juices which have way too much sugar and have lost too much of the fabulous nutrients in the original fruit.
  • Go low sugar, low refined carbohydrates and healthy fats.  Its a lifestyle thing, not a diet.  Once you do, you’ll discover all the amazing subtle flavours you’ve missed from traditional foods and realise how horrible standard breads, sweet deserts, snack bars and cakes really are:  the sugar masks the real flavour, and it gives you a longer term bad after taste, and refined carbohydrates have removed a lot of the flavours
    • Healthy fats is challenging to maintain because Big Food likes to put unhealthy canola oil in everything:  most salad oils, hummus and deli mixes are mostly canola oil, as is margarine.
  • Just avoid artificial sweeteners.  Once you’ve gone cold turkey and got off the sugar addiction you wont be craving it and you’ll feel better for it.
  • Health slogans on food products, “low fat”, “low cholesterol” often mean its bad for you!   Low fat usually means high sugar, for instance.

About other aspects of health:

  • Get exercise, and make it a lifestyle thing.  When you’re older, you’ll discover you cannot function well as a knowledge worker without it.
  • Don’t sit at your desk for long hours.  You need to get up and move around every hour!  Also, become aware of your posture.   Don’t become a hunchback!  Some 2nd years are already heading that way.
  • When you’re mentally worn out, a quick nap or a brisk walk does wonders, and both have scientific backing.
  • Make sure you are getting proper sleep.  That can mean organising your assignments and study properly so you don’t need a to do a bunch of all-nighters to get through.  But it also means setting up the right environment at home for sleep.
  • I know of few cases where drugs and alcohol support good health or brain functioning, including so-called smart drugs or nootropics.  Most are dangerous to the liver, as are many medicines.  Headache and pain medicine is far more dangerous and damaging than many other things!
  • Routine … that’s what the body needs.  For sleep, for eating, for study, for exercise, routine is critical part of making it function well.

Anyway, I have said too much already.  In case you’re wondering, I am now on holidays.  No time for a Data Science professor to talk about this stuff during semester!  But keep in mind, I have no qualifications or expertise when it comes to health.  These are mere opinions!