Training a Pitman-Yor process tree with observed data at the leaves (part 2)

November 13, 2014

So here is a recap of the previous post:

A simple problem with hierarchical Pitman-Yor processes (PYPs) or Dirichlet processes (DP) is where you have a hierarchy of probability vectors and data only at the leaf nodes.  The task is to estimate the probabilities at the leaf nodes.  The hierarchy is used to allow sharing across children …

We have a tree of nodes which have probability vectors.  The formula used estimates a probability for node \theta from its parent \phi as follows:

\hat{p}^\theta_k ~=~ \frac{n^\theta_k - t^\theta_k d}{N^\theta+c} + \frac{c+T^\theta \,d}{N^\theta+c} \,\, \hat{p}^\phi_k               (1)

In this, the pair (d,c) are the parameters of the PYP called discount and concentration respectively, where 0 \le d < 1 and c \ge -d.  The t^\mu_k are magical auxiliary (latent) counts that make the PYP samplers work efficiently.  They are, importantly, constrained by their corresponding counts, and the counts are computed up from the leaves up (using \phi \in \mu to denote \phi are the children of \mu).   The Ns and Ts are totals. These relationships are summarised in the table below.

constraint t^\mu_k \le n^\mu_k
constraint t^\mu_k>0 whenever n^\mu_k>0
propagating counts n^\mu_k=\sum_{\phi \in \mu} t^\phi_k
total N^\mu=\sum_k n^\mu_k
total T^\mu=\sum_k t^\mu_k

It is important to note here how the counts propagate upwards. In some smoothing algorithms, propagation is n^\mu_k=\sum_{\phi \in \mu} n^\phi_k. This is not the case with us. Here the non-leaf nodes represent priors on the leaf nodes, they do not represent alternative models. PYP theory tells us how to make inference about these priors.

Now the theory of this configuration appears in many places but Marcus Hutter and I review it in Section 8 of an ArXiv paper, and its worked through for this case in my recent tutorial in the section “Working the n-gram Model”. The posterior distribution for the case with data at the leaf nodes is given by:

\prod_k \left(p_k^\top\right)^{n_k^\top} \prod_\mu\frac{(c_\mu|d_\mu)_{T^\mu}}{(c_\mu)_{N^\mu}} \prod_\mu\prod_k {\cal S}^{n^\mu_k}_{t^\mu_k,d_\mu}                 (2)

There are a lot of new terms here so let us review them:

(c|d)_{T} is the Pochhammer symbol (a generalised version is used, and a different notation) which is a form of rising factorial so its given by c (c+d) (c+2d) ... (c+(T-1)d); can be computed using Gamma functions
(c)_{T} a related rising factorial given by c (c+1) (c+2) ... (c+(T-1)) so it corresponds to (c_\mu|1)_{T}
\top the root node of the hierarchy
p^{\top}_k the initial/prior probability of the k-th outcome at the root
{\cal S}^{n}_{t,d} a generalised Stirling number of the second kind for discount d and counts (n,t)

Background theory for the Stirling number is reviewed in Section 5.2 of the ArXiv paper, and developed many decades ago by mathematicians and statisticians. We have C and Java libraries for this, for instance the C version is on MLOSS so its O(1) to evaluate, usually.

Now to turn this into a Gibbs sampler, we have to isolate the terms for each t^\mu_k individually. Let \theta be the parent of \mu. The constraints are:

constraint t^\mu_k \le n^\mu_k
constraint t^\mu_k>0 whenever n^\mu_k>0
propagating counts n^\theta_k = \sum_{\phi \in \theta} t^\phi_k
propagating counts t^\theta_k \le \sum_{\phi \in \theta} t^\phi_k

The propagating constraints do not apply when \mu is the root, and in this case, where $\vec{p}^\top$ is given, t_k^\top does not exist.  For \theta the parent of \mu, the constraints can be rearranged to become:

when n^\mu_k=0 t^\mu_k =0
when n^\mu_k=1 t^\mu_k =1
when n^\mu_k>1 1\le t^\mu_k \le n^\mu_k
t^\mu_k \ge t^\theta_k -\sum_{\phi \in \theta,\phi\ne \mu} t^\phi_k

The terms in Equation (2) involving t^\mu_k when \mu is not the root are:

\frac{(c_\mu|d_\mu)_{T^\mu}}{(c_\theta)_{N^\theta}} {\cal S}^{n^\mu_k}_{t^\mu_k,d_\mu}{\cal S}^{n^\theta_k}_{t^\theta_k,d_\theta} (3)

where note that t^\mu_k appears inside n^\theta_k, N^\theta and T^\mu. When \mu is the root, this becomes

\left(p_k^\top\right)^{n_k^\top} (c_\top|d_\top)_{T^\top} {\cal S}^{n^\top_k}_{t^\top_k,d_\top} (4)

The Gibbs sampler for the hierarchical model becomes as follows:

Gibbs sampler

For each node \mu do:

for each outcome k do:

sample t^\mu_k according to Equation (3) or (4) satisfying constraints in the last table above.



As noted by Teh, this can result in a very large range for t^\mu_k when n^\mu_k is large.  We overcome this by constraining t^\mu_k to be within plus or minus 10 of its last value, an approximation good enough for large n^\mu_k.

Moreover, we can initialise t^\mu_k to reasonable values by propagating up using the expected value formula from Section 5.3 of the ArXiv paper.  We only bother doing this when n^\mu_k>1.  The resultant initialisation procedure works in levels:  estimate \vec{t}^\mu at the leaves, propagate these to the \vec{n}^\mu at the next level, estimate \vec{t}^\mu at this next level, propagate these up, etc.


for each level, starting at the bottom do:

for each node \mu at the current level do:

for each outcome k do:

if n^\mu_k\le 1 then

t^\mu_k\leftarrow n^\mu_k

elseif  d_\mu>0 then # the PYP case

t^\mu_k\leftarrow max\left(1,floor\left(\frac{c_\mu}{d_\mu}\left(\frac{(c_\mu+d_\mu)_{n^\mu_k}}{(c_\mu)_{n^\mu_k}}-1\right)\right)\right)

else # the DP case

t^\mu_k \leftarrow max\left(1,floor\left(c_\mu\left(\psi_0 (c_\mu+n^\mu_k) - \psi_0 (c_\mu)\right)\right)\right)

\psi_0() is the digamma function




for each node \mu directly above current level do:

n^\mu_k = \sum_{\phi \in \mu} t^\phi_k



This represents a moderately efficient sampler for the hierarchical model.  To make this work efficiently you need to:

  • set up the data structures so the various counts, totals and parent nodes can be traversed and accessed quickly;
  • set up the sampling so leaf nodes and parents where n^\mu_k is guaranteed to be 0 or 1 can be efficiently ignored;
  • but more importantly, we also need to sample the discounts and concentrations.

Note sampling the discounts and concentrations is the most expensive part of the operation because each time you change them the counts \vec{t}^\mu also need to be resampled.  When the parameters are fixed, convergence is fast.  We will look at this final sampling task next.


One comment

  1. […] We will look at the first of these two tasks next. […]

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s

%d bloggers like this: