# Stick-breaking construction for the Indian buffet process

发布时间:2021-12-09 17:15:42

Stick-breaking Construction for the Indian Buffet Process

Yee Whye Teh Gatsby Computational Neuroscience Unit University College London 17 Queen Square London WC1N 3AR, UK ywteh@gatsby.ucl.ac.uk

Dilan G¨ rur o ¨ MPI for Biological Cybernetics Dept. Sch¨ lkopf o Spemannstrasse 38 72076 T¨ bingen, Germany u dilan.gorur@tuebingen.mpg.de

Zoubin Ghahramani Department of Engineering University of Cambridge Trumpington Street Cambridge CB2 1PZ, UK zoubin@eng.cam.ac.uk

Abstract

The Indian buffet process (IBP) is a Bayesian nonparametric distribution whereby objects are modelled using an unbounded number of latent features. In this paper we derive a stick-breaking representation for the IBP. Based on this new representation, we develop slice samplers for the IBP that are ef?cient, easy to implement and are more generally applicable than the currently available Gibbs sampler. This representation, along with the work of Thibaux and Jordan [17], also illuminates interesting theoretical connections between the IBP, Chinese restaurant processes, Beta processes and Dirichlet processes.

1 INTRODUCTION

The Indian Buffet Process (IBP) is a distribution over binary matrices consisting of N > 0 rows and an unbounded number of columns [6]. These binary matrices can be interpreted as follows: each row corresponds to an object, each column to a feature, and a 1 in entry (i, k) indicates object i has feature k. For example, objects can be movies like “Terminator 2”, “Shrek” and “Shanghai Knights”, while features can be “action”, “comedy”, “stars Jackie Chan”, and the matrix can be [101; 010; 110] in Matlab notation. Like the Chinese Restaurant Process (CRP) [1], the IBP provides a tool for de?ning nonparametric Bayesian models with latent variables. However, unlike the CRP, in which each object belongs to one and only one of in?nitely many latent classes, the IBP allows each object to possess potentially any combination of in?nitely many latent features. This added ?exibility has resulted in a great deal of interest in the IBP, and the development of a range of interesting applications. These applications include models for choice behaviour [5], protein-protein interactions [2], the structure of causal graphs [19], dyadic data for collaborative ?ltering applications [10], and human similarity judgments [11].

In this paper, we derive a new, stick-breaking representation for the IBP, a development which is analogous to Sethuraman’s seminal stick-breaking representation for CRPs [15]. In this representation, as we will see in Section 3, the probability of each feature is represented explicitly by a stick of length between 0 and 1. Sethuraman’s representation paved the way for both novel samplers for and generalizations of CRPs [7]. Similarly, we show how our novel stick-breaking representation of the IBP can be used to develop new slice samplers for IBPs that are ef?cient, easy to implement and have better applicability to non-conjugate models (Sections 4, 5.2, 6). This new representation also suggests generalizations of the IBP (such as a Pitman-Yor variant, in Section 3.2). Moreover, although our stick-breaking representation of the IBP was derived from a very different model than the CRP, we demonstrate a surprising duality between the sticks in these two representations which suggests deeper connections between the two models (Section 3.2). The theoretical developments we describe here, which show a stick-breaking representation which is to the IBP what Sethuraman’s construction is to the CRP, along with the recent work of Thibaux and Jordan [17], showing that a particular subclass of Beta processes is to the IBP as the Dirichlet process is to the CRP, ?rmly establish the IBP in relation to the well-known classes of Bayesian nonparametric models.

2 INDIAN BUFFET PROCESSES

The IBP is de?ned as the limit of a corresponding distribution over matrices with K columns, as the number of columns K → ∞. Let Z be a random binary N × K matrix, and denote entry (i, k) in Z by zik . For each feature k let ?k be the prior probability that feature k is present in an α object. We place a Beta( K , 1) prior on ?k , with α being the strength parameter of the IBP. The full model is:

α ?k ? Beta( K , 1) zik |?k ? Bernoulli(?k )

indepedently ?k indepedently ?i, k

(1a) (1b)

Let us now consider integrating out the ?k ’s and taking the

limit of K → ∞ to obtain the IBP. For the ?rst object, the chance of it having each particular feature k is indeα pendently K once ?k is integrated out, thus the distribuα tion over the number of features it has is Binomial( K , K). As K → ∞, this approaches Poisson(α). For subsequent objects i = 2, . . . , N , the probability of it also having a feature k already belonging to a previous object α K +m<i k is α +1+i?1 → m<i k where m<i k = j<i zjk > 0 i K is the number of objects prior to i with feature k. Repeating the argument for the ?rst object, object i will also have Poisson( α ) new features not belonging to prei vious objects. Note that even though the total number of available features is unbounded, the actual number K + of used features is always ?nite (and in fact is distributed as Poisson(α N 1 )). i=1 i The above generative process can be understood using the metaphor of an Indian buffet restaurant. Customers (objects) come into the restaurant one at a time, and can sample an in?nite number of dishes (features) at the buffet counter. Each customer will try each dish that previous customers have tried with probabilities proportional to how popular each dish is; in addition the customer will try a number of new dishes that others have not tried before. To complete the model, let θk be parameters associated with feature k and xi be an observation associated with object i. Let θk ? H xi ? F (zi,: , θ: ) independently ?k independently ?i (2a) (2b)

the customers and taking customer i to be the last customer to enter the restaurant; the second term f (·|·) is the data likelihood of xi if zik = 1. It is possible to integrate θ1:K + out from the likelihood term if H is conjugate to F . In fact it is important for H to be conjugate to F when we consider the probabilities of new features being introduced, because all possible parameters for these new features have to be taken into account. If Li is the number of new features introduced, we have p(Li |rest) ∝

α ( N )Li e? N Li ! α

×

? ? ? f (xi |zi,1:K + , zi,1:Li = 1, θ1:K + , θ1:Li ) dh(θ1:Li ) (4) ? where zi,1:Li are occurrences for the new features and ? θ1:Li are their parameters, the superscript ? denoting currently unused (inactive) features. The fraction comes from the probability of introducing Li new features under α Poisson( N ) while the second term is the data likelihood, ? with the parameters θ1:Li integrated out with respect to the prior density h(·).

where H is the prior over parameters, F (zi,: , θ: ) is the data distribution given the features zi,: = {zik }∞ correspondk=1 ing to object i and feature parameters θ: = {θk }∞ . We k=1 assume that F (zi,: , θ: ) depends only on the parameters of the present features. 2.1 GIBBS SAMPLER The above generative process for the IBP can be used directly in a Gibbs sampler for posterior inference of Z and θ given data x = {xi } [6]. The representation consists of the number K + of used (active) features, the matrix Z1:N,1:K + of occurrences among the K + active features, and their parameters θ1:K + . The superscript + denotes active features. The sampler iterates through i = 1, . . . , N , for each object i it updates the feature occurrences for the currently used features, then considers adding new features to model the data xi . For the already used features k = 1, . . . , K + , the conditional probability of zik = 1 given other variables is just p(zik = 1|rest) ∝

m?i k N f (xi |zi,?k , zik

The need to integrate out the parameters for new features is similar to the need to integrate out parameters for new clusters in the Dirichlet process (DP) mixture model case (see [13]). To perform this integration ef?ciently, conjugacy is important, but the requirement for conjugacy limits the applicability of the IBP in more elaborate settings. It is possible to devise samplers in the non-conjugate case analogous to those developed for DP mixture models [10, 5]. In the next section we develop a different representation of the IBP in terms of a stick-breaking construction, which leads us to an easy to implement slice sampler for the nonconjugate case.

3 STICK BREAKING CONSTRUCTION

In this section, we describe an alternative representation of the IBP where the feature probabilities are not integrated out, and a speci?c ordering is imposed on the features. We call this the stick-breaking construction for the IBP. We will see that the new construction bears strong relationships with the standard stick-breaking construction for CRPs, paving the way to novel generalizations of and inference techniques for the IBP. 3.1 DERIVATION Let ?(1) > ?(2) > . . . > ?(K) be a decreasing ordering α of ?1:K = {?1 , . . . , ?K }, where each ?l is Beta( K , 1). We will show that in the limit K → ∞ the ?(k) ’s obey the following law, which we shall refer to as the stick-breaking construction for the IBP,

k

= 1, θ1:K + ) (3)

where m?i k = j=i zjk . The fraction is the conditional prior of zik = 1, obtained by using exchangeability among

ν(k) ? Beta(α, 1)

i.i.d.

?(k) = ν(k) ?(k?1) =

l=1

ν(l) (5)

We start by considering ?(1) . For ?nite K it is ?(1) = max ?l

l=1,...,K α where each ?l is Beta( K , 1) and has density:

(6)

Notice that the ?(k) ’s have a Markov structure, with ?(k+1) conditionally independent of ?(1:k?1) given ?(k) . Finally, instead of working with the variables ?(k) directly, ?(k) we introduce a new set of variables ν(k) = ?(k?1) with range [0, 1]. Using a change of variables, the density of ν(k) is derived to be,

α?1 p(ν(k) |?(1:k?1) ) = αν(k) I(0 ≤ ν(k) ≤ 1)

p(?l ) =

α K ?1 I(0 K ?l

α

≤ ?l ≤ 1)

(7)

where I(A) is the indicator function for a condition (measurable set) A: I(A) = 1 if A is true, and 0 otherwise. The cumulative distribution function (cdf) for ?l is then:

?l

(15)

F (?l ) =

?∞

α K

α α K ?1 I(0 Kt

Thus ν(k) are independent from ?(1:k?1) and are simply Beta(α, 1) distributed. Expanding ?(k) = ν(k) ?(k?1) = k l=1 ν(l) , we obtain the stick-breaking construction (5). The construction (5) can be understood metaphorically as follows. We start with a stick of length 1. At iteration k = 1, 2, . . ., we break off a piece at a point ν(k) relative to the current length of the stick. We record the length ?(k) of the stick we just broke off, and recurse on this piece, discarding the other piece of stick. 3.2 RELATION TO DP

≤ t ≤ 1) dt (8)

= ?l I(0 ≤ ?l ≤ 1) + I(1 < ?l )

Since the ?l ’s are independent, the cdf of ?(1) is just the product of the cdfs of each ?l , so

K F (?(1) ) = ?(1) I(0 ≤ ?(1) ≤ 1) + I(1 < ?(1) < ∞) α

K

=

?α I(0 (1)

≤ ?(1) ≤ 1) + I(1 < ?(1) )

(9)

Differentiating, we see that the density of ?(1) is

α?1 p(?(1) ) = α?(1) I(0 ≤ ?(1) ≤ 1)

(10)

In iteration k of the construction (5), after breaking the stick in two we always recurse on the stick whose length we denote by ?(k) . Let π(k) be the length of the other discarded stick. We have,

k?1

and therefore ?(1) ? Beta(α, 1). We now derive the densities for subsequent ?(k) ’s. For each k ≥ 1 let lk be such that ?lk = ?(k) and let Lk = {1, . . . , K}\{l1, . . . , lk }. Since ?(1:k) = {?(1) , . . . , ?(k) } are the k largest values among ?1:K , we have ?l ≤ min ?(k ) = ?(k)

k ≤k

π(k) = (1 ? ν(k) )?(k?1) = (1 ? ν(k) )

l=1

ν(l)

(16)

Making a change of variables v(k) = 1 ? ν(k) ,

k?1

v(k) ? Beta(1, α)

i.i.d.

π(k) = v(k)

l=1

(1 ? v(l) )

(17)

(11)

for each l ∈ Lk . Restricting the range of ?l to [0, ?(k) ], the cdf becomes F (?l |?(1:k) ) =

?α

α

thus π(1:∞) are the resulting stick lengths in a standard stick-breaking construction for DPs [15, 7]. In both constructions the ?nal weights of interest are the lengths of the sticks. In DPs, the weights π(k) are the lengths of sticks discarded, while in IBPs, the weights ?(k) are the lengths of sticks we have left. This difference leads to the different properties of the weights: for DPs, the stick lengths sum to a length of 1 and are not decreasing, while in IBPs the stick lengths need not sum to 1 but are decreasing. Both stick-breaking constructions are shown in Figure 1. In both the weights decrease exponentially quickly in expectation.

?l α α ?1 t K dt 0 K ?(k) α α ?1 K dt Kt 0

K =?(k) ?lK I(0 ≤ ?l ≤ ?(k) ) + I(?(k) < ?l )

(12)

Now ?(k+1) = maxl∈Lk ?l with each ?l independent given ?(1:k) . The cdf of ?(k+1) is again the product of the cdfs of ?l over l ∈ Lk , F (?(k+1) |?(1:k) )

? K?k α K

K?k K α

The direct correspondence to stick-breaking in DPs implies =?(k) ?(k+1) I(0 ≤ ?(k+1) ≤ ?(k) ) + I(?(k) < ?(k+1) ) that a range of techniques for and extensions to the DP can be adapted for the IBP. For example, we can generalize the →??α ?α IBP by replacing the Beta(α, 1) distribution on ν(k) ’s with (k+1) I(0 ≤ ?(k+1) ≤ ?(k) ) + I(?(k) < ?(k+1) ) (k) other distributions. One possibility is a Pitman-Yor [14] as K → ∞. Differentiating, the density of ?(k+1) is, extension of the IBP, de?ned as p(?(k+1) |?(1:k) )

α?1 =α??α ?(k+1) I(0 ≤ ?(k+1) ≤ ?(k) ) (k) k

(13)

(14)

ν(k) ? Beta(α + kd, 1 ? d)

?(k) =

l=1

ν(l)

(18)

?(1) ?(2) ?(3) ?(4) ?(5) ?(6)

π(6)

π(5)

π(4)

π(3)

π(2)

π(1)

Figure 1: Stick-breaking construction for the DP and IBP. The black stick at top has length 1. At each iteration the vertical black line represents the break point. The brown dotted stick on the right is the weight obtained for the DP, while the blue stick on the left is the weight obtained for the IBP. where d ∈ [0, 1) and α > ?d. The Pitman-Yor IBP 1 weights decrease in expectation as a O(k ? d ) power-law, and this may be a better ?t for some naturally occurring data which have a larger number of features with signi?cant but small weights [4]. An example technique for the DP which we could adapt to the IBP is to truncate the stick-breaking construction after a certain number of break points and to perform inference in the reduced space. [7] gave a bound for the error introduced by the truncation in the DP case which can be used here as well. Let K ? be the truncation level. We set ?(k) = 0 for each k > K ? , while the joint density of ?(1:K ? ) is,

K?

seen as adaptively choosing the truncation level at each iteration. Slice sampling is an auxiliary variable method that samples from a distribution by sampling uniformly from the region under its density function [12]. This turns the problem of sampling from an arbitrary distribution to sampling from uniform distributions. Slice sampling has been successfully applied to DP mixture models [8], and our application to the IBP follows a similar thread. In detail, we introduce an auxiliary slice variable, s|Z, ?(1:∞) ? Uniform[0, ?? ] (21)

where ?? is a function of ?(1:∞) and Z, and is chosen to be the length of the stick for the last active feature, ?? = min 1, min ?(k) . (22)

k: ?i,zik =1

The joint distribution of Z and the auxiliary variable s is p(s, ?(1:∞) , Z) = p(Z, ?(1:∞) ) p(s|Z, ?(1:∞) ) (23)

1 where p(s|Z, ?(1:∞) ) = ?? I(0 ≤ s ≤ ?? ). Clearly, integrating out s preserves the original distribution over ?(1:∞) and Z, while conditioned on Z and ?(1:∞) , s is simply drawn from (21). Given s, the distribution of Z becomes: 1 p(Z|x, s, ?(1:∞) ) ∝ p(Z|x, ?(1:∞) ) ?? I(0 ≤ s ≤ ?? ) (24)

p(?(1:K ? ) ) =

k=1 K?

p(?(k) |?(k?1) )

(19)

=α

K?

?α ? ) (K

k=1

??1 I(0 ≤ ?(K ? ) ≤ · · · ≤ ?(1) ≤ 1) (k)

The conditional distribution of Z given ?(1:K ? ) is simply1

N K?

which forces all columns k of Z for which ?(k) < s to be zero. Let K ? be the maximal feature index with ?(K ? ) > s. Thus zik = 0 for all k > K ? , and we need only consider updating those features k ≤ K ? . Notice that K ? serves as a truncation level insofar as it limits the computational costs to a ?nite amount without approximation. Let K ? be an index such that all active features have index k < K ? (note that K ? itself would be an inactive feature). The computational representation for the slice sampler consists of the slice variables and the ?rst K ? features: s, K ? , K ? , Z1:N,1:K ? , ?(1:K ? ) , θ1:K ? . The slice sampler proceeds by updating all variables in turn. Update s. The slice variable is drawn from (21). If the new value of s makes K ? ≥ K ? (equivalently, s < ?(K ? ) ), then we need to pad our representation with inactive features until K ? < K ? . In the appendix we show that the stick lengths ?(k) for new features k can be drawn iteratively from the following distribution: p(?(k) |?(k?1) , z:,>k = 0) ∝ exp(α

N 1 i=1 i (1

p(Z|?(1:K ? ) ) =

i=1 k=1

?zik (1 ? ?(k) )1?zik (k)

(20)

with zik = 0 for k > K ? . Gibbs sampling in this representation is straightforward, the only point to note being that adaptive rejection sampling (ARS) [3] should be used to sample each ?(k) given other variables (see next section).

4 SLICE SAMPLER

Gibbs sampling in the truncated stick-breaking construction is simple to implement, however the predetermined truncation level seems to be an arbitrary and unnecessary approximation. In this section, we propose a nonapproximate scheme based on slice sampling, which can be

Note that we are making a slight abuse of notation by using Z both to denote the original IBP matrix with arbitrarily ordered columns, and the equivalent matrix with the columns reordered to decreasing ?’s. Similarly for the feature parameters θ’s.

1

? ?(k) )i ) (25)

α?1 ?(k) (1 ? ?(k) )N I(0 ≤ ?(k) ≤ ?(k?1) )

We used ARS to draw samples from (25) since it is logconcave in log ?(k) . The columns for these new features are initialized to z:,k = 0 and their parameters drawn from their prior θk ? H.

Update Z. Given s, we only need to update zik for each i and k ≤ K ? . The conditional probabilities are: p(zik = 1|rest) ∝ ?(k) f (xi |zi,?k , zik = 1, θ1:K ? ) (26) ??

arbitrarily large ?nite model, then to either ignore the ordering and weights (to get IBP) or to enforce the decreasing weight ordering (to get stick-breaking). Changing from stick-breaking to the standard IBP representation is easy. We simply drop the stick lengths as well as the inactive features, leaving us with the K + active feature columns along with the corresponding parameters. To change from IBP back to the stick-breaking representation, we have to draw both the stick lengths and order the features in decreasing stick lengths, introducing inactive features into the representation if required. We may index the K + active features in the IBP representation as k = 1, . . . , K + in the ?nite model. Let Z1:N,1:K + be the feature occurrence matrix. Suppose that we have K K+ features in the ?nite model. For the active features, the posterior for the lengths are simply ?+ |z:,k ? Beta( k α + m·k , 1 + N ? m·k ) K → Beta(m·k , 1 + N ? m·k )

The ?? denominator is needed when different values of zik induces different values of ?? by changing the index of the last active feature. Update θk . For each k = 1, . . . , K ? , the conditional probability of θk is:

N

p(θk |rest) ∝ h(θk )

i=1

f (xi |zi,1:K ? , θ?k , θk )

(27)

Update ?(k) . For k = 1, . . . , K ? ? 1, combining (19) and (20), the conditional probability of ?(k) is p(?(k) |rest)

m·k ∝?(k) ?1 (1

? ?(k) )

N ?m·k

I(?(k+1) ≤ ?(k) ≤ ?(k?1) )

(28)

(29)

where m·k = N zik . For k = K ? , in addition to taki=1 ing into account the probability of features K ? is inactive, we also have to take into account the probability that all columns of Z beyond K ? are inactive as well. The appendix shows that the resulting conditional probability of ?(K ? ) is given by (25) with k = K ? . We draw from both (28) and (25) using ARS.

5 CHANGE OF REPRESENTATIONS

Both the stick-breaking construction and the standard IBP representation are different representations of the same nonparametric object. In this section we consider updates which change from one representation to the other. More precisely, given a posterior sample in the stick-breaking representation we wish to construct a posterior sample in the IBP representation and vice versa. Such changes of representation allow us to make use of ef?cient MCMC moves in both representations, e.g. interlacing split-merge moves in IBP representation [10] with the slice sampler in stick-breaking representation. Furthermore, since both stick lengths and the ordering of features are integrated out in the IBP representation, we can ef?ciently update both in the stick-breaking representation by changing to the IBP representation and back. We appeal to the in?nite limit formulation of both representations to derive the appropriate procedures. In particular, we note that the IBP is obtained by ignoring the ordering on features and integrating out the weights ?(1:K) in an arbitrarily large ?nite model, while the stick-breaking construction is obtained by enforcing an ordering with decreasing weights. Thus, given a sample in either representations, our approach is to construct a corresponding sample in an

as K → ∞. For the rest of the K ? K + inactive features, it is suf?cient to consider only those inactive features with stick lengths larger than mink ?+ . Thus we consider k a decreasing ordering ?? > ?? > · · · on these lengths. (1) (2) (25) gives their densities in the K → ∞ limit and ARS can be used to draw ?? ? ) until ?? ? ) < mink ?+ . Fik (K (1:K nally, the stick-breaking representation is obtained by reordering ?+ + , ?? ? ) in decreasing order, with the fea(1:K 1:K ture columns and parameters taking on the same ordering (columns and parameters corresponding to inactive features are set to 0 and drawn from their prior respectively), giving us K + + K ? features in the stick-breaking representation. 5.1 SEMI-ORDERED STICK-BREAKING In deriving the change of representations from the IBP to the stick-breaking representation, we made use of an intermediate representation whereby the active features are unordered, while the inactive ones have an ordering of decreasing stick lengths. It is in fact possible to directly work with this representation, which we shall call semi-ordered stick-breaking. The representation consists of K + active and unordered features, as well as an ordered sequence of inactive features. The stick lengths for the active features have conditional distributions: ?+ |z:,k ? Beta(m·,k , 1 + N ? m·,k ) k (30)

while for the inactive features we have a Markov property: p(?? |?? (k) (k?1) , z:,>k = 0) ∝ exp( (?? )α?1 (1 ? ?? )N I(0 ≤ (k) (k)

N 1 ? i i=1 i (1 ? ?(k) ) )) ?? ≤ ?? (k) (k?1) ) (31)

5.2 SLICE SAMPLER To use the semi-ordered stick-breaking construction as a representation for inference, we can again use the slice sampler to adaptively truncate the representation for inactive features. This gives an inference scheme which works in the non-conjugate case, is not approximate, has an adaptive truncation level, but without the restrictive ordering constraint of the stick-breaking construnction. The representation s, K + , Z1:N,1:K + , ?+ + , θ1:K + consists only 1:K of the K + active features and the slice variable s, s ? Uniform[0, ?? ] ?? = min 1,

1≤k≤K +

10 mixing time

3

10

2

10

1

Stick?Breaking Semi?Ordered

Gibbs Sampling

Figure 2: Autocorrelation times for K + for the slice sampler in decreasing stick lengths ordering, in semi-ordered stick-breaking representation, and for the Gibbs sampler. For each dataset and each sampler, we repeated 5 runs of 15, 000 iterations. We used the autocorrelation coef?cients of the number of represented features K + and α (with a maximum lag of 2500) as measures of mixing time. We found that mixing in K + is slower than in α for all datasets and report results only for K + here. We also found that in this regime the autocorrelation times do not vary with 2 dimensionality or with σA . In Figure 2 we report the auto+ correlation times of K over all runs, all datasets, and all three samplers. As expected, the slice sampler using the decreasing stick lengths ordering was always slower than the semi-ordered one. Surprisingly, we found that the semiordered slice sampler was just as fast as the Gibbs sampler which fully exploits conjugacy. This is about as well as we would expect a more generally applicable non-conjugate sampler to perform.

min

?+ k

(32)

Once a slice value is drawn, we generate K ? inactive features, with their stick lengths drawn from (31) until ? ?? ? +1) < s. The associated feature columns Z1:N,1:K ? (K ? are initialized to 0 and the parameters θ1:K ? drawn from their prior. Sampling for the feature entries and parameters for both the active and just generated inactive features proceed as before. Afterwards, we drop from the list of active features any that became inactive, while we add to the list any inactive feature that became active. Finally, the stick lengths for the new list of active features are drawn from their conditionals (30).

6 EXPERIMENT

In this section we compare the mixing performance of the two proposed slice samplers against Gibbs sampling. We chose a simple synthetic dataset so that we can be assured of convergence to the true posterior and that mixing times can be estimated reliably in a reasonable amount of computation time. We also chose to apply the three samplers on a conjugate model since Gibbs sampling requires conjugacy, although our implementation of the two slice samplers did not make use of this. In the next section we demonstrate the modelling performance of a non-conjugate model using the semi-ordered slice sampler on a dataset of MNIST handwritten digits. We used the conjugate linear-Gaussian binary latent feature model for comparing the performances of the different samplers [6]. Each data point xi is modelled using a spher2 ical Gaussian with mean zi,: A and variance σX , where zi,: is the row vector of feature occurrences corresponding to xi , and A is a matrix whose kth row forms the parameters for the kth feature. Entries of A are drawn i.i.d. from a zero 2 mean Gaussian with variance σA . We generated 1, 2 and 3 dimensional datasets from the model with data variance 2 ?xed at σX = 1, varying values of the strength parameter 2 α = 1, 2 and the latent feature variance σA = 1, 2, 4, 8. For each combination of parameters we produced ?ve datasets with 100 data points, giving a total of 120 datasets. For all 2 2 datasets, we ?xed σX and σA to the generating values and learned the feature matrix Z and α.

7 DEMONSTRATION

In this section we apply the semi-ordered slice sampler to 1000 examples of handwritten images of 3’s in the MNIST dataset. The model we worked with is a generalization of that in Section 6, where in addition to modelling feature occurrences, we also model per object features values [6]. In particular, let Y be a matrix of the same size as Z, with i.i.d. zero mean unit variance Gaussian entries. We model each xi as

2 xi |Z, Y, A, σX ? N ((zi,: 2 yi,: )A, σX I),

(33)

where is elementwise multiplication. Speci?cation for the rest of the model is as in Section 6. We can integrate Y or A out while maintaining tractability, but not both. The handwritten digit images are ?rst preprocessed by projecting on to the ?rst 64 PCA components, and the sampler ran for 10000 iterations. The trace plot of the log likelihood and the distribution of the number of active features are shown in Figure 3. The model succesfully ?nds latent features to reconstruct the images as shown in Figure 4. Some of the latent features found are shown in Figure 5. Most appear to model local shifts of small edge segments

×10

5

log likelihood

?4

400

# iterations

2500 5000 7500 10000

300 200 100 0 50 100 150

?4.1 ?4.2 ?4.3

iterations

m (# objects with feature k) 300 200 100 0

# active feats

150

Figure 5: Features that are shared between many digits. ments. A direct consequence of our stick-breaking construction is that a draw from such a Beta process has the ∞ form A = k=1 ?(k) δθk with ?(k) drawn from (5) and θk drawn i.i.d. from the base measure H. This is a particularly simply case of a more general construction called the inverse L? vy measure [18, 9]. Generalizations to use ing other stick-breaking constructions automatically lead to generalizations of the Beta process, and we are currently exploring a number of possibilities, including the PitmanYor extension. Finally, the duality observed in Section 3.2 seems to be a hitherto unknown connection between the Beta process and the DP which we are currently trying to understand. As an aside, it is interesting to note the importance of feature ordering in the development of the IBP. To make the derivation rigorous, [6] had to carefully ignore the feature ordering by considering permutation-invariant equivalence classes before taking the in?nite limit. In this paper, we derived the stick-breaking construction by imposing a feature ordering with decreasing feature weights. To conclude, our development of a stick-breaking construction for the IBP has lead to interesting insights and connections, as well as practical algorithms such as the new slice samplers. ACKNOWLEDGEMENTS

# objects

20 40 60 80 100 120

100 50 0

k

5

10

15

k (feature label)

# active feats

Figure 3: Top-left: the log likelihood trace plot. The sampler quickly ?nds a high likelihood region. Top-right: histogram of the number of active features over the 10000 iterations. Bottom-left: number of images sharing each feature during the last MCMC iteration. Bottom-right: histogram of the number of active features used by each input image. Note that about half of the features are used by only a few data points, and each data point is represented by a small subset of the active features.

Figure 4: Last column: original digits. Second last column: reconstructed digits. Other columns: features used for reconstruction. of the digits, and are reminiscent of the result of learning models with sparse priors (e.g. ICA) on such images [16].

We thank the reviewers for insightful comments. YWT thanks the Lee Kuan Yew Endowment Fund for funding. REFERENCES [1] D. Aldous. Exchangeability and related topics. In ? ?e Ecole d’Et? de Probabilit? s de Saint-Flour XIII– e 1983, pages 1–198. Springer, Berlin, 1985. [2] W. Chu, Z. Ghahramani, R. Krause, and D. L. Wild. Identifying protein complexes in high-throughput protein interaction screens using an in?nite latent feature model. In BIOCOMPUTING: Proceedings of the Paci?c Symposium, 2006. [3] W.R. Gilks and P. Wild. Adaptive rejection sampling for Gibbs sampling. Applied Statistics, 41:337–348, 1992. [4] S. Goldwater, T.L. Grif?ths, and M. Johnson. Interpolating between types and tokens by estimating power-

8 DISCUSSION AND FUTURE WORK

We have derived novel stick-breaking representations of the Indian buffet process. Based on these representations new MCMC samplers are proposed that are easy to implement and work on more general models than Gibbs sampling. In experiments we showed that these samplers are just as ef?cient as Gibbs without using conjugacy. [17] have recently showed that the IBP is a distribution on matrices induced by the Beta process with a constant strength parameter of 1. This relation to the Beta process is proving to be a fertile ground for interesting develop-

law generators. In Advances in Neural Information Processing Systems, volume 18, 2006. [5] D. G¨ r¨ r, F. J¨ kel, and C. E. Rasmussen. A choice ou a model with in?nitely many latent features. In Proceedings of the International Conference on Machine Learning, volume 23, 2006. [6] T. L. Grif?ths and Z. Ghahramani. In?nite latent feature models and the Indian buffet process. In Advances in Neural Information Processing Systems, volume 18, 2006. [7] H. Ishwaran and L.F. James. Gibbs sampling methods for stick-breaking priors. Journal of the American Statistical Association, 96(453):161–173, 2001. [8] M. Kalli and S. G. Walker. Slice sampling for the Dirichlet process mixture model. Poster presented at the Eighth Valencia International Meeting on Bayesian Statistics, 2006. [9] P. L? vy. e Th? orie de L’Addition des Variables e Al? atoires. Paris: Gauthier-Villars, 1937. e [10] E. Meeds, Z. Ghahramani, R. Neal, and S. T. Roweis. Modeling dyadic data with binary latent factors. In Advances in Neural Information Processing Systems, volume 19, to appear 2007. [11] D. J. Navarro and T. L. Grif?ths. A nonparametric Bayesian method for inferring features from similarity judgements. In Advances in Neural Information Processing Systems, volume 19, to appear 2007. [12] R. M. Neal. Slice sampling. Annals of Statistics, 31:705–767, 2003. [13] R.M. Neal. Markov chain sampling methods for Dirichlet process mixture models. Journal of Computational and Graphical Statistics, 9:249–265, 2000. [14] J. Pitman and M. Yor. The two-parameter PoissonDirichlet distribution derived from a stable subordinator. Annals of Probability, 25:855–900, 1997. [15] J. Sethuraman. A constructive de?nition of Dirichlet priors. Statistica Sinica, 4:639–650, 1994. [16] Y. W. Teh, M. Welling, S. Osindero, and G. E. Hinton. Energy-based models for sparse overcomplete representations. Journal of Machine Learning Research, 4:1235–1260, Dec 2003. [17] R. Thibaux and M. I. Jordan. Hierarchical beta processes and the Indian buffet process. This volume, 2007. [18] R. L. Wolpert and K. Ickstadt. Simulations of l? vy e random ?elds. In Practical Nonparametric and Semiparametric Bayesian Statistics, pages 227–242. Springer-Verlag, 1998.

[19] F. Wood, T. L. Grif?ths, and Z. Ghahramani. A non-parametric Bayesian method for inferring hidden causes. In Proceedings of the Conference on Uncertainty in Arti?cial Intelligence, volume 22, 2006. APPENDIX Recall from the construction of ?(k+1:∞) that it is simply the K → ∞ limit of a decreasing ordering of ?Lk . Since reordering does not affect the probabilities of zil ’s given the corresponding ?l for each l ∈ Lk , p(z:,>k = 0|?(k) ) = lim

K→∞

p(?Lk |?(k) )p(z:,Lk = 0|?Lk ) d?Lk

Given ?(k) , ?l ’s and zil ’s are conditionally i.i.d. across different l’s, with cdf of ?l as given in (12). Thus we have

?(k)

= lim

K→∞

0

α K (1 ? ?)N K ?(k) ? K ?1 d?

?α

K?k

α

(34)

Applying change of variables ν = ?/?(k) to the integral,

1 0 1 α (1 ? ν?(k) )N K ν K ?1 dν α (1 ? ν + ν(1 ? ?(k) ))N K ν K ?1 dν 1 N α (N ) (1 ? ν)N ?i (ν(1 ? ?(k) ))i K ν K ?1 dν i i=0 α (N ) (1 ? ?(k) )i K i i=0 N

α Γ( K +i)Γ(N ?i+1) α Γ( K +i+N ?i+1) α α α

=

0

=

0 N

=

=

i=0

N! (N ?i)!i! (1 QN N !α

α ? ?(k) )i K α K

Qi?1

α j=0 ( K +j)(N ?i)! QN α j=0 ( K +j)

=

j=1 K

+j

1+

N i=1

Qi?1

α j=1 K

+j

i!

(1 ? ?(k) )i

Finally, plugging the above into (34) and taking K → ∞, p(z:,>k = 0|?(k) ) = exp ? αHN + α

N 1 i=1 i (1

? ?(k) )i

(35)

To obtain (25), we note that the conditional for ?(k) is the posterior conditioned on both z:,k = 0 and z:,>k = 0. The prior given ?(k?1) is (14), the probability of z:,k = 0 is just (1 ? ?(k) )N , while the probability of z:,>k = 0 is (35); multiplying all three gives (25).