a probability/combinatorics question Best answer on the web
January 9, 2009 on 7:26 am | In mybachcars.com |
a probability/combinatorics question Best answer on the web I am looking for a solution to the following problem:
Suppose we have N balls as well as N sticks. Each ball takes on one of the S available colors, with the number of balls having the i-th color being p_i (i=1,2,...,S). Clearly, p_1+p_2+...+p_S = N. The balls having the same color are not distinguishable. The same applies for the N sticks: each stick takes on one of the T available colors, with the number of sticks having the j-th color being q_j (j=1,2,...,T; q_1+...+q_T = N). Sticks having the same color are not distinguishable, either.
Now if we pair those balls and sticks up and form N "ball-stick" pairs, what is the expected number of distinct pairs? And what is the probability that there are r distinct pairs?
Thanks. A quick answer is highly appreciated.
I strongly suspect there are fast "greedy" algorithms to find the maximum and minimum values of R (number of distinguisable ball-stick pairs). Greedy algorithms exist where there is a "matroid" structure to the search space, and here the search space can be conceived of as a permutation of the balls keeping the sticks in some fixed order (or vice-versa).
Here's a fairly interesting case, s = t = 4 with N = 11:
p_1 = 5, p_2 = p_3 = p_4 = 2
q_1 = 4, q_2 = q_3 = 3, q_4 = 1
where min(R) = 6 and max(R) = 10.
It seems plausible that a deeper investigation of the matroid structure behind the problem might lead to an efficient (polynomial time) algorithm for the expected value of R (lower complexity than the complete distribution for R), but I have doubts about my own ability to work it out.
regards, mathtalk-ga
I gave some thought to your Question and posted a Comment below outlining a computational approach. If you are only interested in an analytic result (one that would take all the p's and q's into consideration), then this would be of little interest to you. However if you have a practical application in mind for which efficient and accurate calculations are useful, I could post as an Answer a more detailed outline of the coding, together with some optimizations of the procedure.
regards, mathtalk-ga
regards, mathtalk-ga
While I'm not entirely satisfied with the presentation of ideas below, in the interest of expedience I will outline what I believe to be polynomial-time algorithms for the minimum R and maximum R problems. I will shortly add some further details on the computation of expected R as well, and I'll be interested in doing some computations with larger problems as a check on my conjecture.
Alternative Descriptions of the Problem Data
============================================
The number of distinct ball-stick color combinations can be stated in terms familiar to a statistician. The counts of ball color vs. stick color become the entries of what would be called a two-way contingency table, and the given number p_i of balls of one color (resp. q_j of sticks of one color) becomes a "fixed marginal total" for a row (resp. column) of that table.
Another description of the problem data can be given in terms of a bipartite multigraph, with the ball colors being one set of the vertices and the stick colors the disjoint complement, and combinations of the two kinds of colors being edges of the graph. Restrictions on the available colors then correspond to restrictions on the degrees of the respective vertices.
The former description seems to be helpful in visualizing the search for a solution with minimum R, while the latter is a natural choice for seeking maximum R because the count of distinct outcomes corresponds to counting the edges of the underlying simple graph.
Let's begin however with "contingency table" description. More formally we have an s by t table of nonnegative integers with row sums:
p_1 + p_2 + ... + p_s = N
and column sums:
q_1 + q_2 + ... + q_t = N
We define the "test statistic" R to be the number of nonzero cell entries.
The exact distribution of R, assuming balls and sticks are randomly paired, eg. from a pair of proverbial urns, is likely to be a complex calculation. One can of course calculate the distribution of an such test statistic by exhaustive enumeration of the possible contingency tables, an enterprise which is taken up in this lengthy doctoral dissertation:
[Exact Tests via Complete Enumeration: A Distributed Computing Approach]
(Ph.D. dissertation by Danius Takis Michaelides, 1997)
http://eprints.ecs.soton.ac.uk/749/04/thesis.ps
Test statistic R is quite different from the usual test statistics X and G that are used to test "independence" of categorical variables (see above for definitions), both of which assume chi-squared distributions for "sufficiently large" sample sizes. However there is an important connection to be pointed out, in that the use of the chi-square approximation (rather than Fisher's Exact Test, which corresponds to the exactly determined probabilities) depends on having sufficiently many expected observations in each cell of the contingency table, conventionally given as at least 5. Contingency tables for which some cells have expected values below 5 are called "sparse", so an analysis of how many cells may be expected to be zero (or nonzero) is at least in this tangential way related to the usual analysis of contingency tables.
Proposition and Conjecture
==========================
Prop. Let C = {c_ij} be an s by t contingency table with fixed marginal totals:
-----
t
SUM c_ij = p_i for i = 1,..,s
j=1
s
SUM c_ij = q_j for j = 1,..,t
i=1
and consider subtable C', say of dimensions s' by t', whose own row and column sums, p'_i and q'_j, depend on the entries inherited from C.
(i) If C is arranged so that the number of nonzero entries R is as small as possible for any contingency table with the same marginal totals, then C' has also a minimal number of nonzero entries relative to the row and column sum constraints inherited as a subtable of C.
(ii) If C is arranged so that the number of nonzero entries R is as large as possible for any contingency table with the same marginal totals, then C' has also a maximal number of nonzero entries relative to the row and column sum constraints inherited as a subtable of C.
Sketch of proof:
----------------
Choosing an subset of rows and columns of C, the entries of the corresponding subtable C' may be freely rearranged subject to the conservation of its own row and column sums, without adjusting any entries in C outside of C' or affecting the overall row and column sums of C. Thus the optimality of C either with respect to minimum or maximum R implies that the same must be true of any subtable C', for otherwise the possibility of reducing or enlarging the number of nonzero entries in C' would imply that possibility for C.
We will be concerned hereafter only with applying this principle for 2 by 2 subtables, so let me recognize those cases for minimum and maximum R.
A 2 by 2 subtable with at least one nonzero entry in each row and each column has min(R) equal to either 2 or 3 and max(R) equal to 2, 3, or 4.
More particularly:
min(R) = 2 if and only if the p's and q's are the same up to permutation
max(R) = 2 if and only if some p_i = q_j = 1
max(R) = 3 if and only if some p_i = 1 or some q_j = 1 but not both
max(R) = 4 if and only if both p_i's and both q_j's are greater than 1
We have begun to omit discussion of the case when a row or column sum is zero, since this implies that all entries in that row or column resp. are zero. The only reason for not excluding it earlier is that subtable C' may have a zero row or column sum despite the fact that table C does not. However there is certainly nothing special to optimize one way or the other if a row or column sum is zero; the problems of min(R) and max(R) for such a table are equivalent to those for a table with the offending row or column omitted.
Conjecture
----------
Let fixed marginal totals p_i and q_j be prescribed for an s by t table. Subject to these constraints:
(i) Table C has minimum R if and only if every 2 by 2 subtable has minimum number of nonzero entries (relative to its row & column sums inherited from C).
(ii) Table C has maximum R if and only if every 2 by 2 subtable has maximum number of nonzero entries (relative to its row & column sums inherited from C).
The Conjecture is of course a strong converse to the Proposition, which says that optimality of a table implies the corresponding optimality in each 2 by 2 subtable. It's the sort of conjecture that I would expect to fail, if it did, with fairly small counterexamples, and thus I'm taking the absence of these in the less-than-exhaustive analysis carried out so far as a pretty strong hint that it is true. An illustration of this will turn up in discussing a method for finding min(R) below.
In principle the Conjecture implies we can start with an arbitrary contingency table C having the prescribed marginal totals, and proceed to optimize either for min(R) or max(R) by incremental modifications to its 2 by 2 subtables. Since each modification must either decrease or increase R by at least 1, and since trivially:
s,t ? R ? s*t
the number of such steps is polynomial in s and t. (Perhaps some discussion is needed about whether these are the right criteria for polynomial complexity; we will return to the topic later.)
Then we need only consider the complexity of searching a table for 2 by 2 subtables that are not optimal. As is clear from the details of min(R) and max(R) given above for these small subtables, it suffices to restrict attention to 2 by 2 subtables with at least two nonzero entries. Even without this restriction the number of 2 by 2 subtables is a simple polynomial in s and t:
# of 2 by 2 subtables of C = s*(s-1)*t*(t-1)/4
Note that a scanning step is "expensive" as the degree of polynomial is 4. It is therefore of practical value to finding starting configurations that are close to optimal to reduce the number of such steps.
With these preliminaries behind us, we are ready for the first algorithm, which seeks a minimum value for R. We begin by constructing a table C with the required row and column sums and a "generically minimum" R as is to be explained afterwards.
Method I
========
Let s,t ? 1 be given together with positive summands p_i, q_j such that both finite series sum to N, as described earlier under Problem Data:
p_1 + p_2 + ... + p_s = N
q_1 + q_2 + ... + q_t = N
If s,t = 1, then p_1 = q_1 = N is to be assigned to c_11, and the algorithm terminates.
Otherwise compare p_1 and q_1 and proceed by:
(Case 1) p_1 > q_1
Set the first column of C to (q_1,0,...,0)' and assign entries in the remaining (t-1) columns by recursively applying the method to the reduced series:
(p_1 - q_1) + p_2 + ... + p_s = N - q_1
q_2 + ... + q_t = N - q_1
(Case 2) p_1 = q_1
Set the first row and first column of C to zeroes except c_11 = p_1 = q_1, and assign entries in the remaining (s-1) by (t-1) subtable by applying this method to the reduced series:
p_2 + ... + p_s = N - p_1
q_2 + ... + q_t = N - q_1
(Case 3) p_1 < q_1
Set the first row of C to (p_1,0,...,0) and assign entries in the remaining (s-1) rows by recursively applying the method to the reduced series:
p_2 + ... + p_s = N - p_1
(q_1 - p_1) + q_2 + ... + q_t = N - p_1
--------------------------
It is not hard to verify that each step of Method I prior to termination defines a new problem of the same form and of reduced dimension s or t or both by 1. So the method requires at most (s-1) + (t-1) recursive steps before reaching the termination step. Each step including the termination step introduces a single nonzero entry to the final constructed C, so this method constructively gives:
min(R) ? s + t - 1
The actual R obtained by this method is potentially less than this bound by just so many times as Case 2 occurs during its execution. In fact if the termination step where p_1 = q_1 also holds is counted as an instance of Case 2, then:
R = s + t - (# of times Case 2 occurs)
We illustrate these ideas by recalling an example cited earlier:
p_1 = 5, p_2 = p_3 = p_4 = 2
q_1 = 4, q_2 = q_3 = 3, q_4 = 1
Executing Method I on the p's and q's so ordered produces this:
4 1 0 0
0 2 0 0
0 0 2 0
0 0 1 1
Case 2 occurs at step 2 because p_1 + p_2 = q_1 + q_2, so that R = s + t - 2 in this instance. Determining the arrangements of p's and q's apriori that will maximize the number of times that Case 2 occurs is a daunting task. While one can always safely pair off any single terms p_i = q_j before proceeding to the optimization of the remaining summands' order, finding the optimum groupings of terms is probably NP hard.
However if the earlier Conjecture is true, we can simply use the output of Method I as a convenient starting point for a final reduction of R. Suppose that the summands were instead taken in this order:
p_1 = p_2 = 2, p_3 = 5, p_4 = 2
q_1 = q_2 = 3, q_3 = 1, q_4 = 4
The resulting table has the "generic" value R = s + t - 1 = 7:
2 0 0 0
1 1 0 0
0 2 1 2
0 0 0 2
because that ordering of p's and q's did not occasion any Case 2 prior to the termination step.
But with a little effort we can spot a 2 by 2 subtable in the last two rows, second and fourth columns, that cries out for minimization:
2 2 0 4
0 2 modifies to 2 0
thereby introducing one additional zero entry:
2 0 0 0
1 1 0 0
0 0 1 4
0 2 0 0
and min(R) = 6, as appears from the absence of any further 2 by 2 subtable that is not minimal with regard to R.
This construction shows that s + t - 1 is an upper bound for min(R), and one that is generically sharp as treating the p's and q's as free parameters it is always possible to find s by t examples where no Case 2 occurs prior to the termination step.
Description in terms of bipartite graphs
========================================
Next we present an algorithm which seeks the maximum value of R. It will be convenient at this point to switch to using the bipartite graph terminology mentioned briefly above.
Our point of view will be to treat each color of balls as belonging to one set of nodes S and each color of sticks to a disjoint set of nodes T.
Note that S = s and T = t.
For each node i in S we define a "valence" p_i which is the maximum number of edges (or multi-edges, counted according to multiplicity) that can be drawn to this node. Similarly for node j in T we have valence q_j.
Provided that SUM p_i = SUM q_j as our problem definition guarantees, we can always construct a bipartite multigraph which attains the prescribed valences. However from the perspective of maximizing the distinct number of outcomes R, the only thing that counts is the number of simple edges underlying such a graph. Therefore we can identify max(R) with the maximum number of edges in a bipartite simple graph that satisfies the valence (maximum degree) restrictions above, noting that from such a simple graph one can always attain the prescribed degrees by adding multi-edges by turns until all counts p_i and q_j are exhausted.
Method II
=========
Let p_1 ? p_2 ? ... ? p_s.
For i = 1 to s, consider node i in S:
Draw edge from i in S to j in T if possible, q_j > 0,
up to the limit min(p_i, (# of q_j's > 0)), choosing
if p_i is less than (# of q_j's > 0) those nodes j
where q_j is largest.
Decrement q_j by 1 if an edge is drawn from i to j.
--------------------------
The collection of (simple) edges so drawn corresponds to a set of distinct outcomes in the ball/stick pairing, or to nonzero entries in a contingency table C as in the earlier treatment.
Note that the values q_j are modified during the algorithm, so that it might be helpful to write q_j(i-1) for the value of q_j at the beginning of step i and q_j(i) for the value at the end of step i.
The value of R constructed in this fashion could then be expressed as:
s
SUM min(p_i, (# of q_j(i-1) > 0) )
i=1
Again we illustrate the ideas by reference to the earlier example. Note that we here impose a descending order on the p's:
Step 1: p_1 = 5 and q(0) = (4,3,3,1)
Draw an edge from the first node in S to each node in T.
Decrement each valence q accordingly.
Step 2: p_2 = 2 and q(1) = (3,2,2,0)
Draw an edge from the second node in S to the first two in T.
Decrement those two valences in q accordingly.
Step 3: p_3 = 2 and q(2) = (2,1,2,0)
Draw an edge from the third node in S to the first and third in T.
Decrement those two valences in q accordingly.
Step 4: p_4 = 2 and q(3) = (1,1,1,0)
Draw an edge from the fourth node in S to the first two in T.
Decrement those two valences in q accordingly.
By the end of Method II we have constructed 10 = 4+2+2+2 distinct "outcomes" or edges in the bipartite graph G. There was one unsatisfied valence of p_1 remaining from Step 1, and the final vector q(4) = (0,0,1,0), so that we could complete the construction of contingency table C (or the pairings of balls and sticks) by adding a multi-edge from node 1 in S to node 3 in T, an edge that by construction duplicates one already added in Step 1.
That is, prior to adding that final edge we would have this partially populated contingency table corresponding to the simple graph G:
1 1 1 1
1 1 0 0
1 0 1 0
1 1 0 0
and after adding the multi-edge it becomes:
1 1 2 1
1 1 0 0
1 0 1 0
1 1 0 0
which may be seen to satisfy all the required row and column sums.
Furthermore, scanning this table for potential 2 by 2 subtables that are suboptimal in the sense of maximizing R, none are found.
Note on complexity measurement
==============================
It is easy to become careless about claims of "polynomial time" where the size of the problem is not of a simply defined size. Above I referenced polynomials in s and t as a justification for polynomial time, but I would invite critical consideration of this as the basis for measuring the size of the input.
To express numbers s and t requires only log(s) + log(t) space. My proposal to use s and t themselves as the measurements of problem size is therefore based on an additional consideration, namely the (effectively nonzero) summands p_i and q_j that are also required by the problem formulation. Given that at least a bit is required for the expression of each such summand, we are able to argue that the problem size is (up to some positive constant multiplier) at least s + t:
s < c SUM log(p_i)
t < c SUM log(q_j)
The arithmetic required by Methods I and II above is minimal, though subtraction is called for. Therefore it is helpful to anticipate that in a careful analysis the cost of such arithmetic will also entail terms like log(p_i) and log(q_j).
regards, mathtalk-ga
As a notational device, I prefer capital letter R for the (random variable) number of distinguishable pairs of balls and sticks, and lowercase letters s and t for the (deterministic) numbers of ball and stick colors, respectively. Of course the roles of the p's and the q's are symmetric in this problem, so we will omit repeating observations that could well be restated by swapping them.
Without loss of generality we may assume that all p_i, q_j are positive.
The case s = 1 is trivial, since Pr(R = t) is 1.
One approach to managing a computation of the distribution of R is to formulate the problem in terms of sampling from an urn without replacement. The contents of the urn, together with the number of distinguishable outcomes already attained, will define a state for the purposes of describing a system of transition probabilities.
Let an urn be filled with the N balls, having p_i of the corresponding color:
s
SUM p_i = N
i=1
We propose to conduct t withdrawals (without replacement) from the urn, taking q_j balls at the j-th step, joining the selected q_j balls with the respective q_j sticks of a common color:
t
SUM q_j = N
j=1
Let the initial state be "concentrated" with probability 1 on a state (p_1,p_2,...,p_s;0) that reflects the initial contents of the urn and the absence of any ball/stick combinations yet formed. Each step will "distribute" that probability among new states. The number of distinguishable outcomes for any consequent state will increase with each step, being the number of distinct colors of balls present in the sample withdrawn. Since each q_j > 0, there must be at least one color of ball in each step, so the final number of distinguishable outcomes is at least the number of steps t.
We illustrate the computation with s = t = 2 and equal numbers m of both colors for sticks and balls. The initial state is (m,m;0). The first withdrawal will produce a binomial probability distribution on m+1 new states, (k,m-k;r) where r = 1 if k = 0,m and r = 2 otherwise:
Pr(k,m-k;r) = C(m,k)/(2^m)
The second withdrawal leads to the "random" outcome (0,0;R) as all the remaining balls will be finally used up. The possible values for R are 2 and 4, with these respective probabilities:
Pr( R = 2 ) = 2^(1-m)
Pr( R = 4 ) = 1 - 2^(1-m)
The expected value for R is then:
E(R) = 2*Pr(R = 2) + 4*Pr(R = 4)
= 2^(2-m) + 4 - 2^(3-m)
= 4 - 2^(2-m)
Note in particular that for m = 1 (N = 2) we have E(R) = 4-2 = 2. But as m tends to infinity, E(R) rapidly approaches 4.
regards, mathtalk-ga
In your question (May 27th 2005) you asked for the expectation of the number of distinct colors of ball-stick pairs in a random matching of balls to sticks, given the data p_i and q_j describing the numbers of balls and sticks of each color. I think the discussion that followed was all about how to compute the distribution of this number, in particular the maximum and minimum. However, there is a simple way to compute the expectation, using a standard technique from probability theory.
Let x(i,j) be the random variable which takes the value 1 when the ball-stick combination (i,j) occurs, and 0 when it does not occur. So the number of distinct ball-stick color pairs is just the sum of all the x(i,j). The expectation of a sum of random variables is the sum of their expectations. The expectation of x(i,j) is just the probability that the pair (i,j) occurs, which is easy to compute: it is
P[(i,j) occurs] = 1 - P[(i,j) does not occur]
= 1 - P[ each of the q_j sticks gets a ball of some color other than i ]
= 1 - (N-p_i)(N-1-p_i)...(N-(q_j-1) - p_i)/(N(N-1)...(N-(q_j-1))
= 1 - (N-p_i)!(N-q_j)!/N!(N - q_j - p_i)!
Here the final expression in terms of factorials is only valid if q_j + p_i is less than or equal to N; in the case where it is not, the probability is 1, of course.
Now just sum over all pairs (i,j) to get the expectation you wanted. Of course this will only be practical if your problem is of a reasonably small size. If there are thousands of colors then you would be better off doing a Monte-Carlo simulation to estimate the expectation, or use the approximation that when all the p_i and q_j are small compared to N, then P[(i,j) occurs] is close to 1 - (1- (p_i/N))^q_j, which is well approximated by p_i q_j / N . So then the expectation is approximately the sum of p_i q_j /N, which is exactly N. You will notice that N is also the maximum possible number of pairs, so this approximation is saying that the expected number of repeats is small. In this case it might be of interest to know just how small, and you can work this out using an approximation to P[(i,j) occurs] correct up to second order in the p_i and up to second order in the q_j. The resulting approximation is valid when the q_j and p_i are small compared to the square root of N. It is:
E(number of distinct color pairs)
= N - (sum {p_i(p_i-1)})(sum {q_j(q_j-1)})/(2N^2)
To interpret this in another way, consider choosing a random stick and counting the number of other sticks of the same color. Let x be the expectation of this number. Do the same with the balls to define y. Then
E(number of distinct color pairs in a matching) = N - xy/2 + small error.
Compare this to a different model in which instead of choosing a matching, you start with blank sticks and balls and paint them, choosing the colors independently with the same probabilities as in your problem, i.e. the probability of choosing color i for a given ball is p_i/N and so on. You get the same first approximation for the expected number of distinct color pairs when all the probabilities are small. But in this model the exact formula for the expected number of distinct color pairs is the sum over all (i,j) of (1 - (1 - p_i q_j/N^2)^N).
Thanks very much for the detailed analysis. I was also doubting there exists a closed-form expression for the expected number of distinct pairs. Calculating it in a procedural fashion is possible, but seems to have very high complexity. Now let's relax the problem a little bit: is there an efficient way (polynomial time) to find the maximum and minimum number of distinct pairs obtainable under the constraints posed by those p_i, and q_j's?
thanks,
stephen
although I was initially looking for an analytic result, an efficient (polynomial time) algorithm will also suffice. Please post it as an Answer if you have such an algorithm.
Thanks,
stephen
Sadly my Conjecture is false, at least with regard to minimum values of R, as the following 2 by 3 contingency table shows:
2 0 0
1 1 1
Here p_1 = 2, p_2 = 3; q_1 = 3, q_2 = q_3 = 1.
Each of the three 2 by 2 subtables is minimal with respect to the number of nonzero entries, but the marginal totals would also be attained with one fewer nonzero entry overall in this way:
0 1 1
3 0 0
A pair of interchanges are needed to reduce the value of R.
It remains my unproven conjecture that R can be minimized by Method I if the p's and q's are taken in an order which maximizes the number of times that Case 2 occurs. However trying all possible arrangements is clearly a non-polynomial time algorithm, since there are s! * t! of these.
So, is there a polynomial time algorithm here, in effect to determine the optimal subgroupings of p's and q's with equal sums? In absolute terms I suspect not. Doing some searches on terms from the graph-theoretic formulation (bipartite, edge-packing), I came across a lengthy article, actually a book chapter, dealing with the classification of NP-hard problems that have polynomial time "near-optimal" approximations:
[Approximation algorithms for NP-hard optimization problems -- Klein and Young]
http://www.cs.ucr.edu/~neal/index/non_arxiv/crc-approx-algs/final.pdf
Even if the minimum R problem is NP-hard, it seems likely to be one with "nice" approximation algorithms. Indeed we already have bounds:
max(s,t) ? min(R) ? s+t-1
and a sketch of how to find a maximum number of subgroupings of p's and q's with equal sums is the following:
Note that if subset S' of S = {1,..,s} and subset T' of T = {1,..,t} are such that:
SUM p_i = SUM q_j
S' T'
then the same is true of SS' and TT'. Therefore we can restrict our search apriori to subtotals for p's and q's less than or equal to half of N.
Moreover if we are trying to form a maximum number of such subsets, it would make sense to search for subsets that are as small as possible, because the larger the subset the less flexibility that remains for finding additional "matching" subsets (other than the entire complements). Here we can note the special case of singletons p_i = q_j fits smoothly into the approach of looking for the smallest possible matching subsets, as pairing off two equal summands does not encumber at all the search for further matching subsets.
Consider then maintaining lists of sums less than a target, say SQRT(N), that can be attained by a subset of p's, resp. q's. As matching sums on the two lists are generated by incrementally increasing the number of terms allowed (eg. starting from single terms, adding pairs together, etc.), these are matched off.
Such a procedure of course makes quick work of the "counterexample" presented above, as in fact there is the singleton match p_2 = q_1 = 3 to remove.
regards, mathtalk-ga
edit