Guide

Conditional hypergeometric story

Consider two overlapping populations 1 and 2 with n₁ people unique to population 1, n₂ people unique to population 2, and s people in both populations (for a total population 1 size of n₁ + s and total population 2 size of n₂ + s). We then take a random sample of m₁ people from population 1 (all combinations equally likely) and a random sample of m₂ from population 2 (all combinations equally likely). Each sample is internally without replacement (i.e. the sample from population 1 can’t include the same person twice), but the same people can be chosen in both the sample from population 1 and the sample from population 2. We want the distribution of how many people will be in both the sample from population 1 and the sample from population 2, which we will call X₂ (the subscript “2” will make sense shortly).

Conditional hypergeometric PMF for 2 populations

Looking first at the people sampled from population 1 (sample size m₁), let X₁ represent the number of people in the sample who are also in population 2. Isometric to the distribution of how many white balls are chosen when taking m₁ balls from an urn of s white balls and n₁ black balls, X₁ ∼ HGeom(s, n₁, m₁). Then, X₂ is just the number of people in the sample from the second population who were already sampled. That is, conditional on X₁, we can treat the number of people already sampled (X₁) as the white balls and everyone else in population 2 (n₂ + s − X₁) as the black balls, and we can then draw m₂ balls (the sample size from population 2). Therefore, the number of overlaps X₂ is distributed as X₂ ∼ HGeom(X₁, n₂ + s − X₁, m₂). Then, we can get the PMF of X₂ by the law of total probability: P(X₂ = k) = ∑_x₁P(X₂ = k|X₁ = x₁)P(X₁ = x₁)$$=\sum_{x_1=k}^{\textrm{min}(m_1,s)}\frac{{{x_1}\choose{k}}{{n_2+s-x_1}\choose{m_2-k}}}{{{n_2+s}\choose{m_2}}}\cdot\frac{{{s}\choose{x_1}}{{n_1}\choose{m_1-x_1}}}{{{n_1+s}\choose{m_1}}}$$ Here, we have just substituted in the hypergeometric PMFs with a summation from k (at least k people must be sampled from the intersection in sample 1 to have k twice-sampled people) to whichever of m₁ or s is lower (the number sampled in the intersection from sample 1 cannot exceed either the total number of people sampled from population 1 or the total number of people in the intersection).

Conditional hypergeometric extension to arbitrarily many populations

This solution can be extended to the case of k overlapping populations with n_i people in population i not in the intersection of all the populations and s people in the intersection of all the populations. Note that some people included in n_i might also be included in n_i + 1 etc.; they just cannot be included in all other populations or they would go in s. As before, let m_i people be sampled from each population without replacement internally but with replacement between the sampling of each population. Let X_k be the number of people chosen in all k samples. Then, let X₁ be the number of people from s in m₁, X₂ be the number of people from X₁ in m₂, X₃ be the number of people from X₂ in m₃, and so on. Then, extending the law of total probability from earlier,

$$\begin{split} P(X_2=x_2) =& \sum_{x_1}P(X_2=x_2|X_1=x_1)P(X_1=x_1) \\ P(X_3=x_3) =& \sum_{x_2}P(X_3=x_3|X_2=x_2)P(X_2=x_2)=\\ &\sum_{x_2}P(X_3=x_3|X_2=x_2)\Big[\sum_{x_1}P(X_2=x_2|X_1=x_1)P(X_1=x_1)\Big]\\ ...\\ P(X_k=k) =& \sum_{x_{k-1}}P(X_k=k|X_{k-1}=x_{k-1})\Big[\sum_{x_{k-2}}P(X_{k-1}=x_{k-1}|X_{k-2}=x_{k-2})\cdot\Big[...\cdot\\ &\Big[\sum_{x_1}P(X_2=x_2|X_1=x_1)P(X_1=x_1)\Big]...\Big]\Big]\\ =& \sum_{x_{k-1}=k}^{\textrm{min}(m_{k-1},s)}\frac{{{x_{k-1}}\choose{k}}{{n_k+s-x_{k-1}}\choose{m_k-k}}}{{{n_{k}+s}\choose{m_k}}}\cdot\Big[\sum_{x_{k-2}=k}^{\textrm{min}(m_{k-2},s)}\frac{{{x_{k-2}}\choose{x_{k-1}}}{{n_{k-1}+s-x_{k-2}}\choose{m_{k-1}-x_{k-1}}}}{{{n_{k-1}+s}\choose{m_{k-1}}}}\cdot\Big[...\cdot\\ &\Big[\sum_{x_1=k}^{\textrm{min}(m_1,s)}\frac{{{x_1}\choose{x_2}}{{n_2+s-x_1}\choose{m_2-x_2}}}{{{n_2+s}\choose{m_2}}}\cdot\frac{{{s}\choose{x_1}}{{n_1}\choose{m_1-x_1}}}{{{n_1+s}\choose{m_1}}}\Big]...\Big]\Big] \end{split}$$

since P(X_k = k) = 0 if any of X_k − 1, ..., X₁ are less than k.

Computational optimizations

While these sums of sums seem to pose an increasingly daunting computation task, there are actually two optimizations that can speed this computation up significantly. First, calculating this PMF can be optimized by storing all the P(X_i = x_i) at each level because they are the same regardless of any level above them (i.e. P(X₁ = x₁) does not depend on the value of k in P(X₂ = k)). Therefore, by storing P(X₁ = x₁) for x₁ from 0 to min(m₁, s), we can move to the P(X₂ = x₂) level without ever needing to recompute anything for the P(X₁ = x₁) level. This can be repeated with the P(X₂ = x₂) level versus the P(X₃ = x₃) level and so on.

Second, the calculation can be further optimized by storing all the P(X_i = x_i|X_i − 1 = x_i − 1) and updating from P(X_i = x_i|X_i − 1 = x_i − 1) to P(X_i = x_i + 1|X_i − 1 = x_i − 1). For example, looking at P(X₂ = x₂), we initially have the equation $$P(X_2=x_2)=\sum_{x_1=x_2}^{\textrm{min}(m_1,s)}\frac{{{x_1}\choose{x_2}}{{n_2+s-x_1}\choose{m_2-x_2}}}{{{n_2+s}\choose{m_2}}}\cdot\frac{{{s}\choose{x_1}}{{n_1}\choose{m_1-x_1}}}{{{n_1+s}\choose{m_1}}}$$ that we want to update to $$P(X_2=x_2+1)=\sum_{x_1=x_2+1}^{\textrm{min}(m_1,s)}\frac{{{x_1}\choose{x_2+1}}{{n_2+s-x_1}\choose{m_2-(x_2+1)}}}{{{n_2+s}\choose{m_2}}}\cdot\frac{{{s}\choose{x_1}}{{n_1}\choose{m_1-x_1}}}{{{n_1+s}\choose{m_1}}}$$ Looking element-wise, we can keep the $\frac{{{s}\choose{x_1}}{{n_1}\choose{m_1-x_1}}}{{{n_1+s}\choose{m_1}}}$ term from P(X₁ = x₁) and just shift that as necessary to keep proper indexing. However, we need to solve $$c\cdot\frac{{{x_1}\choose{x_2}}{{n_2+s-x_1}\choose{m_2-x_2}}}{{{n_2+s}\choose{m_2}}} = \frac{{{x_1}\choose{x_2+1}}{{n_2+s-x_1}\choose{m_2-(x_2+1)}}}{{{n_2+s}\choose{m_2}}}$$
The solutions is $$c=\frac{{{x_1}\choose{x_2+1}}{{n_2+s-x_1}\choose{m_2-(x_2+1)}}}{{{x_1}\choose{x_2}}{{n_2+s-x_1}\choose{m_2-x_2}}}=$$$$\frac{(x_1-x_2)!x_2!}{(x_1-(x_2+1))!(x_2+1)!}\cdot\frac{(n_2+s-x_1-(m_2-x_2))!(m_2-x_2)!}{(n_2+s-x_1-(m_2-(x_2+1)))!(m_2-(x_2+1))!}$$ $$=\frac{(x_1-x_2)(m_2-x_2)}{(x_2+1)(n_2+s-x_1-(m_2-(x_2+1)))}$$ This holds for any level; that is, to get P(X_i = x_i + 1|X_i − 1 = x_i − 1) from P(X_i = x_i|X_i − 1 = x_i − 1) we can always multiply P(X_i = x_i|X_i − 1 = x_i − 1) by $$\frac{(x_{i-1}-x_i)(m_i-x_i)}{(x_i+1)(n_i+s-x_{i-1}-(m_i-(x_i+1)))}$$ Storing P(X_i = x_i|X_i − 1 = x_i − 1) and P(X_i − 1 = x_i − 1) and updating them element-wise in the sum rather than recomputing the hypergeometric distribution dramatically speeds up computation time. In the rare cases where P(X_i = x_i|X_i − 1 = x_i − 1) is so low that R rounds it to 0, it is necessary to recompute the hypergeometric value to make sure we are not erroneously trying to multiply by 0 to update the terms. However, once those terms become non-zero, the updating formula can be applied continuously. For more than two groups, this entire process is performed recursively (i.e. calculate P(X₁ = x₁), use that to find P(X₂ = x₂), use that to find P(X₃ = x₃) and so on).

PMF, CDF, quantile function, random number generator

The PMF of a hypergeometric distribution is implemented recursively with the optimizations as described above with the additional optimization that if only one value is requested, the top layer (P(X_k = k)) only calculates that value rather than calculating P(X_k = 0) and applying the updating with the coefficient up to P(X_k = k). The CDF of a hypergeometric distribution is implemented by cumulatively summing the top layer (P(X_k = k)) as above to the maximum input requested and then indexing from that stored vector to assign probabilities. The quantile function reverses the process of the CDF, calculating new probabilities up to the maximum input probability requested and then assigning outputs by indexing from that cumulative probability vector. The random number generator generates Unif(0, 1) random numbers and plugs them into the quantile function.

Maximum likelihood estimation

The likelihood function of a conditional hypergeometric (the product of the observations’ PMFs) with respect to the unknown overlap population size s is either

Monotonically decreasing from the largest observed overlap if the overlaps are small because the overlapping population must be at least as large as the largest observed overlap but any larger makes it unlikely to see such small observed overlaps
Or unimodal because the optimal value of s is large enough that the samples pick from the overlapping population s rather than the unique populations but not so large that the samples pick different people in the overlapping population

The likelihood function with respect to n_i is

Monotonically increasing if there are no observed overlaps (because an infinite population size minimizes the chance of drawing from the finite overlapping region)
Unimodal if the observed overlaps are positive and moderate because there is an optimal value between a large n_i where large observed overlaps would be unlikely and a low n_i where low observed overlaps would be unlikely
Or monotonically decreasing from m_i − s if the observed overlaps are very high because that is the smallest population value that allows the sample to be taken but maximizes the chance that the samples are taken from the overlapping population rather than the unique population

The likelihood function with respect to m_i is

Monotonically decreasing from the largest observed overlap if the observed overlaps are low because the sample size must be at least as big as the largest observed overlap but any bigger of a sample size makes the low observed overlaps unlikely
Unimodal if the observed overlaps are positive and moderate because there is an optimal value between a large m_i where the small observed overlaps would be unlikely and a low (but ≥ the maximum observed overlap) m_i where the large observed overlaps would be unlikely
Or monotonically increasing until the likelihood of s + n_i after which it is 0 if the observed overlaps are very large because this allows as many overlaps as possible to occur

However, in all of these cases, there is never a global maximum after the first decrease, so the maximum likelihood estimator functions do the following:

Deal with monotonically increasing likelihoods by returning ∞ for the population size or s + n_i (the maximum sample size) for the sample size
Deal with unimodal likelihoods by progressively calculating log likelihoods and returning the index before the first decrease
Deal with monotonically decreasing likelihoods by returning the first valid number (i.e. the overlapping population size cannot be 0 if one of the observed overlaps is 2)

Method of moments estimators

Using Adam’s law, we can calculate the mean of a conditional hypergeometric distribution. First, we condition the highest-level overlap on the prior overlap and use the mean of a hypergeomtric distribution:

$$E(X_k) = E(E(X_k|X_{k-1})) = E\Big(\frac{m_k \cdot X_{k-1}}{n_k+s}\Big) = \frac{m_k}{n_k+s}E(X_{k-1})$$

From here, we can repeat the process by conditioning on the overlap prior to the prior overlap, and we repeat this all the way to X₁, the number of items from the first population in the overlapping space.

$$=\frac{m_k}{n_k+s}E(E(X_{k-1}|X_{k-2})) = ... = \frac{\prod_{i=2}^km_i}{\prod_{i=2}^k(n_i+s)}E(X_1)$$

Finally, using the mean of a hypergeometric, we can substitute in:

$$ = \frac{\prod_{i=2}^km_i}{\prod_{i=2}^k(n_i+s)}\frac{m_1\cdot{s}}{n_1+s}=\frac{\prod_{i=1}^km_i}{\prod_{i=1}^k(n_i+s)}\cdot{}s$$

Rearranging this can give estimators for either unknown m_j or n_j:

$$m_j=E(X_k)\frac{\prod_{i=1}^k(n_i+s)}{s\cdot{}\prod_{i\neq{j}}m_i}\implies\hat{m_j}=\bar{X_k}\frac{\prod_{i=1}^k(n_i+s)}{s\cdot{}\prod_{i\neq{j}}m_i}$$ and

$$n_j+s=\frac{\prod_{i=1}^km_i}{\prod_{i\neq{j}}^k(n_i+s)}\cdot{}s\cdot\frac{1}{E(X_k)}\implies\hat{n_j}=\frac{\prod_{i=1}^km_i}{\prod_{i\neq{j}}^k(n_i+s)}\cdot\frac{s}{\bar{X_k}} - s$$

The first estimator (for m_j) is unbiased because $$E(\hat{m_j})-m_j=E\Big(\bar{X_k}\frac{\prod_{i=1}^k(n_i+s)}{s\cdot{}\prod_{i\neq{j}}m_i}\Big) -m_j= E(\bar{X_k})\frac{\prod_{i=1}^k(n_i+s)}{s\cdot{}\prod_{i\neq{j}}m_i} - m_j$$ $$=E(X_k)\frac{\prod_{i=1}^k(n_i+s)}{s\cdot{}\prod_{i\neq{j}}m_i} - m_j = \frac{\prod_{i=1}^km_i}{\prod_{i=1}^k(n_i+s)}\cdot{}s\cdot\frac{\prod_{i=1}^k(n_i+s)}{s\cdot{}\prod_{i\neq{j}}m_i} - m_j = m_j-m_j = 0$$ where the second equality is because all the n_i, the rest of the m_i, and s are constants and the third equality is by linearity of expectation.

However, the second estimator (for n_j) is biased because $$E(\hat{n_j})-n_j=E\Big(\frac{\prod_{i=1}^km_i}{\prod_{i\neq{j}}^k(n_i+s)}\cdot\frac{s}{\bar{X_k}} - s\Big)-n_j = \frac{\prod_{i=1}^km_i}{\prod_{i\neq{j}}^k(n_i+s)}\cdot{}s\cdot{}E\Big(\frac{1}{\bar{X_k}}\Big)-s-n_j\geq$$ $$\frac{\prod_{i=1}^km_i}{\prod_{i\neq{j}}^k(n_i+s)}\cdot{}s\cdot{}\frac{1}{E(\bar{X_k})}-s-n_j=\frac{\prod_{i=1}^km_i}{\prod_{i\neq{j}}^k(n_i+s)}\cdot{}s\cdot{}\frac{1}{E(X_k)}-s-n_j$$ $$=\frac{\prod_{i=1}^km_i}{\prod_{i\neq{j}}^k(n_i+s)}\cdot{}s\cdot{}\frac{\prod_{i=1}^k(n_i+s)}{s\cdot\prod_{i=1}^km_i}-s-n_j=n_j+s-s-n_j=0$$ where the second equality is because all the m_i, the rest of the n_i, and s are constants, the inequality is by Jensen’s inequality with $\frac{1}{x}$ as a convex function, and the third equality is by the linearity of expectation.

The bias in $\hat{n_j}$ seems to suggest using $\sum_{i=1}^n\frac{1}{X_{ki}}$ in the estimator rather than $\frac{1}{\sum_{i=1}^n{X_{ki}}}$ where X_ki are each observation of X_k. However, that expression will evaluate to ∞ if any of the observed overlaps are 0, which is commonly the case. Thus, caution is advised when using either estimator.

Examples

Suppose there were n₁ = 300 people unique to population 1, n₂ = 500 people unique to population 2, s = 50 people in both population 1 and population 2, m₁ = 56 people sampled from population 1, and m₂ = 14 people sampled from population 2. We can find the probability of getting 1 observed overlap as follows:

dchyper(k = 1, s = 50, n = c(300, 500), m = c(56, 14))
#> Finished 1 level...
#> [1] 0.1692956

(Large sample sizes (over 10, 000 or so) can take a long time so “Finished 1 level…” lets you know every time the function finishes finding all the probabilities for a certain probability level. You can suppress this with “verbose = F” as we will do for the rest.) If we wanted to visualize drawing from these populations with the same sample sizes 10, 000 times and see what proportion of the draws were actually 1, we could run:

set.seed(1)
draws <- rchyper(10000, 50, c(300, 500), c(56, 14), verbose = F)
hist(draws, breaks = seq(0,5,1) - 0.5)

print(sum(draws==1)/length(draws))
#> [1] 0.1772

Likewise, if we want the probability we see 1 or fewer overlaps, we can run

pchyper(1, 50, c(300, 500), c(56, 14), F)
#> [1] 0.9833006

which lines up with what we observe by simulation:

print(sum(draws<=1)/length(draws))
#> [1] 0.9836

Maybe we want to compare simulated and exact medians:

print(sort(draws)[length(draws)/2])
#> [1] 0
qchyper(0.5, 50, c(300, 500), c(56, 14), F)
#> [1] 0

Looks like more than half the values are 0! Maybe we observe 1 overlap and want to check the chance that we would see 1 or more overlaps from this distribution just due to chance. We can find this with:

pvalchyper(1, 50, c(300, 500), c(56, 14), "upper", F)
#> [1] 0.1859951

If we only saw the outputs and didn’t know what our overlap size was, we could find the maximum likelihood estimator with

mleS(draws, c(300, 500), c(56, 14), F)
#> [1] 52

Not bad for 10, 000 samples! (The original was 50.) If we bump up the number of draws taken, we can get even more precise:

set.seed(1)
draws <- rchyper(10^5, 50, c(300, 500), c(56, 14), F)
mleS(draws, c(300, 500), c(56, 14), F)
#> [1] 50

Spot on! Now, if we didn’t know the size of sample 1, we could run

mleM(1, draws, 50, c(300, 500), c(0, 14), F)
#> [1] 56

where the 1 specifies we want the first sample and the 0 in the last parameter could be any placeholder integer. Maybe instead we wanted a method of moments estimate of the sample size:

momM(1, draws, 50, c(300, 500), c(0, 14))
#> [1] 56.16325

Now, say we wanted to know the unique size of population 1. To find this, we can run:

mleN(1, draws, 50, c(0, 500), c(56, 14), F)
#> [1] 299
momN(1, draws, 50, c(0, 500), c(56, 14))
#> [1] 298.9827

where the 1 specifies the first population and the 0 in the population parameter could be any placeholder integer.

Applications

The most common use for this package will likely be in null distributions: finding the probability that you would see the observed number of overlaps or more just due to chance. For example, if you had two programs that were supposed to identify all the colors in a painting and very few of the colors they identified overlapped, you might use this to check whether the observed number of colors that did overlap were more than would be expected if each program was pulling random colors out of its database.

However, this package could also be used with census statistics. For example, imagine there were two overlapping school districts such that administrators knew how many children lived in both and how many children lived only in one or the other. Then imagine each district had a random lottery admissions process for their magnet schools. This package could give the distribution of how many children are likely to be admitted into both districts’ magnet schools.

The estimators implemented in this package could be useful in estimating unknown initial population parameters or sample sizes, but in practice it is unlikely that all but one unique population size would be known or only one sample size would be unknown. In reality, the most likely use for these is in making statements like, “So many overlaps were observed that the data would actually best fit a model with a sample size of [insert some big number],” when all population parameters and sample sizes are known and a null hypothesis has been rejected.

Thank-yous

I’d like to thank Srihari Ganesh and Skyler Wu for their help in reviewing and improving the two-population case; Max Li for his help in optimizing the PMF calculation; and Meghan Short and Kelsey Thompson for their help in reviewing the math behind the arbitrary-population-number extension.