Consider two overlapping populations 1 and 2 with n1 people unique to population 1, n2 people unique to population 2, and s people in both populations (for a total population 1 size of n1 + s and total population 2 size of n2 + s). We then take a random sample of m1 people from population 1 (all combinations equally likely) and a random sample of m2 from population 2 (all combinations equally likely). Each sample is internally without replacement (i.e. the sample from population 1 can’t include the same person twice), but the same people can be chosen in both the sample from population 1 and the sample from population 2. We want the distribution of how many people will be in both the sample from population 1 and the sample from population 2, which we will call X2 (the subscript “2” will make sense shortly).
Looking first at the people sampled from population 1 (sample size m1), let X1 represent the number of people in the sample who are also in population 2. Isometric to the distribution of how many white balls are chosen when taking m1 balls from an urn of s white balls and n1 black balls, X1 ∼ HGeom(s, n1, m1). Then, X2 is just the number of people in the sample from the second population who were already sampled. That is, conditional on X1, we can treat the number of people already sampled (X1) as the white balls and everyone else in population 2 (n2 + s − X1) as the black balls, and we can then draw m2 balls (the sample size from population 2). Therefore, the number of overlaps X2 is distributed as X2 ∼ HGeom(X1, n2 + s − X1, m2). Then, we can get the PMF of X2 by the law of total probability: P(X2 = k) = ∑x1P(X2 = k|X1 = x1)P(X1 = x1)$$=\sum_{x_1=k}^{\textrm{min}(m_1,s)}\frac{{{x_1}\choose{k}}{{n_2+s-x_1}\choose{m_2-k}}}{{{n_2+s}\choose{m_2}}}\cdot\frac{{{s}\choose{x_1}}{{n_1}\choose{m_1-x_1}}}{{{n_1+s}\choose{m_1}}}$$ Here, we have just substituted in the hypergeometric PMFs with a summation from k (at least k people must be sampled from the intersection in sample 1 to have k twice-sampled people) to whichever of m1 or s is lower (the number sampled in the intersection from sample 1 cannot exceed either the total number of people sampled from population 1 or the total number of people in the intersection).
This solution can be extended to the case of k overlapping populations with ni people in population i not in the intersection of all the populations and s people in the intersection of all the populations. Note that some people included in ni might also be included in ni + 1 etc.; they just cannot be included in all other populations or they would go in s. As before, let mi people be sampled from each population without replacement internally but with replacement between the sampling of each population. Let Xk be the number of people chosen in all k samples. Then, let X1 be the number of people from s in m1, X2 be the number of people from X1 in m2, X3 be the number of people from X2 in m3, and so on. Then, extending the law of total probability from earlier,
$$\begin{split} P(X_2=x_2) =& \sum_{x_1}P(X_2=x_2|X_1=x_1)P(X_1=x_1) \\ P(X_3=x_3) =& \sum_{x_2}P(X_3=x_3|X_2=x_2)P(X_2=x_2)=\\ &\sum_{x_2}P(X_3=x_3|X_2=x_2)\Big[\sum_{x_1}P(X_2=x_2|X_1=x_1)P(X_1=x_1)\Big]\\ ...\\ P(X_k=k) =& \sum_{x_{k-1}}P(X_k=k|X_{k-1}=x_{k-1})\Big[\sum_{x_{k-2}}P(X_{k-1}=x_{k-1}|X_{k-2}=x_{k-2})\cdot\Big[...\cdot\\ &\Big[\sum_{x_1}P(X_2=x_2|X_1=x_1)P(X_1=x_1)\Big]...\Big]\Big]\\ =& \sum_{x_{k-1}=k}^{\textrm{min}(m_{k-1},s)}\frac{{{x_{k-1}}\choose{k}}{{n_k+s-x_{k-1}}\choose{m_k-k}}}{{{n_{k}+s}\choose{m_k}}}\cdot\Big[\sum_{x_{k-2}=k}^{\textrm{min}(m_{k-2},s)}\frac{{{x_{k-2}}\choose{x_{k-1}}}{{n_{k-1}+s-x_{k-2}}\choose{m_{k-1}-x_{k-1}}}}{{{n_{k-1}+s}\choose{m_{k-1}}}}\cdot\Big[...\cdot\\ &\Big[\sum_{x_1=k}^{\textrm{min}(m_1,s)}\frac{{{x_1}\choose{x_2}}{{n_2+s-x_1}\choose{m_2-x_2}}}{{{n_2+s}\choose{m_2}}}\cdot\frac{{{s}\choose{x_1}}{{n_1}\choose{m_1-x_1}}}{{{n_1+s}\choose{m_1}}}\Big]...\Big]\Big] \end{split}$$
since P(Xk = k) = 0 if any of Xk − 1, ..., X1 are less than k.
While these sums of sums seem to pose an increasingly daunting computation task, there are actually two optimizations that can speed this computation up significantly. First, calculating this PMF can be optimized by storing all the P(Xi = xi) at each level because they are the same regardless of any level above them (i.e. P(X1 = x1) does not depend on the value of k in P(X2 = k)). Therefore, by storing P(X1 = x1) for x1 from 0 to min(m1, s), we can move to the P(X2 = x2) level without ever needing to recompute anything for the P(X1 = x1) level. This can be repeated with the P(X2 = x2) level versus the P(X3 = x3) level and so on.
Second, the calculation can be further optimized by storing all the
P(Xi = xi|Xi − 1 = xi − 1)
and updating from P(Xi = xi|Xi − 1 = xi − 1)
to P(Xi = xi + 1|Xi − 1 = xi − 1).
For example, looking at P(X2 = x2),
we initially have the equation $$P(X_2=x_2)=\sum_{x_1=x_2}^{\textrm{min}(m_1,s)}\frac{{{x_1}\choose{x_2}}{{n_2+s-x_1}\choose{m_2-x_2}}}{{{n_2+s}\choose{m_2}}}\cdot\frac{{{s}\choose{x_1}}{{n_1}\choose{m_1-x_1}}}{{{n_1+s}\choose{m_1}}}$$
that we want to update to $$P(X_2=x_2+1)=\sum_{x_1=x_2+1}^{\textrm{min}(m_1,s)}\frac{{{x_1}\choose{x_2+1}}{{n_2+s-x_1}\choose{m_2-(x_2+1)}}}{{{n_2+s}\choose{m_2}}}\cdot\frac{{{s}\choose{x_1}}{{n_1}\choose{m_1-x_1}}}{{{n_1+s}\choose{m_1}}}$$
Looking element-wise, we can keep the $\frac{{{s}\choose{x_1}}{{n_1}\choose{m_1-x_1}}}{{{n_1+s}\choose{m_1}}}$
term from P(X1 = x1)
and just shift that as necessary to keep proper indexing. However, we
need to solve $$c\cdot\frac{{{x_1}\choose{x_2}}{{n_2+s-x_1}\choose{m_2-x_2}}}{{{n_2+s}\choose{m_2}}}
=
\frac{{{x_1}\choose{x_2+1}}{{n_2+s-x_1}\choose{m_2-(x_2+1)}}}{{{n_2+s}\choose{m_2}}}$$
The solutions is $$c=\frac{{{x_1}\choose{x_2+1}}{{n_2+s-x_1}\choose{m_2-(x_2+1)}}}{{{x_1}\choose{x_2}}{{n_2+s-x_1}\choose{m_2-x_2}}}=$$$$\frac{(x_1-x_2)!x_2!}{(x_1-(x_2+1))!(x_2+1)!}\cdot\frac{(n_2+s-x_1-(m_2-x_2))!(m_2-x_2)!}{(n_2+s-x_1-(m_2-(x_2+1)))!(m_2-(x_2+1))!}$$
$$=\frac{(x_1-x_2)(m_2-x_2)}{(x_2+1)(n_2+s-x_1-(m_2-(x_2+1)))}$$
This holds for any level; that is, to get P(Xi = xi + 1|Xi − 1 = xi − 1)
from P(Xi = xi|Xi − 1 = xi − 1)
we can always multiply P(Xi = xi|Xi − 1 = xi − 1)
by $$\frac{(x_{i-1}-x_i)(m_i-x_i)}{(x_i+1)(n_i+s-x_{i-1}-(m_i-(x_i+1)))}$$
Storing P(Xi = xi|Xi − 1 = xi − 1)
and P(Xi − 1 = xi − 1)
and updating them element-wise in the sum rather than recomputing the
hypergeometric distribution dramatically speeds up computation time. In
the rare cases where P(Xi = xi|Xi − 1 = xi − 1)
is so low that R rounds it to 0, it is
necessary to recompute the hypergeometric value to make sure we are not
erroneously trying to multiply by 0 to
update the terms. However, once those terms become non-zero, the
updating formula can be applied continuously. For more than two groups,
this entire process is performed recursively (i.e. calculate P(X1 = x1),
use that to find P(X2 = x2),
use that to find P(X3 = x3)
and so on).
The PMF of a hypergeometric distribution is implemented recursively with the optimizations as described above with the additional optimization that if only one value is requested, the top layer (P(Xk = k)) only calculates that value rather than calculating P(Xk = 0) and applying the updating with the coefficient up to P(Xk = k). The CDF of a hypergeometric distribution is implemented by cumulatively summing the top layer (P(Xk = k)) as above to the maximum input requested and then indexing from that stored vector to assign probabilities. The quantile function reverses the process of the CDF, calculating new probabilities up to the maximum input probability requested and then assigning outputs by indexing from that cumulative probability vector. The random number generator generates Unif(0, 1) random numbers and plugs them into the quantile function.
The likelihood function of a conditional hypergeometric (the product of the observations’ PMFs) with respect to the unknown overlap population size s is either
The likelihood function with respect to ni is
The likelihood function with respect to mi is
However, in all of these cases, there is never a global maximum after the first decrease, so the maximum likelihood estimator functions do the following:
Using Adam’s law, we can calculate the mean of a conditional hypergeometric distribution. First, we condition the highest-level overlap on the prior overlap and use the mean of a hypergeomtric distribution:
$$E(X_k) = E(E(X_k|X_{k-1})) = E\Big(\frac{m_k \cdot X_{k-1}}{n_k+s}\Big) = \frac{m_k}{n_k+s}E(X_{k-1})$$
From here, we can repeat the process by conditioning on the overlap prior to the prior overlap, and we repeat this all the way to X1, the number of items from the first population in the overlapping space.
$$=\frac{m_k}{n_k+s}E(E(X_{k-1}|X_{k-2})) = ... = \frac{\prod_{i=2}^km_i}{\prod_{i=2}^k(n_i+s)}E(X_1)$$
Finally, using the mean of a hypergeometric, we can substitute in:
$$ = \frac{\prod_{i=2}^km_i}{\prod_{i=2}^k(n_i+s)}\frac{m_1\cdot{s}}{n_1+s}=\frac{\prod_{i=1}^km_i}{\prod_{i=1}^k(n_i+s)}\cdot{}s$$
Rearranging this can give estimators for either unknown mj or nj:
$$m_j=E(X_k)\frac{\prod_{i=1}^k(n_i+s)}{s\cdot{}\prod_{i\neq{j}}m_i}\implies\hat{m_j}=\bar{X_k}\frac{\prod_{i=1}^k(n_i+s)}{s\cdot{}\prod_{i\neq{j}}m_i}$$ and
$$n_j+s=\frac{\prod_{i=1}^km_i}{\prod_{i\neq{j}}^k(n_i+s)}\cdot{}s\cdot\frac{1}{E(X_k)}\implies\hat{n_j}=\frac{\prod_{i=1}^km_i}{\prod_{i\neq{j}}^k(n_i+s)}\cdot\frac{s}{\bar{X_k}} - s$$
The first estimator (for mj) is unbiased because $$E(\hat{m_j})-m_j=E\Big(\bar{X_k}\frac{\prod_{i=1}^k(n_i+s)}{s\cdot{}\prod_{i\neq{j}}m_i}\Big) -m_j= E(\bar{X_k})\frac{\prod_{i=1}^k(n_i+s)}{s\cdot{}\prod_{i\neq{j}}m_i} - m_j$$ $$=E(X_k)\frac{\prod_{i=1}^k(n_i+s)}{s\cdot{}\prod_{i\neq{j}}m_i} - m_j = \frac{\prod_{i=1}^km_i}{\prod_{i=1}^k(n_i+s)}\cdot{}s\cdot\frac{\prod_{i=1}^k(n_i+s)}{s\cdot{}\prod_{i\neq{j}}m_i} - m_j = m_j-m_j = 0$$ where the second equality is because all the ni, the rest of the mi, and s are constants and the third equality is by linearity of expectation.
However, the second estimator (for nj) is biased because $$E(\hat{n_j})-n_j=E\Big(\frac{\prod_{i=1}^km_i}{\prod_{i\neq{j}}^k(n_i+s)}\cdot\frac{s}{\bar{X_k}} - s\Big)-n_j = \frac{\prod_{i=1}^km_i}{\prod_{i\neq{j}}^k(n_i+s)}\cdot{}s\cdot{}E\Big(\frac{1}{\bar{X_k}}\Big)-s-n_j\geq$$ $$\frac{\prod_{i=1}^km_i}{\prod_{i\neq{j}}^k(n_i+s)}\cdot{}s\cdot{}\frac{1}{E(\bar{X_k})}-s-n_j=\frac{\prod_{i=1}^km_i}{\prod_{i\neq{j}}^k(n_i+s)}\cdot{}s\cdot{}\frac{1}{E(X_k)}-s-n_j$$ $$=\frac{\prod_{i=1}^km_i}{\prod_{i\neq{j}}^k(n_i+s)}\cdot{}s\cdot{}\frac{\prod_{i=1}^k(n_i+s)}{s\cdot\prod_{i=1}^km_i}-s-n_j=n_j+s-s-n_j=0$$ where the second equality is because all the mi, the rest of the ni, and s are constants, the inequality is by Jensen’s inequality with $\frac{1}{x}$ as a convex function, and the third equality is by the linearity of expectation.
The bias in $\hat{n_j}$ seems to suggest using $\sum_{i=1}^n\frac{1}{X_{ki}}$ in the estimator rather than $\frac{1}{\sum_{i=1}^n{X_{ki}}}$ where Xki are each observation of Xk. However, that expression will evaluate to ∞ if any of the observed overlaps are 0, which is commonly the case. Thus, caution is advised when using either estimator.
Suppose there were n1 = 300 people unique to population 1, n2 = 500 people unique to population 2, s = 50 people in both population 1 and population 2, m1 = 56 people sampled from population 1, and m2 = 14 people sampled from population 2. We can find the probability of getting 1 observed overlap as follows:
(Large sample sizes (over 10, 000 or so) can take a long time so “Finished 1 level…” lets you know every time the function finishes finding all the probabilities for a certain probability level. You can suppress this with “verbose = F” as we will do for the rest.) If we wanted to visualize drawing from these populations with the same sample sizes 10, 000 times and see what proportion of the draws were actually 1, we could run:
set.seed(1)
draws <- rchyper(10000, 50, c(300, 500), c(56, 14), verbose = F)
hist(draws, breaks = seq(0,5,1) - 0.5)
Likewise, if we want the probability we see 1 or fewer overlaps, we can run
which lines up with what we observe by simulation:
Maybe we want to compare simulated and exact medians:
Looks like more than half the values are 0! Maybe we observe 1 overlap and want to check the chance that we would see 1 or more overlaps from this distribution just due to chance. We can find this with:
If we only saw the outputs and didn’t know what our overlap size was, we could find the maximum likelihood estimator with
Not bad for 10, 000 samples! (The original was 50.) If we bump up the number of draws taken, we can get even more precise:
set.seed(1)
draws <- rchyper(10^5, 50, c(300, 500), c(56, 14), F)
mleS(draws, c(300, 500), c(56, 14), F)
#> [1] 50
Spot on! Now, if we didn’t know the size of sample 1, we could run
where the 1 specifies we want the first sample and the 0 in the last parameter could be any placeholder integer. Maybe instead we wanted a method of moments estimate of the sample size:
Now, say we wanted to know the unique size of population 1. To find this, we can run:
mleN(1, draws, 50, c(0, 500), c(56, 14), F)
#> [1] 299
momN(1, draws, 50, c(0, 500), c(56, 14))
#> [1] 298.9827
where the 1 specifies the first population and the 0 in the population parameter could be any placeholder integer.
The most common use for this package will likely be in null distributions: finding the probability that you would see the observed number of overlaps or more just due to chance. For example, if you had two programs that were supposed to identify all the colors in a painting and very few of the colors they identified overlapped, you might use this to check whether the observed number of colors that did overlap were more than would be expected if each program was pulling random colors out of its database.
However, this package could also be used with census statistics. For example, imagine there were two overlapping school districts such that administrators knew how many children lived in both and how many children lived only in one or the other. Then imagine each district had a random lottery admissions process for their magnet schools. This package could give the distribution of how many children are likely to be admitted into both districts’ magnet schools.
The estimators implemented in this package could be useful in estimating unknown initial population parameters or sample sizes, but in practice it is unlikely that all but one unique population size would be known or only one sample size would be unknown. In reality, the most likely use for these is in making statements like, “So many overlaps were observed that the data would actually best fit a model with a sample size of [insert some big number],” when all population parameters and sample sizes are known and a null hypothesis has been rejected.
I’d like to thank Srihari Ganesh and Skyler Wu for their help in reviewing and improving the two-population case; Max Li for his help in optimizing the PMF calculation; and Meghan Short and Kelsey Thompson for their help in reviewing the math behind the arbitrary-population-number extension.