3
$\begingroup$

I am looking for a practical(!) way to create many different random partitions of a large set, and then to identify efficiently into which partition some elements of the set belong.

Details

$\Omega$ is a Cartesian product of $i=1,\ldots,d$ finite sets $\Omega_i=\{1,\ldots,k_i\}$ and $ S\subset\Omega$ a subset. I would like to

  1. Create random partitions of $\Omega$ into parts $A_1, \ldots,A_m$ (i.e. the sets are mutually disjoint and their union is $\Omega$), where each part has roughly the same number of elements.
  2. Determine for each element $x\in S$ the number $j$ such that $x\in A_j.$

A natural way to implement this is finding functions $f_t:\Omega\rightarrow \{1,\ldots,m\}$ which are easy to evaluate and have the property that the pre-images $f_t^{-1}(j)$ have roughly equal size. One example for functions which have exactly those properties are the projections $\pi_i:\Omega\rightarrow \Omega_i.$ My problem is that there are only $d$ of those projections, but I need many $f_t$.

I am not looking for theoretical solutions but for practical, implementable ones. The big issue is the large size of $\Omega$ due to the combinatorial explosion of elements. To provide some rough idea of order of magnitudes: The number of factors is maybe double digits, as well as the cardinality of the factors $\Omega_i$; so $\Omega$ is huge ($>10^{20}$ say) but the size of $S$ is much smaller, e.g. in the 100 thousands. Partitions with $m=2$ are completely sufficient, larger $m$ just nice to have. I would like to be able to generate maybe 1000 to 10'000 functions $f_t.$ For close $t$ the functions should not be too correlated.

All randomness is understood with respect to the counting measure, i.e. where probability of $ A\subset\Omega$ is the number of elements in $A$ divided by the number of elements in $\Omega$.

Questions

  • Are Hash-functions good candidates for my $f_t$?
  • What family of hash functions would be best suited?
  • And how would I generate those hash functions?

Or is there another, better approach to the problem?

$\endgroup$
2
  • $\begingroup$ While iterating $S$ why not just randomly assign to the $A_i$? $\endgroup$ Commented Jul 11 at 14:51
  • $\begingroup$ In brief, because I do not know what probabilities to use for the random assignment. The intersections of $S$ with general equal sized partitions are supposed to reveal structural features of $S$ and should enable me to compare various sets $S_1$, $S_2$, ... . Roughly speaking, if the intersection of many equal sized partitions with two sets $S_1$ and $S_2$ are very smilar then the sets are very similar. $\endgroup$ Commented Jul 11 at 15:13

2 Answers 2

1
$\begingroup$

Update: Map every element $\omega = [w_1, w_2, ..., w_d] \in \Omega$ to a random vector by taking the $w_i$-th value of an efficient pseudo random function for each of the $d$ components. The same starting seed value can be used for each application.

The random vector can take any integral value in a hypercube with sides of length, say, 255 for one random byte. Now sum the dimensions. If even, assign to $A_1$, otherwise $A_2$. This can be extended to $n$ classes by checking what the sum is congruent to modulo $n$.

We want to be sure that given distinct $\omega_1, \omega_2 \in \Omega$, there is no correlation between sharing one (or more) component and sharing class. This is true, because conditional on one component value, the element can be in any class depending on whether there are even or odd remaining components that are even.

This requires a multiple of $dk_i$ operations which is feasible. By picking different seed values we get many distinct functions.

The original answer involved indexing all the $\Omega$ with $j$ and using the $j$th value of a random sequence, but this is computationally feasible as noted.

$\endgroup$
5
  • $\begingroup$ I do not think it is that easy. The size of $\Omega$ can easily be larger than $10^{20}$. There is no way to assign the elements of $\Omega$. You need a way to select the elements of $S$. $\endgroup$ Commented Jul 10 at 22:09
  • $\begingroup$ I see, I have a clarifying question now though. You mention the projections $\pi_i$ would work if there were many of them. Can we just extend that idea by using a PRNG to pick partitions of $\Omega_1$ only and then assign using that? And do the same for other $\Omega_i$ as needed? Or is it important that the random partition be random with respect to all the $\Omega_i$? $\endgroup$ Commented Jul 11 at 2:27
  • $\begingroup$ I fear the point is exactly the randomness with respect to all possible partitions. $\endgroup$ Commented Jul 11 at 7:48
  • $\begingroup$ I edited my question to make the aspect of practicability more clear. $\endgroup$ Commented Jul 11 at 7:53
  • $\begingroup$ I updated my answer and believe it addresses both these issues. $\endgroup$ Commented Jul 12 at 2:54
0
$\begingroup$

For each of the $d$ factors, choose $n$ random permutations of $\Omega_i=\{1,\ldots,k_i\}$. Wikipedia has a couple of good algorithms for finding such random permutations efficiently. We will construct $dn$ functions $f_t$ of the kind that you desire. Since $n$ can be as large as $k_i!$, there is no shortage of functions.

We construct the $dn$ random partitions as follows: choose factor $j$, take the projection $\pi_j:\Omega\rightarrow \Omega_j$, choose a permutation $\phi_\ell$ of $\Omega_j=\{1,\ldots,k_j\}$ and apply $\phi_\ell$ to this projection. In other words for $\omega \in \Omega$, we compute $k=\phi_\ell(\pi_j(\omega))$. Clearly $k \in \{1,\ldots,k_j\}$. If $k<k_j/m$, we assign $\omega$ to $A_1$, if not but $k<2k_j/m$ we assign $\omega$ to $A_2$ and so on. All these computations are quite easy and the $dn$ permutations do not require excessive storage.

If you want to use the projections on all the $\Omega_j$ as indicated in your comment, we can modify the procedure as follows for the case $m=2$. For each factor $j$, take the projection $\pi_j:\Omega\rightarrow \Omega_j$, choose a permutation $\phi_{\ell,j}$ of $\Omega_j=\{1,\ldots,k_j\}$ and apply $\phi_{\ell,j}$ to this projection. In other words for $\omega \in \Omega$, we compute $K_j=\phi_{\ell,j}(\pi_j(\omega))$. Clearly $K_j \in \{1,\ldots,k_j\}$. Let $\alpha=(1/m)^{1/d}=(1/2)^{1/d}$. If $K_j<\alpha k_j \;\forall j$, we assign $\omega$ to $A_1$, if not we assign $\omega$ to $A_2$. The probabilities of these two assignments are $\left((1/2)^{1/d}\right)^d = 1/2$. I think this procedure can be modified to handle $m>2$, but I have not worked out the details.

$\endgroup$
2
  • $\begingroup$ I think this was suggested already by @Nick above in a comment. But this gives me only very specific random partitions. The set $S$, which I am ultimately interested in, may be very special. Such as e.g. all $\pi_j(\omega)=0$. Then a permutation with respect to the j-th factor would not be very interesting. One could also think up less easily spotted issues involving dependencies between pairs or triplets of components. To avoid the need to specify the potential structure in $S$ I want uniform randomness on all partitions. $\endgroup$ Commented Jul 11 at 15:14
  • $\begingroup$ Have modified my answer to deal with this issue. Any procedure that allows easy assignment will use some rule and will introduce some issues. If you use a hash function, the issues are hidden inside that black box and you do not know what they are. If you use a transparent algorithm , the issues are clearly revealed and you can choose the least evil. $\endgroup$ Commented Jul 11 at 15:31

You must log in to answer this question.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.