Probability Theory: Maximizing the difference between distribution functions

Question

Given a sample of observations $X$, by changing a parameter $p$ we can divide $X$ into two subsamples $X_1$ and $X_2$ (this division is done in a non-trivial way which is nonetheless irrelevant to the problem at hand). We would like to find a value of $p$ for which subsamples $X_1$ and $X_2$ are as much as possible "distinct". To this purpose, we can consider how "far" apart the empirical distribution functions of $X_1$ and $X_2$ are. I am deciding between two objective functions:

1- $max_p \{max_x (\mid F_{X_1}(x)- F_{X_2}(x)\mid)\}$

2- $max_p \{\sum_x (\mid F_{X_1}(x) - F_{X_2}(x)\mid)\}$,

i.e., the first maximization problem searches for a value of $p$ that maximizes the biggest absolute difference between the EDFs, while the second one chooses a value of $p$ for which the area between the graphs of EDFs is maximized.

I acknowledge that the notion of distinctness might be ambiguous, however, I am looking for pros and cons of each objective functions and if there is any similar problem studied in the literature.

krkeane · Accepted Answer · 2024-01-06 15:04:41Z

The most useful thing I can suggest is to move this post to Cross Validated.

You are focusing on maximizing the difference in empirical distribution functions. If you sort your data, and assign the lower values to $X_1$ and the higher values to $X_2$, I believe your empirical distribution differences in the given sample would be maximized.

If the objective is to infer two distinct generating processes, for example a "typical" generating process $X_1 \sim N(0,1)$ and an "outlier" generating process $X_2 \sim N(0,100)$, focusing on the empirical distribution differences would not tease these two processes apart. A more typical analytic step would be to determine which distribution ($X_1$ or $X_2$) was more likely to have generated a given observation $x_i$. The statistical distance between these two parameterized generating distributions (and prior for "typical" vs "outlier" in a Bayesian framework) would determine the difficulty of the labeling task $x_i \in X_1$ or $x_i \in X_2$.

Given your model assumptions, and inference from the sample data, the distance between the parameterized generating distributions is fixed $-$ it doesn't make sense to talk about maximizing the distance between probability distributions given the sample data and labels $x_i \in X_1$ or $x_i \in X_2$.

If you are allowed to select different features, or transformations of the sample data, you may be able to find very distinct distributions ("good features" for your classification task).

A common measure of the distance between distributions is the Kullback–Leibler divergence.

The expectation–maximization algorithm ("EM algorithm") is commonly used in machine learning for similar tasks (attributing an observation to a generating distribution and estimating the parameters of the generating distributions).

mehman · Accepted Answer · 2025-10-01 12:09:38Z

0

I am not sure what is the goal here, but you could look into Kullback-Leibler divergence.

answered Oct 1 at 12:09

mehman

213 bronze badges

Add a comment |

Stack Exchange Network

Probability Theory: Maximizing the difference between distribution functions

2 Answers 2

Your Answer

Hot Network Questions

Probability Theory: Maximizing the difference between distribution functions

2 Answers 2

Your Answer

Sign up or log in

Post as a guest

Related

Hot Network Questions