Some duality bounds for nonconvex problems, I
Introduction
We reproduce the famous paper [1]. Consider a real-valued lower semi-continuous function $f$ on a convex set $\mathcal{X} \subset \mathbb{E}$, where $\mathbb{E}$ is a Hausdorff linear space. We measure the following "convexity error" of $f$,
$$\rho(f) = \sup_{a \in \Delta^K} \left\{f\left(\textstyle\sum_{k \in [K]} a_k x_k\right) - \textstyle\sum_{k \in [K]} a_k f(x_k)\right\}, x_k \in \mathcal{X}, \forall k \in [K].$$
where we use a finite family of points $x_1, ...., x_K, K < \infty$, and $\Delta^{K}$ is the probability simplex of $\mathbb{R}_+^K$. If use the Dirac measure $\delta$,
$$\mathcal{M}(\mathcal{X}) = \left\{ \textstyle\sum_{k \in [K]} a_k \delta(\mathcal{X}_k) \mid a_k \in [0, 1], \textstyle\sum_{k \in [K]} a_k = 1 \right\}.$$
The set of discrete probability distributions on $\mu \in \mathcal{X}$. Then, $\rho(f)$ could be written as,
$$\rho(f) = \sup_{\mu \in \mathcal{M}(\mathcal{X})} f\left(\int x \, \mathrm{d} \mu\right) - \int f(x) \, \mathrm{d} \mu.$$
Minkowski sum. The Minkowski sum of two sets $A, B \subseteq \mathbb{E}$ is defined as
$$A + B = \lbrace a + b \mid a \in A, b \in B \rbrace.$$
Fenchel conjugate. The Fenchel conjugate of $f$ (assume at least it is lower semi-continuous) is defined as,
$$f_\ast(y) = \sup_{x \in \mathcal{X}} \left\{\langle y, x \rangle - f(x)\right\}, \quad y \in \mathbb{E}_\ast.$$
Suppose the indicator function,
$$\sigma_{\mathcal{Y}}(y) = \begin{cases} 0, & y \in \mathcal{Y}, \\ \infty, & y \notin \mathcal{Y}. \end{cases}$$
Then, the dual is called the support function of $\mathcal{Y}$,
$$\sigma_{^\ast,\mathcal{Y}}(y) = \sup_{y \in \mathbb{E}} \left\{\langle y, x \rangle - \sigma_{\mathcal{Y}}(x)\right\} = \sup_{y \in \mathbb{E}} \left\{\langle y, x \rangle, \; x \in \mathcal{X} \right\}.$$
Basic properties of $\rho$. The authors of [1] proved that,
$$\begin{align*} &0 \le \rho(f) \le \infty \\ &\rho(f) = 0 \iff f \text{ is convex}. \end{align*}$$
Main result. Consider, for lsc $f, g$, the primal problem
$$z^\ast = \inf ~ f(x) + g(Ax), ~\text{s.t.} ~ x \in \mathcal{X}, Ax \in \mathcal{Y}. \tag{P}$$
Here,
- We allow $g$ to be convex.
- $\mathcal{X} \subseteq \mathbb{E}$ is a convex set of a Hausdoff linear space $\mathbb{E}$.
- $\mathcal{Y} \subseteq \mathbb{R}^m$ and $A \in \mathcal{L}(\mathbb{E}, \mathbb{R}^m)$ is a linear mapping to Euclidean space.
One can also read,
$$\text{(P)} \Leftrightarrow \inf_{x \in \mathcal{X}, y \in \mathcal{Y}} \left\{f(x) + g(y) \right\}, ~\text{s.t.} ~ Ax = y$$
So the dual function is,
$$\begin{align*} d(p) &= \inf_{x \in \mathcal{X}, y \in \mathcal{Y}} \left\{\langle p, Ax - y \rangle + f(x) + g(y) \right\} \\ &= - \sup_{x \in \mathcal{X}, y \in \mathcal{Y}} \left\{\langle -A^\ast p, x \rangle - f(x) + \langle p, y \rangle - g(y) \right\} \\ &= - f_\ast(-A^\ast p) - g_\ast(p). \end{align*}$$
and
$$d^\ast= \sup_{p \in \mathbb{R}^m} d(p). \tag{D}$$
If $g \equiv 0$, then the problem is,
$$z^\ast = \inf_{x \in \mathcal{X}, Ax \in \mathcal{Y}} ~ f(x) = \inf_{x \in \mathcal{X}} ~ f(x) + \sigma_{\mathcal{Y}}(Ax).$$
It is easy to see the dual problem is,
$$d^\ast= \sup_{p \in \mathbb{R}^m} d(p) = ~ - f_\ast(-A^\ast p) - \sigma_{\ast,\mathcal{Y}}(p),$$
Here $A^\ast$ is the adjoint of $A$,
$$A^\ast: \mathbb{R}^m \to \mathbb{E}_\ast,$$
(recall $\mathbb{E}_\ast$ is the dual space of $\mathbb{E}$). The goal here is to prove that,
$$0 \le z^\ast - d^\ast \le \rho(f).$$
Weak duality. By definition,
$$\begin{align*} \forall p \in \mathbb{R}^m, \quad d(p) &= \inf_{x \in \mathcal{X}, y \in \mathcal{Y}} \left\{\langle p, Ax - y \rangle + f(x) + g(y) \right\} \\ &\le \langle p, Ax - y \rangle + f(x) + g(y), ~ \forall x \in \mathcal{X}, y \in \mathcal{Y} \\ &\le f(x) + g(y), \quad \forall x \in \mathcal{X}, y \in \mathcal{Y}, Ax = y. \end{align*}$$
Hence,
$$d^\ast = \sup_{p \in \mathbb{R}^m} d(p) \le \inf_{x \in \mathcal{X}, Ax \in \mathcal{Y}} \left\{f(x) + g(Ax) \right\} = z^\ast.$$
Proof
Consider a mapping, $\Phi: \mathcal{X}\times\mathcal{Y} \to \mathbb{R}\times\mathbb{R}^m$,
$$\Phi(x, y) = (f(x) + g(y), Ax - y).$$
and using the discrete probability distribution over $\mathcal{X}\times\mathcal{Y}$, let $\mu \in \mathcal{M}(\mathcal{X}\times\mathcal{Y})$, and consider
$$\begin{align*} \mathbb{R} \ni w &= \inf \int_{\mathcal{X}\times\mathcal{Y}} f(x) + g(y) \, \mathrm{d} \mu, ~\text{s.t.}~ \int_{\mathcal{X}} A(x) \, \mathrm{d}\mu = \int_{\mathcal{Y}} y \, \mathrm{d}\mu \\ &= \inf \textstyle\sum_{i} a_i f(x_i) + a_i g(y_i), ~\text{s.t.}~ \textstyle\sum_{i} a_i A(x_i) = \textstyle\sum_{i} a_i y_i, ~ a_i \in \Delta^I. \end{align*}$$
The last equality simply uses an explicit probability vector $a$. Setting $\mu = \delta(x, y)$, we have,
$$w \le \inf_{x\in \mathcal{X}, y\in\mathcal{Y}} \left\{ f(x) + g(y), ~\text{s.t.}~ Ax = y\right\} = z^\ast.$$
Step I. Show that $\mathbb{R} \times \mathbb{R}^m \ni (w, 0)$ is not in the convex hull of
$$\Phi(\mathcal{X}\times\mathcal{Y}) + \mathbb{R}_{>0} \times \mathbb{R}^m,$$
the sum is in Minkowski sense. This means it enlarges the image by allowing the objective to strictly increase ($\mathbb{R}_{>0}$) and the constraint residual to shift freely ($\mathbb{R}^m$):
$$(w, 0) \notin \mathrm{conv}(\Phi + \mathbb{R}_+ \times \mathbb{R}^m).$$
Otherwise, there exists $\mu \in \mathcal{M}$, such that,
$$w > \inf \int_{\mathcal{X}\times\mathcal{Y}} f(x) + g(y) \, \mathrm{d} \mu, \quad \int_{\mathcal{X}} A(x) \, \mathrm{d}\mu = \int_{\mathcal{Y}} y \, \mathrm{d}\mu$$
which contradicts the definition of $w$.
Step II. Because $(w, 0) \notin \mathrm{conv}(\Phi + \mathbb{R}_+ \times \mathbb{R}^m)$, in the finite-dimensional space, there is a separation, i.e., there exists $(a, p) \in \mathbb{R} \times \mathbb{R}^m$ such that,
$$a w = a w + \langle p, 0 \rangle \le a (f(x) + g(y) + c) + \langle p, Ax - y \rangle, \quad \forall x \in \mathcal{X}, y \in \mathcal{Y}, c > 0.$$
First, $a$ cannot be zero. If so, then
$$\langle p, Ax - y \rangle \ge 0, \quad \forall x \in \mathcal{X}, y \in \mathcal{Y}.$$
Using the alternative lemma, we have since $0 \in \textrm{int}(A(\mathcal{X}) - \mathcal{Y})$,
$$\inf_{x\in \mathcal{X}, y\in\mathcal{Y}} \langle p, Ax - y \rangle < 0 \quad\text{or}\quad p = 0.$$
(This mimics the proof for strong duality in convex optimization with Slater's condition.) So we must have $a > 0$. Scaling both sides by $1/a$, we have,
$$\begin{align*} w &\le f(x) + g(y) + \langle p, Ax - y \rangle + c, \quad \forall x \in \mathcal{X}, y \in \mathcal{Y}, \quad p \leftarrow \tfrac{1}{a}p \\ &\le \inf_{x\in \mathcal{X}, y\in\mathcal{Y}} \left\{f(x) + g(y) + \langle p, Ax - y \rangle\right\} + \inf_{c\in\mathbb{R}_{>0}} c \\ &= \inf_{x\in \mathcal{X}, y\in\mathcal{Y}} \left\{f(x) + g(y) + \langle p, Ax - y \rangle\right\} \\ &= - f_\ast(-A^\ast p) - g_\ast(p). \end{align*}$$
Step III. For any $\epsilon > 0$, we have some $\mu$, such that,
$$\begin{align*} \int f(x) + g(y) \, \mathrm{d} \mu &\le w + \epsilon \\ \int A(x) \, \mathrm{d} \mu &= \int y \, \mathrm{d} \mu. \end{align*}$$
As such, note $A$ is linear, this means, taking the mean point $\int x , \mathrm{d} \mu = \bar x$,
$$A \bar x = A \int x \, \mathrm{d} \mu = \int A(x) \, \mathrm{d} \mu$$
and analogously for $y$, so we have,
$$\begin{align*} w + \epsilon & \ge f(\bar x) + g(\bar y) + \int f(x) + g(y) \, \mathrm{d} \mu - (f(\bar x) + g(\bar y)) \\ & \ge f(\bar x) + g(\bar y) - \sup_{x \in \mathcal{X}, y \in \mathcal{Y}} \left(f\left(\int x \, \mathrm{d} \mu\right) + g\left(\int y \, \mathrm{d} \mu\right) - \int f(x) + g(y) \, \mathrm{d} \mu\right) \\ & = f(\bar x) + g(\bar y) - \rho(f) - \rho(g). \end{align*}$$
Hence, we have,
$$\forall \epsilon > 0, \quad - f_\ast(-A^\ast p) - g_\ast(p) + \epsilon \ge f(\bar x) + g(\bar y) - \rho(f) - \rho(g).$$
Take infimum over both sides, we have,
$$\begin{align*} d(p) = \inf_{\epsilon > 0} \left\{ - f_\ast(-A^\ast p) + \epsilon \right\} &\ge \inf_{\bar x \in \mathcal{X}, \bar y \in \mathcal{Y}, A\bar x = \bar y} \left\{f(\bar x) + g(\bar y) - \rho(f) - \rho(g)\right\} \\ &= z^\ast - \rho(f) - \rho(g). \end{align*}$$
Hence, we conclude since $g$ is convex,
$$\boxed{z^\ast - d^\ast \le \rho(f) + \rho(g) = \rho(f).}$$
This completes the proof.
Application
An interesting application here, is if the original primal variable has some decomposition structure,
$$\mathcal{X} = \prod_{i \in [I]} \mathcal{X}_i, \quad f(x) = \tfrac{1}{I}\textstyle\sum_{i \in [I]} f_i(x_i), \quad A = \tfrac{1}{I}\textstyle\sum_{i \in [I]} A_i.$$
and each $\mathcal{X}_i \subseteq \mathbb{E}_i$ is a convex set of a corresponding lower-dimensional linear space $\mathbb{E}_i$, and $A_i \in \mathcal{L}(\mathbb{E}_i, \mathbb{R}^m)$. Then, the problem reads,
$$\begin{align*} z^\ast = \inf ~&\left\{\tfrac{1}{I}\textstyle\sum_{i \in [I]} f_i(x_i) + g\left(\tfrac{1}{I}\textstyle\sum_{i \in [I]} A_i x_i\right)\right\}, \\ \text{s.t.}~ &\tfrac{1}{I}\textstyle\sum_{i \in [I]} A_i x_i \in \mathcal{Y}, ~ x_i \in \mathcal{X}_i, \forall i \in [I]. \end{align*}$$
With some algebra, the dual function is,
$$d(p) = - \tfrac{1}{I} \textstyle\sum_{i \in [I]} f_{i,\ast}(-A_i^\ast p) - g_{i,\ast}(p).$$
So as we use the general theorem above, we have,
$$z^\ast - d^\ast \le \tfrac{1}{I} \textstyle\sum_{i \in [I]} \rho(f_i) \le \max_{i \in [I]} \rho(f_i).$$
The whole point is, we can find a much better bound,
$$z^\ast - d^\ast \le {\color{#BE5845}{(m+1)I^{-1}}} \max_{i \in [I]} \rho(f_i).$$
which depends on the dimension of image of linear mapping $A$ and diminishes as the number of blocks $I$ increases. This leaves as a question for our next note.
- [1] Estimates of the duality gap in nonconvex optimization. Mathematics of Operations Research, 1(3), 225–245 (1976).