Entropic Regularization and Sinkhorn

Author

Jack Pfaffinger

Sinkhorn’s Algorithm is an iterative numerical method used to obtain an optimal transport plan \(\pi \in \Gamma(\alpha,\beta)\) for the Kantorovich Problem with entropic regularization in the case of finitely supported positive measures \(\alpha, \beta \in \mathcal M_{+}(X)\). For further reading see (Peyré and Cuturi 2018) (pg. 62-73).

Continuous Problem Formulation

Entropic regularization modifies the Kantorovich problem by adding a Kullback-Leibler divergence term to the optimization goal. Specifically, the general form of the problem is now to determine

\[ L_c^\epsilon (\alpha, \beta) = \inf_{\pi \in \Gamma(\alpha, \beta)} \underbrace{\int_{X \times Y} c(x, y) \, d\pi(x, y)}_{\text{Kantorovich functional } \mathbb{K}(\pi)} + \epsilon \underbrace{\text{KL}(\pi \mid \alpha \otimes \beta)}_{\text{entropic term}} \]

where \(\alpha\otimes\beta\) is the product measure of \(\alpha\) and \(\beta\), and where

\[ \operatorname{KL}(\mu\mid\nu) = \int_X \log\left(\frac{\mathrm{d}\mu}{\mathrm{d}\nu}(x) \right) \mathrm{d}\mu(x) + \int_X (\mathrm{d}\nu(x) - \mathrm{d}\mu(x)) \]

whenever the Radon-Nikodym derivative \(\tfrac{\mathrm{d}\mu}{\mathrm{d}\nu}\) exists (i.e. when \(\mu\) is absolutely continuous w.r.t. \(\nu\)) and \(+\infty\) otherwise. This form of the KL divergence is applicable even when \(\mu,\nu \in \mathcal{M}_+(X)\) differ in total mass and it reduces to the standard definition whenever \(\mu\) and \(\nu\) have equal total mass. From this definition it immediately follows that for \(\epsilon > 0\) an optimal coupling \(\pi^*\) must be absolutely continuous w.r.t \(\alpha \otimes \beta\). As a result, the optimal plan is in some sense less singular and hence “smoothed out.”

Discrete Problem Formulation

To apply Sinkhorn’s algorithm to approximate \(L^{\epsilon}_c (\alpha,\beta)\), it will be necessary to assume finite support so let \(\alpha = \textstyle\sum_{i=1}^{n} a_i \delta_{x_i}\) and \(\beta = \textstyle\sum_{j=1}^m b_j \delta_{y_j}\) and denote the corresponding vector of weights by \(\mathbf{a} \in \mathbb{R}_{+}^n\) and \(\mathbf{b} \in \mathbb{R}_{+}^m\). Additionally let \(C_{ij} = c(x_i, y_j)\) and denote the discrete version of \(\Gamma(\alpha,\beta)\) by \(U(a,b)= \{ P \in \mathbb{R}^{n \times m} \mid \textstyle\sum_{j} P_{ij}=a_i, \textstyle\sum_{i} P_{ij} = b_j \}\).

The discrete entropy of a coupling matrix is defined as:

\[ H(P) := -\sum_{i,j}P_{i,j}(\log(P_{i,j}) - 1) \]

with an analogous definition for vectors.

This lets us write the entropically regularized Kantorovich problem as:

\[L^{\epsilon}_{C}(\mathbf{a},\mathbf{b}) = \inf_{P\in U(\mathbf{a},\mathbf{b})} \sum_{i,j} C_{i,j} P_{i,j} -\epsilon H(P).\]

Adding the entropy regularization guarantees uniqueness of the solution, and makes the solutions “less sparse” as \(\epsilon\) increases.

The unique solution \(P^{\epsilon}\) of the entropically regularized Kantorovich problem converges to the optimal solution of the unregularized Kantorovich problem which has maximal entropy as \(\epsilon \rightarrow 0\). In other words,

\[ P_{\epsilon} \xrightarrow{\varepsilon \to 0} \underset{P \in U(\mathbf{a}, \mathbf{b})}{\operatorname{argmax}} \{ H(P) : \sum_{i, j} P_{i,j}C_{i,j} = L_{C}(\mathbf{a}, \mathbf{b}) \} \]

where \(L_{C}(\mathbf{a}, \mathbf{b})\) is the minimal cost of the unregularized Kantorovich problem. Thus we can see

\[ L^{\epsilon}_C(\mathbf{a}, \mathbf{b}) \xrightarrow{\varepsilon \to 0} L_{C}(\mathbf{a}, \mathbf{b}). \]

and

\[ P_{\epsilon} \xrightarrow{\varepsilon \to 0} a \otimes b = ab^{T}. \]

See here for more on Discrete Optimal Transport.

Characterizing the Solution

The solution to the discrete entropically regularized problem formulation is unique and has a special form. We can define the Kullback-Leibler divergence between couplings as

\[\operatorname{KL}(P\mid K) := \sum_{i,j} P_{ij} \log\left(\frac{P_{i,j}}{K_{i.j}}\right) + - P_{i,j} + K_{i, j}.\]

Theorem (Peyré and Cuturi (2018), Proposition 4.3 on pg. 63):

The solution \(P \in \mathbb{R}^{n\times m}\) to discrete regularized Kantorovich problem is unique and has the form \(P_{ij} = u_iK_{ij}v_j\) for some \(\mathbf{u} \in \mathbb{R}^n, \mathbf{v} \in \mathbb{R}^m\) where \(K_{ij}=e^{-C_{ij}/\epsilon}\). Moreover, \(\mathbf u\) and \(\mathbf v\) are unique up to multiplication and division by some scaling factor.

In fact, the optimal solution \(P_{\epsilon}\) of \(L_{c}^{\epsilon}(\mathbf{a}, \mathbf{b})\) satisfies:

\[ P_{\epsilon} = \underset{P \in U(\mathbf{a}, \mathbf{b})}{\operatorname{argmin}} KL(P| K) \]

i.e. \(P^{\epsilon}\) is the admissible coupling which has smallest KL-divergence with respect to \(K\).

Sinkhorn’s Algorithm

Sinkhorn’s algorithm takes advantage of the aforementioned characterization result to iteratively approximate the scaling factors \(\mathbf u\) and \(\mathbf v\). The procedure is simple and only involves matrix-vector multiplication and entrywise division as follows

\[ u^{(\ell +1)} = \frac{\textbf a}{K\textbf v^{(\ell)}} \quad\text{and}\quad v^{(\ell +1)} = \frac{\textbf a}{K^T\textbf u^{(\ell+1)}} \]

Where \(\mathbf{v}^{(0)}\) is initialized arbitrarily as a vector of all \(1's\). This problem is known known in the numerical analysis community as a matrix scaling problem.

Once a sufficient number of iterations \(L\) have been taken, we let \(\hat P_{ij}=u^{(L)}_i K_{i,j} v^{(L)}_j\) be our approximation of the optimal plan.

References

Peyré, Gabriel, and Marco Cuturi. 2018. “Computational Optimal Transport.” arXiv Preprint. https://arxiv.org/abs/1803.00567.