Algorithm for distributed convex optimization

Question

Consider the convex optimization problem

$$\begin{array}{ll} \underset{{x_1, \dots, x_N, y}}{\text{minimize}} & \displaystyle\sum_{i=1}^{N} f_i(x_i,y)\\ \text{subject to} & y = \displaystyle\sum_{i=1}^{N} x_i\\ & x_i \in X_i \ \forall i, \ \ y \in Y\end{array}$$

where $X_1, \dots, X_N, Y \subset \mathbb{R}^n$ are compact and convex, $f_1, \dots, f_N : \mathbb{R}^n \times \mathbb{R}^n \to \mathbb{R}$ are continuous, strictly convex functions.

I am looking for a distributed formulation of the problem, namely where "local solvers" only know their constraint set $X_i$ and their cost function $f_i$, so can compute:

$$ x_i^{\star}( \cdot ) := \arg \min_{ x \in X_i} f_i(x, \cdot) + g_i(x, \cdot), $$

for some convex penalty function $g_i$ to be designed.

Does there exist a distributed algorithm with guaranteed convergence?

Algorithms like divide and concur address parallelization of problems with more complicated constraints and "local" interactions. Your problem has simpler constraints but a pesky global $y$. — Matt
– Matt, Commented Aug 31, 2014 at 9:30
The following is a recent publication which removes the need for the Slater condition: arxiv.org/abs/1412.0791 — Michael
– Michael, Commented Dec 6, 2014 at 1:37

Community · Accepted Answer · 2020-06-12 10:38:56Z

Another approach is this: If you assume each index $i \in \{1, \ldots, N\}$ is a node of a connected graph, you can define local variables $y_i$ for each $i \in \{1, \ldots, N\}$, and then enforce the constraint $y_i=y_j$ whenever $i$ and $j$ are neighbors. If you want, you can do the same thing for the $x_i$ variables: Define $x_i^{(k)}$ as the node $i$ estimate of $x_k$. The resulting problem is:

Minimize:

$$ \sum_{i=1}^N f_i(x_i^{(i)}, y_i) $$

Subject to:

$$ y_i = y_j \: \mbox{ whenever $i$ and $j$ are neighbors} $$

$$ \sum_{k=1}^N x_i^{k} = y_i \: \mbox{ for all $i$} $$

$$ x_i^{k} = x_j^{k} \: \forall k, \mbox{ whenever $i$ and $j$ are neighbors} $$

$$ y_i \in \mathcal{Y}, x_i^k \in \mathcal{X}_k $$

This is a brute-force approach that introduces many variables, you can reduce some of the variables if you use a tree (a "minimalist" connected graph). I did this in the following paper, which may be of interest: "Distributed and secure computation of convex programs over a network of connected processors": http://www-bcf.usc.edu/~mjneely/pdf_papers/dist_comp_dcdis2.pdf

The key property of this formulation is that the objective function is a separable sum of terms, where each term involves only functions and variables associated with a particular node. Likewise, all constraints are such separable sums, and further the terms in the constraints involve only neighboring variables (so that message passing between neighbors can satisfy these constraints). The paper above develops a distributed algorithm that solves this, where each node knows only its own $f_i(\cdot)$ function.

An alternative use of a tree structure

If you don't like all the $x_i^k$ variables (which I don't), you could use a tree structure with a root at some node, say node 1. Then you could define $s_i$ as a variable that is the sum of $x_i$ and the $s_j$ values of its children, and then define the additional constraint $ s_1 = y_1$. This enforces the desired $\sum_{i=1}^Nx_i=y$ constraint. I give an algorithm for this in the last section below.

A delayed update approach

Alternatively, you could designate node 1 as the "keeper of the constraints" and then assume all nodes can communicate with node 1 after a fixed delay of $D\geq 0$ time slots. The resulting problem and an algorithm is:

Minimize:

$$ \sum_{i=1}^N f_i(x_i, y_i) $$

Subject to:

$$ y_i = y_1 \: \mbox{ for all $i \in \{2, 3, \ldots N\}$} $$

$$\sum_{i=1}^N x_i = y_1 $$

$$ x_i \in \mathcal{X}_i \: , \: y_i \in \mathcal{Y} $$

You can then use a "delayed drift-plus-penalty approach" as I did in a more recent paper "Distributed stochastic optimization via correlated scheduling" (see equation (24) there): http://ee.usc.edu/stochastic-nets/docs/distributed-opt-infocom2014.pdf

Specifically, you can enforce the constraints with "virtual queues" $Q_i(t)$ and $Z(t)$ (related to Lagrange multiplier updates) that use delayed versions of the $x_i$ variables. These are updated every time step $t \in \{0, 1, 2, \ldots\}$ by:

$$ Q_i(t+1) = Q_i(t) + y_i(t-D) - y_1(t-D) \: \mbox{ for all $i \in \{2, 3, \ldots, N\}$}$$

$$ Z(t+1) = Z(t) + \sum_{i=1}^{N-1}x_i(t-D) - y_1(t-D) $$

Assume the $\mathcal{X}_i$ and $\mathcal{Y}$ sets are compact, so the virtual queues change by at most a bounded constant every slot.

Algorithm:

Fix $\epsilon>0$. Every time step $t\in\{0,1,2,\ldots\}$, each node $i$ observes $\tilde{Q}_i(t)$ and $\tilde{Z}(t)$, being any values that differ from the true $Q_i(t)$ and $Z(t)$ variables by at most some additive constant $C$ (where $C$ does not depend on $t$ or the $Q_i(t)$, $Z_i(t)$ values...using $\tilde{Q}_i(t) = Q_i(t-D)$ works). Then:

(i) Nodes $i \in \{2, \ldots, N\}$ choose $x_i(t), y_i(t)$ by:

Minimize: $\frac{1}{\epsilon} f_i(x_i(t), y_i(t)) + \tilde{Q}_i(t)y_i(t) + \tilde{Z}(t)x_i(t)$

Subject to: $x_i(t) \in \mathcal{X}_i, y_i(t) \in \mathcal{Y}$

(ii) Node $1$ chooses $x_1(t), y_1(t)$ by:

Minimize: $\frac{1}{\epsilon} f_1(x_1(t), y_1(t)) - y_1(t)\left(Z(t) + \sum_{i=2}^NQ_i(t)\right)$

Subject to: $x_1(t) \in \mathcal{X}_1, y_1(t) \in \mathcal{Y}$.

(iii) Node 1 updates the virtual queues $Q_i(t)$ and $Z(t)$ using delayed information $x_i(t-D)$ and $y_i(t-D)$, as specified in the queueing equations above.

The resulting algorithm yields time averages $\overline{x}_i(t)$ and $\overline{y}_i(t)$ that are an $O(\epsilon)$-approximation to the convex program solution, where:

$$ \overline{x}_i(t) = \frac{1}{t}\sum_{\tau=0}^{t-1} x_i(\tau) \: \: , \: \: \overline{y}_i(t) = \frac{1}{t}\sum_{\tau=0}^{t-1} y_i(\tau) $$

This is true regardless of the approximation constant $C$, although a large $C$ value will affect the coefficient of the $O(\epsilon)$ term, as well as the convergence time. The functions $f_i(\cdot)$ need to be convex, but not necessarily strictly convex. For example, it works for linear programs.

The tree-based algorithm

Let's assume the tree structure, with node 1 the "root." For each node $i \in \{1, \ldots, N\}$, define $C(i)$ as the set of nodes that are children of node $i$. The set $C(i)$ is empty if node $i$ is a leaf. For each $i \in \{2, 3, \ldots, N\}$, define $P(i)$ as the parent of node $i$. For simplicity, assume the sets $\mathcal{X}_i$ are sets of nonnegative numbers.

Minimize: $$ \sum_{i=1}^N f_i(x_i, y_i) $$

Subject to:

$$ y_i = y_{P(i)} \: \: \mbox{ for all $i \in \{2, \ldots, N\}$} $$ $$ y_1 = s_1 $$ $$ s_i = x_i + \sum_{j\in C(i)} s_j \: \: \mbox{ for all $i \in \{1, \ldots, N\}$} $$ $$ y_i \in \mathcal{Y}, x_i \in \mathcal{X}_i, s_i \in [0, s_{max}] $$

where $s_{max}$ is a known maximum on the sum of the $x_i$ values (such a maximum exists since the $\mathcal{X}_i$ sets are compact).

Drift-plus-penalty method:

For each constraint, define a virtual queue with update equations given by:

$$ Q_i(t+1) = Q_i(t) + y_i(t) - y_{P(i)}(t) \: \: \mbox{ for all $i \in \{2, \ldots, N\}$} $$

$$ Z(t+1) = Z(t) + y_1(t) - s_1(t) $$

$$ H_i(t+1) = H_i(t) + s_i(t) - x_i(t) - \sum_{j\in C(i)} s_j(t) \: \: \mbox{ for all $i \in \{1, \ldots, N\}$} $$

The values $Q_i(t)$ and $H_i(t)$ are kept at node $i$. The $Z(t)$ value is kept at node 1. Every timeslot $t$ do:

(i) Nodes $i\in\{2, \ldots, N\}$ observe their own queue $H_i(t)$ and the queue $H_{P(i)}(t)$ of their parent, and choose $s_i(t) \in [0, s_{max}]$ to minimize: $$ s_i(t)[H_i(t) - H_{P(i)}(t)] $$ This reduces to choosing $s_i(t)=0$ if $H_i(t) \geq H_{P(i)}(t)$, and $s_i(t)=s_{max}$ else.

(ii) Nodes $i \in \{2, \ldots, N\}$ observe their own queues $Q_i(t), H_i(t)$ and the queues $Q_j(t)$ for their children $j \in C(i)$, and choose $x_i(t) \in \mathcal{X}_i$, $y_i(t)\in\mathcal{Y}$ to minimize: $$ \frac{1}{\epsilon}f_i(x_i(t), y_i(t)) + y_i(t)\left[Q_i(t) - \sum_{j\in C(i)} Q_j(t)\right] -x_i(t)H_i(t) $$

(iii) Node 1 observes its own $Z(t)$ queue and the queues $Q_j(t)$ for its children $j \in C(1)$, and chooses $y_1(t)\in \mathcal{Y}$, $x_1(t) \in \mathcal{X}_1$ to minimize: $$ \frac{1}{\epsilon}f_1(x_1(t), y_1(t)) + y_1(t)\left[Z(t) -\sum_{j\in C(1)}Q_j(t)\right] $$

(iv) Node 1 observes its own $Z(t), H_1(t)$ and chooses $s_1(t) \in [0, s_{max}]$ to minimize: $$ s_1(t)[-Z(t) + H_1(t)] $$

(v) Each node updates its own queues according to the update equations given above.

Thanks for the comment. However, I do not understand the first point. There are no specific neighbors in the problem I stated. Namely, the variable $y$ couples all of the single "nodes". And what I am looking at is an algorithm where each "node" has the only knowledge of such $y$, not of all the single $x_j$s. Do you think your methods can be adapted to solve that problem? Also, regarding your very last method, is convergence to an optimal solution guaranteed? — user693
– user693, Commented Sep 7, 2014 at 9:11
It sounds like you are talking about an easier case than what I gave, in which case you may only need the $y_i$ variables, as in the last part of the anwser above (perhaps even with no delay, i.e., $D=0$). Usually, in "distributed" problems, there are devices that are physically separated, and can only communicate via links of a network. So I assumed each node $i$ only has knowledge of the internal variables it keeps. However, the constraints are met via message passing over links (nodes can only talk to their neighbors), and ensure that the answer converges to the desired answer. — Michael
– Michael, Commented Sep 7, 2014 at 20:35
Yes, the drift-plus-penalty method has guaranteed performance, as discussed in the papers. You can also read about it in wikipedia if you like: en.wikipedia.org/wiki/Drift_plus_penalty — Michael
– Michael, Commented Sep 7, 2014 at 20:37
That is what the "Distributed and secure computation" paper does. As an example, I have added a new section above that gives an algorithm for the case of a tree. — Michael
– Michael, Commented Sep 8, 2014 at 16:51
The reason is that it defines drift $\Delta(\tau) = (1/2)[||Q(\tau+1)||^2-||Q(\tau)||^2]$ and ensures that $\Delta(\tau) \leq B + CV$ for all $\tau$ (for some constants $B, C, V$). Summing this over $\tau \in \{0, 1, \ldots, t-1\}$ gives $||Q(t)||^2 - ||Q(0)||^2 \leq (B+CV)t$ and so $||Q(t)||/t \leq \frac{\sqrt{[||Q(0)||^2/t + (B+CV)}}{\sqrt{t}}$. This goes to 0 as $t\rightarrow\infty$, so constraint violations go to $0$ (regardless of Slater). The paper also shows the objective function is within $O(1/V)$ of optimality for all $t$ (regardless of Slater). Slater gives a faster convergence. — Michael
– Michael, Commented Sep 11, 2014 at 19:05

AndreaCassioli · Accepted Answer · 2014-08-28 20:13:50Z

2

I would start from a classic:

Bertsekas, Dimitri P., and John N. Tsitsiklis. Parallel and distributed computation: numerical methods. Prentice-Hall, Inc., 1989.

answered Aug 28, 2014 at 20:13

AndreaCassioli

9327 silver badges10 bronze badges

1

$\begingroup$ Thanks for the suggestion. I went through it but I did not find the right distributed formulation. Do you have any idea? $\endgroup$

user693
– user693

2014-08-30 19:20:48 +00:00
Commented Aug 30, 2014 at 19:20

Add a comment |

Matt · Accepted Answer · 2014-08-31 09:32:48Z

1

Your $y$ term is a big problem for parallelizing your problem, since it makes every $x_i$ affect every $f_i$.

You will need all-to-all communication regarding $h(\vec{x}) = \sum_{i=1}^{N}\!\frac{\partial f_i(x_i,y)}{\partial y}$ at the current value of $\vec{x}$.

Then you can use $g_i(x,\cdot) = x \left( h(\vec x) - \frac{\partial f_i(x_i,y)}{\partial y} \right)$ .

Of course you should not completely minimize your function $x_i^*$ on each step. For a naive gradient descent method, you should probably (depending on your $f_i$) go less than $1/N$ of the way towards this minimum on each step before getting an updated global $h$. For more advanced methods, you could compute not only the linearization $h$ but also higher order approximations to the global effect of $y$.

This is only worth parallelizing if each $f$ takes a long time to compute compared to the all-to-all communication step. For $f$ that can be computed efficiently, parallelizing the solver will probably slow it down, with inter-process communication becoming your new bottleneck.

answered Aug 31, 2014 at 9:32

Matt

9,7635 gold badges34 silver badges48 bronze badges

$\begingroup$ Thanks for your answer. Let me ask some questions. First, I do not understand your definition of $g_i$. Can you please clarify? At the moment I assume you mean $g_i(x, \cdot) := x^\top \left( h(x) - \frac{\partial f_i}{\partial y}( x, \cdot ) \right)$. Second, do you think that convergence can be proved using your scheme? Third, if "all-to-all" communication is needed, then why not to solve the "centralized" problem in the first place? $\endgroup$

user693
– user693

2014-08-31 09:59:15 +00:00
Commented Aug 31, 2014 at 9:59
$\begingroup$ Q1: The product of $x$ and $h(\vec x) - \frac{\partial f_i(x_i,y)}{\partial y}$ represents the total cost to all the other local solvers of changing $x_i$. The globally-communicated scalar $h(\vec x)$ is the derivative of the total cost to all solvers of changing $y$, so subtracting the contribution from this solver, $\frac{\partial f_i(x_i,y)}{\partial y}$, we get just the (first derivative of the) cost to the other solvers. We multiply this derivative by $x$ (the minimizer) to get a linear estimate of the non-local cost of changing $x_i$. Both $x$ and the derivative are scalars. $\endgroup$

Matt
– Matt

2014-08-31 10:48:28 +00:00
Commented Aug 31, 2014 at 10:48
1

$\begingroup$ Q2: I do not think convergence can be proved for a gradient descent approach, because without further information about the functions $f_i$, it is hard to know whether your step size might be way too big. However, a modified approach such as repeatedly using a line search (updating $h$ at each step) in the direction of the gradient could probably be proven to work. You can also prove that the global problem is minimized exactly when the local problems are minimized, although that only tells you that if your distributed optimization process converges, then it has found the global minimum. $\endgroup$

Matt
– Matt

2014-08-31 10:56:50 +00:00
Commented Aug 31, 2014 at 10:56
1

$\begingroup$ Q3: Yes, this is exactly the point -- there is probably no benefit to implementing a distributed solver for this. However, if $f_i$ takes a long time to compute (say one minute) compared to the time it takes to collect information from all your processes to compute the new $h$ and send that back to all processes (say a second or less depending on the network your processes are communicating on), then you will get a huge benefit from distributing the problem. In the end this is similar to using a centralized solver that sends requests for individual $f_i$ computations to other machines. $\endgroup$

Matt
– Matt

2014-08-31 11:06:22 +00:00
Commented Aug 31, 2014 at 11:06
$\begingroup$ One more question. Do you think one can exploit Chapter 2 or 3 in [web.stanford.edu/~boyd/papers/pdf/admm_distr_stats.pdf]? $\endgroup$

user693
– user693

2014-09-05 06:44:21 +00:00
Commented Sep 5, 2014 at 6:44

| Show 6 more comments

Stack Exchange Network

Algorithm for distributed convex optimization

3 Answers 3

An alternative use of a tree structure

A delayed update approach

Algorithm:

The tree-based algorithm

Drift-plus-penalty method:

You must log in to answer this question.

Hot Network Questions

Algorithm for distributed convex optimization

3 Answers 3

An alternative use of a tree structure

A delayed update approach

Algorithm:

The tree-based algorithm

Drift-plus-penalty method:

You must log in to answer this question.

Related

Hot Network Questions