Another approach is this: If you assume each index $i \in \{1, \ldots, N\}$ is a node of a connected graph, you can define local variables $y_i$ for each $i \in \{1, \ldots, N\}$, and then enforce the constraint $y_i=y_j$ whenever $i$ and $j$ are neighbors. If you want, you can do the same thing for the $x_i$ variables: Define $x_i^{(k)}$ as the node $i$ estimate of $x_k$. The resulting problem is:
Minimize:
$$ \sum_{i=1}^N f_i(x_i^{(i)}, y_i) $$
Subject to:
$$ y_i = y_j \: \mbox{ whenever $i$ and $j$ are neighbors} $$
$$ \sum_{k=1}^N x_i^{k} = y_i \: \mbox{ for all $i$} $$
$$ x_i^{k} = x_j^{k} \: \forall k, \mbox{ whenever $i$ and $j$ are neighbors} $$
$$ y_i \in \mathcal{Y}, x_i^k \in \mathcal{X}_k $$
This is a brute-force approach that introduces many variables, you can reduce some of the variables if you use a tree (a "minimalist" connected graph). I did this in the following paper, which may be of interest: "Distributed and secure computation of convex programs over a network of connected processors":
http://www-bcf.usc.edu/~mjneely/pdf_papers/dist_comp_dcdis2.pdf
The key property of this formulation is that the objective function is a separable sum of terms, where each term involves only functions and variables associated with a particular node. Likewise, all constraints are such separable sums, and further the terms in the constraints involve only neighboring variables (so that message passing between neighbors can satisfy these constraints). The paper above develops a distributed algorithm that solves this, where each node knows only its own $f_i(\cdot)$ function.
An alternative use of a tree structure
If you don't like all the $x_i^k$ variables (which I don't), you could use a tree structure with a root at some node, say node 1. Then you could define $s_i$ as a variable that is the sum of $x_i$ and the $s_j$ values of its children, and then define the additional constraint $ s_1 = y_1$. This enforces the desired $\sum_{i=1}^Nx_i=y$ constraint. I give an algorithm for this in the last section below.
A delayed update approach
Alternatively, you could designate node 1 as the "keeper of the constraints" and then assume all nodes can communicate with node 1 after a fixed delay of $D\geq 0$ time slots. The resulting problem and an algorithm is:
Minimize:
$$ \sum_{i=1}^N f_i(x_i, y_i) $$
Subject to:
$$ y_i = y_1 \: \mbox{ for all $i \in \{2, 3, \ldots N\}$} $$
$$\sum_{i=1}^N x_i = y_1 $$
$$ x_i \in \mathcal{X}_i \: , \: y_i \in \mathcal{Y} $$
You can then use a "delayed drift-plus-penalty approach" as I did in a more recent paper "Distributed stochastic optimization via correlated scheduling" (see equation (24) there): http://ee.usc.edu/stochastic-nets/docs/distributed-opt-infocom2014.pdf
Specifically, you can enforce the constraints with "virtual queues" $Q_i(t)$ and $Z(t)$ (related to Lagrange multiplier updates) that use delayed versions of the $x_i$ variables. These are updated every time step $t \in \{0, 1, 2, \ldots\}$ by:
$$ Q_i(t+1) = Q_i(t) + y_i(t-D) - y_1(t-D) \: \mbox{ for all $i \in \{2, 3, \ldots, N\}$}$$
$$ Z(t+1) = Z(t) + \sum_{i=1}^{N-1}x_i(t-D) - y_1(t-D) $$
Assume the $\mathcal{X}_i$ and $\mathcal{Y}$ sets are compact, so the virtual queues change by at most a bounded constant every slot.
Algorithm:
Fix $\epsilon>0$. Every time step $t\in\{0,1,2,\ldots\}$, each node $i$ observes $\tilde{Q}_i(t)$ and $\tilde{Z}(t)$, being any values that differ from the true $Q_i(t)$ and $Z(t)$ variables by at most some additive constant $C$ (where $C$ does not depend on $t$ or the $Q_i(t)$, $Z_i(t)$ values...using $\tilde{Q}_i(t) = Q_i(t-D)$ works). Then:
(i) Nodes $i \in \{2, \ldots, N\}$ choose $x_i(t), y_i(t)$ by:
Minimize: $\frac{1}{\epsilon} f_i(x_i(t), y_i(t)) + \tilde{Q}_i(t)y_i(t) + \tilde{Z}(t)x_i(t)$
Subject to: $x_i(t) \in \mathcal{X}_i, y_i(t) \in \mathcal{Y}$
(ii) Node $1$ chooses $x_1(t), y_1(t)$ by:
Minimize: $\frac{1}{\epsilon} f_1(x_1(t), y_1(t)) - y_1(t)\left(Z(t) + \sum_{i=2}^NQ_i(t)\right)$
Subject to: $x_1(t) \in \mathcal{X}_1, y_1(t) \in \mathcal{Y}$.
(iii) Node 1 updates the virtual queues $Q_i(t)$ and $Z(t)$ using delayed information $x_i(t-D)$ and $y_i(t-D)$, as specified in the queueing equations above.
The resulting algorithm yields time averages $\overline{x}_i(t)$ and $\overline{y}_i(t)$ that are an $O(\epsilon)$-approximation to the convex program solution, where:
$$ \overline{x}_i(t) = \frac{1}{t}\sum_{\tau=0}^{t-1} x_i(\tau) \: \: , \: \: \overline{y}_i(t) = \frac{1}{t}\sum_{\tau=0}^{t-1} y_i(\tau) $$
This is true regardless of the approximation constant $C$, although a large $C$ value will affect the coefficient of the $O(\epsilon)$ term, as well as the convergence time. The functions $f_i(\cdot)$ need to be convex, but not necessarily strictly convex. For example, it works for linear programs.
The tree-based algorithm
Let's assume the tree structure, with node 1 the "root." For each node $i \in \{1, \ldots, N\}$, define $C(i)$ as the set of nodes that are children of node $i$. The set $C(i)$ is empty if node $i$ is a leaf. For each $i \in \{2, 3, \ldots, N\}$, define $P(i)$ as the parent of node $i$. For simplicity, assume the sets $\mathcal{X}_i$ are sets of nonnegative numbers.
Minimize:
$$ \sum_{i=1}^N f_i(x_i, y_i) $$
Subject to:
$$ y_i = y_{P(i)} \: \: \mbox{ for all $i \in \{2, \ldots, N\}$} $$
$$ y_1 = s_1 $$
$$ s_i = x_i + \sum_{j\in C(i)} s_j \: \: \mbox{ for all $i \in \{1, \ldots, N\}$} $$
$$ y_i \in \mathcal{Y}, x_i \in \mathcal{X}_i, s_i \in [0, s_{max}] $$
where $s_{max}$ is a known maximum on the sum of the $x_i$ values (such a maximum exists since the $\mathcal{X}_i$ sets are compact).
Drift-plus-penalty method:
For each constraint, define a virtual queue with update equations given by:
$$ Q_i(t+1) = Q_i(t) + y_i(t) - y_{P(i)}(t) \: \: \mbox{ for all $i \in \{2, \ldots, N\}$} $$
$$ Z(t+1) = Z(t) + y_1(t) - s_1(t) $$
$$ H_i(t+1) = H_i(t) + s_i(t) - x_i(t) - \sum_{j\in C(i)} s_j(t) \: \: \mbox{ for all $i \in \{1, \ldots, N\}$} $$
The values $Q_i(t)$ and $H_i(t)$ are kept at node $i$. The $Z(t)$ value is kept at node 1.
Every timeslot $t$ do:
(i) Nodes $i\in\{2, \ldots, N\}$ observe their own queue $H_i(t)$ and the queue $H_{P(i)}(t)$ of their parent, and
choose $s_i(t) \in [0, s_{max}]$ to minimize:
$$ s_i(t)[H_i(t) - H_{P(i)}(t)] $$
This reduces to choosing $s_i(t)=0$ if $H_i(t) \geq H_{P(i)}(t)$, and $s_i(t)=s_{max}$ else.
(ii) Nodes $i \in \{2, \ldots, N\}$ observe their own queues $Q_i(t), H_i(t)$ and the queues $Q_j(t)$ for their children $j \in C(i)$, and choose $x_i(t) \in \mathcal{X}_i$, $y_i(t)\in\mathcal{Y}$ to minimize:
$$ \frac{1}{\epsilon}f_i(x_i(t), y_i(t)) + y_i(t)\left[Q_i(t) - \sum_{j\in C(i)} Q_j(t)\right] -x_i(t)H_i(t) $$
(iii) Node 1 observes its own $Z(t)$ queue and the queues $Q_j(t)$ for its children $j \in C(1)$, and chooses $y_1(t)\in \mathcal{Y}$, $x_1(t) \in \mathcal{X}_1$ to minimize:
$$ \frac{1}{\epsilon}f_1(x_1(t), y_1(t)) + y_1(t)\left[Z(t) -\sum_{j\in C(1)}Q_j(t)\right] $$
(iv) Node 1 observes its own $Z(t), H_1(t)$ and chooses $s_1(t) \in [0, s_{max}]$ to minimize:
$$ s_1(t)[-Z(t) + H_1(t)] $$
(v) Each node updates its own queues according to the update equations given above.