Parallel algorithm is a subject that interests me a lot because it has includes both theoretical aspect and numerical aspect, because on the one hand, we are interested in designing a parallel algorithm which has fast Spanning time and low Work. On the other hand, the parallel algorithms are more sensitive to other factors such as work balancing, cache efficiency and communication cost.
I am currently reading a paper about Parallel SGD with compression and I would like to briefly discuss this paper a little bit. The paper doesn't have very hard theoretical analysis comparing to other optimization paper, but it is very useful because it deals with the issue that when you are paralleling the SGD you suffer from the communication cost. And it provides a convergence analysis about it. Now I am going to talk about the detail of this paper.
In this paper, they present a parallel algorithm which deals with distributed learning under the data parallelism setting, where we assume that data are distributed over multiple nodes on a network, with a shared model that needs to be jointly optimized. Mathematically, the underlying problem can be posed as the following distributed optimization problem
\begin{equation}\label{eq1} \min_{\textbf{x} \in X} f(\textbf{x}) \equiv \frac{1}{n} \sum_{i=1}^{n} \mathbb{E}_{\xi_i}[F(\textbf{x} ,\xi_i)] \end{equation}
To deal with (1) , the most common synchronized approach is parallel SGD, so each machine calculate the local stochastic gradient with respect to the shared parameter $\textbf{x}_t$:
\begin{align*}\textbf{g}_t^i = \nabla F(\textbf{x}_t,\xi^i_t) \end{align*}
Then send the local stochastic gradient to the master processor and aggregate the data:
\begin{align*} \textbf{g}_t = \frac{1}{n}\sum_{i=1}^{n}\textbf{g} ^i_t \end{align*}
Next update $ \textbf{x}^{t+1}$ by using $\textbf{g}_t$ Then send the updated $\textbf{x}^{t+1}$ to the other processor;However, notice that when the data size is large, the communication cost will dominate the running time. Then the running time will be essentially the same as the serial SGD...
As a result, in order to alleviate the cost, each processor can send the compressed local gradient such as quantization or sparsification.We will define this 2 operators later. However, there is an issue. It is observed that such compression methods slow down the convergence due to the loss of information under compression. To deal with this issue, we will record the loss due to compression from the last iteration and add this loss to the stochastic gradient on the next iteration. We will do similar thing in the master processor as well. We will talk about more details when we see the detail of the algorithm...
Now we need to talk about how to do the compression. We will mainly talk about 2 ways here.
Let $Q_w(.)$ be the compression operator.This operator can either be biased or unbiased.The operator can be either random or deterministic. We list several common compression operator here:
1 . Random Quantization: For any random number $z \in [a,b]$ where $a$ and $b$ are predesigned low bit number. The compression operator will equal $a$ with probability $\frac{b-z}{b-a}$ and equal $b$ with probability $\frac{a-z}{b-a}$ \item
2. Top k sparsification :For any vector $\textbf{x}$, compress $\textbf{x}$ by retaining the top k largest elements of this vector and set the others to zero.(Note that there is a random version of this operator) \end{enumerate}
Now we give the pseudo code of the algorithm :
Input: Initialize $\textbf{x}_0$ , learning rate $\gamma$ , initial error $\boldsymbol{\delta}^i = 0$ and $\boldsymbol{\delta} = 0$, and number of total iterations T
FOR{$t = 0,1,\cdots,T $}
On processor for local gradient
Compute the error-compensated stochastic gradient:
$\textbf{v}^i_{t} = \nabla F(\textbf{x}_t,\xi^i_t) + \boldsymbol{\delta}^i_{t-1}$
Compress $\textbf{v}^i_t$ into $Q_w(\textbf{v}^i_t)$
update error $\boldsymbol{ \delta}_t^i = \textbf{v}^i_t - Q_w(\textbf{v}^i_t)$
Send $Q_w(\textbf{x}^i_t)$ to the parameter server
On Master processor
Average all gradients received from workers:
$\textbf{v}_t = \frac{1}{n}\sum_{i=1}^{n} Q_w[\textbf{v}^i_t] + \boldsymbol{\delta}_{t-1} $
Compress $\textbf{v}_t$ into $Q_w(\textbf{v}_t)$
Update the error $\boldsymbol{\delta}_t = \textbf{v}_t - Q_w(\textbf{v}_t)$
Send $Q_w(\textbf{v}_t)$ to processor for local gradient
On processor for local gradient
Update the local model $\textbf{x}_{t+1} = \textbf{x}_{t} - \gamma Q_w(\textbf{v}_t)$
ENDFOR
return $\textbf{x}_T$
The above algorithm has the following convergence result:
Theorem 1 Under appropriate assumption, for DOUBLESQUEEZE, choosing $\gamma = \frac{1}{4L + \sigma \sqrt{\frac{T}{n}}+\epsilon^{\frac{2}{3}}T^{\frac{1}{3}}}$, we have the following convergence rate: \begin{align*} \frac{1}{T} \sum_{t=0}^{T-1} \mathbb{E}[\Vert \nabla f(\textbf{x}_t)\Vert^2] = O(\frac{\sigma}{\sqrt{nT}}+ \frac{\epsilon^{\frac{2}{3}}}{T^{\frac{2}{3}}}+\frac{1}{T}) \end{align*} Here we treat $f(\textbf{x}^1)-f(\textbf{x}^*)$ and $L$ as a constants
What a Mathematicians!
ReplyDelete