Parallel algorithm is a subject that interests me a lot because it has includes both theoretical aspect and numerical aspect, because on the one hand, we are interested in designing a parallel algorithm which has fast Spanning time and low Work. On the other hand, the parallel algorithms are more sensitive to other factors such as work balancing, cache efficiency and communication cost.
I am currently reading a paper about Parallel SGD with compression and I would like to briefly discuss this paper a little bit. The paper doesn't have very hard theoretical analysis comparing to other optimization paper, but it is very useful because it deals with the issue that when you are paralleling the SGD you suffer from the communication cost. And it provides a convergence analysis about it. Now I am going to talk about the detail of this paper.
In this paper, they present a parallel algorithm which deals with distributed learning under the data parallelism setting, where we assume that data are distributed over multiple nodes on a network, with a shared model that needs to be jointly optimized. Mathematically, the underlying problem can be posed as the following distributed optimization problem
minx∈Xf(x)≡1nn∑i=1Eξi[F(x,ξi)]
To deal with (1) , the most common synchronized approach is parallel SGD, so each machine calculate the local stochastic gradient with respect to the shared parameter xt:
git=∇F(xt,ξit)
Then send the local stochastic gradient to the master processor and aggregate the data:
gt=1nn∑i=1git
Next update xt+1 by using gt Then send the updated xt+1 to the other processor;However, notice that when the data size is large, the communication cost will dominate the running time. Then the running time will be essentially the same as the serial SGD...
As a result, in order to alleviate the cost, each processor can send the compressed local gradient such as quantization or sparsification.We will define this 2 operators later. However, there is an issue. It is observed that such compression methods slow down the convergence due to the loss of information under compression. To deal with this issue, we will record the loss due to compression from the last iteration and add this loss to the stochastic gradient on the next iteration. We will do similar thing in the master processor as well. We will talk about more details when we see the detail of the algorithm...
Now we need to talk about how to do the compression. We will mainly talk about 2 ways here.
Let Qw(.) be the compression operator.This operator can either be biased or unbiased.The operator can be either random or deterministic. We list several common compression operator here:
1 . Random Quantization: For any random number z∈[a,b] where a and b are predesigned low bit number. The compression operator will equal a with probability b−zb−a and equal b with probability a−zb−a \item
2. Top k sparsification :For any vector x, compress x by retaining the top k largest elements of this vector and set the others to zero.(Note that there is a random version of this operator) \end{enumerate}
Now we give the pseudo code of the algorithm :
Input: Initialize x0 , learning rate γ , initial error δi=0 and δ=0, and number of total iterations T
FOR{t=0,1,⋯,T}
On processor for local gradient
Compute the error-compensated stochastic gradient:
vit=∇F(xt,ξit)+δit−1
Compress vit into Qw(vit)
update error δit=vit−Qw(vit)
Send Qw(xit) to the parameter server
On Master processor
Average all gradients received from workers:
vt=1n∑ni=1Qw[vit]+δt−1
Compress vt into Qw(vt)
Update the error δt=vt−Qw(vt)
Send Qw(vt) to processor for local gradient
On processor for local gradient
Update the local model xt+1=xt−γQw(vt)
ENDFOR
return xT
The above algorithm has the following convergence result:
Theorem 1 Under appropriate assumption, for DOUBLESQUEEZE, choosing γ=14L+σ√Tn+ϵ23T13, we have the following convergence rate: 1TT−1∑t=0E[‖∇f(xt)‖2]=O(σ√nT+ϵ23T23+1T)
What a Mathematicians!
ReplyDelete