Está en la página 1de 23


4, APRIL 2015 1985

Phase Retrieval via Wirtinger Flow:

Theory and Algorithms
Emmanuel J. Candès, Xiaodong Li, and Mahdi Soltanolkotabi

Abstract— We study the problem of recovering the phase from be cast as QPs [54, Sec. 4.3.1]. Focusing on the literature on
magnitude measurements; specifically, we wish to reconstruct a physical sciences, the problem (I.1) is generally referred to
complex-valued signal x ∈ n about which we have phaseless as the phase retrieval problem. To understand this connection,
samples of the form yr = |a r , x|2 , r = 1, . . . , m (knowledge
of the phase of these samples would yield a linear system). recall that most detectors can only record the intensity of the
This paper develops a nonconvex formulation of the phase light field and not its phase. Thus, when a small object is
retrieval problem as well as a concrete solution algorithm. illuminated by a quasi-monochromatic wave, detectors mea-
In a nutshell, this algorithm starts with a careful initialization sure the magnitude of the diffracted light. In the far field, the
obtained by means of a spectral method, and then refines this diffraction pattern happens to be the Fourier transform of the
initial estimate by iteratively applying novel update rules, which
have low computational complexity, much like in a gradient object of interest—this is called Fraunhofer diffraction—so
descent scheme. The main contribution is that this algorithm is that in discrete space, (I.1) models the data aquisition mecha-
shown to rigorously allow the exact retrieval of phase information nism in a coherent diffraction imaging setup; one can identify z
from a nearly minimal number of random measurements. Indeed, with the object of interest, ar with complex sinusoids, and
the sequence of successive iterates provably converges to the yr with the recorded data. Hence, we can think of (I.1) as
solution at a geometric rate so that the proposed scheme is
efficient both in terms of computational and data resources. a generalized phase retrieval problem. As is well known,
In theory, a variation on this scheme leads to a near-linear the phase retrieval problem arises in many areas of science
time algorithm for a physically realizable model based on coded and engineering such as X-ray crystallography [33], [50],
diffraction patterns. We illustrate the effectiveness of our methods microscopy [47]–[49], astronomy [21], diffraction and array
with various experiments on image data. Underlying our analysis imaging [12], [18], and optics [68]. Other fields of application
are insights for the analysis of nonconvex optimization schemes
that may have implications for computational problems beyond include acoustics [4], [5], blind channel estimation in wire-
phase retrieval. less communications [3], [60], interferometry [22], quantum
mechanics [20], [61] and quantum information [35]. We refer
Index Terms— Non-convex optimization, convergence to global
optimum, phase retrieval, Wirtinger derivatives. the reader to the tutorial paper [64] for a recent review of the
theory and practice of phase retrieval.
I. I NTRODUCTION Because of the practical significance of the phase retrieval
problem in imaging science, the community has developed
W E ARE interested in solving quadratic equations of the
form methods for recovering a signal x ∈ n from data of the
form yr = |ar , x|2 in the special case where one samples
yr = |ar , z|2 , r = 1, 2, . . . , m, (I.1) the (square) modulus of its Fourier transform. In this setup,
the most widely used method is perhaps the error reduction
where z ∈ n is the decision variable, ar ∈ n are known algorithm and its generalizations, which were derived from
sampling vectors, and yr ∈  are observed measurements. the pioneering research of Gerchberg and Saxton [29] and
This problem is a general instance of a nonconvex quadratic Fienup [24], [26]. The Gerchberg-Saxton algorithm starts from
program (QP). Nonconvex QPs have been observed to occur a random initial estimate and proceeds by iteratively applying
frequently in science and engineering and, consequently, their a pair of ‘projections’: at each iteration, the current guess is
study is of importance. For example, a class of combinatorial projected in data space so that the magnitude of its frequency
optimization problems with Boolean decision variables may spectrum matches the observations; the signal is then projected
Manuscript received July 19, 2014; accepted December 27, 2014. Date of in signal space to conform to some a-priori knowledge about
publication February 3, 2015; date of current version March 13, 2015. its structure. In a typical instance, our knowledge may be that
E. J. Candès is with the Department of Mathematics and Statistics, Stanford the signal is real-valued, nonnegative and spatially limited.
University, Stanford, CA 94305 USA (e-mail:
X. Li is with the Department of Wharton Statistics, University First, while error reduction methods often work well in prac-
of Pennsylvania, Philadelphia, PA 19104-6340 USA (e-mail: tice, the algorithms seem to rely heavily on a priori information about the signals, see [25], [27], [34], [62]. Second, since
M. Soltanolkotabi is with the Ming Hsieh Department of Electrical
Engineering, University of Southern California, Los Angeles, CA 90089-2560 these algorithms can be cast as alternating projections onto
USA (e-mail: nonconvex sets [8] (the constraint in Fourier space is not
Communicated by Y. Ma, Associate Editor for Signal Processing. convex), fundamental mathematical questions concerning their
Color versions of one or more of the figures in this paper are available
online at convergence remain, for the most part, unresolved; we refer
Digital Object Identifier 10.1109/TIT.2015.2399924 to Section III-B for further discussion.
0018-9448 © 2015 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission.
See for more information.

On the theoretical side, several combinatorial optimization Algorithm 1 Wirtinger Flow: Initialization
problems—optimization programs with discrete design vari- Input: Observations {yr } ∈ m .
ables which take on integer or Boolean values—can be cast as Set
solving quadratic equations or as minimizing a linear objec- 
tive subject to quadratic inequalities. In their most general λ =n r
r ar 
form these problems are known to be notoriously difficult
(NP-hard) [54, Sec. 4.3]. Nevertheless, many heuristics have Set z 0 , normalized to z 0  = λ, to be the eigenvector
been developed for addressing such problems.1 One popular corresponding to the largest eigenvalue of
heuristic is based on a class of convex relaxations known
as Shor’s relaxations [54, Sec. 4.3.1] which can be solved Y = yr ar ar∗ .
using tractable semi-definite programming (SDP). For cer- m
tain random models, some recent SDP relaxations such as
Output: Initial guess z 0 .
PhaseLift [14] are known to provide exact solutions (up to
global phase) to the generalized phase retrieval problem using
a near minimal number of sampling vectors [16], [17]. While
in principle SDP based relaxations offer tractable solutions, deciding whether a stationary point of a polynomial of degree
they become computationally prohibitive as the dimension of four is a local minimizer is already known to be NP-hard.
the signal increases. Indeed, for a large number of unknowns in Our approach to (II.1) is simply stated: start with an
the tens of thousands, say, the memory requirements are far out initialization z 0 , and for τ = 0, 1, 2, . . ., inductively define

μτ +1 1   ∗ 2
of reach of desktop computers so that these SDP relaxations m

are de facto impractical. z τ +1 = z τ − ar z − yr (ar ar )z
z 0 2 m r=1
μτ +1
II. A LGORITHM : W IRTINGER F LOW := z τ − ∇ f (z τ ). (II.2)
z 0 2
This paper introduces an approach to phase retrieval based If the decision variable z and the sampling vectors were all real
on non-convex optimization as well as a solution algorithm, valued, the term between parentheses would be the gradient
which has two components: (1) a careful initialization obtained of f divided by two, as our notation suggests. However,
by means of a spectral method, and (2) a series of updates since f (z) is a mapping from n to , it is not holomorphic
refining this initial estimate by iteratively applying a novel and hence not complex-differentiable. However, this term can
update rule, much like in a gradient descent scheme. We refer still be viewed as a gradient based on Wirtinger derivatives
to the combination of these two steps, introduced in reverse reviewed in Section VI. Hence, (II.2) is a form of steepest
order below, as the Wirtinger flow (WF) algorithm. descent and the parameter μτ +1 can be interpreted as a step
size (note nonetheless that the effective step size is also
A. Minimization of a Non-Convex Objective inversely proportional to the magnitude of the initial guess).
Let (x, y) be a loss function measuring the misfit between
both its scalar arguments. If the loss function is non-negative B. Initialization via a Spectral Method
and vanishes only when x = y, then a solution to the Our main result states that for a certain random model, if the
generalized phase retrieval problem (I.1) is any solution to initialization z 0 is sufficiently accurate, then the sequence
{z τ } will converge toward a solution to the generalized phase
1    ∗ 2 
problem (I.1). In this paper, we propose computing the initial
minimize f (z) :=  yr , ar z , z∈ n
. (II.1)
2m guess z 0 via a spectral method, detailed in Algorithm 1.
In words, z 0 is the leading
 eigenvector of the positive semidefi-
Although one could study many loss functions, we shall focus nite Hermitian matrix r yr ar ar∗ constructed from the knowl-
in this paper on the simple quadratic loss (x, y) = (x − y)2 . edge of the sampling vectors and observations. (As usual,
Admittedly, the formulation (II.1) does not make the problem ar∗ is the adjoint of ar .) Letting A be the m × n matrix whose
any easier since the function f is not convex. Minimizing r th row is ar∗ so that with obvious notation y = | Ax|2 , z 0 is
non-convex objectives, which may have very many stationary the leading eigenvector of A∗ diag{ y} A and can be computed
points, is known to be NP-hard in general. In fact, even via the power method by repeatedly applying A, entrywise
establishing convergence to a local minimum or stationary multiplication by y and A∗ . In the theoretical framework we
point can be quite challenging, please see [53] for an example study below, a constant number of power iterations would give
where convergence to a local minimum of a degree-four machine accuracy because of an eigenvalue gap between the
polynomial is known to be NP-hard.2 As a side remark, top two eigenvalues, please see Appendix B for additional
1 For a partial review of some of these heuristics as well as some
recent theoretical advances in related problems we refer to our companion
paper [16, Sec. 1.6] and references therein [7], [13], [30], [31], [36], C. Wirtinger Flow as a Stochastic Gradient Scheme
[55], [56], [65].
2 Observe that if all the sampling vectors are real valued, our objective is We would like to motivate the Wirtinger flow algorithm
also a degree-four polynomial. and provide some insight as to why we expect it to work in

a model where the sampling vectors are random. First, we

emphasize that our statements in this section are heuristic
in nature; as it will become clear in the proof Section VII,
a correct mathematical formalization of these ideas is far
more complicated than our heuristic development here may
suggest. Second, although our ideas are broadly applicable, it
makes sense to begin understanding the algorithm in a setting
where everything is real valued, and in which the vectors ar
are i.i.d. N (0, I ). Also without any loss in generality and to
simplify exposition in this section we shall assume that the
signal has unit Euclidean norm, i. e. x = 1.
Let x be a solution to (I.1) so that yr = |ar , x|2 , and
consider the initialization step first. In the Gaussian model,
a simple moment calculation gives Fig. 1. Learning parameter μτ from (II.5) as a function of the iteration
count τ ; here, τ0 ≈ 330 and μmax = 0.4.
 yr ar ar = I + 2x x ∗ .

By the strong law of large numbers, the matrix Y Hence, the average WF update is the same as that in (II.3)
in Algorithm 1 is equal to the right-hand side in the limit of so that we can interpret the Wirtinger flow algorithm as a
large samples. Since any leading eigenvector of I +2xx ∗ is of stochastic gradient scheme in which we only get to observe
the form λx for some scalar λ ∈ , we see that the intialization an unbiased estimate ∇ f (z) of the “true” gradient ∇ F(z).
step would recover x perfectly, up to a global sign or phase Regarding WF as a stochastic gradient scheme helps us in
factor, had we infinitely many samples. Indeed, the chosen choosing the learning parameter or step size μτ . Lemma 7.7
normalization would guarantee that the recovered signal is of asserts that
the form ±x. As an aside, we would like to note that the top ∇ f (z) − ∇ F(z)2 ≤ x2 · min z ± x (II.4)
two eigenvalues of I + 2x x ∗ are well separated unless x
is very small, and that their ratio is equal to 1 + 2x2 . Now holds with high probability. Looking at the right-hand side, this
with a finite amount of data, the leading eigenvector of Y will says that the uncertainty about the gradient estimate depends
of course not be perfectly correlated with x but we hope that on how far we are from the actual solution x. The further
it is sufficiently correlated to point us in the right direction. away, the larger the uncertainty or the noisier the estimate.
We now turn our attention to the gradient-update (II.2) and This suggests that in the early iterations we should use a small
define learning parameter as the noise is large since we are not yet
3 2 2 close to the solution. However, as the iterations count increases
2z ∗ (I − x x ∗ )z + z − 1 and we make progress, the size of the noise also decreases
where here and below, x is once again our planted solution. and we can pick larger values for the learning parameter. This
The first term ensures that the direction of z matches the heuristic together with experimentation lead us to consider
direction of x and the second term penalizes the deviation
μτ = min(1 − e−τ/τ0 , μmax ) (II.5)
of the Euclidean norm of z from that of x. Obviously, the
minimizers of this function are ±x. Now consider the gradient shown in Figure 1. Values of τ0 around 330 and of μmax
scheme around 0.4 worked well in our simulations. This makes sure
μτ +1 that μτ is rather small at the beginning (e.g. μ1 ≈ 0.003
z τ +1 = z τ − ∇ F(z τ ) (II.3)
||z 0 ||2 but quickly increases and reaches a maximum value of about
In Section VII-I, we show that if min z 0 ± x ≤ 1/8 x, 0.4 after 200 iterations or so.
then {z τ } converges to x up to a global sign. However, this is
all ideal as we would need knowledge of x itself to compute III. M AIN R ESULTS
the gradient of F; we simply cannot run this algorithm.
Consider now the WF update and assume for a moment that A. Exact Phase Retrieval via Wirtinger Flow
z τ is fixed and independent of the sampling vectors. We are Our main result establishes the correctness of the Wirtinger
well aware that this is a false assumption but nevertheless wish flow algorithm in the Gaussian model defined below. Later
to explore some of its consequences. In the Gaussian model, if in Section V, we shall also develop exact recovery results for
z is independent of the sampling vectors, then a modification a physically inspired diffraction model.
of Lemma 7.2 for real-valued z shows that [∇ f (z)] = ∇ F(z) Definition 3.1: We say that the sampling vectors follow the
and, therefore, Gaussian model if ar ∈ n ∼ N (0, I /2)+i N (0, I /2). In the
μ real-valued case, they are i.i.d. N (0, I ).
[zτ +1 ] = [zτ ] − τ +12 [∇ f (zτ )]
z 0  We also need to define the distance to the solution set.
μτ +1 Definition 3.2: Let x ∈ n be any solution to the quadratic
⇒ [z τ +1 ] = z τ − ∇ F(z τ ).
z 0 2 system (I.1) (the signal we wish to recover). For each z ∈ n ,

define data vector: with obvious notation,

Az τ
dist(z, x) = min z − eiφ x . v̂ τ +1 = b  , (III.2)
φ∈[0,2π] | Az τ |
Theorem 3.3: Let x be an arbitrary vector in n and where  is elementwise multiplication, and b = | Ax| so that
y = | Ax|2 ∈ m be m quadratic samples with m ≥ c0 ·n log n, br2 = yr for all r = 1, . . . , m. Then
where c0 is a sufficiently large numerical constant. Then the
Wirtinger flow initial estimate 
z 0 normalized to have squared v τ +1 = arg minv∈ n v̂ τ +1 − Av. (III.3)
Euclidean norm equal to m −1 r yr ,3 obeys
(In the case of Fourier data, the step (III.2)–(III.3) essentially
1 adjusts the modulus of the Fourier transform of the current
dist(z 0 , x) ≤ x (III.1)
8 guess so that it fits the measured data.) Finally, if we know
that the solution belongs to a convex set C (as in the case where
with probability at least 1 − 10e−γ n − 8/n 2 (γ is a fixed the signal is known to be real-valued, possibly non-negative
positive numerical constant). Further, take a constant learning and of finite support), then the next iterate is
parameter sequence, μτ = μ for all τ = 1, 2, . . . and assume
μ ≤ c1 /n for some fixed numerical constant c1 . Then there is z τ +1 = PC (v τ +1 ), (III.4)
an event of probability at least 1 − 13e−γ n − me−1.5m − 8/n 2 ,
such that on this event, starting from any initial solution z 0 where PC is the projection onto the convex set C. If no
obeying (III.1), we have such information is available, then z τ +1 = v τ +1 . The first
step (III.3) is not a projection onto a convex set and,
1 μ τ/2 therefore, it is in general completely unclear whether the
dist(z τ , x) ≤ 1− · x .
8 4 Gerchberg-Saxton algorithm actually converges. (And if it
Clearly, one would need 2n quadratic measurements to have were to converge, at what speed?) It is also unclear how
any hope of recovering x ∈ n . It is also known that in the procedure should be initialized to yield accurate final
our sampling model, the mapping z → | Az|2 is injective estimates. This is in contrast to the Wirtinger flow algorithm,
for m ≥ 4n [5] and that this property holds for generic which in the Gaussian model is shown to exhibit geometric
sampling vectors [19].4 Hence, the Wirtinger flow algorithm convergence to the solution to the phase retrieval problem.
loses at most a logarithmic factor in the sampling complexity. Another benefit is that the Wirtinger flow algorithm does not
In comparison, the SDP relaxation only needs a sampling require solving a least-squares problem (III.3) at each iteration;
complexity proportional to n (no logarithmic factor) [15], and each step enjoys a reduced computational complexity.
it is an open question whether Theorem 3.3 holds in this A recent contribution related to ours is the interest-
regime. ing paper [56], which proposes an alternating minimization
Setting μ = c1 /n yields  accuracy in a relative sense, scheme named AltMinPhase for the general phase retrieval
namely, dist(z, x) ≤  x, in O(n log 1/) iterations. problem. AltMinPhase is inspired by the Gerchberg-Saxton
The computational work at each iteration is dominated by update (III.2)–(III.3) as well as other established alter-
two matrix-vector products of the form Az and A∗ v, see nating projection heuristics [26], [42], [43], [45], [46], [69].
Appendix B. It follows that the overall computational com- We describe the algorithm in the setup of Theorem 3.3
plexity of the WF algorithm is O(mn 2 log 1/). Later in the for which [56] gives theoretical guarantees. To begin with,
paper, we will exhibit a modification to the WF algorithm AltMinPhase partitions the sampling vectors ar (the rows of
of mere theoretical interest, which also yields exact recovery the matrix A) and corresponding observations yr into B + 1
under the same sampling complexity and an O(mn log 1/) disjoint blocks ( y(0) , A(0) ), ( y(1), A(1) ), . . ., ( y(B), A(B)) of
computational complexity; that is to say, the computational roughly equal size. Hence, distinct blocks are stochastically
workload is now just linear in the problem size. independent from each other. The first block ( y(0) , A(0) )
is used to compute an initial estimate z 0 . After initializa-
B. Comparison With Other Non-Convex Schemes tion, AltMinPhase goes through a series of iterations of the
form (III.2)–(III.3), however, with the key difference that each
We now pause to comment on a few other non-convex
iteration uses a fresh set of sampling vectors and observations:
schemes in the literature. Other comparisons may be found
in details,
in our companion paper [16].
Earlier, we discussed the Gerchberg-Saxton and Fienup z τ +1 = arg min z∈ n v̂ τ +1 − A(τ +1) z,
algorithms. These formulations assume that A is a Fourier A(τ +1) z τ
transform and can be described as follows: suppose z τ is the v̂ τ +1 = b  . (III.5)
current guess, then one computes the image of z τ through A | A(τ +1) z τ |
and adjust its modulus so that it matches that of the observed As for the Gerchberg-Saxton algorithm, each iteration requires
solving a least-squares problem. Now assume a real-valued
 The same 2
results holds with the intialization from Algorithm 1 because Gaussian model as well as a real valued solution x ∈ n . The
r ar  ≈ m · n with a standard deviation of about the square root of this main result in [56] states that if the first block ( y(0) , A(0) )
4 It is not within the scope of this paper to explain the meaning of generic contains at least c · n log3 n samples and each consecutive
vectors and, instead, refer the interested reader to [19]. block contains at least c · n log n samples—c here denotes a

positive numerical constant whose value may change at each and ours. Indeed, OptSpace operates by computing an initial
occurence—then it is possible to initialize the algorithm via guess of the solution to a low-rank matrix completion problem
data from the first block in such a way that each consec- by means of a spectral method. It then sets up a nonconvex
utive iterate (III.5) decreases the error z τ − x by 50%; problem, and proposes an iterative algorithm for solving it.
naturally, all of this holds in a probabilistic sense. Hence, Under suitable assumptions, [39] demonstrates the correctness
one can get  accuracy in the sense introduced earlier from of this method in the sense that OptSpace will eventually
a total of c · n log n · (log2 n + log 1/) samples. Whereas converge to a low-rank solution, although it is not shown to
the Wirtinger flow algorithm achieves arbitrary accuracy from converge in polynomial time.
just c · n log n samples, these theoretical results would require
an infinite number of samples. This is, however, not the
main point.
The main point is that in practice, it is not realistic to We present some numerical experiments to assess the empir-
imagine (1) that we will divide the samples in distinct ical performance of the Wirtinger flow algorithm. Here, we
blocks (how many blocks should we form a priori? of which mostly consider a model of coded diffraction patterns reviewed
sizes?) and (2) that we will use measured data only once. below.
With respect to the latter, observe that the Gerchberg-Saxton
procedure (III.2)–(III.3) uses all the samples at each iteration.
This is the reason why AltMinPhase is of little practical A. The Coded Diffraction Model
value, and of theoretical interest only. As a matter of fact, We consider an acquisition model, where we collect data of
its design and study seem merely to stem from analytical the form
considerations: since one uses an independent set of mea- n−1 2
surements at each iteration, A(τ +1) and z τ are stochasti-  
 −i2πkt /n 
cally independent, a fact which considerably simplifies the yr =  x[t]d̄ (t)e ,
convergence analysis. In stark contrast, the WF iterate uses t =0
all the samples at each iteration and thus introduces some 0≤k ≤n−1
r = (, k), (IV.1)
dependencies, which makes for some delicate analysis. Over- 1 ≤  ≤ L;
coming these difficulties is crucial because the community is
thus for a fixed , we collect the magnitude of the diffraction
preoccupied with convergence properties of algorithms one
pattern of the signal {x(t)} modulated by the waveform/code
actually runs, like Gerchberg-Saxton (III.2)–(III.3), or would
{d (t)}. By varying  and changing the modulation pattern d ,
actually want to run. Interestingly, it may be possible to
we generate several views thereby creating a series of coded
use some of the ideas developed in this paper to develop a
diffraction patterns (CDPs).
rigorous theory of convergence for algorithms in the style of
In this paper, we are mostly interested in the situation where
Gerchberg-Saxton and Fienup, please see [65].
the modulation patterns are random; in particular, we study
In a recent paper [44], which appeared on the arXiv preprint
a model in which the d ’s are i.i.d. distributed, each having
server as the final version of this paper was under preparation,
i.i.d. entries sampled from a distribution d. Our theoretical
the authors explore necessary and sufficient conditions for the
results from Section V assume that d is symmetric, obeys
global convergence of an alternative minimization scheme with
|d| ≤ M as well as the moment conditions
generic sampling vectors. The issue is that we do not know
when these conditions hold. Further, even when the algorithm  d = 0,  d 2 = 0,  |d|4 = 2( |d|2 )2 . (IV.2)
converges, it does not come with an explicit convergence rate
so that is is not known whether the algorithm converges in A random variable obeying these assumptions is said to be
polynomial time. As before, some of our methods as well as admissible. Since d is complex valued we can have  d 2 = 0
those from our companion paper [16] may have some bearing while d = 0. An example of an admissible random variable is
upon the analysis of this algorithm. Similarly, another class d = b1 b2 , where b1 and b2 are independent and distributed as
of nonconvex algorithms that have recently been proposed in ⎧
the literature are iterative algorithms based on Generalized ⎪
⎪ +1 with prob. 1/4

Approximate Message Passing (GAMP), see [59], [63] as well with prob. 1/4
as [9], [10], [23] for some background literature on AMP. b1 = (IV.3)

⎪ −i with prob. 1/4
In [63], the authors demonstrate a favorable runtime for an ⎪

algorithm of this nature. However, this does not come with +i with prob. 1/4,
any theoretical guarantees. and
Moving away from the phase retrieval problem, we √
would like to mention some very interesting work 2/2 with prob. 4/5
on the matrix completion problem using non-convex b2 = √ (IV.4)
3 with prob. 1/5.
schemes by Montanari and coauthors [38]–[40], see
also [2], [6], [32], [37], [41], [51], [52]. Although the prob- We shall refer to this distribution as an octanary pattern since d
lems and models are quite different, there are some general can take on eight distinct values. The condition [d 2 ] = 0 is
similarities between the algorithm named OptSpace in [39] here to avoid unnecessarily complicated calculations in our

Fig. 2. Empirical probability of success based on 100 random trials for different signal/measurement models and a varied number of measurements. The
coded diffraction model uses octanary patterns; the number of patterns L = m/n only takes on integral values.

theoretical analysis. In particular, we can also work with a Below, we set n = 128, and generate one signal of each type
ternary pattern in which d is distributed as which will be used in all the experiments.
⎧ The initialization step of the Wirtinger flow algorithm is

⎨+1 with prob. 1/4
run by applying 50 iterations of the power method outlined in
d= 0 with prob. 1/2 (IV.5) Algorithm 3 from Appendix B. In the iteration (II.2), we use

−1 with prob. 1/4. the parameter value μτ = min(1 − exp(−τ/τ0 ), 0.2) where
We emphasize that the random coded diffraction patterns τ0 ≈ 330. We stop after 2, 500 iterations, and report the
mentioned above are physically realizable in optical applica- empirical probability of success for the two different signal
tions specially those that arise in microscopy. However, we models. The empirical probability of succcess is an average
should note that phase retrieval has many different applications over 100 trials, where in each instance, we generate new
and in some cases other CDP models may be more suitable random sampling vectors according to the Gaussian or CDP
for that particular application. We refer to our companion models. We declare a trial successful if the relative error of
paper [16, Sec. 2.2] for a discussion of other practically the reconstruction dist( x̂, x)/ x falls below 10−5 .
relevant models. Figure 2 shows that around 4.5n Gaussian phaseless mea-
surements suffice for exact recovery with high probability
via the Wirtinger flow algorithm. We also see that about
B. The Gaussian and Coded Diffraction Models
six octanary patterns are sufficient.
We begin by examining the performance of the Wirtinger
flow algorithm for recovering random signals x ∈ n under
the Gaussian and coded diffraction models. We are interested C. Performance on Natural Images
in signals of two different types: We move on to testing the Wirtinger flow algorithm on
• Random Low-Pass Signals: Here, x is given by various images of different sizes; these are photographs of the

M/2 Naqsh-e Jahan Square in the central Iranian city of Esfahan,
x[t] = (X k + i Yk )e2πi(k−1)(t −1)/n , the Stanford main quad, and the Milky Way galaxy. Since each
k=−(M/2−1) photograph is in color, we run the WF algorithm on each of
the three RGB images that make up the photograph. Color
with M = n/8 and X k and Yk are i.i.d. N (0, 1).
images are viewed as n 1 × n 2 × 3 arrays, where the first two
• Random Gaussian Signals: In this model, x ∈ n is a
indices encode the pixel location, and the last the color band.
random complex Gaussian vector with i.i.d. entries of the
We generate L = 20 random octanary patterns and gather
form x[t] = X +i Y with X and Y distributed as N (0, 1);
the coded diffraction patterns for each color band using these
this can be expressed as
20 samples. As before, we run 50 iterations of the power

n/2 method as the initialization step. The updates use the sequence
x[t] = (X k + i Yk )e2πi(k−1)(t −1)/n , μτ = min(1 − exp(−τ/τ0 ), 0.4) where τ0 ≈ 330 as before.
k=−(n/2−1) In all cases we run 300 iterations and record the relative
where X k and Yk are are i.i.d. N (0, 1/8) so that recovery error as well as the running time. If x and x̂
the low-pass model is a ‘bandlimited’ version of this are the original
and recovered images, the relative error is

high-pass random model (variances are adjusted so that equal to x̂ − x / x, where · is the Euclidean norm
the expected signal power is the same). x2 = i, j,k |x(i, j, k)|2. The computational time we report

Fig. 3. Performance of the WF algorithm on three scenic images. Image size, computational time in seconds and in units of FFTs are reported, as well as
the relative error after 300 WF iterations. (a) Naqsh-e Jahan Square, Esfahan. Image size is 189 × 768 pixels; timing is 61.4 sec or about 21,200 FFT units.
The relative error is 6.2 × 10−16 . (b) Stanford main quad. Image size is 320 × 1280 pixels; timing is 181.8120 sec or about 20, 700 FFT units. The relative
error is 3.5 × 10−14 . (c) Milky way Galaxy. Image size is 1080 × 1920 pixels; timing is 1318.1 sec or 41, 900 FFT units. The relative error is 9.3 × 10−16 .

is the the computational time averaged over the three RGB in a matter of minutes. To convey an idea of timing that is
images. All experiments were carried out on a MacBook Pro platform-independent, we also report time in units of FFTs;
with a 2.4 GHz Intel Core i7 Processor and 8 GB 1600 MHz one FFT unit is the amount of time it takes to perform
DDR3 memory. a single FFT on an image of the same size. Now all the
Figure 3 shows the images recovered via the Wirtinger flow workload is dominated by matrix vector products of the form
algorithm. In all cases, WF gets 12 or 13 digits of precision Az and A∗ v. In details, each iteration of the power method in

Fig. 4. An illustrative setup of diffraction patterns.

Fig. 5. Schematic representation and electron density map of the Caffeine molecule. (a) Schematic representation (b) Electron density map.

the initialization step, or each update (II.2) requires 40 FFTs; prevent the application of full-blown SDP solvers on desktop
the factor of 40 comes from the fact that we have 20 patterns computers.
and that each iteration involves one FFT and one adjoint or
inverse FFT. Hence, the total number of FFTs is equal to D. 3D Molecules
Understanding molecular structure is a great contempo-
20 patterns × 2 (one FFT and one IFFT)
rary scientific challenge, and several techniques are currently
×(300 gradient steps + 50 power iterations) = 14, 000. employed to produce 3D images of molecules; these include
electron microscopy and X-ray imaging. In X-ray imaging,
Another way to state this is that the total workload of our for instance, the experimentalist illuminates an object of
algorithm is roughly equal to 350 applications of the sensing interest, e.g. a molecule, and then collects the intensity of
matrix A and its adjoint A∗ . For about 13 digits of accuracy the diffracted rays, please see Figure 4 for an illustrative
(relative error of about 10−13 ), Figure 3 shows that we need setup. Figures 5 and 6 show the schematic representation
between 21,000 and 42,000 FFT units. This is within a factor and the corresponding electron density maps for the Caffeine
between 1.5 and 3 of the optimal number computed above. and Nicotine molecules: the density map ρ(x 1 , x 2 , x 3 ) is the
This increase has to do with the fact that in our imple- 3D object we seek to infer. In this paper, we do not go as
mentation, certain variables are copied into other temporary far 3D reconstruction but demonstrate the performance of
variables and these types of operations cause some overhead. the Wirtinger flow algorithm for recovering projections of
This overhead is non-linear and becomes more prominent as 3D molecule density maps from simulated data. For related
the size of the signal increases. simulations using convex schemes we refer the reader to [28].
For comparison, SDP based solutions such as To derive signal equations, consider an experimental appa-
PhaseLift [14], [17] and PhaseCut [67] would be prohibitive ratus as in Figure 4. If we imagine that light propagates in
on a laptop computer as the lifted signal would not fit into the direction of the x 3 -axis, an approximate model for the
memory. In the SDP approach an n pixel image become collected data reads
an n 2 /2 array, which in the first example already takes    2
storing the lifted signal even for the smallest image requires 
I (f1 , f 2 ) =  ρ(x 1 , x 2 , x 3 )dx 3 e −2iπ( f 1 x 1 + f 2 x 2 )
dx 1 dx 2  .
(189 × 768)2 × 1/2 × 8 Bytes, which is approximately
85 GB of space. (For the image of the Milky Way, storage In other words, we collect
 the intensity of the diffraction
would be about 17 TB.) These large memory requirements pattern of the projection ρ(x 1 , x 2 , x 3 )dx 3. The 2D image

Fig. 6. Schematic representation and electron density map of the Nicotine molecule. (a) Schematic representation (b) Electron density map.

20 coded diffraction octanary patterns (we use the same

patterns for all 51 projections). We run the Wirtinger flow
algorithm with exactly the same parameters as in the previ-
ous section, and stop after 150 gradient iterations. Figure 8
reports the average relative error over the 51 projections and
the total computational time required for reconstructing all
51 images.


We complement our study with theoretical results applying
to the model of coded diffraction patterns. These results
concern a variation of the Wirtinger flow algorithm: whereas
the iterations are exactly the same as (II.2), the initialization
applies an iterative scheme which uses fresh sets of sample
Fig. 7. Electron density ρ(x1 , x2 , x3 ) of the Caffeine molecule along with at each iteration. This is described in Algorithm 2. In the
its projection onto the x1 x2 -plane. CDP model, the partitioning assigns to the same group all
the observations and sampling vectors corresponding to a
given realization of the random code. This is equivalent to
we wish to recover is thus the line integral of the density map partitioning the random patterns into B + 1 groups. As a
along a given direction. As an example, the Caffeine molecule result, sampling vectors in distinct groups are stochastically
along with its projection on the x 1 x 2 -plane (line integral in independent.
the x 3 direction) is shown in Figure 7. Now, if we let R be Theorem 5.1: Let x be an arbitrary vector in n and
the Fourier transform of the density ρ, one can re-express the assume we collect L admissible coded diffraction patterns
identity above as with L ≥ c0 · (log n)4 , where c0 is a sufficiently
I ( f1 , f2 ) = |R( f 1 , f2 , 0)|2 . large numerical constant. Then the initial solution z 0 of
Algorithm 25 obeys
Therefore, by imputing the missing phase using phase retrieval 1
algorithms, one can recover a slice of the 3D Fourier transfom dist(z 0 , x) ≤ √ x (V.1)
8 n
of the electron density map, i.e. R( f1 , f2 , 0). Viewing the
object from different angles or directions gives us different with probability at least 1 − (4L + 2)/n 3 . Further, take a con-
slices. In a second step we do not perform in this paper, stant learning parameter sequence, μτ = μ for all τ = 1, 2, . . .
one can presumably recover the 3D Fourier transform of and assume μ ≤ c1 for some fixed numerical constant c1 . Then
the electron density map from all these slices (this is the there is an event of probability at least 1−(2L +1)/n 3 −1/n 2 ,
tomography or blind tomography problem depending upon such that on this event, starting from any initial solution z 0
whether or not the projection angles are known) and, in turn, obeying (V.1), we have
the 3D electron density map. 1  μ τ/2
We now generate 51 observation planes by rotating the dist(z τ , x) ≤ √ 1 − · x . (V.2)
8 n 3
x 1 x 2 -plane around the x 1 -axis by equally spaced angles
in the interval [0, 2π]. Each of these planes is associ- 5 We choose the number of partitions B in Algorithm 2 to obey B ≥ c log n
ated with a 2D projection of size 1024 × 1024, giving us for c1 a sufficiently large numerical constant.

Fig. 8. Reconstruction sequence of the projection of the Caffeine and Nicotine molecules along different directions. To see the videos please download and
open the PDF file using Acrobat Reader. (a) Caffeine molecule Mean rel. error is 9.6 × 10−6 Total time is 5.4 hours. (b) Nicotine molecule Mean rel. error
is 1.7 × 10−5 Total time is 5.4 hours

Algorithm 2 Initialization via Resampled Wirtinger Flow n log n (or even n for certain kind of patterns). We leave this
Input: Observations {yr } ∈ m and number of blocks B. to future research.
Partition the observations and sampling vectors {yr }r=1
and Setting μ = c1 yields  accuracy in O(log 1/) iterations.

{ar }r=1 into B + 1 groups of size m = m/(B + 1). For
m As the computational work at each iteration is dominated by
each group b = 0, 1, . . . , B, set two matrix-vector products of the form Az and A∗ v, it follows
that the overall computational is at most O(n L log n log 1/).
1  (b)  (b) 2 2 In particular, this approach yields a near-linear time algo-
f (z; b) = y − a
 r , z ,
2m  r rithm in the CDP model (linear in the dimension of the
signal n). In the Gaussian model, the complexity scales like
(b) O(mn log 1/).
where {ar } are those sampling vectors belonging to the bth
group (and likewise for {yr }).
Initialize ũ0 to be eigenvector corresponding to the largest VI. W IRTINGER D ERIVATIVES
eigenvalue of Our gradient step (II.2) uses a notion of derivative, which
 can be interpreted as a Wirtinger derivative. The purpose of
1 ∗
Y= yr(0) ar(0) ar(0) this section is thus to gather some results concerning Wirtinger
m derivatives of real valued functions over complex variables.
normalized as in Algorithm (1). Here and below, M T is the transpose of the matrix M, and
Loop: c̄ denotes the complex conjugate of a scalar c ∈ . Similarly,
for b = 0 to B − 1 do the matrix M̄ is obtained by taking complex conjugates of the
elements of M.
Any complex-or real-valued function
ub+1 = ub − ∇ f (ub ; b)
u0 2 f (z) = f (x, y) = u(x, y) + i v(x, y)
end for of several complex variables can be written in the form
Output: z 0 = u B . f (z, z̄), where f is holomorphic in z = x + i y for fixed z̄
and holomorphic in z̄ = x − i y for fixed z. This holds as
long as a the real-valued functions u and v are differentiable
as functions of the real variables x and y. As an example,
In the Gaussian model, both statements also hold with high consider
probability provided that m ≥ c0 · n (log n)2 , where c0 is a   2  2
sufficiently large numerical constant. f (z) = y −  a∗ z  = (y − z̄ T aa∗ z)2 = f (z, z̄).
Hence, we achieve perfect recovery from on the order of with z, a ∈ n and y ∈ . While f (z) is not holomorphic
n(log n)4 samples arising from a coded diffraction experiment. in z, f (z, z̄) is holomorphic in z for a fixed z̄, and vice versa.
Our companion paper [16] established that PhaseLift—the This fact underlies the development of the Wirtinger
SDP relaxation—is also exact with a sampling complexity calculus. In essence, the conjugate coordinates
on the order of n(log n)4 (this has recently been improved  
to n(log n)2 [31]). We believe that the sampling complexity z
∈ n × n , z = x + i y and z̄ = x − i y,
of both approaches (WF and SDP) can be further reduced to z̄

can serve as a formal substitute for the representation The reader may wonder why we choose to work with
(x, y) ∈ 2n . This leads to the following derivatives conjugate coordinates as there are alternatives: in particular,
∂f ∂ f (z, z̄) we could view the complex variable z = x + i y ∈ n as a
:= | z̄=constant vector in 2n and just run gradient descent in the x, y coor-
∂z  ∂z 
∂f ∂f ∂f dinate system. The main reason why conjugate coordinates
= , ,..., , are particularly attractive is that expressions for derivatives
∂ z1 ∂ z2 ∂ z n z̄=constant
become significantly simpler and resemble those we obtain in
∂f ∂ f (z, z̄) the real case, where f : n →  is a function of real variables.
:= | z=constant
∂ z̄  ∂ z̄ 
∂f ∂f ∂f VII. P ROOFS
= , ,..., .
∂ z̄ 1 ∂ z̄ 2 ∂ z̄ n z=constant
A. Preliminaries
Our definitions follow standard notation from multivariate We first√note that in the CDP model with admissible CDPs
calculus so that derivatives are row vectors and gradients are ar  ≤ 6n for all r√ = 1,√ 2, . . . , m, as the entries of
column vectors. In this new coordinate system the complex the CDPs obey |d| ≤ 3 < 6. In the Gaussian model
gradient is given by √
  the measurements vectors also obey ar  ≤ 6n for all
∂f ∂f ∗ r = 1, 2, . . . , m with probability at least 1 − me−1.5n .
∇c f = , .
∂ z ∂ z̄ Throughout the proofs, we assume we are on this event without
explicitly mentioning it each time.
Similarly, we define
    Before we begin with the proofs we should mention that
∂ ∂f ∗ ∂ ∂f ∗ we will prove our result using the update
H z z := , H z̄ z := ,
∂z ∂z ∂ z̄ ∂ z μ
 ∗  ∗ z τ +1 = z τ − ∇ f (z τ ), (VII.1)
∂ ∂f ∂ ∂f x2
H z z̄ := , H z̄ z̄ := .
∂ z ∂ z̄ ∂ z̄ ∂ z̄ in lieu of the WF update
In this coordinate system the complex Hessian is given by μWF
  z τ +1 = z τ − ∇ f (z τ ). (VII.2)
H zz H z̄ z z 0 2
∇ f :=
H z z̄ H z̄ z̄
Since z 0 2 − x2  ≤ 64
x2 holds with high probability
Given vectors z and z ∈ n , we have defined the gradient as proven in Section VII-H, we have
and Hessian in a manner such that Taylor’s approximation
takes the form z 0 2 ≥ x2 . (VII.3)

f (z + z) ≈ f (z) + (∇c f (z)) Therefore, the results for the update (VII.1) automatically
z carry over to the WF update with a simple rescaling of the
∗ upper bound on the learning parameter. More precisely, if we
1 z z
+ ∇ 2 f (z) . prove that the update (VII.1) converges to a global optimum
2 z z as long as μ ≤ μ0 , then the convergence of the WF update
If we were to run gradient descent in this new coordinate to a global optimum is guaranteed as long as μWF ≤ 63 64 μ0 .
system, the iterates would be Also, the update in (VII.1) is invariant to the Euclidean norm
     ∗  of x. Therefore, without loss of generality we will assume
z τ +1 zτ ∂ f /∂ z |
= −μ  ∗ z=z τ
(VI.1) throughout the proofs that x = 1.
z̄ τ +1 z̄ τ ∂ f /∂ z̄ | z=zτ
We remind the reader that throughout x is a solution to
Note that when f is a real-valued function (as in this paper) our quadratic equations, i.e. obeys y = | Ax|2 and that the
we have sampling vectors are independent from x. Define
∂f ∂f
= . P := {xeiφ : φ ∈ [0, 2π]}.
∂z ∂ z̄
Therefore, the second set of updates in (VI.1) is just the to be the set of all vectors that differ from the planted
conjugate of the first. Thus, it is sufficient to keep track of solution x only by a global phase factor. We also introduce
the first update, namely, the set of all points that are close to P,
z τ +1 = z τ − μ (∂ f /∂ z)∗ . E() := {z ∈ : dist(z, P) ≤ }, (VII.4)

For real valued functions of complex variables, setting Finally for any vector z ∈ we define the phase φ(z) as
φ(z) := arg min z − eiφ x ,
∇ f (z) = φ∈[0,2π]
gives the gradient update so that

z τ +1 = z τ − μ∇ f (z τ ). dist(z, x) = z − eiφ(z) x .

B. Formulas for the Complex Gradient and Hessian vectors. Furthermore, assume the measurement vectors ar are
We gather some useful gradient and Hessian calculations distributed according to the Gaussian model. Then
that will be used repeatedly. Starting with    1 3 1
 Re(u∗ ar ar∗ v) 2 = + (Re(u∗ v))2 − (Im(u∗ v))2
2 2 2
f (z) = yr − z̄ T (ar ar∗ )z    
r=1  Re(u∗ ar ar∗ v) ar∗ v 2 = 2 Re(u∗ v) (VII.6)
yr − z T (ar ar∗ )T z̄ ,
 ar∗ v 2k = k!. (VII.7)
2m The next lemma establishes the concentration of the Hessian
we establish around its mean for both the Gaussian and the CDP model.
 T Lemma 7.4: In the setup of Lemma 7.1, assume the vec-
1  T
∂  tors ar are distributed according to either the Gaussian
f (z) = z (ar ar∗ )T z̄ − yr (ar ar∗ )T z̄.
∂z m or admissible CDP model with a sufficiently large
number of measurements. This means that the num-
This gives ber of samples obeys m ≥ c(δ) · n log n in the
1  T
m Gaussian model and the number of patterns obeys
∇ f (z) = f (z) = z̄ (ar ar∗ )z − yr (ar ar∗ )z. L ≥ c(δ) · log3 n in the CDP model. Then
∂z m
r=1 2
∇ f (x) − [∇ 2 f (x)] ≤ δ, (VII.8)
For the second derivative, we write
 ∗ holds with probability at least 1 − 10e−γ n − 8/n 2 and
1  ∗ 2
∂ ∂ 
Hz z = f (z) = 2|ar z| − yr ar ar∗ 1−(2L+1)/n 3 for the Gaussian and CDP models, respectively.
∂z ∂z m
r=1 We will also make use of the two results below, which are
and corollaries of the three lemmas above. These corollaries are
 ∗ also proven in Appendix A.
1  ∗ 2
∂ ∂ Corollary 7.5: Suppose ∇ 2 f (x) − [∇ 2 f (x)] ≤ δ.
H z̄z = f (z) = (ar z) ar arT .
∂ z̄ ∂z m Then for all h ∈ n obeying h = 1, we have
1   ∗ 1 h ∗ 2
∗ 2 h
Re h ar ar x = ∇ f (x)
m 4 h̄ h̄
∇ 2 f (z) r=1 r=1
  1 3
1  (2  ar∗ z 2 − yr )ar ar∗ ≤ h2 + Re(x ∗ h)2
(a∗ z)2 a a T
=  ∗ r2 r r T . 2 2
(ar∗ z)2 ār ar∗ (2  ar z  − yr ) ār ar 1 ∗ 2 δ
− Im(x h) + .
2 2
C. Expectation and Concentration In the other direction,
1   ∗
This section gathers some useful intermediate results whose 2 1 3
proofs are deferred to Appendix A. The first two lemmas Re h ar ar∗ x ≥ h2 + Re(x ∗ h)2
m 2 2
establish the expectation of the Hessian, gradient and a r=1
1 δ
related random variable in both the Gaussian and admissible − Im(x ∗ h)2 − .
CDP models.6 2 2

Lemma 7.1: Recall that x is a solution obeying x = 1, Corollary 7.6: Suppose ∇ 2 f (x) − [∇ 2 f (x)] ≤ δ.
which is independent from the sampling vectors. Furthermore, Then for all h ∈ n obeying h = 1, we have
assume the sampling vectors ar are distributed according to 

1   ∗ 2  ∗ 2 1   ∗ 2

m m
∗ ∗
either the Gaussian or admissible CDP model. Then a r x ar h = h ar x ar ar h
    m m
3 1
[∇ 2 f (x)] = I 2n + xx̄ [x ∗ , x T ] − −xx̄ [x ∗ , −x T ].
r=1 r=1
2 2 ≥ (1 − δ) h2 +  h∗ x 
Lemma 7.2: In the setup of Lemma 7.1, let z ∈ n be a ≥ (1 − δ) h2 ,
fixed vector independent of the sampling vectors. We have and

1   ∗ 2  ∗ 2 1   ∗ 2

m m
[∇ f (z)] = (I − xx ∗ )z + 2 z2 − 1 z. a r x ar h = h ∗ ar x ar ar h ∗
m m
r=1 r=1
The next lemma gathers some useful identities in the Gaussian  2
model. ≤ (1 + δ) h2 +  h∗ x 
Lemma 7.3: Assume u, v ∈ n are fixed vectors obeying ≤ (2 + δ) h2 .
u = v = 1 which are independent of the sampling
The next lemma establishes the concentration of the gra-
6 In the CDP model the expectation is with respect to the random modulation dient around its mean for both Gaussian and admissible
pattern. CDP models.

Lemma 7.7: In the setup of Lemma 7.4, let z ∈ n be results where the goal is often to prove convergence to
a fixed vector independent of the sampling vectors obeying a unique global optimum, the objective function f does
dist(z, x) ≤ 12 . Then not have a unique global optimum. Indeed, in our problem,
if x is solution, then eiφ x is also solution. Hence, proper
∇ f (z) − [∇ f (z)] ≤ δ · dist(z, x).
modification is required to prove convergence results.
holds with probability at least 1 − 20e−γ m − 4m/n 4 in the We prove that if z ∈ E() then for all 0 < μ ≤ 2/β
Gaussian model and 1 − (4L + 2)/n 3 in the CDP model.
We finish with a result concerning the concentration of the z + = z − μ∇ f (z)
sample covariance matrix.
Lemma 7.8: In the setup of Lemma 7.4, obeys

m 2μ
dist (z + , x) ≤ 1 −
dist2 (z, x). (VII.11)
I n − m −1 ar ar∗ ≤ δ, α

Therefore, if z ∈ E() then we also have z + ∈ E().
holds with probability at least 1 − 2e−γ m for the Gaussian
The lemma follows by inductively applying (VII.11). Now
model and 1 − 1/n 2 in the CDP model. On this event,
simple algebraic manipulations together with the regularity
1   ∗ 2
condition (VII.10) give
(1 − δ) h2 ≤ ar h ≤ (1 + δ) h2 , ∀h ∈ n
m 2 2
(VII.9) z + − xeiφ(z) = z − xeiφ(z) − μ∇ f (z)

= z − xeiφ(z)
D. General Convergence Analysis   
We will assume that the function f satisfies a regularity −2μ Re ∇ f (z), z − xeiφ(z)
condition on E(), which essentially states that the gradient + μ2 ∇ f (z)2
of the function is well behaved. We remind the reader that 2

E(), as defined in (VII.4), is the set of points that are close ≤ z − xeiφ(z)
to the path of global minimizers.  2 1 
1 iφ(z)
Condition 7.9 (Regularity Condition): We say that the −2μ z − xe + ∇ f (z) 2
α β
function f satisfies the regularity condition or RC(α, β, )
if for all vectors z ∈ E() we have + μ2 ∇ f (z)2

 1 1 = 1− z − xeiφ(z)
Re ∇ f (z), z − xeiφ(z) ≥ dist2 (z, x) + ∇ f (z)2 . α
α β  
(VII.10) +μ μ− ∇ f (z)2
In the lemma below we show that as long as the regularity β
condition holds on E() then Wirtinger Flow starting from 2μ

≤ 1− z − xeiφ(z) ,
an initial solution in E() converges to a global optimizer at α
a geometric rate. Subsequent sections shall establish that this
where the last line follows from μ ≤ 2/β. The definition of
property holds.
φ(z + ) gives
Lemma 7.10: Assume that f obeys RC(α, β, ) for all
z ∈ E(). Furthermore, suppose z 0 ∈ E, and assume 2 2

0 < μ ≤ 2/β. Consider the following update z + − xeiφ(z+ ) ≤ z + − xeiφ(z) ,

z τ +1 = z τ − μ∇ f (z τ ). which concludes the proof.

Then for all τ we have z τ ∈ E() and
2μ τ E. Proof of the Regularity Condition
dist2 (z τ , x) ≤ 1 − dist2 (z 0 , x).
α For any z ∈ E(), we need to show that
We note that for αβ < 4, (VII.10) holds with the direction of   1 1
the inequality reversed.7 Thus, if RC(α, β, ) holds, α and β Re ∇ f (z), z − xeiφ(z) ≥ dist2 (z, x) + ∇ f (z)2 .
must obey αβ ≥ 4. As a result, under the stated assumptions α β
of Lemma 7.10 above, the factor 1 − 2μ/α ≥ 1 − 4/(αβ) is (VII.12)
We prove this with δ = 0.01 by establishing that our
Proof: The proof follows a structure similar to
gradient satisfies the local smoothness and local curvature
related results in the convex optimization literature
conditions defined below. Combining both these two properties
see [55, Th. 2.1.15]. However, unlike these classical
gives (VII.12).
7 One can see this by applying Cauchy-Schwarz and calculating the deter- Condition 7.11 (Local Curvature Condition): We say that
minant of the resulting quadratic form. the function f satisfies the local curvature condition or

LCC(α, , δ) if for all vectors z ∈ E(), We will establish (VII.17) for different measurement models
 1 (1 − δ) and different values of . Below, it shall be convenient to use
Re ∇ f (z), z − xe iφ(z)
≥ + dist2 (z, x) the shorthand
α 4
1   ∗ 4
m 5
 Yr (h, s) := Re(h∗ ar ar∗ x)2 + 3s Re(h∗ ar ar∗ x)|ar∗ h|2
+  ar (z − eiφ(z) x) . 2
10m 9
+ s 2 |ar∗ h|4 ,
(VII.13) 10
This condition essentially states that the function curves suf- m
ficiently upwards (along most directions) near the curve of Yr (h, s) := Yr (h, s).
global optimizers. r=1
Condition 7.12 (Local Smoothness Condition): We say √
1) Proof of (VII.17) With  √= 1/8 n in the Gaussian
that the function f satisfies the local smoothness condition and CDP Models: Set  = 1/8 n. We show that with high
or L SC(β, , δ) if for all vectors z ∈ E() we have probability, (VII.17) holds for all h satisfying Im(h∗ x) = 0,
∇ f (z)2 h2 = 1, 0 ≤ s ≤ , δ ≤ 0.01, and α ≥ 30. First, note that

1   ∗ 4
m by Cauchy-Schwarz inequality,
(1 − δ) iφ(z) 
≤β dist (z, x) +
ar (z − e x) .
4 10m
r=1 Yr (h, s)
This condition essentially states that the gradient of the func- ≥ Re(h∗ ar ar∗ x)2
tion is well behaved (the function does not vary too much)  
 m  m
near the curve of global optimizers. 3s   
−  Re(h ar ar x) 
∗ ∗ 2 |ar∗ h|4
r=1 r=1
F. Proof of the Local Curvature Condition
2 m
For any z ∈ E(), we want to prove the local curvature 9
+ |ar∗ h|4
condition (VII.13). Recall that 10 m
m   2
 5  m  9  m
∇ f (z) = |ar , z|2 − yr (ar ar∗ )z, =  Re(h ar ar x) − s 
∗ ∗ 2 |ar∗ h|4
m 2m 10m
r=1 r=1 r=1
and define h := e−iφ(z) z −
x. To establish (VII.13) it suffices 5 
9s 2 

to prove that ≥ Re(h∗ ar ar∗ x)2 − |ar∗ h|4 . (VII.18)

4m 10m
r=1 r=1
2 Re(h∗ ar ar∗ x)2 + 3 Re(h∗ ar ar∗ x)|ar∗ h|2 + |ar∗ h|4 The last inequality follows from (a − b)2 ≥ a2
− b2 .
m 2

  By Corollary 7.5,
1   ∗ 4
1 (1 − δ)
≥ ar h + + h2 , (VII.15) 1 
1−δ 3
10m α 4 Re(h∗ ar ar∗ x)2 ≥ + Re(x ∗ h)2 (VII.19)
m 2 2
holds for all h satisfying Im(h∗ x) = 0, h2 ≤ . Equiv- r=1

alently, we only need to prove that for all h satisfying holds with high probability for all h obeying h = 1.
Im(h∗ x) = 0, h2 = 1 and for all s with 0 ≤ s ≤ , Furthermore, by applying Lemma 7.8,

1  ∗ 4 
m m
2 Re(h∗ ar ar∗ x)2 + 3s Re(h∗ ar ar∗ x)|ar∗ h|2 |ar h| ≤ (max ar 2 )
|ar∗ h|2 ≤ 6(1 + δ)n
m m r m
4 1
r=1 r=1
9  (1 − δ)
+ s 2  ar∗ h ≥ + . (VII.16) (VII.20)
10 α 4
By Corollary 7.5, with high probability, holds with high probability. Plugging (VII.19) and (VII.20)
in (VII.18) yields
1+δ 3
Re(h∗ ar ar∗ x)2 ≤ + Re(x ∗ h)2 , 15 5 27
2 2 Yr (h, s) ≥ Re(x ∗ h)2 + (1 − δ) − s 2 (1 + δ)n.
8 8 5
holds for all h obeying h = 1. Therefore, to establish the
(VII.17) follows by using α ≥ 30,  = 8√1 n and δ = 0.01.
local curvature condition (VII.13) it suffices to show that
2) Proof of (VII.17) With  = 1/8 in the Gaussian Model:
1  5 Set  = 1/8. We show that with high probability, (VII.17)
Re(h∗ ar ar∗ x)2 + 3s Re(h∗ ar ar∗ x)|ar∗ h|2
m 2 holds for all h satisfying Im(h∗ x) = 0, h2 = 1, 0 ≤ s ≤ ,
   δ ≤ 2, and α ≥ 8. To this end, we first state a result about
9 2  ∗ 4 1 1 3
+ s ar h ≥ + + Re(x ∗ h)2 . (VII.17) the tail of a sum of i.i.d. random variables. Below,  is the
10 α 2 4 cumulative distribution function of a standard normal variable.

Lemma 7.13 [11]: Suppose X 1 , X 2 , . . . , X m are i.i.d. real- with γ = 1/840. Therefore, with probability at least 1−e−2γ m ,
valued random variables obeying X r ≤ b for some nonrandom we have
b > 0,  X r = 0, and  X r2 = v 2 . Setting σ 2 = m max(b2 , v 2 ), 5 1
Y r (h, s) ≥ (1 + 3 Re(x ∗ h)2 ) + 6s Re(x ∗ h) + 2.7s 2 −
4 4
P(X 1 + . . . + X m ≥ y) 3 3  2
    ∗ 2
≥ + Re(x h) + 3 Re(x h) + s ∗
y2 4 4
≤ min exp − 2 , c0 (1 − (y/σ )) , 
2σ 1 3
+ − s2
4 10
where one can take c0 = 25. 3 3
To establish (VII.17) we first prove it for a fixed h, and ≥ + Re(x ∗ h)2 . (VII.21)
4 4
then use a covering argument. Observe that √
provided that s ≤ 5/6. The inequality above holds for a

2 fixed vector h. To prove (VII.17) for all h ∈ n with h = 1,
5 9  ∗ 2
Yr := Yr (h, s) = Re(h∗ ar ar∗ x) + s ar h . define
2 10
5 9  ∗ 2
pr (h) := Re(h∗ ar ar∗ x) + s ar h .
By Lemma 7.3, 2 10

1 3 Using the fact  ∗that
  max  r ar ≤  6n and s ≤ 1/8, we have
[Re(h∗ ar ar∗ x)2 ] = + (Re(x ∗ h))2 | pr (h)| ≤ 2  ar h  ar∗ x  + s  ar∗ h ≤ 13n. Moreover, for any
2 2
u, v ∈ n obeying u = v = 1,
 ∗ ∗ 
| pr (u) − pr (v)| ≤  Re (u − v) ar ar x 
[Re(h∗ ar ar∗ x)|ar∗ h|2 ] = 2 Re(u∗ v),  2 
9  ∗  
compare (VII.5) and (VII.6). Therefore, using s ≤ 18 , + s ar (u + v)  ar∗ (u − v)
μr =  Yr =
5 27
(1 + 3 Re(x ∗ h)2 ) + 6s Re(x ∗ h) + s 2 < 6. ≤ n u − v .
4 10 2
Now define X r = μr − Yr . First, since Yr ≥ 0,
X r ≤ μr < 6. Second, we bound  X r2 using Lemma 7.3 q(h) :=
pr (h)2 − Re(x ∗ h)2
and Holder’s inequality with s ≤ 1/8: m
25 81 = Yr (h, s) − Re(x ∗ h)2 .
 X r2 ≤ Yr2 = [Re(h∗ ar ar∗ x)4 ] + s 4 [|ar∗ h|8 ] 4
4 100
27  4 For any u, v ∈ n obeying u = v = 1,
+ s 2 [Re(h∗ ar ar∗ x)2  ar∗ h ]  
2 1 m
+ 15s [Re(h∗ ar ar∗ x)3 |ar∗ h|2 ] |q(u) − q(v)| =  ( pr (u) − pr (v))( pr (u) + pr (v))
27  6 r=1

+ s 3 [Re(h∗ ar ar∗ x) ar∗ h ] 3 
! − Re(x (u − v))Re(x (u + v))
∗ ∗
25     4
≤ [ar∗ h8 ] [ar∗ x 8 ] 27n 3
81 4 ≤ × 2 × 13n u − v + u − v
+ s [|ar∗ h|8 ]  2  2
100 ! 3
27  12  4 = 351n + 2
u − v . (VII.22)
+ s 2 [ar∗ h ] [ ar∗ x  ] 2
2 !
 10  6 Therefore, for any u, v ∈ n obeying u = v = 1 and
+ 15s [ar∗ h ] [ar∗ x  ] u − v ≤ η := 6000n
2 , we have
!  14  2
+ s 3 [ ar∗ h ] [ ar∗ x  ] 1
5 q(v) ≥ q(u) − . (VII.23)
< 20s 4 + 543s 3 + 513s 2 + 403s + 150
Let Nη be an η-net for the unit sphere of n with cardinality
< 210. obeying |Nη | ≤ (1 + η2 )2n . Applying (VII.21) together with
the union bound we conclude that for all u ∈ Nη
We have all the elements to apply Lemma 7.13 with  
σ 2 = m max(92 , 210) = 210m and y = m/4: 3
 q(u) ≥ ≥ 1 − |Nη |e−2γ m
 mμ − Yr ≥ ≤ e−2γ m ≥ 1 − (1 + 12000n 2 )n e−2γ m
4 ≥ 1 − e−γ m . (VII.24)

The last line follows by choosing m such that m ≥ c · n log n, Note that since (a + b + c)2 ≤ 3(a 2 + b 2 + c2 )
where c is a sufficiently large constant. Now for any h on 
1 m
the unit sphere of n , there exists a vector u ∈ Nη such that |g(h, w, s)|2 ≤  2|h∗ ar ||w∗ ar ||ar∗ x|2
h − u ≤ η. By combining (VII.23) and (VII.24), m
3 1 5 + 3s|h∗ ar |2 |ar∗ x||w∗ ar |
q(h) ≥ − > m
4 16 8   2
1  2 ∗ 3 ∗ 
1 1 3
⇒ Yr (h, s) ≥ + + Re(x ∗ h)2 , + s |ar h| |w ar |
8 2 4 m
holds with probability at least 1 − e−γ m . This concludes the 2  m

≤ 3 |h∗ ar ||w∗ ar ||ar∗ x|2 
proof of (VII.17) with α ≥ 8. m
 3s m

+ 3  |h ar | |ar x||w ar |
∗ 2 ∗ ∗
G. Proof of the Local Smoothness Condition r=1
 s m ∗ 3 ∗ 2
For any z ∈ E(), we want to prove (VII.14), which is + 3 |ar h| |w ar |
equivalent to proving that for all u ∈ n obeying u = 1, m
we have := 3(I1 + I2 + I3 ). (VII.27)
  We now bound each of the terms on the right-hand side.
(∇ f (z))∗ u2

For the first term we use Cauchy-Schwarz and Corollary 7.6,
1   ∗ 4
(1 − δ) iφ(z) 
which give
≤β dist (z, x) +
 ar (z − e x) . 

1   ∗ 2  ∗ 2 1   ∗ 2  ∗ 2

4 10m m m
I1 ≤ a r x ar w a r x ar h
m m
r=1 r=1
Recall that
≤ (2 + δ)2 . (VII.28)

1 Similarly, for the second term, we have
∇ f (z) = |ar , z|2 − yr (ar ar∗ )z 

1   ∗ 4 1   ∗ 2  ∗ 2
m m
I2 ≤ ar h a r w ar x
m m
and define r=1 r=1
 ∗ 4
≤  a h .
1  (VII.29)
m r
g(h, w, s) = 2 Re(h∗ ar ar∗ x) Re(w∗ ar ar∗ x) m
r=1 Finally, for the third term we use the Cauchy-Schwarz inequal-
1 m
ity together with Lemma 7.8 (inequality) (VII.9) to derive
+ s|ar∗ h|2 Re(w∗ ar ar∗ x) 
1   ∗ 3
I3 ≤ ar h max ar 
m r
+ 2s Re(h∗ ar ar∗ x) Re(w∗ ar ar∗ h) r=1
1   ∗ 3
r=1 m

1  2 ∗ 2
m ≤ 6n ar h
+ s |ar h| Re(w∗ ar ar∗ h). m

1   ∗ 4  2
r=1 m
≤ 6n ar h  a h

Define h := e−φ(z) z − x and w := e−iφ(z) u, to r=1 r=1
6n(1 + δ)   ∗ 4
establish (VII.14) it suffices to prove that
≤ ar h . (VII.30)

1   ∗ 4
m r=1
|g(h, w, 1)| ≤ β
h +
ar h . (VII.25) We now plug these inequalities into (VII.27) and get
4 10m
27s 2 (2 + δ)   ∗ 4
r=1 m
|g(h, w, s)|2 ≤ 12(2 + δ)2 + ar h
holds for all h and w satisfying Im(h∗ x) = 0, h ≤  and m
w = 1. Equivalently, we only need to prove for all h and w 18s 4 n(1 + δ)   ∗ 4
satisfying Im(h∗ x) = 0, h = w = 1 and ∀s : 0 ≤ s ≤ , + a h r

s 2   ∗ 4 2   ∗ 4
m m
1−δ 1−δ s  a h , (VII.31)
|g(h, w, s)| ≤ β
+ ar h . (VII.26) ≤β + r
4 10m 4 10m
r=1 r=1

which completes the proof of (VII.26) and, in turn, establishes I. Initialization via Resampled Wirtinger Flow
the local smoothness condition in (VII.14). However, the last
line of (VII.31) holds as long as In this section, we prove that that the output of Algorithm 2
  obeys (V.1) from Theorem 5.1. Introduce
(2 + δ)2
β ≥ max 48 , 270(2 + δ) + 180 n(1 + δ) .
1−δ 1 ∗
F(z) = z (I − xx ∗ )z + z2 − 1 .
(VII.32) 2
In our theorems we use two different values of  = 1

8 n
and By Lemma 7.2, if z ∈ n is a vector independent from the
 = Using δ ≤ 0.01 in (VII.32) we conclude that the local
8. measurements, then
smoothness condition (VII.31) holds as long as

β ≥ 550 for  = 1/(8 n),  ∇ f (z; b) = ∇ F(z).
β ≥ 3n + 550 for  = 1/8.
We prove that F obeys a regularization condition in E(1/8),
H. Wirtinger Flow Initialization namely,
In this section, we prove that the WF initialization 
obeys (III.1) from Theorem 3.3. Recall that Re ∇ F(z), z − xeiφ(z) 
1  ∗ 2
Y := |ar x| ar ar∗ . 1 1
≥ dist2 (z, x) +  ∇ F(z)2 . (VII.33)
r=1 α β
and that Lemma 7.4 gives
Lemma 7.7 implies that for a fixed vector z,
Y − (x x ∗ + x2 I ) ≤ δ := 0.001.
Let z̃ 0 be the eigenvector corresponding to the top Re ∇ f (z; b), z − xeiφ(z) 
" #
eigenvalue λ0 of Y obeying  z˜0  = 1. It is easy to see that = Re ∇ F(z), z − xeiφ(z)
  " #
λ0 − (|z̃ ∗0 x|2 + 1) =  z̃ ∗0 Y z̃ 0 − z̃ ∗0 (x x ∗ + I)z̃ 0 
    + Re ∇ f (z; b) − ∇ F(z), z − xeiφ(z)
=  z̃ ∗0 Y − (x x ∗ + I ) z̃ 0  " #
≥ Re ∇ F(z), z − xeiφ(z)
≤ Y − (x x ∗ + I )
− ∇ f (z; b) − ∇ F(z) dist(z, x)
≤ δ. " #
≥ Re ∇ F(z), z − xeiφ(z) − δdist(z, x)2
1 1
|z̃ ∗0 x|2 ≥ λ0 − 1 − δ. ≥ − δ dist(z, x)2 +  ∇ F(z)2 , (VII.34)
α  β
Also, since λ0 is the top eigenvalue of Y , and x = 1,
we have holds with high probability. The last inequality follows
λ0 ≥ x ∗ Y x = x ∗ Y − (I + x x ∗ ) x + 2 ≥ 2 − δ. from (VII.33). Applying Lemma 7.7, we also have

Combining the above two inequalities together, we have 1

∇ F(z)2 ≥ ∇ f (z; b)2 − ∇ f (z; b) − ∇ F(z)2
|z̃ ∗0 x|2 ≥ 1 − 2δ 2
√ 1
⇒ dist2 (z̃ 0 , x) ≤ 2 − 2 1 − 2δ <
1 ≥ ∇ f (z; b)2 − δ 2 dist(z, x)2 .
256 2
⇒ dist(z̃ 0 , x) ≤ . Plugging the latter into (VII.34) yields
1 m ∗ Re ∇ f (z; b), z − xeiφ(z) 
Recall that z 0 = m r=1 |ar x|
2 z̃ 0 . By Lemma 7.8,
equation (VII.9), with high probability we have 1 δ2 1
  ≥ −  − δ dist2 (z, x) +  ∇ f (z; b)
  1 m  α β 2β
z 0 2 − 1 =  |ar∗ x|2 − 1 ≤ 1 1
m  256 := dist2 (z, x) + ∇ f (z; b) .
α̃ β̃
⇒ |z 0  − 1| ≤ .
16 Therefore, using the general convergence analysis of gradient
Therefore, we have descent discussed in Section VII-D we conclude that for all
dist(z 0 , x) ≤ z 0 − z̃ 0  + dist(z̃ 0 , x), μ̃ ≤ μ̃0 := 2/β̃,
= |z 0  − 1| + dist(z̃ 0 , x),  
1 dist2 (ub+1 , x) ≤ 1 − dist2 (ub , x).
≤ . α̃

Finally, of an admissible random variable). For ease of exposition, we

shall rewrite (IV.1) in the form
log n
B≥−  n−1 2
log 1 − 2α̃μ̃ 

 B yr =  x[t]d̄ (t)e−i2πkt /n  =  f ∗k D ∗ x  ,
2μ̃ 2  
t =0
⇒ dist(u B , x) ≤ 1− dist(u0 , x)
α̃ 0≤k ≤n−1
 B r = (, k),
1 ≤  ≤ L,
2μ̃ 2 1
≤ 1−
α̃ 8 where f ∗k is the kth row of the n × n DFT matrix and
1 D is a diagonal matrix with the diagonal entries given by
≤ √ .
8 n d (0), d (1), . . . , d (n − 1). In our model, the matricces D
are i.i.d. copies of D.
It only remains to establish (VII.33). First, without loss
of generality, we can assume that φ(z) = 0, which implies
Re(z ∗ x) = |z ∗ x| and use z − x in lieu of dist(z, x). Set
A. Proof of Lemma 7.1
h := z − x so that Im(x ∗ h) = 0. This implies
 The proof for admissible coded diffraction patterns follows
∇ F(z) = (I − xx ∗ )z + 2 z2 − 1 z from [16, Lemmas 3.1 and 3.2]. For the Gaussian model, it
 is a consequence of the two lemmas below, whose proofs are
= (I − xx ∗ )(x + h) + 2 x + h2 − 1 (x + h) ommitted.
Lemma A .1: Suppose the sequence {ar } follows the
= I − x x ∗ h + 2 2 Re(x ∗ h) + h2 (x + h)
Gaussiam model. Then for any fixed vector x ∈ n ,
= (1 + 4(x ∗ h) + 2 h2 )h
1   ∗ 2
+ (3(x ∗ h) + 2 h2 )x.  ar x ar ar = x x ∗ + x2 I.

Lemma A .2: Suppose the sequence {ar } follows the
∇ F(z) ≤ 4 h + 6 h2 + 2 h3 ≤ 5 h , (VII.35) Gaussiam model. Then for any fixed vector x ∈ n ,
where the last inequality is due to h ≤  ≤ 1/8.  
1  ∗ 2
Furthermore,  ∗
(ar x) ār ar = 2x x T .
Re ∇ F(z), z − x
" #
= Re (1 + 4(x ∗ h) + 2 h2 )h + (3(x ∗ h) + 2 h2 )x, h B. Proof of Lemma 7.2
= h2 + 2 h4 + 6 h2 (x ∗ h) + 3  x ∗ h Recall that
≥ h2 , (VII.36)
4 ∇ f (z) = |ar , z|2 − yr (ar ar∗ )z
where the last inequality also holds because h ≤  ≤ 1/8. r=1
Finally, (VII.35) and (VII.36) imply
= |ar , z|2 − |ar , x|2 (ar ar∗ )z.
1 r=1
Re (∇ F(z), z − x) ≥ h2
4 Thus by applying [16, Lemma 3.1] (for the CDP model) and
1 1
≥ h2 + ∇ F(z)2 Lemma VII-A above (for the Gaussian model) we have
8 200
1 1 m
:=  h2 +  ∇ F(z)2 , 1   2  
α β [∇ f (z)] =  ( ar∗ z  ar ar∗ z −  ar∗ x  ar ar∗ z)
where α  = 8 and β  = 200.
= (zz ∗ + z2 I )z − (x x ∗ + I)z
= 2(z2 − 1)z + (I − x x ∗ )z.
C. Proof of Lemma 7.3
We provide here the proofs of our intermediate results.
Throughout this section we use D ∈ n×n to denote a diagonal Suppose a ∈ n ∼ N (0, I /2) + i N (0, I/2). Since the
random matrix with diagonal elements being i.i.d. samples law of a is invariant by unitary transformation, we may just
from an admissible distribution d (recall the definition (IV.2) as well take v = e1 and u = s1 eiφ1 e1 + s2 eiφ2 e2 , where

s1 , s2 are positive real numbers obeying s12 + s22 = 1. We have A slight modification to [16, Lemma 3.9] gives
   (E1c ) ≤ 1/n 3 provided L ≥ c(R) log3 n for a sufficiently
 Re(u∗ aa∗ v) 2 large numerical constant
 2   √ c(R). Since Hoeffding’s inequality
= [ Re(s1 eiφ1 |a1 |2 + s2 eiφ2 a1 ā2 ) ] yields ( f ∗k D ∗ x  > 2R log n) ≤ 2n −R , we have
= 2s12 cos2 (φ1 ) + s22 (E c ) ≤ n −3 + 2(n L)n −R .
1 3 2 1 Setting R = 4 completes the proof.
= + s1 cos (φ1 ) − s12 sin 2 (φ1 )
2 2 2 2) The Gaussian Model: By unitary invariance, we may
1 3 ∗
2 1  2 take x = e1 . Letting z(1), be the first coordinate of a vector z,
= + Re(u v) − Im(u∗ v) .
2 2 2 to prove Lemma 7.4 for the Gaussian model it suffices to prove
and the two inequalities,
 Re(u∗ aa∗ v) a∗ v 2 1 

m  δ

  |ar (1)|2 ar ar∗ − I + e1 e1T ≤
= [ Re(s1 e−iφ1 |a1 |2 + s2 e−iφ2 ā1 a2 ) |a1 |2 ]
m 4
= 2s1 cos(φ1 )
= 2 Re(u∗ v).
m δ
The identity (VII.7) follows from standard normal moment ar (1) ar arT − 2e1 e1T ≤ . (A.3)
m 4
calculations. r=1

For any  > 0, there is a constant C > 0 with the property

D. Proof of Lemma 7.4 that m ≥ C · n implies
1) The CDP Model: Write the Hessian as
1  1 
m m

L n (|ar (1)|2 − 1) ≤ , (|ar (1)|4 − 2) < ,
Y := ∇ 2 f (x) = W k ( D ) m m
r=1 r=1
=1 k=1
where |ar (1)|6 < 10.
D 0 Ak ( D) B k ( D) D∗ 0 r=1
W k ( D) :=
0 D∗ B k ( D) Ak ( D) 0 D with probability at least 1 − 3n −2 ; this is a consquence of
and Chebyshev’s inequality. Moreover a union bound gives
 2 '
Ak ( D) =  f ∗k D∗ x  f k f ∗k , B k ( D) = ( f ∗k D∗ x)2 f k f kT . max |ar (1)| ≤ 10 log m
It is useful to recall that
 with probability at least 1 − n −2 . Denote by E 0 the event
 Y = I 2+x̄xx∗x 2x x T
I + x̄ x T
. on which the above inequalities hold. We show that there is
another event E 1 of high probability such that (A.2) and (A.3)
Now set hold on E 0 ∩ E 1 . Our proof strategy is similar to that
1  of [64, Th. 39]. To prove (A.2), we will show that with high
L n
Y= W k ( D  ){| f ∗ D∗ x|≤√2R log n} , probability, for any y ∈ n obeying  y = 1, we have
nL k 

=1 k=1   
 ∗ 1 
where R is a positive scalar whose value will be determined I0 ( y) :=  y |ar (1)| ar ar − I + e1 e1
2 T
shortly, and define the events  r=1 
1 m
Y −  Y  ≤ },
E 1 (R) = {$   δ
= |ar (1)|2  ar∗ y − (1+| y(1)|2 ) ≤ . (A.4)
m  4
E 2 (R) = {$
Y = Y }, r=1
%&  ' (
E 3 (R) =  f ∗ D ∗ x  ≤ 2R log n , For this purpose, partition y in the form y = ( y(1), ỹ) with
k, ỹ ∈ n−1 , and decompose the inner product as
E = {Y −  Y  ≤ }.  ∗ 2    
  √  a y = |ar (1)|2 | y(1)|2 +2 Re ã∗ ỹar (1) y(1) +  ã∗ ỹ2 .
r r r
Note that E 1 ∩ E 2 ⊂ E. Also, if  f ∗k D ∗ x  ≤ 2R log n for
all pairs (k, ), then $
Y = Y and thus E 3 ⊂ E 2 . Putting all of This gives
this together gives 1 m
) I0 ( y) =  (|ar (1)|4 − 2) | y(1)|2
(E c ) ≤ (E1c E2c ) m

≤ (E 1c ) + (E 2c ) 1 

+ 2 Re |ar (1)| ar (1) y(1) ãr ỹ
≤ (E 1c ) + (E 3c ) m

  ' 1 
n m
≤ (E 1c )+ ( f ∗k D∗ x  > 2R log n). (A.1) + 2  ∗ 2 2
|ar (1)| ãr ỹ −  ỹ ,
=1 k=1 r=1

which follows from | y(1)|2 + ỹ2 = 1 since y has unit norm. E. Proof of Corollary 7.5
This gives It follows from ∇ 2 f (x)− [∇ 2 f (x)] ≤ δ that ∇ 2 f (x) 
I0 ( y)   
[∇ 2 f (x)] + δ I . Therefore, using the fact that for any
1  m  1  m  complex scalar c, Re(c)2 = 12 |c|2 + 12 Re(c2 ), we have
≤ |ar (1)|2 − 1  ỹ2 +  |ar (1)|4 − 2 | y(1)|2
m  m  1 

1  m
 Re(h∗ ar ar∗ x)2
+2 |ar (1)|2 ar (1) y(1) ãr∗ ỹ r=1
m  m  
 r=1  1  h ∗  ar∗ x 2 ar ar∗ (ar∗ x)2 ar arT h
1    =  ∗ 2

2  ∗ 2
 2  4 h̄ ∗ 2 ∗
(ar x) ār ar  
ar x ār ar T h̄
+ |ar (1)| ãr ỹ −  ỹ  r=1
 ∗     ∗    ∗   
m  1 h 3 x x 1 x h

r=1  x
1  m   I 2n + −
 ∗  4 h̄ 2 x̄ x̄ 2 − x̄ − x̄ h̄
≤ 2 + 2  |ar (1)| ar (1) y(1) ãr ỹ
m  δ h h
 r=1  +
1  m  2  4 h̄ h̄
+ |ar (1)|2  ãr∗ ỹ −  ỹ2 . (A.5) 1 3 1 δ
m   ∗ 2
h + Re(x h) − I m(x h) + .
2 ∗ 2
2 2 2 2
We now turn our attention to the last two terms of (A.5).
The other inequality is established in a similar fashion.
For the second term, the ordinary Hoeffding’s inequality
([64], Proposition 10) gives that for any constants δ0 and γ ,
there exists ! a constant C(δ0 , γ ), such that for F. Proof of Corollary 7.6
In the proof of Lemma 7.4, we established that with high
m ≥ C(δ0 , γ ) n r=1 |ar (1)| ,
1  m 
1   ∗ 2
 (|ar (1)| ar (1) y(1)) ãr ỹ ≤ δ0 | y(1)|  ỹ ≤ δ0
  ar x ar ar∗ − x x ∗ + x2 I  ≤ δ.
holds with probability at least 1 − 3e−2γ n . To control
the final term, we apply the Bernstein-type inequality
([64], Proposition 16) to assert the following: for any positive 1   ∗ 2
constants δ0 and γ! , there exists a constant C(δ0 , γ ), such that ar x ar ar∗  x x ∗ + x2 I − δ I.
m r=1
for m ≥ C(δ0 , γ )( n( r=1 |ar (1)|4 ) + n maxr=1
m |a (1)|2 ),
  This concludes the proof of one side. The other side is similar.
1  m  2 
 |ar (1)|2  ãr∗ ỹ −  ỹ2  ≤ δ0  ỹ2 ≤ δ0
m  G. Proof of Lemma 7.7
Note that
holds with probability at least 1 − 2e−2γ n .
Therefore, for any unit norm vector y, ∇ f (z) −  ∇ f (z) = max u, ∇ f (z) −  ∇ f (z)
u∈ n,u=1
I0 ( y) ≤ 2 + 2δ0 (A.6) Therefore, to establish the concentration of ∇ f (z) around its
holds with probability at least 1 − 5e−2γ n . mean we proceed by bounding |u, ∇ f (z) −  ∇ f (z)|. From
By [64, Lemma 5.4], we can bound the operator norm Section VII-B,
via an -net argument:
∇ f (z) = |ar , z|2 − yr (ar ar∗ )z.
max I0 ( y) ≤ 2 max I0 ( y) ≤ 4 + 4δ0 , m
y∈ n y∈N
Define h := e−φ(z) z − x and w := e−iφ(z) u, we have
where N is an 1/4-net of the unit sphere in n.

By applying the union bound and choosing appropriate δ0 , u, ∇ f (z)

1  ∗  ∗ 2  2
 and γ , (A.2) holds
! with probability at!least 1 − 5e−γ n , as
 m = w ar x ar arT h̄ + w ∗  ar∗ x  ar ar∗ h
long as m ≥ C  ( n r=1 m
|ar (1)|6 + n r=1 |ar (1)|4 + m
 2  2
n max1≤r≤m |ar (1)| ). On E 0 this inequality follows from
+ 2w∗  ar∗ h ar ar∗ h x + w ∗ ar∗ h ar arT x̄
m ≥ C · n log n provided C is sufficiently large. In conclusion,  2
(A.2) holds with probability at least 1 − 5e−γ n − 4n −2 . + w∗  ar∗ h ar ar∗ h. (A.7)
The proof of (A.3) is similar. The only difference is that
By Lemma 7.2 we also have
the random matrix is not Hermitian, so we work with  
 w∗ [∇ f (z)] = w ∗ 2xx T h̄ + w∗ x x ∗ + x2 I h
 ∗ 1 
I0 (u, v) = u ar (1) ar ar − 2e1 e1 v  ,
 m  + 2w∗ hh∗ + h2 I x + w∗ 2h h̄ x̄
where u and v are unit vectors. + w∗ hh∗ + h2 I h. (A.8)

Combining (A.7) and (A.8) together with the triangular Algorithm 3 Power Method
inequality and Lemma 7.4 give Input: Matrix Y
u, ∇ f (z) − [∇ f (z)] v 0 is a random vector on the unit sphere of n

  for τ = 1 to T do
 ∗ 1   ∗ 2

≤ w ar x ar arT − 2x x T h̄ v τ = YY vvττ −1
  r=1 m
 end for
 ∗ 1   ∗ 2  Output: z̃ 0 = v T
+ w ar x ar ar∗ − (xx ∗ + x2 I ) h

 1   ∗ 2  
  θ0 is the angle between the initial guess and the top
+ 2 w ∗ ar h ar ar∗ − hh∗ + h2 I x
 m  eigenvector. Hence, we would need on the order of

 1   ∗ 2
m  log(n/) / log(λ1 /λ2 ) for  accuracy. Under the stated
+ w ∗ ar h ar arT − 2h h̄ x̄  assumptions, Lemma 7.4 bounds below the eigenvalue gap
 by a numerical constant so that we can see that few iterations
 ∗ 1   ∗ 2
∗ ∗  of the power method would yield accurate estimates.
+ w ar h ar ar − hh + h I 2
h .
1  m

≤ ar∗ x ar arT − 2x x T h M. S. would like to thank Andrea Montanari for helpful
r=1 m discussions and for the class [1] which inspired him to
1  2
 a x  ar a − (x x + x I)
∗ ∗ ∗ explore provable algorithms based on non-convex schemes.
+ 2
m r r
He would also like to thank Alexandre d’Aspremont,
r=1 Fajwel Fogel and Filipe Maia for sharing some useful code
1  m
 ∗ 2
 a h ar a − (hh + h I)
∗ ∗ 2 regarding 3D molecule reconstructions, Ben Recht for intro-
m r r
ducing him to reference [53], and Amit Singer for bringing
1  m
2 the paper [44] to his attention.
+ ar∗ h ar arT − 2h h̄
1  m
 ∗ 2 
 a h ar a∗ − hh∗ + h2 I [1] EE 378B—Inference, Estimation, and Information Processing. [Online].
+ h Available:
m r r [2] A. Agarwal, S. Negahban, and M. J. Wainwright, “Fast global conver-
≤ 3δ h (1 + h) gence of gradient methods for high-dimensional statistical recovery,”
9 Ann. Statist., vol. 40, no. 5, pp. 2452–2482, 2012.
≤ δ h . [3] A. Ahmed, B. Recht, and J. Romberg. (2012). “Blind deconvolu-
2 tion using convex programming.” [Online]. Available:
[4] R. Balan, “On signal reconstruction from its spectrogram,” in Proc. 44th
H. Proof of Lemma 7.8 Annu. Conf. Inf. Sci. Syst. (CISS), Mar. 2010, pp. 1–4.
The result for the CDP model follows from [5] R. Balan, P. Casazza, and D. Edidin, “On signal reconstruction without
phase,” Appl. Comput. Harmonic Anal., vol. 20, no. 3, pp. 345–356,
[16, Lemma 3.3]. For the Gaussian model, it is a consequence 2006.
of standard results, see [64, Th. 5.39], concerning the [6] L. Balzano and S. J. Wright. (2013). “Local convergence of an algo-
deviation of the sample covariance matrix from its mean. rithm for subspace identification from partial data.” [Online]. Available:
[7] A. S. Bandeira, J. Cahill, D. G. Mixon, and A. A. Nelson. (2013).
A PPENDIX B “Saving phase: Injectivity and stability for phase retrieval.” [Online].
T HE P OWER M ETHOD Available:
[8] H. H. Bauschke, P. L. Combettes, and D. R. Luke, “Hybrid
We use the power method (Algorithm 3) with a projection–reflection method for phase retrieval,” J. Opt. Soc. Amer. A,
vol. 20, no. 6, pp. 1025–1034, 2003.
random initialization to compute the first eigenvector of [9] M. Bayati and A. Montanari, “The dynamics of message passing on
Y = A diag{ y} A∗ . Since, each iteration of the power method dense graphs, with applications to compressed sensing,” IEEE Trans.
asks to compute the matrix-vector product Inf. Theory, vol. 57, no. 2, pp. 764–785, Feb. 2011.
[10] M. Bayati and A. Montanari, “The LASSO risk for Gaussian matrices,”
Y z = A diag{ y} A∗ z, IEEE Trans. Inf. Theory, vol. 58, no. 4, pp. 1997–2017, Apr. 2012.
[11] V. Bentkus, “An inequality for tail probabilities of martingales with
we simply need to apply A and A∗ to an arbitrary vector. In the differences bounded from one side,” J. Theoretical Probab., vol. 16,
no. 1, pp. 161–173, 2003.
Gaussian model, this costs 2mn multiplications while in the [12] O. Bunk et al., “Diffractive imaging for periodic samples: Retrieving
CDP model the cost is that of 2L n-point FFTs. We now turn one-dimensional concentration profiles across microfluidic channels,”
our attention to the number of iterations required to achieve a Acta Crystallograph. A, Found. Crystallogr., vol. 63, pp. 306–314,
Jul. 2007.
sufficiently accurate initialization. [13] J. Cahill, P. G. Casazza, J. Peterson, and L. Woodland. (2013).
Standard results from numerical linear algebra show that “Phase retrieval by projections.” [Online]. Available:
after k iterations of the power method, the accuracy of the abs/1305.6226
[14] E. J. Candès, Y. C. Eldar, T. Strohmer, and V. Voroninski, “Phase
eigenvector is O(tan θ0 (λ2 /λ1 )k ), where λ1 and λ2 are the top retrieval via matrix completion,” SIAM J. Imag. Sci., vol. 6, no. 1,
two eigenvalues of the positive semidefinite matrix Y , and pp. 199–225, 2013.

[15] E. J. Candès and X. Li, “Solving quadratic equations via PhaseLift when [42] S. Marchesini, “Invited article: A unified evaluation of iterative projec-
there are about as many equations as unknowns,” Found. Comput. Math., tion algorithms for phase retrieval,” Rev. Sci. Instrum., vol. 78, no. 1,
vol. 66, pp. 1241–1274. p. 011301, 2007.
[16] E. J. Candès, X. Li, and M. Soltanolkotabi. (2013). “Phase [43] S. Marchesini, “Phase retrieval and saddle-point optimization,” J. Opt.
retrieval from coded diffraction patterns.” [Online]. Available: Soc. Amer. A, vol. 24, no. 10, pp. 3289–3296, 2007. [44] S. Marchesini, Y.-C. Tu, and H.-T. Wu. (2014). “Alternating projection,
[17] E. J. Candès, T. Strohmer, and V. Voroninski, “PhaseLift: Exact and ptychographic imaging and phase synchronization.” [Online]. Available:
stable signal recovery from magnitude measurements via convex pro-
gramming,” Commun. Pure Appl. Math., vol. 38, no. 2, pp. 346–356, [45] J. Miao, P. Charalambous, J. Kirz, and D. Sayre, “Extending
Mar. 2015. the methodology of X-ray crystallography to allow imaging of
[18] A. Chai, M. Moscoso, and G. Papanicolaou, “Array imaging using micrometre-sized non-crystalline specimens,” Nature, vol. 400, no. 6742,
intensity-only measurements,” Inverse Problems, vol. 27, no. 1, pp. 342–344, 1999.
p. 015005, 2011. [46] J. Miao, T. Ishikawa, B. Johnson, E. H. Anderson, B. Lai, and
[19] A. Conca, D. Edidin, M. Hering, and C. Vinzant. (2013). “An algebraic K. O. Hodgson, “High resolution 3D X-ray diffraction microscopy,”
characterization of injectivity in phase retrieval.” [Online]. Available: Phys. Rev. Lett., vol. 89, no. 8, p. 088303, Aug. 2002. [47] J. Miao, T. Ishikawa, Q. Shen, and T. Earnest, “Extending X-ray
[20] J. V. Corbett, “The Pauli problem, state reconstruction and quantum-real crystallography to allow the imaging of noncrystalline materials, cells,
numbers,” Rep. Math. Phys., vol. 57, no. 1, pp. 53–68, 2006. and single protein complexes,” Annu. Rev. Phys. Chem., vol. 59,
[21] J. C. Dainty and J. R. Fienup, “Phase retrieval and image recon- pp. 387–410, May 2008.
struction for astronomy,” in Image Recovery: Theory and Application, [48] L. Waller, L. Tian, and G. Barbastathis, “Transport of intensity phaseam-
H. Stark, Ed. San Diego, CA, USA: Academic, 1987, pp. 231–275. plitude imaging with higher order intensity derivatives,” Opt. Exp.,
[22] L. Demanet and V. Jugnon. (2013). “Convex recovery from interferomet- vol. 18, no. 12, pp. 12552–12561, 2010.
ric measurements.” [Online]. Available: [49] L. Tian, X. Li, K. Ramchandran, and L. Waller, “Multiplexed coded
[23] D. L. Donoho, A. Maleki, and A. Montanari, “Message-passing algo- illumination for Fourier ptychography with an LED array microscope,”
rithms for compressed sensing,” Proc. Nat. Acad. Sci., vol. 106, no. 45, Biomed. Opt. Exp., vol. 5, pp. 2376–2389, 2014.
pp. 18914–18919, 2009. [50] R. P. Millane, “Phase retrieval in crystallography and optics,” J. Opt.
[24] J. R. Fienup, “Reconstruction of an object from the modulus of its Soc. Amer. A, vol. 7, no. 3, pp. 394–411, 1990.
Fourier transform,” Opt. Lett., vol. 3, no. 1, pp. 27–29, 1978. [51] Y. Mroueh. (2014). “Robust phase retrieval and super-resolution
[25] J. R. Fienup, “Fine resolution imaging of space objects,” from one bit coded diffraction patterns.” [Online]. Available:
Radar Opt. Division, Environ. Res. Inst. Michigan, Ann Arbor,
MI, USA, Final Sci. Rep. 01/1982-1, 1982. [52] Y. Mroueh and L. Rosasco. (2013). “Quantization and greed are good:
[26] J. R. Fienup, “Phase retrieval algorithms: A comparison,” Appl. Opt., One bit phase retrieval, robustness and greedy refinements.” [Online].
vol. 21, no. 15, pp. 2758–2769, 1982. Available:
[27] J. R. Fienup, “Comments on ‘The reconstruction of a multidimensional [53] K. G. Murty and S. N. Kabadi, “Some NP-complete problems in
sequence from the phase or magnitude of its Fourier transform,”’ IEEE quadratic and nonlinear programming,” Math. Program., vol. 39, no. 2,
Trans. Acoust., Speech, Signal Process., vol. 31, no. 3, pp. 738–739, pp. 117–129, 1987.
Jun. 1983. [54] A. Ben-Tal and A. Nemirovski, Lectures on Modern Convex Optimiza-
[28] F. Fogel, I. Waldspurger, and A. d’Aspremont. (2013). “Phase tion: Analysis, Algorithms, and Engineering Applications. Philadelphia,
retrieval for imaging problems.” [Online]. Available: PA, USA: SIAM, 2001.
abs/1304.7735 [55] Y. Nesterov, Introductory Lectures on Convex Optimization: A Basic
[29] R. W. Gerchberg and W. O. Saxton, “A practical algorithm for the Course (Applied Optimization), vol. 87. Boston, MA, USA: Kluwer,
determination of the phase from image and diffraction plane pictures,” 2004,
Optik, vol. 35, pp. 237–246, 1972. [56] P. Netrapalli, P. Jain, and S. Sanghavi. (2013). “Phase
[30] D. Gross, F. Krahmer, and R. Kueng. (2013). “A partial derandom- retrieval using alternating minimization.” [Online]. Available:
ization of PhaseLift using spherical designs.” [Online]. Available: [57] H. Ohlsson, A. Y. Yang, R. Dong, and S. S. Sastry. (2011). “Compressive
[31] D. Gross, F. Krahmer, and R. Kueng. (2014). “Improved recovery phase retrieval from squared output measurements via semidefinite
guarantees for phase retrieval from coded diffraction patterns.” [Online]. programming.” [Online]. Available:
Available: [58] S. Oymak, A. Jalali, M. Fazel, Y. C. Eldar, and B. Hassibi. (2012).
[32] M. Hardt. (2013). “On the provable convergence of alternating min- “Simultaneously structured models with application to sparse and low-
imization for matrix completion.” [Online]. Available: http://arxiv- rank matrices.” [Online]. Available: [59] S. Rangan, “Generalized approximate message passing for estimation
[33] R. W. Harrison, “Phase problem in crystallography,” J. Opt. Soc. with random linear mixing,” in Proc. IEEE Int. Symp. Inf. Theory (ISIT),
Amer. A, vol. 10, no. 5, pp. 1046–1055, 1993. Jul./Aug. 2011, pp. 2168–2172.
[34] M. H. Hayes, “The reconstruction of a multidimensional sequence from [60] J. Ranieri, A. Chebira, Y. M. Lu, and M. Vetterli. (2013). “Phase
the phase or magnitude of its Fourier transform,” IEEE Trans. Acoust., retrieval for sparse signals: Uniqueness conditions.” [Online]. Available:
Speech, Signal Process., vol. 30, no. 2, pp. 140–154, Apr. 1982.
[35] T. Heinosaari, L. Mazzarella, and M. M. Wolf, “Quantum tomography [61] H. Reichenbach, Philosophic Foundations of Quantum Mechanics.
under prior information,” Commun. Math. Phys., vol. 318, no. 2, Berkeley, CA, USA: Univ. California Press, 1965.
pp. 355–374, 2013. [62] J. L. C. Sanz, T. S. Huang, and T. Wu, “A note on iterative Fourier
[36] K. Jaganathan, S. Oymak, and B. Hassibi, “On robust phase retrieval for transform phase reconstruction from magnitude,” IEEE Trans. Acoust.,
sparse signals,” in Proc. 50th Annu. Allerton Conf. Commun., Control, Speech, Signal Process., vol. 32, no. 6, pp. 1251–1254, Dec. 1984.
Comput. (Allerton), Oct. 2012, pp. 794–799. [63] P. Schniter and S. Rangan, “Compressive phase retrieval via generalized
[37] P. Jain, P. Netrapalli, and S. Sanghavi, “Low-rank matrix completion approximate message passing,” in Proc. 50th Annu. Allerton Conf.
using alternating minimization,” in Proc. 45th Annu. ACM Symp. Theory Commun., Control, Comput. (Allerton), 2012, pp. 815–822.
Comput., 2013, pp. 665–674. [64] Y. Shechtman, Y. C. Eldar, O. Cohen, H. N. Chapman, J. Miao, and
[38] R. H. Keshavan, “Efficient algorithms for collaborative filtering,” M. Segev. (2014). “Phase retrieval with application to optical imaging.”
Ph.D. dissertation, Dept. Elect. Eng., Stanford Univ., Stanford, CA, [Online]. Available:
USA, 2012. [65] M. Soltanolkotabi, “Algorithms and theory for clustering and non-
[39] R. H. Keshavan, A. Montanari, and S. Oh, “Matrix completion from a convex quadratic programming,” Ph.D. dissertation, Dept. Elect. Eng.,
few entries,” IEEE Trans. Inf. Theory, vol. 56, no. 6, pp. 2980–2998, Stanford Univ., Stanford, CA, USA, 2014.
Jun. 2010. [66] R. Vershynin, “Compressed sensing: Theory and applications,” in Intro-
[40] R. H. Keshavan, A. Montanari, and S. Oh, “Matrix completion from duction to the Non-Asymptotic Analysis of Random Matrices, Y. Eldar
noisy entries,” J. Mach. Learn. Res., vol. 11, pp. 2057–2078, Mar. 2010. and G. Kutyniok, Eds. Cambridge, U.K.: Cambridge Univ. Press, 2012.
[41] K. Lee, Y. Wu, and Y. Bresler. (2013). “Near optimal compressed sensing [67] I. Waldspurger, A. d’Aspremont, and S. Mallat. (2012). “Phase recovery,
of sparse rank-one matrices via sparse power factorization.” [Online]. maxcut and complex semidefinite programming.” [Online]. Available:

[68] A. Walther, “The question of phase retrieval in optics,” Opt. Acta, Int. Xiaodong Li is a postdoctoral research associate in the Wharton Statistics
J. Opt., vol. 10, no. 1, pp. 41–49, 1963. Department, University of Pennsylvania. He received his B.S. in mathematics
[69] G.-Z. Yang, B.-Z. Dong, B.-Y. Gu, J.-Y. Zhuang, and O. K. Ersoy, from Peking University, Beijing, China in 2008. He obtained his Ph.D in
“Gerchberg–Saxton and Yang–Gu algorithms for phase retrieval in a mathematics from Stanford University in 2013. His research interests are in
nonunitary transform system: A comparison,” Appl. Opt., vol. 33, no. 2, statistics, mathematical signal processing, machine learning and optimization.
pp. 209–218, 1994.

Emmanuel J. Candès is the Barnum-Simons Chair in Mathematics and

Statistics, and professor of electrical engineering (by courtesy) at Stanford
University. Up until 2009, he was the Ronald and Maxine Linde Professor of
Applied and Computational Mathematics at the California Institute of Technol-
ogy. His research interests are in applied mathematics, statistics, information
theory, signal processing and mathematical optimization with applications
to the imaging sciences, scientific computing and inverse problems. Cands
graduated from the Ecole Polytechnique in 1993 with a degree in science Mahdi Soltanolkotabi obtained his B.S. in electrical engineering at Sharif
and engineering, and received his Ph.D. in statistics from Stanford University University of Technology, Tehran, Iran in 2009. He completed his M.S. and
in 1998. Emmanuel received the 2006 Alan T. Waterman Award from NSF, Ph.D. in electrical engineering at Stanford University in 2011 and 2014. He
which recognizes the achievements of early-career scientists. Other honors was a postdoctoral researcher in the Electrical Engineering and Computer
include the 2013 Dannie Heineman Prize presented by the Academy of Science and Statistics departments at the University of California, Berkeley
Sciences at Gttingen, the 2010 George Polya Prize awarded by the Society during the 2014-2015 academic year. He is currently an assistant professor
of Industrial and Applied Mathematics (SIAM), and the 2015 AMS-SIAM in the Ming Hsieh Department of Electrical Engineering at the University of
George David Birkhoff Prize in Applied Mathematics. He is a member of Southern California. His research interests include mathematical optimization,
the National Academy of Sciences and the American Academy of Arts and machine learning, signal processing, high-dimensional statistics, and geometry
Sciences. with emphasis on applications in information and physical sciences.