0 calificaciones0% encontró este documento útil (0 votos)

8 vistas23 páginasCandes, Li, Soltanolkotabi - Phase Retrieval via WIrtinger Flow

© © All Rights Reserved

PDF, TXT o lea en línea desde Scribd

Candes, Li, Soltanolkotabi - Phase Retrieval via WIrtinger Flow

© All Rights Reserved

0 calificaciones0% encontró este documento útil (0 votos)

8 vistas23 páginasCandes, Li, Soltanolkotabi - Phase Retrieval via WIrtinger Flow

© All Rights Reserved

Está en la página 1de 23

Theory and Algorithms

Emmanuel J. Candès, Xiaodong Li, and Mahdi Soltanolkotabi

Abstract— We study the problem of recovering the phase from be cast as QPs [54, Sec. 4.3.1]. Focusing on the literature on

magnitude measurements; specifically, we wish to reconstruct a physical sciences, the problem (I.1) is generally referred to

complex-valued signal x ∈ n about which we have phaseless as the phase retrieval problem. To understand this connection,

samples of the form yr = |a r , x|2 , r = 1, . . . , m (knowledge

of the phase of these samples would yield a linear system). recall that most detectors can only record the intensity of the

This paper develops a nonconvex formulation of the phase light field and not its phase. Thus, when a small object is

retrieval problem as well as a concrete solution algorithm. illuminated by a quasi-monochromatic wave, detectors mea-

In a nutshell, this algorithm starts with a careful initialization sure the magnitude of the diffracted light. In the far field, the

obtained by means of a spectral method, and then refines this diffraction pattern happens to be the Fourier transform of the

initial estimate by iteratively applying novel update rules, which

have low computational complexity, much like in a gradient object of interest—this is called Fraunhofer diffraction—so

descent scheme. The main contribution is that this algorithm is that in discrete space, (I.1) models the data aquisition mecha-

shown to rigorously allow the exact retrieval of phase information nism in a coherent diffraction imaging setup; one can identify z

from a nearly minimal number of random measurements. Indeed, with the object of interest, ar with complex sinusoids, and

the sequence of successive iterates provably converges to the yr with the recorded data. Hence, we can think of (I.1) as

solution at a geometric rate so that the proposed scheme is

efficient both in terms of computational and data resources. a generalized phase retrieval problem. As is well known,

In theory, a variation on this scheme leads to a near-linear the phase retrieval problem arises in many areas of science

time algorithm for a physically realizable model based on coded and engineering such as X-ray crystallography [33], [50],

diffraction patterns. We illustrate the effectiveness of our methods microscopy [47]–[49], astronomy [21], diffraction and array

with various experiments on image data. Underlying our analysis imaging [12], [18], and optics [68]. Other fields of application

are insights for the analysis of nonconvex optimization schemes

that may have implications for computational problems beyond include acoustics [4], [5], blind channel estimation in wire-

phase retrieval. less communications [3], [60], interferometry [22], quantum

mechanics [20], [61] and quantum information [35]. We refer

Index Terms— Non-convex optimization, convergence to global

optimum, phase retrieval, Wirtinger derivatives. the reader to the tutorial paper [64] for a recent review of the

theory and practice of phase retrieval.

I. I NTRODUCTION Because of the practical significance of the phase retrieval

problem in imaging science, the community has developed

W E ARE interested in solving quadratic equations of the

form methods for recovering a signal x ∈ n from data of the

form yr = |ar , x|2 in the special case where one samples

yr = |ar , z|2 , r = 1, 2, . . . , m, (I.1) the (square) modulus of its Fourier transform. In this setup,

the most widely used method is perhaps the error reduction

where z ∈ n is the decision variable, ar ∈ n are known algorithm and its generalizations, which were derived from

sampling vectors, and yr ∈ are observed measurements. the pioneering research of Gerchberg and Saxton [29] and

This problem is a general instance of a nonconvex quadratic Fienup [24], [26]. The Gerchberg-Saxton algorithm starts from

program (QP). Nonconvex QPs have been observed to occur a random initial estimate and proceeds by iteratively applying

frequently in science and engineering and, consequently, their a pair of ‘projections’: at each iteration, the current guess is

study is of importance. For example, a class of combinatorial projected in data space so that the magnitude of its frequency

optimization problems with Boolean decision variables may spectrum matches the observations; the signal is then projected

Manuscript received July 19, 2014; accepted December 27, 2014. Date of in signal space to conform to some a-priori knowledge about

publication February 3, 2015; date of current version March 13, 2015. its structure. In a typical instance, our knowledge may be that

E. J. Candès is with the Department of Mathematics and Statistics, Stanford the signal is real-valued, nonnegative and spatially limited.

University, Stanford, CA 94305 USA (e-mail: candes@stanford.edu).

X. Li is with the Department of Wharton Statistics, University First, while error reduction methods often work well in prac-

of Pennsylvania, Philadelphia, PA 19104-6340 USA (e-mail: tice, the algorithms seem to rely heavily on a priori information

xiaodli@wharton.upenn.edu). about the signals, see [25], [27], [34], [62]. Second, since

M. Soltanolkotabi is with the Ming Hsieh Department of Electrical

Engineering, University of Southern California, Los Angeles, CA 90089-2560 these algorithms can be cast as alternating projections onto

USA (e-mail: soltanol@usc.edu). nonconvex sets [8] (the constraint in Fourier space is not

Communicated by Y. Ma, Associate Editor for Signal Processing. convex), fundamental mathematical questions concerning their

Color versions of one or more of the figures in this paper are available

online at http://ieeexplore.ieee.org. convergence remain, for the most part, unresolved; we refer

Digital Object Identifier 10.1109/TIT.2015.2399924 to Section III-B for further discussion.

0018-9448 © 2015 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission.

See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.

1986 IEEE TRANSACTIONS ON INFORMATION THEORY, VOL. 61, NO. 4, APRIL 2015

On the theoretical side, several combinatorial optimization Algorithm 1 Wirtinger Flow: Initialization

problems—optimization programs with discrete design vari- Input: Observations {yr } ∈ m .

ables which take on integer or Boolean values—can be cast as Set

solving quadratic equations or as minimizing a linear objec-

yr

tive subject to quadratic inequalities. In their most general λ =n r

2

.

r ar

2

form these problems are known to be notoriously difficult

(NP-hard) [54, Sec. 4.3]. Nevertheless, many heuristics have Set z 0 , normalized to z 0 = λ, to be the eigenvector

been developed for addressing such problems.1 One popular corresponding to the largest eigenvalue of

heuristic is based on a class of convex relaxations known

1

m

as Shor’s relaxations [54, Sec. 4.3.1] which can be solved Y = yr ar ar∗ .

using tractable semi-definite programming (SDP). For cer- m

r=1

tain random models, some recent SDP relaxations such as

Output: Initial guess z 0 .

PhaseLift [14] are known to provide exact solutions (up to

global phase) to the generalized phase retrieval problem using

a near minimal number of sampling vectors [16], [17]. While

in principle SDP based relaxations offer tractable solutions, deciding whether a stationary point of a polynomial of degree

they become computationally prohibitive as the dimension of four is a local minimizer is already known to be NP-hard.

the signal increases. Indeed, for a large number of unknowns in Our approach to (II.1) is simply stated: start with an

the tens of thousands, say, the memory requirements are far out initialization z 0 , and for τ = 0, 1, 2, . . ., inductively define

μτ +1 1 ∗ 2

of reach of desktop computers so that these SDP relaxations m

∗

are de facto impractical. z τ +1 = z τ − ar z − yr (ar ar )z

z 0 2 m r=1

μτ +1

II. A LGORITHM : W IRTINGER F LOW := z τ − ∇ f (z τ ). (II.2)

z 0 2

This paper introduces an approach to phase retrieval based If the decision variable z and the sampling vectors were all real

on non-convex optimization as well as a solution algorithm, valued, the term between parentheses would be the gradient

which has two components: (1) a careful initialization obtained of f divided by two, as our notation suggests. However,

by means of a spectral method, and (2) a series of updates since f (z) is a mapping from n to , it is not holomorphic

refining this initial estimate by iteratively applying a novel and hence not complex-differentiable. However, this term can

update rule, much like in a gradient descent scheme. We refer still be viewed as a gradient based on Wirtinger derivatives

to the combination of these two steps, introduced in reverse reviewed in Section VI. Hence, (II.2) is a form of steepest

order below, as the Wirtinger flow (WF) algorithm. descent and the parameter μτ +1 can be interpreted as a step

size (note nonetheless that the effective step size is also

A. Minimization of a Non-Convex Objective inversely proportional to the magnitude of the initial guess).

Let (x, y) be a loss function measuring the misfit between

both its scalar arguments. If the loss function is non-negative B. Initialization via a Spectral Method

and vanishes only when x = y, then a solution to the Our main result states that for a certain random model, if the

generalized phase retrieval problem (I.1) is any solution to initialization z 0 is sufficiently accurate, then the sequence

{z τ } will converge toward a solution to the generalized phase

1 ∗ 2

m

problem (I.1). In this paper, we propose computing the initial

minimize f (z) := yr , ar z , z∈ n

. (II.1)

2m guess z 0 via a spectral method, detailed in Algorithm 1.

r=1

In words, z 0 is the leading

eigenvector of the positive semidefi-

Although one could study many loss functions, we shall focus nite Hermitian matrix r yr ar ar∗ constructed from the knowl-

in this paper on the simple quadratic loss (x, y) = (x − y)2 . edge of the sampling vectors and observations. (As usual,

Admittedly, the formulation (II.1) does not make the problem ar∗ is the adjoint of ar .) Letting A be the m × n matrix whose

any easier since the function f is not convex. Minimizing r th row is ar∗ so that with obvious notation y = | Ax|2 , z 0 is

non-convex objectives, which may have very many stationary the leading eigenvector of A∗ diag{ y} A and can be computed

points, is known to be NP-hard in general. In fact, even via the power method by repeatedly applying A, entrywise

establishing convergence to a local minimum or stationary multiplication by y and A∗ . In the theoretical framework we

point can be quite challenging, please see [53] for an example study below, a constant number of power iterations would give

where convergence to a local minimum of a degree-four machine accuracy because of an eigenvalue gap between the

polynomial is known to be NP-hard.2 As a side remark, top two eigenvalues, please see Appendix B for additional

1 For a partial review of some of these heuristics as well as some

information.

recent theoretical advances in related problems we refer to our companion

paper [16, Sec. 1.6] and references therein [7], [13], [30], [31], [36], C. Wirtinger Flow as a Stochastic Gradient Scheme

[55], [56], [65].

2 Observe that if all the sampling vectors are real valued, our objective is We would like to motivate the Wirtinger flow algorithm

also a degree-four polynomial. and provide some insight as to why we expect it to work in

CANDÈS et al.: PHASE RETRIEVAL VIA WIRTINGER FLOW: THEORY AND ALGORITHMS 1987

emphasize that our statements in this section are heuristic

in nature; as it will become clear in the proof Section VII,

a correct mathematical formalization of these ideas is far

more complicated than our heuristic development here may

suggest. Second, although our ideas are broadly applicable, it

makes sense to begin understanding the algorithm in a setting

where everything is real valued, and in which the vectors ar

are i.i.d. N (0, I ). Also without any loss in generality and to

simplify exposition in this section we shall assume that the

signal has unit Euclidean norm, i. e. x = 1.

Let x be a solution to (I.1) so that yr = |ar , x|2 , and

consider the initialization step first. In the Gaussian model,

a simple moment calculation gives Fig. 1. Learning parameter μτ from (II.5) as a function of the iteration

count τ ; here, τ0 ≈ 330 and μmax = 0.4.

1

m

yr ar ar = I + 2x x ∗ .

∗

m

r=1

By the strong law of large numbers, the matrix Y Hence, the average WF update is the same as that in (II.3)

in Algorithm 1 is equal to the right-hand side in the limit of so that we can interpret the Wirtinger flow algorithm as a

large samples. Since any leading eigenvector of I +2xx ∗ is of stochastic gradient scheme in which we only get to observe

the form λx for some scalar λ ∈ , we see that the intialization an unbiased estimate ∇ f (z) of the “true” gradient ∇ F(z).

step would recover x perfectly, up to a global sign or phase Regarding WF as a stochastic gradient scheme helps us in

factor, had we infinitely many samples. Indeed, the chosen choosing the learning parameter or step size μτ . Lemma 7.7

normalization would guarantee that the recovered signal is of asserts that

the form ±x. As an aside, we would like to note that the top ∇ f (z) − ∇ F(z)2 ≤ x2 · min z ± x (II.4)

two eigenvalues of I + 2x x ∗ are well separated unless x

is very small, and that their ratio is equal to 1 + 2x2 . Now holds with high probability. Looking at the right-hand side, this

with a finite amount of data, the leading eigenvector of Y will says that the uncertainty about the gradient estimate depends

of course not be perfectly correlated with x but we hope that on how far we are from the actual solution x. The further

it is sufficiently correlated to point us in the right direction. away, the larger the uncertainty or the noisier the estimate.

We now turn our attention to the gradient-update (II.2) and This suggests that in the early iterations we should use a small

define learning parameter as the noise is large since we are not yet

3 2 2 close to the solution. However, as the iterations count increases

2z ∗ (I − x x ∗ )z + z − 1 and we make progress, the size of the noise also decreases

2

where here and below, x is once again our planted solution. and we can pick larger values for the learning parameter. This

The first term ensures that the direction of z matches the heuristic together with experimentation lead us to consider

direction of x and the second term penalizes the deviation

μτ = min(1 − e−τ/τ0 , μmax ) (II.5)

of the Euclidean norm of z from that of x. Obviously, the

minimizers of this function are ±x. Now consider the gradient shown in Figure 1. Values of τ0 around 330 and of μmax

scheme around 0.4 worked well in our simulations. This makes sure

μτ +1 that μτ is rather small at the beginning (e.g. μ1 ≈ 0.003

z τ +1 = z τ − ∇ F(z τ ) (II.3)

||z 0 ||2 but quickly increases and reaches a maximum value of about

In Section VII-I, we show that if min z 0 ± x ≤ 1/8 x, 0.4 after 200 iterations or so.

then {z τ } converges to x up to a global sign. However, this is

all ideal as we would need knowledge of x itself to compute III. M AIN R ESULTS

the gradient of F; we simply cannot run this algorithm.

Consider now the WF update and assume for a moment that A. Exact Phase Retrieval via Wirtinger Flow

z τ is fixed and independent of the sampling vectors. We are Our main result establishes the correctness of the Wirtinger

well aware that this is a false assumption but nevertheless wish flow algorithm in the Gaussian model defined below. Later

to explore some of its consequences. In the Gaussian model, if in Section V, we shall also develop exact recovery results for

z is independent of the sampling vectors, then a modification a physically inspired diffraction model.

of Lemma 7.2 for real-valued z shows that [∇ f (z)] = ∇ F(z) Definition 3.1: We say that the sampling vectors follow the

i.i.d.

and, therefore, Gaussian model if ar ∈ n ∼ N (0, I /2)+i N (0, I /2). In the

μ real-valued case, they are i.i.d. N (0, I ).

[zτ +1 ] = [zτ ] − τ +12 [∇ f (zτ )]

z 0 We also need to define the distance to the solution set.

μτ +1 Definition 3.2: Let x ∈ n be any solution to the quadratic

⇒ [z τ +1 ] = z τ − ∇ F(z τ ).

z 0 2 system (I.1) (the signal we wish to recover). For each z ∈ n ,

1988 IEEE TRANSACTIONS ON INFORMATION THEORY, VOL. 61, NO. 4, APRIL 2015

Az τ

dist(z, x) = min z − eiφ x . v̂ τ +1 = b , (III.2)

φ∈[0,2π] | Az τ |

Theorem 3.3: Let x be an arbitrary vector in n and where is elementwise multiplication, and b = | Ax| so that

y = | Ax|2 ∈ m be m quadratic samples with m ≥ c0 ·n log n, br2 = yr for all r = 1, . . . , m. Then

where c0 is a sufficiently large numerical constant. Then the

Wirtinger flow initial estimate

z 0 normalized to have squared v τ +1 = arg minv∈ n v̂ τ +1 − Av. (III.3)

Euclidean norm equal to m −1 r yr ,3 obeys

(In the case of Fourier data, the step (III.2)–(III.3) essentially

1 adjusts the modulus of the Fourier transform of the current

dist(z 0 , x) ≤ x (III.1)

8 guess so that it fits the measured data.) Finally, if we know

that the solution belongs to a convex set C (as in the case where

with probability at least 1 − 10e−γ n − 8/n 2 (γ is a fixed the signal is known to be real-valued, possibly non-negative

positive numerical constant). Further, take a constant learning and of finite support), then the next iterate is

parameter sequence, μτ = μ for all τ = 1, 2, . . . and assume

μ ≤ c1 /n for some fixed numerical constant c1 . Then there is z τ +1 = PC (v τ +1 ), (III.4)

an event of probability at least 1 − 13e−γ n − me−1.5m − 8/n 2 ,

such that on this event, starting from any initial solution z 0 where PC is the projection onto the convex set C. If no

obeying (III.1), we have such information is available, then z τ +1 = v τ +1 . The first

step (III.3) is not a projection onto a convex set and,

1 μ τ/2 therefore, it is in general completely unclear whether the

dist(z τ , x) ≤ 1− · x .

8 4 Gerchberg-Saxton algorithm actually converges. (And if it

Clearly, one would need 2n quadratic measurements to have were to converge, at what speed?) It is also unclear how

any hope of recovering x ∈ n . It is also known that in the procedure should be initialized to yield accurate final

our sampling model, the mapping z → | Az|2 is injective estimates. This is in contrast to the Wirtinger flow algorithm,

for m ≥ 4n [5] and that this property holds for generic which in the Gaussian model is shown to exhibit geometric

sampling vectors [19].4 Hence, the Wirtinger flow algorithm convergence to the solution to the phase retrieval problem.

loses at most a logarithmic factor in the sampling complexity. Another benefit is that the Wirtinger flow algorithm does not

In comparison, the SDP relaxation only needs a sampling require solving a least-squares problem (III.3) at each iteration;

complexity proportional to n (no logarithmic factor) [15], and each step enjoys a reduced computational complexity.

it is an open question whether Theorem 3.3 holds in this A recent contribution related to ours is the interest-

regime. ing paper [56], which proposes an alternating minimization

Setting μ = c1 /n yields accuracy in a relative sense, scheme named AltMinPhase for the general phase retrieval

namely, dist(z, x) ≤ x, in O(n log 1/) iterations. problem. AltMinPhase is inspired by the Gerchberg-Saxton

The computational work at each iteration is dominated by update (III.2)–(III.3) as well as other established alter-

two matrix-vector products of the form Az and A∗ v, see nating projection heuristics [26], [42], [43], [45], [46], [69].

Appendix B. It follows that the overall computational com- We describe the algorithm in the setup of Theorem 3.3

plexity of the WF algorithm is O(mn 2 log 1/). Later in the for which [56] gives theoretical guarantees. To begin with,

paper, we will exhibit a modification to the WF algorithm AltMinPhase partitions the sampling vectors ar (the rows of

of mere theoretical interest, which also yields exact recovery the matrix A) and corresponding observations yr into B + 1

under the same sampling complexity and an O(mn log 1/) disjoint blocks ( y(0) , A(0) ), ( y(1), A(1) ), . . ., ( y(B), A(B)) of

computational complexity; that is to say, the computational roughly equal size. Hence, distinct blocks are stochastically

workload is now just linear in the problem size. independent from each other. The first block ( y(0) , A(0) )

is used to compute an initial estimate z 0 . After initializa-

B. Comparison With Other Non-Convex Schemes tion, AltMinPhase goes through a series of iterations of the

form (III.2)–(III.3), however, with the key difference that each

We now pause to comment on a few other non-convex

iteration uses a fresh set of sampling vectors and observations:

schemes in the literature. Other comparisons may be found

in details,

in our companion paper [16].

Earlier, we discussed the Gerchberg-Saxton and Fienup z τ +1 = arg min z∈ n v̂ τ +1 − A(τ +1) z,

algorithms. These formulations assume that A is a Fourier A(τ +1) z τ

transform and can be described as follows: suppose z τ is the v̂ τ +1 = b . (III.5)

current guess, then one computes the image of z τ through A | A(τ +1) z τ |

and adjust its modulus so that it matches that of the observed As for the Gerchberg-Saxton algorithm, each iteration requires

solving a least-squares problem. Now assume a real-valued

3

The same 2

results holds with the intialization from Algorithm 1 because Gaussian model as well as a real valued solution x ∈ n . The

r ar ≈ m · n with a standard deviation of about the square root of this main result in [56] states that if the first block ( y(0) , A(0) )

quantity.

4 It is not within the scope of this paper to explain the meaning of generic contains at least c · n log3 n samples and each consecutive

vectors and, instead, refer the interested reader to [19]. block contains at least c · n log n samples—c here denotes a

CANDÈS et al.: PHASE RETRIEVAL VIA WIRTINGER FLOW: THEORY AND ALGORITHMS 1989

positive numerical constant whose value may change at each and ours. Indeed, OptSpace operates by computing an initial

occurence—then it is possible to initialize the algorithm via guess of the solution to a low-rank matrix completion problem

data from the first block in such a way that each consec- by means of a spectral method. It then sets up a nonconvex

utive iterate (III.5) decreases the error z τ − x by 50%; problem, and proposes an iterative algorithm for solving it.

naturally, all of this holds in a probabilistic sense. Hence, Under suitable assumptions, [39] demonstrates the correctness

one can get accuracy in the sense introduced earlier from of this method in the sense that OptSpace will eventually

a total of c · n log n · (log2 n + log 1/) samples. Whereas converge to a low-rank solution, although it is not shown to

the Wirtinger flow algorithm achieves arbitrary accuracy from converge in polynomial time.

just c · n log n samples, these theoretical results would require

an infinite number of samples. This is, however, not the

IV. N UMERICAL E XPERIMENTS

main point.

The main point is that in practice, it is not realistic to We present some numerical experiments to assess the empir-

imagine (1) that we will divide the samples in distinct ical performance of the Wirtinger flow algorithm. Here, we

blocks (how many blocks should we form a priori? of which mostly consider a model of coded diffraction patterns reviewed

sizes?) and (2) that we will use measured data only once. below.

With respect to the latter, observe that the Gerchberg-Saxton

procedure (III.2)–(III.3) uses all the samples at each iteration.

This is the reason why AltMinPhase is of little practical A. The Coded Diffraction Model

value, and of theoretical interest only. As a matter of fact, We consider an acquisition model, where we collect data of

its design and study seem merely to stem from analytical the form

considerations: since one uses an independent set of mea- n−1 2

surements at each iteration, A(τ +1) and z τ are stochasti-

−i2πkt /n

cally independent, a fact which considerably simplifies the yr = x[t]d̄ (t)e ,

convergence analysis. In stark contrast, the WF iterate uses t =0

all the samples at each iteration and thus introduces some 0≤k ≤n−1

r = (, k), (IV.1)

dependencies, which makes for some delicate analysis. Over- 1 ≤ ≤ L;

coming these difficulties is crucial because the community is

thus for a fixed , we collect the magnitude of the diffraction

preoccupied with convergence properties of algorithms one

pattern of the signal {x(t)} modulated by the waveform/code

actually runs, like Gerchberg-Saxton (III.2)–(III.3), or would

{d (t)}. By varying and changing the modulation pattern d ,

actually want to run. Interestingly, it may be possible to

we generate several views thereby creating a series of coded

use some of the ideas developed in this paper to develop a

diffraction patterns (CDPs).

rigorous theory of convergence for algorithms in the style of

In this paper, we are mostly interested in the situation where

Gerchberg-Saxton and Fienup, please see [65].

the modulation patterns are random; in particular, we study

In a recent paper [44], which appeared on the arXiv preprint

a model in which the d ’s are i.i.d. distributed, each having

server as the final version of this paper was under preparation,

i.i.d. entries sampled from a distribution d. Our theoretical

the authors explore necessary and sufficient conditions for the

results from Section V assume that d is symmetric, obeys

global convergence of an alternative minimization scheme with

|d| ≤ M as well as the moment conditions

generic sampling vectors. The issue is that we do not know

when these conditions hold. Further, even when the algorithm d = 0, d 2 = 0, |d|4 = 2( |d|2 )2 . (IV.2)

converges, it does not come with an explicit convergence rate

so that is is not known whether the algorithm converges in A random variable obeying these assumptions is said to be

polynomial time. As before, some of our methods as well as admissible. Since d is complex valued we can have d 2 = 0

those from our companion paper [16] may have some bearing while d = 0. An example of an admissible random variable is

upon the analysis of this algorithm. Similarly, another class d = b1 b2 , where b1 and b2 are independent and distributed as

of nonconvex algorithms that have recently been proposed in ⎧

the literature are iterative algorithms based on Generalized ⎪

⎪ +1 with prob. 1/4

⎪

⎨−1

Approximate Message Passing (GAMP), see [59], [63] as well with prob. 1/4

as [9], [10], [23] for some background literature on AMP. b1 = (IV.3)

⎪

⎪ −i with prob. 1/4

In [63], the authors demonstrate a favorable runtime for an ⎪

⎩

algorithm of this nature. However, this does not come with +i with prob. 1/4,

any theoretical guarantees. and

Moving away from the phase retrieval problem, we √

would like to mention some very interesting work 2/2 with prob. 4/5

on the matrix completion problem using non-convex b2 = √ (IV.4)

3 with prob. 1/5.

schemes by Montanari and coauthors [38]–[40], see

also [2], [6], [32], [37], [41], [51], [52]. Although the prob- We shall refer to this distribution as an octanary pattern since d

lems and models are quite different, there are some general can take on eight distinct values. The condition [d 2 ] = 0 is

similarities between the algorithm named OptSpace in [39] here to avoid unnecessarily complicated calculations in our

1990 IEEE TRANSACTIONS ON INFORMATION THEORY, VOL. 61, NO. 4, APRIL 2015

Fig. 2. Empirical probability of success based on 100 random trials for different signal/measurement models and a varied number of measurements. The

coded diffraction model uses octanary patterns; the number of patterns L = m/n only takes on integral values.

theoretical analysis. In particular, we can also work with a Below, we set n = 128, and generate one signal of each type

ternary pattern in which d is distributed as which will be used in all the experiments.

⎧ The initialization step of the Wirtinger flow algorithm is

⎪

⎨+1 with prob. 1/4

run by applying 50 iterations of the power method outlined in

d= 0 with prob. 1/2 (IV.5) Algorithm 3 from Appendix B. In the iteration (II.2), we use

⎪

⎩

−1 with prob. 1/4. the parameter value μτ = min(1 − exp(−τ/τ0 ), 0.2) where

We emphasize that the random coded diffraction patterns τ0 ≈ 330. We stop after 2, 500 iterations, and report the

mentioned above are physically realizable in optical applica- empirical probability of success for the two different signal

tions specially those that arise in microscopy. However, we models. The empirical probability of succcess is an average

should note that phase retrieval has many different applications over 100 trials, where in each instance, we generate new

and in some cases other CDP models may be more suitable random sampling vectors according to the Gaussian or CDP

for that particular application. We refer to our companion models. We declare a trial successful if the relative error of

paper [16, Sec. 2.2] for a discussion of other practically the reconstruction dist( x̂, x)/ x falls below 10−5 .

relevant models. Figure 2 shows that around 4.5n Gaussian phaseless mea-

surements suffice for exact recovery with high probability

via the Wirtinger flow algorithm. We also see that about

B. The Gaussian and Coded Diffraction Models

six octanary patterns are sufficient.

We begin by examining the performance of the Wirtinger

flow algorithm for recovering random signals x ∈ n under

the Gaussian and coded diffraction models. We are interested C. Performance on Natural Images

in signals of two different types: We move on to testing the Wirtinger flow algorithm on

• Random Low-Pass Signals: Here, x is given by various images of different sizes; these are photographs of the

M/2 Naqsh-e Jahan Square in the central Iranian city of Esfahan,

x[t] = (X k + i Yk )e2πi(k−1)(t −1)/n , the Stanford main quad, and the Milky Way galaxy. Since each

k=−(M/2−1) photograph is in color, we run the WF algorithm on each of

the three RGB images that make up the photograph. Color

with M = n/8 and X k and Yk are i.i.d. N (0, 1).

images are viewed as n 1 × n 2 × 3 arrays, where the first two

• Random Gaussian Signals: In this model, x ∈ n is a

indices encode the pixel location, and the last the color band.

random complex Gaussian vector with i.i.d. entries of the

We generate L = 20 random octanary patterns and gather

form x[t] = X +i Y with X and Y distributed as N (0, 1);

the coded diffraction patterns for each color band using these

this can be expressed as

20 samples. As before, we run 50 iterations of the power

n/2 method as the initialization step. The updates use the sequence

x[t] = (X k + i Yk )e2πi(k−1)(t −1)/n , μτ = min(1 − exp(−τ/τ0 ), 0.4) where τ0 ≈ 330 as before.

k=−(n/2−1) In all cases we run 300 iterations and record the relative

where X k and Yk are are i.i.d. N (0, 1/8) so that recovery error as well as the running time. If x and x̂

the low-pass model is a ‘bandlimited’ version of this are the original

and recovered images, the relative error is

high-pass random model (variances are adjusted so that equal to
x̂ − x
/ x, where · is the Euclidean norm

the expected signal power is the same). x2 = i, j,k |x(i, j, k)|2. The computational time we report

CANDÈS et al.: PHASE RETRIEVAL VIA WIRTINGER FLOW: THEORY AND ALGORITHMS 1991

Fig. 3. Performance of the WF algorithm on three scenic images. Image size, computational time in seconds and in units of FFTs are reported, as well as

the relative error after 300 WF iterations. (a) Naqsh-e Jahan Square, Esfahan. Image size is 189 × 768 pixels; timing is 61.4 sec or about 21,200 FFT units.

The relative error is 6.2 × 10−16 . (b) Stanford main quad. Image size is 320 × 1280 pixels; timing is 181.8120 sec or about 20, 700 FFT units. The relative

error is 3.5 × 10−14 . (c) Milky way Galaxy. Image size is 1080 × 1920 pixels; timing is 1318.1 sec or 41, 900 FFT units. The relative error is 9.3 × 10−16 .

is the the computational time averaged over the three RGB in a matter of minutes. To convey an idea of timing that is

images. All experiments were carried out on a MacBook Pro platform-independent, we also report time in units of FFTs;

with a 2.4 GHz Intel Core i7 Processor and 8 GB 1600 MHz one FFT unit is the amount of time it takes to perform

DDR3 memory. a single FFT on an image of the same size. Now all the

Figure 3 shows the images recovered via the Wirtinger flow workload is dominated by matrix vector products of the form

algorithm. In all cases, WF gets 12 or 13 digits of precision Az and A∗ v. In details, each iteration of the power method in

1992 IEEE TRANSACTIONS ON INFORMATION THEORY, VOL. 61, NO. 4, APRIL 2015

Fig. 5. Schematic representation and electron density map of the Caffeine molecule. (a) Schematic representation (b) Electron density map.

the initialization step, or each update (II.2) requires 40 FFTs; prevent the application of full-blown SDP solvers on desktop

the factor of 40 comes from the fact that we have 20 patterns computers.

and that each iteration involves one FFT and one adjoint or

inverse FFT. Hence, the total number of FFTs is equal to D. 3D Molecules

Understanding molecular structure is a great contempo-

20 patterns × 2 (one FFT and one IFFT)

rary scientific challenge, and several techniques are currently

×(300 gradient steps + 50 power iterations) = 14, 000. employed to produce 3D images of molecules; these include

electron microscopy and X-ray imaging. In X-ray imaging,

Another way to state this is that the total workload of our for instance, the experimentalist illuminates an object of

algorithm is roughly equal to 350 applications of the sensing interest, e.g. a molecule, and then collects the intensity of

matrix A and its adjoint A∗ . For about 13 digits of accuracy the diffracted rays, please see Figure 4 for an illustrative

(relative error of about 10−13 ), Figure 3 shows that we need setup. Figures 5 and 6 show the schematic representation

between 21,000 and 42,000 FFT units. This is within a factor and the corresponding electron density maps for the Caffeine

between 1.5 and 3 of the optimal number computed above. and Nicotine molecules: the density map ρ(x 1 , x 2 , x 3 ) is the

This increase has to do with the fact that in our imple- 3D object we seek to infer. In this paper, we do not go as

mentation, certain variables are copied into other temporary far 3D reconstruction but demonstrate the performance of

variables and these types of operations cause some overhead. the Wirtinger flow algorithm for recovering projections of

This overhead is non-linear and becomes more prominent as 3D molecule density maps from simulated data. For related

the size of the signal increases. simulations using convex schemes we refer the reader to [28].

For comparison, SDP based solutions such as To derive signal equations, consider an experimental appa-

PhaseLift [14], [17] and PhaseCut [67] would be prohibitive ratus as in Figure 4. If we imagine that light propagates in

on a laptop computer as the lifted signal would not fit into the direction of the x 3 -axis, an approximate model for the

memory. In the SDP approach an n pixel image become collected data reads

an n 2 /2 array, which in the first example already takes 2

storing the lifted signal even for the smallest image requires

I (f1 , f 2 ) = ρ(x 1 , x 2 , x 3 )dx 3 e −2iπ( f 1 x 1 + f 2 x 2 )

dx 1 dx 2 .

(189 × 768)2 × 1/2 × 8 Bytes, which is approximately

85 GB of space. (For the image of the Milky Way, storage In other words, we collect

the intensity of the diffraction

would be about 17 TB.) These large memory requirements pattern of the projection ρ(x 1 , x 2 , x 3 )dx 3. The 2D image

CANDÈS et al.: PHASE RETRIEVAL VIA WIRTINGER FLOW: THEORY AND ALGORITHMS 1993

Fig. 6. Schematic representation and electron density map of the Nicotine molecule. (a) Schematic representation (b) Electron density map.

patterns for all 51 projections). We run the Wirtinger flow

algorithm with exactly the same parameters as in the previ-

ous section, and stop after 150 gradient iterations. Figure 8

reports the average relative error over the 51 projections and

the total computational time required for reconstructing all

51 images.

We complement our study with theoretical results applying

to the model of coded diffraction patterns. These results

concern a variation of the Wirtinger flow algorithm: whereas

the iterations are exactly the same as (II.2), the initialization

applies an iterative scheme which uses fresh sets of sample

Fig. 7. Electron density ρ(x1 , x2 , x3 ) of the Caffeine molecule along with at each iteration. This is described in Algorithm 2. In the

its projection onto the x1 x2 -plane. CDP model, the partitioning assigns to the same group all

the observations and sampling vectors corresponding to a

given realization of the random code. This is equivalent to

we wish to recover is thus the line integral of the density map partitioning the random patterns into B + 1 groups. As a

along a given direction. As an example, the Caffeine molecule result, sampling vectors in distinct groups are stochastically

along with its projection on the x 1 x 2 -plane (line integral in independent.

the x 3 direction) is shown in Figure 7. Now, if we let R be Theorem 5.1: Let x be an arbitrary vector in n and

the Fourier transform of the density ρ, one can re-express the assume we collect L admissible coded diffraction patterns

identity above as with L ≥ c0 · (log n)4 , where c0 is a sufficiently

I ( f1 , f2 ) = |R( f 1 , f2 , 0)|2 . large numerical constant. Then the initial solution z 0 of

Algorithm 25 obeys

Therefore, by imputing the missing phase using phase retrieval 1

algorithms, one can recover a slice of the 3D Fourier transfom dist(z 0 , x) ≤ √ x (V.1)

8 n

of the electron density map, i.e. R( f1 , f2 , 0). Viewing the

object from different angles or directions gives us different with probability at least 1 − (4L + 2)/n 3 . Further, take a con-

slices. In a second step we do not perform in this paper, stant learning parameter sequence, μτ = μ for all τ = 1, 2, . . .

one can presumably recover the 3D Fourier transform of and assume μ ≤ c1 for some fixed numerical constant c1 . Then

the electron density map from all these slices (this is the there is an event of probability at least 1−(2L +1)/n 3 −1/n 2 ,

tomography or blind tomography problem depending upon such that on this event, starting from any initial solution z 0

whether or not the projection angles are known) and, in turn, obeying (V.1), we have

the 3D electron density map. 1 μ τ/2

We now generate 51 observation planes by rotating the dist(z τ , x) ≤ √ 1 − · x . (V.2)

8 n 3

x 1 x 2 -plane around the x 1 -axis by equally spaced angles

in the interval [0, 2π]. Each of these planes is associ- 5 We choose the number of partitions B in Algorithm 2 to obey B ≥ c log n

1

ated with a 2D projection of size 1024 × 1024, giving us for c1 a sufficiently large numerical constant.

1994 IEEE TRANSACTIONS ON INFORMATION THEORY, VOL. 61, NO. 4, APRIL 2015

Fig. 8. Reconstruction sequence of the projection of the Caffeine and Nicotine molecules along different directions. To see the videos please download and

open the PDF file using Acrobat Reader. (a) Caffeine molecule Mean rel. error is 9.6 × 10−6 Total time is 5.4 hours. (b) Nicotine molecule Mean rel. error

is 1.7 × 10−5 Total time is 5.4 hours

Algorithm 2 Initialization via Resampled Wirtinger Flow n log n (or even n for certain kind of patterns). We leave this

Input: Observations {yr } ∈ m and number of blocks B. to future research.

Partition the observations and sampling vectors {yr }r=1

m

and Setting μ = c1 yields accuracy in O(log 1/) iterations.

{ar }r=1 into B + 1 groups of size m = m/(B + 1). For

m As the computational work at each iteration is dominated by

each group b = 0, 1, . . . , B, set two matrix-vector products of the form Az and A∗ v, it follows

that the overall computational is at most O(n L log n log 1/).

m

1 (b) (b) 2 2 In particular, this approach yields a near-linear time algo-

f (z; b) = y − a

r , z ,

2m r rithm in the CDP model (linear in the dimension of the

r=1

signal n). In the Gaussian model, the complexity scales like

(b) O(mn log 1/).

where {ar } are those sampling vectors belonging to the bth

(b)

group (and likewise for {yr }).

Initialize ũ0 to be eigenvector corresponding to the largest VI. W IRTINGER D ERIVATIVES

eigenvalue of Our gradient step (II.2) uses a notion of derivative, which

m

can be interpreted as a Wirtinger derivative. The purpose of

1 ∗

Y= yr(0) ar(0) ar(0) this section is thus to gather some results concerning Wirtinger

m derivatives of real valued functions over complex variables.

r=1

normalized as in Algorithm (1). Here and below, M T is the transpose of the matrix M, and

Loop: c̄ denotes the complex conjugate of a scalar c ∈ . Similarly,

for b = 0 to B − 1 do the matrix M̄ is obtained by taking complex conjugates of the

elements of M.

Any complex-or real-valued function

μ̃

ub+1 = ub − ∇ f (ub ; b)

u0 2 f (z) = f (x, y) = u(x, y) + i v(x, y)

end for of several complex variables can be written in the form

Output: z 0 = u B . f (z, z̄), where f is holomorphic in z = x + i y for fixed z̄

and holomorphic in z̄ = x − i y for fixed z. This holds as

long as a the real-valued functions u and v are differentiable

as functions of the real variables x and y. As an example,

In the Gaussian model, both statements also hold with high consider

probability provided that m ≥ c0 · n (log n)2 , where c0 is a 2 2

sufficiently large numerical constant. f (z) = y − a∗ z = (y − z̄ T aa∗ z)2 = f (z, z̄).

Hence, we achieve perfect recovery from on the order of with z, a ∈ n and y ∈ . While f (z) is not holomorphic

n(log n)4 samples arising from a coded diffraction experiment. in z, f (z, z̄) is holomorphic in z for a fixed z̄, and vice versa.

Our companion paper [16] established that PhaseLift—the This fact underlies the development of the Wirtinger

SDP relaxation—is also exact with a sampling complexity calculus. In essence, the conjugate coordinates

on the order of n(log n)4 (this has recently been improved

to n(log n)2 [31]). We believe that the sampling complexity z

∈ n × n , z = x + i y and z̄ = x − i y,

of both approaches (WF and SDP) can be further reduced to z̄

CANDÈS et al.: PHASE RETRIEVAL VIA WIRTINGER FLOW: THEORY AND ALGORITHMS 1995

can serve as a formal substitute for the representation The reader may wonder why we choose to work with

(x, y) ∈ 2n . This leads to the following derivatives conjugate coordinates as there are alternatives: in particular,

∂f ∂ f (z, z̄) we could view the complex variable z = x + i y ∈ n as a

:= | z̄=constant vector in 2n and just run gradient descent in the x, y coor-

∂z ∂z

∂f ∂f ∂f dinate system. The main reason why conjugate coordinates

= , ,..., , are particularly attractive is that expressions for derivatives

∂ z1 ∂ z2 ∂ z n z̄=constant

become significantly simpler and resemble those we obtain in

∂f ∂ f (z, z̄) the real case, where f : n → is a function of real variables.

:= | z=constant

∂ z̄ ∂ z̄

∂f ∂f ∂f VII. P ROOFS

= , ,..., .

∂ z̄ 1 ∂ z̄ 2 ∂ z̄ n z=constant

A. Preliminaries

Our definitions follow standard notation from multivariate We first√note that in the CDP model with admissible CDPs

calculus so that derivatives are row vectors and gradients are ar ≤ 6n for all r√ = 1,√ 2, . . . , m, as the entries of

column vectors. In this new coordinate system the complex the CDPs obey |d| ≤ 3 < 6. In the Gaussian model

gradient is given by √

the measurements vectors also obey ar ≤ 6n for all

∂f ∂f ∗ r = 1, 2, . . . , m with probability at least 1 − me−1.5n .

∇c f = , .

∂ z ∂ z̄ Throughout the proofs, we assume we are on this event without

explicitly mentioning it each time.

Similarly, we define

Before we begin with the proofs we should mention that

∂ ∂f ∗ ∂ ∂f ∗ we will prove our result using the update

H z z := , H z̄ z := ,

∂z ∂z ∂ z̄ ∂ z μ

∗ ∗ z τ +1 = z τ − ∇ f (z τ ), (VII.1)

∂ ∂f ∂ ∂f x2

H z z̄ := , H z̄ z̄ := .

∂ z ∂ z̄ ∂ z̄ ∂ z̄ in lieu of the WF update

In this coordinate system the complex Hessian is given by μWF

z τ +1 = z τ − ∇ f (z τ ). (VII.2)

H zz H z̄ z z 0 2

∇ f :=

2

.

H z z̄ H z̄ z̄

Since z 0 2 − x2 ≤ 64

1

x2 holds with high probability

Given vectors z and z ∈ n , we have defined the gradient as proven in Section VII-H, we have

and Hessian in a manner such that Taylor’s approximation

63

takes the form z 0 2 ≥ x2 . (VII.3)

64

∗

z

f (z + z) ≈ f (z) + (∇c f (z)) Therefore, the results for the update (VII.1) automatically

z carry over to the WF update with a simple rescaling of the

∗ upper bound on the learning parameter. More precisely, if we

1 z z

+ ∇ 2 f (z) . prove that the update (VII.1) converges to a global optimum

2 z z as long as μ ≤ μ0 , then the convergence of the WF update

If we were to run gradient descent in this new coordinate to a global optimum is guaranteed as long as μWF ≤ 63 64 μ0 .

system, the iterates would be Also, the update in (VII.1) is invariant to the Euclidean norm

∗ of x. Therefore, without loss of generality we will assume

z τ +1 zτ ∂ f /∂ z |

= −μ ∗ z=z τ

(VI.1) throughout the proofs that x = 1.

z̄ τ +1 z̄ τ ∂ f /∂ z̄ | z=zτ

We remind the reader that throughout x is a solution to

Note that when f is a real-valued function (as in this paper) our quadratic equations, i.e. obeys y = | Ax|2 and that the

we have sampling vectors are independent from x. Define

∂f ∂f

= . P := {xeiφ : φ ∈ [0, 2π]}.

∂z ∂ z̄

Therefore, the second set of updates in (VI.1) is just the to be the set of all vectors that differ from the planted

conjugate of the first. Thus, it is sufficient to keep track of solution x only by a global phase factor. We also introduce

the first update, namely, the set of all points that are close to P,

n

z τ +1 = z τ − μ (∂ f /∂ z)∗ . E() := {z ∈ : dist(z, P) ≤ }, (VII.4)

For real valued functions of complex variables, setting Finally for any vector z ∈ we define the phase φ(z) as

n

∗

∂f

φ(z) := arg min
z − eiφ x
,

∇ f (z) = φ∈[0,2π]

∂z

gives the gradient update so that

z τ +1 = z τ − μ∇ f (z τ ). dist(z, x) =
z − eiφ(z) x
.

1996 IEEE TRANSACTIONS ON INFORMATION THEORY, VOL. 61, NO. 4, APRIL 2015

B. Formulas for the Complex Gradient and Hessian vectors. Furthermore, assume the measurement vectors ar are

We gather some useful gradient and Hessian calculations distributed according to the Gaussian model. Then

that will be used repeatedly. Starting with 1 3 1

Re(u∗ ar ar∗ v) 2 = + (Re(u∗ v))2 − (Im(u∗ v))2

2 2 2

1

m

2

f (z) = yr − z̄ T (ar ar∗ )z

(VII.5)

2m

r=1 Re(u∗ ar ar∗ v) ar∗ v 2 = 2 Re(u∗ v) (VII.6)

=

1

m

2

yr − z T (ar ar∗ )T z̄ ,

ar∗ v 2k = k!. (VII.7)

2m The next lemma establishes the concentration of the Hessian

r=1

we establish around its mean for both the Gaussian and the CDP model.

T Lemma 7.4: In the setup of Lemma 7.1, assume the vec-

1 T

m

∂ tors ar are distributed according to either the Gaussian

f (z) = z (ar ar∗ )T z̄ − yr (ar ar∗ )T z̄.

∂z m or admissible CDP model with a sufficiently large

r=1

number of measurements. This means that the num-

This gives ber of samples obeys m ≥ c(δ) · n log n in the

∗

1 T

m Gaussian model and the number of patterns obeys

∂

∇ f (z) = f (z) = z̄ (ar ar∗ )z − yr (ar ar∗ )z. L ≥ c(δ) · log3 n in the CDP model. Then

∂z m

r=1
2

∇ f (x) − [∇ 2 f (x)]
≤ δ, (VII.8)

For the second derivative, we write

∗ holds with probability at least 1 − 10e−γ n − 8/n 2 and

1 ∗ 2

m

∂ ∂

Hz z = f (z) = 2|ar z| − yr ar ar∗ 1−(2L+1)/n 3 for the Gaussian and CDP models, respectively.

∂z ∂z m

r=1 We will also make use of the two results below, which are

and corollaries of the three lemmas above. These corollaries are

∗ also proven in Appendix A.

1 ∗ 2

m

∂ ∂ Corollary 7.5: Suppose
∇ 2 f (x) − [∇ 2 f (x)]
≤ δ.

H z̄z = f (z) = (ar z) ar arT .

∂ z̄ ∂z m Then for all h ∈ n obeying h = 1, we have

r=1

m

1 ∗ 1 h ∗ 2

m

Therefore,

∗ 2 h

Re h ar ar x = ∇ f (x)

m 4 h̄ h̄

∇ 2 f (z) r=1 r=1

1 3

1 (2 ar∗ z 2 − yr )ar ar∗ ≤ h2 + Re(x ∗ h)2

m

(a∗ z)2 a a T

= ∗ r2 r r T . 2 2

m

r=1

(ar∗ z)2 ār ar∗ (2 ar z − yr ) ār ar 1 ∗ 2 δ

− Im(x h) + .

2 2

C. Expectation and Concentration In the other direction,

1 ∗

m

This section gathers some useful intermediate results whose 2 1 3

proofs are deferred to Appendix A. The first two lemmas Re h ar ar∗ x ≥ h2 + Re(x ∗ h)2

m 2 2

establish the expectation of the Hessian, gradient and a r=1

1 δ

related random variable in both the Gaussian and admissible − Im(x ∗ h)2 − .

CDP models.6 2 2

Lemma 7.1: Recall that x is a solution obeying x = 1, Corollary 7.6: Suppose
∇ 2 f (x) − [∇ 2 f (x)]
≤ δ.

which is independent from the sampling vectors. Furthermore, Then for all h ∈ n obeying h = 1, we have

assume the sampling vectors ar are distributed according to

m m

∗ ∗

either the Gaussian or admissible CDP model. Then a r x ar h = h ar x ar ar h

m m

3 1

[∇ 2 f (x)] = I 2n + xx̄ [x ∗ , x T ] − −xx̄ [x ∗ , −x T ].

r=1 r=1

2

2 2 ≥ (1 − δ) h2 + h∗ x

Lemma 7.2: In the setup of Lemma 7.1, let z ∈ n be a ≥ (1 − δ) h2 ,

fixed vector independent of the sampling vectors. We have and

m m

[∇ f (z)] = (I − xx ∗ )z + 2 z2 − 1 z. a r x ar h = h ∗ ar x ar ar h ∗

m m

r=1 r=1

The next lemma gathers some useful identities in the Gaussian 2

model. ≤ (1 + δ) h2 + h∗ x

Lemma 7.3: Assume u, v ∈ n are fixed vectors obeying ≤ (2 + δ) h2 .

u = v = 1 which are independent of the sampling

The next lemma establishes the concentration of the gra-

6 In the CDP model the expectation is with respect to the random modulation dient around its mean for both Gaussian and admissible

pattern. CDP models.

CANDÈS et al.: PHASE RETRIEVAL VIA WIRTINGER FLOW: THEORY AND ALGORITHMS 1997

Lemma 7.7: In the setup of Lemma 7.4, let z ∈ n be results where the goal is often to prove convergence to

a fixed vector independent of the sampling vectors obeying a unique global optimum, the objective function f does

dist(z, x) ≤ 12 . Then not have a unique global optimum. Indeed, in our problem,

if x is solution, then eiφ x is also solution. Hence, proper

∇ f (z) − [∇ f (z)] ≤ δ · dist(z, x).

modification is required to prove convergence results.

holds with probability at least 1 − 20e−γ m − 4m/n 4 in the We prove that if z ∈ E() then for all 0 < μ ≤ 2/β

Gaussian model and 1 − (4L + 2)/n 3 in the CDP model.

We finish with a result concerning the concentration of the z + = z − μ∇ f (z)

sample covariance matrix.

Lemma 7.8: In the setup of Lemma 7.4, obeys

m
2μ

dist (z + , x) ≤ 1 −

2

dist2 (z, x). (VII.11)

I n − m −1 ar ar∗
≤ δ, α

r=1

Therefore, if z ∈ E() then we also have z + ∈ E().

holds with probability at least 1 − 2e−γ m for the Gaussian

The lemma follows by inductively applying (VII.11). Now

model and 1 − 1/n 2 in the CDP model. On this event,

simple algebraic manipulations together with the regularity

1 ∗ 2

m

condition (VII.10) give

(1 − δ) h2 ≤ ar h ≤ (1 + δ) h2 , ∀h ∈ n

.

m
2
2

r=1

(VII.9)
z + − xeiφ(z)
=
z − xeiφ(z) − μ∇ f (z)

2

=
z − xeiφ(z)

D. General Convergence Analysis

We will assume that the function f satisfies a regularity −2μ Re ∇ f (z), z − xeiφ(z)

condition on E(), which essentially states that the gradient + μ2 ∇ f (z)2

of the function is well behaved. We remind the reader that
2

E(), as defined in (VII.4), is the set of points that are close ≤
z − xeiφ(z)

to the path of global minimizers.
2 1

1
iφ(z)

Condition 7.9 (Regularity Condition): We say that the −2μ
z − xe
+ ∇ f (z) 2

α β

function f satisfies the regularity condition or RC(α, β, )

if for all vectors z ∈ E() we have + μ2 ∇ f (z)2

2μ

2

1 1 = 1−
z − xeiφ(z)

Re ∇ f (z), z − xeiφ(z) ≥ dist2 (z, x) + ∇ f (z)2 . α

α β

2

(VII.10) +μ μ− ∇ f (z)2

In the lemma below we show that as long as the regularity β

condition holds on E() then Wirtinger Flow starting from 2μ

2

≤ 1−
z − xeiφ(z)
,

an initial solution in E() converges to a global optimizer at α

a geometric rate. Subsequent sections shall establish that this

where the last line follows from μ ≤ 2/β. The definition of

property holds.

φ(z + ) gives

Lemma 7.10: Assume that f obeys RC(α, β, ) for all

z ∈ E(). Furthermore, suppose z 0 ∈ E, and assume
2
2

0 < μ ≤ 2/β. Consider the following update
z + − xeiφ(z+ )
≤
z + − xeiφ(z)
,

Then for all τ we have z τ ∈ E() and

2μ τ E. Proof of the Regularity Condition

dist2 (z τ , x) ≤ 1 − dist2 (z 0 , x).

α For any z ∈ E(), we need to show that

We note that for αβ < 4, (VII.10) holds with the direction of 1 1

the inequality reversed.7 Thus, if RC(α, β, ) holds, α and β Re ∇ f (z), z − xeiφ(z) ≥ dist2 (z, x) + ∇ f (z)2 .

must obey αβ ≥ 4. As a result, under the stated assumptions α β

of Lemma 7.10 above, the factor 1 − 2μ/α ≥ 1 − 4/(αβ) is (VII.12)

non-negative.

We prove this with δ = 0.01 by establishing that our

Proof: The proof follows a structure similar to

gradient satisfies the local smoothness and local curvature

related results in the convex optimization literature

conditions defined below. Combining both these two properties

see [55, Th. 2.1.15]. However, unlike these classical

gives (VII.12).

7 One can see this by applying Cauchy-Schwarz and calculating the deter- Condition 7.11 (Local Curvature Condition): We say that

minant of the resulting quadratic form. the function f satisfies the local curvature condition or

1998 IEEE TRANSACTIONS ON INFORMATION THEORY, VOL. 61, NO. 4, APRIL 2015

LCC(α, , δ) if for all vectors z ∈ E(), We will establish (VII.17) for different measurement models

1 (1 − δ) and different values of . Below, it shall be convenient to use

Re ∇ f (z), z − xe iφ(z)

≥ + dist2 (z, x) the shorthand

α 4

1 ∗ 4

m 5

Yr (h, s) := Re(h∗ ar ar∗ x)2 + 3s Re(h∗ ar ar∗ x)|ar∗ h|2

+ ar (z − eiφ(z) x) . 2

10m 9

+ s 2 |ar∗ h|4 ,

r=1

(VII.13) 10

1

This condition essentially states that the function curves suf- m

ficiently upwards (along most directions) near the curve of Yr (h, s) := Yr (h, s).

m

global optimizers. r=1

Condition 7.12 (Local Smoothness Condition): We say √

1) Proof of (VII.17) With √= 1/8 n in the Gaussian

that the function f satisfies the local smoothness condition and CDP Models: Set = 1/8 n. We show that with high

or L SC(β, , δ) if for all vectors z ∈ E() we have probability, (VII.17) holds for all h satisfying Im(h∗ x) = 0,

∇ f (z)2 h2 = 1, 0 ≤ s ≤ , δ ≤ 0.01, and α ≥ 30. First, note that

1 ∗ 4

m by Cauchy-Schwarz inequality,

(1 − δ) iφ(z)

≤β dist (z, x) +

2

ar (z − e x) .

4 10m

r=1 Yr (h, s)

5

m

(VII.14)

This condition essentially states that the gradient of the func- ≥ Re(h∗ ar ar∗ x)2

2m

r=1

tion is well behaved (the function does not vary too much)

m m

near the curve of global optimizers. 3s

− Re(h ar ar x)

∗ ∗ 2 |ar∗ h|4

m

r=1 r=1

F. Proof of the Local Curvature Condition

s

2 m

For any z ∈ E(), we want to prove the local curvature 9

+ |ar∗ h|4

condition (VII.13). Recall that 10 m

r=1

1

m 2

5 m 9 m

∇ f (z) = |ar , z|2 − yr (ar ar∗ )z, = Re(h ar ar x) − s

∗ ∗ 2 |ar∗ h|4

m 2m 10m

r=1 r=1 r=1

and define h := e−iφ(z) z −

x. To establish (VII.13) it suffices 5

m

9s 2

m

4m 10m

r=1 r=1

1

m

2 Re(h∗ ar ar∗ x)2 + 3 Re(h∗ ar ar∗ x)|ar∗ h|2 + |ar∗ h|4 The last inequality follows from (a − b)2 ≥ a2

− b2 .

m 2

r=1

By Corollary 7.5,

1 ∗ 4

m

1 (1 − δ)

≥ ar h + + h2 , (VII.15) 1

m

1−δ 3

10m α 4 Re(h∗ ar ar∗ x)2 ≥ + Re(x ∗ h)2 (VII.19)

r=1

m 2 2

holds for all h satisfying Im(h∗ x) = 0, h2 ≤ . Equiv- r=1

alently, we only need to prove that for all h satisfying holds with high probability for all h obeying h = 1.

Im(h∗ x) = 0, h2 = 1 and for all s with 0 ≤ s ≤ , Furthermore, by applying Lemma 7.8,

1

m

1 ∗ 4

m m

2 Re(h∗ ar ar∗ x)2 + 3s Re(h∗ ar ar∗ x)|ar∗ h|2 |ar h| ≤ (max ar 2 )

1

|ar∗ h|2 ≤ 6(1 + δ)n

m m r m

r=1

4 1

r=1 r=1

9 (1 − δ)

+ s 2 ar∗ h ≥ + . (VII.16) (VII.20)

10 α 4

By Corollary 7.5, with high probability, holds with high probability. Plugging (VII.19) and (VII.20)

in (VII.18) yields

1

m

1+δ 3

Re(h∗ ar ar∗ x)2 ≤ + Re(x ∗ h)2 , 15 5 27

m

r=1

2 2 Yr (h, s) ≥ Re(x ∗ h)2 + (1 − δ) − s 2 (1 + δ)n.

8 8 5

holds for all h obeying h = 1. Therefore, to establish the

(VII.17) follows by using α ≥ 30, = 8√1 n and δ = 0.01.

local curvature condition (VII.13) it suffices to show that

2) Proof of (VII.17) With = 1/8 in the Gaussian Model:

m

1 5 Set = 1/8. We show that with high probability, (VII.17)

Re(h∗ ar ar∗ x)2 + 3s Re(h∗ ar ar∗ x)|ar∗ h|2

m 2 holds for all h satisfying Im(h∗ x) = 0, h2 = 1, 0 ≤ s ≤ ,

r=1

δ ≤ 2, and α ≥ 8. To this end, we first state a result about

9 2 ∗ 4 1 1 3

+ s ar h ≥ + + Re(x ∗ h)2 . (VII.17) the tail of a sum of i.i.d. random variables. Below, is the

10 α 2 4 cumulative distribution function of a standard normal variable.

CANDÈS et al.: PHASE RETRIEVAL VIA WIRTINGER FLOW: THEORY AND ALGORITHMS 1999

Lemma 7.13 [11]: Suppose X 1 , X 2 , . . . , X m are i.i.d. real- with γ = 1/840. Therefore, with probability at least 1−e−2γ m ,

valued random variables obeying X r ≤ b for some nonrandom we have

b > 0, X r = 0, and X r2 = v 2 . Setting σ 2 = m max(b2 , v 2 ), 5 1

Y r (h, s) ≥ (1 + 3 Re(x ∗ h)2 ) + 6s Re(x ∗ h) + 2.7s 2 −

4 4

P(X 1 + . . . + X m ≥ y) 3 3 2

∗ 2

≥ + Re(x h) + 3 Re(x h) + s ∗

y2 4 4

≤ min exp − 2 , c0 (1 − (y/σ )) ,

2σ 1 3

+ − s2

4 10

where one can take c0 = 25. 3 3

To establish (VII.17) we first prove it for a fixed h, and ≥ + Re(x ∗ h)2 . (VII.21)

4 4

then use a covering argument. Observe that √

provided that s ≤ 5/6. The inequality above holds for a

2 fixed vector h. To prove (VII.17) for all h ∈ n with h = 1,

5 9 ∗ 2

Yr := Yr (h, s) = Re(h∗ ar ar∗ x) + s ar h . define

2 10

5 9 ∗ 2

pr (h) := Re(h∗ ar ar∗ x) + s ar h .

By Lemma 7.3, 2 10

√

1 3 Using the fact ∗that

max r ar ≤ 6n and s ≤ 1/8, we have

[Re(h∗ ar ar∗ x)2 ] = + (Re(x ∗ h))2 | pr (h)| ≤ 2 ar h ar∗ x + s ar∗ h ≤ 13n. Moreover, for any

2

2 2

u, v ∈ n obeying u = v = 1,

and

5

∗ ∗

| pr (u) − pr (v)| ≤ Re (u − v) ar ar x

[Re(h∗ ar ar∗ x)|ar∗ h|2 ] = 2 Re(u∗ v), 2

9 ∗

compare (VII.5) and (VII.6). Therefore, using s ≤ 18 , + s ar (u + v) ar∗ (u − v)

10

27

μr = Yr =

5 27

(1 + 3 Re(x ∗ h)2 ) + 6s Re(x ∗ h) + s 2 < 6. ≤ n u − v .

4 10 2

Introduce

Now define X r = μr − Yr . First, since Yr ≥ 0,

1

m

X r ≤ μr < 6. Second, we bound X r2 using Lemma 7.3 q(h) :=

3

pr (h)2 − Re(x ∗ h)2

and Holder’s inequality with s ≤ 1/8: m

r=1

4

3

25 81 = Yr (h, s) − Re(x ∗ h)2 .

X r2 ≤ Yr2 = [Re(h∗ ar ar∗ x)4 ] + s 4 [|ar∗ h|8 ] 4

4 100

27 4 For any u, v ∈ n obeying u = v = 1,

+ s 2 [Re(h∗ ar ar∗ x)2 ar∗ h ]

2 1 m

+ 15s [Re(h∗ ar ar∗ x)3 |ar∗ h|2 ] |q(u) − q(v)| = ( pr (u) − pr (v))( pr (u) + pr (v))

m

27 6 r=1

+ s 3 [Re(h∗ ar ar∗ x) ar∗ h ] 3

5

! − Re(x (u − v))Re(x (u + v))

∗ ∗

25 4

≤ [ar∗ h8 ] [ar∗ x 8 ] 27n 3

4

81 4 ≤ × 2 × 13n u − v + u − v

+ s [|ar∗ h|8 ] 2 2

100 ! 3

27 12 4 = 351n + 2

u − v . (VII.22)

+ s 2 [ar∗ h ] [ ar∗ x ] 2

2 !

10 6 Therefore, for any u, v ∈ n obeying u = v = 1 and

+ 15s [ar∗ h ] [ar∗ x ] u − v ≤ η := 6000n

1

2 , we have

! 14 2

27

+ s 3 [ ar∗ h ] [ ar∗ x ] 1

5 q(v) ≥ q(u) − . (VII.23)

16

< 20s 4 + 543s 3 + 513s 2 + 403s + 150

Let Nη be an η-net for the unit sphere of n with cardinality

< 210. obeying |Nη | ≤ (1 + η2 )2n . Applying (VII.21) together with

the union bound we conclude that for all u ∈ Nη

We have all the elements to apply Lemma 7.13 with

σ 2 = m max(92 , 210) = 210m and y = m/4: 3

q(u) ≥ ≥ 1 − |Nη |e−2γ m

4

m

m

mμ − Yr ≥ ≤ e−2γ m ≥ 1 − (1 + 12000n 2 )n e−2γ m

4 ≥ 1 − e−γ m . (VII.24)

r=1

2000 IEEE TRANSACTIONS ON INFORMATION THEORY, VOL. 61, NO. 4, APRIL 2015

The last line follows by choosing m such that m ≥ c · n log n, Note that since (a + b + c)2 ≤ 3(a 2 + b 2 + c2 )

where c is a sufficiently large constant. Now for any h on

1 m

the unit sphere of n , there exists a vector u ∈ Nη such that |g(h, w, s)|2 ≤ 2|h∗ ar ||w∗ ar ||ar∗ x|2

h − u ≤ η. By combining (VII.23) and (VII.24), m

r=1

1

m

3 1 5 + 3s|h∗ ar |2 |ar∗ x||w∗ ar |

q(h) ≥ − > m

r=1

4 16 8 2

1 2 ∗ 3 ∗

m

1 1 3

⇒ Yr (h, s) ≥ + + Re(x ∗ h)2 , + s |ar h| |w ar |

8 2 4 m

r=1

2

holds with probability at least 1 − e−γ m . This concludes the 2 m

≤ 3 |h∗ ar ||w∗ ar ||ar∗ x|2

proof of (VII.17) with α ≥ 8. m

r=1

2

3s m

+ 3 |h ar | |ar x||w ar |

∗ 2 ∗ ∗

m

G. Proof of the Local Smoothness Condition r=1

2

s m ∗ 3 ∗ 2

For any z ∈ E(), we want to prove (VII.14), which is + 3 |ar h| |w ar |

equivalent to proving that for all u ∈ n obeying u = 1, m

r=1

we have := 3(I1 + I2 + I3 ). (VII.27)

We now bound each of the terms on the right-hand side.

(∇ f (z))∗ u2

For the first term we use Cauchy-Schwarz and Corollary 7.6,

1 ∗ 4

m

(1 − δ) iφ(z)

which give

≤β dist (z, x) +

2

ar (z − e x) .

4 10m m m

r=1

I1 ≤ a r x ar w a r x ar h

m m

r=1 r=1

Recall that

≤ (2 + δ)2 . (VII.28)

m

1 Similarly, for the second term, we have

∇ f (z) = |ar , z|2 − yr (ar ar∗ )z

m

1 ∗ 4 1 ∗ 2 ∗ 2

m m

r=1

I2 ≤ ar h a r w ar x

m m

and define r=1 r=1

2+δ

m

∗ 4

≤ a h .

1 (VII.29)

m r

g(h, w, s) = 2 Re(h∗ ar ar∗ x) Re(w∗ ar ar∗ x) m

r=1

m

r=1 Finally, for the third term we use the Cauchy-Schwarz inequal-

1 m

ity together with Lemma 7.8 (inequality) (VII.9) to derive

+ s|ar∗ h|2 Re(w∗ ar ar∗ x)

2

m

1 ∗ 3

m

r=1

I3 ≤ ar h max ar

1

m

m r

+ 2s Re(h∗ ar ar∗ x) Re(w∗ ar ar∗ h) r=1

m

2

1 ∗ 3

r=1 m

1 2 ∗ 2

m ≤ 6n ar h

+ s |ar h| Re(w∗ ar ar∗ h). m

r=1

m

m

1 ∗ 4 2

r=1 m

≤ 6n ar h a h

∗

r

m

Define h := e−φ(z) z − x and w := e−iφ(z) u, to r=1 r=1

6n(1 + δ) ∗ 4

m

establish (VII.14) it suffices to prove that

≤ ar h . (VII.30)

m

1 ∗ 4

m r=1

1−δ

|g(h, w, 1)| ≤ β

2

h +

2

ar h . (VII.25) We now plug these inequalities into (VII.27) and get

4 10m

27s 2 (2 + δ) ∗ 4

r=1 m

|g(h, w, s)|2 ≤ 12(2 + δ)2 + ar h

holds for all h and w satisfying Im(h∗ x) = 0, h ≤ and m

r=1

w = 1. Equivalently, we only need to prove for all h and w 18s 4 n(1 + δ) ∗ 4

m

satisfying Im(h∗ x) = 0, h = w = 1 and ∀s : 0 ≤ s ≤ , + a h r

m

r=1

s 2 ∗ 4 2 ∗ 4

m m

1−δ 1−δ s a h , (VII.31)

|g(h, w, s)| ≤ β

2

+ ar h . (VII.26) ≤β + r

4 10m 4 10m

r=1 r=1

CANDÈS et al.: PHASE RETRIEVAL VIA WIRTINGER FLOW: THEORY AND ALGORITHMS 2001

which completes the proof of (VII.26) and, in turn, establishes I. Initialization via Resampled Wirtinger Flow

the local smoothness condition in (VII.14). However, the last

line of (VII.31) holds as long as In this section, we prove that that the output of Algorithm 2

obeys (V.1) from Theorem 5.1. Introduce

(2 + δ)2

β ≥ max 48 , 270(2 + δ) + 180 n(1 + δ) .

2

2

1−δ 1 ∗

F(z) = z (I − xx ∗ )z + z2 − 1 .

(VII.32) 2

In our theorems we use two different values of = 1

√

8 n

and By Lemma 7.2, if z ∈ n is a vector independent from the

= Using δ ≤ 0.01 in (VII.32) we conclude that the local

1

8. measurements, then

smoothness condition (VII.31) holds as long as

√

β ≥ 550 for = 1/(8 n), ∇ f (z; b) = ∇ F(z).

β ≥ 3n + 550 for = 1/8.

We prove that F obeys a regularization condition in E(1/8),

H. Wirtinger Flow Initialization namely,

In this section, we prove that the WF initialization

obeys (III.1) from Theorem 3.3. Recall that Re ∇ F(z), z − xeiφ(z)

1 ∗ 2

m

Y := |ar x| ar ar∗ . 1 1

≥ dist2 (z, x) + ∇ F(z)2 . (VII.33)

m

r=1 α β

and that Lemma 7.4 gives

Lemma 7.7 implies that for a fixed vector z,

Y − (x x ∗ + x2 I ) ≤ δ := 0.001.

Let z̃ 0 be the eigenvector corresponding to the top Re ∇ f (z; b), z − xeiφ(z)

" #

eigenvalue λ0 of Y obeying z˜0 = 1. It is easy to see that = Re ∇ F(z), z − xeiφ(z)

" #

λ0 − (|z̃ ∗0 x|2 + 1) = z̃ ∗0 Y z̃ 0 − z̃ ∗0 (x x ∗ + I)z̃ 0

+ Re ∇ f (z; b) − ∇ F(z), z − xeiφ(z)

= z̃ ∗0 Y − (x x ∗ + I ) z̃ 0 " #

≥ Re ∇ F(z), z − xeiφ(z)

≤ Y − (x x ∗ + I )

− ∇ f (z; b) − ∇ F(z) dist(z, x)

≤ δ. " #

≥ Re ∇ F(z), z − xeiφ(z) − δdist(z, x)2

Therefore,

1 1

|z̃ ∗0 x|2 ≥ λ0 − 1 − δ. ≥ − δ dist(z, x)2 + ∇ F(z)2 , (VII.34)

α β

Also, since λ0 is the top eigenvalue of Y , and x = 1,

we have holds with high probability. The last inequality follows

λ0 ≥ x ∗ Y x = x ∗ Y − (I + x x ∗ ) x + 2 ≥ 2 − δ. from (VII.33). Applying Lemma 7.7, we also have

∇ F(z)2 ≥ ∇ f (z; b)2 − ∇ f (z; b) − ∇ F(z)2

|z̃ ∗0 x|2 ≥ 1 − 2δ 2

√ 1

⇒ dist2 (z̃ 0 , x) ≤ 2 − 2 1 − 2δ <

1 ≥ ∇ f (z; b)2 − δ 2 dist(z, x)2 .

256 2

1

⇒ dist(z̃ 0 , x) ≤ . Plugging the latter into (VII.34) yields

16

!

1 m ∗ Re ∇ f (z; b), z − xeiφ(z)

Recall that z 0 = m r=1 |ar x|

2 z̃ 0 . By Lemma 7.8,

equation (VII.9), with high probability we have 1 δ2 1

≥ − − δ dist2 (z, x) + ∇ f (z; b)

1 m α β 2β

31

z 0 2 − 1 = |ar∗ x|2 − 1 ≤ 1 1

m 256 := dist2 (z, x) + ∇ f (z; b) .

r=1

α̃ β̃

1

⇒ |z 0 − 1| ≤ .

16 Therefore, using the general convergence analysis of gradient

Therefore, we have descent discussed in Section VII-D we conclude that for all

dist(z 0 , x) ≤ z 0 − z̃ 0 + dist(z̃ 0 , x), μ̃ ≤ μ̃0 := 2/β̃,

= |z 0 − 1| + dist(z̃ 0 , x),

2μ̃

1 dist2 (ub+1 , x) ≤ 1 − dist2 (ub , x).

≤ . α̃

8

2002 IEEE TRANSACTIONS ON INFORMATION THEORY, VOL. 61, NO. 4, APRIL 2015

shall rewrite (IV.1) in the form

log n

B≥− n−1 2

log 1 − 2α̃μ̃

2

B yr = x[t]d̄ (t)e−i2πkt /n = f ∗k D ∗ x ,

2μ̃ 2

t =0

⇒ dist(u B , x) ≤ 1− dist(u0 , x)

α̃ 0≤k ≤n−1

B r = (, k),

1 ≤ ≤ L,

2μ̃ 2 1

≤ 1−

α̃ 8 where f ∗k is the kth row of the n × n DFT matrix and

1 D is a diagonal matrix with the diagonal entries given by

≤ √ .

8 n d (0), d (1), . . . , d (n − 1). In our model, the matricces D

are i.i.d. copies of D.

It only remains to establish (VII.33). First, without loss

of generality, we can assume that φ(z) = 0, which implies

Re(z ∗ x) = |z ∗ x| and use z − x in lieu of dist(z, x). Set

A. Proof of Lemma 7.1

h := z − x so that Im(x ∗ h) = 0. This implies

The proof for admissible coded diffraction patterns follows

∇ F(z) = (I − xx ∗ )z + 2 z2 − 1 z from [16, Lemmas 3.1 and 3.2]. For the Gaussian model, it

is a consequence of the two lemmas below, whose proofs are

= (I − xx ∗ )(x + h) + 2 x + h2 − 1 (x + h) ommitted.

Lemma A .1: Suppose the sequence {ar } follows the

= I − x x ∗ h + 2 2 Re(x ∗ h) + h2 (x + h)

Gaussiam model. Then for any fixed vector x ∈ n ,

= (1 + 4(x ∗ h) + 2 h2 )h

1 ∗ 2

m

+ (3(x ∗ h) + 2 h2 )x. ar x ar ar = x x ∗ + x2 I.

∗

m

r=1

Therefore,

Lemma A .2: Suppose the sequence {ar } follows the

∇ F(z) ≤ 4 h + 6 h2 + 2 h3 ≤ 5 h , (VII.35) Gaussiam model. Then for any fixed vector x ∈ n ,

where the last inequality is due to h ≤ ≤ 1/8.

1 ∗ 2

m

Furthermore, ∗

(ar x) ār ar = 2x x T .

m

r=1

Re ∇ F(z), z − x

" #

= Re (1 + 4(x ∗ h) + 2 h2 )h + (3(x ∗ h) + 2 h2 )x, h B. Proof of Lemma 7.2

2

= h2 + 2 h4 + 6 h2 (x ∗ h) + 3 x ∗ h Recall that

1

1

m

≥ h2 , (VII.36)

4 ∇ f (z) = |ar , z|2 − yr (ar ar∗ )z

m

where the last inequality also holds because h ≤ ≤ 1/8. r=1

1

m

Finally, (VII.35) and (VII.36) imply

= |ar , z|2 − |ar , x|2 (ar ar∗ )z.

m

1 r=1

Re (∇ F(z), z − x) ≥ h2

4 Thus by applying [16, Lemma 3.1] (for the CDP model) and

1 1

≥ h2 + ∇ F(z)2 Lemma VII-A above (for the Gaussian model) we have

8 200

1 1 m

:= h2 + ∇ F(z)2 , 1 2

α β [∇ f (z)] = ( ar∗ z ar ar∗ z − ar∗ x ar ar∗ z)

2

m

r=1

where α = 8 and β = 200.

= (zz ∗ + z2 I )z − (x x ∗ + I)z

= 2(z2 − 1)z + (I − x x ∗ )z.

A PPENDIX A

E XPECTATIONS AND D EVIATIONS

C. Proof of Lemma 7.3

We provide here the proofs of our intermediate results.

Throughout this section we use D ∈ n×n to denote a diagonal Suppose a ∈ n ∼ N (0, I /2) + i N (0, I/2). Since the

random matrix with diagonal elements being i.i.d. samples law of a is invariant by unitary transformation, we may just

from an admissible distribution d (recall the definition (IV.2) as well take v = e1 and u = s1 eiφ1 e1 + s2 eiφ2 e2 , where

CANDÈS et al.: PHASE RETRIEVAL VIA WIRTINGER FLOW: THEORY AND ALGORITHMS 2003

s1 , s2 are positive real numbers obeying s12 + s22 = 1. We have A slight modification to [16, Lemma 3.9] gives

(E1c ) ≤ 1/n 3 provided L ≥ c(R) log3 n for a sufficiently

Re(u∗ aa∗ v) 2 large numerical constant

2 √ c(R). Since Hoeffding’s inequality

= [ Re(s1 eiφ1 |a1 |2 + s2 eiφ2 a1 ā2 ) ] yields ( f ∗k D ∗ x > 2R log n) ≤ 2n −R , we have

1

= 2s12 cos2 (φ1 ) + s22 (E c ) ≤ n −3 + 2(n L)n −R .

2

1 3 2 1 Setting R = 4 completes the proof.

= + s1 cos (φ1 ) − s12 sin 2 (φ1 )

2

2 2 2 2) The Gaussian Model: By unitary invariance, we may

1 3 ∗

2 1 2 take x = e1 . Letting z(1), be the first coordinate of a vector z,

= + Re(u v) − Im(u∗ v) .

2 2 2 to prove Lemma 7.4 for the Gaussian model it suffices to prove

and the two inequalities,

Re(u∗ aa∗ v) a∗ v 2
1

m
δ

|ar (1)|2 ar ar∗ − I + e1 e1T
≤

= [ Re(s1 e−iφ1 |a1 |2 + s2 e−iφ2 ā1 a2 ) |a1 |2 ]

(A.2)

m
4

r=1

= 2s1 cos(φ1 )

and

= 2 Re(u∗ v).

1

m
δ

2

The identity (VII.7) follows from standard normal moment
ar (1) ar arT − 2e1 e1T
≤ . (A.3)

m
4

calculations. r=1

D. Proof of Lemma 7.4 that m ≥ C · n implies

1) The CDP Model: Write the Hessian as

1 1

m m

1

L n (|ar (1)|2 − 1) ≤ , (|ar (1)|4 − 2) < ,

Y := ∇ 2 f (x) = W k ( D ) m m

r=1 r=1

nL

=1 k=1

1

m

where |ar (1)|6 < 10.

m

D 0 Ak ( D) B k ( D) D∗ 0 r=1

W k ( D) :=

0 D∗ B k ( D) Ak ( D) 0 D with probability at least 1 − 3n −2 ; this is a consquence of

and Chebyshev’s inequality. Moreover a union bound gives

2 '

Ak ( D) = f ∗k D∗ x f k f ∗k , B k ( D) = ( f ∗k D∗ x)2 f k f kT . max |ar (1)| ≤ 10 log m

1≤r≤m

It is useful to recall that

∗

with probability at least 1 − n −2 . Denote by E 0 the event

Y = I 2+x̄xx∗x 2x x T

I + x̄ x T

. on which the above inequalities hold. We show that there is

another event E 1 of high probability such that (A.2) and (A.3)

Now set hold on E 0 ∩ E 1 . Our proof strategy is similar to that

1 of [64, Th. 39]. To prove (A.2), we will show that with high

L n

$

Y= W k ( D ){| f ∗ D∗ x|≤√2R log n} , probability, for any y ∈ n obeying y = 1, we have

nL k

=1 k=1

∗ 1

m

∗

where R is a positive scalar whose value will be determined I0 ( y) := y |ar (1)| ar ar − I + e1 e1

2 T

y

m

shortly, and define the events r=1

1 m

2

Y − Y ≤ },

E 1 (R) = {$ δ

= |ar (1)|2 ar∗ y − (1+| y(1)|2 ) ≤ . (A.4)

m 4

E 2 (R) = {$

Y = Y }, r=1

%& ' (

E 3 (R) = f ∗ D ∗ x ≤ 2R log n , For this purpose, partition y in the form y = ( y(1), ỹ) with

k

k, ỹ ∈ n−1 , and decompose the inner product as

E = {Y − Y ≤ }. ∗ 2

√ a y = |ar (1)|2 | y(1)|2 +2 Re ã∗ ỹar (1) y(1) + ã∗ ỹ2 .

r r r

Note that E 1 ∩ E 2 ⊂ E. Also, if f ∗k D ∗ x ≤ 2R log n for

all pairs (k, ), then $

Y = Y and thus E 3 ⊂ E 2 . Putting all of This gives

this together gives 1 m

) I0 ( y) = (|ar (1)|4 − 2) | y(1)|2

(E c ) ≤ (E1c E2c ) m

r=1

≤ (E 1c ) + (E 2c ) 1

m

∗

+ 2 Re |ar (1)| ar (1) y(1) ãr ỹ

2

≤ (E 1c ) + (E 3c ) m

r=1

L0

' 1

n m

≤ (E 1c )+ ( f ∗k D∗ x > 2R log n). (A.1) + 2 ∗ 2 2

|ar (1)| ãr ỹ − ỹ ,

m

=1 k=1 r=1

2004 IEEE TRANSACTIONS ON INFORMATION THEORY, VOL. 61, NO. 4, APRIL 2015

which follows from | y(1)|2 + ỹ2 = 1 since y has unit norm. E. Proof of Corollary 7.5

This gives It follows from ∇ 2 f (x)− [∇ 2 f (x)] ≤ δ that ∇ 2 f (x)

I0 ( y)

[∇ 2 f (x)] + δ I . Therefore, using the fact that for any

1 m 1 m complex scalar c, Re(c)2 = 12 |c|2 + 12 Re(c2 ), we have

≤ |ar (1)|2 − 1 ỹ2 + |ar (1)|4 − 2 | y(1)|2

m m 1

m

r=1

1 m

r=1

Re(h∗ ar ar∗ x)2

m

+2 |ar (1)|2 ar (1) y(1) ãr∗ ỹ r=1

m m

r=1 1 h ∗ ar∗ x 2 ar ar∗ (ar∗ x)2 ar arT h

1 = ∗ 2

m

2 ∗ 2

2 4 h̄ ∗ 2 ∗

(ar x) ār ar

ar x ār ar T h̄

+ |ar (1)| ãr ỹ − ỹ r=1

∗ ∗ ∗

m 1 h 3 x x 1 x h

r=1 x

1 m I 2n + −

∗ 4 h̄ 2 x̄ x̄ 2 − x̄ − x̄ h̄

≤ 2 + 2 |ar (1)| ar (1) y(1) ãr ỹ

2

∗

m δ h h

r=1 +

1 m 2 4 h̄ h̄

+ |ar (1)|2 ãr∗ ỹ − ỹ2 . (A.5) 1 3 1 δ

m ∗ 2

h + Re(x h) − I m(x h) + .

2 ∗ 2

r=1

2 2 2 2

We now turn our attention to the last two terms of (A.5).

The other inequality is established in a similar fashion.

For the second term, the ordinary Hoeffding’s inequality

([64], Proposition 10) gives that for any constants δ0 and γ ,

there exists ! a constant C(δ0 , γ ), such that for F. Proof of Corollary 7.6

m

In the proof of Lemma 7.4, we established that with high

m ≥ C(δ0 , γ ) n r=1 |ar (1)| ,

6

probability,

1 m

∗

1 ∗ 2

m

(|ar (1)| ar (1) y(1)) ãr ỹ ≤ δ0 | y(1)| ỹ ≤ δ0

2

m

r=1

ar x ar ar∗ − x x ∗ + x2 I ≤ δ.

m

r=1

holds with probability at least 1 − 3e−2γ n . To control

Therefore,

the final term, we apply the Bernstein-type inequality

([64], Proposition 16) to assert the following: for any positive 1 ∗ 2

m

constants δ0 and γ! , there exists a constant C(δ0 , γ ), such that ar x ar ar∗ x x ∗ + x2 I − δ I.

m

m r=1

for m ≥ C(δ0 , γ )( n( r=1 |ar (1)|4 ) + n maxr=1

m |a (1)|2 ),

r

This concludes the proof of one side. The other side is similar.

1 m 2

|ar (1)|2 ãr∗ ỹ − ỹ2 ≤ δ0 ỹ2 ≤ δ0

m G. Proof of Lemma 7.7

r=1

Note that

holds with probability at least 1 − 2e−2γ n .

Therefore, for any unit norm vector y, ∇ f (z) − ∇ f (z) = max u, ∇ f (z) − ∇ f (z)

u∈ n,u=1

I0 ( y) ≤ 2 + 2δ0 (A.6) Therefore, to establish the concentration of ∇ f (z) around its

holds with probability at least 1 − 5e−2γ n . mean we proceed by bounding |u, ∇ f (z) − ∇ f (z)|. From

By [64, Lemma 5.4], we can bound the operator norm Section VII-B,

1

m

via an -net argument:

∇ f (z) = |ar , z|2 − yr (ar ar∗ )z.

max I0 ( y) ≤ 2 max I0 ( y) ≤ 4 + 4δ0 , m

r=1

y∈ n y∈N

Define h := e−φ(z) z − x and w := e−iφ(z) u, we have

where N is an 1/4-net of the unit sphere in n.

1 ∗ ∗ 2 2

m

and γ , (A.2) holds

! with probability at!least 1 − 5e−γ n , as

m = w ar x ar arT h̄ + w ∗ ar∗ x ar ar∗ h

long as m ≥ C ( n r=1 m

|ar (1)|6 + n r=1 |ar (1)|4 + m

r=1

2 2

n max1≤r≤m |ar (1)| ). On E 0 this inequality follows from

2

+ 2w∗ ar∗ h ar ar∗ h x + w ∗ ar∗ h ar arT x̄

m ≥ C · n log n provided C is sufficiently large. In conclusion, 2

(A.2) holds with probability at least 1 − 5e−γ n − 4n −2 . + w∗ ar∗ h ar ar∗ h. (A.7)

The proof of (A.3) is similar. The only difference is that

By Lemma 7.2 we also have

the random matrix is not Hermitian, so we work with

w∗ [∇ f (z)] = w ∗ 2xx T h̄ + w∗ x x ∗ + x2 I h

∗ 1

m

2

I0 (u, v) = u ar (1) ar ar − 2e1 e1 v ,

T T

T

m + 2w∗ hh∗ + h2 I x + w∗ 2h h̄ x̄

r=1

where u and v are unit vectors. + w∗ hh∗ + h2 I h. (A.8)

CANDÈS et al.: PHASE RETRIEVAL VIA WIRTINGER FLOW: THEORY AND ALGORITHMS 2005

Combining (A.7) and (A.8) together with the triangular Algorithm 3 Power Method

inequality and Lemma 7.4 give Input: Matrix Y

u, ∇ f (z) − [∇ f (z)] v 0 is a random vector on the unit sphere of n

for τ = 1 to T do

∗ 1 ∗ 2

m

≤ w ar x ar arT − 2x x T h̄ v τ = YY vvττ −1

−1

m

r=1 m

end for

∗ 1 ∗ 2 Output: z̃ 0 = v T

+ w ar x ar ar∗ − (xx ∗ + x2 I ) h

m

r=1m

1 ∗ 2

θ0 is the angle between the initial guess and the top

+ 2 w ∗ ar h ar ar∗ − hh∗ + h2 I x

m eigenvector. Hence, we would need on the order of

r=1

1 ∗ 2

m log(n/) / log(λ1 /λ2 ) for accuracy. Under the stated

T

+ w ∗ ar h ar arT − 2h h̄ x̄ assumptions, Lemma 7.4 bounds below the eigenvalue gap

m

r=1

by a numerical constant so that we can see that few iterations

∗ 1 ∗ 2

m

∗ ∗ of the power method would yield accurate estimates.

+ w ar h ar ar − hh + h I 2

h .

m

r=1
ACKNOWLEDGMENT

1 m

2

≤
ar∗ x ar arT − 2x x T
h M. S. would like to thank Andrea Montanari for helpful

m

r=1 m
discussions and for the class [1] which inspired him to

1 2

a x ar a − (x x + x I)

∗ ∗ ∗ explore provable algorithms based on non-convex schemes.

+
2

h

m r r

He would also like to thank Alexandre d’Aspremont,

r=1
Fajwel Fogel and Filipe Maia for sharing some useful code

1 m

∗ 2

a h ar a − (hh + h I)

∗ ∗ 2 regarding 3D molecule reconstructions, Ben Recht for intro-

+2

m r r

ducing him to reference [53], and Amit Singer for bringing

r=1

1 m

2
the paper [44] to his attention.

T

+
ar∗ h ar arT − 2h h̄

m
R EFERENCES

r=1

1 m

∗ 2

a h ar a∗ − hh∗ + h2 I
[1] EE 378B—Inference, Estimation, and Information Processing. [Online].

+
h Available: http://web.stanford.edu/class/ee378b/

m r r
[2] A. Agarwal, S. Negahban, and M. J. Wainwright, “Fast global conver-

r=1

≤ 3δ h (1 + h) gence of gradient methods for high-dimensional statistical recovery,”

9 Ann. Statist., vol. 40, no. 5, pp. 2452–2482, 2012.

≤ δ h . [3] A. Ahmed, B. Recht, and J. Romberg. (2012). “Blind deconvolu-

2 tion using convex programming.” [Online]. Available: http://arxiv.org/

abs/1211.5608

[4] R. Balan, “On signal reconstruction from its spectrogram,” in Proc. 44th

H. Proof of Lemma 7.8 Annu. Conf. Inf. Sci. Syst. (CISS), Mar. 2010, pp. 1–4.

The result for the CDP model follows from [5] R. Balan, P. Casazza, and D. Edidin, “On signal reconstruction without

phase,” Appl. Comput. Harmonic Anal., vol. 20, no. 3, pp. 345–356,

[16, Lemma 3.3]. For the Gaussian model, it is a consequence 2006.

of standard results, see [64, Th. 5.39], concerning the [6] L. Balzano and S. J. Wright. (2013). “Local convergence of an algo-

deviation of the sample covariance matrix from its mean. rithm for subspace identification from partial data.” [Online]. Available:

http://arxiv.org/abs/1306.3391

[7] A. S. Bandeira, J. Cahill, D. G. Mixon, and A. A. Nelson. (2013).

A PPENDIX B “Saving phase: Injectivity and stability for phase retrieval.” [Online].

T HE P OWER M ETHOD Available: http://arxiv.org/abs/1302.4618

[8] H. H. Bauschke, P. L. Combettes, and D. R. Luke, “Hybrid

We use the power method (Algorithm 3) with a projection–reflection method for phase retrieval,” J. Opt. Soc. Amer. A,

vol. 20, no. 6, pp. 1025–1034, 2003.

random initialization to compute the first eigenvector of [9] M. Bayati and A. Montanari, “The dynamics of message passing on

Y = A diag{ y} A∗ . Since, each iteration of the power method dense graphs, with applications to compressed sensing,” IEEE Trans.

asks to compute the matrix-vector product Inf. Theory, vol. 57, no. 2, pp. 764–785, Feb. 2011.

[10] M. Bayati and A. Montanari, “The LASSO risk for Gaussian matrices,”

Y z = A diag{ y} A∗ z, IEEE Trans. Inf. Theory, vol. 58, no. 4, pp. 1997–2017, Apr. 2012.

[11] V. Bentkus, “An inequality for tail probabilities of martingales with

we simply need to apply A and A∗ to an arbitrary vector. In the differences bounded from one side,” J. Theoretical Probab., vol. 16,

no. 1, pp. 161–173, 2003.

Gaussian model, this costs 2mn multiplications while in the [12] O. Bunk et al., “Diffractive imaging for periodic samples: Retrieving

CDP model the cost is that of 2L n-point FFTs. We now turn one-dimensional concentration profiles across microfluidic channels,”

our attention to the number of iterations required to achieve a Acta Crystallograph. A, Found. Crystallogr., vol. 63, pp. 306–314,

Jul. 2007.

sufficiently accurate initialization. [13] J. Cahill, P. G. Casazza, J. Peterson, and L. Woodland. (2013).

Standard results from numerical linear algebra show that “Phase retrieval by projections.” [Online]. Available: http://arxiv.org/

after k iterations of the power method, the accuracy of the abs/1305.6226

[14] E. J. Candès, Y. C. Eldar, T. Strohmer, and V. Voroninski, “Phase

eigenvector is O(tan θ0 (λ2 /λ1 )k ), where λ1 and λ2 are the top retrieval via matrix completion,” SIAM J. Imag. Sci., vol. 6, no. 1,

two eigenvalues of the positive semidefinite matrix Y , and pp. 199–225, 2013.

2006 IEEE TRANSACTIONS ON INFORMATION THEORY, VOL. 61, NO. 4, APRIL 2015

[15] E. J. Candès and X. Li, “Solving quadratic equations via PhaseLift when [42] S. Marchesini, “Invited article: A unified evaluation of iterative projec-

there are about as many equations as unknowns,” Found. Comput. Math., tion algorithms for phase retrieval,” Rev. Sci. Instrum., vol. 78, no. 1,

vol. 66, pp. 1241–1274. p. 011301, 2007.

[16] E. J. Candès, X. Li, and M. Soltanolkotabi. (2013). “Phase [43] S. Marchesini, “Phase retrieval and saddle-point optimization,” J. Opt.

retrieval from coded diffraction patterns.” [Online]. Available: Soc. Amer. A, vol. 24, no. 10, pp. 3289–3296, 2007.

http://arxiv.org/abs/1310.3240 [44] S. Marchesini, Y.-C. Tu, and H.-T. Wu. (2014). “Alternating projection,

[17] E. J. Candès, T. Strohmer, and V. Voroninski, “PhaseLift: Exact and ptychographic imaging and phase synchronization.” [Online]. Available:

stable signal recovery from magnitude measurements via convex pro- http://arxiv.org/abs/1402.0550

gramming,” Commun. Pure Appl. Math., vol. 38, no. 2, pp. 346–356, [45] J. Miao, P. Charalambous, J. Kirz, and D. Sayre, “Extending

Mar. 2015. the methodology of X-ray crystallography to allow imaging of

[18] A. Chai, M. Moscoso, and G. Papanicolaou, “Array imaging using micrometre-sized non-crystalline specimens,” Nature, vol. 400, no. 6742,

intensity-only measurements,” Inverse Problems, vol. 27, no. 1, pp. 342–344, 1999.

p. 015005, 2011. [46] J. Miao, T. Ishikawa, B. Johnson, E. H. Anderson, B. Lai, and

[19] A. Conca, D. Edidin, M. Hering, and C. Vinzant. (2013). “An algebraic K. O. Hodgson, “High resolution 3D X-ray diffraction microscopy,”

characterization of injectivity in phase retrieval.” [Online]. Available: Phys. Rev. Lett., vol. 89, no. 8, p. 088303, Aug. 2002.

http://arxiv.org/abs/1312.0158 [47] J. Miao, T. Ishikawa, Q. Shen, and T. Earnest, “Extending X-ray

[20] J. V. Corbett, “The Pauli problem, state reconstruction and quantum-real crystallography to allow the imaging of noncrystalline materials, cells,

numbers,” Rep. Math. Phys., vol. 57, no. 1, pp. 53–68, 2006. and single protein complexes,” Annu. Rev. Phys. Chem., vol. 59,

[21] J. C. Dainty and J. R. Fienup, “Phase retrieval and image recon- pp. 387–410, May 2008.

struction for astronomy,” in Image Recovery: Theory and Application, [48] L. Waller, L. Tian, and G. Barbastathis, “Transport of intensity phaseam-

H. Stark, Ed. San Diego, CA, USA: Academic, 1987, pp. 231–275. plitude imaging with higher order intensity derivatives,” Opt. Exp.,

[22] L. Demanet and V. Jugnon. (2013). “Convex recovery from interferomet- vol. 18, no. 12, pp. 12552–12561, 2010.

ric measurements.” [Online]. Available: http://arxiv.org/abs/1307.6864 [49] L. Tian, X. Li, K. Ramchandran, and L. Waller, “Multiplexed coded

[23] D. L. Donoho, A. Maleki, and A. Montanari, “Message-passing algo- illumination for Fourier ptychography with an LED array microscope,”

rithms for compressed sensing,” Proc. Nat. Acad. Sci., vol. 106, no. 45, Biomed. Opt. Exp., vol. 5, pp. 2376–2389, 2014.

pp. 18914–18919, 2009. [50] R. P. Millane, “Phase retrieval in crystallography and optics,” J. Opt.

[24] J. R. Fienup, “Reconstruction of an object from the modulus of its Soc. Amer. A, vol. 7, no. 3, pp. 394–411, 1990.

Fourier transform,” Opt. Lett., vol. 3, no. 1, pp. 27–29, 1978. [51] Y. Mroueh. (2014). “Robust phase retrieval and super-resolution

[25] J. R. Fienup, “Fine resolution imaging of space objects,” from one bit coded diffraction patterns.” [Online]. Available:

Radar Opt. Division, Environ. Res. Inst. Michigan, Ann Arbor, http://arxiv.org/abs/1402.2255

MI, USA, Final Sci. Rep. 01/1982-1, 1982. [52] Y. Mroueh and L. Rosasco. (2013). “Quantization and greed are good:

[26] J. R. Fienup, “Phase retrieval algorithms: A comparison,” Appl. Opt., One bit phase retrieval, robustness and greedy refinements.” [Online].

vol. 21, no. 15, pp. 2758–2769, 1982. Available: http://arxiv.org/abs/1312.1830

[27] J. R. Fienup, “Comments on ‘The reconstruction of a multidimensional [53] K. G. Murty and S. N. Kabadi, “Some NP-complete problems in

sequence from the phase or magnitude of its Fourier transform,”’ IEEE quadratic and nonlinear programming,” Math. Program., vol. 39, no. 2,

Trans. Acoust., Speech, Signal Process., vol. 31, no. 3, pp. 738–739, pp. 117–129, 1987.

Jun. 1983. [54] A. Ben-Tal and A. Nemirovski, Lectures on Modern Convex Optimiza-

[28] F. Fogel, I. Waldspurger, and A. d’Aspremont. (2013). “Phase tion: Analysis, Algorithms, and Engineering Applications. Philadelphia,

retrieval for imaging problems.” [Online]. Available: http://arxiv.org/ PA, USA: SIAM, 2001.

abs/1304.7735 [55] Y. Nesterov, Introductory Lectures on Convex Optimization: A Basic

[29] R. W. Gerchberg and W. O. Saxton, “A practical algorithm for the Course (Applied Optimization), vol. 87. Boston, MA, USA: Kluwer,

determination of the phase from image and diffraction plane pictures,” 2004,

Optik, vol. 35, pp. 237–246, 1972. [56] P. Netrapalli, P. Jain, and S. Sanghavi. (2013). “Phase

[30] D. Gross, F. Krahmer, and R. Kueng. (2013). “A partial derandom- retrieval using alternating minimization.” [Online]. Available:

ization of PhaseLift using spherical designs.” [Online]. Available: http://arxiv.org/abs/1306.0160

http://arxiv.org/abs/1310.2267 [57] H. Ohlsson, A. Y. Yang, R. Dong, and S. S. Sastry. (2011). “Compressive

[31] D. Gross, F. Krahmer, and R. Kueng. (2014). “Improved recovery phase retrieval from squared output measurements via semidefinite

guarantees for phase retrieval from coded diffraction patterns.” [Online]. programming.” [Online]. Available: http://arxiv.org/abs/1111.6323

Available: http://arxiv.org/abs/1402.6286 [58] S. Oymak, A. Jalali, M. Fazel, Y. C. Eldar, and B. Hassibi. (2012).

[32] M. Hardt. (2013). “On the provable convergence of alternating min- “Simultaneously structured models with application to sparse and low-

imization for matrix completion.” [Online]. Available: http://arxiv- rank matrices.” [Online]. Available: http://arxiv.org/abs/1212.3753

web3.library.cornell.edu/abs/1312.0925v1 [59] S. Rangan, “Generalized approximate message passing for estimation

[33] R. W. Harrison, “Phase problem in crystallography,” J. Opt. Soc. with random linear mixing,” in Proc. IEEE Int. Symp. Inf. Theory (ISIT),

Amer. A, vol. 10, no. 5, pp. 1046–1055, 1993. Jul./Aug. 2011, pp. 2168–2172.

[34] M. H. Hayes, “The reconstruction of a multidimensional sequence from [60] J. Ranieri, A. Chebira, Y. M. Lu, and M. Vetterli. (2013). “Phase

the phase or magnitude of its Fourier transform,” IEEE Trans. Acoust., retrieval for sparse signals: Uniqueness conditions.” [Online]. Available:

Speech, Signal Process., vol. 30, no. 2, pp. 140–154, Apr. 1982. http://arxiv.org/abs/1308.3058

[35] T. Heinosaari, L. Mazzarella, and M. M. Wolf, “Quantum tomography [61] H. Reichenbach, Philosophic Foundations of Quantum Mechanics.

under prior information,” Commun. Math. Phys., vol. 318, no. 2, Berkeley, CA, USA: Univ. California Press, 1965.

pp. 355–374, 2013. [62] J. L. C. Sanz, T. S. Huang, and T. Wu, “A note on iterative Fourier

[36] K. Jaganathan, S. Oymak, and B. Hassibi, “On robust phase retrieval for transform phase reconstruction from magnitude,” IEEE Trans. Acoust.,

sparse signals,” in Proc. 50th Annu. Allerton Conf. Commun., Control, Speech, Signal Process., vol. 32, no. 6, pp. 1251–1254, Dec. 1984.

Comput. (Allerton), Oct. 2012, pp. 794–799. [63] P. Schniter and S. Rangan, “Compressive phase retrieval via generalized

[37] P. Jain, P. Netrapalli, and S. Sanghavi, “Low-rank matrix completion approximate message passing,” in Proc. 50th Annu. Allerton Conf.

using alternating minimization,” in Proc. 45th Annu. ACM Symp. Theory Commun., Control, Comput. (Allerton), 2012, pp. 815–822.

Comput., 2013, pp. 665–674. [64] Y. Shechtman, Y. C. Eldar, O. Cohen, H. N. Chapman, J. Miao, and

[38] R. H. Keshavan, “Efficient algorithms for collaborative filtering,” M. Segev. (2014). “Phase retrieval with application to optical imaging.”

Ph.D. dissertation, Dept. Elect. Eng., Stanford Univ., Stanford, CA, [Online]. Available: http://arxiv.org/abs/1402.7350

USA, 2012. [65] M. Soltanolkotabi, “Algorithms and theory for clustering and non-

[39] R. H. Keshavan, A. Montanari, and S. Oh, “Matrix completion from a convex quadratic programming,” Ph.D. dissertation, Dept. Elect. Eng.,

few entries,” IEEE Trans. Inf. Theory, vol. 56, no. 6, pp. 2980–2998, Stanford Univ., Stanford, CA, USA, 2014.

Jun. 2010. [66] R. Vershynin, “Compressed sensing: Theory and applications,” in Intro-

[40] R. H. Keshavan, A. Montanari, and S. Oh, “Matrix completion from duction to the Non-Asymptotic Analysis of Random Matrices, Y. Eldar

noisy entries,” J. Mach. Learn. Res., vol. 11, pp. 2057–2078, Mar. 2010. and G. Kutyniok, Eds. Cambridge, U.K.: Cambridge Univ. Press, 2012.

[41] K. Lee, Y. Wu, and Y. Bresler. (2013). “Near optimal compressed sensing [67] I. Waldspurger, A. d’Aspremont, and S. Mallat. (2012). “Phase recovery,

of sparse rank-one matrices via sparse power factorization.” [Online]. maxcut and complex semidefinite programming.” [Online]. Available:

Available: http://arxiv.org/abs/1312.0525 http://arxiv.org/abs/1206.0102

CANDÈS et al.: PHASE RETRIEVAL VIA WIRTINGER FLOW: THEORY AND ALGORITHMS 2007

[68] A. Walther, “The question of phase retrieval in optics,” Opt. Acta, Int. Xiaodong Li is a postdoctoral research associate in the Wharton Statistics

J. Opt., vol. 10, no. 1, pp. 41–49, 1963. Department, University of Pennsylvania. He received his B.S. in mathematics

[69] G.-Z. Yang, B.-Z. Dong, B.-Y. Gu, J.-Y. Zhuang, and O. K. Ersoy, from Peking University, Beijing, China in 2008. He obtained his Ph.D in

“Gerchberg–Saxton and Yang–Gu algorithms for phase retrieval in a mathematics from Stanford University in 2013. His research interests are in

nonunitary transform system: A comparison,” Appl. Opt., vol. 33, no. 2, statistics, mathematical signal processing, machine learning and optimization.

pp. 209–218, 1994.

Statistics, and professor of electrical engineering (by courtesy) at Stanford

University. Up until 2009, he was the Ronald and Maxine Linde Professor of

Applied and Computational Mathematics at the California Institute of Technol-

ogy. His research interests are in applied mathematics, statistics, information

theory, signal processing and mathematical optimization with applications

to the imaging sciences, scientific computing and inverse problems. Cands

graduated from the Ecole Polytechnique in 1993 with a degree in science Mahdi Soltanolkotabi obtained his B.S. in electrical engineering at Sharif

and engineering, and received his Ph.D. in statistics from Stanford University University of Technology, Tehran, Iran in 2009. He completed his M.S. and

in 1998. Emmanuel received the 2006 Alan T. Waterman Award from NSF, Ph.D. in electrical engineering at Stanford University in 2011 and 2014. He

which recognizes the achievements of early-career scientists. Other honors was a postdoctoral researcher in the Electrical Engineering and Computer

include the 2013 Dannie Heineman Prize presented by the Academy of Science and Statistics departments at the University of California, Berkeley

Sciences at Gttingen, the 2010 George Polya Prize awarded by the Society during the 2014-2015 academic year. He is currently an assistant professor

of Industrial and Applied Mathematics (SIAM), and the 2015 AMS-SIAM in the Ming Hsieh Department of Electrical Engineering at the University of

George David Birkhoff Prize in Applied Mathematics. He is a member of Southern California. His research interests include mathematical optimization,

the National Academy of Sciences and the American Academy of Arts and machine learning, signal processing, high-dimensional statistics, and geometry

Sciences. with emphasis on applications in information and physical sciences.

- Compartir en Facebook, abre una nueva ventanaFacebook
- Compartir en Twitter, abre una nueva ventanaTwitter
- Compartir en LinkedIn, abre una nueva ventanaLinkedIn
- Copiar dirección de enlace en el portapapelesCopiar dirección de enlace
- Compartir por correo electrónico, abre un cliente de correo electrónicoCorreo electrónico