0 vistas

Cargado por Liliana Forzani

PARTIAL LEAST SQUARES PREDICTION IN
HIGH-DIMENSIONAL REGRESSION
BY R. DENNIS COOK AND LILIANA FORZANI

- Anderson Bror Sen 2005
- A.Note.on.Alternative.Regressions.(A.J).1942.pdf
- Regression through Excel
- 10s 12 Linear Regression
- Chapter 9 Corregido
- Statistics Model Paper
- PShah-22843119
- on and Regression 22
- 140a syllabus
- Empirical
- 267952708
- Correlation and Regression
- imm2804
- 7606-9700-1-PB
- 102-905-1-PB
- Rp-Panel - 1 - Thesis Report for Result Tables
- Analysis Toolpack
- Zeit Vogel 2014
- Ch6 Slides 1
- Introductory Guide to using Stata

Está en la página 1de 28

0, Vol. 0, No. 00, 1–25

https://doi.org/10.1214/18-AOS1681

© Institute of Mathematical Statistics, 0

1 1

2 2

3

PARTIAL LEAST SQUARES PREDICTION IN 3

4

HIGH-DIMENSIONAL REGRESSION 4

5 5

6

B Y R. D ENNIS C OOK AND L ILIANA F ORZANI 6

7 University of Minnesota and Facultad de Ingeniería Química, UNL, 7

8 Researcher of CONICET 8

9 9

We study the asymptotic behavior of predictions from partial least

10 10

squares (PLS) regression as the sample size and number of predictors diverge

11 in various alignments. We show that there is a range of regression scenarios 11

12 where PLS predictions have the usual root-n convergence rate, even when the 12

13 sample size is substantially smaller than the number of predictors, and an even 13

14 wider range where the rate is slower but may still produce practically use- 14

ful results. We show also that PLS predictions achieve their best asymptotic

15 15

behavior in abundant regressions where many predictors contribute informa-

16 16

tion about the response. Their asymptotic behavior tends to be undesirable

17 in sparse regressions where few predictors contribute information about the 17

18 response. 18

19 19

20 20

1. Introduction. Partial least squares (PLS) regression is one of the first

21 21

methods for prediction in high-dimensional linear regressions in which sample

22 22

size n may not be large relative to the number of predictors p. It was set in mo-

23 23

tion by Wold, Martens and Wold [35]. Since then the development of PLS regres-

24 24

sion has taken place mainly within the Chemometrics community where empirical

25 25

prediction is the main issue and PLS regression is now a core method. Chemo-

26 26

metricians tended not to address population models or regression coefficients, but

27 27

instead dealt directly with predictions resulting from PLS algorithms. This cus-

28 28

tom of forgoing population considerations, asymptotic approximations and other

29 29

30

widely accepted statistical constructs placed PLS at odds with statistical tradition, 30

31

with the consequence that it has been slow to be recognized within the statistics 31

32

community. There is now vast Chemometrics literature on PLS regression, some of 32

33

it refining and extending the methodology and some of it affirming basic method- 33

34

ology [4]. Martens and Næs’ 1992 book [28] is a classical reference for PLS within 34

35

the Chemometrics community. 35

36

Studies of PLS regression have appeared in mainline statistics literature from 36

37

time to time. Helland [21] was perhaps the first to define a PLS regression model, 37

38 and a first attempt at maximum likelihood estimation was made by Helland [22]; 38

39 see also [23, 29]. Frank and Friedman [18] gave an informative discussion of PLS 39

40 40

41 Received January 2017; revised December 2017. 41

42 MSC2010 subject classifications. Primary 62J05; secondary 62F12. 42

43

Key words and phrases. Abundant regressions, dimension reduction, sparse regressions. 43

1

AOS imspdf v.2018/02/08 Prn:2018/02/28; 14:33 F:aos1681.tex; (G) p. 2

1 regression from various statistical views, and Garthiwate [19] attempted a statis- 1

2 tical interpretation of PLS algorithms. Naik and Tsai [30] demonstrated that PLS 2

3 regression provides a consistent estimator of the central subspace [7, 8] when the 3

4 distribution of the response given the predictors follows a single-index model and 4

5 n → ∞ with p fixed. Delaigle and Hall [16] extended it to functional data. Cook, 5

6 Helland and Su [12] established a population connection between PLS regression 6

7 and envelopes [14] in the context of multivariate linear regression, provided the 7

8 first firm PLS model and showed that envelope estimation leads to root-n consis- 8

9 tent estimators whose performance dominates that of PLS in traditional fixed p 9

10 contexts. 10

11 PLS regression also has a substantial following outside of the Chemometrics 11

12 and Statistics communities. Boulesteix and Strimmer [3] studied the advantages 12

13 of PLS regression for the analysis of high-dimensional genomic data, and Nguyen 13

14 and Rocke [31, 32] proposed it for microarray-based classification. Worsley [36] 14

15 considered PLS regression for the analysis of data from PET and fMRI studies. Ap- 15

16 plication of PLS for the analysis of spatiotemporal data was proposed by Lobaugh 16

17 et al. [27], and Schwartz et al. [33] used PLS in image analysis. Because of these 17

18 and many other applications, it seems clear that PLS regression is widely used 18

19 across the applied sciences. All subsequent references to PLS in this article should 19

20 be understood to mean PLS regression. 20

21 In view of the apparent success that PLS has had in Chemometrics and else- 21

22 where, we might anticipate that it has reasonable statistical properties in high- 22

23 dimensional regression. However, the algorithmic nature of PLS evidently made it 23

24 difficult to study using traditional statistical measures, with the consequence that 24

25 PLS was long regarded as a technique that is useful, but whose core statistical 25

26 properties are elusive. Chun and Keleş [6] provided a piece of the puzzle by show- 26

27 ing that, within a certain modeling framework, the PLS estimator of the coefficient 27

28 vector in linear regression is inconsistent unless p/n → 0. They then used this as 28

29 motivation for their development of a sparse version of PLS. The Chun–Keleş re- 29

30 sult poses a little dilemma. On the one hand, decades of experience support PLS as 30

31 a useful method, but its inconsistency when p/n → c > 0 casts doubt on its use- 31

32 fulness in high-dimensional regression, which is one of the contexts in which PLS 32

33 undeniably stands out by virtue of its widespread application. There are several 33

34 possible explanations for this conflict, including (a) consistency does not always 34

35 signal the value of a method in practice, (b) the Chemometrics literature is largely 35

36 wrong about the value of PLS and (c) the modeling construct used by Chun and 36

37

Keleş does not reflect the range of applications in which PLS is employed. 37

38

Cook and Forzani [9] studied single-component PLS regressions and found that 38

39

in some reasonable settings PLS predictions can converge at the root-n rate as 39

40

n, p → ∞, regardless of the alignment between n and p, a result that stands in 40

41

contrast to the finding of Chun and Keleş [6]. Single-component regressions do 41

42

occur in practice, but our impression is that multiple-component regressions are the 42

43

rule. Recent studies that used multiple PLS components include studies of seasonal 43

AOS imspdf v.2018/02/08 Prn:2018/02/28; 14:33 F:aos1681.tex; (G) p. 3

PLS PREDICTION 3

1 streamflow forecasting [1], Italian craft beer [2], the metabolomics of meat exudate 1

2 [5], the prediction of biogas yield [24], quantification in bioprocesses [25] and the 2

3 Japanese honeysuckle [26]. 3

4 In this article, we follow the general setup of Cook and Forzani [9] and use 4

5 traditional (n, p)-asymptotic arguments to provide insights into PLS predictions 5

6 in multiple-component regressions. We also give bounds on the rates of conver- 6

7 gence for PLS predictions as n, p → ∞ and in doing so we conjecture about the 7

8 value of PLS in various regression scenarios. Section 2 contains a review of PLS 8

9 regression, along with comments on its connection to envelopes and sufficient di- 9

10 mension reduction. The specific objective of our study is described in Section 3. In 10

11 Section 4, we introduce and provide intuition for various quantities that influence 11

12 the (n, p)-asymptotic behavior of PLS predictions. Our main results are given as 12

13 two theorems in Section 5. There we also describe connections with the results 13

14 of Cook and Forzani [9] for single-component regressions and offer a different 14

15 view of the Chun–Keleş result [6]. Supporting simulations and an illustrative data 15

16 analysis are given in Section 6. We focus solely on predictive consistency until 16

17 Section 7.1 where we address estimative consistency. Proofs and other supporting 17

18 material are given in an online supplement to this article [10]. 18

19 Our results show that there is a range of regression scenarios where PLS predic- 19

20 tions have the usual root-n convergence rate, even when n ≪ p, and an even wider 20

21 21

range where the rate is slower but may still produce practically useful results, the

22 22

Chun–Keleş result notwithstanding.

23 23

24 24

2. PLS review. There are several different PLS algorithms for the multivariate

25 25

(multi-response) linear regression of r responses on p predictors. These algorithms

26 26

may not be presented as model-based, but instead are often regarded as methods for

27 27

prediction. It is known they give the same result for univariate responses but give

28 28

distinct sample results for multivariate responses. We restrict attention to univariate

29 29

regression so that the methodology is clear. See Section 7.2 for further discussion

30 30

related to this choice.

31 31

The context for our study is the typical linear regression model with univariate

32 32

response y and random predictor vector X ∈ Rp ,

33 33

34 (2.1) y = µ + β T X − E(X) + ǫ, 34

35 35

36

where the regression coefficients β ∈ Rp are unknown, and the error ǫ has mean 36

37

0, variance τ 2 and is independent of X. We assume that (y, X) follows a non- 37

38

singular multivariate normal distribution and that the data (yi , Xi ), i = 1, . . . , n, 38

39

arise as independent copies of (y, X). We use the normality assumption to facili- 39

40

tate asymptotic calculations and to connect with the results of Chun and Keleş [6]; 40

41

nevertheless, simulations and experience in practice indicate that it is not essen- 41

42

tial for the methodology itself. Further discussion of this assumption is given in 42

43

Section 7.4. To avoid trivial cases, we assume throughout that β 6= 0. 43

AOS imspdf v.2018/02/08 Prn:2018/02/28; 14:33 F:aos1681.tex; (G) p. 4

2 matrix with columns (Xi − X̄), i = 1, . . . , n. Then the model for the full sample 2

3 can be represented also in vector form as 3

4 4

5

Y = α1n + F T β + ε, 5

6 6

where 1n represents the n × 1 vector of ones, α = E(y) and ε = (ǫi ). Let 6 =

7 7

var(X) > 0 and σ = cov(X, y). We use Wq () to denote the Wishart distribution

8 8

with q degrees of freedom and scale matrix . Let PA(1) denote the projection

9 9

in the 1 > 0 inner product onto span(A) if A is a matrix or onto A itself if it is

10 10

a subspace. We use PA := PA(I ) to denote projections in the usual inner product

11 11

and QA = I − PA . The Euclidean norm is denoted as k · k. Turning to notation

12

for a sample, let σ̂ = n−1 F Y and 6̂ = n−1 F F T ≥ 0 denote the usual moment 12

13

estimators of σ and 6 using n for the divisor. With W = F F T ∼ Wn−1 (6), we 13

14

can represent 6̂ = W/n, σ̂ = n−1 (Wβ + F ε). 14

15 15

The PLS estimator of β hinges fundamentally on the notion that we can identify

16

a dimension reduction subspace H ⊆ Rp so that y ⊥ ⊥ X | PH X and d := dim(H) < 16

17 17

p (and hopefully d ≪ p). This driving condition is the same as that encountered

18 18

in the literature on sufficient dimension reduction (see [8] for an introduction), but

19 19

PLS operates in the context of model (2.1), while sufficient dimension reduction

20 20

is largely model-free. We assume that d is known in all technical results stated in

21 21

this article. In Chemometrics and elsewhere, d is often chosen by using predictive

22 22

cross validation or a holdout sample. See Section 7.3 for discussion on the choice

23 23

of d.

24 24

Assume momentarily that a basis matrix H ∈ Rp×d of H is known and that

25

6̂ > 0. Let B = 6̂ −1 σ̂ denote the ordinary least squares estimator of β. Then 25

26 26

following the reduction X 7→ H T X, ordinary least squares is used to estimate

27 27

the coefficient vector βy|H T X for the regression of y on H T X, giving estimated

28 28

29

coefficient matrix β̃y|H T X = (H T 6̂H )−1 H T σ̂ . The known-H estimator β̃H of β 29

30

is then 30

31 (2.2) β̃H = H β̃y|H T X = PH(6̂) B. 31

32 32

33 Equation (2.2) describes β̃H as a projection of B onto H and shows that β̃H de- 33

34 pends on H only via H. It also shows that β̃H requires H T 6̂H > 0, but does not 34

35 actually require 6̂ > 0. This is essentially how PLS handles n < p regressions: by 35

36 reducing the predictors to H T X while requiring n ≫ d, PLS is able to deal with 36

37 high-dimensional regressions in a relatively straightforward manner. The unique 37

38 and essential ingredient supplied by PLS is an algorithm for estimating H. 38

39 The following is the population statement developed by Cook et al. [12] of the 39

40 SIMPLS algorithm [15] for estimating H in univariate regressions. Set w0 = 0 and 40

41 W0 = w0 . For k = 0, . . . , d − 1, set 41

42 42

43

Sk = span(6Wk ), 43

AOS imspdf v.2018/02/08 Prn:2018/02/28; 14:33 F:aos1681.tex; (G) p. 5

PLS PREDICTION 5

1/2

1 wk+1 = QSk σ/ σ T QSk σ , 1

2 2

3

Wk+1 = (w0 , . . . , wk , wk+1 ). 3

4 At termination, span(Wd ) is a dimension reduction subspace H. Since d is as- 4

5 sumed to be known and effectively fixed, SIMPLS depends on only two population 5

6 quantities—σ and 6—that must be estimated. The sample version of SIMPLS is 6

7 constructed by replacing σ and 6 by their sample counterparts and terminating af- 7

8 ter d steps, even if 6̂ is singular. In particular, SIMPLS does not make use of 6̂ −1 8

9 and so does not require 6̂ to be nonsingular, but it does require d ≤ min(p, n − 1). 9

10 If d = p, then span(Wp ) = Rp and PLS reduces to the ordinary least squares 10

11 11

estimator. Let G = (σ, 6σ, . . . , 6 d−1 σ ) and Ĝ = (σ̂ , 6̂ σ̂ , . . . , 6̂ d−1 σ̂ ) denote

12 12

population and sample Krylov matrices. Helland [21] showed that span(G) =

13 13

span(Wd ), giving a closed-form expression for a basis of the population PLS sub-

14 14

space, and that the sample version of the SIMPLS algorithm gives span(Ĝ).

15

PLS can be seen as an envelope method as follows [12]. A subspace R ⊆ Rp is 15

16 16

a reducing subspace of 6 if R decomposes 6 = PR 6PR + QR 6QR and then we

17 17

say that R reduces 6. The intersection of all reducing subspaces of 6 that contain

18

a specified subspace S ⊆ Rp is called the 6-envelope of S and denoted as E6 (S ). 18

19 19

Let Pk denote the projection onto the kth eigenspace of 6, k = 1, . . . , q ≤ p. Then

20 20

the 6-envelopeP of S can be constructed by projecting onto the eigenspaces of 6

21 q 21

[14]: E6 (S ) = i=1 Pk S . Cook et al. [12] showed that the population SIMPLS

22 22

algorithm produces E6 (B ), the 6-envelope of B := span(β), so H = span(Wd ) =

23 23

span(G) = E6 (B ).

24

From this point, we use H ∈ Rp×d to denote any semi-orthogonal basis matrix 24

25

for E6 (B ) and let (H, H0 ) ∈ Rp×p denote an orthogonal matrix. The connection 25

26 26

with envelopes led Cook et al. [12] to the following envelope model for PLS:

27 27

T T

28 y = µ + βy|H T XH X − E(X) + ǫ, 28

29

(2.3) 29

30

6 = H 6H H T + H0 6H0 H0T , 30

31 where 31

32 32

33

6H = var H T X = H T 6H ∈ Rd×d , 33

34 6H0 = var H0T X = H0T 6H0 ∈R (p−d)×(p−d)

, 34

35 35

36

and βy|H T X can be interpreted as the coordinates of β relative to basis H . In terms 36

37

of the parameters in model (2.1), this model makes use of the basis H of E6 (B ) to 37

38

achieve a parsimonious re-parameterization of β and 6 in terms: 6 is as given in 38

39

the model and 39

−1

40 (2.4) β = PH(6) β = Hβy|H T X = H H 6H T

H σ = G G 6G −1 GT σ,

T T 40

41 41

42

where the last step follows because, as noted previously, E6 (B ) = span(H ) = 42

43

span(G). This re-parameterization has no impact on the predictors or the error 43

AOS imspdf v.2018/02/08 Prn:2018/02/28; 14:33 F:aos1681.tex; (G) p. 6

2 (2.1). 2

3 Beginning with model (2.3), Cook et al. [12] developed likelihood-based es- 3

4 timators whose performance dominates that of the SIMPLS in the traditional 4

5 fixed p context. It follows from (2.3) that y ⊥ ⊥ X | H T X and H T X ⊥ ⊥ H0T X, 5

6 which together imply that (y, H T X) ⊥ ⊥ H0T X. Model (2.3) and the condition 6

7 HT X ⊥ ⊥ H0T X are what sets the PLS framework apart from that of sufficient di- 7

8 mension reduction. As a consequence of this structure, the distribution of y can 8

9 respond to changes in H T X, but changes in H0T X affect neither the distribution 9

10

of y nor the distribution of H T X. For this reason, we refer to H0T X as the noise 10

11 11

in X. As will be seen later, the predictive success of PLS depends crucially on the

12 relative sizes of 6H0 , the variability of the noise in X and 6H the variability in the 12

13 part of X that affects y. 13

14 14

15 3. Objective. Let β̂ denote the estimator of β produced by the SIMPLS algo- 15

16 rithm: from (2.4) 16

17 17

18

β = G GT 6G −1 GT σ, 18

19 T −1 T 19

β̂ = Ĝ Ĝ 6̂ Ĝ Ĝ σ̂ ,

20 20

21 where Ĝ = (σ̂ , 6̂ σ̂ , . . . , 6̂ d−1 σ̂ ), as defined previously. Our interest lies in study- 21

22 ing the predictive performance of β̂ as n and p grow in various alignments. 22

23 Let yN = µ + β T (XN − E(X)) + ǫN denote a new observation on y at a new 23

24 independent observation XN of X. The PLS predicted value of yN at XN is 24

25 ŷN = ȳ + β̂ T (XN − X̄), giving a difference of 25

T T

26 ŷN − yN = ȳ − µ + (β̂ − β) XN − E(X) − (β̂ − β) X̄ − E(X) 26

27 27

28 − β T X̄ − E(X) + ǫN . 28

29 The first term ȳ − µ = Op (n−1/2 ).

Since var(y) = β T 6β + τ2

must remain con- 29

30 stant as p grows, β 6= 0 and 6 > 0, we see that β T 6β ≍ 1 as p → ∞, where “ak ≍ 30

31 bk ” means that, as k → ∞, ak = O(bk ) and bk = O(ak ). Thus the fourth term 31

32 β T (X̄ − E(X)) = Op (n−1/2 ) by Chebyschev’s inequality: var(β T (X̄ − E(X))) = 32

33

β T 6β/n → 0 as n, p → ∞. The term (β̂ − β)T (X̄ − E(X)) must have order 33

34

smaller than or equal to the order of (β̂ − β)T (XN − E(X)), which will be at least 34

35

Op (n−1/2 ). 35

36 Consequently, we have the essential asymptotic representation 36

37 37

38

ŷN − yN = Op (β̂ − β)T XN − E(X) + ǫN as n, p → ∞. 38

39 Since ǫN is the intrinsic error in the new observation, the n, p-asymptotic behavior 39

40 of the prediction ŷN is governed by the estimative performance of β̂ as measured 40

41 by 41

42 42

43

(3.1) DN := (β̂ − β)T ωN = σ̂ T Ĝ ĜT 6̂ Ĝ −1 ĜT − σ T G GT 6G −1 GT ωN , 43

AOS imspdf v.2018/02/08 Prn:2018/02/28; 14:33 F:aos1681.tex; (G) p. 7

PLS PREDICTION 7

2 properties of DN as n, p → ∞. Because var(DN | β̂) = (β̂ − β)T 6(β̂ − β), results 2

3 for DN also tell us about the asymptotic behavior of β̂ in the 6 inner product. 3

4 Consistency of β̂ is discussed in Section 7.1. Until then, we focus exclusively on 4

5 predictions via DN . 5

6 6

7 4. Overarching considerations. In this section, we introduce and discuss 7

8 various population constructs that play key roles in the asymptotic results of Sec- 8

9 tion 5. 9

10 10

11 4.1. Dimension d of E6 (B ). As mentioned in Section 2, we assume through- 11

12 out this article that the dimension d = dim{E6 (B )} is known and constant for all 12

13 finite p ≥ d. Technically, this dimension may increase for a time with p (e.g., 13

14 while p < d), but we assume that it remains constant after a certain point. 14

15 15

16 4.2. Signal and noise in X. Although we are pursuing asymptotic properties 16

17 of PLS predictions via (3.1), the envelope model (2.3) guides aspects of the study. 17

18 Under this envelope construction, B ⊆ E6 (B ) and for any nonnegative integer k, 18

19 19

(4.1) 6 k = H 6H

k

H T + H0 6H

k

HT .

0 0

20 20

21 Our asymptotic results depend fundamentally on the sizes of 6H and 6H0 . Define 21

22 η(p) : R 7→ R and κ(p) : R 7→ R as 22

23 23

(4.2) tr(6H ) ≍ η(p) ≥ 1,

24 24

25 (4.3) tr(6H0 ) ≍ κ(p), 25

26 26

where we imposed the condition η(p) ≥ 1 without loss of generality. In what fol-

27 27

lows, we will typically suppress the argument and refer to η(p) and κ(p) as η

28 28

and κ. If finitely many of the eigenvalues of 6H0 are O(p) and the rest are all

29 29

bounded away from 0 and ∞, then we could take κ = p. Otherwise, it is tech-

30 30

nically possible that p = o(κ), although we would not normally expect that in

31 31

practice.

32 32

To gain intuition about η(p), let λi denote the ith eigenvalue of 6H , i =

33 33

1, . . . , d, and assume without loss of generality that the columns of H =

34 34

(h1 , . . . , hd ) are orthogonal eigenvectors of 6. Then using (4.1) and the facts that

35 35

σ = PH σ and 6H = diag(λ1 , . . . , λd ),

36 36

−1 T

37 β T 6β = σ T 6 −1 σ = σ T H 6H H σ 37

38 T −1 T 38

σ H 6H

2 H σ

39 (4.4) = kσ k T

39

40 σ PH σ 40

41 d

X 41

42 = wi kσ k2 /λi , 42

43 i=1 43

AOS imspdf v.2018/02/08 Prn:2018/02/28; 14:33 F:aos1681.tex; (G) p. 8

1 where the weights wi = σ T Phi σ/σ T PH σ , Phi denotes the projection onto 1

P

2 span(hi ) and di=1 wi = 1. Consequently, if the wi are bounded away from 0 2

3 and if many predictors are correlated with y so that kσ k2 → ∞, then the eigenval- 3

4 ues of 6H must diverge to ensure that β T 6β remains bounded. We could in this 4

5 case take η(p) = kσ k2 . 5

6 Suppose that the first k eigenvalues λi , i = 1, . . . , k, diverge with p, that λi ≍ 6

7 λj , i, j = 1, . . . , k, and that the remaining d − k eigenvalues are a lower order, 7

8 λj = o(λi ), i = 1, . . . , k, j = k + 1, . . . , d. Then if kσ k2 ≍ λi , i = 1, . . . , k, we 8

9 must have wi → 0 for i = k + 1, . . . , d for β T 6β to remain bounded. 9

10 It is possible also that the eigenvalues λi are bounded. This happens in sparse 10

11 regressions when only d predictors are relevant. For instance, if H = (Id , 0)T then 11

12 6H is the dth order leading principal submatrix of 6, and thus it is fixed with 12

13 bounded eigenvalues. Bounded eigenvalues are possible also when many predic- 13

14 14

tors are related weakly with the response so kσ k is bounded. If the eigenvalues λi

15 15

are bounded, then η ≍ 1.

16 16

From the discussion so far, we see that κ, being the trace of a p − d × p − d

17 17

positive definite matrix, would normally be at least the order of p, but might have

18 18

a larger order. η, being the trace of a d × d matrix, will in practice have order at

19

most p and can achieve that order in abundant regressions where kσ k2 ≍ p. We 19

20 20

can contrive cases where p = o(η), but they seem impractical. For these reasons,

21 21

we limit our consideration to regressions in which η = O(κ).

22 22

The measures κ and η are frequently joined naturally in our asymptotic expan-

23 23

sions into the combined measure

24 24

κ(p)

25 (4.5) φ(n, p) = . 25

26 nη(p) 26

27 As will be seen later, a good scenario for prediction occurs when φ(n, p) → 0 as 27

28 n, p → ∞. This implies a synergy between the signal η and the sample size n, 28

29 with the product nη being required to dominate the variation of the noise in X 29

30 as measured by κ. This is similar to the signal rate found by Cook, Forzani and 30

31 Rothman [11] in their study of abundant high-dimensional linear regression. We 31

32 typically drop the arguments (n, p) when referring to φ(n, p). 32

33 33

34 4.3. Coefficients βy|H T X . The coefficients for the regression of y on the re- 34

35 −1 35

duced predictors H T X can be represented as βy|H T X = 6H σH , where σH =

36 T d×1 36

H σ ∈ R . Population predictions based on the reduced predictor involve the

37 T T T 37

product βy|H T X H X. If var(H X) = 6H diverges along certain directions, then

38 38

39

we must have corresponding parts of βy|H T X converge to 0 to balance the increases 39

in H T X or otherwise the form βy|HT T

40 T X H X will not make sense asymptotically. 40

41 This essential behavior can be seen also from 41

42 T T

T T −1 T 42

var βy|H T X H X = βy|H T X 6H βy|H T X = σH 6H σH = β 6β.

43 43

AOS imspdf v.2018/02/08 Prn:2018/02/28; 14:33 F:aos1681.tex; (G) p. 9

PLS PREDICTION 9

−1

2 correspondingly increase to compensate for the convergence of 6H to 0 in those 2

3 T 1/2

same directions. By construction, var(H X/η ) = 6H /η → V ≥ 0. Also, nor- 3

4 malizing 6H by η forces a corresponding normalization of σH by η1/2 . 4

5 5

6 4.4. Error variance τ 2 . The quadratic form β T 6β is a monotonically increas- 6

7 ing function of p. Since var(y) = β T 6β + τ 2 is constant, as β T 6β increases with 7

8 p, τ 2 must correspondingly decrease with p. Although it is technically possible 8

9 to have τ → 0, we assume throughout that τ is bounded away from 0 as p → ∞ 9

10 since this is likely relevant in nearly all applications. 10

11 11

12 4.5. Asymptotic dependence. In the envelope model (2.3), H represents a 12

13 semi-orthogonal basis matrix for E6 (B ). However, the SIMPLS method for es- 13

14 timating E6 (B ) involves Ĝ. While span(G) = E6 (B ), G is not semi-orthogonal, 14

15 and thus we need to keep track of any asymptotic linear dependencies among the 15

16 reduced variables GT X ∈ Rd . Let 16

17 17

18 C = diag−1/2 GT 6G GT 6G diag−1/2 GT 6G ∈ Rd×d 18

19 19

denote the correlation matrix for GT X, and define the function ρ(p) so that as

20 20

p→∞

21 21

22 (4.6) tr C −1 ≍ ρ(p). 22

23 23

As with other constructions, we typically drop the argument and refer to ρ(p)

24 24

as ρ. Let Ri2 denote the squared multiple correlation coefficient from the linear

25 P 25

26

regression of the ith coordinate of GT X onto the rest. Then tr(C −1 ) = di=1 (1 − 26

27

Ri2 )−1 , so ρ basically describes the rate of increase in the sum of variance inflation 27

28

factors. It may be appropriate for many applications to assume that ρ is bounded, 28

29

but it turns out that we might still obtain useful results then√ ρ → ∞ if its rate of 29

30

increase is sufficiently slow and in particular slower than n. 30

31

In high-dimensional regressions, the eigenvalues of 6 are often assumed to be 31

32

bounded away from 0 and ∞ as p → ∞, which rules out any exact asymptotic 32

33

dependence among the predictors. In the context of PLS, y ⊥ ⊥ X | GT X and so the 33

T

variables G X are the only ones that are relevant to the regression. We use ρ to

34 34

35

measure asymptotic dependencies among the variables in GT X. For instance, it 35

36

will be seen in the two theorems of Section 5 that the sample size required for con- 36

37

sistency when ρ → ∞ can be much larger than that required when ρ is bounded. 37

38

Our context allows for exact asymptotic dependencies in the complementary set of 38

39

variables H0T X, so our conclusions stand even if the smallest eigenvalue of 6H0 39

40

converges to zero. Since the eigenvalues of 6H0 are also eigenvalues of 6, the 40

41

smallest eigenvalue of 6 may converge to 0 without impacting our results. 41

42

The following proposition gives necessary and sufficient conditions for tr(C −1 ) 42

43

to be bounded. In preparation, consider the regression of y on the reduced and 43

AOS imspdf v.2018/02/08 Prn:2018/02/28; 14:33 F:aos1681.tex; (G) p. 10

√

1 scaled predictors H T X/ η, where the scaling is as discussed in Section 4.3. The 1

2 Krylov matrix for this regression is 2

3 3

4 σH 6H σH 6H 2 σH 6H d−1 σH 4

GH = √ , √ , √ ,..., √ .

5 η η η η η η η 5

6 √ 6

Let aH = limp→∞ σH / η. Then the limiting form of GH can be expressed as

7 7

8 2 d−1 d×d 8

G∞ = lim GH = aH , V aH , V aH , . . . , V aH ∈ R ,

9 p→∞ 9

10 where V = limp→∞ 6H /η, as defined in Section 4.3. By construction rank(GH ) = 10

11 d for all finite p, but rank(G∞ ) could be less than d if, for example, V is singular 11

12 or some of its eigenvalues are equal. 12

13 13

14

P ROPOSITION 1. V > 0 and rank(G∞ ) = d if and only if tr(C −1 ) is bounded 14

15 15

as p → ∞.

16 16

17 17

The next two corollaries describe related implications.

18 18

19 19

C OROLLARY 1. If V > 0 with distinct eigenvalues, then rank(G∞ ) = d if and

20 20

only if EV (span(aH )) = Rd .

21 21

22 22

23

This corollary, which follows from Cook, Li and Chiaromonte [13], Theorem 1, 23

24

says in effect that rank(G∞ ) = d if and only if aH has a nonzero projection onto 24

25

each of the d eigenspaces of V . If V > 0, but has fewer than d eigenspaces, then 25

26

rank(G∞ ) < d. This partly explains the need for the two conditions of Proposi- 26

27 tion 1. 27

28 28

29 C OROLLARY 2. Assume that V > 0. Then: 29

30 30

(i) rank(G∞ ) = d implies that V has distinct eigenvalues and that

31 31

d

32 EV span(aH ) = R . 32

33 33

34

(ii) EV (span(aH )) = Rd implies that V has distinct eigenvalues and that 34

35 rank(G∞ ) = d. 35

36 36

37 37

The next corollary describes what happens when the eigenvalues of 6 are

38 38

bounded away from 0 and ∞ as p → ∞.

39 39

40 40

C OROLLARY 3. If the eigenvalues of 6 are bounded away from 0 and

41 41

∞ as p → ∞, then V > 0. Additionally, tr(C −1 ) is bounded if and only if

42 42

rank(G∞ ) = d.

43 43

AOS imspdf v.2018/02/08 Prn:2018/02/28; 14:33 F:aos1681.tex; (G) p. 11

PLS PREDICTION 11

2 6 > 0 has compound symmetry with diagonal elements all 1 and constant off 2

3 diagonal element ψ ∈ (0, 1), 3

4 4

5 (4.7) 6 = (1 − ψ + pψ)P1 + (1 − ψ)Q1 , 5

6 6

where P1 is the projection onto the p × 1 vector of ones 1p . In this case, 6 has

7 7

two eigenspaces and the performance of PLS depends on where β falls relative to

8 8

these spaces.

9 9

10 10

4.6.1. Constant covariances with y. Suppose σ = 1p . Then β = (1 − ψ +

11 √ 11

12

pψ)−1 1p , H = 1p / p, 6H = (1 − ψ + pψ), 6H0 = (1 − ψ)Ip−d , η ≍ p, and 12

13

κ ≍ p. Additionally, d = 1, w1 = 1, kσ k2 = p, C = 1, λ1 = (1 − ψ + pψ), 13

14 d 14

X

p

15

T

β 6β = wi kσ k /λi = 2

→ ψ −1 , 15

16 i=1

1 − ψ + pψ 16

17 √ 17

and G∞ = limp→∞ (H T σ/ η) = 1 with η = p.

18 18

19 19

20 4.6.2. Contrasts. Suppose that 1Tp σ = 0. Then β = (1 − ψ)−1 σ , H = σ/kσ k, 20

21 6H = (1 − ψ), 6H0 = (1 − ψ + pψ)P1 + (1 − ψ)Q1,σ , κ ≍ p and η ≍ 1. Also, 21

22 d = 1, w1 = 1, λ1 = (1 − ψ) and 22

23 d 23

X kσ k2

24 β T 6β = wi kσ k2 /λi = , 24

25 i=1

1−ψ 25

26 26

so kσ k must be bounded. Additionally, G∞ = kσ k with η = 1.

27 27

28 28

29

4.6.3. Arbitrary σ . Decompose σ = P1 σ + Q1 σ = σ̄ 1p + cp , where σ̄ = 29

30

1Tp σ/p is assumed to be bounded away from 0 and cp = σ − 1p σ̄ is a residual 30

31 vector, 1Tp cp = 0. Then β = σ̄ (1 − ψ + pψ)−1 1p + (1 − ψ)−1 cp , 31

32 32

1p cp

33 H = (h1 , h2 ) = √ , , 33

34

p kcp k 34

35 6H = diag{(1 − ψ + pψ), (1 − ψ)}, 6H0 = (1 − ψ)Q1,cp , κ ≍ p and η ≍ p. 35

36 Further, d = 2, 36

37 37

38 kσ k2 = σ T H H T σ = σ T PhT1 σ + σ T PhT2 σ = σ̄ 2 p + kcp k2 , 38

39 2 39

w1 = σ̄ 2 p/ σ̄ 2 p + kcp k ,

40 40

41 w2 = kcp k2 / σ̄ 2 p + kcp k2 , 41

42

2 42

43

β T 6β = σ̄ p/(1 − ψ + pψ) + (1 − ψ)−1 kcp k2 . 43

AOS imspdf v.2018/02/08 Prn:2018/02/28; 14:33 F:aos1681.tex; (G) p. 12

1 We see as a consequence of this structure that both σ̄ and kcp k must be bounded 1

2 and that w1 → 1 and w2 → 0. Additionally, with η = p and σ̄∞ = limp→∞ σ̄ , 2

3 √ √ √ 3

4

aH = lim p σ̄ / η, kcp k/ η T = (σ̄∞ , 0)T , 4

p→∞

5 5

V = lim diag (1 − ψ + pψ), (1 − ψ) /p = diag(ψ, 0),

6 p→∞ 6

7 7

σ̄∞ ψ σ̄∞

8 G∞ = . 8

0 0

9 9

10 In this case, V and G∞ both have rank 1, and so by Proposition 1 tr(C −1 ) is 10

11 unbounded as p → ∞. 11

12 To find an order for tr(C −1 ), we have 12

13 13

6σ = σ̄ b(p, ψ)1p + (1 − ψ)cp ,

14 14

15 G = σ̄ 1p + cp , σ̄ b(p, ψ)1p + (1 − ψ)cp , 15

16 2 2 2 2 2 2

! 16

σ̄ pb(p, ψ) + (1 − ψ)kcp k σ̄ pb (p, ψ) + (1 − ψ) kcp k

17 GT 6G = , 17

18

σ̄ 2 pb2 (p, ψ) + (1 − ψ)2 kcp k2 σ̄ 2 pb3 (p, ψ)3 + (1 − ψ)3 kcp k2 18

19 where b(p, ψ) = 1 − ψ + pψ. From this, it can be verified that tr(C −1 )

≍ p2 , 19

20 2 −1

so ρ = p . The behavior of tr(C ) in this example is due to the different orders 20

21 of magnitude of the eigenvalues of 6H , λ1 ≍ p and λ2 ≍ 1. As will be seen later 21

22 in Theorems 1 and 2, a consequence of this structure is that we would need sam- 22

23 ple size n ≫ p4 to keep the direction in span(1) from swamping the direction in 23

24 span⊥ (1). 24

25 25

26 4.7. Universal conditions. Before discussing asymptotic results in the next 26

27 section, we summarize the conditions that we assume through this article. We re- 27

28 quire that: 28

29 29

C1. Model (2.1) holds, where (y, X) follows a nonsingular multivariate normal

30 30

distribution and that the data (yi , Xi ), i = 1, . . . , n, arise as independent copies

31 31

of (y, X). To avoid the trivial case, we assume that the coefficient vector β 6= 0,

32 32

which implies that the dimension of the envelope d ≥ 1. We also assume that the

33 33

error standard deviation

√ τ is bounded away from 0 as p → ∞.

34 34

C2. φ and ρ/ n → 0 as n, p → ∞, where φ and ρ are defined at (4.5) and

35 35

(4.6).

36 36

C3. η = O(κ) as p → ∞, where η ≥ 1, and η and κ are defined at (4.2) and

37 37

(4.3).

38 38

C4. The dimension d of the envelope is known and constant for all finite p.

39 C5. 6 > 0 for all finite p. This restriction allows 6̂ to be singular, which is a 39

40 scenario PLS was designed to handle. We do not require as a universal condition 40

41 that the eigenvalues of 6 are bounded as p → ∞. 41

42 42

43

Additional conditions will be needed for various results. 43

AOS imspdf v.2018/02/08 Prn:2018/02/28; 14:33 F:aos1681.tex; (G) p. 13

PLS PREDICTION 13

2 totic behavior of PLS predictions can depend crucially on all of the quantities 2

3 described in Section 4: n, d, η, κ and ρ. In this section, we summarize our main 3

4 results along with a few special scenarios that may provide useful intuition in prac- 4

5 tice. Additional results along with proofs for those given here are available in the 5

6 supplement [10]. All of the asymptotic results in this section should be understood 6

7 to hold as n, p → ∞. 7

8 8

9 5.1. Orders of DN . The results of Theorem 1 are the most general, requiring 9

10 for potentially good results in practice only that C1–C5 hold and that the terms 10

11 characterizing the orders go to zero as n, p → ∞. In particular, the eigenvalues of 11

12 6 need not be bounded. Its proof is given as Supplement Theorem S1. 12

13 13

14 T HEOREM 1. As n, p → ∞, 14

√

15 DN = Op (ρ/ n) + Op ρ 1/2 n−1/2 (κ/η)d . 15

16 16

17

In particular, 17

18 I. If ρ ≍ 1, then DN = Op √ {n−1/2 (κ/η)d }. 18

19 II. If κ ≍ η, then DN = Op (ρ/

√ n).

19

20 III. If d = 1, then DN = Op ( nφ). 20

21 21

22 We see from this that the asymptotic behavior of PLS depends crucially on the 22

23 relative sizes of signal η and noise κ in X. It follows from the general result that if 23

24 κ ≍ p, as likely occurs in Chemometrics

√ applications, and η ≍ p, so the regression 24

25 is abundant, then DN = Op (ρ/ n). This may be one of the reasons for the success 25

26 of PLS in spectrometric prediction in Chemometrics. 26

27 On the other hand, if the signal in X is small relative to the noise in X, so η = 27

28 o(κ), then it may take a very large sample size for PLS prediction to be consistent. 28

29 For instance, suppose that the regression is sparse so only d predictors matter, 29

30 and thus η ≍ 1. Then it follows reasonably that ρ ≍ 1 and, from part I, DN = 30

31 Op {n−1/2 κ d }. If, in addition, κ ≍ p then DN = Op {pd n−1/2 }. Clearly, if d is not 31

32 small, then it could take a huge sample size for PLS prediction to be consistent. 32

33 Cook and Forzani [9] showed using the same setup as employed here that for 33

34 single-component regressions (d = 1) 34

2 ) 3 )

35

∗ −1/2

tr1/2 (6H 0 tr(6H0 ) tr1/2 (6H 0

35

36 (5.1) DN = Op n + √ + + , 36

37

nkσ k2 nkσ k2 nkσ k3 37

38 where the superscript ∗ is meant as a reminder that this order of DN for d = 1 is 38

39 from Cook and Forzani [9]. To connect (5.1) with Theorem 1.III„ first substitute 39

j

40 the bound tr(6H0 ) ≤ κ j into (5.1) to obtain 40

41 1/2 41

∗ −1/2 κ κ κ

42 DN = Op n +√ +√ . 42

43 nkσ k2 nkσ k nkσ k2

2

43

AOS imspdf v.2018/02/08 Prn:2018/02/28; 14:33 F:aos1681.tex; (G) p. 14

2

∗ √ √ √ 2

3

DN = Op n−1/2 + nφ + nφ 3/2 = Op ( nφ). 3

4 Consequently, ∗

DN as given in (5.1) provides a sharper result than that given in 4

5 j 5

Theorem 1.III. We used the bound tr(6H0 ) ≤ κ j consistently when deriving the

6 6

conclusions of Theorem 1 because otherwise the conclusions are complicated to

7 7

the point that extracting a useful message is problematic. In some cases, (5.1)

8 8

and Theorem 1.III agree. For instance, consider the compound symmetry example

9 9

of Section 4.6√ with σ = 1p . Then d = 1, 6H0 = (1 − ψ)Ip−d√ , κ ≍ p, η ≍ p,

10 ∗ = O (1/ n) and, from part III of Theorem 1, D = O (1/ n). 10

DN p N p

11 11

Theorem 1 places no constraints on the rate of increase in the eigenvalues of

12 12

6H0 . In some regressions, it may be reasonable to assume that the eigenvalues of

13 h ) ≍ p as p → ∞. This is what happens in the 13

6H0 are bounded so that tr(6H 0

14 14

compound symmetry example. In the next theorem, we describe the asymptotic

15 h ) = O(κ). Its proof follows from Sup- 15

behavior of PLS predictions when tr(6H 0

16 16

plement Theorems S2 and S3.

17 17

18 h ) = O(κ), h = 1, . . . , 4d − 1, then 18

19

T HEOREM 2. If tr(6H 0 19

20 √ p 20

DN = Op (ρ/ n) + Op ( ρφ).

21 21

22 In particular, 22

23 √ 23

I. If ρ ≍ 1, then DN = Op ( φ).

√

24 24

II. If η ≍ κ, then DN = Op (ρ/

√ n).

25 25

III. If d = 1, then DN = Op ( φ).

26 26

27 27

The order of DN now depends on a balance between the sample size n, the

28 28

variance inflation factors as measured through ρ and the noise to signal ratio in φ,

29 29

but it no longer depends on the dimension d. Contrasting the results of Theorems

30 30

1 and 2, we see a much better rate for case I in Theorem 2, and the same rates for

31 31

case

√ II. The √ rate for case III in Theorem 2 is no worse that in Theorem 1 since

32 32

φ = O( nφ).

33 33

In the next two sections, we discuss the asymptotic behaviors of PLS under

34 34

35

models for X that may be plausible for some data. We connect with the results of 35

36

Chun and Keleş [6] in Section 5.2. 36

37 37

38

5.2. Isotropic predictor variation. The compound symmetry example of Sec- 38

39

tion 4.6 was used primarily to help fix ideas as the theory was developed. In that 39

40

example, we specified a particular eigenstructure for 6 and then discussed out- 40

41

comes depending on where σ fell relative to that eigenstructure. We next discuss 41

42

an alternate way of structuring 6 that takes y into account and that may be more 42

43

reflective of Chemometrics applications of PLS. 43

AOS imspdf v.2018/02/08 Prn:2018/02/28; 14:33 F:aos1681.tex; (G) p. 15

PLS PREDICTION 15

2 2

3

(5.2) X = µX + 2ν + ω, 3

4 where ν ∈ Rd is a vector of latent variables that is normally distributed with mean 4

5 0 and variance Id , 2 ∈ Rp×d has rank d ≤ p, ω ∈ Rp is normally distributed 5

6 with mean 0 and variance π 2 Ip , and ω ⊥ ⊥ (ν, y). Since 2 is unknown and uncon- 6

7 strained, there is no loss of generality in the restriction that var(ν) = Id . 7

8 We further assume that cov(ν, y) has no 0 elements so the dependence between 8

9 X and y arises fully via ν. It follows as a consequence of this model that X ⊥ ⊥ν| 9

10 2T X, and thus d linear combinations 2T X carry all of the information that X has 10

11 about y. The variance of X can be expressed as 11

12 12

13

6 = 22T + π 2 Ip = H 2T 2 + π 2 Id H T + π 2 QH , 13

14 where H = 2(2T 2)−1/2 is a semi-orthogonal basis matrix for span(2). Since 14

15 σ = 2 cov(ν, y) and cov(ν, y) has no nonzero elements, it follows that E6 (B ) = 15

16 span(2) = H, 6H = 2T 2 + π 2 Id and 6H0 = π 2 Ip−d . We can now appeal to 16

17 Theorems 1 and 2 to gain information about the asymptotic behavior of PLS under 17

18 (5.2). 18

19 Since the eigenvalues of 6H0 are bounded, κ ≍ p. The signal in X is measured 19

20 by 20

21 p 21

X

22 T 2 2 22

tr(6H ) = tr 2 2 + π d ≍ kθi k ,

23 i=1 23

24 24

25

where θiT is the ith row of 2. If the signal is sparse, so for example only d rows of 25

26 2 are nonzero, then tr(6H ) is bounded, η ≍ 1 and V = limp→∞ 2T 2+ π 2 Id > 0. 26

27 On the other extreme, if the signal is abundant so many rows of 2 are nonzero 27

28 and tr(6H ) diverges, we can take η = tr(2T 2) and reasonably assume V = 28

29 limp→∞ 2T 2/η > 0. For instance, in spectroscopy data it seems entirely plau- 29

30

sible that notable signal comes from many wavelengths, not just a few. 30

31

It remains to address ρ. Since V > 0 with a sparse signal, and we assume V > 0 31

32

with an abundant signal, it follows from Proposition 1 that ρ ≍ 1 if and only if 32

33

rank(G∞ ) = d. To evaluate the rank of G∞ , we need aH = V 1/2 cov(ν, y), V and 33

34

EV (span(aH )) = EV (span(cov(ν, y)). Then, by Corollaries 1 and 2, rank(G∞ ) = d 34

35

if and only if V has distinct eigenvalues and cov(ν, y) has a nonzero projection 35

36

onto every eigenspace of V . Although we might contrive cases where rank(V ) < d 36

37

or where rank(V ) = d and cov(ν, y) is orthogonal to an eigenspace of V , those 37

38

would seem to be unusual in practice, and consequently it may be reasonable to 38

39

assume that rank(G∞ ) = d, and thus that ρ ≍ 1. 39

40

With this background, we next turn to application of Theorems 1 and 2 with κ ≍ 40

41

p and ρ ≍ 1. Under conclusion II of Theorem 1, if η ≍ p then DN = Op (n−1/2 ) 41

42

and we expect reasonable performance √ from PLS predictions. From the general 42

43

conclusion of Theorem 2, DN = Op ( φ). If in addition η ≍ p, then again DN = 43

AOS imspdf v.2018/02/08 Prn:2018/02/28; 14:33 F:aos1681.tex; (G) p. 16

√

1 Op (n−1/2 ), and DN = Op (p1/4 /n1/2 ) if η ≍ p. These rates suggest again that 1

2 PLS predictions could be useful in high-dimensional regressions. 2

3 The predictor model employed by Chun and Keleş [6], Assumption 1, in their 3

4 treatment of PLS is the same as (5.2) with the added constraint that the columns 4

5 of 2 are orthogonal with bounded norms that converge as sequences. As a re- 5

6 sult 2T 2 is a convergent diagonal matrix, which effectively imposes sparsity and 6

7 several additional simplifying consequences: 7

8 8

1. The eigenvalues of 6H0 must be bounded away from 0 and ∞, which implies

9 9

that κ ≍ p.

10 10

2. The eigenvalues of the now diagonal matrix V = limp→∞ 2T 2+ π 2 Id must

11 11

be distinct [6], Condition 1, and bounded away from 0 and ∞, so the signal is

12 12

bounded and η ≍ 1.

13 13

3. Since cov(ν, y) has no zero elements, EV (span(cov(ν, y)) = Rd , and thus

14 14

ρ ≍ 1 by Corollaries 2 and 3. This means that ρ will not appear in the conclusions

15 15

of Theorems 1 and 2.

16 16

17 Our results for the setting considered by Chun and Keleş can be found by setting 17

18 φ = p/n and ρ = 1 in the main conclusion of Theorem 2, which gives DN = 18

19 Op ((p/n)1/2 ). Since this requires p/n → 0, it agrees with the Chun–Keleş result. 19

20 By asking that the eigenvalues of 6 be bounded, Chun and Keleş in effect assumed 20

21 sparsity to motivate a sparse solution and their requirement that the columns of 2 21

22 be orthogonal effectively forced ρ ≍ 1. In contrast, as seen in Theorems

√ 1 and 2, 22

23 PLS can in some settings achieve a convergence rate that is near n. 23

24 24

25 5.3. Anisotropic predictor variation. Model (5.2) is restrictive because it pos- 25

26 tulates that the elements of X − µX − 2ν are independent and identically dis- 26

27 tributed. In effect, all of the extrinsic anisotropic variation in X is due to its 27

28 association with y. One extension of (5.2) allows for anisotropic variation in 28

29 (X − µX − 2ν), so its elements can be correlated: 29

1/2

30 (5.3) X = µX + 2ν + 1 ω, 30

31 31

32

where 1 ∈ Rp×p is positive definite, the elements of ω are independent copies of a 32

33

standard normal random variable and all other quantities are as defined for (5.2), so 33

34

again the elements of cov(ν, y) are all nonzero. Under this model in combination 34

35

with (2.1), it can be verified that 6 = 22T + 1, σ = 2 cov(ν, y) and 35

36 E6 (B ) = E6 span(σ ) = E6 span(2) = E1 span(2) . 36

37 37

38

Let u = dim(E6 (B )), let H ∈ Rp×u denote a semi-orthogonal basis matrix for 38

39

E6 (B ), let (H, H0 ) ∈ Rp×p denote an orthogonal matrix. Then for some posi- 39

40

tive definite matrices ∈ Ru×u and 0 ∈ R(p−u)×(p−u) , we have 1 = H H T + 40

H0 0 H0T , 2 = H U , where U ∈ Ru×d has rank d, 6H = U U T + , 6H0 = 0

41 41

42

and, as before, 6 = H 6H H T + H0 6H0 H0T . We are now in a position to consider 42

43

application of Theorems 1–2. 43

AOS imspdf v.2018/02/08 Prn:2018/02/28; 14:33 F:aos1681.tex; (G) p. 17

PLS PREDICTION 17

2 2

3

E6 (B ) = E1 span(2) = span(2), 3

4 u = d, U = (2T 2)1/2 , 6H = 2T 2 + and 6 = H (2T 2 + )H T + H0 0 H0T . 4

5 Except for and 0 , the structure that follows from this setup is just like that 5

6 associated with (5.2). In particular, if 1 has bounded eigenvalues, which may be a 6

7 reasonable assumption when y accounts substantially for the extrinsic variation in 7

8 X, then all of the essential asymptotic results of Section 5.2 hold. 8

9 9

10 5.3.2. span(2) does not reduce 1. The situation becomes more complicated 10

11 when span(2) does not reduce 1. Suppose that the eigenvalues of 1 are bounded 11

12 and that η is unbounded. Then, as in previous cases, κ ≍ p. But, since the eigen- 12

13 values of are bounded, limp→∞ 6H /η = limp→∞ U U T /η must be singular. 13

14 This means that ρ is unbounded and so it may still have an important impact 14

15 on the conclusions of Theorems 1 and 2. On the other hand, if the eigenvalues 15

16 of 0 are bounded, but the eigenvalues of are unbounded, then we may still 16

17 have κ ≍ p and√ η ≍ p. Going further, if ρ is bounded then we will again have 17

18 DN = Op (1/ n). 18

19 19

20 6. Simulations and data analysis. 20

21 21

22 6.1. Simulations. In this section, we give simulation results in support of our 22

23 asymptotic conclusions. We use the isotropic model (5.2) and compound symmetry 23

24 (4.7) as the basis for our simulation models. 24

25 25

26

6.1.1. Isotropic model (5.2). Our simulations for the isotropic model were 26

27

all conducted with µX = 0, d = 2, π 2 = 1 and (y, ν T ) ∼ N3 (0, U ), where the 27

28

elements of U were U11 = 4, U12 = U13 = 0.8, U22 = U33 = 1 and U23 = 0. 28

29

The columns of 2 were constructed to be orthogonal with the diagonal elements 29

30

diag(2T 2) = (t1 (p), t2 (p)) of 2T 2 being increasing functions of p, and al- 30

31

ways V > 0. If V has distinct eigenvalues, then we know from the discussion 31

32

of Section 5.2 that ρ ≍ 1. To provide more details on ρ, we next give tr(C −1 ). 32

33

Let R1 (p) = (t2 (p) + π 2 )/(t1 (p) + π 2 ), R2 (p) = t2 (p)/t1 (p) and cov(y, ν) = 33

34

(v1 , v2 ). Then 34

35

−1 (v12 + v22 R1 R2 )(v12 + v22 R13 R2 ) 35

tr C =2 .

36

v12 v22 R1 (R1 − 1)2 R2 36

37 37

38

Both v1 and v2 are nonzero and do not depend on n or p. Consequently, the asymp- 38

39

totic behavior of tr(C −1 ) depends only on R1 and R2 , which both converge to finite 39

40

nonzero constants by construction. However, if R1 → 1 then tr(C −1 ) will diverge 40

41

which may have a serious impact on the rate of convergence. 41

42

Figure 1 shows results from data generated under this setup with diag(2T 2) = 42

43

(4pa , pa ), 0 < a ≤ 1, and diag(2T 2) = (4c, c) where c is constant. Consequently, 43

AOS imspdf v.2018/02/08 Prn:2018/02/28; 14:33 F:aos1681.tex; (G) p. 18

1 1

2 2

3 3

4 4

5 5

6 6

7 7

8 8

9 9

10 10

11 11

12 12

13 13

14 14

F IG . 1. Simulation results from the isotropic model (5.2): Listing from the top at log2 (p) = 6 the

15 lines correspond to η equal to a constant, p 1/2 , p 3/4 and p. 15

16 16

17 17

for each 2 we can take the corresponding η = pa , 0 ≤ a ≤ 1. It follows from

18 18

the discussion of Section 5.2 and from the above calculations that ρ ≍ 1. Since

19 19

κ ≍ p, the asymptotic

√ behavior of the simulation is governed by Theorem 2.I,

20 20

giving DN = Op ( φ) with φ = p/nη. A data set of size n = p/2 was obtained

21 21

by using n independently generated observations on (y, ν T ) and ω in model (5.2)

22 22

to obtain n independent observations on X. Then n additional observations on X

23 2 was computed for each and averaged. The vertical axis 23

were generated and DN

24 24

Db2 of Figure 1 is the average over 100 replications of this whole process. Reading

25 25

26

from the top to bottom at log2 (p) = 6, the lines in Figure 1 correspond to η equal to 26

27

a constant, p1/2 , p3/4 and p. Since n = p/2, we have φ = 2/η. Thus, in reference 27

28

to Figure 1, our theoretical results predict convergence of the curves for η equal to 28

29

p1/2 , p3/4 and p, but no convergence for η equal to a constant. The curves shown 29

30

in Figure 1 seem to support this prediction, with the best results being achieved for 30

31

η = p, followed by η = p3/4 and η = p1/2 . 31

32

Figure 2 was constructed like Figure 1, except diag(2T 2) = (pa , pa ), 0 < a ≤ 32

33

1, and diag(2T 2) = (c, c) where c is constant. This seemingly small change has 33

34

the potential to have a big impact on the results because now the eigenvalues of V 34

35

are no longer distinct and R1 = 1, with the consequence that ρ may slow the rate 35

36

of convergence as indicated in Theorem 2. Indeed, the results in Figure 2 seem 36

37

uniformly worse than those in Figure 1. While it seems clear that the curve for 37

38

η = p is convergent, it is not clear if the curves for η = p1/2 or η = p3/4 are so. 38

39

The influence of U on the results of this example is controlled largely by the 39

40

correlations cyν = cov(y, ν)/ var1/2 (y) between y and the elements of ν. The con- 40

41

dition var(ν) = I2 was imposed without loss of generality since we can always 41

42

achieve it by rescaling. In Figures 1 and 2, cyν = (0.4, 0.4). If we had set the cor- 42

relations to be larger, say cyν = (0.8, 0.8), D b2 would have decreased faster as a

43 43

AOS imspdf v.2018/02/08 Prn:2018/02/28; 14:33 F:aos1681.tex; (G) p. 19

PLS PREDICTION 19

1 1

2 2

3 3

4 4

5 5

6 6

7 7

8 8

9 9

10 10

11 11

12 12

13 13

14 F IG . 2. Simulation results from the isotropic model (5.2): Listing from the top at log2 (p) = 12 the 14

15

lines correspond to η equal to a constant, p 1/2 , p 3/4 and p. 15

16 16

17 function of p. If we had set the correlations to be weaker, say cyν = (0.2, 0.2), 17

18 Db2 would have decreased slower. Although in either case, the general conclusions 18

19 from Figures 1 and 2 would still be discernible. We selected correlations of 0.4 19

20 because we felt that they represent modest correlations that illustrate the theory 20

21 nicely without giving an optimistic impression, as might happen if we had used 21

22 large correlations. 22

23 23

24 6.1.2. Compound symmetry (4.7). For this simulation, we used model (2.1) 24

25 with the compound symmetry structure (4.7) for 6 constructed with σ = 1p + 25

26 cp , σ = 1p + 0.5cp and σ = 1p and in each case ψ = 0.8. With σ , ψ and p 26

27 set, we generated a single observation on X ∼ Np (0, 6) and then generated the 27

28 corresponding y according to model (2.1) with error standard deviation τ = 1. 28

29 This process was repeated n = p/2 times to get β̂. Then n additional observations 29

on X were generated and DN 2 was computed for each and averaged. The vertical

30 30

axis Db2 of Figure 3 is the average over 100 replications of this whole process.

31 31

In this simulation, we have κ ≍ p, η ≍ κ and tr(6H h ) ≍ κ. It follows that The-

32 0 32

33 orem 2.II is applicable for σ = 1p √+ cp and σ = 1p + 0.5cp giving, from the dis- 33

34 cussion in Section 4.6, DN = p2 / n. Since we used n = p/2, we do not expect 34

35 convergence, which seems consistent with the results shown in Figure 3. Theo- 35

36 rem 2.III applies for σ = 1p since then d = 1. In that case, DN = Op (p−1/2 ), 36

37 which again seems consistent with the results of Figure 3. 37

38 38

39

6.2. Tetracycline data. Goicoechea and Olivieri [20] used PLS to develop a 39

40

predictor of tetracycline concentration in human blood. The 50 training samples 40

41

were constructed by spiking blank sera with various amounts of tetracycline in the 41

42

range 0–4 µg mL−1 . A validation set of 57 samples was constructed in the same 42

43

way. For each sample, the values of the predictors were determined by measuring 43

AOS imspdf v.2018/02/08 Prn:2018/02/28; 14:33 F:aos1681.tex; (G) p. 20

1 1

2 2

3 3

4 4

5 5

6 6

7 7

8 8

9 9

10 10

11 11

12 12

13 13

14

F IG . 3. Simulation results using compound symmetry (4.7). Reading from top to bottom the lines 14

correspond to σ = 1p + cp , σ = 1p + 0.5cp and σ = 1p .

15 15

16 16

17 fluorescence intensity at p = 101 equally spaced points in the range 450–550 nm. 17

18 18

The authors determined using leave-one-out cross validation that the best predic-

19 19

tions of the training data were obtained with d = 4 linear combinations of the

20 20

original 101 predictors.

21 21

We use these data to illustrate the behavior of PLS predictions in Chemometrics

22 22

as the number of predictors increases. We used PLS with d = 4 to predict the

23 23

validation data based on p equally spaced spectra, with p ranging between 10

24 24

and 101. The root mean squared error (MSE) is shown in Figure 4 for five values

25 25

of p. PLS fits were determined by using library{pls} in R. We see a relatively

26 26

steep drop in MSE for small p, say less than 30, and a slow but steady decrease

27 27

28 28

29 29

30 30

31 31

32 32

33 33

34 34

35 35

36 36

37 37

38 38

39 39

40 40

41 41

42 42

F IG . 4. Tetracycline data: Validation MSE from 10, 20, 33, 50 and 101 equally spaced spectra.

43 43

AOS imspdf v.2018/02/08 Prn:2018/02/28; 14:33 F:aos1681.tex; (G) p. 21

PLS PREDICTION 21

1 in MSE thereafter. Since we are dealing with actual prediction, the root-MSE will 1

2 not converge to 0 with increasing p as it seems to do in some of the simulations. 2

3 3

4 4

7. Discussion. In this section, we give results on the convergence of β̂ and

5 5

describe our rationale for some of the restrictions that we imposed.

6 6

7 7

7.1. Convergence of β̂. The focus of this article has been on the rate of con-

8 8

vergence of predictions as measured by DN . In this section, we consider for com-

9 9

pleteness the rate of convergence of β̂ in the 6 inner product. Let

10 10

1/2

11

Vn,p = var1/2 (DN | β̂) = (β̂ − β)T 6(β̂ − β) . 11

12 12

13 Then, as shown in Appendix Section S8, Vn,p and DN have the same order as 13

14 n, p → ∞. To be clear, we state this in the following theorem. 14

15 15

16 T HEOREM 3. As n, p → ∞, 16

17 17

18 I. Under the conditions of Theorem 1, 18

19 √ 19

Vn,p = Op (ρ/ n) + Op ρ 1/2 n−1/2 (κ/η)d .

20 20

21 II. Under the conditions of Theorem 2, 21

22 p 22

√

23 Vn,p = Op (ρ/ n) + Op ( ρφ). 23

24 24

25 It follows from this theorem that the special cases of Theorems 1 and 2 and the 25

26 subsequent discussions apply to Vn,p as well. In particular, estimative convergence 26

27 as measured in the 6 inner product will be at or near the root-n rate under the same 27

28 conditions as predictive convergence. 28

29 29

30 30

7.2. Multivariate Y . Recall that we confined our study to regressions with a

31 31

univariate response. An extension to multivariate Y ∈ Rr seems elusive because

32 32

there are numerous PLS algorithms for multivariate Y and they can all produce dif-

33 33

ferent results. The two most common algorithms NIPLS and SIMPLS are known

34 34

to produce different results when r > 1 but give the same results when r = 1 [12,

35 35

15, 34]. The multivariate version of the Krylov construction Ĝ provides another

36 36

PLS algorithm. Some prefer to standardize the elements of Y to have sample vari-

37 37

ance equal to 1, while others do not standardize. Some PLS algorithms reduce

38 38

Y and X simultaneously, while others reduce X alone. These various algorithms

39 39

40

can produce different results when r > 1 but also produce the same or equivalent 40

41

results when r = 1. It seems to us that any extension to allow for a multivariate 41

42

response would first need to address the multiplicity of methods, which is outside 42

43

the scope of this report. 43

AOS imspdf v.2018/02/08 Prn:2018/02/28; 14:33 F:aos1681.tex; (G) p. 22

1 7.3. Choice of the dimension, d. We assumed through this article that the di- 1

2 mension d of the envelope is effectively fixed and known, as did Chun and Keleş 2

3 [6]. In practice, d will not normally be known so a data-dependent estimate dn,p 3

4 will often be used in its stead. If dn,p > d, the (nonasymptotic) results of a PLS 4

5 analysis will still be based on a true model, albeit one with more variation than 5

6 necessary. If dn,p < d, then PLS will incur some bias in estimation. The bias can 6

7 be sizable if dn,p is substantially less than d, an event that we judge to be unlikely 7

8 because the far values of dn,p should be ruled out by standard PLS methodology. 8

9 Extensions of the asymptotic results of this article that allow for using dn,p in- 9

10 stead of d will depend on the rate at which dn,p converges to d. If that rate is 10

11 sufficiently fast, then the results of this article will still hold. Otherwise, the rates 11

12 presented here will be optimistic. We chose to assume d known so that the results 12

13 might reflect the core behavior of PLS while keeping an important link with the 13

14 work of Chun and Keleş [6]. This view avoided the task of studying selection meth- 14

15 ods, which is outside the scope of this article but still an important next step. Eck 15

16 and Cook [17] proposed an estimator of β as a weighted average of the envelope 16

17 estimators over the possible dimensions of the envelope, the weights being func- 17

18 tions of the Bayes information criterion for each envelope model. This weighted 18

19 estimator avoids the need to estimate the dimension and might be adaptable for 19

20 asymptotic studies of PLS. 20

21 Another desirable extension is to allow d → ∞ as p → ∞. In such a case, 21

22 we expect PLS to still yield consistent results provided d grows at a rate that is 22

23 sufficiently slow relative to p. 23

24 24

25 7.4. Importance of normality. As mentioned previously, simulations and our 25

26 experience in practice suggest that normality is not an essential assumption in prac- 26

27 tice, particularly if a holdout sample is used to assess performance of the final pre- 27

28 dictive model. Theoretically, we expect that our asymptotic results are indicative 28

29 for sub-normal variables, but may not be so for sur-normals, depending on the tail 29

30 behavior. We relied extensively on the behavior of higher order moments of nor- 30

31 mals. Extending these results to classes of distributions would require bounds that 31

32 would likely be quite loose for normals. Assuming normality allowed us to get rel- 32

33 atively sharp bounds, which we feel is useful for a first look at PLS asymptotics. 33

34 The same normality was used also by Naik and Tsai [30] in their asymptotic study 34

35 of the fixed p case and by Chun and Keleş [6] for the case in which p and n both 35

36 diverge. 36

37 37

38

7.5. Impact of the results. Our asymptotic results are intended to provide a 38

39

qualitative understanding of various plausible PLS scenarios. For instance, if it is 39

40

thought that nearly all predictors contribute information about the response, so η ≍ 40

41

p, then we may have DN = Op (n−1/2 ) without regard to the relationship between 41

42

n and p. On the other extreme, if the regression is viewed as likely sparse, so η ≍ 1, 42

43

then we may have DN = Op ((p/n)1/2 ) and we now need n to be large relative 43

AOS imspdf v.2018/02/08 Prn:2018/02/28; 14:33 F:aos1681.tex; (G) p. 23

PLS PREDICTION 23

2 the example of Section 6.2 where we observed a steady decrease in mean squared 2

3 error, suggesting that the regression is abundant so η ≍ p. 3

4 Our results also serve to place the findings by Chun and Keleş [6] in a broader 4

5 context by demonstrating that it is possible in some scenarios for PLS to have 5

6 root-n or near root-n convergence rates as n and p diverge. 6

7 7

8 Acknowledgements. The authors thank the Associate Editor and referees for 8

9 helpful comments on an earlier version of this article, and are grateful to H. C. 9

10 Goicoechea and A. C. Olivieri for allowing the use of their Tetracycline data. 10

11 11

12 SUPPLEMENTARY MATERIAL 12

13 13

Supplement to “Partial least squares prediction in high-dimensional re-

14 14

gression” (DOI: 10.1214/18-AOS1681SUPP; .pdf). ???. aos1681supp.pdf

15 15

16 REFERENCES 16

17 17

[1] A BUDU , S., K ING , P. and PAGANO , T. C. (2010). Application of partial least-squares regres-

18 18

sion in seasonal streamflow forecasting. J. Hydrol. Eng. 15 612–623. <author>

19 [2] B IANCOLILLO , A., B UCCI , R., M AGRÌ , A. L., M AGRÌ , A. D. and M ARINI , F. (2014). Data- 19

20 fusion for multiplatform characterization of an Italian craft beer aimed at its authentica- 20

21 tion. Anal. Chim. Acta 820 23–31. 21 <author>

22 [3] B OULESTEIX , A.-L. and S TRIMMER , K. (2007). Partial least squares: A versatile tool for the 22

analysis of high-dimensional genomic data. Brief. Bioinform. 8 32–44. <pbm>

23 23

[4] B RO , R. and E ELDÉN , L. (2009). PLS works. J. Chemom. 23 69–71. <author>

24 [5] C ASTEJÒN , D., G ARCÌA -S EGURA , J. M., E SCUDERO , R., H ERRERA , A. and C AM - 24

25 BERO , M. I. (2015). Metabolomics of meat exudate: Its potential to evaluate beef meat 25

26 conservation and aging. Anal. Chim. Acta 901 1–11. 26 <author>

27 [6] C HUN , H. and K ELE Ş , S. (2010). Sparse partial least squares regression for simultaneous 27

dimension reduction and variable selection. J. R. Stat. Soc. Ser. B. Stat. Methodol. 72

28 28

3–25. MR2751241 <mr>

29 [7] C OOK , R. D. (1994). Using dimension-reduction subspaces to identify important inputs in 29

30 models of physical systems. In Proceedings of the Section on Engineering and Physical 30

31 Sciences 18–25. American Statistical Association, Alexandria, VA. 31 <author>

32 [8] C OOK , R. D. (1998). Regression Graphics: Ideas for Studying Regressions through Graphics. 32

Wiley, New York. MR1645673 <mr>

33 33

[9] C OOK , R. D. and F ORZANI , L. (2017). Big data and partial least squares prediction. Canad.

34 34

J. Statist. To appear. <author>

35 [10] C OOK , R. D. and F ORZANI , L. (2018). Supplement to “Partial least squares prediction in 35

36 high-dimensional regression.” DOI:10.1214/18-AOS1681SUPP. 36 <unstr>

37 [11] C OOK , R. D., F ORZANI , L. and ROTHMAN , A. J. (2013). Prediction in abundant high- 37

dimensional linear regression. Electron. J. Stat. 7 3059–3088. MR3151762 <mr>

38 38

[12] C OOK , R. D., H ELLAND , I. S. and S U , Z. (2013). Envelopes and partial least squares regres-

39 39

sion. J. R. Stat. Soc. Ser. B. Stat. Methodol. 75 851–877. MR3124794 <mr>

40 [13] C OOK , R. D., L I , B. and C HIAROMONTE , F. (2007). Dimension reduction in regression with- 40

41 out matrix inversion. Biometrika 94 569–584. MR2410009 41 <mr>

42 [14] C OOK , R. D., L I , B. and C HIAROMONTE , F. (2010). Envelope models for parsimonious and 42

43

efficient multivariate linear regression. Statist. Sinica 20 927–960. MR2729839 43

<mr>

AOS imspdf v.2018/02/08 Prn:2018/02/28; 14:33 F:aos1681.tex; (G) p. 24

1 [15] DE J ONG , S. (1993). SIMPLS: An alternative approach to partial least squares regression. 1

2 Chemom. Intell. Lab. Syst. 18 251–263. DOI:10.1016/0169-7439(93)85002-X. 2 <author>

3 [16] D ELAIGLE , A. and H ALL , P. (2012). Methodology and theory for partial least squares applied 3

to functional data. Ann. Statist. 40 322–352. MR3014309 <mr>

4 4

[17] E CK , D. J. and C OOK , R. D. (2017). Weighted envelope estimation to handle variability in

5 model selection. Biometrika 104 743–749. MR3694595 5 <mr>

6 [18] F RANK , I. E. and F RIDEMAN , J. H. (1993). A statistical view of some chemometrics regres- 6

7 sion tools. Technometrics 35 102–246. DOI:10.1080/00401706.1993.10485033. 7 <author>

8 [19] G ARTHWAITE , P. H. (1994). An interpretation of partial least squares. J. Amer. Statist. Assoc. 8

89 122–127. MR1266290 <mr>

9 9

[20] G OICOECHEA , H. C. and O LIVER , A. C. (1999). Enhanced synchronous spectrofluorometric

10 determination of tetracycline in blood serum by chemometric analysis. Comparison of 10

11 partial least-squares and hybrid linear analysis calibrations. Anal. Chem. 71 4361–4368. 11 <author>

12 [21] H ELLAND , I. S. (1990). Partial least squares regression and statistical models. Scand. J. Stat. 12

13 17 97–114. MR1085924 13 <mr>

[22] H ELLAND , I. S. (1992). Maximum likelihood regression on relevant components. J. Roy.

14 14

Statist. Soc. Ser. B 54 637–647. MR1160488 <mr>

15 [23] H ELLAND , I. S. (2001). Some theoretical aspects of partial least squares regression. Chemom. 15

16 Intell. Lab. Syst. 58 97–107. 16 <author>

17 [24] K ANDEL , T. A., G ISLUM , R., J ØRGENSEN , U. and L ÆRKE , P. E. (2013). Prediction of biogas 17

18

yield and its kinetics in reed canary grass using near infrared reflectance spectroscopy and 18

chemometrics. Bioresour. Technol. 146 282–287. <author>

19 19

[25] KOCH , C., P OSCH , A. E., G OICOECHEA , H. C., H ERWIG , C. and L ENDLA , B. (2013). Multi-

20 analyte quantification in bioprocesses by Fourier-transform-infrared spectroscopy by par- 20

21 tial least squares regression and multivariate curve resolution. Anal. Chim. Acta 807 103– 21

22 110. 22 <author>

23

[26] L I , W., C HENG , Z., WANG , Y. and Q U , H. (2013). Quality control of Lonicerae Japonicae 23

Flos using near infrared spectroscopy and chemometrics. J. Pharm. Biomed. Anal. 72

24 24

33–39. <author>

25 [27] L OBAUGH , N. J., W EST, R. and M C I NTOSH , A. R. (2001). Spatiotemporal analysis of exper- 25

26 imental differences in event-related potential data with partial least squares. Psychophys- 26

27 iology 38 517–530. 27 <author>

28

[28] M ARTENS , H. and N ÆS , T. (1992). Multivariate Calibration. Wiley, Chichester. MR1029523 28

<mr>

[29] N ÆS , T. and H ELLAND , I. S. (1993). Relevant components in regression. Scand. J. Stat. 20

29 29

239–250. MR1241390 <mr>

30 [30] NAIK , P. and T SAI , C.-L. (2000). Partial least squares estimator for single-index models. J. R. 30

31 Stat. Soc. Ser. B. Stat. Methodol. 62 763–771. MR1796290 31 <mr>

32 [31] N GUYEN , D. V. and ROCKE , D. M. (2002). Tumor classification by partial least squares using 32

33

microarray gene expression data. Bioinformatics 18 39–50. 33

<author>

[32] N GUYEN , D. V. and ROCKE , D. M. (2004). On partial least squares dimension reduction for

34 34

microarray-based classification: A simulation study. Comput. Statist. Data Anal. 46 407–

35 425. MR2067030 35 <mr>

36 [33] S CHWARTZ , R. W., K EMBHAVI , A., H ARWOOD , D. and DAVIS , L. S. (2009). Human detec- 36

37 tion using partial least squares analysis. In 2009 IEEE 12th International Conference on 37

38

Computer Vision 24–31. 38

<author>

[34] TER B RAAK , C. J. F. and DE J ONG , S. (1998). The objective function of partial least squares

39 39

regression. J. Chemom. 12 41–54. <author>

40 [35] W OLD , S., M ARTENS , H. and W OLD , H. (1983). The multivariate calibration problem in 40

41 chemistry solved by the PLS method. In Proceedings of the Conference on Matrix Pencils 41

42 (A. Ruhe and B. Kågström, eds.). Lecture Notes in Mathematics 973 286–293. Springer, 42

43

Heidelberg. 43

<author>

AOS imspdf v.2018/02/08 Prn:2018/02/28; 14:33 F:aos1681.tex; (G) p. 25

PLS PREDICTION 25

1 [36] W ORSLEY, K. J. (1997). An overview and some new developments in the statistical analysis 1

2 of PET and fMRI data. Hum. Brain Mapp. 5 254–258. 2 <author>

3 3

S CHOOL OF S TATISTICS FACULTAD DE I NGENIERÍA

4 U NIVERSITY OF M INNESOTA Q UÍMICA , UNL 4

5 313 F ORD H ALL S ANTIAGO DEL E STERO 2819 5

6 224 C HURCH S T. SE S ANTA F E 6

M INNEAPOLIS , M INNESOTA 55455 A RGENTINA

7 USA E- MAIL : liliana.forzani@gmail.com 7

8 E- MAIL : dennis@stat.umn.edu 8

9 9

10 10

11 11

12 12

13 13

14 14

15 15

16 16

17 17

18 18

19 19

20 20

21 21

22 22

23 23

24 24

25 25

26 26

27 27

28 28

29 29

30 30

31 31

32 32

33 33

34 34

35 35

36 36

37 37

38 38

39 39

40 40

41 41

42 42

43 43

AOS imspdf v.2018/02/08 Prn:2018/02/28; 14:33 F:aos1681.tex; (G) p. 26

2 2

The list of entries below corresponds to the original Reference section of your article. The

3 3

bibliography section on previous page was retrieved from MathSciNet applying an automated

4 procedure. 4

5 Please check both lists and indicate those entries which lead to mistaken sources in automati- 5

6 cally generated Reference list. 6

7 7

8 [1] A BUDU , S., K ING , P. and PAGANO , T. C. (2010). Application of partial least-squares regres- 8

9 sion in seasonal streamflow forecasting. J. Hydrol. Eng. 15 612–623. 9 <author>

10

[2] B IANCOLILLO , A., B UCCI , R., M AGRÌ , A. L., M AGRÌ , A. D. and M ARINI , F. (2014). Data- 10

fusion for multiplatform characterization of an Italian craft beer aimed at its authentica-

11 11

tion. Anal. Chim. Acta 820 23–31. <author>

12 12

[3] B OULESTEIX , A.-L. and S TRIMMER , K. (2007). Partial least squares: A versatile tool for the

13 analysis of high-dimensional genomic data. Brief. Bioinform. 8 32–44. 13 <pbm>

14 [4] B RO , R. and E ELDÉN , L. (2009). PLS works. J. Chemom. 23 69–71. 14 <author>

15 [5] C ASTEJÒN , D., G ARCÌA -S EGURA , J. M., E SCUDERO , R., H ERRERA , A. and C AM - 15

16 BERO , M. I. (2015). Metabolomics of meat exudate: Its potential to evaluate beef meat 16

conservation and aging. Anal. Chim. Acta 901 1–11. <author>

17 17

[6] C HUN , H. and K ELE Ş , S. (2010). Sparse partial least squares regression for simultaneous

18 18

dimension reduction and variable selection. J. R. Stat. Soc. Ser. B. Stat. Methodol. 72

19 3–25. 2751241 19 <mr>

20 [7] C OOK , R. D. (1994). Using dimension-reduction subspaces to identify important inputs in 20

21 models of physical systems. In Proceedings of the Section on Engineering and Physical 21

22 Sciences 18–25. American Statistical Association, Alexandria, VA. 22 <author>

23

[8] C OOK , R. D. (1998). Regression Graphics: Ideas for Studying Regressions through Graphics. 23

Wiley, New York. 1645673 <mr>

24 24

[9] C OOK , R. D. and F ORZANI , L. (2017). Big data and partial least squares prediction. Canad.

25 25

J. Statist. To appear. <author>

26 [10] C OOK , R. D. and F ORZANI , L. (2018). Supplement to “Partial least squares prediction in 26

27 high-dimensional regression.” DOI:10.1214/18-AOS1681SUPP. 27 <unstr>

28 [11] C OOK , R. D., F ORZANI , L. and ROTHMAN , A. J. (2013). Prediction in abundant high- 28

29 dimensional linear regression. Electron. J. Stat. 7 3059–3088. 3151762 29 <mr>

[12] C OOK , R. D., H ELLAND , I. S. and S U , Z. (2013). Envelopes and partial least squares regres-

30 30

sion. J. R. Stat. Soc. Ser. B. Stat. Methodol. 75 851–877. 3124794 <mr>

31 31

[13] C OOK , R. D., L I , B. and C HIAROMONTE , F. (2007). Dimension reduction in regression with-

32 out matrix inversion. Biometrika 94 569–584. 2410009 32 <mr>

33 [14] C OOK , R. D., L I , B. and C HIAROMONTE , F. (2010). Envelope models for parsimonious and 33

34 efficient multivariate linear regression. Statist. Sinica 20 927–960. 2729839 34 <mr>

35 [15] DE J ONG , S. (1993). SIMPLS: An alternative approach to partial least squares regression. 35

36

Chemom. Intell. Lab. Syst. 18 251–263. DOI:10.1016/0169-7439(93)85002-X. 36

<author>

[16] D ELAIGLE , A. and H ALL , P. (2012). Methodology and theory for partial least squares applied

37 37

to functional data. Ann. Statist. 40 322–352. 3014309 <mr>

38 38

[17] E CK , D. J. and C OOK , R. D. (2017). Weighted envelope estimation to handle variability in

39 model selection. Biometrika 104 743–749. 3694595 39 <mr>

40 [18] F RANK , I. E. and F RIDEMAN , J. H. (1993). A statistical view of some chemometrics regres- 40

41 sion tools. Technometrics 35 102–246. DOI:10.1080/00401706.1993.10485033. 41 <author>

42 [19] G ARTHWAITE , P. H. (1994). An interpretation of partial least squares. J. Amer. Statist. Assoc. 42

43

89 122–127. 1266290 43

<mr>

AOS imspdf v.2018/02/08 Prn:2018/02/28; 14:33 F:aos1681.tex; (G) p. 27

2 determination of tetracycline in blood serum by chemometric analysis. Comparison of 2

3 partial least-squares and hybrid linear analysis calibrations. Anal. Chem. 71 4361–4368. 3 <author>

4

[21] H ELLAND , I. S. (1990). Partial least squares regression and statistical models. Scand. J. Stat. 4

17 97–114. 1085924 <mr>

5 5

[22] H ELLAND , I. S. (1992). Maximum likelihood regression on relevant components. J. Roy.

6 Statist. Soc. Ser. B 54 637–647. 1160488 6 <mr>

7 [23] H ELLAND , I. S. (2001). Some theoretical aspects of partial least squares regression. Chemom. 7

8 Intell. Lab. Syst. 58 97–107. 8 <author>

9 [24] K ANDEL , T. A., G ISLUM , R., J ØRGENSEN , U. and L ÆRKE , P. E. (2013). Prediction of biogas 9

yield and its kinetics in reed canary grass using near infrared reflectance spectroscopy and

10 10

chemometrics. Bioresour. Technol. 146 282–287. <author>

11 [25] KOCH , C., P OSCH , A. E., G OICOECHEA , H. C., H ERWIG , C. and L ENDLA , B. (2013). Multi- 11

12 analyte quantification in bioprocesses by Fourier-transform-infrared spectroscopy by par- 12

13 tial least squares regression and multivariate curve resolution. Anal. Chim. Acta 807 103– 13

14 110. 14 <author>

15

[26] L I , W., C HENG , Z., WANG , Y. and Q U , H. (2013). Quality control of Lonicerae Japonicae 15

Flos using near infrared spectroscopy and chemometrics. J. Pharm. Biomed. Anal. 72

16 16

33–39. <author>

17 [27] L OBAUGH , N. J., W EST, R. and M C I NTOSH , A. R. (2001). Spatiotemporal analysis of exper- 17

18 imental differences in event-related potential data with partial least squares. Psychophys- 18

19 iology 38 517–530. 19 <author>

20 [28] M ARTENS , H. and N ÆS , T. (1992). Multivariate Calibration. Wiley, Chichester. 1029523 20 <mr>

[29] N ÆS , T. and H ELLAND , I. S. (1993). Relevant components in regression. Scand. J. Stat. 20

21 21

239–250. 1241390 <mr>

22 [30] NAIK , P. and T SAI , C.-L. (2000). Partial least squares estimator for single-index models. J. R. 22

23 Stat. Soc. Ser. B. Stat. Methodol. 62 763–771. 1796290 23 <mr>

24 [31] N GUYEN , D. V. and ROCKE , D. M. (2002). Tumor classification by partial least squares using 24

25 microarray gene expression data. Bioinformatics 18 39–50. 25 <author>

26

[32] N GUYEN , D. V. and ROCKE , D. M. (2004). On partial least squares dimension reduction for 26

microarray-based classification: A simulation study. Comput. Statist. Data Anal. 46 407–

27 27

425. 2067030 <mr>

28 [33] S CHWARTZ , R. W., K EMBHAVI , A., H ARWOOD , D. and DAVIS , L. S. (2009). Human detec- 28

29 tion using partial least squares analysis. In 2009 IEEE 12th International Conference on 29

30 Computer Vision 24–31. 30 <author>

31 [34] TER B RAAK , C. J. F. and DE J ONG , S. (1998). The objective function of partial least squares 31

regression. J. Chemom. 12 41–54. <author>

32 32

[35] W OLD , S., M ARTENS , H. and W OLD , H. (1983). The multivariate calibration problem in

33 chemistry solved by the PLS method. In Proceedings of the Conference on Matrix Pencils 33

34 (A. Ruhe and B. Kågström, eds.). Lecture Notes in Mathematics 973 286–293. Springer, 34

35 Heidelberg. 35 <author>

36 [36] W ORSLEY, K. J. (1997). An overview and some new developments in the statistical analysis 36

37

of PET and fMRI data. Hum. Brain Mapp. 5 254–258. 37

<author>

38 38

39 39

40 40

41 41

42 42

43 43

AOS imspdf v.2018/02/08 Prn:2018/02/28; 14:33 F:aos1681.tex; (G) p. 28

2 2

Following information will be included as pdf file Document Properties:

3 3

4 Title : Partial least squares prediction in high-dimensional re- 4

5

gression 5

Author : R. Dennis Cook, Liliana Forzani

6 6

Subject : The Annals of Statistics, 0, Vol. 0, No. 00, 1-25

7 Keywords: 62J05, 62F12, Abundant regressions, dimension reduction, 7

8 sparse regressions 8

9 9

10 10

THE LIST OF URI ADDRESSES

11 11

12 12

13 Listed below are all uri addresses found in your paper. The non-active uri addresses, if any, are 13

14 indicated as ERROR. Please check and update the list where necessary. The e-mail addresses 14

are not checked – they are listed just for your information. More information can be found in

15 15

the support page:

16 http://www.e-publications.org/ims/support/urihelp.html. 16

17 17

18 200 http://www.imstat.org/aos/ [2:pp.1,1] OK 18

19 200 http://www.imstat.org [2:pp.1,1] OK // http://www.imstat.org/ 19

20 301 http://www.ams.org/mathscinet/msc/msc2010.html [2:pp.1,1] Moved Permanently

20

21

404 https://doi.org/10.1214/18-AOS1681SUPP [4:pp.23,23,23,23] Not Found

21

404 https://doi.org/10.1016/0169-7439(93 [2:pp.24,24] Not Found

22 22

303 https://doi.org/10.1080/00401706.1993.10485033 [2:pp.24,24] See Other

23 --- mailto:dennis@stat.umn.edu [2:pp.25,25] Check skip 23

24 --- mailto:liliana.forzani@gmail.com [2:pp.25,25] Check skip 24

25 25

26 26

27 27

28 28

29 29

30 30

31 31

32 32

33 33

34 34

35 35

36 36

37 37

38 38

39 39

40 40

41 41

42 42

43 43

- Anderson Bror Sen 2005Cargado porChristos Voulgaris
- A.Note.on.Alternative.Regressions.(A.J).1942.pdfCargado porpedronuno20
- Regression through ExcelCargado porPrince Malik
- 10s 12 Linear RegressionCargado poralkmind
- Chapter 9 CorregidoCargado porAlisson Barrientos
- Statistics Model PaperCargado porapi-3699388
- PShah-22843119Cargado porsemper88
- on and Regression 22Cargado porAmit Kubade
- 140a syllabusCargado porapi-291415989
- EmpiricalCargado porJihen Mejri
- 267952708Cargado porRogeriano21
- Correlation and RegressionCargado porKrislyn Ann Austria Alde
- imm2804Cargado porHusnul Hidayat
- 7606-9700-1-PBCargado porkwesitsibu
- 102-905-1-PBCargado porThiago Delgado
- Rp-Panel - 1 - Thesis Report for Result TablesCargado porMuhammad Faisal Khan
- Analysis ToolpackCargado porAyu Wulandari
- Zeit Vogel 2014Cargado porHannah Jardim
- Ch6 Slides 1Cargado porCarlos Rojas
- Introductory Guide to using StataCargado porWillie81
- Measuring Value Relevance in a (Possibly) Inefficient MarketCargado por1111sami
- Mallows - Some Comments on CpCargado porAlex Malard
- Play It Again With Feeling: Computer Feedback in Musical Communication of EmotionsCargado porEsteban Algora Aguilar
- chapter08Cargado porapi-172580262
- Regressin Model StepsCargado porsanjeev0210
- iv_specification_testsCargado porVictor Haselmann Arakawa
- stnews66Cargado porMarya Hasni
- Latest 1Cargado porTobinmichael
- Solution Manual07Cargado porGanesh Shankar
- 10.1108@14013381111157328(1)Cargado porMuhammad Afzal Khilji

- Outline(1)Cargado porPerry Lam
- Basics of Least SquaresCargado porKen Lim
- Algebraically Simple Chaotic FlowsCargado pornadamau22633
- kelly1964.pdfCargado poraansa
- Nm ProgrammingCargado porrichshy dragon
- An Introduction to Variable and Feature SelectionCargado porAdarsh Pannu
- Calculus UnlimitedCargado porcarloscontrera
- 2010 TPJC Prelim Paper 2Cargado porcjcsucks
- Algebra_ICargado porCorey Wilds
- PRECALC EBOOK.pdfCargado porMiles
- Discrete Hartley TransformCargado porLucas Gallindo
- Models.woptics.dielectric Slab WaveguideCargado porneomindx
- Algebra Math111 ReviewsheetsCargado porvalini001
- A unified approach to the construction of categories of gamesCargado porsfwc
- cost-min.pdfCargado porSky Shephered
- 10.1.1.92.9625Cargado porjunaidtps1
- Backus 1967Cargado porIsoken Efionayi
- Determining Optical Flow Horn SchunckCargado porTomáš Hodaň
- Optimization of Blended Composite Wing Panels Using Smeared Stiffness Technique and Lamination ParametersCargado porAltairEnlighten
- Differential EquationsCargado porsreekantha
- CSE1560-FinSOLCargado porcoolsuername
- Numerical Techniques for Optimization Problems with PDE ConstraintsCargado porlelix
- Course of Lectures on Magnetism of Lanthanide Ions Under Varying Ligand and Magnetic FieldsCargado poruhraman
- CalcI CompleteCargado porAnthony Loriem
- Variational MethodCargado porAlexandru Dumitrache
- Finals FeedbasackCargado porVenus Madrid
- Model Portfolio OptimizationCargado poranujsharma
- An Assortment of Mathematical Marvels.Cargado porrustyy88
- proppetinal navigCargado porGanesh Kumar
- 63 Trigonometric Ratio & Identity Part 2 of 2Cargado porRahul Sharma