Está en la página 1de 28

AOS imspdf v.2018/02/08 Prn:2018/02/28; 14:33 F:aos1681.tex; (G) p.

The Annals of Statistics


0, Vol. 0, No. 00, 1–25
https://doi.org/10.1214/18-AOS1681
© Institute of Mathematical Statistics, 0
1 1
2 2
3
PARTIAL LEAST SQUARES PREDICTION IN 3
4
HIGH-DIMENSIONAL REGRESSION 4
5 5
6
B Y R. D ENNIS C OOK AND L ILIANA F ORZANI 6
7 University of Minnesota and Facultad de Ingeniería Química, UNL, 7
8 Researcher of CONICET 8
9 9
We study the asymptotic behavior of predictions from partial least
10 10
squares (PLS) regression as the sample size and number of predictors diverge
11 in various alignments. We show that there is a range of regression scenarios 11
12 where PLS predictions have the usual root-n convergence rate, even when the 12
13 sample size is substantially smaller than the number of predictors, and an even 13
14 wider range where the rate is slower but may still produce practically use- 14
ful results. We show also that PLS predictions achieve their best asymptotic
15 15
behavior in abundant regressions where many predictors contribute informa-
16 16
tion about the response. Their asymptotic behavior tends to be undesirable
17 in sparse regressions where few predictors contribute information about the 17
18 response. 18
19 19
20 20
1. Introduction. Partial least squares (PLS) regression is one of the first
21 21
methods for prediction in high-dimensional linear regressions in which sample
22 22
size n may not be large relative to the number of predictors p. It was set in mo-
23 23
tion by Wold, Martens and Wold [35]. Since then the development of PLS regres-
24 24
sion has taken place mainly within the Chemometrics community where empirical
25 25
prediction is the main issue and PLS regression is now a core method. Chemo-
26 26
metricians tended not to address population models or regression coefficients, but
27 27
instead dealt directly with predictions resulting from PLS algorithms. This cus-
28 28
tom of forgoing population considerations, asymptotic approximations and other
29 29
30
widely accepted statistical constructs placed PLS at odds with statistical tradition, 30
31
with the consequence that it has been slow to be recognized within the statistics 31
32
community. There is now vast Chemometrics literature on PLS regression, some of 32
33
it refining and extending the methodology and some of it affirming basic method- 33
34
ology [4]. Martens and Næs’ 1992 book [28] is a classical reference for PLS within 34
35
the Chemometrics community. 35
36
Studies of PLS regression have appeared in mainline statistics literature from 36
37
time to time. Helland [21] was perhaps the first to define a PLS regression model, 37
38 and a first attempt at maximum likelihood estimation was made by Helland [22]; 38
39 see also [23, 29]. Frank and Friedman [18] gave an informative discussion of PLS 39
40 40
41 Received January 2017; revised December 2017. 41
42 MSC2010 subject classifications. Primary 62J05; secondary 62F12. 42
43
Key words and phrases. Abundant regressions, dimension reduction, sparse regressions. 43
1
AOS imspdf v.2018/02/08 Prn:2018/02/28; 14:33 F:aos1681.tex; (G) p. 2

2 R. D. COOK AND L. FORZANI

1 regression from various statistical views, and Garthiwate [19] attempted a statis- 1
2 tical interpretation of PLS algorithms. Naik and Tsai [30] demonstrated that PLS 2
3 regression provides a consistent estimator of the central subspace [7, 8] when the 3
4 distribution of the response given the predictors follows a single-index model and 4
5 n → ∞ with p fixed. Delaigle and Hall [16] extended it to functional data. Cook, 5
6 Helland and Su [12] established a population connection between PLS regression 6
7 and envelopes [14] in the context of multivariate linear regression, provided the 7
8 first firm PLS model and showed that envelope estimation leads to root-n consis- 8
9 tent estimators whose performance dominates that of PLS in traditional fixed p 9
10 contexts. 10
11 PLS regression also has a substantial following outside of the Chemometrics 11
12 and Statistics communities. Boulesteix and Strimmer [3] studied the advantages 12
13 of PLS regression for the analysis of high-dimensional genomic data, and Nguyen 13
14 and Rocke [31, 32] proposed it for microarray-based classification. Worsley [36] 14
15 considered PLS regression for the analysis of data from PET and fMRI studies. Ap- 15
16 plication of PLS for the analysis of spatiotemporal data was proposed by Lobaugh 16
17 et al. [27], and Schwartz et al. [33] used PLS in image analysis. Because of these 17
18 and many other applications, it seems clear that PLS regression is widely used 18
19 across the applied sciences. All subsequent references to PLS in this article should 19
20 be understood to mean PLS regression. 20
21 In view of the apparent success that PLS has had in Chemometrics and else- 21
22 where, we might anticipate that it has reasonable statistical properties in high- 22
23 dimensional regression. However, the algorithmic nature of PLS evidently made it 23
24 difficult to study using traditional statistical measures, with the consequence that 24
25 PLS was long regarded as a technique that is useful, but whose core statistical 25
26 properties are elusive. Chun and Keleş [6] provided a piece of the puzzle by show- 26
27 ing that, within a certain modeling framework, the PLS estimator of the coefficient 27
28 vector in linear regression is inconsistent unless p/n → 0. They then used this as 28
29 motivation for their development of a sparse version of PLS. The Chun–Keleş re- 29
30 sult poses a little dilemma. On the one hand, decades of experience support PLS as 30
31 a useful method, but its inconsistency when p/n → c > 0 casts doubt on its use- 31
32 fulness in high-dimensional regression, which is one of the contexts in which PLS 32
33 undeniably stands out by virtue of its widespread application. There are several 33
34 possible explanations for this conflict, including (a) consistency does not always 34
35 signal the value of a method in practice, (b) the Chemometrics literature is largely 35
36 wrong about the value of PLS and (c) the modeling construct used by Chun and 36
37
Keleş does not reflect the range of applications in which PLS is employed. 37
38
Cook and Forzani [9] studied single-component PLS regressions and found that 38
39
in some reasonable settings PLS predictions can converge at the root-n rate as 39
40
n, p → ∞, regardless of the alignment between n and p, a result that stands in 40
41
contrast to the finding of Chun and Keleş [6]. Single-component regressions do 41
42
occur in practice, but our impression is that multiple-component regressions are the 42
43
rule. Recent studies that used multiple PLS components include studies of seasonal 43
AOS imspdf v.2018/02/08 Prn:2018/02/28; 14:33 F:aos1681.tex; (G) p. 3

PLS PREDICTION 3

1 streamflow forecasting [1], Italian craft beer [2], the metabolomics of meat exudate 1
2 [5], the prediction of biogas yield [24], quantification in bioprocesses [25] and the 2
3 Japanese honeysuckle [26]. 3
4 In this article, we follow the general setup of Cook and Forzani [9] and use 4
5 traditional (n, p)-asymptotic arguments to provide insights into PLS predictions 5
6 in multiple-component regressions. We also give bounds on the rates of conver- 6
7 gence for PLS predictions as n, p → ∞ and in doing so we conjecture about the 7
8 value of PLS in various regression scenarios. Section 2 contains a review of PLS 8
9 regression, along with comments on its connection to envelopes and sufficient di- 9
10 mension reduction. The specific objective of our study is described in Section 3. In 10
11 Section 4, we introduce and provide intuition for various quantities that influence 11
12 the (n, p)-asymptotic behavior of PLS predictions. Our main results are given as 12
13 two theorems in Section 5. There we also describe connections with the results 13
14 of Cook and Forzani [9] for single-component regressions and offer a different 14
15 view of the Chun–Keleş result [6]. Supporting simulations and an illustrative data 15
16 analysis are given in Section 6. We focus solely on predictive consistency until 16
17 Section 7.1 where we address estimative consistency. Proofs and other supporting 17
18 material are given in an online supplement to this article [10]. 18
19 Our results show that there is a range of regression scenarios where PLS predic- 19
20 tions have the usual root-n convergence rate, even when n ≪ p, and an even wider 20
21 21
range where the rate is slower but may still produce practically useful results, the
22 22
Chun–Keleş result notwithstanding.
23 23
24 24
2. PLS review. There are several different PLS algorithms for the multivariate
25 25
(multi-response) linear regression of r responses on p predictors. These algorithms
26 26
may not be presented as model-based, but instead are often regarded as methods for
27 27
prediction. It is known they give the same result for univariate responses but give
28 28
distinct sample results for multivariate responses. We restrict attention to univariate
29 29
regression so that the methodology is clear. See Section 7.2 for further discussion
30 30
related to this choice.
31 31
The context for our study is the typical linear regression model with univariate
32 32
response y and random predictor vector X ∈ Rp ,
33 33

34 (2.1) y = µ + β T X − E(X) + ǫ, 34
35 35
36
where the regression coefficients β ∈ Rp are unknown, and the error ǫ has mean 36
37
0, variance τ 2 and is independent of X. We assume that (y, X) follows a non- 37
38
singular multivariate normal distribution and that the data (yi , Xi ), i = 1, . . . , n, 38
39
arise as independent copies of (y, X). We use the normality assumption to facili- 39
40
tate asymptotic calculations and to connect with the results of Chun and Keleş [6]; 40
41
nevertheless, simulations and experience in practice indicate that it is not essen- 41
42
tial for the methodology itself. Further discussion of this assumption is given in 42
43
Section 7.4. To avoid trivial cases, we assume throughout that β 6= 0. 43
AOS imspdf v.2018/02/08 Prn:2018/02/28; 14:33 F:aos1681.tex; (G) p. 4

4 R. D. COOK AND L. FORZANI

1 Continuing with notation, let Y = (y1 , . . . , yn )T and let F denote the p × n 1


2 matrix with columns (Xi − X̄), i = 1, . . . , n. Then the model for the full sample 2
3 can be represented also in vector form as 3
4 4
5
Y = α1n + F T β + ε, 5
6 6
where 1n represents the n × 1 vector of ones, α = E(y) and ε = (ǫi ). Let 6 =
7 7
var(X) > 0 and σ = cov(X, y). We use Wq () to denote the Wishart distribution
8 8
with q degrees of freedom and scale matrix . Let PA(1) denote the projection
9 9
in the 1 > 0 inner product onto span(A) if A is a matrix or onto A itself if it is
10 10
a subspace. We use PA := PA(I ) to denote projections in the usual inner product
11 11
and QA = I − PA . The Euclidean norm is denoted as k · k. Turning to notation
12
for a sample, let σ̂ = n−1 F Y and 6̂ = n−1 F F T ≥ 0 denote the usual moment 12
13
estimators of σ and 6 using n for the divisor. With W = F F T ∼ Wn−1 (6), we 13
14
can represent 6̂ = W/n, σ̂ = n−1 (Wβ + F ε). 14
15 15
The PLS estimator of β hinges fundamentally on the notion that we can identify
16
a dimension reduction subspace H ⊆ Rp so that y ⊥ ⊥ X | PH X and d := dim(H) < 16
17 17
p (and hopefully d ≪ p). This driving condition is the same as that encountered
18 18
in the literature on sufficient dimension reduction (see [8] for an introduction), but
19 19
PLS operates in the context of model (2.1), while sufficient dimension reduction
20 20
is largely model-free. We assume that d is known in all technical results stated in
21 21
this article. In Chemometrics and elsewhere, d is often chosen by using predictive
22 22
cross validation or a holdout sample. See Section 7.3 for discussion on the choice
23 23
of d.
24 24
Assume momentarily that a basis matrix H ∈ Rp×d of H is known and that
25
6̂ > 0. Let B = 6̂ −1 σ̂ denote the ordinary least squares estimator of β. Then 25
26 26
following the reduction X 7→ H T X, ordinary least squares is used to estimate
27 27
the coefficient vector βy|H T X for the regression of y on H T X, giving estimated
28 28
29
coefficient matrix β̃y|H T X = (H T 6̂H )−1 H T σ̂ . The known-H estimator β̃H of β 29
30
is then 30
31 (2.2) β̃H = H β̃y|H T X = PH(6̂) B. 31
32 32
33 Equation (2.2) describes β̃H as a projection of B onto H and shows that β̃H de- 33
34 pends on H only via H. It also shows that β̃H requires H T 6̂H > 0, but does not 34
35 actually require 6̂ > 0. This is essentially how PLS handles n < p regressions: by 35
36 reducing the predictors to H T X while requiring n ≫ d, PLS is able to deal with 36
37 high-dimensional regressions in a relatively straightforward manner. The unique 37
38 and essential ingredient supplied by PLS is an algorithm for estimating H. 38
39 The following is the population statement developed by Cook et al. [12] of the 39
40 SIMPLS algorithm [15] for estimating H in univariate regressions. Set w0 = 0 and 40
41 W0 = w0 . For k = 0, . . . , d − 1, set 41
42 42
43
Sk = span(6Wk ), 43
AOS imspdf v.2018/02/08 Prn:2018/02/28; 14:33 F:aos1681.tex; (G) p. 5

PLS PREDICTION 5
1/2
1 wk+1 = QSk σ/ σ T QSk σ , 1
2 2
3
Wk+1 = (w0 , . . . , wk , wk+1 ). 3
4 At termination, span(Wd ) is a dimension reduction subspace H. Since d is as- 4
5 sumed to be known and effectively fixed, SIMPLS depends on only two population 5
6 quantities—σ and 6—that must be estimated. The sample version of SIMPLS is 6
7 constructed by replacing σ and 6 by their sample counterparts and terminating af- 7
8 ter d steps, even if 6̂ is singular. In particular, SIMPLS does not make use of 6̂ −1 8
9 and so does not require 6̂ to be nonsingular, but it does require d ≤ min(p, n − 1). 9
10 If d = p, then span(Wp ) = Rp and PLS reduces to the ordinary least squares 10
11 11
estimator. Let G = (σ, 6σ, . . . , 6 d−1 σ ) and Ĝ = (σ̂ , 6̂ σ̂ , . . . , 6̂ d−1 σ̂ ) denote
12 12
population and sample Krylov matrices. Helland [21] showed that span(G) =
13 13
span(Wd ), giving a closed-form expression for a basis of the population PLS sub-
14 14
space, and that the sample version of the SIMPLS algorithm gives span(Ĝ).
15
PLS can be seen as an envelope method as follows [12]. A subspace R ⊆ Rp is 15
16 16
a reducing subspace of 6 if R decomposes 6 = PR 6PR + QR 6QR and then we
17 17
say that R reduces 6. The intersection of all reducing subspaces of 6 that contain
18
a specified subspace S ⊆ Rp is called the 6-envelope of S and denoted as E6 (S ). 18
19 19
Let Pk denote the projection onto the kth eigenspace of 6, k = 1, . . . , q ≤ p. Then
20 20
the 6-envelopeP of S can be constructed by projecting onto the eigenspaces of 6
21 q 21
[14]: E6 (S ) = i=1 Pk S . Cook et al. [12] showed that the population SIMPLS
22 22
algorithm produces E6 (B ), the 6-envelope of B := span(β), so H = span(Wd ) =
23 23
span(G) = E6 (B ).
24
From this point, we use H ∈ Rp×d to denote any semi-orthogonal basis matrix 24
25
for E6 (B ) and let (H, H0 ) ∈ Rp×p denote an orthogonal matrix. The connection 25
26 26
with envelopes led Cook et al. [12] to the following envelope model for PLS:
27  27
T T
28 y = µ + βy|H T XH X − E(X) + ǫ, 28
29
(2.3) 29
30
6 = H 6H H T + H0 6H0 H0T , 30
31 where 31
32  32
33
6H = var H T X = H T 6H ∈ Rd×d , 33

34 6H0 = var H0T X = H0T 6H0 ∈R (p−d)×(p−d)
, 34
35 35
36
and βy|H T X can be interpreted as the coordinates of β relative to basis H . In terms 36
37
of the parameters in model (2.1), this model makes use of the basis H of E6 (B ) to 37
38
achieve a parsimonious re-parameterization of β and 6 in terms: 6 is as given in 38
39
the model and 39
−1 
40 (2.4) β = PH(6) β = Hβy|H T X = H H 6H T
H σ = G G 6G −1 GT σ,
T T 40
41 41
42
where the last step follows because, as noted previously, E6 (B ) = span(H ) = 42
43
span(G). This re-parameterization has no impact on the predictors or the error 43
AOS imspdf v.2018/02/08 Prn:2018/02/28; 14:33 F:aos1681.tex; (G) p. 6

6 R. D. COOK AND L. FORZANI

1 and in consequence we still have that X is independent of ǫ as assumed for model 1


2 (2.1). 2
3 Beginning with model (2.3), Cook et al. [12] developed likelihood-based es- 3
4 timators whose performance dominates that of the SIMPLS in the traditional 4
5 fixed p context. It follows from (2.3) that y ⊥ ⊥ X | H T X and H T X ⊥ ⊥ H0T X, 5
6 which together imply that (y, H T X) ⊥ ⊥ H0T X. Model (2.3) and the condition 6
7 HT X ⊥ ⊥ H0T X are what sets the PLS framework apart from that of sufficient di- 7
8 mension reduction. As a consequence of this structure, the distribution of y can 8
9 respond to changes in H T X, but changes in H0T X affect neither the distribution 9
10
of y nor the distribution of H T X. For this reason, we refer to H0T X as the noise 10
11 11
in X. As will be seen later, the predictive success of PLS depends crucially on the
12 relative sizes of 6H0 , the variability of the noise in X and 6H the variability in the 12
13 part of X that affects y. 13
14 14
15 3. Objective. Let β̂ denote the estimator of β produced by the SIMPLS algo- 15
16 rithm: from (2.4) 16
17  17
18
β = G GT 6G −1 GT σ, 18
19 T −1 T 19
β̂ = Ĝ Ĝ 6̂ Ĝ Ĝ σ̂ ,
20 20
21 where Ĝ = (σ̂ , 6̂ σ̂ , . . . , 6̂ d−1 σ̂ ), as defined previously. Our interest lies in study- 21
22 ing the predictive performance of β̂ as n and p grow in various alignments. 22
23 Let yN = µ + β T (XN − E(X)) + ǫN denote a new observation on y at a new 23
24 independent observation XN of X. The PLS predicted value of yN at XN is 24
25 ŷN = ȳ + β̂ T (XN − X̄), giving a difference of 25
T  T 
26 ŷN − yN = ȳ − µ + (β̂ − β) XN − E(X) − (β̂ − β) X̄ − E(X) 26
27  27
28 − β T X̄ − E(X) + ǫN . 28
29 The first term ȳ − µ = Op (n−1/2 ).
Since var(y) = β T 6β + τ2
must remain con- 29
30 stant as p grows, β 6= 0 and 6 > 0, we see that β T 6β ≍ 1 as p → ∞, where “ak ≍ 30
31 bk ” means that, as k → ∞, ak = O(bk ) and bk = O(ak ). Thus the fourth term 31
32 β T (X̄ − E(X)) = Op (n−1/2 ) by Chebyschev’s inequality: var(β T (X̄ − E(X))) = 32
33
β T 6β/n → 0 as n, p → ∞. The term (β̂ − β)T (X̄ − E(X)) must have order 33
34
smaller than or equal to the order of (β̂ − β)T (XN − E(X)), which will be at least 34
35
Op (n−1/2 ). 35
36 Consequently, we have the essential asymptotic representation 36
37   37
38
ŷN − yN = Op (β̂ − β)T XN − E(X) + ǫN as n, p → ∞. 38
39 Since ǫN is the intrinsic error in the new observation, the n, p-asymptotic behavior 39
40 of the prediction ŷN is governed by the estimative performance of β̂ as measured 40
41 by 41
42    42
43
(3.1) DN := (β̂ − β)T ωN = σ̂ T Ĝ ĜT 6̂ Ĝ −1 ĜT − σ T G GT 6G −1 GT ωN , 43
AOS imspdf v.2018/02/08 Prn:2018/02/28; 14:33 F:aos1681.tex; (G) p. 7

PLS PREDICTION 7

1 where ωN = XN − E(X) ∼ N(0, 6). Our goal now is to determine asymptotic 1


2 properties of DN as n, p → ∞. Because var(DN | β̂) = (β̂ − β)T 6(β̂ − β), results 2
3 for DN also tell us about the asymptotic behavior of β̂ in the 6 inner product. 3
4 Consistency of β̂ is discussed in Section 7.1. Until then, we focus exclusively on 4
5 predictions via DN . 5
6 6
7 4. Overarching considerations. In this section, we introduce and discuss 7
8 various population constructs that play key roles in the asymptotic results of Sec- 8
9 tion 5. 9
10 10
11 4.1. Dimension d of E6 (B ). As mentioned in Section 2, we assume through- 11
12 out this article that the dimension d = dim{E6 (B )} is known and constant for all 12
13 finite p ≥ d. Technically, this dimension may increase for a time with p (e.g., 13
14 while p < d), but we assume that it remains constant after a certain point. 14
15 15
16 4.2. Signal and noise in X. Although we are pursuing asymptotic properties 16
17 of PLS predictions via (3.1), the envelope model (2.3) guides aspects of the study. 17
18 Under this envelope construction, B ⊆ E6 (B ) and for any nonnegative integer k, 18
19 19
(4.1) 6 k = H 6H
k
H T + H0 6H
k
HT .
0 0
20 20
21 Our asymptotic results depend fundamentally on the sizes of 6H and 6H0 . Define 21
22 η(p) : R 7→ R and κ(p) : R 7→ R as 22
23 23
(4.2) tr(6H ) ≍ η(p) ≥ 1,
24 24
25 (4.3) tr(6H0 ) ≍ κ(p), 25
26 26
where we imposed the condition η(p) ≥ 1 without loss of generality. In what fol-
27 27
lows, we will typically suppress the argument and refer to η(p) and κ(p) as η
28 28
and κ. If finitely many of the eigenvalues of 6H0 are O(p) and the rest are all
29 29
bounded away from 0 and ∞, then we could take κ = p. Otherwise, it is tech-
30 30
nically possible that p = o(κ), although we would not normally expect that in
31 31
practice.
32 32
To gain intuition about η(p), let λi denote the ith eigenvalue of 6H , i =
33 33
1, . . . , d, and assume without loss of generality that the columns of H =
34 34
(h1 , . . . , hd ) are orthogonal eigenvectors of 6. Then using (4.1) and the facts that
35 35
σ = PH σ and 6H = diag(λ1 , . . . , λd ),
36 36
−1 T
37 β T 6β = σ T 6 −1 σ = σ T H 6H H σ 37
38  T −1 T  38
σ H 6H
2 H σ
39 (4.4) = kσ k T
39
40 σ PH σ 40
41 d
X 41

42 = wi kσ k2 /λi , 42
43 i=1 43
AOS imspdf v.2018/02/08 Prn:2018/02/28; 14:33 F:aos1681.tex; (G) p. 8

8 R. D. COOK AND L. FORZANI

1 where the weights wi = σ T Phi σ/σ T PH σ , Phi denotes the projection onto 1
P
2 span(hi ) and di=1 wi = 1. Consequently, if the wi are bounded away from 0 2
3 and if many predictors are correlated with y so that kσ k2 → ∞, then the eigenval- 3
4 ues of 6H must diverge to ensure that β T 6β remains bounded. We could in this 4
5 case take η(p) = kσ k2 . 5
6 Suppose that the first k eigenvalues λi , i = 1, . . . , k, diverge with p, that λi ≍ 6
7 λj , i, j = 1, . . . , k, and that the remaining d − k eigenvalues are a lower order, 7
8 λj = o(λi ), i = 1, . . . , k, j = k + 1, . . . , d. Then if kσ k2 ≍ λi , i = 1, . . . , k, we 8
9 must have wi → 0 for i = k + 1, . . . , d for β T 6β to remain bounded. 9
10 It is possible also that the eigenvalues λi are bounded. This happens in sparse 10
11 regressions when only d predictors are relevant. For instance, if H = (Id , 0)T then 11
12 6H is the dth order leading principal submatrix of 6, and thus it is fixed with 12
13 bounded eigenvalues. Bounded eigenvalues are possible also when many predic- 13
14 14
tors are related weakly with the response so kσ k is bounded. If the eigenvalues λi
15 15
are bounded, then η ≍ 1.
16 16
From the discussion so far, we see that κ, being the trace of a p − d × p − d
17 17
positive definite matrix, would normally be at least the order of p, but might have
18 18
a larger order. η, being the trace of a d × d matrix, will in practice have order at
19
most p and can achieve that order in abundant regressions where kσ k2 ≍ p. We 19
20 20
can contrive cases where p = o(η), but they seem impractical. For these reasons,
21 21
we limit our consideration to regressions in which η = O(κ).
22 22
The measures κ and η are frequently joined naturally in our asymptotic expan-
23 23
sions into the combined measure
24 24
κ(p)
25 (4.5) φ(n, p) = . 25
26 nη(p) 26
27 As will be seen later, a good scenario for prediction occurs when φ(n, p) → 0 as 27
28 n, p → ∞. This implies a synergy between the signal η and the sample size n, 28
29 with the product nη being required to dominate the variation of the noise in X 29
30 as measured by κ. This is similar to the signal rate found by Cook, Forzani and 30
31 Rothman [11] in their study of abundant high-dimensional linear regression. We 31
32 typically drop the arguments (n, p) when referring to φ(n, p). 32
33 33
34 4.3. Coefficients βy|H T X . The coefficients for the regression of y on the re- 34
35 −1 35
duced predictors H T X can be represented as βy|H T X = 6H σH , where σH =
36 T d×1 36
H σ ∈ R . Population predictions based on the reduced predictor involve the
37 T T T 37
product βy|H T X H X. If var(H X) = 6H diverges along certain directions, then
38 38
39
we must have corresponding parts of βy|H T X converge to 0 to balance the increases 39
in H T X or otherwise the form βy|HT T
40 T X H X will not make sense asymptotically. 40
41 This essential behavior can be seen also from 41
42 T T 
T T −1 T 42
var βy|H T X H X = βy|H T X 6H βy|H T X = σH 6H σH = β 6β.
43 43
AOS imspdf v.2018/02/08 Prn:2018/02/28; 14:33 F:aos1681.tex; (G) p. 9

PLS PREDICTION 9

1 Since β T 6β is bounded, if 6H diverges along certain directions then σH must 1


−1
2 correspondingly increase to compensate for the convergence of 6H to 0 in those 2
3 T 1/2
same directions. By construction, var(H X/η ) = 6H /η → V ≥ 0. Also, nor- 3
4 malizing 6H by η forces a corresponding normalization of σH by η1/2 . 4
5 5
6 4.4. Error variance τ 2 . The quadratic form β T 6β is a monotonically increas- 6
7 ing function of p. Since var(y) = β T 6β + τ 2 is constant, as β T 6β increases with 7
8 p, τ 2 must correspondingly decrease with p. Although it is technically possible 8
9 to have τ → 0, we assume throughout that τ is bounded away from 0 as p → ∞ 9
10 since this is likely relevant in nearly all applications. 10
11 11
12 4.5. Asymptotic dependence. In the envelope model (2.3), H represents a 12
13 semi-orthogonal basis matrix for E6 (B ). However, the SIMPLS method for es- 13
14 timating E6 (B ) involves Ĝ. While span(G) = E6 (B ), G is not semi-orthogonal, 14
15 and thus we need to keep track of any asymptotic linear dependencies among the 15
16 reduced variables GT X ∈ Rd . Let 16
17   17
18 C = diag−1/2 GT 6G GT 6G diag−1/2 GT 6G ∈ Rd×d 18
19 19
denote the correlation matrix for GT X, and define the function ρ(p) so that as
20 20
p→∞
21 21

22 (4.6) tr C −1 ≍ ρ(p). 22
23 23
As with other constructions, we typically drop the argument and refer to ρ(p)
24 24
as ρ. Let Ri2 denote the squared multiple correlation coefficient from the linear
25 P 25
26
regression of the ith coordinate of GT X onto the rest. Then tr(C −1 ) = di=1 (1 − 26
27
Ri2 )−1 , so ρ basically describes the rate of increase in the sum of variance inflation 27
28
factors. It may be appropriate for many applications to assume that ρ is bounded, 28
29
but it turns out that we might still obtain useful results then√ ρ → ∞ if its rate of 29
30
increase is sufficiently slow and in particular slower than n. 30
31
In high-dimensional regressions, the eigenvalues of 6 are often assumed to be 31
32
bounded away from 0 and ∞ as p → ∞, which rules out any exact asymptotic 32
33
dependence among the predictors. In the context of PLS, y ⊥ ⊥ X | GT X and so the 33
T
variables G X are the only ones that are relevant to the regression. We use ρ to
34 34
35
measure asymptotic dependencies among the variables in GT X. For instance, it 35
36
will be seen in the two theorems of Section 5 that the sample size required for con- 36
37
sistency when ρ → ∞ can be much larger than that required when ρ is bounded. 37
38
Our context allows for exact asymptotic dependencies in the complementary set of 38
39
variables H0T X, so our conclusions stand even if the smallest eigenvalue of 6H0 39
40
converges to zero. Since the eigenvalues of 6H0 are also eigenvalues of 6, the 40
41
smallest eigenvalue of 6 may converge to 0 without impacting our results. 41
42
The following proposition gives necessary and sufficient conditions for tr(C −1 ) 42
43
to be bounded. In preparation, consider the regression of y on the reduced and 43
AOS imspdf v.2018/02/08 Prn:2018/02/28; 14:33 F:aos1681.tex; (G) p. 10

10 R. D. COOK AND L. FORZANI



1 scaled predictors H T X/ η, where the scaling is as discussed in Section 4.3. The 1
2 Krylov matrix for this regression is 2
3       3
4 σH 6H σH 6H 2 σH 6H d−1 σH 4
GH = √ , √ , √ ,..., √ .
5 η η η η η η η 5
6 √ 6
Let aH = limp→∞ σH / η. Then the limiting form of GH can be expressed as
7 7
8 2 d−1  d×d 8
G∞ = lim GH = aH , V aH , V aH , . . . , V aH ∈ R ,
9 p→∞ 9
10 where V = limp→∞ 6H /η, as defined in Section 4.3. By construction rank(GH ) = 10
11 d for all finite p, but rank(G∞ ) could be less than d if, for example, V is singular 11
12 or some of its eigenvalues are equal. 12
13 13
14
P ROPOSITION 1. V > 0 and rank(G∞ ) = d if and only if tr(C −1 ) is bounded 14
15 15
as p → ∞.
16 16
17 17
The next two corollaries describe related implications.
18 18
19 19
C OROLLARY 1. If V > 0 with distinct eigenvalues, then rank(G∞ ) = d if and
20 20
only if EV (span(aH )) = Rd .
21 21
22 22
23
This corollary, which follows from Cook, Li and Chiaromonte [13], Theorem 1, 23
24
says in effect that rank(G∞ ) = d if and only if aH has a nonzero projection onto 24
25
each of the d eigenspaces of V . If V > 0, but has fewer than d eigenspaces, then 25
26
rank(G∞ ) < d. This partly explains the need for the two conditions of Proposi- 26
27 tion 1. 27
28 28
29 C OROLLARY 2. Assume that V > 0. Then: 29
30 30
(i) rank(G∞ ) = d implies that V has distinct eigenvalues and that
31 31
 d
32 EV span(aH ) = R . 32
33 33
34
(ii) EV (span(aH )) = Rd implies that V has distinct eigenvalues and that 34
35 rank(G∞ ) = d. 35
36 36
37 37
The next corollary describes what happens when the eigenvalues of 6 are
38 38
bounded away from 0 and ∞ as p → ∞.
39 39
40 40
C OROLLARY 3. If the eigenvalues of 6 are bounded away from 0 and
41 41
∞ as p → ∞, then V > 0. Additionally, tr(C −1 ) is bounded if and only if
42 42
rank(G∞ ) = d.
43 43
AOS imspdf v.2018/02/08 Prn:2018/02/28; 14:33 F:aos1681.tex; (G) p. 11

PLS PREDICTION 11

1 4.6. Compound symmetry. To help fix ideas, consider a regression in which 1


2 6 > 0 has compound symmetry with diagonal elements all 1 and constant off 2
3 diagonal element ψ ∈ (0, 1), 3
4 4
5 (4.7) 6 = (1 − ψ + pψ)P1 + (1 − ψ)Q1 , 5
6 6
where P1 is the projection onto the p × 1 vector of ones 1p . In this case, 6 has
7 7
two eigenspaces and the performance of PLS depends on where β falls relative to
8 8
these spaces.
9 9
10 10
4.6.1. Constant covariances with y. Suppose σ = 1p . Then β = (1 − ψ +
11 √ 11
12
pψ)−1 1p , H = 1p / p, 6H = (1 − ψ + pψ), 6H0 = (1 − ψ)Ip−d , η ≍ p, and 12
13
κ ≍ p. Additionally, d = 1, w1 = 1, kσ k2 = p, C = 1, λ1 = (1 − ψ + pψ), 13
14 d 14
X 
p
15
T
β 6β = wi kσ k /λi = 2
→ ψ −1 , 15
16 i=1
1 − ψ + pψ 16
17 √ 17
and G∞ = limp→∞ (H T σ/ η) = 1 with η = p.
18 18
19 19
20 4.6.2. Contrasts. Suppose that 1Tp σ = 0. Then β = (1 − ψ)−1 σ , H = σ/kσ k, 20
21 6H = (1 − ψ), 6H0 = (1 − ψ + pψ)P1 + (1 − ψ)Q1,σ , κ ≍ p and η ≍ 1. Also, 21
22 d = 1, w1 = 1, λ1 = (1 − ψ) and 22
23 d 23
X  kσ k2
24 β T 6β = wi kσ k2 /λi = , 24
25 i=1
1−ψ 25
26 26
so kσ k must be bounded. Additionally, G∞ = kσ k with η = 1.
27 27
28 28
29
4.6.3. Arbitrary σ . Decompose σ = P1 σ + Q1 σ = σ̄ 1p + cp , where σ̄ = 29
30
1Tp σ/p is assumed to be bounded away from 0 and cp = σ − 1p σ̄ is a residual 30
31 vector, 1Tp cp = 0. Then β = σ̄ (1 − ψ + pψ)−1 1p + (1 − ψ)−1 cp , 31
32   32
1p cp
33 H = (h1 , h2 ) = √ , , 33
34
p kcp k 34
35 6H = diag{(1 − ψ + pψ), (1 − ψ)}, 6H0 = (1 − ψ)Q1,cp , κ ≍ p and η ≍ p. 35
36 Further, d = 2, 36
37 37
38 kσ k2 = σ T H H T σ = σ T PhT1 σ + σ T PhT2 σ = σ̄ 2 p + kcp k2 , 38
39 2 39
w1 = σ̄ 2 p/ σ̄ 2 p + kcp k ,
40 40

41 w2 = kcp k2 / σ̄ 2 p + kcp k2 , 41
42
2 42
43
β T 6β = σ̄ p/(1 − ψ + pψ) + (1 − ψ)−1 kcp k2 . 43
AOS imspdf v.2018/02/08 Prn:2018/02/28; 14:33 F:aos1681.tex; (G) p. 12

12 R. D. COOK AND L. FORZANI

1 We see as a consequence of this structure that both σ̄ and kcp k must be bounded 1
2 and that w1 → 1 and w2 → 0. Additionally, with η = p and σ̄∞ = limp→∞ σ̄ , 2
3 √ √ √  3
4
aH = lim p σ̄ / η, kcp k/ η T = (σ̄∞ , 0)T , 4
p→∞
5  5
V = lim diag (1 − ψ + pψ), (1 − ψ) /p = diag(ψ, 0),
6 p→∞ 6
7   7
σ̄∞ ψ σ̄∞
8 G∞ = . 8
0 0
9 9
10 In this case, V and G∞ both have rank 1, and so by Proposition 1 tr(C −1 ) is 10
11 unbounded as p → ∞. 11
12 To find an order for tr(C −1 ), we have 12
13 13
6σ = σ̄ b(p, ψ)1p + (1 − ψ)cp ,
14  14
15 G = σ̄ 1p + cp , σ̄ b(p, ψ)1p + (1 − ψ)cp , 15
16 2 2 2 2 2 2
! 16
σ̄ pb(p, ψ) + (1 − ψ)kcp k σ̄ pb (p, ψ) + (1 − ψ) kcp k
17 GT 6G = , 17
18
σ̄ 2 pb2 (p, ψ) + (1 − ψ)2 kcp k2 σ̄ 2 pb3 (p, ψ)3 + (1 − ψ)3 kcp k2 18
19 where b(p, ψ) = 1 − ψ + pψ. From this, it can be verified that tr(C −1 )
≍ p2 , 19
20 2 −1
so ρ = p . The behavior of tr(C ) in this example is due to the different orders 20
21 of magnitude of the eigenvalues of 6H , λ1 ≍ p and λ2 ≍ 1. As will be seen later 21
22 in Theorems 1 and 2, a consequence of this structure is that we would need sam- 22
23 ple size n ≫ p4 to keep the direction in span(1) from swamping the direction in 23
24 span⊥ (1). 24
25 25
26 4.7. Universal conditions. Before discussing asymptotic results in the next 26
27 section, we summarize the conditions that we assume through this article. We re- 27
28 quire that: 28
29 29
C1. Model (2.1) holds, where (y, X) follows a nonsingular multivariate normal
30 30
distribution and that the data (yi , Xi ), i = 1, . . . , n, arise as independent copies
31 31
of (y, X). To avoid the trivial case, we assume that the coefficient vector β 6= 0,
32 32
which implies that the dimension of the envelope d ≥ 1. We also assume that the
33 33
error standard deviation
√ τ is bounded away from 0 as p → ∞.
34 34
C2. φ and ρ/ n → 0 as n, p → ∞, where φ and ρ are defined at (4.5) and
35 35
(4.6).
36 36
C3. η = O(κ) as p → ∞, where η ≥ 1, and η and κ are defined at (4.2) and
37 37
(4.3).
38 38
C4. The dimension d of the envelope is known and constant for all finite p.
39 C5. 6 > 0 for all finite p. This restriction allows 6̂ to be singular, which is a 39
40 scenario PLS was designed to handle. We do not require as a universal condition 40
41 that the eigenvalues of 6 are bounded as p → ∞. 41
42 42
43
Additional conditions will be needed for various results. 43
AOS imspdf v.2018/02/08 Prn:2018/02/28; 14:33 F:aos1681.tex; (G) p. 13

PLS PREDICTION 13

1 5. Asymptotic results. Depending on properties of the regression, the asymp- 1


2 totic behavior of PLS predictions can depend crucially on all of the quantities 2
3 described in Section 4: n, d, η, κ and ρ. In this section, we summarize our main 3
4 results along with a few special scenarios that may provide useful intuition in prac- 4
5 tice. Additional results along with proofs for those given here are available in the 5
6 supplement [10]. All of the asymptotic results in this section should be understood 6
7 to hold as n, p → ∞. 7
8 8
9 5.1. Orders of DN . The results of Theorem 1 are the most general, requiring 9
10 for potentially good results in practice only that C1–C5 hold and that the terms 10
11 characterizing the orders go to zero as n, p → ∞. In particular, the eigenvalues of 11
12 6 need not be bounded. Its proof is given as Supplement Theorem S1. 12
13 13
14 T HEOREM 1. As n, p → ∞, 14
√ 
15 DN = Op (ρ/ n) + Op ρ 1/2 n−1/2 (κ/η)d . 15
16 16
17
In particular, 17
18 I. If ρ ≍ 1, then DN = Op √ {n−1/2 (κ/η)d }. 18
19 II. If κ ≍ η, then DN = Op (ρ/
√ n).
19
20 III. If d = 1, then DN = Op ( nφ). 20
21 21
22 We see from this that the asymptotic behavior of PLS depends crucially on the 22
23 relative sizes of signal η and noise κ in X. It follows from the general result that if 23
24 κ ≍ p, as likely occurs in Chemometrics
√ applications, and η ≍ p, so the regression 24
25 is abundant, then DN = Op (ρ/ n). This may be one of the reasons for the success 25
26 of PLS in spectrometric prediction in Chemometrics. 26
27 On the other hand, if the signal in X is small relative to the noise in X, so η = 27
28 o(κ), then it may take a very large sample size for PLS prediction to be consistent. 28
29 For instance, suppose that the regression is sparse so only d predictors matter, 29
30 and thus η ≍ 1. Then it follows reasonably that ρ ≍ 1 and, from part I, DN = 30
31 Op {n−1/2 κ d }. If, in addition, κ ≍ p then DN = Op {pd n−1/2 }. Clearly, if d is not 31
32 small, then it could take a huge sample size for PLS prediction to be consistent. 32
33 Cook and Forzani [9] showed using the same setup as employed here that for 33
34 single-component regressions (d = 1) 34
 2 ) 3 )
35
∗ −1/2
tr1/2 (6H 0 tr(6H0 ) tr1/2 (6H 0
35
36 (5.1) DN = Op n + √ + + , 36
37
nkσ k2 nkσ k2 nkσ k3 37
38 where the superscript ∗ is meant as a reminder that this order of DN for d = 1 is 38
39 from Cook and Forzani [9]. To connect (5.1) with Theorem 1.III„ first substitute 39
j
40 the bound tr(6H0 ) ≤ κ j into (5.1) to obtain 40
41   1/2  41
∗ −1/2 κ κ κ
42 DN = Op n +√ +√ . 42
43 nkσ k2 nkσ k nkσ k2
2
43
AOS imspdf v.2018/02/08 Prn:2018/02/28; 14:33 F:aos1681.tex; (G) p. 14

14 R. D. COOK AND L. FORZANI

1 Next, it follows immediately from (4.4) that, when d = 1, η ≍ kσ k2 and so 1


2
∗ √ √  √ 2
3
DN = Op n−1/2 + nφ + nφ 3/2 = Op ( nφ). 3
4 Consequently, ∗
DN as given in (5.1) provides a sharper result than that given in 4
5 j 5
Theorem 1.III. We used the bound tr(6H0 ) ≤ κ j consistently when deriving the
6 6
conclusions of Theorem 1 because otherwise the conclusions are complicated to
7 7
the point that extracting a useful message is problematic. In some cases, (5.1)
8 8
and Theorem 1.III agree. For instance, consider the compound symmetry example
9 9
of Section 4.6√ with σ = 1p . Then d = 1, 6H0 = (1 − ψ)Ip−d√ , κ ≍ p, η ≍ p,
10 ∗ = O (1/ n) and, from part III of Theorem 1, D = O (1/ n). 10
DN p N p
11 11
Theorem 1 places no constraints on the rate of increase in the eigenvalues of
12 12
6H0 . In some regressions, it may be reasonable to assume that the eigenvalues of
13 h ) ≍ p as p → ∞. This is what happens in the 13
6H0 are bounded so that tr(6H 0
14 14
compound symmetry example. In the next theorem, we describe the asymptotic
15 h ) = O(κ). Its proof follows from Sup- 15
behavior of PLS predictions when tr(6H 0
16 16
plement Theorems S2 and S3.
17 17
18 h ) = O(κ), h = 1, . . . , 4d − 1, then 18
19
T HEOREM 2. If tr(6H 0 19
20 √ p 20
DN = Op (ρ/ n) + Op ( ρφ).
21 21
22 In particular, 22
23 √ 23
I. If ρ ≍ 1, then DN = Op ( φ).

24 24
II. If η ≍ κ, then DN = Op (ρ/
√ n).
25 25
III. If d = 1, then DN = Op ( φ).
26 26
27 27
The order of DN now depends on a balance between the sample size n, the
28 28
variance inflation factors as measured through ρ and the noise to signal ratio in φ,
29 29
but it no longer depends on the dimension d. Contrasting the results of Theorems
30 30
1 and 2, we see a much better rate for case I in Theorem 2, and the same rates for
31 31
case
√ II. The √ rate for case III in Theorem 2 is no worse that in Theorem 1 since
32 32
φ = O( nφ).
33 33
In the next two sections, we discuss the asymptotic behaviors of PLS under
34 34
35
models for X that may be plausible for some data. We connect with the results of 35
36
Chun and Keleş [6] in Section 5.2. 36
37 37
38
5.2. Isotropic predictor variation. The compound symmetry example of Sec- 38
39
tion 4.6 was used primarily to help fix ideas as the theory was developed. In that 39
40
example, we specified a particular eigenstructure for 6 and then discussed out- 40
41
comes depending on where σ fell relative to that eigenstructure. We next discuss 41
42
an alternate way of structuring 6 that takes y into account and that may be more 42
43
reflective of Chemometrics applications of PLS. 43
AOS imspdf v.2018/02/08 Prn:2018/02/28; 14:33 F:aos1681.tex; (G) p. 15

PLS PREDICTION 15

1 We suppose that X can be modeled as 1


2 2
3
(5.2) X = µX + 2ν + ω, 3
4 where ν ∈ Rd is a vector of latent variables that is normally distributed with mean 4
5 0 and variance Id , 2 ∈ Rp×d has rank d ≤ p, ω ∈ Rp is normally distributed 5
6 with mean 0 and variance π 2 Ip , and ω ⊥ ⊥ (ν, y). Since 2 is unknown and uncon- 6
7 strained, there is no loss of generality in the restriction that var(ν) = Id . 7
8 We further assume that cov(ν, y) has no 0 elements so the dependence between 8
9 X and y arises fully via ν. It follows as a consequence of this model that X ⊥ ⊥ν| 9
10 2T X, and thus d linear combinations 2T X carry all of the information that X has 10
11 about y. The variance of X can be expressed as 11
12  12
13
6 = 22T + π 2 Ip = H 2T 2 + π 2 Id H T + π 2 QH , 13
14 where H = 2(2T 2)−1/2 is a semi-orthogonal basis matrix for span(2). Since 14
15 σ = 2 cov(ν, y) and cov(ν, y) has no nonzero elements, it follows that E6 (B ) = 15
16 span(2) = H, 6H = 2T 2 + π 2 Id and 6H0 = π 2 Ip−d . We can now appeal to 16
17 Theorems 1 and 2 to gain information about the asymptotic behavior of PLS under 17
18 (5.2). 18
19 Since the eigenvalues of 6H0 are bounded, κ ≍ p. The signal in X is measured 19
20 by 20
21 p 21
 X
22 T 2 2 22
tr(6H ) = tr 2 2 + π d ≍ kθi k ,
23 i=1 23
24 24
25
where θiT is the ith row of 2. If the signal is sparse, so for example only d rows of 25
26 2 are nonzero, then tr(6H ) is bounded, η ≍ 1 and V = limp→∞ 2T 2+ π 2 Id > 0. 26
27 On the other extreme, if the signal is abundant so many rows of 2 are nonzero 27
28 and tr(6H ) diverges, we can take η = tr(2T 2) and reasonably assume V = 28
29 limp→∞ 2T 2/η > 0. For instance, in spectroscopy data it seems entirely plau- 29
30
sible that notable signal comes from many wavelengths, not just a few. 30
31
It remains to address ρ. Since V > 0 with a sparse signal, and we assume V > 0 31
32
with an abundant signal, it follows from Proposition 1 that ρ ≍ 1 if and only if 32
33
rank(G∞ ) = d. To evaluate the rank of G∞ , we need aH = V 1/2 cov(ν, y), V and 33
34
EV (span(aH )) = EV (span(cov(ν, y)). Then, by Corollaries 1 and 2, rank(G∞ ) = d 34
35
if and only if V has distinct eigenvalues and cov(ν, y) has a nonzero projection 35
36
onto every eigenspace of V . Although we might contrive cases where rank(V ) < d 36
37
or where rank(V ) = d and cov(ν, y) is orthogonal to an eigenspace of V , those 37
38
would seem to be unusual in practice, and consequently it may be reasonable to 38
39
assume that rank(G∞ ) = d, and thus that ρ ≍ 1. 39
40
With this background, we next turn to application of Theorems 1 and 2 with κ ≍ 40
41
p and ρ ≍ 1. Under conclusion II of Theorem 1, if η ≍ p then DN = Op (n−1/2 ) 41
42
and we expect reasonable performance √ from PLS predictions. From the general 42
43
conclusion of Theorem 2, DN = Op ( φ). If in addition η ≍ p, then again DN = 43
AOS imspdf v.2018/02/08 Prn:2018/02/28; 14:33 F:aos1681.tex; (G) p. 16

16 R. D. COOK AND L. FORZANI



1 Op (n−1/2 ), and DN = Op (p1/4 /n1/2 ) if η ≍ p. These rates suggest again that 1
2 PLS predictions could be useful in high-dimensional regressions. 2
3 The predictor model employed by Chun and Keleş [6], Assumption 1, in their 3
4 treatment of PLS is the same as (5.2) with the added constraint that the columns 4
5 of 2 are orthogonal with bounded norms that converge as sequences. As a re- 5
6 sult 2T 2 is a convergent diagonal matrix, which effectively imposes sparsity and 6
7 several additional simplifying consequences: 7
8 8
1. The eigenvalues of 6H0 must be bounded away from 0 and ∞, which implies
9 9
that κ ≍ p.
10 10
2. The eigenvalues of the now diagonal matrix V = limp→∞ 2T 2+ π 2 Id must
11 11
be distinct [6], Condition 1, and bounded away from 0 and ∞, so the signal is
12 12
bounded and η ≍ 1.
13 13
3. Since cov(ν, y) has no zero elements, EV (span(cov(ν, y)) = Rd , and thus
14 14
ρ ≍ 1 by Corollaries 2 and 3. This means that ρ will not appear in the conclusions
15 15
of Theorems 1 and 2.
16 16
17 Our results for the setting considered by Chun and Keleş can be found by setting 17
18 φ = p/n and ρ = 1 in the main conclusion of Theorem 2, which gives DN = 18
19 Op ((p/n)1/2 ). Since this requires p/n → 0, it agrees with the Chun–Keleş result. 19
20 By asking that the eigenvalues of 6 be bounded, Chun and Keleş in effect assumed 20
21 sparsity to motivate a sparse solution and their requirement that the columns of 2 21
22 be orthogonal effectively forced ρ ≍ 1. In contrast, as seen in Theorems
√ 1 and 2, 22
23 PLS can in some settings achieve a convergence rate that is near n. 23
24 24
25 5.3. Anisotropic predictor variation. Model (5.2) is restrictive because it pos- 25
26 tulates that the elements of X − µX − 2ν are independent and identically dis- 26
27 tributed. In effect, all of the extrinsic anisotropic variation in X is due to its 27
28 association with y. One extension of (5.2) allows for anisotropic variation in 28
29 (X − µX − 2ν), so its elements can be correlated: 29
1/2
30 (5.3) X = µX + 2ν + 1 ω, 30
31 31
32
where 1 ∈ Rp×p is positive definite, the elements of ω are independent copies of a 32
33
standard normal random variable and all other quantities are as defined for (5.2), so 33
34
again the elements of cov(ν, y) are all nonzero. Under this model in combination 34
35
with (2.1), it can be verified that 6 = 22T + 1, σ = 2 cov(ν, y) and 35
  
36 E6 (B ) = E6 span(σ ) = E6 span(2) = E1 span(2) . 36
37 37
38
Let u = dim(E6 (B )), let H ∈ Rp×u denote a semi-orthogonal basis matrix for 38
39
E6 (B ), let (H, H0 ) ∈ Rp×p denote an orthogonal matrix. Then for some posi- 39
40
tive definite matrices  ∈ Ru×u and 0 ∈ R(p−u)×(p−u) , we have 1 = H H T + 40
H0 0 H0T , 2 = H U , where U ∈ Ru×d has rank d, 6H = U U T + , 6H0 = 0
41 41
42
and, as before, 6 = H 6H H T + H0 6H0 H0T . We are now in a position to consider 42
43
application of Theorems 1–2. 43
AOS imspdf v.2018/02/08 Prn:2018/02/28; 14:33 F:aos1681.tex; (G) p. 17

PLS PREDICTION 17

1 5.3.1. span(2) reduces 1. If span(2) reduces 1, then 1


2  2
3
E6 (B ) = E1 span(2) = span(2), 3
4 u = d, U = (2T 2)1/2 , 6H = 2T 2 +  and 6 = H (2T 2 + )H T + H0 0 H0T . 4
5 Except for  and 0 , the structure that follows from this setup is just like that 5
6 associated with (5.2). In particular, if 1 has bounded eigenvalues, which may be a 6
7 reasonable assumption when y accounts substantially for the extrinsic variation in 7
8 X, then all of the essential asymptotic results of Section 5.2 hold. 8
9 9
10 5.3.2. span(2) does not reduce 1. The situation becomes more complicated 10
11 when span(2) does not reduce 1. Suppose that the eigenvalues of 1 are bounded 11
12 and that η is unbounded. Then, as in previous cases, κ ≍ p. But, since the eigen- 12
13 values of  are bounded, limp→∞ 6H /η = limp→∞ U U T /η must be singular. 13
14 This means that ρ is unbounded and so it may still have an important impact 14
15 on the conclusions of Theorems 1 and 2. On the other hand, if the eigenvalues 15
16 of 0 are bounded, but the eigenvalues of  are unbounded, then we may still 16
17 have κ ≍ p and√ η ≍ p. Going further, if ρ is bounded then we will again have 17
18 DN = Op (1/ n). 18
19 19
20 6. Simulations and data analysis. 20
21 21
22 6.1. Simulations. In this section, we give simulation results in support of our 22
23 asymptotic conclusions. We use the isotropic model (5.2) and compound symmetry 23
24 (4.7) as the basis for our simulation models. 24
25 25
26
6.1.1. Isotropic model (5.2). Our simulations for the isotropic model were 26
27
all conducted with µX = 0, d = 2, π 2 = 1 and (y, ν T ) ∼ N3 (0, U ), where the 27
28
elements of U were U11 = 4, U12 = U13 = 0.8, U22 = U33 = 1 and U23 = 0. 28
29
The columns of 2 were constructed to be orthogonal with the diagonal elements 29
30
diag(2T 2) = (t1 (p), t2 (p)) of 2T 2 being increasing functions of p, and al- 30
31
ways V > 0. If V has distinct eigenvalues, then we know from the discussion 31
32
of Section 5.2 that ρ ≍ 1. To provide more details on ρ, we next give tr(C −1 ). 32
33
Let R1 (p) = (t2 (p) + π 2 )/(t1 (p) + π 2 ), R2 (p) = t2 (p)/t1 (p) and cov(y, ν) = 33
34
(v1 , v2 ). Then 34
35
−1  (v12 + v22 R1 R2 )(v12 + v22 R13 R2 ) 35
tr C =2 .
36
v12 v22 R1 (R1 − 1)2 R2 36
37 37
38
Both v1 and v2 are nonzero and do not depend on n or p. Consequently, the asymp- 38
39
totic behavior of tr(C −1 ) depends only on R1 and R2 , which both converge to finite 39
40
nonzero constants by construction. However, if R1 → 1 then tr(C −1 ) will diverge 40
41
which may have a serious impact on the rate of convergence. 41
42
Figure 1 shows results from data generated under this setup with diag(2T 2) = 42
43
(4pa , pa ), 0 < a ≤ 1, and diag(2T 2) = (4c, c) where c is constant. Consequently, 43
AOS imspdf v.2018/02/08 Prn:2018/02/28; 14:33 F:aos1681.tex; (G) p. 18

18 R. D. COOK AND L. FORZANI

1 1
2 2
3 3
4 4
5 5
6 6
7 7
8 8
9 9
10 10
11 11
12 12
13 13
14 14
F IG . 1. Simulation results from the isotropic model (5.2): Listing from the top at log2 (p) = 6 the
15 lines correspond to η equal to a constant, p 1/2 , p 3/4 and p. 15
16 16
17 17
for each 2 we can take the corresponding η = pa , 0 ≤ a ≤ 1. It follows from
18 18
the discussion of Section 5.2 and from the above calculations that ρ ≍ 1. Since
19 19
κ ≍ p, the asymptotic
√ behavior of the simulation is governed by Theorem 2.I,
20 20
giving DN = Op ( φ) with φ = p/nη. A data set of size n = p/2 was obtained
21 21
by using n independently generated observations on (y, ν T ) and ω in model (5.2)
22 22
to obtain n independent observations on X. Then n additional observations on X
23 2 was computed for each and averaged. The vertical axis 23
were generated and DN
24 24
Db2 of Figure 1 is the average over 100 replications of this whole process. Reading
25 25
26
from the top to bottom at log2 (p) = 6, the lines in Figure 1 correspond to η equal to 26
27
a constant, p1/2 , p3/4 and p. Since n = p/2, we have φ = 2/η. Thus, in reference 27
28
to Figure 1, our theoretical results predict convergence of the curves for η equal to 28
29
p1/2 , p3/4 and p, but no convergence for η equal to a constant. The curves shown 29
30
in Figure 1 seem to support this prediction, with the best results being achieved for 30
31
η = p, followed by η = p3/4 and η = p1/2 . 31
32
Figure 2 was constructed like Figure 1, except diag(2T 2) = (pa , pa ), 0 < a ≤ 32
33
1, and diag(2T 2) = (c, c) where c is constant. This seemingly small change has 33
34
the potential to have a big impact on the results because now the eigenvalues of V 34
35
are no longer distinct and R1 = 1, with the consequence that ρ may slow the rate 35
36
of convergence as indicated in Theorem 2. Indeed, the results in Figure 2 seem 36
37
uniformly worse than those in Figure 1. While it seems clear that the curve for 37
38
η = p is convergent, it is not clear if the curves for η = p1/2 or η = p3/4 are so. 38
39
The influence of U on the results of this example is controlled largely by the 39
40
correlations cyν = cov(y, ν)/ var1/2 (y) between y and the elements of ν. The con- 40
41
dition var(ν) = I2 was imposed without loss of generality since we can always 41
42
achieve it by rescaling. In Figures 1 and 2, cyν = (0.4, 0.4). If we had set the cor- 42
relations to be larger, say cyν = (0.8, 0.8), D b2 would have decreased faster as a
43 43
AOS imspdf v.2018/02/08 Prn:2018/02/28; 14:33 F:aos1681.tex; (G) p. 19

PLS PREDICTION 19

1 1
2 2
3 3
4 4
5 5
6 6
7 7
8 8
9 9
10 10
11 11
12 12
13 13
14 F IG . 2. Simulation results from the isotropic model (5.2): Listing from the top at log2 (p) = 12 the 14
15
lines correspond to η equal to a constant, p 1/2 , p 3/4 and p. 15
16 16
17 function of p. If we had set the correlations to be weaker, say cyν = (0.2, 0.2), 17
18 Db2 would have decreased slower. Although in either case, the general conclusions 18
19 from Figures 1 and 2 would still be discernible. We selected correlations of 0.4 19
20 because we felt that they represent modest correlations that illustrate the theory 20
21 nicely without giving an optimistic impression, as might happen if we had used 21
22 large correlations. 22
23 23
24 6.1.2. Compound symmetry (4.7). For this simulation, we used model (2.1) 24
25 with the compound symmetry structure (4.7) for 6 constructed with σ = 1p + 25
26 cp , σ = 1p + 0.5cp and σ = 1p and in each case ψ = 0.8. With σ , ψ and p 26
27 set, we generated a single observation on X ∼ Np (0, 6) and then generated the 27
28 corresponding y according to model (2.1) with error standard deviation τ = 1. 28
29 This process was repeated n = p/2 times to get β̂. Then n additional observations 29
on X were generated and DN 2 was computed for each and averaged. The vertical
30 30
axis Db2 of Figure 3 is the average over 100 replications of this whole process.
31 31
In this simulation, we have κ ≍ p, η ≍ κ and tr(6H h ) ≍ κ. It follows that The-
32 0 32
33 orem 2.II is applicable for σ = 1p √+ cp and σ = 1p + 0.5cp giving, from the dis- 33
34 cussion in Section 4.6, DN = p2 / n. Since we used n = p/2, we do not expect 34
35 convergence, which seems consistent with the results shown in Figure 3. Theo- 35
36 rem 2.III applies for σ = 1p since then d = 1. In that case, DN = Op (p−1/2 ), 36
37 which again seems consistent with the results of Figure 3. 37
38 38
39
6.2. Tetracycline data. Goicoechea and Olivieri [20] used PLS to develop a 39
40
predictor of tetracycline concentration in human blood. The 50 training samples 40
41
were constructed by spiking blank sera with various amounts of tetracycline in the 41
42
range 0–4 µg mL−1 . A validation set of 57 samples was constructed in the same 42
43
way. For each sample, the values of the predictors were determined by measuring 43
AOS imspdf v.2018/02/08 Prn:2018/02/28; 14:33 F:aos1681.tex; (G) p. 20

20 R. D. COOK AND L. FORZANI

1 1
2 2
3 3
4 4
5 5
6 6
7 7
8 8
9 9
10 10
11 11
12 12
13 13
14
F IG . 3. Simulation results using compound symmetry (4.7). Reading from top to bottom the lines 14
correspond to σ = 1p + cp , σ = 1p + 0.5cp and σ = 1p .
15 15
16 16
17 fluorescence intensity at p = 101 equally spaced points in the range 450–550 nm. 17
18 18
The authors determined using leave-one-out cross validation that the best predic-
19 19
tions of the training data were obtained with d = 4 linear combinations of the
20 20
original 101 predictors.
21 21
We use these data to illustrate the behavior of PLS predictions in Chemometrics
22 22
as the number of predictors increases. We used PLS with d = 4 to predict the
23 23
validation data based on p equally spaced spectra, with p ranging between 10
24 24
and 101. The root mean squared error (MSE) is shown in Figure 4 for five values
25 25
of p. PLS fits were determined by using library{pls} in R. We see a relatively
26 26
steep drop in MSE for small p, say less than 30, and a slow but steady decrease
27 27
28 28
29 29
30 30
31 31
32 32
33 33
34 34
35 35
36 36
37 37
38 38
39 39
40 40
41 41
42 42
F IG . 4. Tetracycline data: Validation MSE from 10, 20, 33, 50 and 101 equally spaced spectra.
43 43
AOS imspdf v.2018/02/08 Prn:2018/02/28; 14:33 F:aos1681.tex; (G) p. 21

PLS PREDICTION 21

1 in MSE thereafter. Since we are dealing with actual prediction, the root-MSE will 1
2 not converge to 0 with increasing p as it seems to do in some of the simulations. 2
3 3
4 4
7. Discussion. In this section, we give results on the convergence of β̂ and
5 5
describe our rationale for some of the restrictions that we imposed.
6 6
7 7
7.1. Convergence of β̂. The focus of this article has been on the rate of con-
8 8
vergence of predictions as measured by DN . In this section, we consider for com-
9 9
pleteness the rate of convergence of β̂ in the 6 inner product. Let
10 10
 1/2
11
Vn,p = var1/2 (DN | β̂) = (β̂ − β)T 6(β̂ − β) . 11
12 12
13 Then, as shown in Appendix Section S8, Vn,p and DN have the same order as 13
14 n, p → ∞. To be clear, we state this in the following theorem. 14
15 15
16 T HEOREM 3. As n, p → ∞, 16
17 17
18 I. Under the conditions of Theorem 1, 18
19 √  19
Vn,p = Op (ρ/ n) + Op ρ 1/2 n−1/2 (κ/η)d .
20 20
21 II. Under the conditions of Theorem 2, 21
22 p 22

23 Vn,p = Op (ρ/ n) + Op ( ρφ). 23
24 24
25 It follows from this theorem that the special cases of Theorems 1 and 2 and the 25
26 subsequent discussions apply to Vn,p as well. In particular, estimative convergence 26
27 as measured in the 6 inner product will be at or near the root-n rate under the same 27
28 conditions as predictive convergence. 28
29 29
30 30
7.2. Multivariate Y . Recall that we confined our study to regressions with a
31 31
univariate response. An extension to multivariate Y ∈ Rr seems elusive because
32 32
there are numerous PLS algorithms for multivariate Y and they can all produce dif-
33 33
ferent results. The two most common algorithms NIPLS and SIMPLS are known
34 34
to produce different results when r > 1 but give the same results when r = 1 [12,
35 35
15, 34]. The multivariate version of the Krylov construction Ĝ provides another
36 36
PLS algorithm. Some prefer to standardize the elements of Y to have sample vari-
37 37
ance equal to 1, while others do not standardize. Some PLS algorithms reduce
38 38
Y and X simultaneously, while others reduce X alone. These various algorithms
39 39
40
can produce different results when r > 1 but also produce the same or equivalent 40
41
results when r = 1. It seems to us that any extension to allow for a multivariate 41
42
response would first need to address the multiplicity of methods, which is outside 42
43
the scope of this report. 43
AOS imspdf v.2018/02/08 Prn:2018/02/28; 14:33 F:aos1681.tex; (G) p. 22

22 R. D. COOK AND L. FORZANI

1 7.3. Choice of the dimension, d. We assumed through this article that the di- 1
2 mension d of the envelope is effectively fixed and known, as did Chun and Keleş 2
3 [6]. In practice, d will not normally be known so a data-dependent estimate dn,p 3
4 will often be used in its stead. If dn,p > d, the (nonasymptotic) results of a PLS 4
5 analysis will still be based on a true model, albeit one with more variation than 5
6 necessary. If dn,p < d, then PLS will incur some bias in estimation. The bias can 6
7 be sizable if dn,p is substantially less than d, an event that we judge to be unlikely 7
8 because the far values of dn,p should be ruled out by standard PLS methodology. 8
9 Extensions of the asymptotic results of this article that allow for using dn,p in- 9
10 stead of d will depend on the rate at which dn,p converges to d. If that rate is 10
11 sufficiently fast, then the results of this article will still hold. Otherwise, the rates 11
12 presented here will be optimistic. We chose to assume d known so that the results 12
13 might reflect the core behavior of PLS while keeping an important link with the 13
14 work of Chun and Keleş [6]. This view avoided the task of studying selection meth- 14
15 ods, which is outside the scope of this article but still an important next step. Eck 15
16 and Cook [17] proposed an estimator of β as a weighted average of the envelope 16
17 estimators over the possible dimensions of the envelope, the weights being func- 17
18 tions of the Bayes information criterion for each envelope model. This weighted 18
19 estimator avoids the need to estimate the dimension and might be adaptable for 19
20 asymptotic studies of PLS. 20
21 Another desirable extension is to allow d → ∞ as p → ∞. In such a case, 21
22 we expect PLS to still yield consistent results provided d grows at a rate that is 22
23 sufficiently slow relative to p. 23
24 24
25 7.4. Importance of normality. As mentioned previously, simulations and our 25
26 experience in practice suggest that normality is not an essential assumption in prac- 26
27 tice, particularly if a holdout sample is used to assess performance of the final pre- 27
28 dictive model. Theoretically, we expect that our asymptotic results are indicative 28
29 for sub-normal variables, but may not be so for sur-normals, depending on the tail 29
30 behavior. We relied extensively on the behavior of higher order moments of nor- 30
31 mals. Extending these results to classes of distributions would require bounds that 31
32 would likely be quite loose for normals. Assuming normality allowed us to get rel- 32
33 atively sharp bounds, which we feel is useful for a first look at PLS asymptotics. 33
34 The same normality was used also by Naik and Tsai [30] in their asymptotic study 34
35 of the fixed p case and by Chun and Keleş [6] for the case in which p and n both 35
36 diverge. 36
37 37
38
7.5. Impact of the results. Our asymptotic results are intended to provide a 38
39
qualitative understanding of various plausible PLS scenarios. For instance, if it is 39
40
thought that nearly all predictors contribute information about the response, so η ≍ 40
41
p, then we may have DN = Op (n−1/2 ) without regard to the relationship between 41
42
n and p. On the other extreme, if the regression is viewed as likely sparse, so η ≍ 1, 42
43
then we may have DN = Op ((p/n)1/2 ) and we now need n to be large relative 43
AOS imspdf v.2018/02/08 Prn:2018/02/28; 14:33 F:aos1681.tex; (G) p. 23

PLS PREDICTION 23

1 to p. Increasing p in the context of Chemometrics applications was illustrated in 1


2 the example of Section 6.2 where we observed a steady decrease in mean squared 2
3 error, suggesting that the regression is abundant so η ≍ p. 3
4 Our results also serve to place the findings by Chun and Keleş [6] in a broader 4
5 context by demonstrating that it is possible in some scenarios for PLS to have 5
6 root-n or near root-n convergence rates as n and p diverge. 6
7 7
8 Acknowledgements. The authors thank the Associate Editor and referees for 8
9 helpful comments on an earlier version of this article, and are grateful to H. C. 9
10 Goicoechea and A. C. Olivieri for allowing the use of their Tetracycline data. 10
11 11
12 SUPPLEMENTARY MATERIAL 12
13 13
Supplement to “Partial least squares prediction in high-dimensional re-
14 14
gression” (DOI: 10.1214/18-AOS1681SUPP; .pdf). ???. aos1681supp.pdf
15 15
16 REFERENCES 16
17 17
[1] A BUDU , S., K ING , P. and PAGANO , T. C. (2010). Application of partial least-squares regres-
18 18
sion in seasonal streamflow forecasting. J. Hydrol. Eng. 15 612–623. <author>
19 [2] B IANCOLILLO , A., B UCCI , R., M AGRÌ , A. L., M AGRÌ , A. D. and M ARINI , F. (2014). Data- 19
20 fusion for multiplatform characterization of an Italian craft beer aimed at its authentica- 20
21 tion. Anal. Chim. Acta 820 23–31. 21 <author>
22 [3] B OULESTEIX , A.-L. and S TRIMMER , K. (2007). Partial least squares: A versatile tool for the 22
analysis of high-dimensional genomic data. Brief. Bioinform. 8 32–44. <pbm>
23 23
[4] B RO , R. and E ELDÉN , L. (2009). PLS works. J. Chemom. 23 69–71. <author>
24 [5] C ASTEJÒN , D., G ARCÌA -S EGURA , J. M., E SCUDERO , R., H ERRERA , A. and C AM - 24
25 BERO , M. I. (2015). Metabolomics of meat exudate: Its potential to evaluate beef meat 25
26 conservation and aging. Anal. Chim. Acta 901 1–11. 26 <author>
27 [6] C HUN , H. and K ELE Ş , S. (2010). Sparse partial least squares regression for simultaneous 27
dimension reduction and variable selection. J. R. Stat. Soc. Ser. B. Stat. Methodol. 72
28 28
3–25. MR2751241 <mr>
29 [7] C OOK , R. D. (1994). Using dimension-reduction subspaces to identify important inputs in 29
30 models of physical systems. In Proceedings of the Section on Engineering and Physical 30
31 Sciences 18–25. American Statistical Association, Alexandria, VA. 31 <author>
32 [8] C OOK , R. D. (1998). Regression Graphics: Ideas for Studying Regressions through Graphics. 32
Wiley, New York. MR1645673 <mr>
33 33
[9] C OOK , R. D. and F ORZANI , L. (2017). Big data and partial least squares prediction. Canad.
34 34
J. Statist. To appear. <author>
35 [10] C OOK , R. D. and F ORZANI , L. (2018). Supplement to “Partial least squares prediction in 35
36 high-dimensional regression.” DOI:10.1214/18-AOS1681SUPP. 36 <unstr>
37 [11] C OOK , R. D., F ORZANI , L. and ROTHMAN , A. J. (2013). Prediction in abundant high- 37
dimensional linear regression. Electron. J. Stat. 7 3059–3088. MR3151762 <mr>
38 38
[12] C OOK , R. D., H ELLAND , I. S. and S U , Z. (2013). Envelopes and partial least squares regres-
39 39
sion. J. R. Stat. Soc. Ser. B. Stat. Methodol. 75 851–877. MR3124794 <mr>
40 [13] C OOK , R. D., L I , B. and C HIAROMONTE , F. (2007). Dimension reduction in regression with- 40
41 out matrix inversion. Biometrika 94 569–584. MR2410009 41 <mr>
42 [14] C OOK , R. D., L I , B. and C HIAROMONTE , F. (2010). Envelope models for parsimonious and 42
43
efficient multivariate linear regression. Statist. Sinica 20 927–960. MR2729839 43
<mr>
AOS imspdf v.2018/02/08 Prn:2018/02/28; 14:33 F:aos1681.tex; (G) p. 24

24 R. D. COOK AND L. FORZANI

1 [15] DE J ONG , S. (1993). SIMPLS: An alternative approach to partial least squares regression. 1
2 Chemom. Intell. Lab. Syst. 18 251–263. DOI:10.1016/0169-7439(93)85002-X. 2 <author>
3 [16] D ELAIGLE , A. and H ALL , P. (2012). Methodology and theory for partial least squares applied 3
to functional data. Ann. Statist. 40 322–352. MR3014309 <mr>
4 4
[17] E CK , D. J. and C OOK , R. D. (2017). Weighted envelope estimation to handle variability in
5 model selection. Biometrika 104 743–749. MR3694595 5 <mr>
6 [18] F RANK , I. E. and F RIDEMAN , J. H. (1993). A statistical view of some chemometrics regres- 6
7 sion tools. Technometrics 35 102–246. DOI:10.1080/00401706.1993.10485033. 7 <author>
8 [19] G ARTHWAITE , P. H. (1994). An interpretation of partial least squares. J. Amer. Statist. Assoc. 8
89 122–127. MR1266290 <mr>
9 9
[20] G OICOECHEA , H. C. and O LIVER , A. C. (1999). Enhanced synchronous spectrofluorometric
10 determination of tetracycline in blood serum by chemometric analysis. Comparison of 10
11 partial least-squares and hybrid linear analysis calibrations. Anal. Chem. 71 4361–4368. 11 <author>
12 [21] H ELLAND , I. S. (1990). Partial least squares regression and statistical models. Scand. J. Stat. 12
13 17 97–114. MR1085924 13 <mr>
[22] H ELLAND , I. S. (1992). Maximum likelihood regression on relevant components. J. Roy.
14 14
Statist. Soc. Ser. B 54 637–647. MR1160488 <mr>
15 [23] H ELLAND , I. S. (2001). Some theoretical aspects of partial least squares regression. Chemom. 15
16 Intell. Lab. Syst. 58 97–107. 16 <author>
17 [24] K ANDEL , T. A., G ISLUM , R., J ØRGENSEN , U. and L ÆRKE , P. E. (2013). Prediction of biogas 17
18
yield and its kinetics in reed canary grass using near infrared reflectance spectroscopy and 18
chemometrics. Bioresour. Technol. 146 282–287. <author>
19 19
[25] KOCH , C., P OSCH , A. E., G OICOECHEA , H. C., H ERWIG , C. and L ENDLA , B. (2013). Multi-
20 analyte quantification in bioprocesses by Fourier-transform-infrared spectroscopy by par- 20
21 tial least squares regression and multivariate curve resolution. Anal. Chim. Acta 807 103– 21
22 110. 22 <author>
23
[26] L I , W., C HENG , Z., WANG , Y. and Q U , H. (2013). Quality control of Lonicerae Japonicae 23
Flos using near infrared spectroscopy and chemometrics. J. Pharm. Biomed. Anal. 72
24 24
33–39. <author>
25 [27] L OBAUGH , N. J., W EST, R. and M C I NTOSH , A. R. (2001). Spatiotemporal analysis of exper- 25
26 imental differences in event-related potential data with partial least squares. Psychophys- 26
27 iology 38 517–530. 27 <author>
28
[28] M ARTENS , H. and N ÆS , T. (1992). Multivariate Calibration. Wiley, Chichester. MR1029523 28
<mr>
[29] N ÆS , T. and H ELLAND , I. S. (1993). Relevant components in regression. Scand. J. Stat. 20
29 29
239–250. MR1241390 <mr>
30 [30] NAIK , P. and T SAI , C.-L. (2000). Partial least squares estimator for single-index models. J. R. 30
31 Stat. Soc. Ser. B. Stat. Methodol. 62 763–771. MR1796290 31 <mr>
32 [31] N GUYEN , D. V. and ROCKE , D. M. (2002). Tumor classification by partial least squares using 32
33
microarray gene expression data. Bioinformatics 18 39–50. 33
<author>
[32] N GUYEN , D. V. and ROCKE , D. M. (2004). On partial least squares dimension reduction for
34 34
microarray-based classification: A simulation study. Comput. Statist. Data Anal. 46 407–
35 425. MR2067030 35 <mr>
36 [33] S CHWARTZ , R. W., K EMBHAVI , A., H ARWOOD , D. and DAVIS , L. S. (2009). Human detec- 36
37 tion using partial least squares analysis. In 2009 IEEE 12th International Conference on 37
38
Computer Vision 24–31. 38
<author>
[34] TER B RAAK , C. J. F. and DE J ONG , S. (1998). The objective function of partial least squares
39 39
regression. J. Chemom. 12 41–54. <author>
40 [35] W OLD , S., M ARTENS , H. and W OLD , H. (1983). The multivariate calibration problem in 40
41 chemistry solved by the PLS method. In Proceedings of the Conference on Matrix Pencils 41
42 (A. Ruhe and B. Kågström, eds.). Lecture Notes in Mathematics 973 286–293. Springer, 42
43
Heidelberg. 43
<author>
AOS imspdf v.2018/02/08 Prn:2018/02/28; 14:33 F:aos1681.tex; (G) p. 25

PLS PREDICTION 25

1 [36] W ORSLEY, K. J. (1997). An overview and some new developments in the statistical analysis 1
2 of PET and fMRI data. Hum. Brain Mapp. 5 254–258. 2 <author>
3 3
S CHOOL OF S TATISTICS FACULTAD DE I NGENIERÍA
4 U NIVERSITY OF M INNESOTA Q UÍMICA , UNL 4
5 313 F ORD H ALL S ANTIAGO DEL E STERO 2819 5
6 224 C HURCH S T. SE S ANTA F E 6
M INNEAPOLIS , M INNESOTA 55455 A RGENTINA
7 USA E- MAIL : liliana.forzani@gmail.com 7
8 E- MAIL : dennis@stat.umn.edu 8
9 9
10 10
11 11
12 12
13 13
14 14
15 15
16 16
17 17
18 18
19 19
20 20
21 21
22 22
23 23
24 24
25 25
26 26
27 27
28 28
29 29
30 30
31 31
32 32
33 33
34 34
35 35
36 36
37 37
38 38
39 39
40 40
41 41
42 42
43 43
AOS imspdf v.2018/02/08 Prn:2018/02/28; 14:33 F:aos1681.tex; (G) p. 26

1 THE ORIGINAL REFERENCE LIST 1


2 2
The list of entries below corresponds to the original Reference section of your article. The
3 3
bibliography section on previous page was retrieved from MathSciNet applying an automated
4 procedure. 4
5 Please check both lists and indicate those entries which lead to mistaken sources in automati- 5
6 cally generated Reference list. 6
7 7
8 [1] A BUDU , S., K ING , P. and PAGANO , T. C. (2010). Application of partial least-squares regres- 8
9 sion in seasonal streamflow forecasting. J. Hydrol. Eng. 15 612–623. 9 <author>
10
[2] B IANCOLILLO , A., B UCCI , R., M AGRÌ , A. L., M AGRÌ , A. D. and M ARINI , F. (2014). Data- 10
fusion for multiplatform characterization of an Italian craft beer aimed at its authentica-
11 11
tion. Anal. Chim. Acta 820 23–31. <author>
12 12
[3] B OULESTEIX , A.-L. and S TRIMMER , K. (2007). Partial least squares: A versatile tool for the
13 analysis of high-dimensional genomic data. Brief. Bioinform. 8 32–44. 13 <pbm>
14 [4] B RO , R. and E ELDÉN , L. (2009). PLS works. J. Chemom. 23 69–71. 14 <author>
15 [5] C ASTEJÒN , D., G ARCÌA -S EGURA , J. M., E SCUDERO , R., H ERRERA , A. and C AM - 15
16 BERO , M. I. (2015). Metabolomics of meat exudate: Its potential to evaluate beef meat 16
conservation and aging. Anal. Chim. Acta 901 1–11. <author>
17 17
[6] C HUN , H. and K ELE Ş , S. (2010). Sparse partial least squares regression for simultaneous
18 18
dimension reduction and variable selection. J. R. Stat. Soc. Ser. B. Stat. Methodol. 72
19 3–25. 2751241 19 <mr>
20 [7] C OOK , R. D. (1994). Using dimension-reduction subspaces to identify important inputs in 20
21 models of physical systems. In Proceedings of the Section on Engineering and Physical 21
22 Sciences 18–25. American Statistical Association, Alexandria, VA. 22 <author>
23
[8] C OOK , R. D. (1998). Regression Graphics: Ideas for Studying Regressions through Graphics. 23
Wiley, New York. 1645673 <mr>
24 24
[9] C OOK , R. D. and F ORZANI , L. (2017). Big data and partial least squares prediction. Canad.
25 25
J. Statist. To appear. <author>
26 [10] C OOK , R. D. and F ORZANI , L. (2018). Supplement to “Partial least squares prediction in 26
27 high-dimensional regression.” DOI:10.1214/18-AOS1681SUPP. 27 <unstr>
28 [11] C OOK , R. D., F ORZANI , L. and ROTHMAN , A. J. (2013). Prediction in abundant high- 28
29 dimensional linear regression. Electron. J. Stat. 7 3059–3088. 3151762 29 <mr>
[12] C OOK , R. D., H ELLAND , I. S. and S U , Z. (2013). Envelopes and partial least squares regres-
30 30
sion. J. R. Stat. Soc. Ser. B. Stat. Methodol. 75 851–877. 3124794 <mr>
31 31
[13] C OOK , R. D., L I , B. and C HIAROMONTE , F. (2007). Dimension reduction in regression with-
32 out matrix inversion. Biometrika 94 569–584. 2410009 32 <mr>
33 [14] C OOK , R. D., L I , B. and C HIAROMONTE , F. (2010). Envelope models for parsimonious and 33
34 efficient multivariate linear regression. Statist. Sinica 20 927–960. 2729839 34 <mr>
35 [15] DE J ONG , S. (1993). SIMPLS: An alternative approach to partial least squares regression. 35
36
Chemom. Intell. Lab. Syst. 18 251–263. DOI:10.1016/0169-7439(93)85002-X. 36
<author>
[16] D ELAIGLE , A. and H ALL , P. (2012). Methodology and theory for partial least squares applied
37 37
to functional data. Ann. Statist. 40 322–352. 3014309 <mr>
38 38
[17] E CK , D. J. and C OOK , R. D. (2017). Weighted envelope estimation to handle variability in
39 model selection. Biometrika 104 743–749. 3694595 39 <mr>
40 [18] F RANK , I. E. and F RIDEMAN , J. H. (1993). A statistical view of some chemometrics regres- 40
41 sion tools. Technometrics 35 102–246. DOI:10.1080/00401706.1993.10485033. 41 <author>
42 [19] G ARTHWAITE , P. H. (1994). An interpretation of partial least squares. J. Amer. Statist. Assoc. 42
43
89 122–127. 1266290 43
<mr>
AOS imspdf v.2018/02/08 Prn:2018/02/28; 14:33 F:aos1681.tex; (G) p. 27

1 [20] G OICOECHEA , H. C. and O LIVER , A. C. (1999). Enhanced synchronous spectrofluorometric 1


2 determination of tetracycline in blood serum by chemometric analysis. Comparison of 2
3 partial least-squares and hybrid linear analysis calibrations. Anal. Chem. 71 4361–4368. 3 <author>
4
[21] H ELLAND , I. S. (1990). Partial least squares regression and statistical models. Scand. J. Stat. 4
17 97–114. 1085924 <mr>
5 5
[22] H ELLAND , I. S. (1992). Maximum likelihood regression on relevant components. J. Roy.
6 Statist. Soc. Ser. B 54 637–647. 1160488 6 <mr>
7 [23] H ELLAND , I. S. (2001). Some theoretical aspects of partial least squares regression. Chemom. 7
8 Intell. Lab. Syst. 58 97–107. 8 <author>
9 [24] K ANDEL , T. A., G ISLUM , R., J ØRGENSEN , U. and L ÆRKE , P. E. (2013). Prediction of biogas 9
yield and its kinetics in reed canary grass using near infrared reflectance spectroscopy and
10 10
chemometrics. Bioresour. Technol. 146 282–287. <author>
11 [25] KOCH , C., P OSCH , A. E., G OICOECHEA , H. C., H ERWIG , C. and L ENDLA , B. (2013). Multi- 11
12 analyte quantification in bioprocesses by Fourier-transform-infrared spectroscopy by par- 12
13 tial least squares regression and multivariate curve resolution. Anal. Chim. Acta 807 103– 13
14 110. 14 <author>
15
[26] L I , W., C HENG , Z., WANG , Y. and Q U , H. (2013). Quality control of Lonicerae Japonicae 15
Flos using near infrared spectroscopy and chemometrics. J. Pharm. Biomed. Anal. 72
16 16
33–39. <author>
17 [27] L OBAUGH , N. J., W EST, R. and M C I NTOSH , A. R. (2001). Spatiotemporal analysis of exper- 17
18 imental differences in event-related potential data with partial least squares. Psychophys- 18
19 iology 38 517–530. 19 <author>
20 [28] M ARTENS , H. and N ÆS , T. (1992). Multivariate Calibration. Wiley, Chichester. 1029523 20 <mr>
[29] N ÆS , T. and H ELLAND , I. S. (1993). Relevant components in regression. Scand. J. Stat. 20
21 21
239–250. 1241390 <mr>
22 [30] NAIK , P. and T SAI , C.-L. (2000). Partial least squares estimator for single-index models. J. R. 22
23 Stat. Soc. Ser. B. Stat. Methodol. 62 763–771. 1796290 23 <mr>
24 [31] N GUYEN , D. V. and ROCKE , D. M. (2002). Tumor classification by partial least squares using 24
25 microarray gene expression data. Bioinformatics 18 39–50. 25 <author>
26
[32] N GUYEN , D. V. and ROCKE , D. M. (2004). On partial least squares dimension reduction for 26
microarray-based classification: A simulation study. Comput. Statist. Data Anal. 46 407–
27 27
425. 2067030 <mr>
28 [33] S CHWARTZ , R. W., K EMBHAVI , A., H ARWOOD , D. and DAVIS , L. S. (2009). Human detec- 28
29 tion using partial least squares analysis. In 2009 IEEE 12th International Conference on 29
30 Computer Vision 24–31. 30 <author>
31 [34] TER B RAAK , C. J. F. and DE J ONG , S. (1998). The objective function of partial least squares 31
regression. J. Chemom. 12 41–54. <author>
32 32
[35] W OLD , S., M ARTENS , H. and W OLD , H. (1983). The multivariate calibration problem in
33 chemistry solved by the PLS method. In Proceedings of the Conference on Matrix Pencils 33
34 (A. Ruhe and B. Kågström, eds.). Lecture Notes in Mathematics 973 286–293. Springer, 34
35 Heidelberg. 35 <author>
36 [36] W ORSLEY, K. J. (1997). An overview and some new developments in the statistical analysis 36
37
of PET and fMRI data. Hum. Brain Mapp. 5 254–258. 37
<author>
38 38
39 39
40 40
41 41
42 42
43 43
AOS imspdf v.2018/02/08 Prn:2018/02/28; 14:33 F:aos1681.tex; (G) p. 28

1 META DATA IN THE PDF FILE 1


2 2
Following information will be included as pdf file Document Properties:
3 3
4 Title : Partial least squares prediction in high-dimensional re- 4
5
gression 5
Author : R. Dennis Cook, Liliana Forzani
6 6
Subject : The Annals of Statistics, 0, Vol. 0, No. 00, 1-25
7 Keywords: 62J05, 62F12, Abundant regressions, dimension reduction, 7
8 sparse regressions 8
9 9
10 10
THE LIST OF URI ADDRESSES
11 11
12 12
13 Listed below are all uri addresses found in your paper. The non-active uri addresses, if any, are 13
14 indicated as ERROR. Please check and update the list where necessary. The e-mail addresses 14
are not checked – they are listed just for your information. More information can be found in
15 15
the support page:
16 http://www.e-publications.org/ims/support/urihelp.html. 16
17 17
18 200 http://www.imstat.org/aos/ [2:pp.1,1] OK 18
19 200 http://www.imstat.org [2:pp.1,1] OK // http://www.imstat.org/ 19
20 301 http://www.ams.org/mathscinet/msc/msc2010.html [2:pp.1,1] Moved Permanently
20
21
404 https://doi.org/10.1214/18-AOS1681SUPP [4:pp.23,23,23,23] Not Found
21
404 https://doi.org/10.1016/0169-7439(93 [2:pp.24,24] Not Found
22 22
303 https://doi.org/10.1080/00401706.1993.10485033 [2:pp.24,24] See Other
23 --- mailto:dennis@stat.umn.edu [2:pp.25,25] Check skip 23
24 --- mailto:liliana.forzani@gmail.com [2:pp.25,25] Check skip 24
25 25
26 26
27 27
28 28
29 29
30 30
31 31
32 32
33 33
34 34
35 35
36 36
37 37
38 38
39 39
40 40
41 41
42 42
43 43