Old Bracketing PDF

Chapter 12
Bracketing methods
SECTION 1 explains how the concept of bracketing may be thought of as a scheme to
partition an index set of functions into nitely many regions, on each which there
exists a single approximating function with a guaranteed error bound.
SECTION 2 describes a rened technique of truncation and successive approximation,
for the simplest case of an empirical process built from independent variables.
It derives an L
1
maximal inequality as a recusive bound relating the errors of
successive approximations.
SECTION 3 presents some examples to illustrate the uses of the results from Section 2.
SECTION 4 generalizes the method from Section 2 to an abstract setting that includes
several sorts of dependence between the summands. The arguments are stated
in an abstract form that covers both independent and dependent summands. All
details of possible dependence are hidden in two Assumptions, one concerning the
norm used to express approximations to classes of functions, the other specifying
the behaviour of maxima of nite collections of centered sums.
SECTION 5 studies the special case of phi-mixing sequences.
SECTION 6 studies the special case of absolutely regular (beta-mixing) sequences.
SECTION 7 points out the difculties involved in the study of strong-mixing sequences.
SECTION 8 develops a maximal inequality for tail probabilities, by means of a
modication of the methods from Section 4.
[] 1. What is bracketing?
Bracketing arguments have long been used in empirical process theory. A very
simple form of bracketing is often used in textbooks to prove the Glivenko-
Cantelli theoremthe most basic example of a uniform laws of large numbers.
The empirical distribution function F
n
for a sample
1
, . . . ,
n
from a
distribution function F on the real line. That is, F
n
(t ) denotes the proportion
of the observations less than or equal to t ,
F
n
(t ) =
1
n
i n
{
i
t } for each t in R.
The Glivenko-Cantelli theorem asserts that sup
t
|F
n
(t ) F(t )| converges to
zero almost surely.
The bracketing argument controls the contributions from an interval
t
1
t t
2
by means of bounds that hold throughout the interval. For such t
we have
F
n
(t
1
) F(t
2
) F
n
(t ) F(t ) F
n
(t
2
) F(t
1
).
The two bounds converge almost surely to F(t
1
) F(t
2
) and F(t
2
) F(t
1
).
If t
2
and t
1
are close enough togethermeaning that the probability measure
Asymptopia: 23 March 2001 c David Pollard
1
Section 12.1 What is bracketing?
of the interval (t
1
, t
2
] is small enoughthen all the F
n
(t ) F(t ) values, for
t
1
t t
2
, get squeezed close to the origin. If we cover the whole real
line by a union of nitely many such intervals, we are able to deduce that
sup
t
|F
n
(t ) F(t )| is eventually small.
There is a more fruitful way to think of the increment F(t
2
)F(t
1
). If P is
the probability measure corresponding to the distribution function F, the incre-
ment equals the L
1
(P) distance between the two indicator functions (, t
1
]
and (, t
2
]. The concept of bracketing then has an obvious extension to
arbitrary classes of (measurable) functions on a measurable space (X, A).
bracket.def1 <1> Denition. A pair of P-integrable functions u on X denes a bracket,
[, u] := {g : (x) g(x) u(x0 for all x}. For 1 q , the bracketing
number N
(q)
[ ]
(, P) for a subclass of functions F L
q
(P) is dened as the
smallest value of N for which there exist brackets [
i
, u
i
] with P(u
i
i
)
q

q
for i = 1, . . . , N and F
i
[
i
, u
i
].
Need bracketing functions in L
q
also?
The Denition allows the possibility that the bracketing numbers be might
be innite, but they are useful only when nite. The quantity N
(q)
[ ]
(, P) is also
called the metric entropy with bracketing.
Uniform approximations correspond to bracketing numbers for q = .
For proofs of uniform laws of large numbers, the bracketing numbers for q = 1
are more natural. As you will see later in this Chapter, q = 2 is better suited
for approximations related to functional central limit theorems. In the earlier
empirical process literature for central limit theorems for bounded classes of
functions, bracketing numbers with q = 1 were often used; but for extensions
to unbounded classes, do seem to require q = 2.
For classes of functions the role of the empirical distribution function is
taken over by the empirical measure P
n
, which puts mass 1/n at each of
the n observations,
P
n
g =
1
n
i n
g(
i
).
On the real line, P
n
(, t ] = F
n
(t ).
uslln <2> Example. Suppose the {
i
} are sampled independently from a xed prob-
ability distribution P, and suppose F is a class of measurable functions with
N
(1)
[]
(, F, P) nite for each > 0. Deduce a uniform strong law of large
numbers, sup
F
| P
n
f Pf | 0 almost surely.
Complete proof
lipschitz <3> Example. Bracketing arguments often appear disguised as smoothness

assumptions in the statistics literature. Typically F is a parametric class
of functions { f
t
: t T} indexed by a bounded subset of an Euclidean
space R
k
. Suppose the functions satisfy a Lipschitz condition, | f
t
(x) f
s
(x)|
M(x)|t s|
, for some xed > 0 and some xed M in L

q
(P). Write C for
the L
q
(P) norm of M.
For some constant C
0
there exist a set of N C
0
(1/
1/
)
k
points
Get eucliden-style bracketing numbers
2
Chapter 12 Bracketing methods
More delicate bracketing arguments will be easier to describe if we recast
the denition into a slightly different form. Suppose F
i
[
i
, u
i
], a covering
for a nite -bracketing. Consider an f in F. As a tie-breaking rule, choose
the smallestk for which f [
k
, u
k
]. Write A
f for (
k
+u
k
)/2 and B
f for
u
k

k
. Then | f A
f | B
f and B
, for whatever norm is used

to dene the bracketing. Even if F is innite, there are only nitely many
different approximating functions A
f and bounding functions B
f . Indeed, the
bracketing serves to partition F into nitely many regions, = {F
1
, . . . , F
N
}:
if f
1
and f
2
belong to the same F
k
then they share the same approximating and
bounding functions. Put another way, as maps from F into nite collections
of functions, both A
and B
take constant values on each member of the

partition ; they are simple functions, in their dependence on f . In general, let
us call a function on F -simple if it takes a constant value on each F
i
in .
The denition of bracketing, for an arbitrary norm on classes of functions,
can be thought of as a scheme of approximation via simple functions.
bracket.defn <4> Denition. For given > 0, say that a nite partition
of F into disjoint
regions supports a -bracketing (for the norm ) if there exist functions
A
(x, f ) and B(x, f ) such that:

(i) | f (x) A
(x, f )| B
(x, f ) for all x;

(ii) B
(, f ) for every f ;
(iii) each A
(x, ) and B
(x, ) is
-simple as a function on F.
The bracketing number N() is dened as the smallest number of regions
needed for a partition that supports a -bracketing.
The bracketing function N() is decreasing. Again, it is of use only when
it is nite-valued. Typically N() tends to innity as tends to zero. For the
ner applications of bracketing arguments, we will derive bounds expressed as
integrals over d involving the bracketing numbers. The bounds will be usefulk
only when the integrals converge; the convergence corresponds to assumptions
about the rate of increase of N() as tends to zero.
The application of bracketing to prove uniform strong laws of large
numbers is crude. More rened arguments are needed to get sharper bounds
corresponding to the central limit theorem level of asymptotics. traditionally
these bounds have been expressed in terms of the empirical process
n
=
n(P
n
P), by which sums are standardized in a way appropriate to central
limit theorems,
n
g = n
1/2
i n
_
g(
i
) Pg(
i
)
_
.
The results in this Chapter are stated for empirical processes constructed from
random elements
1
, . . . ,
n
taking values in a space X, indexed by classes of
integrable functions.
The general problem is to develop uniform approximations to the empirical
process {
n
f : f F} indexed by a class of functions F on X. In my
opinion, the most useful general solutions give probabilistic bounds on
sup
F
|
n
f
n
(A
f )|, such as an inequality for tail probabilities or an L

q
bound. The behaviour of the process indexed by F can then be related to the
behaviour of the nite-dimensional process {
n
f
: f
}.
In this Chapter, all the arguments for the various probabilistic bounds will
make use of a recursive approximation scheme known as chaining. It is not
hard to understand why useful bounds for the empirical process do not usually
3
Section 12.1 What is bracketing?
follow by means of a single bracketing approximation. The bracketing bound
destroys the centering. For example, with the upper bound we have
n
( f A
f ) =

nP
n
( f A
f )
nP
n
( f A
f )

nP
n
(B
f ) +
nP(B
f )
=
n
(B
f ) +2
nP(B
f ).
The lower bound reverses the sign on the 2
nP(B
f ). If P(B
f ) were
small compared with n
1/2
, the change in centering would not be important.
Unfortunately that level of approximation would usually require a decreasing
value of as n gets larger; we lose the benets of an approximation by means
of a xed, nite collection of functions.
The solution to the problem of the recentering is to derive the approximation
in several steps. Suppose A
1
and B
1
refer to the approximations and bounds
for a
1
-bracketing, and A
2
and B
2
refer to the bracketing for a smaller
value
2
. Apply the empirical process to both sides of the equality f A
1
f =
f A
2
f + A
2
f A
1
f , then take a supremum over F to derive the inequality
recursion1 <5> sup
F
|
n
( f A
1
f )| sup
F
|
n
( f A
2
f )| +max
F
|
n
(A
2
f A
1
f )|.
The two suprema bound the error s of approximation for the two bracketings.
The maximum runs over at most N(
1
)N(
2
) pairs of differences between
approximating functions; I write it as a maximum to remind you that it
involves only nitely many differences, even for an innite F. If we can bound
probabilistically the contribution from that last term then we arrive at a recursive
inequality relating the errors two bracketing approximations.
Repeated application of the same idea would give a bound for the crude
bracketing approximation as a sum of a bound for a ner approximation plus
a sum of terms coming from the maxima over differences of approximating
functions. It remains only to bound the contributions from those maxima.
Each of the differences A
2
f A
1
f is bounded by B
1
f + B
2
f , with norm
bounded by
1
+
2
.
Typically use L
2
norm. Bernstein for tail probs of bounded functions.
Chaining must stop when the constant term in bound overwhelms variance
term. For unifromly bounded clases of functions, luckily the contributions
at end of chain are easy to take care of. For unbounded classes, need to
truncate. Cite Dudley 81 for rst form. Moment condition on envelope. Best
form due to Ossiander 1987 and Seattle groupsee comments about history
in the Notes Section.
Bernstein <6> Lemma. For independent
1
, . . . ,
n
and a measurable function g bounded
in absolute value by a constant ,
P{|
n
g| t } 2 exp
_

1
/
2
t
2
g
2
2
+
2
/
3
t
n
_
The chaining bounds will be derived by a recursive procedure based
on successive nite approximations to F in the sense of a norm . For
independent {
i
}, it will usually be the L
2
norm,
g
2
2
=
i n
Pg(
i
)
2
.
However, the argument is written so that it works for other norms, such as
those introduced by Rio (1994) and Doukhan, Massart & Rio (1994) for mixing
processes. Only two extra specic properties are required of the norm.
4
In the literature, the most familiar example of such a bound is the Bennett
inequality for sums of independent random variables. Suppose
1
, . . . ,
n
are
independent and |g| is bounded by a constant , that is, (g) . Then
Bennetts inequality asserts that
Bennett <7> P{|
n
g| g
2
} 2 exp
_
1
2
2
B(/g
2
)
_
, for > 0,
where
B(x) =
(1 + x) log(1 + x) 1 x
x
2
/2
.
It is the presence of the nuisance factor, B(/g
2
)
_
, that complicates the
chaining argument for tail probabilities. If stays xed while and g
2
are
made smaller, the nuisance factor begins to dominate the bound. It was for this
reason that Bass (1985) and Ossiander (1987) needed to add an extra truncation
step to the chaining argument. The truncation keeps the /g
2
close enough
to zero that one can ignore the nuisance factor, and act as if the centered sum
has sub-gaussian tails.
The possible dependence for the general chaining argument with L
1
norms
can also be hidden behind a single assumption.
The keys ideas are easiest to explain (and the proof is simplest) for the L
1
bound, Psup
F
|
n
f
n
(A
f )|. The analogous arguments for tail bounds are

better known in the literature.
[] 2. Independent summands
Suppose
1
, . . . ,
n
are independent random variables. Several simplications
are possible when dealing with independent summands, largely because sums
of bounded independent variables behave almost like subgaussian variables
(Chapter 9/).
It is natural to use the L
2
norm,
f
2
2
:=
1
n
i n
P f (
i
)
2
,
for two reasons: it enters as a measure of scale in inequalities for nite
collections of functions; and it is easy to bound contributions discarded during
truncation arguments, by means of the inequality
independent.trunc <8> n
1/2
i
Pg(
i
){g(
i
) >
ng
2
/t }
i
Pg(
i
)
2
t /ng
2
= t g
2
.
This Section will be devoted to an explanation of the ideas involved in the
proof of the following maximal inequality.
indep.mean <9> Theorem. Suppose
1
, . . . ,
n
are independent and let F be a class
of functions with nite bracketing numbers (for the L
2
norm) for which
_
1
0
_
log N(x) dx < . Then, for some universal constant C,
Psup
F
|
n
( f A
( f ))| C
_

0
_
log(2N(x)) dx + R
,
where
R
= n
1/2
i n
PB(
i
){B(
i
) >
n},
where B(x) = max
F
B
(x, f ) and = /
_
log 2N().
Anyone worries about questions of measurability could work throughout
with the outer integral P
.
5
Section 12.2 Independent summands
The independence is needed only to establish a maximal inequality for
nitely many bounded functions. It depends upon the elementary fact (see
Pollard (1996, Chapter 4?) or Chow & Teicher (1978, page 338)) that the
function dened by (x) = 2(e
x
1 x)/x
2
for x = 0, and (0) = 1, is
positive and increasing over the whole real line.
indep.max <10> Lemma. Suppose
1
, . . . ,
n
are independent under P. Let G be a nite class
consisting of N functions, each bounded in absolute value by a constant
n
and g
2
. Then there exists a universal constant C such that
Pmax
gG
|
n
g| C
0
_
log(2N) if /
_
log(2N).
Proof. Consider rst the bound for a single sum. Let W
i
= (g(
i
) Pg(
i
)) /
n.
Then |W
i
| 2 and

i
PW
2
i

2
. The moment generating function of
n
g =

i
W
i
equals
i
_
1 +P(t W
i
) +P
1
2
t
2
W
2
i
(t W
i
)
_
i
_
1 +P
1
2
t
2
W
2
i
(2t)
_
increasing
exp
_
1
2
t
2
i
PW
2
i
(2t)
_
using 1 + x e
x
exp
_
1
2
t
2
2
(2t)
_
.
The same,argument works for g.
Sum up 2N such moment generating functions to get the maximal
inequality. For xed t > 0,
exp(t Pmax
G
|
n
g|) Pexp
_
t max
G
|
n
g|
_
by Jensens inequality
Pexp
_
t max
G
n
g
_
+Pexp
_
t max
G
n
(g)
_
2N exp
_
1
2
t
2
2
(2t)
_
.
Take logarithms then put t =
_
log(2N)/ to get
Pmax
G
|
n
g|
_
log(2N)
_
1 +
1
2
(2
_
log(2N)/)
_
.
The stated inequality holds with C = 1 +(2)/2.
Comment on how the factor of 2 could be absorbed into the constant, except
when N = 1. It would be notationally convenient if we could dispense with
the 2.
The proof of the Theorem consists of repeated application of the Lemma to
the nite maxima obtained from a sequence of approximations. For k = 0, 1, . . .
and
k
= /2
k
, let
k
be a partition of F into at most N
k
= N(
k
) regions.
Write f
k
( ) = A
k
(, , f ) and B
k
() = B
k
( , f ) for the
k
-simple functions that
dene the
k
-bracketing. Dene h
k
=

log N
k
. Here, and subsequently, I omit
the argument f when it can be inferred from context.
Integrals bound sums
Typically integrals that appear in bounds like the one in Theorem <9> are
h(
k
)
just tidier substitutes for sums of error terms, a substitution made possible by
6
the geometric decrease in the {
k
}: If h() is a decreasing function on R
+
, then
k
h(
k
) = 2(
k

k+1
)h(
k
) 2
_
{
k+1
< x
k
}h(x) dx.
Sum over k, to deduce that
m
k=0
k
h(
k
)
_

0
m+1
h(x) dx
_

0
h(x) dx.
Put h(x) =
_
log(2N(x)).
Nested partitions using logs
The bracketing argument is much simpler if the
k
partitions are nestedthat
is, each
k+1
is a renement of the preceeding
k
and if we can take the B
k
as decreasing in k. The logarithmic dependence on the bracketing numbers lets
us arrange Without loss of generality we may assume the partitions to be nested
and the bounding functions to decrease as k increases.
Once again write h(x) for
_
log 2N(x). Let
k
be the common renement
of the partitions
1
, . . . ,
k
and let B
k
( f ) = min
i k
B
k
( f ). Within each region
of the
k
partition choose A
k
( f ) to correspond to the B
i
( f ) that achieves the
minimum dening B
k
( f ). Notice that
k
partitions F into at most
M
k
= N(
1
)N(
2
) . . . N(
k
) exp(
i kh(
i
)
2
) exp
_
(
i kh(
i
))
2
_
.
The integral term in Theorem <9> also bounds the sum corresponding to the
ner partition:
k=0
k
_
log M
k

k=0
j
{ j k}h(
j
)
=
j =0
h(
j
)2
j
because

k
{ j k}2
j
= 2
k+1
8
_

0
h(x) dx.
Rather than carry the superscript for
k
and B
k
, let me just write
k
and
B
k
, absorbing the extra constants due to the nesting into the C.
Truncation regions
Lemma <10> suggests that we truncate the functions at a level depending on
their L
2
norms. Let {
k
} be a decreasing sequence of constants, to be specied.
It will turn out that
0
equals the specied in the statement of the Theorem.
Split sup
F
|
n
( f f
o
)| into the contributions from the regions {B
0

0
n}
and {B
0
>
0
n}. For each f , bound |

n
( f f
0
)| by
|
n
( f f
0
){B
0

0
n}| +n
1/2
i n
B
0
(
i
, f ){B
0
(
i
, f ) >
0
n}
+n
1/2
i n
PB
0
(
i
, f ){B
0
(
i
, f ) >
0
n} top.split <11>
The third term contributes at most
n
1/2
i n
PB
0
(
i
, f )
2
n

2
0
= h(),
7
Section 12.2 Independent summands
which gets absorbed into integral contribution to the bound. The supremum of
the second term over F is less than the R
error. The rst term is the starting

point for the recursive procedure known as chaining.
At each succeeding step in the chaining argument we truncate the
differences f f
k
more severely, based on the size of B
k
. Write T
k
( f ), or
just T
k
, for the indicator function of the set
i k
{B
i
( f )
i
n}. Because the

partitions are nested, there is no harm in letting the truncations accumulate. I
will argue recursively to bound the contribution from the truncated remainders,
k
= Psup
F
|
n
( f f
k
)T
k
( f )|.
Notice that
0
equals the rst term in <top.splity>.
Recursive inequality
Start from the recursive equality,
( f f
k
)T
k
= ( f f
k+1
)T
k+1
rec.ineq.indep <12>
+( f
k+1
f
k
)T
k+1
+( f f
k
)T
k
T
c
k+1
.
Notice that the truncations are arranged so that each function on the
right-hand side of <12> is bounded.
Apply
n
to both sides of <12>, take suprema of absolute values, then
expectations, to get
k

k+1
mean.indep.rec <13>
+Pmax
F
|
n
( f
k+1
f
k
)T
k+1
| links
+Psup
F
|
n
( f f
k
)T
k
T
c
k+1
| trunc
The differences f
k+1
f
k
contribute an error due to moving down one level in
the chain of approximations; they are contributed by the links of the chain.
The indicators T
k
T
c
k+1
pick out the contribution to the error of approximation in
when we move from the kth level of truncation to the (k + 1)st. The links
term is already set up for an application of Lemma <10>. The trunc term
can be bounded by a maximum over a nite class, by means of the property
taht denes a bracketing.
Bound for the links term
Notice that each of f
k
= A
k
( f ), f
k+1
= A
k+1
( f ), and T
k+1
is
k+1
-simple;
functions f that lie in the same region of the partition share the same values for
these three quantities. The maximum need run over at most N
k+1
exp(h
2
k+1
)
representative f , one from each region.
Bound | f
k
f
k+1
| by | f
k
f | +| f
k+1
f | B
k
+ B
k+1
. The indicator
function T
k+1
ensures that, for every f ,
sup
x
|( f
k
f
k+1
)T
k+1
|
k
+
k+1
2
k
,
max
F
( f
k
f
k+1
)T
k+1
2
B
k
2
+B
k+1
2

k
+
k+1
= 3
k+1
.
Notice that the truncation plays no role in bounding the L
2
norm. The constraint
required by Lemma <10> is satised provided
indep.constraint1 <14>
2
k
3
k+1
/h
k+1 .
8
If we can choose the truncation levels {
k
} to satisfy this constraint we will
have
Pmax
t
|
n
( f
k
f
k+1
)T
k+1
| 3C
0
k+1
h
k+1
Bound for the trunc term
The truncated remainder process ( f f
k+1
)T
k
T
c
k+1
is not simple, because
( f f
k+1
) depends on f through more than just the regions of a nite
partition of F. However, it is bounded in absolute value by the
k+1
-simple
function B
k+1
T
k
T
c
k+1
. In general, if g and G are functions for which |g| G,
then
|
n
g| n
1/2
i n
h(
i
) +n
1/2
i n
PG(
i
) =
n
G +2n
1/2
i n
PG(
i
).
Invoke the inequality with g = ( f f
k+1
)T
k
T
c
k+1
and G = B
k+1
T
k
T
c
k+1
.
Psup
F
|
n
( f f
k+1
)T
k
T
c
k+1
|
Pmax
F
|
n
B
k+1
T
k
T
c
k+1
| +2 max
F
Pn
1/2
i n
B
k+1
T
k
T
c
k+1
(
i
). bracket.bnd <15>
Again the maximum need run over at most N
K+1
representative f . The
functions B
k+1
T
k
T
c
k+1
are bounded in absolute value by
k
(because B
k+1

B
k

k
on T
k
) and have L
2
norms less than
k+1
. Provided
indep.constraint2 <16>

k

k+1
/h
k+1 ,
Lemma <10> will bound the rst term on the right-hand side by C
0
k+1
h
k+1
.
Inequality <8> bounds the second term by 2
2
k+1
/
k+1
.
In summary: provided constraints <14> and <16> hold, we have the
recursive inequality
mean.indep.rec2 <17>
k

k+1
+4C
0
k+1
h
k+1
+2
2
k+1
/
k+1
We minimize R and the error term in <17> by choosing the
k
as large as
possible, that is,
k
=
k+1
/h
k+1
.
Summation
The recursive inequality then simplies to
k

k+1
+(2 +4C
0
) ki +1h
k+1
.
Repeated substitution, and replacement of the resulting sum by its bounding
integral leaves us with the inequality
0

k
+(8 +16C
0
)
_

0
k+1
h(x) dx.
As k tends to innity,
k
tends to zero (it is bounded by 2n
k
), which leads to
the integral bound as stated in the theorem.
[] 3. A generic bracketing bound
In general the norm plays the role of a scaling, as suggested by some
probabilistic bound for a single
n
g. Such bounds typically also depend on a
second measure of size (). In this paper, unless otherwise specied, (g)
will denote the supremum norm sup
x
|g(x)|.
9
Section 12.3 A generic bracketing bound
Improvements
Applies to general norms.
Lower terminal of integral not quite zerohelpful for mixing
applications.
Add one more term to the recursive equality, to avoid assumption that
partitions are nested.
mean.assumption <18> Assumption:. Suppose the norm satises
(i) g
1
g
2
if |g
1
| |g
2
|,
(ii) there exists a nonnegative increasing function D() on R
+
such that
i
Pg(
i
){g(
i
) > g/t } gD(t ) for each t > 0.
The form of the upper bound is suggested by the methods of Doukhan et

al. (1994) for absolutely regular processes. For independent processes, with the
L
2
norm
2
, we can take D(t ) t .
max.mean.assumption <19> Assumption:. Suppose there exists increasing functions G() and H() for
which the following properties hold. If G is a nite set of at most N functions
on X for which (g) and g for each g G, then
Pmax
gG
|
n
g| H(N) if /G(N).
For example, for independent summands with

2
as norm, both H(N)
and G(N) can be taken as mutiples of
_
log(2N), as will be shown in Section 2.
The upper bounds as stated are sensible only if the various integrals
converge. The detail of the proof shows that the lower terminal to the integrals
can be replaced by a
> 0, with only a slight increase in the constant.

As shown by Birg e & Massart (1993), such a renement is important for
applications to minimum contrast estimators for innite-dimensional parameters.
For Theorem <20>, the
is determined by the equality
=
_

J(x) dx.
For Theorem <39>, it is determined by the equality
what?
main.mean <20> Theorem. Suppose Assumptions <18> and <19> hold. Then, for a
xed > 0, for some universal constant C
0
,
Psup
F
|
n
( f A
( f ))| C
0
_

J(x) dx, +n
1/2
i n
PB(
i
){B(
i
) >??},
where
B(x) = max
F
B
(x, f )
J(x) = H (N(x)N(x/2)) + D (2G (N(x)N(x/2)))
?? =??
and
is the largest value for which H(N(2
))

n.
Proof. For k = 0, 1, . . . and
k
= /2
k
, let
k
be a partition of F into at
most N
k
= N(
k
) regions. Write f
k
= A
k
( f ) and B
k
= B
k
(, f ) for the
k
-simple functions that dene the
k
-bracketing. Dene
k
= G(N
k
N
k+1
) and
k
= H(N
k
N
k+1
).
10
The bound will be derived by a recursive argument, involving succesive
simple approximations to
n
f . At the kth step, the approximating functions
will be
k
-simple, where
k
denotes the common renement of
k
and
k+1
.
Suprema over classes of
k
-simple functions reduce to maxima over at most
N
k
N
k+1
representatives, one for each region of
k
. Assumption <19> therefore
suggests that we need functions bounded in absolute value by
k
=
k
/
k
,
an property that will be achieved by truncation. Write T
k
( f ), or just T
k
, for
the indicator function of the set {B
k
( f )
k
}. Notice that
0
M, and
hence T
0
( f ) = X. I will argue recursively to bound the contribution from the
truncated remainders,
k
= P
sup
F
|
n
( f f
k
)T
k
( f )|.
Notice that
0
is the quantity we seek to bound.
The recursive inequality for
k
will be derived from the equality
( f f
k
)T
k
= ( f f
k+1
)T
k+1
( f f
k+1
)T
c
k
T
k+1
+( f
k+1
f
k
)T
k
T
k+1
+( f f
k
)T
k
T
c
k+1
. recursive.equality <21>
Here and subsequently I omit the argument f when it can be inferred from
context. To verify equality <21>, notice that the rst two terms on the right-
hand side sum to ( f f
k+1
)T
k
T
k+1
, the third term then replaces the factor
( f f
k+1
) by ( f f
k
), then the last term combines the two contributions from
B
k+1
. The role of the second term is to undo the effect of the B
k
truncation
after it has done its work; the successive truncations do not accumulate as
in Ossianders (1987) argument. Without such a tidying up, products such as
N
k
N
k+1
would be replaced by products N
0
N
1
. . . N
k
N
k+1
, which might cause
summability difculties if H(N) grew faster than a slow logarithmic rate.
Notice that the truncations are arranged so that each summand on the right-hand
side of <21> is bounded.
Apply
n
to both sides of <21>, take suprema of absolute values, then
expectations, to get
k

k+1
+Psup
F
|
n
( f f
k+1
)T
c
k
T
k+1
|
+Pmax
F
|
n
( f
k+1
f
k
)T
k
T
k+1
|
+Psup
F
|
n
( f f
k
)T
k
T
c
k+1
|. mean.recursive <22>
Notice that the maximum in the third term on the right-hand side runs over
only nitely many distinct functions; Assumption <19> will handle this term
directly. The bracketing will increase both the second and fourth terms to
bounds on simple processes that will be handled by the same inequality.
The two Assumptions lead to simple bounds for the last three terms on the
right-hand side of <22>.
Bound for the third term
Bound | f
k
f
k+1
| by | f
k
f | +| f
k+1
f | B
k
+B
k+1
. The indicator function
T
k
T
k+1
ensures that
max
F
(( f
k
f
k+1
)T
k
T
k+1
)
k
+
k+1
,
max
F
( f
k
f
k+1
)T
k
T
k+1

k
+
k+1
.
11
Section 12.3 A generic bracketing bound
The constraint required by Assumption <19> is satised:
k
+
k+1
k
+
k+1
=
3
k
+
3
k+1
k
.
It follows that
Pmax
t
|
n
( f
k
f
k+1
)T
k
T
k+1
| (
k
+
k+1
)
k
.
Bound for the second term
k+1
)T
c
k
T
k+1
( f f
k+1
) depends on f through more than just the regions of a nite partition
of F. However, it is bounded in absolute value by B
k+1
T
c
k
T
k+1
. In general, if
g and h are functions for which |g| h, then
|
n
g|
i n
h(
i
) +
i n
Ph(
i
) =
n
h +2
i n
Ph(
i
).
Apply the bound with g = ( f f
k+1
)T
c
k
T
k+1
and h = B
k+1
T
c
k
T
k+1
Psup
F
|
n
( f f
k+1
)T
c
k
T
k+1
| Pmax
F
|
n
B
k+1
T
c
k
T
k+1
|
+2 max
F
P
i n
B
k+1
T
c
k
T
k+1
(
i
). bracket.bnd <23>
From the bounds
max
F
(B
k+1
T
c
k
T
k+1
) max
F
(B
k+1
T
k+1
)
k+1
,
max
t
B
k+1
T
c
k
T
k+1
max
F
B
k+1

k+1
,
Assumption <19> gives
Pmax
F
|
n
B
k+1
T
c
k
T
k+1
|
k+1
k
,
because
k+1
=
k+1
/
k+1

k+1
/
k
.
For the contribution from the expected values, split acording to which of
B
k
or B
k+1
is larger, to bound the function B
k+1
T
c
k
T
k+1
by
mean.bnd <24> B
k+1
{B
k+1
>
k
}+B
k
{B
k
>
k
} B
k+1
{B
k+1
> 2
k+1
/
k
}+B
k
{B
k
>
k
/
k
}
Thus, via Assumption <18>,
i n
PB
k+1
T
c
k
T
k+1
(
i
)
k+1
D(
k
/2) +
k
D(
k
).
Bound for the fourth term
By the symmetry in k and k + 1 in the argument for the second term, we can
interchange their roles to get
Psup
F
|
n
( f f
k
)T
c
k+1
T
k
| Pmax
F
|
n
B
k
T
c
k+1
T
k
|
+2 max
F
P
i n
B
k
T
c
k+1
T
k
(
i
). bracket.bnd <25>
Assumption <18> applies because
k
=
k
/
k
, to bound the rst contribution
by
k
k
. Arguing as for the second term, bound the contribution from the means
by
i n
PB
k
(
i
){B
k
(
i
) >
k
/2
k+1
} +PB
k+1
(
i
){B
k+1
(
i
) >
k+1
/
k+1
}

k+1
D(
k+1
) +
k
D(2
k+1
)
Notice that D(2
k+1
) is the largest of the four contributions from D to the
second and fourth terms.
12
The recursive inequality <22> has now simplied to
k

k+1
+6
k+1
k
+6
k+1
D(2
k+1
).
Recursive substitution, and replacement of the resulting sum by its bounding
integral leaves us with the inequality
0

k
12
_

0
k
J(x) dx,
with J(x) as in Section 1. As k tends to innity,
k
2n
k
tends to zero,
which leads to the integral bound as stated in the theorem.
A slightly better bound is obtained by choosing a k just large enough to
make
k
comparable to the other terms in the bound. Remember that
is
determined by the equality

n
=
_
J(x) dx = J. Choose the largest k for

which
k+1

. Bound
k
using the bracketing inequality,
k
Pmax
F
n
B
k
{B
k

k
} +2
i n
PB
k
{B
k

k
}.
Notice that the T
c
k+1
factor is missing. That has no effect on the rst contribution,
which, by virtue of Assumption <19>, is less than
k
H(N
k
) 2
_

k
k+1
J(x) dx.
The sum of expected values is no longer in the form needed by Assump-
tion <18>. Instead, it can be bounded by the corresponding second moment,
2n
_
1
n
i n
PB
2
k
_
1/2
2
n
k
8J.
The entire
k
contribution has been absorbed into integral terms, leaving a nal
upper bound of 20J for
0
.
[] 4. Phi mixing
Let B
k
denote the sigma-eld generated by
1
, . . . ,
k
and A
k
denote the
sigma-eld generated by
k
,
k+1
, . . .. Say that {
i
} has phi-mixing coefcients
{
m
} if, for all nonegative integers k and m,
|PAB PAPB|
m
PB for all B B
k
and A A
k+m
.
If X is B
k
-measurable and integrable, and Y is A
k+m
-measurable and bounded
by a constant K, a simple approximation argument (see Billingsley 1968,
page 170) shows that
covar <26> |PXY PXPY| 2K
m
P|X|.
This inequality leads to a bound on the moment generating function of a sum,
which will play the same role as Lemma <indep.mgf>. The argument is
a slightly modied form of the proof of Collombs (1984) Lemma 1, with
elimination of his rst moment quantities from the bound.
Once again, work with the L
2
norm, g
2
2
=

i n
Pg(
i
)
2
.
13
Section 12.4 Phi mixing
phi.mgf <27> Lemma. Let W
1
, . . . , W
n
be random variables with phi-mixing coef-
cients {
m
}. Suupose each |W
i
| is bounded by a constant and PW
i
= 0, and
that

i n
PW
2
i

2
. Then, with C = 4 +16
i n
i
,
Pexp(t
i n
X
i
) exp
_
Ct
2
2
+n
e
m
/m
_
for each t > 0 and each positive integer m with 1 m n and mt
1
/
4
.
Proof. Break {1, . . . , n} into successive blocks B
1
, B
1
, . . . , B
N
, B
N
of
length m (except for B
N
and B
N
, which might be shorter). Notice that
n + 2m 2Nm n, whence n/2m N 1. Dene B
i
=

j B
i
X
j
and
T
i
= B
1
+. . . + B
i
. Dene B
i
and T
i
in a similar fashion. Write
2
j
for PX
2
j
,
and V
i
for

j B
i

2
j
.
By convexity,
Pexp(t
i n
X
i
) = Pexp
1
2
(2t T
N
+2t T
N
)
1
2
Pexp(2t T
N
) +
1
2
Pexp(2t T
N
).
Consider the rst term on the right-hand side. Peel off the contribution from
the block B
N
, using the inequality (<26>) for a separation of at least m.
Notice that |2t B
N
| 2t m
1
2
by the constraint on m and t , whence
exp(2t B
N
)

e. From inequality <26>,
Pexp(2t (T
N1
+ B
N
)) Pexp(2t T
N1
)Pexp(2t B
N
) +2
m
Pexp(2t T
N1
)
e.
Because e
x
1 + x + x
2
for |x|
1
2
, we also have
Pexp(2t B
N
) 1 +4t
2
PB
2
N
.
The mixing condition lets us bound PB
2
N
by
1
4
CV
N
(the argument is essentially
the one on page 172 of Billingsley (1968)), which leads to
Pexp(2t T
N
) Pexp(2t T
N1
)
_
1 +Ct
2
V
N
+2
m
e
_
Pexp(2t T
N1
) exp(Ct
2
V
N
+2
m
e ).
Repeat the argument another N 2 times to get
Pexp(2t T
N
) Pexp(2t T
1
) exp
_
Ct
2
(V
2
+. . . + V
N
) +2(N 1)
m
e
_
exp(4t
2
V
1
) exp
_
Ct
2
(V
2
+. . . + V
N
) +
n
m
e
_
.
A similar argument gives a similar inequality for Pexp(2t T
N
). The asserted
bound on the moment generating function follows.
The inequality, as stated, requires no convergence condition on the mixing
coefcients; the bound is for a xed n. However, in applications the inequality
is needed for arbitrarily large n. In that case one needs the familiar condition
k=1
k
< .
The dependence between the variables has contributed the
m
/m term,
which indirectly provides the constraint appearing in the phi-mixing incarnation
of Assumption <19>. The constraint is best expressed in terms of a decreasing
function for which
m
/m (m) for each m. The dependence throws an
extra factor
1
((log 2N)/n) into the denition of G. When log 2N > n(1),
this factor should be interpreted as 1.
14
phi.max <28> Corollary. Let {
i
} have phi-mixing coefcients for which

m=1
m
< .
Let G be a nite class consisting of N functions for which (g) and
g
2
. Then there exists a constant C
4
, depending only on the mixing
coefcients, such that
Pmax
gG
|Sg| C
4
_
log(2N) if /G(N),
where
G(N) = 16
1
_
log 2N
n
_
_
log(2N).
That is, Assumption <19> holds with H(N) = C
1
_
log(2N) and G(N) as
given.
Proof. Argue as for Corollary <10> that, provided 0 < 8mt < 1,
Pmax
G
|Sg| log(2N)/t +C
1
t
2
+
e(m)n/t.
Choose t =
_
log(2N)/, and let m be the smallest integer greater than 1 for
which n(m) log(2N). Then m/2 m 1
1
(log N/n), from which the
constraint follows.
Derive the analogous tail
bound?
The effect of the extra factor in the denition of G is best understood

through a specic example.
phi.clt1 <29> Example. Suppose {
i
} is a stationary sequence with marginal distributions P
and mixing coefcients such that (x) = O(x
p
). Suppose the class F has
envelope F such that PF
v
< for some v > 2, and bracketing numbers with
log N
2
(x) = O(x
2w
) for some w < 1. As for Example <mean.Ossiander>,
a functional central limit theorem can be proved provided p, v, and w are
suitably related.
Let
n
tend to zero so slowly that nP{F >
n
n
1/v
} 0. Then the
truncated class F
= { f {F
n
n
1/v
}/
n : f F} has elements bounded by

M
n
=
n
n
1/v1/2
, which tends to zero faster than a small negative power of n. Section incomplete from here
on.
Need to let tend to zero fast enough so that the n
1/p
does not blow up the
entropy integral. This will be different from the xed for the stoch equicty.
phi.clt2 <30> Example. Same setup as for previous example, but work with new norm, a
la DMR.
DMR (following Rio.cov94 ) realized that it was more elegant to absorb
the required rate of convergence for a sequence of mixing coefcients into the
denition of a norm,
X
2
2,r
=
_
1
0
r
1
(u)Q
X
(u)
2
du.
With a slight change of notation, their Lemma 5 corresponds precisely to my
Assumption <18>.
Make sure r
1
1, so that new norm is larger than L
2
norm.
For a nonnegative random variable X dene F(x) = P{X > x}, with
corresponding quantile function Q
X
(u) = inf{x : F(x) u} for 0 < u < 1.
The subtle choice of inequalities ensures that
F(x) u if and only if x < Q
X
(u).
Consequently, X has the same distribution as Q
X
(U) where U is distributed
Uniform(0, 1).
15
Section 12.4 Phi mixing
DMR5 <31> Lemma. For each > 0,
PX{X > X
2,r
/
1
()} X
2,r
.
Proof. Write for X
2,r
. Put = (). By denition of the norm,
_
r()
0
r
1
(u)Q
X
(u)
2
du r()Q
X
(r())
2
,
from which it follows that Q(r())
1
(). Use the quantile represen-
tation to rewrite the left-hand side of the asserted inequality as
_
1
0
Q
X
(u){Q
X
(u) > Q(r())} du =
_
1
0
Q
X
(u){u < r()} du
_
1
0
_
r
1
(u)
Q
X
(u){u < r()} du,
which, by the Cauchy-Schwarz inequality, is less than
r()/ =
, as
asserted.
Let g
= g/
n and g
= g(
1
)
2,r
. Deduce that
i n
g
(
i
){g(
i
> g
1
()}

nPg{g > g
2,r
/
1
()}

_
n.
In consequence, D(
1
()) =
n. Apply with = (log 2N)/n to get
i n
g
(
i
){g(
i
> g
/G(N)} g
_
log 2N.
That should lead to an entropy integral like the one for independent variables,
provided bracketing numbers are calculated using the
2,r
norm. The fnal clt
should then look like the result for indep rvs.
[] 5. Absolute regularity
DMR have established a fclt for stationary, absolutely regular sequences, a
slightly weaker property than the phi-mixing assumption. The precise dention
of abs reg is unimportant. It matters only that it leads to a moment maximal
inequality involving an interesting new norm.
Using the Berbee coupling between abs reg sequences and independent
sequences, DMR also established an inequality that corresponds to my Assump-
tion <19>. Write L(N) for the maximum of 1 and log N. Let G be a nite
class consisting of N functions for which (g) and g
2,r
. Put
G(N) =
1
(L(N)/n)
_
L(N)/8.
Then Lemma 4 of DMR implies existence of a universal constant C
3
such that:
DMR4 <32> . Pmax
gG
|Sg/
n| C
3
_
L(N) if

n/G(N)
How to rescale?
Derive the DMR bound?
16
A saturation lemma <33> Lemma. Suppose D() is a decreasing nonegative function on (0, 1] for
which
_
1
0
D(x) dx < . The integral condition imposes some constraint on
how fast D(x) can increase as x tends to zero. Often one needs more precise
control over the rate. DMR have shown how to replace D by a slightly larger
function for which one has such control. More precisely, the function dened
by
D(x) = sup
0<t x
(t /x)
2
D(t ),
has the following properties.
(i) D(x/2) 4D(x), for each x.
(ii) D is decreasing.
(iii) D(x) D(x).
(iv) For each nonegative function on R
+
for which (x)/X is increasing,
_

0
_
D(x)
_
dx 4
_

0
(D(x)) dx,
for each 0 < 1.
Property (i) follows from the fact that the supremum dening D(x/2) runs
over half the range, and it has an extra factor of (1/2)
2
in the denominator, as
compared with the denition for D(x). Properties (ii) and (iii) follow from the
equivalent denition for D,
D(x) = sup
0<s1
s
2
D(sx).
Property (iv) is more delicate. Split the range for the supremum dening D(x)
into 0 < t x/2 and x/2 < t x, then bound D(t ) from below by D(x/2) on
the second subrange, to obtain
D(x) = max
_
1
4
D(x/2), D(x/2)
_
.
Apply to both sides, invoke the fact that (t /4) (t )/4, then bound the
maximum by a sum.
_

0
_
D(x)
_
dx
_

0
_
D(x/2)
_
/4 dx +
_

0
(D(x/2)) dx.
With the change of variable y = x/2, the rst integral on the right-hand side
is seen to be less than half the integral on the left-hand side, and the second
integral is seen to be less than 2
_
0
(D(x)) dx. Property (iv) would then
follow by a rearrangement of terms, provided all integrals are nite. To cover
the case of innite integrals, replace D
C
(x) = D(x) by D(x) C, for a
constant C, derive the inequality from property (iv) with D replaced by D
C
,
then invoke Monotone Convergence as C tends to innity.
[] 6. Strong mixing
Let Y = (Y
1
, . . . , Y
n
). Fix a > 0. For each 2 dene
N(, Y) =
_
i n
P|Y
i
|
/(+ )
_
1/
.
Doukhan (or Andrews and Pollard) have shown, under the assumption that
i =1
(1 +i )
2
/(+ )
i
< ,
17
Section 12.6 Strong mixing
that
Doukhan.moment <34> P|W
1
+. . . + W
n
|
min (N(2, W), N(, W)) .

Let g = N(2, Y) where Y
i
= g(
i
).
Fact: If (g) = g then
N(Q, Y) N(2, Y).
Proof: Let s = Q(2 + )/2. Then
_
i n
(P|Y
i
|
Q+
)
Q/(Q+ )
_
1/Q
i n
(P|Y
i
|
s
)
Q/s
_
1/Q
i n
(P|Y
i
|
2+
(s2 )
)Q/s
_
1/Q

(Q2)/Q
N(2, Y)
2/Q
N(2, Y).
Fact:

i n
P|Y
i
|{Y
i
> Y/t } Yt
1+
.
Proof: Let = Y. Let
i
denote the L
2+
norm of Y
i
. Note that max
i

i
.
LHS is less than
i n
P|Y
i
|
2+
t
1+
/
1+
t
1+
i n
2+
i
/(
i
)
= t
1+
i n
2
i
/
= t
1+
.
[] 7. Tail probabilitybounds
A tail bound for a single Sg easily implies a maximal inequality for a nite
class of functionsone has merely to add together a nite number of bounds.
For example, with independent summands, and a class G of at most N functions
each satisfying g
2
and (g) , the Bennett inequality would give
P{max
G
|Sg| > } exp
_
log(2N)
1
2
2
B(/)
_
.
In chaining arguments we need large enough to absorb the log(2N), which
suggests that we replace by +
_
log(2N). If is constrained to be less
than /( +
_
log(2N)) then we get the sub-gaussian tail exp(c
2
), for some
constant c. It turns out that the derivation of such a maximal inequality for nite
classes is the only part of the argument where independence is needed. Similar
inequalities (derived by a slightly more involved argumentssee Sections 3
and 4) hold for various types of mixing sequence; they play the same role in the
chaining argument. In general, for a chaining argument with tail probabilities,
all details about possible dependence can be hidden behind a single assumption.
18
max.tail.assumption <35> Assumption:. Suppose there exists functions G(, ) and H() increasing in
each argument, and a decreasing function (), for which the following property
holds. If G is a nite set of at most N functions on X for which (g) and
g for each g G,
P{max
gG
|Sg| (H(N) +)} () if /G(N, ),
for each > 0.
The argument is slightly more involved than for the proof of Theorem <20>,
because there are two more sequence of constants to be chosen correctly.
main.tail <36> Theorem. Suppose Assumptions <18> and <38> hold. Suppose the func-
tions in F are bounded in absolute value by a constant M /G (N()N(/2)),
for a xed > 0. Let () be a decreasing, nonnegative function on R
+
. Then,
for some universal constants C
1
and C
2
,
P{sup
F
|S( f A
( f ))| > C
2
_

0
K(x) dx} C
1
_

0
_
(x)
_
x
dx,
where K(x) = H (N(x)N(x/2)) +(x)D (2G (N(x)N(x/2), (x))).
As before, for k = 0, 1, . . . and
k
= /2
k
, let
k
be a partition of F into
at most N
k
= N(
k
) regions, and write f
k
= A
k
( f ) and B
k
= B
k
(, f ) for the
k
k
-bracketing. Dene
k
= G(N
k
N
k+1
,
k
)
and
k
=
k
/
k
and
k
= H(N
k
N
k+1
) and
k
= (
k
).
Start once more from the recursive equality
( f f
k
)T
k
= ( f f
k+1
)T
k+1
( f f
k+1
)T
c
k
T
k+1
+( f
k+1
f
k
)T
k
T
k+1
+( f f
k
)T
k
T
c
k+1
.
Dene a corresponding sequence of constants {R
k
}, with
R
k
= R
k+1
+6
k+1
(
k
+
k+1
+
k
) where
k
= D(2
k
).
This time write
k
for P{sup
F
|S( f f
k
)T
k
| > R
k
}. Then we have the
recursive tail bound
k

k+1
+P{sup
F
|S( f f
k+1
)T
c
k
T
k+1
| >
k+1
(
k
+
k
+3
k
)}
+P{max
F
|S( f
k+1
f
k
)T
k
T
k+1
| > (
k
+
k+1
)(
k
+
k
}
+P{sup
F
|S( f f
k
)T
k
T
c
k+1
| >
k
(
k
+
k
+3
k
}. tail.recursive <37>
The argument for the second, third, and fourth terms on the right-hand side
of <40> parallels the argument started from <22>. I omit most of the repeated
detail.
The maximum in the third term again runs over at most N
k
N
k+1
distinct
representatives, each with a bound of at most
k
+
k+1
and a norm of at
most
k
+
k+1
. The
k
were again chosen to ensure that the constraint required
by Assumption <38> is satised. It follows that the third term contributes at
most (
k
).
As before,
|S( f f
k+1
)T
c
k
T
k+1
| SB
k+1
T
c
k
T
k+1
+2 max
F
P
i n
B
k+1
T
c
k
T
k+1
(
i
),
and the sum of the means is less than 3
k
. Thus the second term is less than
P{max
F
SB
k+1
T
c
k
T
k+1
>
k+1
(H(N
k
N
k+1
) +
k+1
)} (
k+1
).
Similarly, the fourth term is bounded by (
k
).
19
Section 12.7 Tail probabilitybounds
With repeated substitution, the recursive inequality <40> leads to
0

k
+2
_

k+1
((x))
x
dx,
with
R
0
= R
k
+122
_

k+1
K(x) dx.
Argue as in the previous section, but with J() replaced by K(), to bound the
contributions from
k
and R
k
foran appropriately chosen k.
Not quite right. There is another contribution to the tail
bound.
Fix.
Old section on Tail probabilities
max.tail.assumption <38> Assumption:. Suppose there exists functions G(, ) and H() and a
decreasing function (), increasing in each argument, for which the following
property holds. If G is a nite set of functions on X for which (g) and
g for each g G,
P{max
gG
|Sg| (H(N) +)} () if /G(N, ),
for each > 0.
main.tail <39> Theorem. Suppose Assumptions <38> and <38> hold. Suppose the
functions in F are bounded in absolute value by a constant M. Suppose > 0
is such that M /G (N()N(/2)). Then, for some universal constant C
0
,
P{sup
F
|S( f A
( f ))| > R
0
} C
0
_

0
H (N(x)N(x/2))+D (G (8N(x)N(x/2))) dx,
where R
0
= and
Proof. The argument is slightly more involved than for the proof of Theo-
rem <main>, because there is one more sequence of constants to be chosen
correctly.
As before, for k = 0, 1, . . . and
k
= /2
k
, let
k
be a partition of F
into at most N
k
= N(
k
) regions. Write f
k
= A
k
( f ) and B
k
( f ) for the
k
k
-bracketing. Dene
k
= G(N
k
N
k+1
,
k
)
and
k
=
k
/
k
and
k
= H(N
k
N
k+1
), where the {
k
} remian to be chosen.
Start once more from the recursive equality
( f f
k
)T
k
= ( f f
k+1
)T
k+1
( f f
k+1
)T
c
k
T
k+1
+(A
k+1
A
k
)T
k
T
k+1
+( f f
k
)T
k
T
c
k+1
.
Dene a corresponding sequence of constants {R
k
}, with
R
k
= R
k+1
+???
k
+???
k
k+1
+???
k
.
where
k
=
k
D(1/
k
). This time write
k
for P{sup
F
|S( f f
k
)T
k
| > R
k
}.
Then we have the recursive tail bound
k

k+1
+P{sup
F
|S( f f
k+1
)T
c
k
T
k+1
| (
k
+
k+1
)(
k
+ +k)}
+P{max
F
|S( f
k+1
f
k
)T
k
T
k+1
| >
k
(
k
+
k+1
+
k
} +P{sup
F
|S( f f
k
)T
k
T
c
k+1
| >
k
(
k
+
k+1
+
k
}. tail.recursive <40>
Bound for the third term
Bound | f
k
f
k+1
| by | f
k
f | +| f
k+1
f | B
k+1
+B
k
. The indicator function
T
k
T
k+1
ensures that
max
F
(( f
k
f
k+1
)T
k
T
k+1
)
k
+
k+1
,
max
F
( f
k
f
k+1
)T
k
T
k+1

k
+
k+1
.
20
The constraint required by Assumption <38> is satised:
k
+
k+1
k
+
k+1
=
2
3
1
k
+
1
3
1
k+1
k
.
It follows that
Pmax
t
|S( f
k
f
k+1
)T
k
T
k+1
| 3
k+1
k
.
Bound for the second term
k+1
)T
c
k
T
k+1
( f f
k+1
) depends on f through more than just the regions of a nite partition
of F. However, it is bounded in absolute value by B
k+1
T
c
k
T
k+1
. In general, if
g and h are functions for which |g| h, then
|Sg|
i n
h(
i
) +
i n
Ph(
i
) = Sh +2
i n
Ph(
i
).
Apply the bound with g = ( f f
k+1
)T
c
k
T
k+1
and h = B
k+1
T
c
k
T
k+1
tail.bracket.bnd <41>
Psup
F
|S( f f
k+1
)T
c
k
T
k+1
| Pmax
F
|SB
k+1
T
c
k
T
k+1
|+2 max
F
P
i n
B
k+1
T
c
k
T
k+1
(
i
).
From the bounds
max
F
(B
k+1
T
c
k
T
k+1
) max
F
(B
k+1
T
k+1
)
k+1
,
max
t
B
k+1
T
c
k
T
k+1
max
F
B
k+1

k+1
,
Assumption <38> gives
Pmax
F
|SB
k+1
T
c
k
T
k+1
|
k+1
k
,
because
k+1
=
k+1
/
k+1

k+1
/
k
.
For the contribution from the expected values, split acording to which of
B
k
or B
k+1
is larger, to bound the function B
k+1
T
c
k
T
k+1
by
tail.mean.bnd <42>
B
k+1
{B
k+1
B
k
>
k
}+B
k
{B
k
>
k
} B
k+1
{B
k+1
> 2
k+1
/
k
}+B
k
{B
k
>
k
/
k
}
Thus, via Assumption <38>,
i n
PB
k+1
T
c
k
T
k+1
(
i
)
k+1
D(
k
/2) +
k
D(
k
).
Bound for the fourth term
By the symmetry in k and k + 1 in the argument for the second term, we can
interchange their roles to get
tail.bracket.bnd <43> Psup
F
|S( f f
k
)T
c
k+1
T
k
| Pmax
F
|SB
k
T
c
k+1
T
k
| +2 max
F
P
i n
B
k
T
c
k+1
T
k
(
i
).
Assumption <38> applies because
k
=
k
/
k
, to bound the rst contribution
by
k
k
. Arguing as for the second term, bound the contribution from the means
by
i n
PB
k
(
i
){B
k
(
i
) >
k
/2
k+1
} +PB
k+1
(
i
){B
k+1
(
i
) >
k+1
/
k+1
}

k+1
D(
k+1
) +
k
D(2
k+1
)
The recursive inequality <recursive> has now simplied to
k

k+1
+6
k+1
k
+6
k
D(2
k+1
).
21
Section 12.7 Tail probabilitybounds
Because
k
2n
k
0 as k , recursive substitution leaves us with a
bound for
0
less than
5
k=0
k
+2n
k=0
k
.
The integral term in Theorem <main> is a cleaner looking bound for the sum.
Better to stop at some nite k?

NewBennett <44> Lemma.
Suppose
i
= (x
i
, y
i
) with all x
i
and y
i
mutually independent. Suppose
| f (x, y)| M(y)(x), with Pexp(M(y
i
) for all i and |(x)| for
all x. Write V for

i n
P(x
i
)
2
. Then
P{S f } exp
_
2
4V
_
provided
2V
.
NewBennett <45> Lemma.
Suppose
i
= (x
i
, y
i
) with all x
i
and y
i
mutually independent. Suppose
| f (x, y)| M(y)(x), with Pexp(M(y
i
) for all i and |(x)| for
all x. Write V for

i n
P(x
i
)
2
. Then
P{S f } exp
_
2
4V
_
provided
2V
.
The Bennett inequality <7> for P{
i
W
i
} would follow from a
minimization of exp
_
t +
1
2
t
2
2
(t )
_
over t > 0.
[] 8. Notes
The two main results (Theorems <20> and <39>) give maximal inequalities for
expected values (L
1
norms) and tail probabilities. These inequalities abstract and
extend the arguments rst developed by the Seattle group (Pyke 1983, Alexander
& Pyke 1986, Alexander 1984, Bass 1985, Ossiander 1987) and subsequently
generalized by the Paris group (Massart 1987, Birg e & Massart 1993, Doukhan
et al. 1994, Birg e & Massart 1995). Some of the history of the main ideas is
discussed in Section History.
Check with the comments from
Pyke.
The argument to establish the maximal inequality of Assumption <19>
for nite classes is essentially Pisiers (1983) method combined with the rst
step in the derivation of the Bernstein/Bennett inequality.
Donsker? cf. Parthasarathy (1967).
Revesz
aad
donsker
(Dudley 1981) (Dudley 1978) Dudley (1978) introduced the concept of
metric entropy with bracketing in order to prove a functional central limit
theorem for empirical processes indexed by classes of sets, later extending it to
classes of functions in Dudley (1981).
History according to Ron Pyke
Ken Alexander saw the paper of Pyke (1983), and realized how to improve the
truncation technique used there. He applied the improvement in Alexander 84(?).
They wrote another joint paper (Alexander & Pyke 1986)see the note at the
end of the paper. Bass (1985) applied the truncation to set-indexed partial-sum
processes (thge paper was not written up before December 84). Bass and Pyke
(paper around 1983?) recognized the truncation problem; they didnt use the
22
best form of truncation. Mina Ossiander worked on her dissertation during the
spring and summer of 1984, producing her thesis (published as Ossiander 1987)
and a technical report in NovemberDecember of that year. She started from the
Alexander&Pyke paper, then developed a more general form of the truncation
argument (?). There were many discussions between Ossiander and Bass. The
nal publication dates are not indicative of the true order in which work carried
out, because of delays in refereeing.
AGOZ
dmr
Cite Dehardt.
birge and Massart
mention where bracketing comes from: history of Seattle contributions,
as related by Ron Pyke. The method is based on the truncation/bracketing
argument developed by the Seattle group. I have isolated the key ingredients
into the following two assumptions.
For an application to dependent variables, DMR introduced a new type of
norm for a random variable X. The quantile transformation constructs a random
variable Q(U) with the same distribution as X, by means of a U distributed
uniformly on (0, 1). For a nonnegative decreasing function L determined by
mixing assumptions, they dened a norm by X
2
= PL(U)Q(U)
2
. This new
norm has properties analogous to those of an L
2
norm. They proved elegant
maximal inequalities for mixing processes based on bracketings with the new
norm.
References
Alexander, K. S. (1984), Probability inequalities for empirical processes and a
law of the iterated logarithm, Annals of Probability 12, 10411067.
Alexander, K. S. & Pyke, R. (1986), A uniform central limit theorem for set-
indexed partial-sum processes with nite variance, Annals of Probability
14, 582597.
Andrews, D. W. K. & Pollard, D. (1994), An introduction to functional central
limit theorems for dependent stochastic processes, International Statistical
Review 62, 119132.
Bass, R. (1985), Law of the iterated logarithm for set-indexed partial-sum
processes with nite variance, Zeitschrift f ur Wahrscheinlichkeitstheorie
und Verwandte Gebiete 70, 591608.
Billingsley, P. (1968), Convergence of Probability Measures, Wiley, New York.
Birg e, L. & Massart, P. (1993), Rates of convergence for minimum contrast
estimators, Probability Theory and Related Fields 97, 113150.
Birg e, L. & Massart, P. (1995), Minimum contrast estimators on sieves, in
D. Pollard, E. Torgersen & G. L. Yang, eds, A Festschrift for Lucien
Le Cam, Springer-Verlag, New York, pp. ??????
Chow, Y. S. & Teicher, H. (1978), Probability Theory: Independence, Inter-
changeability, Martingales, Springer, New York.
Collomb, G. (1984), Propri et es de convergence presque compl` ete du pr edicteur
` a noyau, Zeitschrift f ur Wahrscheinlichkeitstheorie und Verwandte Gebiete
66, 441460.
Doukhan, P. & Portal, F. (1983), Moments de variables al eatoires m elangeantes,
Comptes Rendus de lAcademie des Sciences, Paris 297, 129132.
Doukhan, P. & Portal, F. (1984), Vitesse de convergence dans le th eor` eme
central limite pour des variables al eatoires m elangeantes ` a valeurs dans
un espace de hilbert, Comptes Rendus de lAcademie des Sciences, Paris
298, 305308.
23
Section 12.8 Notes
Doukhan, P., Massart, P. & Rio, E. (1994), The functional central limit
theorem for strongly mixing processes, Annales de lInstitut Henri
Poincar e ??, ??????
Dudley, R. M. (1978), Central limit theorems for empirical measures, Annals
of Probability 6, 899929.
Dudley, R. M. (1981), Donsker classes of functions, in M. Cs org o, D. A.
Dawson, J. N. K. Rao & A. K. M. E. Saleh, eds, Statistics and Related
Topics, North-Holland, Amsterdam, pp. 341352.
Hall, P. & Heyde, C. C. (1980), Martingale Limit Theory and Its Application,
Academic Press, New York, NY.
Massart, P. (1987), Quelques problemes de vitesse de convergence pour des
processus empiriques, PhD thesis, Universit e Paris Sud, Centre dOrsay.
Chapter 1A= Massart (1986); Chapter 1B = Invariance principles for
empirical processes: the weakly dependent case.
Ossiander, M. (1987), A central limit theorem under metric entropy with
L
2
bracketing, Annals of Probability 15, 897919.
Parthasarathy, K. R. (1967), Probability Measures on Metric Spaces, Academic,
New York.
Pollard, D. (1984), Convergence of Stochastic Processes, Springer, New York.
Pollard, D. (1990), Empirical Processes: Theory and Applications, Vol. 2
of NSF-CBMS Regional Conference Series in Probability and Statistics,
Institute of Mathematical Statistics, Hayward, CA.
Pollard, D. (1996), An Explanation of Probability, ??? (Unpublished book
manuscript.).
Pyke, R. (1983), A uniform central limit theorem for partial-sum processes
indexed by sets, in J. F. C. Kingman & G. E. H. Reuter, eds, Probability,
Statistics and Analysis, Cambridge University Press, Cambridge, pp. 219
240.
Rio, E. (1994), Covariance inequalities for strongly mixing processes, Annales
de lInstitut Henri Poincar e ??, ??????
Sen, P. (1974), Weak convergence of multidimensional empirical processes for
stationary -mixing processes, Annals of Probability 2, 147154.
Yukich, J. E. (1986), Rates of convergence for classes of functions: the non-iid
case, Journal of Multivariate Analysis 20, 175189.
24

Old Bracketing PDF

Cargado por

Información del documento

Descripción original:

Título original

Derechos de autor

Formatos disponibles

Compartir este documento

Compartir o incrustar documentos

Opciones para compartir

¿Le pareció útil este documento?

¿Este contenido es inapropiado?

Copyright:

Formatos disponibles

Old Bracketing PDF

Cargado por

Copyright:

Formatos disponibles

Chapter 12

lipschitz <3> Example. Bracketing arguments often appear disguised as smoothness

, for some xed > 0 and some xed M in L

, for whatever norm is used

f and bounding functions B

take constant values on each member of the

(x, f ) and B(x, f ) such that:

(x, f ) for all x;

f )|, such as an inequality for tail probabilities or an L

f )|. The analogous arguments for tail bounds are

Rather than carry the superscript for

n}. For each f , bound |

error. The rst term is the starting

n}. Because the

The form of the upper bound is suggested by the methods of Doukhan et

For example, for independent summands with

> 0, with only a slight increase in the constant.

is determined by the equality

is the largest value for which H(N(2

J(x) dx = J. Choose the largest k for

The effect of the extra factor in the denition of G is best understood

n : f F} has elements bounded by

n. Apply with = (log 2N)/n to get

min (N(2, W), N(, W)) .

También podría gustarte