Está en la página 1de 14

Accurate estimator of correlations between asynchronous signals

Bence Toth

arXiv:0805.2310v1 [physics.data-an] 15 May 2008

ISI Foundation - Viale S. Severo 65,


10133 Torino, Italy
and
Institute of Physics,
Budapest University of Technology and Economics
- Budafoki u
t. 8. H-1111 Budapest, Hungary
Janos Kertesz
Institute of Physics and HAS-BME Condensed Matter Physics Group,
Budapest University of Technology and Economics
- Budafoki u
t. 8. H-1111 Budapest, Hungary
(Dated: May 16, 2008)

Abstract
The estimation of the correlation between time series is often hampered by the asynchronicity of
the signals. Cumulating data within a time window suppresses this source of noise but weakens the
statistics. We present a method to estimate correlations without applying long time windows. We
decompose the correlations of data cumulated over a long window using decay of lagged correlations
as calculated from short window data. This increases the accuracy of the estimated correlation significantly and decreases the necessary efforts of calculations both in real and computer experiments
by orders of magnitude.
PACS numbers: 05.45.Tp, 06.30.Ft, 05.40.Ca, 89.65.Gh

Electronic address: bence@maxwell.phy.bme.hu

Electronic address: kertesz@phy.bme.hu

I.

INTRODUCTION

Correlations between time series are fundamental for understanding and interpreting
stochastic processes. Very often the time between the signals of a series is distributed in an
uneven fashion, causing asynchronicity of the compared series.
For example, continuous time random walks [1] can be used to describe a broad range
of processes from transport in disordered solids [2, 3] to finance [4]. Correlated continuous
random walks produce asynchronous time series where the above mentioned problems occur.
Another example is the case of neutron activation analysis [5]. These experiments are
used for non-destructive analysis of materials in order to determine the concentration of
their constituents. In the analysis the specimen is bombarded by neutrons coming from a
source, causing the elements to form radioactive isotopes, and from the spectra of the of the
emissions of the radioactive sample, the concentration of the elements can be determined.
Since the radiation appears together for all kinds of atoms, different elements are going to
radiate in a correlated but asynchronous way [5, 6].
In case of financial correlations the problems arising from asynchronous signals is called
the Epps effect [7]. Epps showed first in 1979, that stock return correlations decrease as the
sampling frequency of data increases. Since his discovery the phenomenon has been detected
in several studies of different stock markets and foreign exchange markets [8, 9, 10, 11, 12, 13].
As an illustration we show the correlations between the returns of two liquid stocks as a
function of the time window in Figure 1.
When computing the Pearson correlation coefficient of two stationary signals we have
to face an important problem: The correlation measure is suitable to determine the grade
of co-movements of synchronous observations, while the signals often occur asynchronously.
The usual way to handle this problem is to cumulate data over a time window t and look
for the correlations between these binned data. However, for short t the noise due to
asynchronicity may reduce the measured correlations significantly.
In order to approach the asymptotic, proper value of the correlation coefficient, t should
be much larger than the scale of asynchronicity. On the other hand, this leads to the
reduction of the statistics, consequently it makes the estimates inaccurate.
It has been suggested to use measures of correlation other than the Pearson coefficient to
overcome the problem of asynchronicity. Ref. [14] presents a method of measuring covariance
2

0.25

KO/PEP

0.2

0.15

0.1

0.05

0
0

1000

2000

3000

4000

5000

t [sec]

6000

7000

8000

9000

FIG. 1: (Color online) The correlation coefficient between the price changes of the stocks of CocaCola and Pepsi as a function of sampling time scale. The measured correlation coefficient increases
with the binning time window t and t several hours are needed to have a good estimate.

based on Fourier series analysis of data. This method has been applied by Refs. [15] and
[16, 17] in the study of financial correlations. While the Fourier method is indeed somewhat
less sensitive to asynchronicity, the problem cannot be eliminated by its use. Refs. [18, 19]
propose a new estimator of the covariance of two diffusion processes that are observed only
at discrete times in a non-synchronous manner. Their estimator uses all available data and
does not require synchronization of observations, however in the presence of noise it becomes
inconsistent and its variance diverges. A good comparison of several covariance estimators
can be found in Refs [20, 21].
In this paper we describe an appropriate decomposition of the expression for the Pearson
correlation coefficient for large t by using the value of the coefficient for small time window
and decay of lagged correlations, which can also be calculated by using the good statistics
high resolution data only. We applied our method [22] already to explain the Epps effect
[7]. Here we show how the decomposition can be used to accurately estimate the asymptotic
correlation coefficients.
The paper is organized as follows: In Section II we review the decomposition of correlations. Section III contains the description of our method together with some demonstrative
results. We end the paper with a short summary in Section IV.

II.

DECOMPOSITION OF CORRELATIONS

Let us consider a system, where discrete stationary signals arrive from two different
sources (A and B) at different time instances resulting in two correlated time series. We
count the hits arriving, and denote their cumulative number measured from a reference time
t = 0 at time t by cA (t) and cB (t) respectively.
We are interested in the correlation between the number of hits arriving to our sensor in
a certain time window, thus the change in cA and cB . This will be denoted by
A
rt
(t) = cA (t) cA (t t).

(1)

The general Pearson correlation measure with time lag is defined by

A/B
Ct ( )

A

B
rt (t)rt
(t + )
=
,
AB

(2)

where rt (t) is the deviation of rt (t) from its mean,

A
B
(t + ) =
rt (t)rt

N
1 X A
B
rt (it)rt
(it + ),
=
N i=1

(3)

A
B
with N = [(T )/t] and A ( B ) is the standard deviation of rt
(rt
).

We use h. . . i for denoting time average. The equal-time correlation coefficient is naturally:
A/B

A/B

t Ct ( = 0).
As we can see, already by the definition, the length of the time window, i.e. the value of
t has a major role in measuring the correlation.
Assuming t = nt0 with n being a positive integer, we can deduce the following relationship between correlations on the two different time scales

A/B

t =

n1
X

x=n+1

!1/2

A

B

B
(n |x|) rt
(t)rt
(t + xt0 ) n2 rt
(t) rt
(t)
0
0
0
0

n1
X

x=n+1

A
2
A
(n |x|) rt
(t)rt
(t + xt0 ) n2 rt
(t)
0
0
0

n1
X

x=n+1

B
2

B
B
2
(t
+
xt
)

n
r
(t)
(n |x|) rt
(t)r
0
t0
t0
0

!1/2

(4)

More details on the deduction of Equation 4 can be found in Ref. [22], where we studied

A

the case when rt
(t) vanishes (which can often be assumed for financial returns), that
0
leads to a somewhat simpler equation. It is plausible to set t0 as the shortest meaningful
time scale in the system, that has to be chosen with the actual problem in mind.
This decomposition can be used to accurately estimate correlations by using high resolution, i.e., good statistics data.

III.

METHOD

Expression (4) enables to calculate the correlation coefficient for any sampling time scale,
t, by knowing the coefficient on a shorter sampling time scale, t0 , and the decay of lagged
correlations on the same shorter sampling time scale (given that t is multiple of t0 ).
The method we would like to propose relies on this decomposition of correlations.
We suggest the following procedure. Data should be binned with a small t0 such that a
good statistics is achieved, irrespective of the fact that noise due to asynchronicity may be
considerable. Then the correlation functions and the decay of lagged correlations should be
calculated using these data (of course the calculated correlation can be expected to be too
small). Plugging in these quantities into Equation (4) with a large enough t we obtain a
good estimate for the proper correlation. Using different values of t an extrapolation to
the proper, asymptotic correlation coefficient is possible.
We demonstrate the method in more details in this section. We use correlated random
walks and show that in case of directly measuring the correlations in large t time windows
we need very long time series in order to have a good estimate for correlations. When using
the decomposition method, we use high resolution data and thus can achieve an estimate
for the correlations with the same accuracy from a dataset of much shorter time span.
5

10020

position

10010

10000

W(t)
cA (t)
cB (t)
9990
0

100

200

300

400

time

FIG. 2: (Color online) A snapshot of the model with exponentially distributed waiting times. The
original random walk is shown with lines (black), the two sampled series with triangles and lines
(red) and dots and lines (blue).
A.

Demonstration

We construct correlated asynchronous time series in the following way: As a first step
we generate a core random walk with unit steps up or down in each second with equal
probability (W (t)). Second we sample the random walk, W (t), twice independently with
waiting times drawn from some distribution. This way we obtain two time series (cA (t) and
cB (t)), which are correlated since they are sampled from the same core random walk, but
the steps in the two walks are asynchronous. A snapshot of the random walks can be seen
in Figure 2.
The core random walk is:

W (t) = W (t 1) + (t),

(5)

where (t) is 1 with equal probability. Below we present results first for the exponentially
distributed waiting times between steps and second for the waiting times between steps being
drawn from a Weibull distribution. We study the correlation between the changes in the
position of the two random walkers.

1.

Exponential waiting times

We generate 50 pairs of time series, as described above, with waiting times between
consecutive changes from the distribution:

P(y) =

ey if y 0
0

y<0

(6)

with parameter = 1/60. Each time series has a length of 25000 time steps. Naturally,
since the time series are finite, the correlation coefficients that we measure will have errors. However, in the case of exponential waiting times, we can analytically determine the
correlation coefficients [23]:

A/B

t =


1
et 1 + 1.
t

(7)

In Figure 3 we show the results for the correlation coefficients on different time scales,
where the shortest time scale used was t0 = 10.
The black circles show the directly measured correlation coefficients on different scales.
Clearly, they have a considerable variance. The blue triangles and line show the exact analytical results corresponding to the analytical formula, which takes into account the noise
due to asynchronicity. The red squares and line show the result obtained by the decomposition method, choosing an arbitrarily chosen pair of time series, measuring the correlations as
well as their decay on the high frequency data and determining the correlations on all time
scales, using (4). As we can see, while the directly measured points (black circles) have a
quite large variance, the values obtained through decomposition are very close to the exact
correlation values.

2.

Weibull waiting times

In order to demonstrate the power of the method for a non Poisson process we generate
50 pairs of time series, as described above, with waiting times between consecutive changes
generated from a Weibull distribution:

25000 points;

=1/60

0.9

0.8

0.7

0.6

0.5

0.4

0.3

0.2

direct measurements
decomposition

0.1

analytical

200

400

600

800

1000

FIG. 3: (Color online) Comparison of the directly measured correlation coefficients and the coefficients determined through the decomposition method in case of exponentially distributed waiting
times. The black circles show the results of the directly measured correlations on the ensemble of
50 time series pairs. Red squares and line are determined by using one of the time series pairs
arbitrarily and applying the decomposition method. Blue triangles and line are the exact values
of the underlying correlation.

P(y) =

b y b1 (y/a)b
( ) e
a a

if y 0

y<0

(8)

with parameters a = 20 and b = 0.7. Again each time series has a length of 25000 time
steps and the directly measured correlation values have large variance. In Figure 4 we show
the results for the correlation coefficients on different time scales, again with t0 = 10.
Here we do not know the exact solution and compare the estimates with the average of
the empirical values. Again what we see is, that we get a good estimate of the means from
the decomposition formula while the variance of the measured values is quite large.
The two above examples on generated time series show that using our method and estimating the correlation coefficient between asynchronous signals from the high frequency
data leads to much smaller variation of the results than in case of direct measurement of
correlations on lower frequency data.
Generally, one is interested in the asymptotic value of the correlations, i.e. in the limit of
t . As we can see, even for the numerically generated data, determining the correlation
8

25000 points; a=20; b=0.7


1.0

0.9

0.8

0.7

0.6

0.5

0.4

0.3

direct measurements

0.2

decomposition
mean of ensemble

0.1

200

400

600

800

1000

FIG. 4: (Color online) Comparison of the directly measured correlation coefficients and the coefficients determined through the decomposition method in case of Weibull distributed waiting times.
The black circles show the results of the directly measured correlations on the ensemble of 50 time
series pairs. Red squares and line are determined by using one of the time series pairs arbitrarily
and applying the decomposition method. Blue triangles and line are the ensemble average of the
black circles.

for the scale of t = 1000 we still get a correlation different from the underlying asymptotic
correlation, that is 1. A possible way to determine the asymptotic correlation value is using
some extrapolation method.
We demonstrate that applying the most simple extrapolation method for the generated
data we can determine the exact underlying correlation with very good accuracy. We present
the results for the case of Weibull distributed waiting times. Since we are interested in the
t value, we use the plot of t as a function of 1/t for the extrapolation. Figure
5 shows the correlation points and the curve determined by a fourth order polynomial
extrapolation. The extrapolated curve intercepts the y-axis at the value of 1.01, which is
very close to the actual asymptotic correlation value, with an error of around 1%.

B.

Decay functions

As we can see from Equation (4) a subtle point of the correlation estimation is the measurement of the decay of lagged correlations on the short (t0 ) time scale, that we will call de
A

A
cay functions. There are three decay functions we have to measure: rt
(t)rt
(t + xt0 ) ,
0
0
9

1.0

0.8

0.6

0.4

0.2

0.0
0.00

0.01

0.02

0.03

0.04

0.05

1/ t

FIG. 5: (Color online) The correlation as a function of 1/t for the Weibull distributed waiting
times. The circles show the correlations determined using the decomposition method, and the (red)
curve is the fourth order polynomial extrapolation of the correlation values. The extrapolation gives
an asymptotic correlation value of 1.01, very close to the actual underlying value, that is 1.

A

B
B
rt0 (t)rt
(t + xt0 ) and rt
(t)rt
(t + xt0 ) . To have good estimation of the asymp0
0
0
totic correlation value, one has to have precise measurements of these decay functions.

In the above examples we had random walks as underlying processes, thus by definition
the autocorrelations are delta functions. However, because of the asynchronous sampling,
the cross-correlation is being smeared out and instead of a delta function we have finite
decay of the cross-correlation. These cases are simple from the point of view of the decay
functions. To demonstrate that in case of more complicated vanishing decay functions the
decomposition can still give a good estimation of the asymptotic correlation, we consider the
following time series. We generate a persistent random walk [24, 25], ie. a walk, where the
probability, , of jumping in the same direction as in the previous step is higher than 0.5.
Then, as we have done before, we sample the persistent random walk twice independently
with waiting times drawn from an Weibull distribution (again with parameters a = 20
and b = 0.7). This construction generates slowly vanishing decay functions (that would be
exponentially decaying without the asynchronous sampling). Again we generate 50 pairs of
time series, each being 25000 steps long, the persistency is = 0.999. Figures 6 and 7 shows
two of the decay functions (the decay of the autocorrelations are identical so we only show
one of them). Here we set t0 = 50.
In Figure 8 we show the results for the correlation coefficients on different time scales.
10

1.0

0.6

0.4

0.2

< r

to

A
(t)r

to

(t+ )

>

0.8

0.0

-0.2

1000

2000

3000

A

A (t + xt ) in case of the persistent random
FIG. 6: (Color online) The decay function, rt
(t)rt
0
0
0

walk, sampled with Weibull waiting times. The autocorrelations decrease slowly.

0.6

0.5

0.2

0.1

0.0

to

A
(t)r

0.3

< r

to

(t+ )

>

0.4

-0.1

-0.2

-3000

-2000

-1000

1000

2000

3000

A

B (t + xt ) in case of the persistent random
FIG. 7: (Color online) The decay function, rt
(t)rt
0
0
0

walk, sampled with Weibull waiting times. The cross-correlations decrease slowly.

The decomposition method gives good results in this case too.


Figure 9 shows the extrapolation to the asymptotic value of the correlation, using piecewise Cubic Hermite Interpolation method. The extrapolated curve intercepts the y-axis at
the value of 1.00.

11

1.0

0.9

0.8

0.7

0.6

0.5

direct measurements
decomposition
0.4

mean of ensemble

200

400

600

800

1000

FIG. 8: (Color online) Comparison of the directly measured correlation coefficients and the coefficients determined through the decomposition method in case of the persistent random walk, with
Weibull sampling. The black circles show the results of the directly measured correlations on the
ensemble of 50 time series pairs. Red squares and line are determined by using one of the time
series pairs arbitrarily and applying the decomposition method. Blue triangles and line are the
ensemble average of the black circles.

1.0

0.9

0.8

0.7

0.6

0.5

0.000

0.005

0.010

0.015

0.020

1/ t

FIG. 9: (Color online) The correlation as a function of 1/t for the persistent random walk, with
Weibull sampling. The circles show the correlations determined using the decomposition method,
and the (red) curve is determined by using Piecewise Cubic Hermite Interpolation method. The
extrapolation gives an asymptotic correlation value of 1.00.

12

IV.

SUMMARY

In this paper we discussed the problem of estimating the correlation coefficient between
two asynchronous signals. While the direct use of high resolution data results in an underestimation of the correlations, coarser binning of the the data leads to larger errors due to
loss of data.
We proposed a method, which enables to estimate the asymptotic value of correlations
from the high frequency data, without the need of using longer time scales and thus without
using worse statistics. The correlations from the high frequency data can be determined
very accurately, based on the good statistics. We demonstrated our method on generated
datasets, showing that the error of correlations determined by our method is much smaller
than the errors of correlations measured directly, using long time windows. Extrapolating to
the asymptotic correlation from the determined correlation values leads to a very accurate
estimation of the underlying correlation. A very important question in the estimation of the
asymptotic correlation value is the determination of the shortest meaningful time scale, t0 ,
on which we measure the decay functions. The asynchronicity of the signals slows down the
decrease of the decay functions. In the paper we showed that also in case of non-trivial decay
functions the decomposition gives a good estimation of the asymptotic correlation value.

Acknowledgments

We thank Zoltan Szatmary and Zsolt Kajcsos for useful discussions. Support by OTKA
K60456 and T049238 is acknowledged.

[1] E. W. Montroll and G. H. Weiss, J. Math. Phys. 6, 167 (1965).


[2] H. Scher and M. Lax, Physical Review B 7, 4491 (1973).
[3] H. Scher and M. Lax, Physical Review B 7, 4502 (1973).
[4] E. Scalas, Physica A 362, 225 (2006).
[5] G. Hevesy and H. Levi, Nature 137, 185 (1936).
[6] D. Soete, R. Gijbels, and J. Hoste, Neutron Activation Analysis (Wiley Interscience, New
York, 1972).

13

[7] T. W. Epps, Journal of the American Statistical Association 74, 291 (1979).
[8] M. Lundin, M. Dacorogna, and U. A. M
uller, Correlation of high-frequency financial time
series (Wiley & Sons, 1999).
[9] G. Bonanno, F. Lillo, and R. N. Mantegna, Quantitative Finance 1, 1 (2001).
[10] A. Zebedee, Working Paper, San Diego State University (2001).
[11] J. Muthuswamy, S. Sarkar, A. Low, and E. Terry, Journal of Futures Markets 21, 127 (2001).
[12] M. Tumminello, T. D. Matteo, T. Aste, and R. N. Mantegna, Eur. Phys. J. B. 55, 209 (2006).
[13] B. T
oth and J. Kertesz, Physica A 360, 505 (2006).
[14] P. Malliavin and M. E. Mancino, Finance Stochast. 6, 49 (2002).
[15] R. Ren`o, International Journal of Theoretical and Applied Finance 6, 87 (2003).
[16] O. Precup and G. Iori, Physica A 344, 252 (2004).
[17] G. Iori and O. V. Precup, Physical Review E 75 (2007).
[18] T. Hayashi and N. Yoshida, Bernoulli 11, 359 (2005).
[19] T. Hayashi and S. Kusuoka, Statistical Inference for Stochastic Processes 11, 93 (2008).
[20] V. Voev and A. Lunde, Journal of Financial Econometrics 5, 68 (2006).

[21] A. Palandri (2006), URL http://www.econ.ku.dk/Events News/Zeuthen/2006/Workshop 06/Papers/Palan


[22] B. T
oth and J. Kertesz (2007), URL http://arxiv.org/pdf/0704.1099.
[23] B. T
oth,

B. T
oth,

and J. Kertesz,

Proceedings of SPIE 6601 (2007),

URL

http://arxiv.org/pdf/0704.3798.
[24] R. Furth, Ann. Phys. (Leipzig) 43, 177 (1917).
[25] G. H. Weiss, Aspects and Applications of the Random Walk (North-Holland, Amsterdam,
1994).

14