IEEE TRANSACTIONS ON ACOUSTICS, SPEECH, AND SIGNAL PROCESSING, VOL. ASSP-33, NO.
2, APRIL 1985 387
MAT1 WAX AND THOMAS KAILATH, FELLOW, IEEE Abstract-A new approach is presented to the problem of detecting the number of signals in a multichannel time-series, based on the appli- cation of the information theoretic criteria for model selection intro- duced by Akaike (AIC) and by Schwartz and Rissanen (MDL). Unlike the conventional hypothesis testing based approach, the new approach does not require any subjective threshold settings; the number of signals is obtained merely by minimizing the AIC or the MDL criteria. Si ul a- tion results that illustrate the performance of the new method for the detection of the number of signals received by a sensor array are presented. I I. INTRODUCTION N many problems in signal processing, the vector of observa- tions can be modeled as a superposition of a finite number of signals embedded in an additive noise. This is the case, for example, in sensor array processing, in harmonic retrieval, in retrieving the poles of a system from the natural response, and in retrieving overlapping echoes from radar backscatter. A key issue in these problems is the detection of the number of signals. One approach to this problem is based on the observation that the number of signals can be determined from the eigen- values of the covariance matrix of the observation vector. Bartlett [4] and Lawley [14] developed a procedure, based on a nFsted sequence of hypothesis tests, to implement this ap- proach. For each hypothesis, the likelihood ratio statistic is computed and compared t o a threshold; the hypothesis ac- cepted is the first one for which the threshold is crossed. The problem with this method is the subjective judgment required for deciding on the threshold levels. In this paper we present a new approach to the problem that is based on the application of the information theoretic cri- teria for model selection introduced by Akaike (AIC) and by Schwartz and Rissanen (MDL). The advantage of this ap- proach is that no subjective judgment is required in the deci- sion process; the number of signals is determined as the value for which the AIC or the MDL criteria is minimized. The paper is organized as follows. After the statement and formulation of the problem in Section 11, the information theoretic criteria for model selection are introduced in Section 111. The application of these criteria to the problem of detect- ing the number of signals and the consistency of these criteria Manuscript received May 19, 1983; revised June 1, 1984. This work was supported in part by the Air Force Office of Scientific Research, Air Force Systems Command, under Contract AF 49-620-79-C-0058, the U.S. Army Research Office, under Contract DAAG29-79-C-0215, and by the Joint Services Program at Stanford University under Con- tract DAAG29-81-K-0057. The authors are with the Information Systems Laboratory, Stanford University, Stanford, CA 94305. are discussed in Sections IV and V, respectively. Simulation results that illustrate the performance of the new method for sensor array processing are described in Section VI. Fre- quency domain extensions and some concluding remarks are presented in Sections VI1 and VIII, respectively. 11. FORMULATION OF THE PROBLEM The observation vector in certain important problems in sig- nal processing such as sensor array processing [15] [20] [5] [6] [lo] 121 [22] [25] harmonic retrieval [15] [12] pole retrieval from the natural response [ l l ] [24], and re- trieval of overlapping echos from radar backscatter [7] de- noted by the p l vector x(t), is successfully described by the following model x(t) A( q) n(t ) i = 1 where si( scalar complex waveform referred to as the ith signal A(@i ) = a p 1 complex vector, parameterized by an un- known parameter vector ai associated with the ith signal n( a p 1 complex vector referred t o as the additive noise. We assume that the q (q <p ) signals sl are complex (analytic), stationary, and ergodic Gaussian random processes, with zero mean and positive definite covariance ma- trix. The noise vector n( is assumed t o be complex, station- ary, and ergodic Gaussian vector process, independent of the signals, with zero mean and covariance matrix given by where a2 is an unknown scalar constant and is the identity matrix. A crucial problem associated with the, model described in (1) is that of determining the number of signals 4 from a finite set of observations x(tl), x(tN). A promising approach t o this problem is based on the struc- ture of the covariance matrix of the observation vector To introduce this approach, we first rewrite (1) as x(t) As(t) n(t) (2a) where A is the p 4 matrix A [A(%) N@q > I (2b) and is the 4 1 vector ST@) [s, (t) sq(t)] Because the noise is zero mean and independent of the sig- nals, it follows that the covariance matrix of is given by .OO 1985 IEEE
388 IEEE TRANSACTIONS ON ACOUSTICS, SPEECH, AND SIGNAL PROCESSING, VOL. ASSP-33, NO. 2, APRIL 1985 R t aZI (34 where ASAi (3b) with t denoting the conjugate transpose, and S denoting the covariance matrix of the signals, i.e., E[ s ( . s( s ) ~ ] Assuming that the matrix A is of full column rank, i.e., the vectors A(Qi ) (i 1 q) are linearly independent, and that the covariance matrix of the signals is nonsingular, it follows that the rank of Ik is q, or equivalently, the p 4 smallest eigenvalues of are equal t o zero. Denoting the eigenvalues of R by ha it follows, therefore, that the small- est p q eigenvalues of R are all equal to d i.e., X q + 1 h, 02. The number of signals q can hence be determined from the muZtiplicity of the smallest eigenvalue of R. The problem is that the covariance matrix R is unknown in practice. When estimated from a finite sample size, the resulting eigenvalues are all different with probability one, thus making it difficult to determine the number of signals merely by observing the eigenvalues. A more sophisticated approach to the problem, developed by Bartlett [4] and Lawley [14] is based on a se- quence of hypothesis tests. The problems associated with this approach is the subjective judgment needed in the selection of the threshold levels for the different tests. In this paper we take a different approach. We pose the de- tection problem as a model selection problem and then apply the information theoretic criteria for model selection intro- duced by Akaike (AIC) and by Schwartz and Rissanen (MDL). 111. INFORMATION THEORETIC CRITERIA The information theoretic criteria for model selection, intro- duced by Akaike [ l ] [2], Schwartz [21], and Rissanen [17] address the following general problem. Given a set of N obser- vations X {x(l), x(N)} and a family of models, that is, a parameterized family of probability densities f(XlO), select the model that best fits the data. Akaikes proposal was to select the model which gives the minimum AIC, defined by AIC -2 log f(Xl6) t 2k (6) where 6 is the maximum likelihood estimate of the parameter vector 0, and k. is tlie number of free adjusted parameters in 0. The first term is the well-known log-likelihood of tlie max- imum likelihood estimator of the parameters of the model. The second term is a bias correction term, inserted so as to make the AIC an unbiased estimate of the mean Kulback- Liebler distance bet wey the modeled density f(Xl0) and the estimated densityf(Xl0). Inspired by Akaikes pioneering work, Schwartz and Rissanen approached the. problem from quite different points of view. Schwartzs approach is based Bayesian arguments. He as- sumed that each competing model can be assigned a prior probability, and proposed to select the model that yields the maximum posterior probability. Rissanens approach is based on information theoretic arguments. Since each model can be used to encode the observed data, Rissanen proposed to select the model that yields the minimum code length. It turns out that in the large-sample limit, both Schwartzs and Rissanens approaches yield the same criterion, given by MDL= -lOgf(XIS) t i k log N. (7) Note that apart from a factor of 2, the first term is identical to the corresponding one in the AIC, while the second term has an extra factor of 3 log N. IV. ESTIMATING THE NUMBER OF SIGNALS To apply the information theoretic criteria to detect the number of signals, or equivalently, t o determine the rank of the matrix 9, we must first describe the family of competing models, or density functions that we are considering. Regard- ing the observations x( t l x(tN) as identical and statisti- caily independent complex Gaussian random vectors of zero mean, the family of models is necessarily described by the co- variance matrix of Since our models covariance matrix is given by (3), it seems natural to consider the following fam- ily of covariance matrices R( k ) 021 (8) where denotes a semipositive matrix of rank k, and u de- notes an unknown scalar. Note that k E (0, l , p l } ranges over the set of all possiblenumber of signals. Using the well-known spectral representation theorem from linear algebra, we can express R( k) as where XI, Xk and v,, Vk are the eigenvalues and eigenvectors, respectively, of R@) . Denoting by d k ) the parameter vector of the model, it follows that With this parameterization we now proceed to the derivation of the information theoretic criteria for the detection prob- lem. Since the observations are regarded as statistically inde- pendent complex Gaussian random vectors with zero mean, their joint probability density is given by Taking the logarithm and omitting terms that do not depend on the parameter vector d k ) , we find that the log-likelihood function L( @) is given by I,(@@)) -N log det R@) tr [ R@) ] - l (1 2a) where is the sample-covariance matrix defined by
WAX AND KAILATH: DETECTION OF SIGNALS BY INFORMATION THEORETIC CRITERIA 389 The maximum-likelihood estimate is the value of d k ) that maximizes (12). Following Anderson [3] these estimates are given by hi =l i i = l , . - . , k (1 3 4 where la lp and CI Cp are the eigenvalues a,nd eigenvectors, respectively, of the sample covariance matrix R. Substituting the maximum likelihood estimates (13) in the log-likelihood (1 2), with some straightforward manipulations, we obtain Note that the term in the bracket is the ratio of the geometric mean to the arithmetic mean of the smallest p -.k eigenvalues. The number of free parameters in d k ) is obtained by count- ing the number of degrees of freedom of the space spanned by d k ) . Recalling that the eigenvalues of a complex covariance matrix are real but that the eigenvectors are complex, it fol- lows that d k ) has k t 1 t ' 2pk parameters. However, not all of the parameters are independently adjusted; the eigenvectors are constrained to have unit norm and to be mutually orthog- onal. This amounts t o reduction of 2 k degrees of freedom due to the normalization and 2&k(k 1) degrees of freedom due to the mutua1.orthogonaIization. Thus, we obtain (number 'of free adjusted parameters) k t 1 t 2pk [ i k ( k l)] k(2p k ) t 1. (15) The form of AIC for this problem is therefore given by k) N t 2k(2p k) while the MDL criterion is given by k(2p k ) log N. 1 2 The number of signals is determined as the value of k g (0, 1, p I} for which either the AIC or the MDL is minimized. V. CONSISTENCY OF THE CRITERIA We have described two different criteria for estimating num- ber of signals. The natural question is which one should be preferred. What can be said about the goodness of the esti- mates obtained by these criteria. One possible benchmark test is the behavior as the sample size increases. One would prefer an estimator that yields the true number of signals with prob- ability one as the sample size increases to infinity. An estima- tor with this property is said to be consistent. By generalizing a method of proof given in Rissanen [18] and Hannan and Quinn [9], we shall show that the MDL yields a consistent estimate, while the AIC yields an inconsistent estimate that tends, asymptotically, t o overestimate the number of signals. The consistency of the MDL is proved by showing that in the large-sample limit, MDL(k) is minimized for k q. Taking first k 4 , it follows from (17) that 1 [MDL(q) MDL(k)] N log i = k + l q - k i = k + l t log Since the eigenvalues of the sample-covariance matrix l j (i 1, , 4 ) are consistent estimates of the eigenvalues of the true covariance matrix hi, it follows that in the large-sample limit the eigenvalues li (i k t 1, q) are not ali equal with probability one. Therefore, by the arithmetic-mean geometric- mean inequality it follows that in the large-sample limit This implies that the first term in (18) is negative with prob- ability one in' the large-sample limit'. Similarly, by the general- ized arithmetic-mean geometric-mean inequality w,A1 w 4 , >AY'A,W2 w 1 t wa it follows that in the large sample limit
390 IEEE TRANSACTIONS ON ACOUSTICS, SPEECH, AND SIGNAL PROCESSING, VOL. ASSP-33, NO. 2, APRIL 1985 This implies that the second term in (1 8) is also negative with probability one in the large-sample limit. Now, since the last term in (18) goes to zero as the sample size increases, it fol- lows that the difference [MDL(q) MDYk)] is negative with probability one in the large-sample limit for k q. Taking now k 4, it follows from (17) that 2 [MDL(k) MDL(q)] (k 4)(2p k log N Note that the terms in the curly bracket are twice the log- likelihoods of the maximum likelihood estimator under the hypotheses that the rank of is q and k, respectively. Thus, their difference is the likelihood-ratio for deciding between these two hypotheses. From the general theory of likelihood ratios (see, e.g., Cox and Hinkley [8]) it follows that the asymptotic distribution of this statistic is x with number of degrees of freedom equal to the difference of the dimensions of the parameter spaces under the two hypothesis, i.e., [ W P k) 11 M2 P 4) 11 (k q) ( 2p k Thus, as the sample size increase, the probability that the term in the curly bracket in (20) exceeds the first term in (20) is given by the area in the tail from (k 4) ( 2p k q) log N of the mentioned x2 distribution with (k q) ( 2p k 4 ) de- grees of freedom. Since the area in this tail approaches zero as the sample size increases, it follows that in the large-sample limit the difference [MDL(k) MDL(q)] is positive with prob- ability one for k q. Combining this with the previous result for k q, it follows that MDL(k) has a minimum at k q. Repeating the above arguments for the AIC, it follows that in the large-sample limit and for k <q , the difference [AIC(q) AIC(k)] is negative with probability one. However, for k 4, the difference [AIC(k) AIC(q)] has nonzero prob- ability to be negative even in the large-sample limit, since the tail from (k 4) ( 2p k q t 1) of the x2 distribution with (k q) ( 2p k 4 t 1) degrees of freedom is definitely not zero. Hence, the AIC tends to overestimate the number of signals q in the large-sample limit. We should note that from the analysis above it follows that any criteria of the form -log f(Xl4) t Q(N) k (23) where or(N) and or(N)/N+ 0 yields a consistent estimate of the number of signals. VI. SIMULATION RESULTS In this section we present simulation results that illustrate the performance of our method when applied to sensor array processing. By the well-known duality between spatial fre- quency and temporal frequency, these examples can also be interpreted in the context of harmonic retrieval. The exam- ples refer to a uniform linear array of p sensors with 4 incoher- ent sinusoidal plane waves impinging from directions { G1 , Assuming that the spacing between the sensors is equal to half the wavelength of the impinging wavefronts, the vector of the received signal at the array is then given by 2 A ( G ~ ) t n(t) (24a) where A(&) is the p 1 direction vector of the kt h wavefront. k=1 A ( $ ~ ) T [le-jTk e- j ( q- l ) Tk 1 (24b) with 7k 77 sin q5k (24c) and random phase uniformly distributed on (0, 277) n( vector of white noise with mean zero and covariance Note that this model is a special case of the general model presented in (1). In the first example, we considered an array with seven sen- sors ( p 7) and two sources (4 2) with directions-of-arrivals 20 and 25. The signal-to-noise ratio, defined as 10 log 1 /2u2, was 10 dB. Using N = 100 samples, the resulted eigenvalues of the sample-covariance matrix were 21.2359, 2.1717, 1.4279, 1.0979, 1,0544, 0.9432, and 0.7324. Observing the gradual decrease of the eigenvalues it is clear that the separation of the 3 smallest eigenvalues from the 2 large ones is a difficult task in which a naive approach is likely to fail. However, ap- plying the new approach we have presented above yielded the following values for the AIC and MDL. 0 1 2 3 4 5 6 AIC 1180.8 100.5 71.4 75.5 86.8 93.2 96.0 MDL 590.4 67.2 66.9 80.7 95.5 105.2 110.5. The minimum of both the AIC and the MDL is obtained, as expected, for the 4 2. In the second example, we added another source at 10 to the scenario described in the first example. The eigenvalues of the sample-covariance in this cpe were 15.5891, 8.1892, 1.4715, 1.1602, 1.0172, 0.9210, and 0.6528. The resulting values of the AIC and the MDL were as follows.
&AX AND KAILATH: DETECTION OF SIGNALS BY INFORMATION THEORETIC CRITERIA 391 AIC MDL In this case, the minimum value of both the AIC and the MDL is obtained, incorrectly, for 2. The failure of the AIC and MDL in this case is because of the inherent limitations of the problem. Indeed, repeating the example with signal-to-noise ratio of 10 dB yielded the following eigenvalues 14.5595, 6.7786,0.3786,0.1372,0.1109,0.0946, and 0.0687, and con- sequentially the following values of the AIC and the MDL. 5 AIC MDL Now the minimum of both the AIC and the MDL is obtained, correctly, for q 3. VII. EXTENSION TO THE FREQUENCY DOMAIN The starting point of our approach was the time domain rela- tion (1). However, in certain cases, especially in array process- ing, the frequency domain is more natural. As we shall show, our approach is easily extended t o handle frequency-domain observations. Consider the frequency domain analog of the time domain model (1); the observed vector is a Fourier coefficients vector expressed by 4 d w n ) A(mn ai) si(wn) n(wn> (25) i = where si (wn)= the Fourier coefficient of the i t h signal at fre- quency o n , A(w,, ai) a p 1 complex vector, determined by the parameter vector Qii associated with the i t h signal n( wn) a p 1 complex vector of the Fourier coefficients of the additive noise. The problem, as in the time domain, is to estimate the num- ber of signals 4 from L samples of the Fourier-coefficients vec- By the well-known analogy between multivariate analysis and time-series analysis (see, e.g., Wahba [ 23] ) , the time- domain approach carries over to the frequency domain with the role of the sample-covariance matrix played by the peri- odogram estimate of the spectral density matrix, given by t or xl ( on) , , xL( an) - Thus, the frequency-domain version of the AIC is given by f V where Zl (on) lP(wn) are the eigenvalu? of the peri- odogram estimate of the spectral-density matrix K( wn) . When the signals are wide band, namely, occupy several frequency bins, say, 01,. OZ +M, the information on the number of signals is contained in all the M frequency bins. As- suming that the observation time is much larger than the cor- relation times of the signals, it follows that the Fourier coeffi- cients corresponding to different frequencies are statistically independent (see, e.g., Whalen [ 26, p. 811). The AIC for de- tecting the number of wide-band signals that occupy the fre- quency band is given by the sum of (27) over the frequency range of interest 2M[k(2p k) ] (2 8) The corresponding expression for the MDL criterion can be similarly derived. VIII. CONCLUDING REMARKS A new approach to the detection of the number of signals in a multichannel time series has been presented. The approach is based on the application of the AIC and MDL information theoretic criteria for model selection. Unlike the conventional hypothesis test based approach, the new approach does not re- quire any subjective threshold settings; the number of signals is determined merely by minimizing either the AIC or the MDL criterion. It has been shown that the MDL criterion yields a consistent estimate of the number of signals, while the AIC yields an in- consistent estimate that tends, asymptotically, to overestimate the number of signals. The detection problem addressed in this paper is usually a part of a more complex combined detection-estimation prob- lem, where one wants to estimate the number as well as the parameters of the signals in (1). Because of the computational complexity involved, the problem is usually solved in two steps: first the number of signals is detected, and then, with an estimate of the number of signals at hand, the parameters of the signals are estimated. It should be pointed out, however, that in principle the AIC and MDL criteria can be applied to the combined detection-estimation problem [ 2 6 ] . The evaluation of the AIC and MDL in this case in- ;elves th% computation of the maximum likelihood estimates 1 , @k, for every possible number of signals k E p 1) . Solving this highly nonlinear problem for all Val- ues of k is computationally very expensive. Nevertheless, if one is ready to pay the price, the gain in term of performance, especially in difficult situations such as low signal-to-noise ratio, small sample size, or closely spaced signals, may be significant. REFERENCES H. Akaike, Information theory and an extension of the maxi- mum likelihood principle, in hoc. 2nd Int. Symp. Inform.
392 IEEE TRANSACTIONS ON ACOUSTICS, SPEECH, AND SIGNAL PROCESSING, VOL. ASSP-33, NO. 2, APRIL 1985 Theory, suppl. Problems of Control and Inform. Theory, 1973, 21 A new look at the statistical model identification, IEEE Trans. Automat. Contr., vol. AC-19, pp. 716-723, 1974. [3] T. W. Anderson, Asymptotic theory for principal component analysis,Ann. J. Math. Stat., vol. 34,, pp. 122-148, 1963. [4] M:S. Bartlett, A note on the multiplying factors for various x2 approximations, J. Roy. Stat. SOC., ser. E, vol. 16, pp. 296-298, 1954. [5] G. Bienvenu and L. Kopp, Adaptivity t o background noise spa- tial coherence for high resolution passive methods, in Proc. ICASSP80, Denver, CO, 1980, pp. 307-310. [6] T. P. Bronez and J. A. Cadzow, An algebraic approach to super resolution array processing, IEEE Trans. Aerosp. Electron. [7] A. Bruckstein, T. J. Shan, and T. Kailath, The resolution of overlapping echoes, IEEE Trans. Acoust., Speech, Signal Pro- cessing, submitted for publication. [8] D. R. Cox and D. V. Hinkley, Theoretical Statistics. London, England: Chapman and Hall, 1974. [9] E. J. Hannan and B. G. Quinn, The determination of the order of an autoregression, J. Roy. Stat. Soc., ser. E, vol. 41, pp. 190- 195,1979. [ l o] D. H. Johnson and S. R. Degraff, Improving the resolution of bearing in passive sonar arrays by eigenvalue analysis, IEEE 647,1982. Trans. Acoust., Speech, Signal Processing, vol. ASSP-30, pp. 638- 111 R. Kumaresan and D. W. Tufts, Estimating the parameters of exponentially damped sinusoids and pole-zero modeling in noise, IEEE Trans. Acoust., Speech, Signal Processing, vol. ASSP-30, 121 Estimation of arrival of multiple plane waves, IEEE Trans. Aerosp. Electron. Syst., vol. AES-19, pp. 123-133, 1983. [13] S. Y. Kung and Y. H. Hu, Improved Pisarenkos sinusoidal spec- trum estimate via the SVD subspace approximation method, inProc. CDC, Orlando, FL, 1982, pp. 1312-1314. [14] D. N. Lawley, Tests of significance of the latent roots of the covariance and correlation matrices, Biometrica, vol. 43, pp. 151 N. L. Owsley, Spectral signal set extraction, Aspects of Signal Processing: Part II, G. Tacconi, Ed., Dordecht, Holland: Reidel, [16] V. F. Fisarenko, The retrieval of harmonics from a covariance function, Geophys. J. Roy. Astron. vol. 33, pp. 247-266, 1973. 171 J. Rissanen, Modeling by shortest data description, Automatica, [18] Consistent order estimation of autoregressive processes by shortest description of data, Analysis and Optimization of Sto- chastic Systems, Jacobs et al. Eds. New York: Academic, 1980. [19] A universal prior for the integers and estimation by mini- mum descriptionlength,Ann. Stat.,vol. ll,pp.417-431, 1983. I201 R. Schmidt, Multiple emitter location and signal parameter estimation, in Pvoc. RADC Spectral Estimation Workshop, Rome, NY, 1979, pp. 243-258. [21] G. Schwartz, Estimating the dimension of a model, Ann. Stat., [22] G. Su and M. Morf, The signal subspace approach for multiple emitter location, in Proc. 16th Asilomar Con$ Circuits Syst. Comput., Pacific Grove, CA, 1982, pp. 336-340. [23] G. Wahba, On the distribution of some statistics useful in the analysis of jointly stationary time-series, Ann. Math. Stat., vol. [24] M. Wax, R. Schmidt, and T. Kailath, Eigenstructure method for retrieving the poles from the natural response, in Proc. 22nd CDC, San Antonio, TX, 1983, pp. 1343-1344. pp. 267-281. Syst., VO~ . AES-19, pp. 123-133, 1983. pp. 833-840,1983. 128-136,1956. pp. 469-473,1977. VO~ . 14, pp. 465-471,1978. VO~ . 6, pp. 461-464,1978. 39, pp. 1849-1862,1968. [25] M. Wax, T. J. Shan, and T. Kailath, Spatio-temporal spectral analysis by eigenstructure methods, IEEE Trans. Acoust., Speech, Signal Processing, vol. ASSP-32, pp. 817-827,1984. [26] M. Wax, Detection and estimation of superimposed signals, Ph.D. dissertation, Stanford Univ., Stanford, CA, Mar. 1985. [27] A. D. Whalen, Detection of Signals in Noise. New York: Aca- demic, 1971.