Documentos de Académico
Documentos de Profesional
Documentos de Cultura
Queen Mary University of London, Centre for Digital Music, Mile End Road, London
density
mean = 0.15
? mean = 0.2
Difference
?
single absolute threshold
Cumulative normalisation
threshold distribution
Q
@ Q
R
@
+ s
Q 0.0 0.2 0.4 0.6 0.8 1.0
Thresholding Probabilistic thresholding
? ? Yin threshold parameter
YIN: single frequency PYIN:
estimate f multiple frequency-probability Fig. 2: The set of Beta distributions used as parameter priors
pairs {(f1 , p1 ), (f2 , p2 ), . . .}
for PYIN.
Fig. 1: Comparison of the first steps of the original YIN al-
gorithm the proposed PYIN algorithm, with our contribution
in bold print. Both have further steps for pitch refinement and = 18, 11 13 , 8), as shown in Fig.2. Given such a distribution,
pitch track smoothing, not pictured. and the prior probability pa of using the absolute minimum
strategy we can then define the probability that a period is
the fundamental period 0 according to YIN as
auto-correlation function (ACF)
N
X
t+W
X P ( = 0 |S, xt ) = a(si , ) P (si ) [Y (xt , si ) = ], (4)
rt ( ) = xj xj+ , (2) i=1
j=t+1
where [ ] is the Iverson bracket evaluating to unity for a true
from which (1) can be calculated as expression and to zero otherwise, and
(
dt ( ) = rt (0) + rt+ (0) 2rt ( ). (3) 1, if d0 ( ) < si
a(si , ) = (5)
pa , otherwise.
These two calculations comprise the first two steps of the
original YIN algorithm, as illustrated in Fig. 1. The third We use pa = 0.01. Note that if pa < 1, then (4) does not
step of the original YIN algorithm is a normalisation of the necessarily sum to unity. The remaining probability mass can
difference (1) obtaining a cumulative mean normalised dif- be interpreted as the probability that the frame is unvoiced.
ference function d0 ( ) (we omit the subscript index t for sim- Any for which P ( = 0 |S, xt ) > 0 yields a funda-
plicity) via a heuristic designed to compensate for low values mental frequency candidate f = 1/ . As in the original YIN
at short periods (high frequencies) induced by formant reso- algorithm frequency estimates are improved by parabolic in-
nances (for details see [7]). terpolation on the difference function d0 .
The fourth step of the original YIN algorithm is to find the The set of fundamental frequency candidates along with
dip in the difference function d0 that corresponds to the fun- their probabilities is the output of the first stage of the pro-
damental period. This is done by picking the smallest period posed PYIN algorithm. It has several appealing properties:
for which d0 has a local minimum and d0 ( ) < s for a fixed
threshold s (usually s = 0.1 or s = 0.15). In the case that 1. Any for which (4) is non-zero is a genuine YIN fre-
d0 ( ) > s for all (above threshold), the original YIN paper quency estimate for some threshold s, at a minimum of
proposes arg min d0 ( ) as the period estimate; alternatively the d0 difference function.
this can be used to estimate the pitch as unvoiced [11]. We
notate the period estimated by YIN as Y (xt , s). 2. The probabilistic estimate can be obtained in one orig-
Since both the choice of s and the strategy for handling the inal YIN loop with minimal computational overhead.
case that all values are above the threshold affect the result, 3. The conventional Yin estimate is among the candidates,
we propose to abandon the use of a single absolute thresh- i.e. PYIN covers at least as many true fundamental fre-
old and instead use a prior parameter distribution S given by quencies as the original YIN.
P (si ), where si , i = 1, . . . , N are possible thresholds. We
use thresholds ranging from 0.01 to unity in steps of 0.01, This last point is the main motivation for our method. In
i.e. N = 100. The distributions used in our experiments the original YIN algorithm, once an estimate is erroneous, the
are Beta distributions with means 0.1, 0.15, 0.2 ( = 1 and true value cannot be recovered. Our modification is based on
orig. noise sound live rec. phone rec. clip. peak is at 0 semitones. The window is always normalised to
full cand. 0.993 0.993 0.990 0.750 0.953 0.985
sum to 1. For more information refer to the source code. As-
YIN .10 0.988 0.980 0.886 0.648 0.880 0.979
YIN .15 0.989 0.988 0.934 0.644 0.843 0.958 suming independence between voicedness and pitch, the ac-
YIN .20 0.980 0.986 0.953 0.632 0.803 0.931 tual transition probability between two states defined by pitch
and voicedness is simply the product of the two individual
probabilities. The initial probabilities are set to be uniformly
Table 1: Recall by YIN parameter and degradation. distributed over the unvoiced states, and the HMM is decoded
using an efficient version of the Viterbi algorithm that exploits
the sparseness of the transition matrix.
the observation that in many cases there exists an unknown
threshold for which YIN will output the correct pitch, which
is illustrated in Table 1: PYINs set of candidates always has 3. RESULTS
a higher pitch recall than any YIN parametrisation. We will
show later that this substantially boosts the effectiveness of We synthesised singing from the F0 pitch tracks available
pitch tracking in Stage 2 of our proposed PYIN method. with the RWC Music Database [12] and saved them as lin-
ear PCM wav files at a sample rate of 44.1kHz. The tracks
2.2. Stage 2: HMM-based Pitch Tracking cover the 100 full-length songs of the popular music subsec-
tion of the database. In order to simulate a more realistic sit-
The pitch tracking step consists of choosing at most one pitch
uation without such clean data, we degraded the audio using
candidate at every frame. We divide the pitch space into M =
five presets of the Audio Degradation Toolbox (ADT [13]) in
480 bins ranging over four octaves from 55Hz (A1) to just
addition to the original wav files. The complete data com-
under 880Hz (A5) in steps of 10 cents (0.1 semitones).
prises in excess of 30 hours of audio. All results given are on
Such pitch bins can be modelled as states in a hidden this complete data set.
Markov model (HMM). The model would then directly use
We ran three different versions of the proposed PYIN
the probabilities of pitch candidates obtained in the first PYIN
method with the Beta parameter distributions introduced in
step as observation probabilities: the probability of each ob-
Section 2.1 (means 0.10, 0.15, 0.20, see Fig. 2). For compar-
served pitch candidate is assigned to the bin closest to its es-
ison we also ran three versions with a parameter distribution
timated frequency; this results in a sparse observation vector
S that has only one non-zero element si (here, too, these
pm , m = 1, . . . , M , where the only non-zero elements are
were 0.10, 0.15, 0.20). Since this effectively simplifies to
those closest to pitch candidates. We use this idea but de-
original YIN plus smoothing, we refer to these as YIN+S.
velop a more realistic HMM with one voiced (v = 1) and one
The baseline is the conventional YIN, also with three dif-
unvoiced (v = 0) state per pitch (i.e. with 2M pitches), in-
ferent versions of the original YIN method with threshold
spired by an existing note tracking method [9]. Assuming that
parameters s = 0.10, 0.15, 0.20. All methods are run in a
the prior probability of being in either a voiced or an unvoiced
Vamp plugin implementation with a step size of 256 (5.8ms)
state is P (v = 1) = P (v = 0) = 0.5, we define our models
and a frame size of 2048 (46.4ms).
observation probability as
(
0.5 pm , for v = 1 3.1. Quantitative Analysis on Synthetic Data
pm,v = P (6)
0.5 (1 P
k k ) for v = 0.
Recall. In the context of intonation analysis, it is desirable to
Transition probabilities in this model have two main pur- have a correct frequency estimate for as much of the voiced
poses: to favour natural (smooth) pitch tracks over discontin- data as possible, i.e. to obtain high recall. We calculate recall
uous ones, and to favour few changes between unvoiced and as the proportion of actually voiced frames (according to the
voiced states. We encode this behaviour in two distributions ground truth) which the extractor recognises as voiced and
pertaining to voicing transition and pitch transition: tracks with the correct frequency. We follow [10] and accept
a frame as correctly tracked if the estimate is within one semi-
pv = P (vt | vt 1 ) tone of the true frequency.
( Figure 3a shows that the proposed method with any of the
0.99, if no change
= (7) beta distributions tested (PYIN .10, .15, .20) clearly outper-
0.01 otherwise, and forms the original YIN estimate. The PYIN median recall
pij = P (pitcht = j | pitcht 1 = i). (8) values (0.977, 0.982, 0.984) approach much more closely the
upper bound given by the number of correct pitches covered
The latter is realised as a triangular weight distribution which by the full candidate set (0.98) than any of the conventional
encodes that a pitch jump can be at most 25 bins, which cor- YIN methods (medians all below 0.951). Note that this is true
responds to 2.5 semitones per frame. The highest likelihood despite the fact that the YIN methods have an advantage due
1.00 1.00 1.00
0.989
0.987
0.986
0.985
0.984
0.984
0.982
0.981
0.981
0.981
0.977
0.975
0.970
0.95 0.95 0.95
0.960
0.958
0.951
0.950
0.948
0.942
0.935
0.933
0.918
0.917
0.90 0.90 0.90
0.908
0.894
precision
recall
0.858
0.852
0.80 0.80 0.80
0.809
0.75 0.75 0.75
YIN+S .10
YIN+S .15
YIN+S .20
PYIN .10
PYIN .15
PYIN .20
full cand.
YIN .10
YIN .15
YIN .20
YIN+S .10
YIN+S .15
YIN+S .20
PYIN .10
PYIN .15
PYIN .20
YIN .10
YIN .15
YIN .20
YIN+S .10
YIN+S .15
YIN+S .20
PYIN .10
PYIN .15
PYIN .20
(a) Recall (b) Precision (c) F measure
Fig. 3: Box plots of performance measures providing median and 1st and 3rd quartiles.
estimate.
350
Note that the YIN+S estimates have a worse recall than
250
frequency