Assessing Optical Music Recognition Tools: Pierfrancesco Bellini, Ivan Bruno, and Paolo Nesi

Pierfrancesco Bellini, Ivan Bruno, and Paolo Nesi
Department of Systems and Informatics Distributed Systems and Internet Technology Lab Via S. Marta, 3 50139 University of Florence Florence, Italy {pbellini, ivanb, nesi}@dsi.uni.it
Assessing Optical Music Recognition Tools
As digitization and information technologies advance, document analysis and optical-characterrecognition technologies have become more widely used. Optical Music Recognition (OMR), also commonly known as OCR (Optical Character Recognition) for Music, was rst attempted in the 1960s (Pruslin 1966). Standard OCR techniques cannot be used in music-score recognition, because music notation has a two-dimensional structure. In a staff, the horizontal position denotes different durations of notes, and the vertical position denes the height of the note (Roth 1994). Models for nonmusical OCR assessment have been proposed and largely used (Kanai et al. 1995; Ventzislav 2003). An ideal system that could reliably read and understand music notation could be used in music production for educational and entertainment applications. OMR is typically used today to accelerate the conversion from image music sheets into a symbolic music representation that can be manipulated, thus creating new and revised music editions. Other applications use OMR systems for educational purposes (e.g., IMUTUS; see www.exodus.gr/imutus), generating customized versions of music exercises. A different use involves the extraction of symbolic music representations to be used as incipits or as descriptors in music databases and related retrieval systems (Byrd 2001). OMR systems can be classied on the basis of the granularity chosen to recognize the music scores symbols. The architecture of an OMR system is tightly related to the methods used for symbol extraction, segmentation, and recognition. Generally, the music-notation recognition process can be divided into four main phases: (1) the segmentation of the score image to detect and extract symbols; (2) the recognition of symbols; (3) the reconstruction of music information; and (4) the construction
Computer Music Journal, 31:1, pp. 6893, Spring 2007 2007 Massachusetts Institute of Technology.
of the symbolic music notation model to represent the information (Bellini, Bruno, and Nesi 2004). Music notation may present very complex constructs and several styles. This problem has been recently addressed by the MUSICNETWORK and Motion Picture Experts Group (MPEG) in their work on Symbolic Music Representation (www .interactivemusicnetwork.org/mpeg-ahg). Many music-notation symbols exist, and they can be combined in different ways to realize several complex congurations, often without using well-dened formatting rules (Ross 1970; Heussenstamm 1987). Despite various research systems for OMR (e.g., Prerau 1970; Tojo and Aoyama 1982; Rumelhart, Hinton, and McClelland 1986; Fujinaga 1988, 1996; Carter 1989, 1994; Kato and Inokuchi 1990; Kobayakawa 1993; Selfridge-Field 1993; Ng and Boyle 1994, 1996; Coasnon and Camillerapp 1995; Bainbridge and Bell 1996, 2003; Modayur 1996; Cooper, Ng, and Boyle 1997; Bellini and Nesi 2001; McPherson 2002; Bruno 2003; Byrd 2006) as well as commercially available products, optical music recognitionand more generally speaking, music recognitionis a research eld affected by many open problems. The meaning of music recognition changes depending on the kind of applications and goals (Blostein and Carter 1992): audio generation from a musical score, music indexing and searching in a library database, music analysis, automatic transcription of a music score into parts, transcoding a score into interchange data formats, etc. For such applications, we must employ common tools to provide answers to questions such as What does a particular percentagerecognition rate that is claimed by this particular algorithm really mean? and May I invoke a common methodology to compare different OMR tools on the basis of my music? As mentioned in Blostein and Carter (1992) and Miyao and Haralick (2000), there is no standard for expressing the results of the OMR process. Computer Music Journal
68
Another difculty has also to do with the lack of a standard language for symbolic music notation representation. Many music languages are available (Selfridge-Field 1997; Bellini, Bruno, and Nesi 2001; Good 2001), but none of them is satisfactory for a music-recognition evaluation. The same symbols can be modeled in different ways in different languages, and sometimes the same symbols may have multiple representations in the same language. This makes the comparison among OMR results much more difcult at the level of music-notation symbolic coding, and thus also in terms of produced representation. At present, there is neither a standard database for music-score recognition nor a standard terminology. If a new recognition algorithm or system is proposed, it is difcult to compare it objectively with other algorithms or systems. Characterizing the performance of OMR systems is important for two reasons. First, comparisons are important to predict performance. Typically, OMR is part of a larger music production process whose overall performance depends on the OMR recognition rate. Knowledge of end-to-end performance as a function of OMR accuracy rate allows one to predict the costs of production and to make decisions about the production process itself. Second, comparisons are important to monitor progress in research and development of OMR systems, because quantitative measures are needed. Periodic quantitative performance evaluation of OMR systems would allow assessment of progress in the eld and could be of reference for researchers. Currently, many commercial OMR tools are available: SharpEye2, SmartScore, Photoscore, CapellaScan, and others. (A list of OMR systems can be found in Byrd 2006.) The efciency claimed by the most commercially used OMR systems is close to 90 percent. On the other hand, it is not clear how this value is estimated and what musicimage score database used. In any case, each of these products asserts to be the best. Most of them break down when the input document images are highly degraded, such as scanned images of carboncopy documents, documents printed on low-quality paper, and documents that are nth-generation photocopies. Besides, the end user cannot compare the related performances of the products, because
the various accuracy results are not reported in the same dataset. On the other hand, in many cases, music copyists prefer to enter music scores from scratch, because a recognition rate of 90 percent is not enough to make OMR systems attractive. (The costs of correction are very high, even for 90 percent accurate results.) Even in this case, the meaning of 90 percent is not clear. Ng et al. (2004) propose a Quick-Test set of data, including three pages of music that contain examples of the most frequently found musical features and fundamental musical symbols. These pages do not contain any musical compositions taken from published works but rather contain specially designed sequences of notes and symbols to provide fundamental visual combinations of musical features. They are intended to provide initial data and results to establish basic capabilities of OMR systems before moving toward the complexities of an actual musical composition. According to the analysis performed by the MUSICNETWORK working groups on music imaging, the assessment of the results of OMR systems can be based on simple estimations performed on the produced results in terms of reproduced music scores. Copyists insist on stating that tiny details such as changes in basic symbols or annotation symbols (e.g., breath marks or ngering) are very important for assessment and conversion efciency. On the other hand, OMR system builders state that the most relevant aspect to be considered is the capability of the system to recognize in the robust manner the most frequently used symbols (notes, groups of notes, etc.), whereas the tiny details are less relevantand in some cases marginal. These two positions motivated us to analyze OMR results by using two assessment models focused on basic and composite symbols, respectively, and using human opinions to validate the results. For these reasons, this article proposes two models for assessing OMR system performance based on two sets of metrics: those based on basic symbols, and those based on composite music symbols. These metrics are estimated by counting features of the results produced by OMR systems without the need to compare the output score-coding languages. The assessment models have been applied to a set of Bellini et al. 69
monophonic music score images processed by some OMR tools. Some experts were contacted to provide their subjective estimation of the recognition quality of the music scores produced by each OMR tool. All evaluations were performed on classical monophonic pieces. Even if some OMR systems also provide interesting results on polyphonic music, we preferred to limit the assessment to monophonic music, for which OMR-software products typically claim greater success. The quality assessment performed by experts was used to validate the results obtained in terms of metrics. The obtained results with a comparative analysis are presented in this article. The assessment model proposed here can be also applied to the polyphonic case with some simple extensions. It should be noted, however, that the assessment model proposed does not depend on the output format produced by the OMR system, because the assessment is based on the symbols recognized and reproduced in the score. This also means that all the OMR systems considered have been allowed to work in their best operating conditions, in their native formats. This allowed estimating the effective value of the optical music-recognition process for each OMR system considered. This article is organized as follows. First, metrics for assessing OMR recognition quality in terms of basic symbols and a set of metrics for assessing quality of OMR systems on the basis of Composite Symbol Metrics are presented. Next, an example using the metrics proposed is reported. The example has been realized by estimating and comparing the behavior of three OMR systems against seven welldened examples. The examples have been identied from the databases of the MUSICNETWORK partners and participants. Then, validation of the assessment metrics is proposed. A rst validation has been performed comparing the metrics dened without weights vis--vis those dened considering the relevance of the several symbols as stated by a set of experts. Finally, the proposed metrics are validated by producing a metric model using judgments of the experts. Conclusions are drawn in the last section, and the appendices contain the entire data related to the assessment. 70
Metrics for Basic Symbols

Basic symbols in Western music notation and in the classical repertoire are the graphical elements that can be assembled to create more complex symbols and model the entire set of music-notation symbols. Thus, basic symbols refers to notation elements such as noteheads, rests, hooks of notes, etc. A set of basic symbols has been dened on the basis of a large set of examples considered as test images. The set of considered basic symbols is reported in Table 1. (Note that the staff line is not included among the basic symbols, whereas for symbols reported in the Table, the segment of the staff is present.) In fact, in the assessment model proposed, recognition is attempted without considering the staff as an independent symbol. A symbol is considered recognized only when the OMR system is able to reproduce it in the output in its correct position with respect to the others. Counting of correctly recognized symbols was performed manually by experts; for this reason, the clefs reported in Table 1 are only examples of the clefs irrespective of position, which is evaluated only in the results-assessment phase. This means that the model distinguishes between the recognition of the basic symbol and the recognition/ identication of its correct position. In addition, some OMR systems eliminate staff lines in the process, whereas other use them for guiding recognition. The assessment model proposed is independent of the approach used, and Table 1 reports staff segments in those examples only to support this assertion. Typically, some OMR systems perform a segmentation phase to extract basic symbols, used in a successive phase to reconstruct and recognize more complete and complex structures such as sixteenth notes, eighth notes, beamed notes, chords, beamed chords, etc., in a very large number of combinations. Thus, the stem symbol is missing, although the symbols related to hooks that dene the stem position/direction and the note value are present. Assessment based on these basic symbols is performed with the goal of evaluating the capability of the OMR system in recognizing elementary information, while considering the recognition of each Computer Music Journal
Table 1. Basic Symbol Categories with Examples

Symbol Category Empty Note Head 4/4 Empty Note Head 2/4 Black Note Head 1/4 -> 1/64 Rest duration 4/4 Rest duration 2/4 Rest duration 1/4 Rest duration 1/8 Rest duration 1/16 Rest duration 1/32 Rest duration 1/64 Barline Single Barline Double Barline End Barline Start Refrain Barline End Refrain Sharp Flat Natural Double Sharp Double Flat Example Symbol Category Treble Clef Example Symbol Category Number 4 Number 5 Number 6 Number 7 Number 8 Number 9 Slur Down Down Down Down Up Up Up Up P (piano) F (forte) Comma (breath) C Staccato Fermata Mordent Turn Grace Note Trill Tenuto Example
basic symbol even if they are incorrectly used for recreating the complete system. The list of basic symbols present in Table 1 is not exhaustive. On the other hand, it allows evaluation of the performance of monophonic music-score recognition as demonstrated hereafter. It could be extended to include other symbols or to divide some symbols into smaller elements to improve the accuracy of the evaluation. (For example, time signature could be managed as single digits or pairs of digits.) To assess the recognition rate in terms of basic symbols, a set of metrics has been dened to count the different types of recognized, missed, and/or confused symbols, for each of the above possible basic
123 1442443
Bass Clef Baritone Clef
Tenor Clef Contralto Clef Mezzo-soprano Clef Soprano Clef Hook 1 (1/8) Hook 2 (1/16) Hook 3 (1/32) Hook 4 (1/64) Beam 1 (1/8) Beam 2 (1/16) Beam 3 (1/32) Beam 4 (1/64)
Augmentation Dot Accent Number 1 Number 2 Number 3
symbol categories (indexed below by variable i). The rst metric is estimated on the original score image, whereas the others are estimated on the reconstructed score image comparing it with the original one. Here, we dene NEBSi as the number of expected basic symbols of type i; NTBSi as the number of correctly recognized (true) basic symbols of type i (these are the so-called true positivesthe number of symbols recognized consistently with the reference score image); NABSi as the number of added basic symbols of type i (these are false positives the number of symbols inconsistently recognized with the reference score image); NFBSi as the number of incorrectly recognized basic symbols of type Bellini et al. 71
i; and NMBSi as the number of basic symbols not recognized (missed) of type i. Note that the false negatives have been divided into incorrectly recognized symbols (i.e., NFBS) and those that have been completely missed (i.e., NMBS), such that the number of false negatives can be obtained adding those values. The number of true negatives is impossible to be counted, because they are not proposed in the nal results by OMR systems. It is primarily these symbols that an OMR system processes and that it has correctly decided not to consider as valid symbols to be proposed in the output. In fact, they are not counted in the output. These metrics have been used to dene more signicant indicators that can be used to assess the efciency of the recognition process; they are in effect similar to those traditionally applied in the eld of the pattern recognition and classication. For the category of basic symbols i, we dene NEBSi = NTBSi + NFBSi + NMBSi (1)
Added Basic Symbols Rate (ABSR):
ABSR =
NABSi
i =1
NB
TNEBS
100
(6)
By combining the above metrics, a general metric taking into account the faulty, missed, and added symbols can be dened as the Basic Symbols Error Rate (BSER): BSER = FBSR + MBSR + ABSR (7)
The total number of expected basic symbols TNEBS is given by considering all categories: TNEBS =
According to interviews and analysis performed with music notation experts (engravers, copyists, publishers, etc.), not all symbols have the same relevance, and thus counting of the these categories should be weighted in different manners to combine their counting into higher-level metrics. The comparison of results obtained by using weighted and non-weighted metrics may conrm this assumption, as discussed in the rest of the article (in the section on metric validation). For these reasons, a set of metrics including specic weights for each basic symbols category has been derived: Weighted True Basic Symbols Rate (WTBSR): WTBSR = 100 w B, i NTBSi
i =1 NB
NEBSi
i =1
NB
(2)
On the basis of the above counting-based metrics, the following metrics have been dened to obtain normalized values, regardless of the size in terms of number of symbols on each score image. Their value is expressed as percentages: True Basic Symbols Rate (TBSR): TBSR =
(8)
Weighted Faulty Basic Symbols Rate (WFBSR): WFBSR = 100 w B, i NFBSi

i =1 NB
(9)
Weighted Missed Basic Symbols Rate (WMBSR): 100 (3) WMBSR = 100 w B, i NMBSi
i =1 NB
NTBSi
i =1
NB
(10)
TNEBS
Faulty Basic Symbols Rate (FBSR): FBSR =
Weighted Added Basic Symbols Rate (WABSR): WABSR = 100 w B, i NABSi

NB
NFBSi
i =1
NB
TNEBS
100
(4)
i =1
(11)
Missed Basic Symbols Rate (MBSR): MBSR =
In each of these weighted metrics the following relationship was dened: w B, i = w B, i

NB
NMBSi
i =1
NB
TNEBS
100
(5)
NEBSj w B,j
j =1
72
Computer Music Journal
This is simply the normalized weight for category i with respect to the sum of all NEBSi weighted with their own wB,i. (Here, wB,i is the relevance, or weight, given to each basic symbol with respect to the others.) This normalization allows a 100 percent rate when all NTBSi, NFBSi, and NMBSi match NEBSi. In this way, the measure is normalized with respect to the music-score complexity. These weights range from 110, where 10 denotes the highest relevance. Their values have been estimated by collecting the judgment of experts. If all wB,i assume the same value, the metrics dened in Equations 811 are identical to those of Equations 36. Similarly to BSER, a Weighted Basic Symbols Error Rate (WBSER) can be dened as WBSER = WFBSR + WMBSR + WABSR (12) Similarly to BSER (Equation 7), WBSER also assumes that all the error rates of Equations 911 are equally important. To have a more precise evaluation of the Basic Symbol Recognition Rate, the following metric has been dened having different non-negative weights for WFBSR, WMBSR, WABSR, and WTBSR: BSRR = wWFBSRWFBSR + wWMBSRWMBSR + wWABSRWABSR + wWTBSRWTBSR (13)
ysis), while verifying if they are statistically valid, as described in the section on validation. Once the weights have been estimated, the reported metric BSRR can be used as a model to predict the expert estimation of the recognized music notation score images for OMR systems on the basis of simple counting.
Metrics for Composite Symbols

As discussed in the previous section, music notation presents a set of higher-level categories of symbols that are created taking into account basic symbols. A composite music symbol is the nal result in reconstructing relationships among basic symbols. Some basic symbols are at the same time composite music symbols (e.g., rests, clefs), whereas others are elementary components (a piece of beam, an augmentation dot, a hook, etc.) of a more complex/composite music symbol. Recognizing a single basic symbol does not imply that the identied music symbol is correct. From the point of view of the nal result, the music symbol must be identied and characterized by its features and relationships with other symbols. The identication of a notehead does not imply complete note recognition; it is characterized by its pitch, duration, an accidental being correctly assigned, whether it is in a group of notes, etc. The realization of a beamed note group is an indication of the capability of reconstructing relationships among notes of the group. This evaluation method is characterized by an accurate analysis; in fact, a music symbol or a relationship is considered correct if basic symbols and music symbols that are involved in the relationship are correct. Therefore, the idea of dening metrics for assessing recognition rate in terms of composite symbols is performed with the goal of verifying and analyzing whether the recognition rate, as judged by experts, is based on basic-symbol analysis (the identication of some or all basic symbols and the relevance of the most important of them) or on the impression given by the number of recognized composite symbols (considering the relative position of each graphic detail, the reconstruction of composite groups, etc.). Bellini et al. 73
BSRR can be regarded as an estimate of the recognition quality of an OMR system, where the weights are designed to balance the respective relevance of the equation terms. This is reasonable, because the quality may depend on the number of faulty, missed, added, and correctly recognized symbols. In this case, the weights attempt to correct the scale factor, as performed by the same authors in different contexts (Nesi and Querci 1998; Fioravanti and Nesi 2001). As performed in many cases for other multiterm metrics, the values of weights can be estimated using a multi-linear regression technique that assumes the BSRR is a good estimate of the general vote provided by a set of experts for a set of music scores. This approach allows us to collect enough data and expert judgments (which are considered as BSRR values) to invert the system and to estimate the best-tting weights on the basis of a statistical analysis (multi-linear regression anal-
Moreover, the general impression given to an expert may also be based on other aspects more difcult to be quantied. Note that counting composite symbols is to some extent simpler than counting basic symbols, because there are fewer composite symbols per page than basic symbols. To assess the recognition rate in terms of composite symbols, a set of metrics has been dened to count several categories of recognized, faulty, and missed composite symbols for each of the possible composite symbols of category i. Such categories (NC in total) are described in the appendices, where guidelines for counting the following metrics are also reported. These metrics are quite similar to the parallel basic symbols metrics but with B replaced with C, which stands for composite: 1. NECSi: number of expected composite symbols of type i 2. NTCSi: number of correctly recognized (true) composite symbols of type i (the true positive) 3. NFCSi: number of wrongly recognized (faulty) composite symbols of type i (contribution to the number of false negatives) 4. NMCSi: number of not recognized (missed) composite symbols of type i (contribution to the number of false negatives) 5. NACSi: number of added composite symbols of type i (the false positive) Equations 1426 deal with composite symbols and are directly analogous to Equations 113 for basic symbols. (For explanatory text, please see the descriptions of Equations 113, replacing B with C.) Here, we dene the following: NECSi = NTCSi + NFCSi + NMCSi Total Number of Expected Composite Symbols (TNECS): TNECS = (14)
Faulty Composite Symbols Rate (FCSR): FCSR =
NFCSi
i =1
NC
TNECS
NC
100
(17)
Missed Composite Symbols Rate (MCSR): MCSR =
NMCSi
i =1
TNECS
NC
100
(18)
Added Composite Symbols Rate (ACSR): ACSR =
NACSi
i =1
TNECS
100
(19)
Composite Symbols Error Rate (CSER): CSER = FCSR + MCSR + ACSR

NC
(20)
Weighted True Composite Symbol Rate (WTCSR): WTCSR = 100 wC, i NTCSi
i =1
(21)
Weighted Faulty Composite Symbols Rate (WFCSR): WFCSR = 100 wC, i NFCSi
i =1 NC
(22)
Weighted Missed Composite Symbols Rate (WMCSR): WMCSR = 100 wC, i NMCSi
i =1 NC
(23)
Weighted Added Composite Symbols Rate (WACSR): WACSR = 100 wC, i NACSi
i =1 NC
(24)
Weighted Composite Symbols Error Rate (WCSER): WCSER = WFCSR + WMCSR + WACSR (25) (15) Composite Symbol Recognition Rate (CSRR): CSRR = wWFCSRWFCSR + wWMCSRWMCSR + wWACSRWACSR + wWTCSRWTCSR (16) (26)
NECSi
i =1
NC
True Composite Symbols Rate (TCSR): TCSR =
NTCSi
i =1
NC
TNECS
100
As stated for BSRR for the Basic Symbols, CSRR can be regarded as an estimation of the recognition Computer Music Journal
74
quality of an OMR system. The comments valid for the estimation of the BSRR weights are still valid in this case. Then, once the weights have been estimated, the CSRR metric can be used to predict the experts estimation of the recognised music notation score images for OMR systems on the basis of a simple counting of composite symbols.
Basic Symbols Metrics Application Tables B1, B2, and B3, reported in Appendix B, show the values obtained for the basic symbol metrics estimated on the results produced on the seven test cases by the OMR tools SmartScore, O 3MR, and SharpEye2. Analysis of the tables pointed out that SmartScore loses information, and the total NMBS is greater than the total NFBS. The accent symbol was never recognized, and there were some identication problems with time signatures, rests with duration less than an eighth-note, and slurs. Furthermore, SmartScore added whole notes, augmentation dots, and staccato symbols; in fact, for such categories of basic symbols, the NABS is always nonzero. Mordents, trills, and fermatas were also not managed well. SharpEye2 showed a high recognition rate, but at the same time it lost information. It added fewer symbols than the other two OMR software tools; in fact, its total NABS was better than those of the others. It was able to reconstruct slurs and recognize trills, fermatas, and grace notes. Some difficulties in recognizing the baritone clef were detected. Finally, also O 3MR showed a high recognition rate, but it lost some information and did not manage symbols such as fermatas, trills, staccato dots, mordents, turns, and grace notes. For these categories, NABS is always nonzero. The high value of total NFBS and NMBS is primarily owing to the difculty in slur reconstruction. In Figure 1, the distributions of values for basicsymbol metrics TBSR and BSER for the test cases and the considered OMR tools are reported. Considering the average values, it is evident that SharpEye2 is better-ranked with respect to the O 3MR and SmartScore systems. However, if the variance is considered, O 3MR seems to be more reliable, primarily because it achieved the best performance in four examples out of seven, and thus the exhibited the lowest variance. As can be noted by comparing the values obtained for TBSR and BSER, the TBSR illustrates the differences among the OMR systems. In the last column of the tables reported in Figure 1, the margin of error is reported for a condence level of 95 percent. Bellini et al. 75
Application of Metrics and Data Collection

To provide a set of test cases to assess the effective capabilities of the OMR systems considered, seven music-notation score images were selected from a large archive of digitized images collected at the University of Florence Department of Systems and Informatics (DSI). The image test set is available on the MUSICNETWORK Web site (www .interactivemusicnetwork.org/wg_imaging/Omr_ Assessment/index.html), and can be used as a reference set by any other research group or by OMR tool-builders for self-assessment. The selected music score images are music exercises (in common Western music notation) and have the following features: font variability, music symbols that are often used in the classic music repertoire, variable density of music symbols, irregular groups (tuples, triplets, etc.), small notes with or without accidentals (grace notes), different barlines (start and end refrain, end score, single barline, and double barline), clef and time signature changes, ornaments (mordent, turn, trill, etc.), and horizontal symbols (slurs and ties). All images were digitized by means of an HP ScanJet flat scanner with 300 dots per inch (DPI) resolution at 8-bit depth (grayscale), and they were stored in GIF format. Seven images were used as test cases for assessing and comparing three different OMR tools: O 3MR developed at the University of Florence inside the IMUTUS IST FP6 Project (Bruno and Nesi 2004), SmartScore version 2.0.3 (www.musitek.com), and SharpEye2 version 2.62 (www.visiv.co.uk). All images were submitted as input les to the OMR applications, and the corresponding recognized scores were printed and analyzed using the metrics presented in this article.
Figure 1. Distribution of (a) TBSR and (b) BSER for the test cases and for OMR tools. (For TBSR, higher scores are better, and for BSER, lower scores are better.)
(a)
(b)
Composite Symbols Metrics Application Table C1 reported in Appendix C shows the values obtained for the composite-symbol metrics estimated on the results produced on the seven test cases by the OMR systems SmartScore, O 3MR, and SharpEye2. Analysis of the reported results pointed out that SmartScore introduces errors in notes reconstruction and adds notes, as shown by the NFCS and NACS. SmartScore seems to detect tuplets by analyzing the metric consistency of the measure and not by recognizing the printed number (e.g., 3) that explicitly indicates them. Analysis of NFCS and NMCS for slurs, rests, symbols above or below notes, signature changes, and key signatures shows that SmartScore has problems in reconstructing them. The high values of total NACS for each example highlight the main tendency to lose information. SharpEye2 does not introduce wrong notes, but it has some problems with tuplets. In contrast to SmartScore, SharpEye2 seems to recognize a tuplet by detecting the number that characterizes it. In grace-note detection, it does not discriminate appoggiaturas from acciaccaturas; it considers only grace notes as appoggiaturas. For this reason, the NFCS value is high in test case 6 of Figure 1. The main limits for O 3MR involve the recognition of slurs, tuplets, grace notes, and symbols above or below notes. In fact, for these categories, 76
values of NFCS and NMCS are relevant. It introduces rests probably owing to an incorrect decomposition of some music symbols, whereas the comparison with the total NACS of SmartScore shows that O 3MR adds fewer symbols. Figure 2 reports the distribution of values for composite symbols metrics TCSR and CSER for the test cases and for the considered OMR tools. From these values, it is evident that SharpEye2 provides better performance than O 3MR. O 3MR loses with respect to SharpEye2, because it has some difculties with relevant symbols such as slurs and tuplets. O 3MR achieves a better performance when compared to Smartscore. The assessment reported in Figure 2 is based on symbol counting only in the resulted recognized music scores without considering the relevance of symbols. However, according to experts opinions, the symbols relevance is important. What happens often can be explained in terms of experts assigning in their minds a different relevance to different basic and composite symbols. The work reported in the following section concerns validation of the proposed metrics with respect to the judgment provided by experts.
Metrics with Weights for Symbol Relevance The analysis reported hereafter has been performed to identify which one among the metrics dened in Computer Music Journal
Figure 2. Distribution of (a) TCSR and (b) CSER for the test cases and for the OMR tools. (For TCSR, higher scores are better, and for CSER, lower scores are better.)
(a)
(b)
the previous sections is the most suitable for evaluating and/or predicting the quality of a music score reconstructed by an OMR system. Here, most suitable refers to the metrics proximity to the judgment of the experts. The rst step was to collect the weights expressed by experts for modeling the relevance of basic and composite symbols. Once the metrics with weights of symbols were obtained, a new set of metrics could be compared with those discussed in the previous section. This phase must be regarded as the rst step for the effective validation of the metrics against the judgment of the 15 experts on seven different image scores.
Weighting the Relevance of Basic and Composite Symbols A questionnaire was prepared and submitted to a group of experts and users of OMR tools and symbolic music-notation software to estimate the weights expressing the relevance of the identied categories of symbols for both basic and composite metrics. The questionnaire was structured in a manner similar to Table 2 for basic symbols and to Table 3 for composite symbols. For each category of symbols, experts were asked to provide a relevance vote in the range 110. The results presented a low variance, and the median values are reported in
Table 2 for basic symbols and in Table 3 for composite symbols. In Table 2, symbols that affect the correct pitch and timing are the most important. Dynamics and articulation are secondary, as is beaming. Unusual pitch-related symbols like the tenor clef are less important than more common ones like the treble clef. The obtained weights were then used for estimating the metrics WTBSR, WFBSR, WMBSR, and WABSR for basic symbols and the metrics WTCSR, WFCSR, WMCSR, and WACSR for composite symbols. Results related to WTBSR and WTCSR are depicted in Figure 3. The results obtained with WTBSR and WTCSR are similar to those obtained for TBSR and TCSR. Sharpeye consistently achieved the best rank, followed by O 3MR.
Validation of Metrics The second step focused on collecting judgments of the overall quality of the results produced by the OMR tools. To this end, the analysis was carried out while trying to validate the dened metrics against the votes and judgments provided by experts. The experts were asked to express their votes as a measure of quality in the test cases. Analysis was divided into two phases: (1) collection of the experts judgments, and (2) validation of the proposed Bellini et al. 77
Figure 3. Distribution of metric values for (a) WTBSR and (b) WTCSR along the test cases.
(a)
(b)
Table 2. Weights for the Relevance of Basic Symbols

Basic Symbol Empty Note Head 4/4 Empty Note Head 2/4 Black Note Head 1/4 -> 1/64 Rest duration 4/4 Rest duration 2/4 Rest duration 1/4 Rest duration 1/8 Rest duration 1/16 Rest duration 1/32 Rest duration 1/64 Barline Single Barline Double Barline End Barline Start Refrain Barline End Refrain wB 10 10 10 10 10 10 10 10 10 10 10 5 5 5 5 Basic Symbol Sharp Flat Natural Double Sharp Double Flat Treble Clef Bass Clef Baritone Clef Tenor Clef Contralto Clef Mezzosoprano Clef Soprano Clef Hook 1 (1/8) Hook 2 (1/16) Hook 3 (1/32) wB 10 10 10 10 10 10 10 7 6 6 7 7 10 10 10 Basic Symbol Hook 4 (1/64) Beam 1 (1/8) Beam 2 (1/16) Beam 3 (1/32) Beam 4 (1/64) Augmentation Dot Accent Number 1 Number 2 Number 3 Number 4 Number 5 Number 6 Number 7 Number 8 wB 10 10 10 10 10 10 5 6 10 10 10 10 10 10 10 Basic Symbol Number 9 Slur P (piano) F (forte) Comma (breath) C Staccato Fermata Mordent Turn Grace Note Trill Tenuto wB 10 8 5 5 3 10 5 5 5 5 5 5 5
Table 3. Weights for the Relevance of Composite Symbols

Composite Symbol / Relationship Note with pitch and duration Note with accidentals Groups of beamed notes Rests Time signature and time-signature change Key signature and key-signature change Symbols below or above notes (turn, trill, accent, staccato, etc.) wC 10 10 7 10 10 10 5 Composite Symbol / Relationship Grace notes Slurs, ties, and bends Augmentation dots Clefs Irregular note groups Number of measures Number of staves wC 5 7 10 10 10 10 10
78
Table 4. Statistical Results of the Votes Provided by Experts

Experts Votes OMR 1 = O 3MR (DSI) Test Case 1 2 3 4 5 6 7 Total average Median 6.00 6.00 6.00 5.00 6.00 5.00 6.00 Average 5.92 5.69 5.62 4.69 5.54 5.08 5.54 5.44 Mode 7.00 3.00 6.00 5.00 6.00 5.00 6.00 OMR 2 = SMARTSCORE Median 6.00 5.00 5.00 6.00 4.00 4.00 4.00 Average 4.85 4.15 4.62 5.23 4.00 4.00 4.69 4.47 Mode 6.00 5.00 5.00 6.00 3.00 4.00 4.00 OMR 3 = SHARPEYE2 Median 6.00 5.00 6.00 6.00 5.00 5.00 7.00 Average 5.92 5.15 6.38 5.46 4.92 5.15 6.62 5.49 Mode 6.00 5.00 8.00 5.00 4.00 4.00 8.00
metrics, thus allowing the estimation of weights included in metrics CSRR and BSRR. To collect the perceived quality value of the results of the OMR systems, a group of experts participated in a blind test for each test case. The original music score images were presented to the experts together with the corresponding printed image of music score recognized by the OMR tools. The OMR tools were renamed as OMR1, OMR2, and OMR3 to hide their actual names and manufacturers. For each score image produced by an OMR tool, a measure of quality was associated with a vote in the range 110. This allowed collecting 15 scores for each test, thus a total of 105 scores. Table 4 reports the statistical results of the experts votes collected for the test cases (examples) for the three OMR systems. When analyzing the vote of the experts, it is self-evident that even in this case SharpEye2 was the most reliable and satisfactory tool for the experts. The values reported in the table are the median, mean, and the mode of the values corresponding to the different test cases and for the three different tools. The total average of the votes provided by experts identied Sharpeye2 as the higher-ranked OMR system, followed closely by O 3MR. Table 5 shows the correlation matrix for experts votes (listed as V-median, V-average, and V-mode) as well as the values of the most signicant basic- and composite-symbol metrics. Note the high correlation between basic-symbol metrics and the average votes (V-average) given by the experts. On the other hand, composite-symbol metrics are better corre-
lated with median votes (V-median). In addition, composite-symbol metrics are well correlated with the corresponding basic-symbol metrics. According to the correlation table reported in Table 5, the V-median and V-average turned out to be related to WABSR (an index on added basic symbols) and with WBSER. Furthermore, the correlation of V-median and V-average with WCSER is relevant. Using the judgments expressed by the experts for the different examples allows us to invert equations and to estimate the weights on the basis of a multilinear regression analysis. This applies to BSRR (see Equation 13) and CSRR (see Equation 26), where BSRR and CSRR have been computed using the judgment of the experts. Both BSRR and CSRR metrics contain weighted terms that are based on the basic and composite metrics previously considered. As shown in Table 5, most of the terms involved in those metrics are correlated in a signicant manner with the real judgment by the experts, V-median and V-average. The weights involved in BSRR and CSRR allow weighting the relevance of the terms with which they are associated with: added symbols, faulty symbols, missing symbols, and true symbols. The following analysis can help in understanding which of these terms are most relevant in the model. By using the expressions of BSRR and CSRR and considering the votes of the experts in the several examples, a multi-linear least-squares regression technique can be applied (Rousseeuw and Leroy 1987; Nesi and Querci 1998; Fioravanti and Nesi Bellini et al. 79
Table 5. Correlation Values of Metrics
Note that regions involving basic metrics, composite metrics, and expert-provided values have been highlighted with different shading.
2001), to estimate weights that minimize the deviation from a multi-linear relationship. Table 6 shows the results of multi-linear regression analysis. A high value of correlation has been obtained for the BSRR metric. In Table 6, the values of weights w are reported with their corresponding t-values and p-values, which establish the importance of each coefcient in the general model. The p-value can be considered as a probability that the reference distribution assigns to the set of possible values of the test statistic; when it is less than 0.05, the corresponding coefcient is signicant with a condence of 5 percent. The t-value is intuitively an index that establishes the importance of a coefcient in the general model. A coefcient can be considered signicant if the t-value is greater than 1.5 (because a high number of measures have been used for the regression). In the case of BSRR, metric components WFBSR and WMBSR are not signicant. Thus, by removing these terms in the denition of the BSRR metric, a similar result is obtained with a simpler metric, called BSRR' and computed as BSRR = wWABSRWABSR + wWTBSRWTBSR (27) In this case, with the new version of BSRR metric, a very slight decrease in correlation has been ob80
tained, passing from 0.90 to 0.89. This shows that coefcients WABSR and WTBSR (added and correct symbols) are enough to obtain a metric well correlated, and it gives evidence that the judgement of the experts is mainly based on the estimation of these aspects. Results obtained through the analysis of multi-linear regression applied to composite symbols metrics are interesting as well. In this case, all terms are signicant, and the correlation (0.84) is smaller than that obtained with basic-symbol metrics. In all cases, the F-state of the global multi-linear regression is practically zero, and therefore the entire process and data are statistically signicant. In both cases (CSRR and BSRR'), the impact of the added symbols is negative in the estimation of the general vote as the addition of symbols in CSRR. In addition, it should be noted that the values obtained in terms of correlation between the judgement of the experts and the CSRR and BSRR' metrics are not completely different from those that have been obtained by using WBSER and WCSER or BSER and CSER. To some degree, this implies that the latter metrics can be used to provide less-precise estimation of the effective quality of recognition, while they can be useful to compare the performance of OMR systems. As a general conclusion, the proposed metrics Computer Music Journal
Table 6. Multi-Linear Regression Analysis Using Basic-Symbol and Composite-Symbol Metrics

BSRR w WFBSR WMBSR WABSR WTBSR Correlation Std. Error Std. Dev. R-Squared F-Stat t-value 0.144 0.001 0.135 0.061 p-value 1.141 0.017 1.818 29.393 0.269 0.987 0.087 0.000 0.902 0.335 0.647 0.814 0.000 CSRR w WFCSR WMCSR WACSR WTCSR Correlation Std. Error Std. Dev. R-Squared F-Stat t-value 0.171 0.072 0.205 0.064 p-value 2.118 2.002 3.878 23.473 BSRR' w 0.049 WABSR 0.061 WTBSR 0.001 0.000 0.841 0.420 0.601 0.707 0.000 Correlation Std. Error Std. Dev. R-Squared F-Stat t-value 0.160 0.063 p-value 5.856 50.008 0.000 0.000
0.891 0.332 0.654 0.795 0.000
BSRR' and CSRR have been validated against the judgments expressed by a set of experts. This means that the identied model, including the proposed metrics and the estimated weights, can be used for assessing OMR systems, providing results similar to those produced by an expert. The advantage is that the assessment based on metrics is simpler and can be performed without consulting experts, simply by counting the added, missing, true, and faulted symbols when comparing the music sheet produced by the OMR systems and the original score. Such a global approach can be based on WBSER, WCSER, BSER, and/or CSER.
Conclusions
This article addressed the problem of ORM-tool assessment. One of the most important open problems is the lack of standard methodologies that do not allow objective comparison of existing OMR tools. In this article, studies and approaches for the denition of OMR performance evaluation models were described. Two kinds of metrics for performance evaluation were proposed: metrics based on basic symbols, and metrics based on composite music symbols and their relationships. They were validated through comparison with the quality judgment provided by experts in the OMR eld. Metrics were used to collect data by using and comparing three OMR tools: SharpEye2, SmartScore, and O 3MR. The most relevant part of the article lies not in the
number of OMR tools analyzed and compared or in the values obtained, but rather in the assessment model proposed. We proposed a rigorous model for the assessment of OMR systems, and it has been validated by using experts experience. It can be easily applied to polyphonic music, because it is based on the simple counting of symbols and is independent on the format produced by the OMR system under consideration. During the validation phase, we found that the proposed metrics with optimal weights are well correlated with the votes of the experts regarding the quality of the reconstructed music score. Moreover, the proposed metrics can produce evaluations close to those provided by experts considering a restricted number of terms. This allows a reduction in the effort in counting and measuring terms. Finally, collected data allowed comparison of the performance of OMR tools involved in metric validation. The simplest metric shows that SharpEye2 exhibits the best performance and ability to manage more symbols. On the other hand, O 3MR is very reliable if the analysis is limited to the recognition of notes and rests.
Acknowledgments
The authors would like to thank all the partners of the MUSICNETWORK Project and all the participants at the Second Workshop of MUSICNETWORK at the University of Leeds in September 2003 for Bellini et al. 81
their contribution and patience in completing questionnaires and participating in the reported assessment work.
References
Bainbridge, D., and T. Bell. 1996. An Extensible Optical Music Recognition System. Australian Computer Science Communications 18(1):308317. Bainbridge, D., and T. Bell. 2003. A Music Notation Construction Engine for Optical Music Recognition. Software: Practice & Experience 33(2):173200. Bellini, P., I. Bruno, and P. Nesi. 2001. Optical Music Sheet Segmentation. Proceedings of WEDELMUSIC2001. Piscataway, New Jersey: IEEE Press, pp. 183190. Bellini, P., I. Bruno, and P. Nesi. 2004. An Off-Line Optical Music Sheet Recognition. In S. E. George, ed. Visual Perception of Music Notation: On-Line and Off-Line Recognition. Hershey, Pennsylvania: Idea Group, pp. 4077. Bellini, P., and P. Nesi. 2001. WEDELMUSIC Format: An XML Music Notation Format For Emerging Applications. Proceedings of WEDELMUSIC2001. Piscataway, New Jersey, pp. 7986. Blostein, D., and N. P. Carter. 1992. Recognition of Music Notation. In H. S. Baird, H. Bunke, and K. Yamamoto, eds. Structured Document Image Analysis. Berlin: Springer, pp. 573574. Bruno, I. 2003. Music Score Image Analysis: Methods and Tools for Automatic Recognition and Indexing. Ph.D. Thesis, Department of Systems and Informatics, University of Florence, Italy. Bruno, I., and P. Nesi. 2004. Algorithmic Description of the Optical Music Recognition Module. DE7.1 Internal Report of the IMUTUS IST200132270 Project. Byrd, D. 2001. Music-Notation Searching and Digital Libraries. In Proceedings of 2001 Joint Conference on Digital Libraries. New York: Association for Computing Machinery, pp. 239246. Byrd, D. 2006. OMR (Optical Music Recognition) Systems. Bloomington, Indiana: Indiana University. Available online at mypage.iu.edu/~donbyrd/ OMRSystemsTable.html. Carter, N. P. 1989. Automatic Recognition of Printed Music in the Context of Electronic Publishing. Ph.D. thesis, University of Surrey. Carter, N. P. 1994. Music Score Recognition: Problems and Prospects. Computing in Musicology 9:152158. Cooper, D., K. C. Ng, and R. D. Boyle. 1997. MIDI Extensions for Musical Notation: Expressive MIDI. In
E. Selfridge-Field, ed. Beyond MIDI: The Handbook of Musical Codes. Cambridge, Massachusetts: MIT Press, pp. 402447. Coasnon, B., and J. Camillerapp. 1995. A Way to Separate Knowledge from Program in Structured Document Analysis: Application to Optical Music Recognition. Proceedings of the 1995 International Conference on Document Analysis and Recognition. New York: Institute of Electrical and Electronics Engineers, pp. 1092 1097. Fioravanti, F., and Nesi, 2001. P. Estimation and Prediction Metrics For Adaptive Maintenance Effort of Object Oriented Systems. IEEE Transactions on Software Engineering 27:10621084. Fujinaga, I. 1988. Optical Music Recognition Using Projections. M.A. Thesis, McGill University. Fujinaga, I. 1996. Adaptive Optical Music Recognition. Ph.D. Dissertation, Music, McGill University. Good, M. 2001. MusicXML for Notation and Analysis. In E. Selfridge-Field and W. B. Hewlett, eds. The Virtual Score. Cambridge, Massachusetts: MIT Press, pp. 113124. Heussenstamm, G. 1987. The Norton Manual of Music Notation. New York: Norton. Kanai, J., et al. 1995. Automated Evaluation of OCR Zoning. IEEE Transactions on Pattern Analysis and Machine Intelligence 17(1):8690. Kato, H., and S. Inokuchi. 1990. The Recognition System for Printed Piano Music Using Musical Knowledge and Constraints. Proceedings of the International Association for Pattern Recognition Workshop on Syntactic and Structural Pattern Recognition. Murray Hill, NJ, pp. 231248. Kobayakawa, T. 1993. Auto Music Score Recognition System. Proceedings of SPIE: Character Recognition Technologies. Bellingham, Washington: International Society for Optical Engineering, pp. 112123. McPherson, J. R. 2002. Introducing Feedback into an Optical Music Recognition System. Proceedings of the Third International Conference on Music Information Retrieval. Paris: IRCAM, pp. 259260. Miyao, H., and R. M. Haralick. 2000. Format of Ground Truth Data Used in the Evaluation of the Results of an Optical Music Recognition System. IAPR Workshop on Document Analysis Systems, pp. 497506. Modayur, B. R. 1996. Music Score Recognition: A Selective Attention Approach Using Mathematical Morphology. Unpublished manuscript. Seattle: University of Washington Electrical Engineering Department. Nesi, P., and T. Querci. 1998. Effort Estimation and Prediction of Object-Oriented Systems. Journal of Systems and Software 42:89102.
82
Ng, K. C., and R. D. Boyle. 1994. Reconstruction of Music Scores from Primitive Sub-Segmentation. Unpublished manuscript. Leeds, UK: University of Leeds School of Computer Studies. Ng, K. C., and R. D. Boyle. 1996. Recognition and Reconstruction of Primitives in Music Scores. Image and Vision Computing 14(1):3946. Ng, K. C., et al. 2004. Coding Images of Music. Interactive Music Network Project IST-2001-37168, Deliverable DE4.7.1. Prerau, D. S. 1970. Computer Pattern Recognition of Standard Engraved Music Notation. Ph.D. Thesis, MIT. Pruslin, D. 1966. Automatic Recognition of Sheet Music. Sc.D. Thesis, MIT. Ross, T. 1970. The Art of Music Engraving and Processing. Miami: Hansen. Roth, M. 1994. An Approach to Recognition of Printed Music. Diploma Thesis, Swiss Federal Institute of Technology Department of Computer Science. Rousseeuw, P. J., and A. M. Leroy. 1987. Robust Regression and Outlier Detection. New York: Wiley. Rumelhart, D. E., G. E. Hinton, and J. L. McClelland. 1986. A General Framework for Parallel Distributed Processing. In D. E. Rumelhart and J. L. McClelland, eds. Parallel Distributed Processing: Explorations in the Microstructure of Cognition, Volume 1. Cambridge, Massachusetts: MIT Press, pp. 4576. Selfridge-Field, E. 1993. Optical Recognition of Musical Notation: A Survey of Current Work. Computing in Musicology 9:109145. Selfridge-Field, E. 1997. Beyond MIDI: The Handbook of Musical Codes. Cambridge, Massachusetts: MIT Press. Tojo, A., and H. Aoyama. 1982. Automatic Recognition of Music Score. Proceedings of 6th International Conference on Pattern Recognition. Munich, p. 1223. Ventzislav, A. 2003. Error Evaluation and Applicability of OCR Systems. Proceedings of the 4th International Conference on Computer Systems and Technologies. New York: ACM Press, pp. 308313.
covers a large part of Western classical music notation. On the other hand, it can be extended to model more symbols or to cope with structural issues, document layout, etc. The proposed list has been tuned to describe monophonic music scores and their relationships, together with the most important and often used music notation symbols in common-practice Western music notation. Evaluation was performed by comparing the reconstructed score with the original score. (For a more detailed list of symbol categories, as well as tables showing complete numerical results for the three OMR systems, see musicnetwork.dsi.uni.it/AssessingOpticalMusic RecognitionToolsAppendices-V2-2.pdf.)
Note with Pitch and Duration This evaluation focuses on note correctness in terms of pitch and duration. A note is deemed correct when pitch and duration are correct; a fault when pitch or duration are incorrect; a miss when the note shape is not in the reconstructed score; and an add when the note shape is not in the original score.
Rests and Duration This evaluation addresses rest reconstruction. A rest is correct when its duration is correct; a fault when its duration is incorrect; a miss when the rest shape is not in the reconstructed score; and an add when the rest shape is not in the original score.
Note with Accidentals
Appendix A: List of Categories of Complete Symbols and Evaluation Guidelines for Metrics
The list of NC categories used during the test is hereby reported. For each category, the goal and guidelines concerning how to measure the metrics presented in the article are reported as well. The selected set of complete-symbol categories is not exhaustive for any genre of musical score, and yet it
This evaluation focuses on the assignment of accidentals such as sharps, ats, double sharps, naturals, and double ats to notes. An accidental assignment is counted as correct when the accidental has been associated with the correct note; fault when a wrong accidental has been associated to the related note; miss when the accidental has not been associated with the related note or it has been skipped as in the case of a missed note; and an add when Bellini et al. 83
the accidental has been associated with a note while in the original score that note does not have that kind of accidental, or it has no accidental at all.
exist in the reconstructed score; and add when the key signature is produced, but it does not exist in the original score.
Groups of Beamed Notes This evaluation addresses the reproduction of notes groups without considering the duration and pitch of each note. This last feature is considered by the Note with Pitch and Duration evaluation. A group of notes is correct when the group has been realized and matches the number of involved notes in the original score; a miss when the beaming has not been rebuilt in the reconstructed score and notes are independent; an add when a group of notes is produced that does not exist in the original score; or a fault if the group of beamed notes is realized by a different number of notes with respect to the original score.
Symbols Below or Above Notes This evaluation focuses on the verication of ornaments and accents (e.g., staccato, accent, turn, mordent, trill, tenuto, etc.) associated with notes. An ornament or accent symbol is correct when it has been associated with the correct note; fault when an incorrect symbol has been associated with the related note; miss when the symbol has not been associated with the related note or it has been skipped, as in the case of a missed note; and add when the symbol has been associated with a note while that note in the original score includes neither that kind of symbol, nor any symbol.
Grace Notes Time Signature (and Time Signature Change) This evaluation focuses on verication of the time signature in terms of numbers involved in the fraction. It considers both the fraction in the rst measure and the possible time signature changes inside the score. This evaluation results in a correct when numerator and denominator are correct; fault when numerator or denominator are incorrect; miss when the time signature does not exist in the reconstructed score; and add when a time signature is produced that does not exist in the original score. This evaluation focuses on the recognition of acciaccaturas and appoggiaturas. In general, grace notes are related to a single symbol, but they could be used to build a group of small notes. For such evaluation, groups of small notes are considered as a unique symbol. A single grace note or a group of small notes is correct when notes are correct in all their features (pitch, duration, and type of grace note) being perfectly reconstructed; fault when a feature among pitch, duration, and type of note is incorrect; miss when notes do not exist in the reconstructed score; or add when notes are reproduced even if they do not exist in the original score.
Key Signature (and Key Signature Change) This evaluation focuses on verication of a pieces tonality. Here, tonality is linked to the number of accidentals used in representing the key signature. A key signature or a key signature change must be considered. The results are counted as correct when sharps or ats (and possible naturals involved in the tonality change) are correct; fault when the number and/or position of accidentals symbols are incorrect; miss when the key signature does not 84 Slurs and Bends This evaluation addresses the recognition of horizontal symbols: slurs, ties, and bends. Such horizontal symbols are rated correct when the start and the end note are correct; fault when the start or end note are incorrect; miss when the slur or bend do not exist in the reconstructed score; and add when the slur or bend do not exist in the original score. Computer Music Journal
Augmentation Dots This evaluation focuses on the recognition of augmentation dots and their association with notes. In the presence of multiple dots, they must be considered one by one. For each dot, it is rated correct when the augmentation dot has been linked to the correct note; a fault or a miss when the augmentation dot has not been linked to the note; and an add when the augmentation dot does not exist in the note of the original score.
Clefs (and Clef Changes) This evaluation focuses on the recognition of clefs. A clef symbol can indicate different clefs; for example, the alto clef and tenor clef are represented by the same symbol. For this evaluation, the position of symbol in the staff together with its shape must be taken into account. A clef is judged correct when shape and position are correct; fault when shape or position are incorrect; miss when the clef does not exist in the reconstructed score; and add when the clef does not exist in the original score.
A pair of barlines denes a staff region in which notes, rests, or other notation symbols are placed; such a region is called a measure. Because consecutive measures share a barline, failing to recognize barlines could generate different kinds of errors in measure recognition, such as fusion of measures, addition of empty measures, or fragmentation of measures. Taking into account these considerations, a measure is considered correct when the boundaries are correctly recognized; fault or miss when a measure of the original score has not been reproduced; or add when the measure does not exist in the original score or a measure has been divided into two measures. Note that the missed recognition of a barline could generate the fusion of two consecutive measures. In this case, a measure is lost and then must be counted as miss, whereas the reconstructed measure could be considered correct.
Number of Staves Finally, this evaluation focuses on staff recognition. A staff in the music notation score page is counted as correct when it is detectable in the reconstructed score; fault when the staff has different features (e.g., number of lines); miss when the staff is not identied; or add when the staff is not in the original score.
Irregular Note Groups This evaluation concerns the recognition of irregular groups, also called tuplets. An irregular group is a set of notes not necessarily beamed and with a numeric indication dening the kind of tuplet. A tuplet is correct when the irregular group has been reproduced with the correct numeric indication, if any; a fault when the irregular group is not fully reproduced (for example, the numeric indication is not correct); a miss when the irregular group has not been recognized; or an add when the identied group is not an irregular group in the original score.
Appendix B: Application of Basic-Symbol Metrics

Tables B1, B2, and B3 show the results of the metrics for basic symbols, estimated respectively for SmartScore, O 3MR, and SharpEye2 on the seven test cases. Each measurement was performed by more than one person. Then, a verication of results was performed to guarantee the correctness of the estimate. Each table is divided into subtables, and each of these is further divided into two sections: the expected values column (symbols counted in the original score) and columns under the found voice (symbols counted in the reconstructed score). Bellini et al. 85
Number of Measures This evaluation focuses on the recognition of measures through barline detection and recognition.
Table B1.
86
Table B1. (continued)
Bellini et al.
87
Table B2.
88
Bellini et al.
89
Table B3.
90
Appendix C: Application of Composite-Symbol Metrics

Table C1 shows the value of metrics estimated on composite symbols on the results obtained by using the OMR tools SmartScore, O 3MR, and SharpEye2. Measurements were made by more than one person and then compared as explained in Appendix B. Bellini et al. 91
Table C1.
92
Table C1. (continued)
Bellini et al.
93

Assessing Optical Music Recognition Tools: Pierfrancesco Bellini, Ivan Bruno, and Paolo Nesi

Cargado por

Información del documento

Descripción original:

Título original

Derechos de autor

Formatos disponibles

Compartir este documento

Compartir o incrustar documentos

Opciones para compartir

¿Le pareció útil este documento?

¿Este contenido es inapropiado?

Copyright:

Formatos disponibles

Assessing Optical Music Recognition Tools: Pierfrancesco Bellini, Ivan Bruno, and Paolo Nesi

Cargado por

Copyright:

Formatos disponibles

Pierfrancesco Bellini, Ivan Bruno, and Paolo Nesi

Assessing Optical Music Recognition Tools

Metrics for Basic Symbols

Table 1. Basic Symbol Categories with Examples

Bass Clef Baritone Clef

Augmentation Dot Accent Number 1 Number 2 Number 3

Added Basic Symbols Rate (ABSR):

Weighted Faulty Basic Symbols Rate (WFBSR): WFBSR = 100 w B, i NFBSi

Faulty Basic Symbols Rate (FBSR): FBSR =

Weighted Added Basic Symbols Rate (WABSR): WABSR = 100 w B, i NABSi

Missed Basic Symbols Rate (MBSR): MBSR =

In each of these weighted metrics the following relationship was dened: w B, i = w B, i

Computer Music Journal

Metrics for Composite Symbols

Faulty Composite Symbols Rate (FCSR): FCSR =

Missed Composite Symbols Rate (MCSR): MCSR =

Added Composite Symbols Rate (ACSR): ACSR =

Composite Symbols Error Rate (CSER): CSER = FCSR + MCSR + ACSR

True Composite Symbols Rate (TCSR): TCSR =

Application of Metrics and Data Collection

Table 2. Weights for the Relevance of Basic Symbols

Table 3. Weights for the Relevance of Composite Symbols

Computer Music Journal

Table 4. Statistical Results of the Votes Provided by Experts

Table 5. Correlation Values of Metrics

Table 6. Multi-Linear Regression Analysis Using Basic-Symbol and Composite-Symbol Metrics

0.891 0.332 0.654 0.795 0.000

Computer Music Journal

Note with Accidentals

Appendix B: Application of Basic-Symbol Metrics

Computer Music Journal

Table B1. (continued)

Computer Music Journal

Table B2. (continued)

Computer Music Journal

Table B3. (continued)

Appendix C: Application of Composite-Symbol Metrics

Computer Music Journal

Table C1. (continued)

También podría gustarte