Está en la página 1de 3

Proceedings of International Conference on Advancements in Engineering and Technology

www.iaetsd.in

A SURVEY ON ONE CLASS CLUSTERING


HIERARCHY FOR PERFORMING DATA
LINKAGE
S.Rajalakshmi,
Assistant Professor, Department of CSE,
Velammal Engineering College,Anna University,
Chennai,India.
raji780@yahoo.co.in

Abstract Data linkage refers to the process of matching the


data from several databases that refers to the entities of same
type. Data linkage is also possible for the entities that do not
share the common identifier. With the growing size of the todays
database, the complexity of the matching process becomes a
major challenge for Data linkage. Many Indexing techniques
were developed for data linkage but however those techniques
are not efficient. In this paper, a new data linkage method called
as One Class Clustering Tree(OCCT) is developed to overcome
the existing challenges and also to perform the data linkage
process for the entities that do not share a common identifier.
The developed technique builds the tree in such a way that the
inner nodes of the tree represents the features of the first set of
entities and the leaves of the tree represents the features of the
second sets that are similar. The one class clustering tree uses
certain splitting criteria and pruning methods for the data
linkage.
Keywords--Linkage, classification, clustering, splitting, decision
tree induction, index techniques.

I.

INTRODUCTION

Data linkage is the process of identifying different entries that


refers to the same entity across different data sources[1]. The
main aim of the data linkage is to join the datasets that do not
share a common identifier or the foreign key. Data linkage is
usually performed to reduce the large data into the smaller
data. It also helps in removing the duplicate data in the
datasets. This technique is called as deduplication [19]. Data
linkage can be classified into two types namely, one-to-one
data linkage and one-to-many data linkage[15]. In one-to-one
data linkage, the aim is to link an entity from one dataset with
the matching entity from the other dataset. In one-to-many
data linkage the aim is to link an entity from first dat set with
the group of matching entities from the other data set. In this
paper a new data linkage approach is used called as One Class
Clustering Tree(OCCT) which is aimed at performing one-tomany data linkage. The OCCT is most preferable compared to
all the indexing techniques because it can easily be translated
to linkage rules.

ISBN NO : 978 - 1502893314

A.Jayanthi,
M.E(CSE),Department of CSE,
Velammal Engineering College,Anna University,
Chennai,India.
jayanthiarumugamk@gmail.com

The paper is structured as follows: In Section II, we review on


indexing techniques,Section III deals with the data linkage
using OCCT and finally Section IV concludes the paper.
II. INDEXING TECHNIQUES
In this section the various indexing techniques are discussed
and the variation among them are discussed in more detail.
The indexing process of the data linkage can be divided into
two phases. 1)Build- All the records in the database are being
read and their Blocking Key Values(BKV) are generated.
Most of the indexing techniques uses inverted index approach
[6] where the record identifiers that have the same BKV will
be inserted into the same inverted index list.2)Retrieve- For
every block, the list of the record identifiers is retrieved from
the inverted index and the candidate record pairs are generated
from the list.
A.TRADITIONAL BLOCKING
Traditional blocking is one of the technique used in the data
linkage[1]. In traditional Blocking all the records that have the
same BKV are being inserted into the same block and the
records within that block are compared with each other. This
technique can be implemented using the inverted index[6].The
main disadvantage of traditional blocking is that the errors and
the variations in the record fields used to generate the BKVs
will lead to the record being inserted into the wrong block.
The second disadvantage is that the sizes of the block
generated depend upon the frequency distribution of the BKVs
and thus it is difficult to predict the total number of candidate
record pairs that will be generated.
B.SORTED NEIGHBORHOOD INDEXING
Sorted Neighborhood Indexing helps in sorting the database
according to the BKVs,and to subsequently move the window
of a fixed number of records over the sorted values and the
candidate record pairs are generated only from the records
within a current window. It uses three approaches namely
sorted array based approach [4],inverted index based

International Association of Engineering and Technology for Skill Development


51

Proceedings of International Conference on Advancements in Engineering and Technology


approach[14]
and
Adaptive
Sorted
Neighborhood
approach[16].The sorted array based approach is not
applicable when the window size is small. However the
inverted index based approach also has the same drawback of
traditional blocking and it is inefficient approach as it takes
lots of time for splitting the entities. The Adaptive sorted
Neighborhood approach is not suitable when window size is
too large.
C. Q-GRAM BASED INDEXING
Q-Gram Based Indexing technique overcomes the drawback
of the traditional blocking and the sorted neighborhood
indexing. The main aim of this technique is to index the
database such that the records that have the similar,and not
just the same,BKV will be inserted into the same
block[8].However, much larger number of candidate record
pairs will be generated,leading to a more time consuming
process.
D. SUFFIX ARRAY-BASED INDEXING
Suffix Array-Based Indexing technique is one of the most
efficient approach compared to the previous works. The basic
idea of this technique is to insert the BKVs and their suffixes
into a suffix array based inverted index[11]. It uses the
approach called Robust Suffix Array Based Indexing where
the inverted index lists of the suffix values that are similar to
each other in the sorted suffix array are merged[13]. This
technique also takes a lot of time to merge the values.

www.iaetsd.in

III.DATA LINKAGE USING OCCT


OCCT is induced using one of the splitting criteria. The
splitting criteria is used to determine which attribute should be
used in each step of building the tree. OCCT uses the
prepruning process to decide which branches should be
trimmed.

DATABASE A

DATABASE B

CONSTRUCT OCCT USING ALL ENTITIES

PREPRUNING TECHNIQUE

COMPARE ENTITIES

MATCHING ENTITY

NON-MATCHING
ENTITY

E. CANOPY CLUSTERING
The canopy clustering[14]is built by converting BKVs into the
lists of tokens with each unique token becoming a key in the
inverted index. It uses the approach called as the Thresholdbased approach and Nearest Neighbor-Based approach.The
drawback of the canopy clustering is similar to that of the
sorted neighborhood technique based on the sorted array.
F. STRING-MAP-BASED INDEXING
String-map-based indexing [9] is based on mapping BKVs to
objects in a multidimensional Euclidean Space,such that the
distance between the pairs of the strings are preserved.Group
of similar strings are then generated by extracting the objects
that are similar to each other. However this technique fails
when the size of the database is too large or too small.
Hence all the above discussed indexing techniques has few
drawbacks in the data linkage process. In order to overcome
those indexing problems associated with the data linkage
process a new approach called as the One Class Clustering
Tree is proposed, which uses four splitting criteria
namely,Coarse-Grained Jaccard coefficient,Fine-Grained
Jaccard Coefficient, Least Probable Intersection(LPI) and
Maximum Likelihood Estimation(MLE) for data split and
pruning techniques.

ISBN NO : 978 - 1502893314

FINAL RESULT
Fig 1: Work Flow Diagram
Initially the tree is constructed where the inner nodes of the
tree consists of the attribute and the leaves represents the
clusters of the clusters of the matching entities. Secondly, the
prepruning technique is being used which means that the
algorithm stops expanding a branch whenever the subbranch
does not improve the accuracy of the model. OCCT uses the
probabilistic model to find the similar entities that are to be
matched. This probabilistic approach helps to avoid
overfitting. OCCT is chosen to be the best approach for data
linkage compared to indexing techniques.
IV.CONCLUSION
In this paper OCCT approach is used which performs one-tomany data linkage.This method is based on the one class
decision tree model which sums up the knowledge of which
records to be linked together. This method uses one-class
approach which gives the results more accurately.OCCT
model has also been proved successful in three different
domains namely data linkage prevention,recommender system
and fraud detection.

International Association of Engineering and Technology for Skill Development


52

Proceedings of International Conference on Advancements in Engineering and Technology


REFERENCES
1.

2.

3.
4.

5.
6.
7.

8.

9.

10.

11.

12.

13.

I.P. Fellegi and A.B. Sunter, A Theory for Record


Linkage, J. Am. Statistical Soc., vol. 64, no. 328, pp.
1183-1210, Dec. 1969.
D.D. Dorfman and E. Alf, Maximum-Likelihood
Estimation of Parameters of Signal-Detection Theory
and Determination of Confidence IntervalsRatingMethod Data, J. Math. Psychology,vol. 6, no. 3, pp.
487-496, 1969.
J.R.Quinlan, Induction of Decision Trees, Machine
Learning, vol. 1, no. 1, pp. 81-106, March 1986.
M.A. Hernandez and S.J. Stolfo, The Merge/Purge
Problem for Large Databases, Proc. ACM SIGMOD
Intl Conf. Management of Data (SIGMOD 95),
1995.
P.Langley, Elements of Machine Learning, San Franc
Isco, Morgan Kaufmann, 1996.
I.H. Witten, A. Moffat, and T.C. Bell, Managing
Gigabytes, second ed. Morgan Kaufmann, 1999.
S.Guha, R.Rastogi and K.Shim, Rock: A Robust
Clustering Algorithm for Categorical Attributes,
Informat- ion Systems, vol. 25, no. 5, pp. 345-366,
July 2000.
L. Gravano, P.G. Ipeirotis, H.V. Jagadish, N. Koudas,
S. Muthukrishnan, and D. Srivastava, Approximate
String Joins in a Database (Almost) for Free, Proc.
27th Intl Conf. Very Large Data Bases (VLDB 01),
pp. 491-500, 2001.
L. Jin, C. Li, and S. Mehrotra, Efficient Record
Linkage in Large Data Sets, Proc. Eighth Intl Conf.
Database Systems for Advanced Applications
(DASFAA 03), pp. 137-146, 2003.
I.S.Dhillon, S. Mallela, and D.S. Modha,
Information-Theoretic Co-Clustering, Proc. Ninth
ACM SIGKDD Intl Conf. Knowledge Discovery
and Data Mining, pp. 89-98, 2003.
A. Aizawa and K. Oyama, A Fast Linkage Detection
Scheme for Multi-Source Information Integration,
Proc. Intl Workshop Chal- lenges in Web
Information Retrieval and Integration (WIRI 05),
2005.
A.J.Storkey, C.K.I.Williams, E.Taylorand R.G.Mann,
An Expectation Maximisation Algorithm for Oneto- Many Record Linkage, University of Edinburgh
Informatics Research Report, 2005.
P. Christen, A Comparison of Personal Name
Matching: Techniques and Practical Issues, Proc.
IEEE Sixth Data Mining Workshop (ICDM 06),
2006.

ISBN NO : 978 - 1502893314

www.iaetsd.in

14. P. Christen, Towards Parameter-Free Blocking for


Scalable Record Linkage, Technical Report TR-CS07-03, Dept. of Com- puter Science, The Australian
Natl Univ., 2007.
15. P. Christen and K. Goiser, Quality and Complexity
Measures for Data Linkage and Deduplication,
Quality Measures in Data Mining, vol. 43, pp. 127151, 2007.
16. S. Yan, D. Lee, M.Y. Kan, and L.C. Giles, Adaptive
Sorted Neighborhood Methods for Efficient Record
Linkage, Proc. Seventh ACM/IEEE-CS Joint Conf.
Digital Libraries (JCDL 07), 2007.
17. A.Gershman et al., A Decision Tree Based
Recomme- nder System, in Proc. the 10th Int. Conf.
on Innovative Internet Community Services, pp. 170179, 2010.
18. M.Yakout,
A.K.Elmagarmid,
H.Elmeleegy,
M.Quzzani and A.Qi, Behavior Based Record
Linkage, in Proc. of the VLDB Endowment, vol. 3,
no 1-2, pp. 439-448, 2010.
19. P. Christen, A Survey of Indexing Techniques for
Scalable Record Linkage and Deduplication, IEEE
Trans. Knowledge and Data Eng., vol. 24, no. 9, pp.
1537-1555, Sept. 2012, doi:10.1109/TKDE. 2011.
127.
20. M.Dror, A.Shabtai, L.Rokach, Y. Elovici, OCCT: A
One-Class Clustering Tree for Implementing One-toMany Data Linkage, IEEE Trans. on Knowledge
and Data Engineering, TKDE-2011-09-0577, 2013.

International Association of Engineering and Technology for Skill Development


53

También podría gustarte