An Efficient Subtopic Retrieval System Using Hybrid Approach

International Journal of Computer Trends and Technology (IJCTT) volume 4 Issue 7July 2013
ISSN: 2231-2803 http://www.ijcttjournal.org Page 2355

An Efficient Subtopic Retrieval System using Hybrid
Approach
Manpreet Kaur
1
, Usvir Kaur
2

1
Research Fellow,
2
Asst. Professor
1,2
Sri Guru Granth Sahib World University,Fatehgarh Sahib,Punjab.

Abstract The subtopic retrieval is nding documents
that cover many different subtopics of a query topic. It
means the utility of a document in a ranking is dependent
on other documents in the ranking. Subtopic retrieval has
challenges for improving performance, as well as for
developing effective algorithms. Current Ranking systems
having some inability to support subtopic retrieval system.
Two main post-processing techniques for search results
are: clustering and diversification. Clustering and
diversification are two essential methods which have been
getting used for the last couple of years in the searching
methods at search engine optimization. In this work we
proposed an efficient hybrid approach using both
diversification & clustering technique where
agglomerative clustering is used & for diversification max
min diversification is used. This reduces the overburden
of the processor in terms of searching optimization. In
this we also compared our hybrid results with previous
results in terms of precision & recall which proves that
our new hybrid approach has better results.

Keywords SEO, Content retrieval, Hybrid approach,
Clustering, Diversification.

I. INTRODUCTION

Search engine optimisation (SEO) is the method of
rising the quantity and quality of traffic to an online
web site from search engines via "natural" ("organic"
or "algorithmic") search results. Usually, the sooner a
web site is conferred within the search results, or the
upper it "ranks", a lot of searchers can visit that web
site. SEO also can target totally different sorts of
search, as well as image search, native search, and
industry-specific vertical search engines [1]. SEO
isn't essentially associate degree applicable strategy
for each web site and different net selling ways are
often rather more effective, looking on the location
operator's goals. A flourishing net selling campaign
might drive organic search results to pages, however
it conjointly might involve the employment of paid
advertising on search engines and different pages,
building top quality web content to have interaction
and persuade, addressing technical problems that will
keep search engines from crawl and
compartmentalisation those sites, fixing analytics
programs to change web site house owners to live
their successes, and rising a site's conversion rate.

Search Engine promoting, or SEM, may be a style of
net promoting that seeks to market websites by
increasing their visibility within the programme
results pages (SERPs) and includes a proved ROI
(Return on Investment) [1]. in line with the
programme promoting Professionals Organization,
SEM ways include: programme optimisation (or
SEO), paid placement, and paid inclusion. different
sources, as well as the ny Times outline SEM as
apply of shopping for paid search listings, completely
different from SEO that seeks to get higher free
search listings.

Clustering:
In search results the listings from anyone website ar
generally restricted to a precise range and classified
along to create the search results seem neat and
arranged and to make sure diversity amongst the
highest graded results [2]. clump may also consult
with a method that permits search engines to cluster
hubs and authorities on a selected topic along to any
enhance their price by showing their relationships.
clump is that the unattended classification of patterns
(observations, data items, or feature vectors) into
teams (clusters). The clump downside has been self-
addressed in several contexts and by researchers in
several disciplines; this reflects its broad charm and
quality joined of the steps in beta knowledge
analysis. However, clump may be a tough downside
combinatorial, and variations in assumptions and
contexts in several communities has created the
transfer of helpful generic ideas and methodologies
slow to occur. This paper presents an outline of
pattern clump ways from a applied mathematics
pattern recognition perspective, with a goal of
providing helpful recommendation and references to
basic ideas accessible to the broad community of
clump practitioners.

Diversification:
With the expansion of the online and also the style of
programme users, internet search effectiveness and
user satisfaction are often improved by
diversification. Recent approaches to look result
diversification in each full-text and structured content
search.[7] we tend to establish commonalities within
the projected strategies describing associate degree
overall framework for result diversification. we tend
to discuss completely different diversity dimensions
and measures yet as potential ways that of
considering the connectedness / diversity trade-off.
we tend to additionally summarise existing efforts
evaluating diversity in search. Moreover, for every of
those steps, we tend to means aspects that area unit
missing in current approaches as potential directions
for future work instead of cluster the highest search
results by their similarity, one will aim at re ranking
them on the idea of criteria that maximize their
diversity, thus on gift prime results that area unit as
completely different from one another as potential.
this method, known as diversification of search
results, may be a recent analysis topic that, again,
deals with the question ambiguity issue. To some
extent, todays search engines, like Google and
Yahoo!, apply some diversification technique to their
superior results.

1) Max-sum Diversification. the primary objective
combines the totals of the connectedness and variety
live as a weighted sum.
2) Max-min Diversification. The second objective
targets at maximizing the total of the minimum
connectedness and minimum unsimilarity inside the
set.
3) Average unsimilarity Diversification. Their third
objective adds the first connectedness for a result
with the typical unsimilarity concerning all
alternative ends up in the set. The total over the
complete set is to be maximised.
4) Max-sum of max-score Diversification. equally to
max-sum diversification, maximises the total of
unsimilarity of the result set, however it solely
produces sets that have the outside connectedness
total. Therefore, it doesn't realize sets with higher
diversity scores however slightly lower
connectedness total.
5) Max-product Diversification. supported the
already chosen results, Zhaiet al. choose consecutive
result by maximizing the parameterised product of
the connectedness of consecutive result and its
unsimilarity to the chosen results.
II. RELATED WORK
A number of studies showing the benefits of
agglomeration & diversification techniques.
Weideman [1] within the scientific research
Associate in Nursing empirical study was done on the
appliance of moral SEO (Search Engine
Optimization) techniques to an internet site in a trial
to extend its visibility to the 3 main search engines.
Neither paid placement nor any black hat techniques
were thought of. The results indicated that the web
site currently occupied initial position on the 3 prime
search engines for variety of designated keyword
definitions. It are often terminated that the
implementation of moral methodologies, careful
inserting of text and also the use of key phrases will
dramatically increase the visibility of a real-world
web site.

Marco [2] presents a completely unique approach to
net search result agglomeration supported the
automated discovery of word senses from raw text, a
task observed as acceptation Induction (WSI). Key to
our approach is to initial acquire the senses (i.e.,
meanings) of Associate in Nursing ambiguous
question then cluster the search results supported
their linguistics similarity to the word senses
iatrogenic. Our experiments, conducted on datasets of
ambiguous queries, show that our approach
outperforms each net agglomeration and search
engines.

Wang [3] is aimed toward mining the subtopics of {a
question|a question |a question} either indirectly from
the came back results of retrieval systems or directly
from the query itself to diversify the search results.
For the indirect subtopic mining approach,
agglomeration the retrieval results and summarizing
the content of clusters is investigated. additionally,
labeling topic classes and thought tags on every came
back document is explored. For the direct subtopic
mining approach, many external resources, like
Wikipedia, Open Directory Project, search question
logs, and also the connected search services of search
engines, ar consulted. what is more, we tend to
propose a distributed retrieval model to rank
documents with relevance the strip-mined subtopics
for reconciliation connectedness and variety.
Experiments ar conducted on the ClueWeb09 dataset
with the topics of the TREC09 and TREC10 net
Track diversity tasks. Experimental results show that
the planned subtopic-based diversification
algorithmic rule considerably outperforms the
progressive models within the TREC09 and TREC10
net Track diversity tasks. the simplest performance
our planned algorithmic rule achieves is a-nDCG@5
zero.307, IA-P@5 0.121, and a#-nDCG@5 zero.214
on the TREC09, also as a-nDCG@10 zero.421, IA-
P@10 0.201, and a#-nDCG@10 zero.311 on the
TREC10. The results conclude that the subtopic
mining technique with the up-to-date users search
question logs is that the handiest thanks to generate
the subtopics of a question, and also the planned
subtopic-based diversification algorithmic rule will
choose the documents covering varied subtopics.

Cai [4] planned a graded agglomeration technique
sing visual, matter and link analysis. By employing a
vision-based page segmentation algorithmic rule, an
internet page is divided into blocks, and also the
matter and link info of a picture are often accurately
extracted from the block containing that image. By
victimisation block-level link analysis techniques, a
picture graph are often created. we tend to then apply
spectral techniques to seek out a euclidian embedding
of the photographs that respects the graph structure.
so for every image, we've 3 forms of representations,
i.e. visual feature based mostly illustration, matter
feature {based|based mostly|primarily based mostly}
illustration and graph based illustration. victimisation
spectral agglomeration techniques, we are able to
cluster the search results into totally different
linguistics clusters. a picture search example
illustrates the potential of those techniques.

Santos [5] introduces a completely unique
probabilistic framework for net search result
diversification, that expressly accounts for the
assorted aspects associated to Associate in Nursing
underspecified question. above all, we tend to
diversify a document ranking by estimating however
well a given document satisfies every uncovered side
and also the extent to that totally different aspects ar
happy by the ranking as a full. we tend to completely
valuate our framework within the context of the
variety task of the TREC 2009 net track. Moreover,
we tend to exploit question reformulations provided
by 3 major net search engines (WSEs) as a method to
uncover totally different question aspects. The results
attest the effectiveness of our framework when put
next to progressive diversification approaches within
the literature. in addition, by simulating Associate in
Nursing upper-bound question reformulation
mechanism from official TREC knowledge, we tend
to draw helpful insights relating to the effectiveness
of the question reformulations generated by the
various WSEs in promoting diversity.

Carpineto[7] gift a comparative study of their
performance, employing a set of complementary
analysis measures that may be applied to each
partitions and hierarchical lists, and 2 specialised take
a look at collections that specialize in broad and
ambiguous queries, severally. the most finding of our
experiments is that diversification of prime hits is a
lot of helpful for fast coverage of distinct subtopics
whereas agglomeration is best for full retrieval of
single subtopics, with a more robust balance in
performance achieved through generating multiple
subsets of numerous search results. we tend to
conjointly found that there's very little scope for
improvement over the computer programme baseline
unless we tend to have an interest in strict full-
subtopic retrieval, which search results
agglomeration ways don't perform well on queries
with low divergence subtopics, primarily as a result
of the issue of generating discriminative cluster
labels.

III. PROPOSED METHODOLOGY

Clustering and diversification are two essential
methods which have been getting used for the last
couple of years in the searching methods at search
engine optimization. Clustering has been used for
content retrieval (inner) whereas diversification is
used for the topic retrieval. We design an algorithm
which can implement both the features of clustering
and diversification so that the process becomes faster.
The basic purpose of this work is to reduce the
overburden of the processor in terms of searching
optimization .We also aim to create a hybrid
architectural algorithm which can make the search
engine optimization process more effective.
3.1 Proposed Model

The proposed model focuses on following objectives
which are helpful to reduce the burden of the
processor.

a. To create a hybrid architectural algorithm using
Clustering & Diversification.
b. To increase the effectiveness of the system in
terms of Precision & Recall.
c. Implementation of hybrid algorithm by
implementing features of both clustering &
diversification.
d. To improve optimization and searching operation
of a query.

3.2 Basic Block Design

In this proposed work, Hybrid algorithm is used
which is a combination of both clustering &
diversification. Clustering has been used for content
retrieval (inner) whereas diversification is used for
the topic retrieval. In this proposed work, we use
ambient data set to determine the complication of
SEO and database searching.

Fig 1: Block Design of Hybrid Approach
The block style of the hybrid system is shown in Fig
1. In this, we tend to apply collective bunch on the
info set then we tend to apply scoop min
diversification to their result. AN collective approach
begins with every pattern in a very distinct
(singleton) cluster, and in turn merges clusters along
till a stopping criterion is happy. The second
objective targets at increasing the add of the
minimum connectedness and minimum unsimilarity
among the set. It offers North American country the
hybrid results that we tend to then compare with the
previous system that uses solely bunch & solely
diversification.

3.3 Hybrid Algorithm

STEP1: Load query, every single query gives 100
URLs.
STEP2: Apply agglomerative clustering for result
URLs.
STEP3: Make clusters according to repetitive URLs
and repetitive words and unique URLs and unique
word.
STEP4: Repeat steps 2 to 3 for query.
STEP5: Apply diversification method on every
cluster and find max min query results for every
query.
STEP6: find results in terms of URLs and words.
STEP7: find precision and recall value for every
query results and repeat steps 5 to 6.
STEP8: repeat step 7 for every query.

Fig 2: Query Results using Hybrid Approach
Hybrid Approach shows query results as shown in
Fig 2: 1
st
box shows the clustering and 2
nd

diversification 3
rd
subtopics related to query and 4
th

shows the rejected words.
IV. RESULTS
This proposed approach compare with the previous
approaches where only single technique is used for
searching either clustering or diversification.
Effectiveness of the system is calculated in terms of
precision & recall. Precision (also called positive
predictive value) is the fraction of retrieved instances
that are relevant, while recall (also known
as sensitivity) is the fraction of relevant instances that
are retrieved. Both precision and recall are therefore
based on an understanding and measure of relevance.

Fig 3: Clustering Results

Fig 4: Diversification Results

Fig 5: Hybrid Results
Table 1 show all the result values of precision and
recall calculated by using clustering, diversification
and hybrid approach for topic retrieval system. Table
shows that our new hybrid approach gives better
results in terms of precision and recall. So, our new
hybrid approach gives more effective subtopic
retrieval strategy.
Table 1: Precision and Recall Results
Technique Precision Recall
Clustering 0.69 0.50
Diversification 0.65 0.49
Hybrid 0.79 0.54

V. CONCLUSION

While search engines are good for search tasks, they
may be less effective for satisfying broad or
ambiguous queries. The results on different subtopics
of a query will be typically mixed together in the
ranked list, thus implying that the user may have to
sift through a large number of irrelevant items to
locate those of interest. The number of real user
queries affected by this problem is potentially large,
partly because informational queries have been
estimated to account for 80% of web queries, and
partly because today virtually any web query
expressed by very few words has multiple subtopics
(or meanings, or interpretations). In this paper, we
proposed a new hybrid approach using agglomerative
clustering and min max diversification for subtopic
retrieval system. This proposed algorithm gives
better results in terms of precision and recall.
REFERENCES
[1] Melius Weideman,Use of Ethical SEO Methodologies to
Achieve Top Rankings in Top Search Engines, Proceedings of the
2007 Computer Science and IT Education Conference.

[2]Di Marco and R. Navigli, Clustering and Diversifying Web
Search Results with Graph-Based Word Sense Induction 12
September ,2012.

[3] Chieh-J en Wang Yung-Wei Lin .Ming-Feng Tsai Hsin-Hsi
Chen, Mining subtopics from different aspects for diversifying
search results,Springer Science+Business Media New York
2012.

[4] Deng Cai1* Xiaofei He2 Zhiwei Li* Wei-Ying Ma* and J i-
Rong Wen,Hierarchical Clustering of WWW Image Search
Results, Using Visual, Textual and Link Information October 10
16, 2004, New York, New York, USA.

[5] Rodrygo L. T. Santos,Craig Macdonald and Iadh Ounis
Exploiting,Query Reformulations for Web Search Result
Diversification, April 2630, 2010.

[6] A.K. J AIN,M.N. MURTY AND P.J . FLYNN, Data
Clustering, ACM Computing Surveys, Vol.31, No. 3, September
1999.

[7] Enrico Minack, Gianluca Demartini, and Wolfgang Nejdl,
Current Approaches to Search Result Diversitication,L3S
Research Center, Leibniz Universitt Hannover, 30167 Hannover,
Germany, {lastname}@L3S.de

[8] B. J . J ansen, D. L. Booth, and A. Spink, Determining the
informational, navigational, and transactional intent of Web
queries, Information Processing and Management, vol. 44, no. 3,
pp. 12511266, 2008.

[9] C. Zhai, W. W. Cohen, and J . Lafferty, Beyond Independent
Relevance: Methods and Evaluation Metrics for Subtopic
Retrieval, in Proceedings of the 26th International ACM SIGIR
Conference on Research and Development in Information
Retrieval, Toronto, Canada. ACM Press, 2003, pp. 1017.

[10] B. Zhang, H. Li, Y. Liu, L. J i, W. Xi, W. Fan, Z. Chen, and
W.-Y. Ma, Improving web search results using afnity graph, in
Proceedings of the 28th International ACM SIGIR Conference on
Research and Development in Information Retrieval, Salvador,
Brazil. ACM Press, 2006, pp. 504511.

[11] A. Swaminathan, C. Mathew, and D. Kirovski, Essential
Pages, Microsoft Research, Tech. Rep. MSR-TR-2008-15, 2008.

[12] C. Zhai, W. W. Cohen, and J . Lafferty, Diversifying search
results, in Proceedings of the Second ACM International
Conference on Web Search and Data Mining (WSDM 2009),
Barcelona, Spain. ACM Press, 2009, pp. 514.

[13] E. Di Giacomo, W. Didimo, L. Grilli, and G. Liotta, Graph
Visualization Techniques for Web Clustering Engines, IEEE
Transactions on Visualization and Computer Graphics, vol. 13, no.
2, pp. 294304, 2007.

[14] N. Kumar and K. Srinathan, Automatic keyphrase extraction
fromscientic documents using N-gram ltration technique, in
Proceedings of the 2nd European Semantic Web Conference,
Heraklion, Greece. Springer, 2008, pp. 199208.

[15] D. Crabtree, X. Gao, and P. Andreae, Improving web
clustering by cluster selection, in Proceedings of the 2005
IEEE/WIC/ACM International Conference on Web Intelligence,
Compiegne University of Technology, France. IEEE, 2005, pp.
172178.

[16] L. Ruixu and J . Whang, A new cluster merging algorithmof
sufx tree clustering, in FIP TC12 International Conference on
Intelligent Information Processing (IIP 2006), Adelaide, Australia.
Springer, 2006, pp. 197203.

[17] H. Chen and D. R. Karger, Less is more: probabilistic
models for retrieving fewer relevant documents, in Proceedings
of the 29th Annual International ACM SIGIR Conference on
Research and Development in Information Retrieval, Seattle,
Washington, USA. ACM Press, 2006, pp. 429436.

An Efficient Subtopic Retrieval System Using Hybrid Approach

Cargado por

Información del documento

Título original

Derechos de autor

Formatos disponibles

Compartir este documento

Compartir o incrustar documentos

Opciones para compartir

¿Le pareció útil este documento?

¿Este contenido es inapropiado?

Copyright:

Formatos disponibles

An Efficient Subtopic Retrieval System Using Hybrid Approach

Cargado por

Copyright:

Formatos disponibles

International Journal of Computer Trends and Technology (IJCTT) volume 4 Issue 7July 2013

ISSN: 2231-2803 http://www.ijcttjournal.org Page 2355

También podría gustarte