Documentos de Académico
Documentos de Profesional
Documentos de Cultura
Search Engine
Guided By :- XxX
5,981,044 65.5
1,294,261 14.1
1,142,364 12.5
206,969 2.3
175,074 1.9
91,288 1.0
55,122 0.6
27,002 0.3
26,462 0.3
24,681 0.3
Finding documents:
It is potentially needed to find required
document distributed over tens of thousands of
servers.
Formulating queries:
It needed to express exactly what kind of
information is to retrieve.
Determining relevance:
The system must determine whether a
document contains the required information or not.
Types of Search Engine
Heap:
It is a large unstructured chunk of
virtual memory where strings can be
appended.
Hash table :
It is third data structure of size n entries.
Any URL can be run through a hash
function to produce a nonnegative
integer less than n.
All URL that hash to the value k are
hooked together on a linked list.
Every entry into url_table is also entered
into hash table.
The main use of hash table is to start with a
URL and be able to quickly determine
whether it is already present in url_table.
Data structure for crawler
Pointers Pointers
to URL to title Hash Overflow
Code chains
0 2
String storage
URL 1 4
Title 2 5 19 6
3 21 44
URL
U
Title
Heap
n
Term Frequency,
Where,
| D | : total number of documents in the corpus
: number of documents where the
term ti appears (that is ).
Inverse Document Frequency
There are different ways of calculating the IDF
E.g.
The TF-IDF score for computer in
the collection would be :
1)TF-IDF = 0.03/0.0001= 300 , by using first
formula of IDF.