Buildasearchableknowledgebase 140517204309 Phpapp02

Build a Searchable
Knowledge Base
Jimmy Lai
Yahoo! Search Engineer
r97922028 [at] ntu.edu.tw
2014/05/18
http://www.slideshare.net/jimmy_lai/build-a-searchable-knowledge-base
Outline
Introduction to Knowledge Base
Construct a Knowledge Base
Search the Knowledge Base

string match
synonym search
full text search
geo search
put all together
More Applications
2
Knowledge
Knowledge is power. - Francis Bacon, 1597
Knowledge is boundless and connected. So, an

efficient interface to search and browse the
knowledge base is essential.
Lets try to build a searchable knowledge base.
3
Application of Knowledge
Base
Personal assistant: Siri, Google now
Search engine: Googles knowledge graph
4
Construct a Knowledge
Base
1. Find good data sources.
2. Aggregate data as knowledge entity.
3. Construct structured data of knowledge entity.
4. Search the knowledge base.
5. Navigate the knowledge base.
5
Wikipedia
A collaborated encyclopedia with more than 30M
articles over 287 languages.
! http://www.theguardian.com/technology/blog/2009/aug/13/wikipedia-edits
A good source of knowledge base. However the

data of Wikipedia is not well-structured.
6
DBpedia
http://wiki.dbpedia.org/About
Structured data from Wikipedia.
A good data source for a knowledge base.
7
8
Identifier
Knowledge
Entity
Abstract
Relations
9
What can Python do for us
Data Wrangling
Process the raw text data
Aggregate the data from different sources
Output data as json format
Connecting the Data flow between systems

Automation script for starting services and
feeding data
REST API implementing search strategy
10
Example code
git clone git@github.com:jimmylai/knowledge.git!
https://github.com/jimmylai/knowledge!
required python packages:

1. fabric
2. pysolr
3. django
11
Data Preparation
1. Download data from DBpedia
http://downloads.dbpedia.org/current/en/
2. Filter out some specific knowledge entity

zcat instance_types_en.nt.bz2 | get_id_list.py
3. Parse and aggregate data entity from files.

data file script data field
short_abstracts_en.nt.bz2 get_abstract.py abstract
raw_infobox_properties_en.nt.bz2 get_relation.py relations
geo_coordinates_en.nt.bz2 get_geo.py latlon
redirects_en.nt.bz2 get_redirect.py redirects
12
Aggregated Data Format
"http://dbpedia.org/resource/Lake_Yosemite": {
"latlon": "37.376389,-120.428889",
"redirects": [
"Lake_yosemite"
],
"abstract": "Lake Yosemite is an artificial freshwater lake located approximately
five miles (8km) east of Merced, California in the rolling Sierra Foothills. UC Merced
is situated approximately half a mile (0.8km) south of Lake Yosemite. The university
is bounded by the lake on one side and two canals (Fairfield Canal and Le Grand
Canal) run through the campus. In 2007, a myth featured in the Mythbusters' James
Bond Special 1 episode was filmed and tested at Lake Yosemite.",
"relations": {
"type": "http://dbpedia.org/resource/Reservoir",
"location": "http://dbpedia.org/resource/California"
}
}
13
Search by
Solr is a full-text, real-time search engine based on Apache
lucene.
Provides REST-like API.
pysolr make the use of Solr easily.
Download the latest version 4.8.0 from

http://www.apache.org/dyn/closer.cgi/lucene/solr/4.8.0
and extract to solr/solr-4.8.0 dir
Start Solr server and then check the web UI

fab start_solr
http://localhost:8983/solr/
14
Search - String Match
To be able to search by entity name
python feed_data.py string_match
config: solr/conf/string_match/schema.xml
<field name="name" type="string" indexed="true" stored="true"
multiValued="false"/>
<field name="abstract" type="string" indexed="false" stored="true"
multiValued="false"/>
Feed the entities to Solr. Each entity with name and

abstract fields.
15
Search - String Match
http://localhost:8983/solr/string_match/select?q=name%3A%22San+Francisco
%22&wt=json&indent=true
Search by entity name.
16
Search - Synonym
To be able to search by synonym of entity name
python feed_data.py synonym_string_match
config: solr/conf/synonym_string_match/schema.xml
<field name="name" type=name_text" indexed="true" stored="true" multiValued="false"/>
!
<fieldType name="name_text" class="solr.TextField" positionIncrementGap="100">
<analyzer type="index">
<tokenizer class="solr.StandardTokenizerFactory"/>
<filter class="solr.SynonymFilterFactory" synonyms="synonyms.txt"
ignoreCase="true" expand="true"/>
<filter class="solr.LowerCaseFilterFactory"/>
</analyzer>

Restart Solr server and the synonym file will be reloaded.

17
Synonym handling at index
time
18
Synonym handling at query
time
19
Search - Synonym
Search by synonym.
20
Search - Full Text Search
To be able to search by entity name
python feed_data.py full_text_search
config: solr/conf/full_text_search/schema.xml
<copyField source="name" dest="text"/>
<copyField source="abstract" dest=text"/>
!
Feed the entities to Solr. Each name and abstract

field will be copied to the text field. After that we
can do full text search without specify field to
search.
21
Search - Full Text Search
22
Search - Geo Search
To be able to search by distance given a location
python feed_data.py geo_search
config: solr/conf/geo_search/schema.xml
<field name="location" type="location" indexed="true" stored="true"
required="false" multiValued="false" />
Feed the entities to Solr. Each entity contains a location

field and the format is like "51.670100,-3.230100".
23
Given condition on distance
24
Search - Put All Together
Search Strategy
1. Input a query
2. Search by synonym match
3. Search by full text
1. If input a location, filter the result by geo

search
Implement the search strategy as an API

25
Implement the search
strategy in a Django view
26
27
Review
A Knowledge Base with synonym, full-text and geo

search API.
The knowledge entities are connected by relation.
28
More Applications
Question answering system:
1.Query analysis: identify the intension (e.g. looking
for specific type of entity)
2.Search in the knowledge base
3.Return the knowledge entity
29
The modern search engine dont just provide web page urls. They provide the
direct answer to users.
30
More Data Sources and
Knowledge Entities
Open Data
Open APIs
31
My Life in
Build online services for billions of users.
Big data mining on cloud infrastructures.
Open and Innovative working environment.
International teamwork and English communication.
Business trips to Silicon Valley.
Send me your resume if you need a referral.

r97922028 [at] ntu.edu.tw
32

Buildasearchableknowledgebase 140517204309 Phpapp02

Cargado por

Información del documento

Título original

Derechos de autor

Formatos disponibles

Compartir este documento

Compartir o incrustar documentos

Opciones para compartir

¿Le pareció útil este documento?

¿Este contenido es inapropiado?

Copyright:

Formatos disponibles

Buildasearchableknowledgebase 140517204309 Phpapp02

Cargado por

Copyright:

Formatos disponibles

Build a Searchable

Construct a Knowledge Base

Search the Knowledge Base

Knowledge is power. - Francis Bacon, 1597

Knowledge is boundless and connected. So, an

Lets try to build a searchable knowledge base.

Search engine: Googles knowledge graph

2. Aggregate data as knowledge entity.

3. Construct structured data of knowledge entity.

4. Search the knowledge base.

5. Navigate the knowledge base.

A good source of knowledge base. However the

Structured data from Wikipedia.

A good data source for a knowledge base.

Connecting the Data flow between systems

required python packages:

2. Filter out some specific knowledge entity

3. Parse and aggregate data entity from files.

Provides REST-like API.

pysolr make the use of Solr easily.

Download the latest version 4.8.0 from

Start Solr server and then check the web UI

Feed the entities to Solr. Each entity with name and

Search by entity name.

Restart Solr server and the synonym file will be reloaded.

Feed the entities to Solr. Each name and abstract

Feed the entities to Solr. Each entity contains a location

2. Search by synonym match

3. Search by full text

1. If input a location, filter the result by geo

Implement the search strategy as an API

A Knowledge Base with synonym, full-text and geo

The knowledge entities are connected by relation.

Big data mining on cloud infrastructures.

Open and Innovative working environment.

International teamwork and English communication.

Business trips to Silicon Valley.

Send me your resume if you need a referral.

También podría gustarte