Está en la página 1de 5

Inside the architecture of

Google's Knowledge Graph


Despite the massive amounts of computing power dedicated by search engine companies to
crawling and indexing trillions of documents on the Internet, search engines still can't do what
nearly any human can: tell the difference between a star, a 1970s TV show, and a Turkish
alternative rock band. Thats because Web indexing has been based on the bare words found on
webpages, not on what they mean.
Since the beginning, search engines have essentially matched strings of text, says Shashi
Thakur, a technical lead for Googles search team. When you try to match strings, you don't get
a sense of what those strings mean. We should have a connection to real-world knowledge of
things and their properties and connections to other things.
Making those connections is the reason for recent major changes within the search engines at
Microsoft and Google. Microsofts Satori and Googles Knowledge Graph both extract data from
the unstructured information on webpages to create a structured database of the nouns of the
Internet: people, places, things, and the relationships between them all. The changes aren't
cosmetic; for Google, for example, this was the company's biggest retooling to search since
rolling out "universal search" in 2007.
The efforts are in part a fruition of ideas put forward by a team from Yahoo Research in a 2009
paper called A Web of Concepts, in which the researchers outlined an approach to extracting

conceptual information from the wider Web to create a more knowledge-driven approach to
search. They defined three key elements to creating a true web of concepts:

Information extraction: pulling structured data (addresses, phone numbers, prices, stock
numbers and such) out of Web documents and associating it with an entity

Linking: mapping the relationships between entities (connecting an actor to films hes
starred in and to other actors he has worked with)

Analysis: discovering categorizing information about an entity from the content (such as
the type of food a restaurant serves) or from sentiment data (such as whether the restaurant has
positive reviews).
Google and Microsoft have just begun to tap into the power of that kind of knowledge. And their
respective entity databases remain in their infancy. As of June 1, Satori had mapped over 400
million entities and Knowledge Graph had reached half a billiona tiny fraction of the potential
index of entities that the two search tools could amass.
In interviews with Ars, members of the teams at both Google and Microsoft walked us through
the inner workings of Knowledge Graph and Satori. Additionally, we dug through the components
of both search technologies to understand how they work, how they differ from the "old school"
search, and what projects like these mean to the future of the Web.

Graphing the Web


Entity extraction is not exactly a new twist in search; Microsoft acquired language processingbased entity extraction technology when it bought FAST Search and Transfer back in 2008, for
instance. What's new about what Google and Microsoft are doing is the sheer scope of the entity
databases they plan to build, the relationships and actions they are exposing through search,
and the underlying data store they are using to handle the massive number of objects and
relationships within the milliseconds required to render a search result.
By any standard, Knowledge Graph and Satori are already huge databasesbut they aren't
really "databases" in the traditional sense. Rather than being based on relational or object
database models, they are graph databases based on the same graph theory approach used
by Facebooks Open Graph to map relationships between its users and their various activities.
Graph databases are based on entities (or nodes) and the mapped relationships (or "links")
between them. Theyre a good match for Web content, because in a way, the Web itself is a
graph databasewith its pages as nodes, and relationships represented by the hyperlinks
connecting them.
Googles Knowledge Graph derives from Freebase, a proprietary graph database acquired by
Google in 2010 when it bought Metaweb. Google's Thakur, who is technical lead on Knowledge
Graph, says that significant additional development has been done to get the database up to
Googles required capacity. Based on some of the architecture discussed by Google, Knowledge
Graph may also rely on some batch processes powered by Googles Pregel graph engine, the

high-performance graph processing tool that Google developed to handle many of its Web
indexing tasksthough Thakur declined to discuss those sorts of details.

What's an entity, anyway?


The entities in both Knowledge Graph and Satori are essentially semantic data objects, each with
a unique identifier, a collection of properties based on the attributes of the real-world topic they
represent, and links representing the topics relationship to other entities. They also include
actions that someone searching for that topic might want to take.
To get a better picture of what an entity looks like, lets look at an example from
Freebase.Freebases schema allows for a wide range of entity types, each with its own specific
set of properties. These properties can be inherited from one type of entity to another, and
entities can be linked to other entities for parts of their information. For example, a Blu-ray disc of
a movie is a film distribution medium, which is a separate entity from the film itself. But it links
back to the entity for the original film for information like the director and the cast:

The schema for Googles Knowledge Graph is based on the same principles, but with some
significant changes to make it scale to Googles needs. Thakur said that when Google purchased
Metaweb, Freebases database had 12 million entities; Knowledge Graph now tracks 500 million
entities and over 3.5 billion relationships between those entities. To ensure that the entities
themselves didnt become bloated with underused data and hinder the scaling-up of the
Knowledge Graph, Googles team threw out the user-defined schema from Freebase and turned
to their most reliable gauge of the data users wanted: Google's search query stream.
We have the luxury of having access to searches, which are like the zeitgeist, Thakur said.
The search stream gives us a window into what people care about and what properties they look

for. The Knowledge Graph team processed Googles stream of search data to prioritize the
properties assigned to entities based on what users were most interested inhow tall buildings
are, what movies an actor starred in, how many times a celebrity went to rehab.

At last here latest google patent


Identifying entities using search results
Invented by Thomas A. Lasko, Andrew Tomkins, Michael Angelo, Matthew K. Gray,
Russell Ryan, Namrata U. Godbole, and Roni F. Zeiger
Assigned to Google
US Patent 8,775,439
Granted July 8, 2014
Filed September 27, 2011

Abstract

Methods, systems, and apparatus, including computer programs encoded on


computer storage media, for identifying entities using search results.

One of the methods includes the actions of determining that a first search query
includes a respective text reference to each of one or more predetermined
attributes, wherein each attribute is associated with a first entity type;
For each of a plurality of entities of the first entity type, generating a combined
search query that includes the first search query and a name of the entity;
Obtaining search results for each of the plurality of entities using the combined
search query for each respective entity; and using the obtained search results to
generate combined search results to include in a response to the first search
query.

También podría gustarte