Está en la página 1de 11


Case Study

LinkedIn and its System, Network and Analytics Data Storage

Sai Srinivas K (B09016), Sai Sagar J (B09014), Rajeshwari R (B09026) and Ashish K Gupta (B09008) Distributed Database Systems, Spring 2013, IIT Mandi Instructor: Dr. Arti Kashyap

DDB, Spring 2013 |2 Abstract This paper is a case study on LinkedIn, a social networking website for people in professional occupations, Its Data Storage Systems and few of the System, Network and Analytics aspects of the site. The SNA team at LinkedIn has a web site that hosts the open source projects built by the group. Notable among these projects is Project Voldemort, a distributed key-value structured storage system with low-latency similar in purpose to's Dynamo and Google's BigTable. Let us see the till date research and backend managing systems of the web-site which reports more than 200 million acquired users in more than 200 countries and territories. I. Introduction LinkedIn Corporation is a social networking website for professional networking among people in various occupations. The company was founded by Reid Hoffman and founding team members from PayPal and (Allen Blue, Lee Hower, Eric Ly, David Eves, Ian McNish, Chris Saccheri, Jean-Luc Vaillant, Konstantin Guericke, Stephen Beitzel and Yan Pujante) launched it on May 5, 2003 in Santa Monica, California [1]. LinkedIn's CEO is Jeff Weiner, previously a Yahoo! Inc. Executive, and Founder Reid Hoffman, previously CEO of LinkedIn, is now the Chairman of the Board. 1.1 Features This site helps in professional social networking for the users by maintaining a list of connections which would have the individual contact details of everyone connected to them. Whether a site User or not, one can invite anyone to become a connection. However, if the invitee selects "I don't know" or "Spam", this counts as a report against the inviter and he gets too many of such responses, the account may be restricted or closed. This list of connections can then be used in a number of ways: A network of contacts is built up of, their direct connections, their second-degree connections (connections of each of their connections) and also third degree connections (connections of the second-degree connections). This is similar to the concept of Mutual Friends in Facebook where one can gain an introduction to someone, he/she finds interesting. Users can upload their resumes or build/design them design in their profiles in order to share their respective work and community experiences. It can be used to find jobs, people and business opportunities recommended by someone in one's contact network. Employers can list jobs and search for potential candidates. Job seekers can review the profile of hiring managers and discover which of their existing contacts can introduce them.

Users can post their own photos and view photos of others to aid in identification. Users can now follow different companies and can get notification about the new joining and offers available. Users can save or bookmark jobs that they would like to apply for.

The "gated-access approach" (where contact with any professional requires either an existing relationship or the intervention of a contact of theirs) is intended to build trust among the service's users and is one of the Special Features of LinkedIn. The feature LinkedIn Answers similar to Yahoo! Answers allows users to ask questions for the community to answer. This feature is free, and the main difference from the latter is that questions are potentially more business-oriented, and the identity of the people asking and answering questions is known. LinkedIn cites a new 'focus on development of new and more engaging ways to share and discuss professional topics across LinkedIn' is a recent development which may sack the outdating feature, LinkedIn Answers. Other LinkedIn features include LinkedIn Polls as a form of researching (for the users), LinkedIn DirectAds as a form of sponsored advertising etc. LinkedIn allows users to endorse each others skills. This feature also allows users to efficiently provide commentary on other users profiles thus reinforcing the network build-up. However there is no way of flagging anything other than positive content. 1.1.1 Applications The Applications Platform allows other online services to be embedded within a member's profile page like Amazon Reading List that allows LinkedIn members to display books they are reading, a connection to Tripit (travel itinerary), a WordPress and TypePad application, which allows members to display their latest blog postings within their LinkedIn profile and etc. Later on LinkedIn allowed businesses to list products and services on company profile pages; it also permitted LinkedIn members to "recommend" products and services and write reviews. 1.1.2 Groups LinkedIn also supports the formation of interest groups (which are equivalently famous in many social networking sites and blogs), the majority related to employment although a very wide range of topics are covered mainly around professional and career issues and the current focus is on the groups for both academic and corporate alumni. Groups support a limited form of discussion area, moderated by the group owners and managers. Since groups offer the ability to reach a wide audience without so easily falling foul of anti-spam solutions, there is a constant stream of spam postings, and there now exist a range of firms who offer a spamming service for this very purpose. Groups also keep their members informed through emails with updates to the group, including most talked about

IIT Mandi

DDB, Spring 2013 |3 discussions within your professional circles. Groups may be private, accessible to members only or may be open to Internet users in general to read, though they must join in order to post messages. 1.1.3 Job listings LinkedIn allows users to research companies with which they may be interested in working. When typing the name of a given company in the search box, statistics about the company are provided. These may include the location of the company's headquarters and offices, or a list of present and former employees, the percentage of the most common titles/positions held within the company etc. LinkedIn launched a new feature allowing companies to include an "Apply with LinkedIn" button on job listing pages which was really a serious and useful development. The new plugin will allow potential employees to apply for positions using their LinkedIn profiles as resumes. All applications will also be saved under a "Saved Jobs" tab. II. SNA LinkedIn The Search, Network, and Analytics of LinkedIn host the open source projects in the data blogs built by the group. Notable among these projects is Project Voldemort, a distributed key-value structured storage Database system with low-latency similar in purpose to Amazons Dynamo and Google's BigTable. The data team at LinkedIn works on LinkedIn's information retrieval systems, the social graph system, data driven features, and supporting data infrastructure. 2.1 Project Voldemort Voldemort is a distributed key-value storage system. It has the following properties:

Good single node performance: you can expect 1020k operations per second depending on the machines, the network, the disk system, and the data replication factor Support for pluggable data placement strategies to support things like distribution across data centers that are geographically far apart (Data Placement).

2.1.1 Comparison with the Relational Database Voldemort is not a relational database; it does not attempt to satisfy arbitrary relations while satisfying ACID properties. Nor is it an object database that attempts to transparently map object reference graphs. Nor does it introduce a new abstraction such as document-orientation. It is basically just a big, distributed, persistent, faulttolerant hash table. For applications that can use an O/R mapper like activerecord or hibernate this will provide horizontal scalability and much higher availability but at great loss of convenience. For large applications under internet-type scalability pressure, a system may likely consists of a number of functionally partitioned services or APIs, which may manage storage resources across multiple data centres using storage systems which may themselves be horizontally partitioned. Voldemort offers a number of advantages:

Voldemort combines in memory caching with the storage system so that a separate caching tier is not required (instead the storage system itself is just fast) Unlike MySQL replication, both reads and writes scale horizontally Data portioning is transparent, and allows for cluster expansion without rebalancing all data Data replication and placement is decided by a simple API to be able to accommodate a wide range of application specific strategies The storage layer is completely mockable so development and unit testing can be done against a throw-away in-memory storage system without needing a real cluster (or even a real storage system) for simple testing.

Data is automatically replicated over multiple servers (Data Replication) Data is automatically partitioned so each server contains only a subset of the total data (Data Partitioning) Server failures are handled transparently oblivious to the users (Transparent failures) Pluggable serialization is supported to allow rich keys and values including lists and tuples with named fields, as well as to integrate with common serialization frameworks like Protocol Buffers, Thrift, Avro and Java Serialization Data items are versioned to maximize data integrity in failure scenarios without compromising availability of the system (Versioning) Each node is independent of other nodes with no central point of failure or coordination (Node Independence)

For applications in this space, arbitrary in-database joins are already impossible since all the data is not available in any single database. A typical pattern is to introduce a caching layer which will require hash table semantics anyway. It is even used for certain high-scalability storage problems where simple functional partitioning is not sufficient. It is still a new system under development which may have rough edges and probably plenty of uncaught bugs.

IIT Mandi

DDB, Spring 2013 |4 2.1.2 Design Key-Value Storage Project Voldemort created by LinkedIn is just simple keyvalue data storage, for their primary importance is enabling high performance and availability to the users. Both keys and values can be complex compound objects including lists or maps, but nonetheless the only supported queries are effectively the following: Value = store.get(key); Store.delete(key) Store.put(key, value); perspective because the obvious fewer hops and also from the throughput perspective since there are fewer potential bottlenecks, but has few bottlenecks too for it requires the routing intelligence to move up the stack. Apart from them, the flexibility aspect makes high performance configurations possible. Disk access is the single biggest performance hit in storage, the second is network hops. Disk access can be avoided by partitioning the data set and caching as much as possible. Network hops require architectural flexibility to eliminate. In the diagram shown one can implement 3-hop, 2-hop, or 1-hop remote services using different configurations. This enables very high performance to be achieved when it is possible to route service calls directly to the appropriate server.

This may not be good enough for all storage problems, for there maybe a variety of trade-offs like no complex query filters, all joins must be done in code, no foreign key constraints, no triggers etc. 2.1.3 System Architecture The below representation [2] is the Logical view in which each layer implements a simple storage interface like put, get, and delete. Each of these layers is responsible for performing one function such as tcp/ip network communication, serialization, version reconciliation, internode routing, etc. For example the routing layer is responsible for taking an operation; say a PUT, and delegating it to all the N storage replicas in parallel, while handling any failures. [3] Data partitioning and replication [5] Data needs to be partitioned across a cluster of servers so that no single server needs to hold the complete data set. Even when the data can fit on a single disk, disk access for small values may be slowed down by seek time so partitioning would invariably improve cache efficiency by splitting the data into smaller chunks. The servers in the cluster are not interchangeable, and requests need to be routed to a server that holds requested data, not just any available server at random. We have flexibility, on even where the intelligent routing of data to partitions is done, for that matter anywhere in those layers. One could add in a compression layer that compresses byte values at any level below the serialization level. This could be done on the client sides or on the server side to enable hardware load-balanced http clients. The below representation [4] is the Physical architecture having Frontend, Back end and Voldemort Clusters connected through Load balancers (hardware) which is a round-robin software load balancer, and "Partition-aware routing" which is the storage systems internal routing. All the Possible tier-architectures are denoted in the Diagram. It is highly efficient if one could see it, from the latency Similarly Servers which regularly fail or become overloaded are brought down for maintenance. If there are S servers and each server is assumed to fail independently with probability p in a given day, then the probability of losing at least one server in a day will be 1 - (1 - p)s. Therefore we cannot store data on only one server or the probability of data loss will be inversely proportional to cluster size. The simplest possible way to accomplish this would be to cut the data into S partitions (one per server) and store copies of a given key K on R servers. One way to associate the R servers with key K would be to take a = K mod S and

IIT Mandi

DDB, Spring 2013 |5 store the value on servers a, a+1, ..., a+r. So for any probability p you can pick an appropriate replication factor R to achieve an acceptably low probability of data loss. This system has the nice property that anyone can calculate the location of a value just by knowing its key, which allows us to do look-ups in a peer-to-peer fashion without contact a central metadata server that has a mapping of all keys to servers. The downside (Failures) to the above approach occurs when a server is added, or removed from the cluster. In this case d may change and all data will shift between servers. Even if d does not change, load will not evenly distribute from a single removed/failed server to the rest of the cluster. Consistent hashing is a technique that avoids these problems, and we use it to compute the location of each key on the cluster. Using this technique Voldemort has the property that when a server fails load will distribute equally over all remaining servers in the cluster. Likewise when a new server is added to a cluster of S servers, only 1/(S+1) values must be moved to the new machine. To visualize the consistent hashing method we can see the possible integer hash values as a ring beginning with 0 and circling around to 2^31-1. This ring is divided into Q equally-sized partitions with Q >> S and each of the S servers is assigned Q/S of these. A key is mapped onto the ring using an arbitrary hash function, and then we compute a list of R servers responsible for this key by taking the first R unique nodes when moving over the partitions in a clockwise direction. The diagram [6] below pictures a hash ring for servers A, B, C, D. The arrows indicate keys mapped onto the hash ring and the resulting list of servers that will store the value for that key if R=3. not considered in a strict relational mapping). Each key is unique to a store, and each key can have at most one value. Queries Voldemort supports hash table semantics, so a single value can be modified at a time and retrieval is by primary key. This makes distribution across machines particularly easy since everything can be split by the primary key. It can support lists as values if not one-many relations because anyways both accomplish the same, so it is possible to store a reasonable number of values associated with a single key. In most cases this denormalization is a huge performance improvement since there is only a single set of disk seeks; but for very large one-to-many relationships (say where a key maps to tens of millions of values) which must be kept on the server and streamed lazily via a cursor this approach is not practical. This rare case must be broken up into sub-queries or otherwise handled at the application level. The simplicity of the queries can be an advantage, since each has very predictable performance, it is easy to break down the performance of a service into the number of storage operations it performs and quickly estimate the load. In contrast SQL queries are often opaque, and execution plans can be data dependent, so it can be very difficult to estimate whether a given query will perform well with realistic data under load (especially for a new feature which has neither data nor load). Also, having a three operation interface makes it possible to transparently mock out the entire storage layer and unit test using a mock-storage implementation that is little more than a HashMap. This makes unit testing outside of a particular container or environment much more practical. 2.1.5 Consistency & Versioning When taking multiple simultaneous writes distributed across multiple servers and perhaps multiple data centres, consistency of data becomes a difficult problem. The traditional solution to this problem is distributed transactions but these are both slow (due to many round trips) and fragile as they require all servers to be available to process a transaction. In particular any algorithm which must talk to more than 50% of the servers to ensure consistency becomes quite problematic if the application is running in multiple data centres and hence the latency for cross-data-centre operations will be extremely high. These features like load balancing, Semantic partitioning is implemented by Kafka, Sensei DB etc 2.1.4 Data Format & Queries In Voldemort data is divided into store unlike in a relational database where it is broken into 2D tables. The word table is not used for the data need not necessarily be tabular (a value can contain lists and mappings which are An alternate solution is to tolerate the possibility of inconsistency, and resolve inconsistencies at read time. Applications usually do a read-modify-update sequence when modifying data. For example if a user adds an email address to their account we might load the user object, add the email, and then write the new values back to the db. Transactions in databases are a solution to this problem, but are not a real option when the transaction must span

IIT Mandi

DDB, Spring 2013 |6 multiple page loads (which may or may not complete, and which can complete on any particular time frame) The value for a given key is consistent if, in the absence of updates, all reads of that key return the same value. In the read-only world data is created in a consistent way and not changed. When we add both writes, and replication, we encounter problems: now we need to update multiple values on multiple machines and leave things in a consistent state. In the presence of server failures this is very hard, in the presence of network partitions it is provably impossible (a partition is when, e.g., A and B can reach each other and C and D can reach each other, but A and B can't reach C and D). There are several methods for reaching consistency with different guarantees and performance tradeoffs like twoPhase Commit, Paxos-style consensus and Read-repair. The first two approaches prevent permanent inconsistency. The third approach involves writing all inconsistent versions, and then at read-time detecting the conflict, and resolving the problems and hence used by the SNA team. This involves little co-ordination and is completely failure tolerant, but may require additional application logic to resolve conflicts. This has the best availability guarantees, and the highest efficiency (only W writes network roundtrips are required for N replicas where W can be configured to be less than N). 2PC typically requires 2N blocking roundtrips. Paxos variations vary quite a bit but are comparable to 2PC. Another approach to reach consistency is by using Hinted Handoff. In this method during writes if we find that the destination nodes are down (Failure Handling) we store a "hint" of the updated value on one of the alive nodes. Then when these down nodes come back up the "hints" are pushed to them thereby making the data consistent. 2.1.6 Routing Parameters Any persistent system needs to answer the question "where is my stuff?". This is a very easy question if we have a centralized database, since the answer is always "somewhere on the database server". In a partitioned key system there are multiple machines that may have the data. When we do a read we need to read from at least 1 server to get the answer, when we do a write we need to (eventually) write to all N of the replicas. There are thus three parameters that matter:

exception then it is guaranteed that at least W nodes carried out the operation; however if the write fails (say because too few nodes succeed in carrying out the operation) then the state is unspecified. If at least one put/delete succeeds then the value will eventually be the new value, however if none succeeded then the value is lost. If the client wants to ensure the state after a failed write operation they must issue another write. 2.1.7 Performance [7] Getting real applications deployed requires having simple, well understood, predictable performance. Understanding and tuning performance of a cluster of machines is a important criteria too. Note that there are a number of tuneable parameters: the cache size on a node, the number of nodes you read and write to on each operation, the amount of data on a server, etc. Estimating network latency and data/cache ratios Disk is far and away the slowest and lowest throughput operation. Disk seeks are 5-10ms and a lookup could involve multiple disk seeks. When the hot data is primarily in memory you are benchmarking the software, when it is primarily on disk you are benchmarking your disk system. The calculation we do when planning a feature is to take the estimated total data size, divide by the number of nodes and multiply be the replication factor. This is the amount of data per node. Then compare this to the cache size per node. This is the fraction of the total data that can be served from memory. This fraction can be compared to some estimate of the hotness of the data. For example if the requests are completely random, then a high proportion should be in memory. If instead the requests represent data about particular members and only some fraction of members are logged in at once, and one member session indicates many requests, then you may survive with a much lower fraction. Network is the second biggest bottleneck after disk. The maximum throughput one java client can get for roundtrips through a socket to a service that does absolutely nothing seems to be about 30-40k req/sec over localhost. Adding work on the client or server side or adding network latency can only decrease this. Some results of LinkedIn performances The throughput we see from a single multithreaded client talking to a single server where the "hot" data set is in memory under artificially heavy load in a performance lab: Reads: 19,384 req/sec

N - The number of replicas R - The number of machines to read from W - The number writes to block for

Writes: 16,559 req/sec Note that if R + W > N then we are guaranteed to "read our writes". If W = 0, then writes are non-blocking and there is no guarantee of success whatever. Puts and deletes are neither immediately consistent nor isolated. The semantics are this: if a put/delete operation succeeds without Note that this is to a single node cluster so the replication factor is 1. Obviously doubling the replication factor will halve the client req/sec since it is doing 2x the operations. So these numbers represent the maximum throughput from

IIT Mandi

DDB, Spring 2013 |7 one client, by increasing the replication factor, decreasing the cache size, or increasing the data size on the node, we can make the performance arbitrarily slow. Note that in this test, the server is actually fairly lightly loaded since it has only one client so this does not measure the maximum throughput of a server, just the maximum throughput from a single client [8].

2.2 Support for batch computed data Read only stores One of the most data-intensive storage needs is storing batch computed data about members and content in our system. These jobs often deal with the relationships between entities (e.g. related users, or related news articles) and so for N entities can produce up to N2 relationships. An example at LinkedIn is member networks, which are in the 12TB range if stored explicitly for all members. Batch processing of data is generally much more efficient than random access, which means one can easily produce more batch computed data than can be easily accessed by the live system - Hadoop greatly expands this ability. Therefore a Voldemort persistence-backend that supports very efficient read-only access that helps take a lot of the pain our of building, deploying, and managing large, read-only batch computed data sets was created. Much of the pain of dealing with batch computing comes from the "push" process that transfers data from a data warehouse or hadoop instance to the live system. In a traditional db this will often mean rebuilding the index on the live system with the new data. Doing millions of sql insert or update statements is generally not at all efficient, and typically in a SQL db the data will be deployed as a new table and then swapped to replace the current data when the new table is completely built. This is better than doing millions of individual updates, but this still means the live system is now building a many GB index for the new data set (or performa) while simultaneously serving live traffic. This alone can take hours or days, and may destroy the performance on live queries. Some people have fixed this by swapping out at the database level (e.g. having an online and offline db, and then swapping), but this requires effort and means only half your hardware is being utilized. Voldemort fixes this process by making it possible to prebuild the index itself offline (on Hadoop or wherever), and simply push it out to the live servers and transparently swap. A driver program initiates the fetch and swap procedure in parallel across a whole Voldemort cluster. In their tests it is reported that this process can reach the I/O limit of either the Hadoop cluster or the Voldemort cluster. This also helps in associating the Hot data with its corresponding keys.

Benchmarking anything that involves disk access is notoriously difficult because of sensitivity to three factors: 1. The ratio of data to memory 2. The performance of the disk subsystem, and 3. The entropy of the request stream The ratio of data to memory and the entropy of the request stream determine how many cache misses will be sustained, so these are critical. A random request stream is more or less un-cacheable, but fortunately almost no real request streams are random. They tend to have strong temporal locality which is what page cache eviction algorithms exploit. So we can assume a large ratio of memory to disk, and test against a simulated request stream to get performance information. Any build process will consist of three stages: (1) partitioning the data into separate sets for each destination nodes, (2) gathering all data for a given node, and (3) building the lookup structure for that node. 2.2.1 Build Time [8] The tested time is the complete build time including mapping the data out to the appropriate node-chunk, shuffling the data to the nodes that will do the build, and finally creating the store files. In general, the time was roughly evenly split between map, shuffle and reduce phases. The number of map and reduce tasks are a very important parameter, as experiments on a smaller data set show that varying the number of tasks could change the build time by more than 25%, but due to time constraints LinkedIn used defaults Hadoop produced, for Testing. Here are the times taken:

100GB: 28mins (400 mappers, 90 reducers) 512GB: 2hrs, 16mins (2313 mappers, 350 reducers) 1TB: 5hrs, 39mins (4608 mappers, 700 reducers)

This neglects the additional benefits of Hadoop for handling failures, dealing with slower nodes, etc.

IIT Mandi

DDB, Spring 2013 |8 In addition, this process is scalable: it can be run on a number of machines equal to the number of chunks (700 in our 1TB case) not the number of destination nodes (only 10). Data transfer between the clusters happens at a steady rate bound by the disk or network. In LinkedIns Amazon instances this is around 40MB/second. 2.2.2 Online Performance [8] Lookup time for a single Voldemort node compares well to a single MySQL instance as well. Consider a local test against the 100GB per-node data from the 1 TB test. Let it run on an Amazon Extra Large instance with 15GB of RAM and the 4 ephemeral disks in a RAID 10 configuration. To run the tests 1 million requests from a real request stream recorded on the production system against each of storage systems, be simulated. Then the following performance for 1 million requests against a single node is resulted: MySQL Reqs per sec. Median req. Time Avg. req. Time 99th percentile req. time 727 0.23 ms 13.7 ms 127.2 ms Voldemort 1291 0.05 ms 7.7 ms 100.7 ms There are three Hadoop Grids, A, B, and C, for which White Elephant will compute statistics as follows: 1. Upload Task: a task that periodically runs on the Job Tracker for each grid and incrementally copies new log files into a Hadoop grid for analysis. 2. Compute: a sequence of MapReduce jobs coordinated by a Job Executor parses the uploaded logs and computes aggregate statistics. 3. Viewer: a viewer app incrementally loads the aggregate statistics, caches them locally, and exposes a web interface which can be used to slice and dice statistics for your Hadoop clusters 2.3.1 Architecture [10] Here's a diagram outlining the White Elephant architecture:

These numbers are both for local requests with no network involved as the only intention is to benchmark the storage layer of these systems. 2.3 White Elephant: The Hadoop Tool LinkedIns solution of a Hadoop Tool to manage and configure the Network analytics is White Elephant. At LinkedIn it is used for product development (e.g., predictive analytics applications like People You May Know and Endorsements), descriptive statistics for powering our internal dashboards, ad-hoc analysis by data scientists, and ETL. White Elephant parses Hadoop logs to provide visual drill downs and rollups of task statistics for your Hadoop cluster, including total task time, slots used, CPU time, and failed job counts. White Elephant fills several needs:

2.4 Sensei DB Sensei DB is a distributed searchable database that handles complex semi-structured queries. It can be used to power consumer search systems with rich structured data. It is an Open-source, distributed, real-time, semi-structured database which powers LinkedIn homepage and LinkedIn Signal. Some Features of this database include:

Scheduling: when you have a handful of periodic jobs, its easy to reason about when they should run, but that quickly doesnt scale. The ability to schedule jobs at periods of low utilization helps maximize cluster efficiency. Capacity planning: to plan for future hardware needs, operations need to understand the resource usage growth of jobs. Billing: Hadoop clusters have finite capacity, so in a multi-tenant environment its important to know the resources used by a product feature against its business value.

Full-text search Fast real-time updates Structured and faceted search BQL: SQL-like query language Fast key-value lookup High performance under concurrent heavy update and query volumes

IIT Mandi

DDB, Spring 2013 |9

Hadoop integration

2.5 Avatara: OLAP for Web-scale Analytics Products The last important part of SNA, LinkedIn described in this paper is Avatara which is an OLAP for web analytics products. LinkedIn has many analytical insight products such as "Who's Viewed My Profile?" and "Who's Viewed This Job?"At their core, these are multidimensional queries. For example, "Who's Viewed My Profile?" takes someone's profile views and breaks them down by industry, geography, company, school, etc to show the richness of people who viewed their profiles and who viewed the Job [12] :

It helps in faceted search on the rich structured data required by LinkedIn to incorporate into the user profiles. The fundamental paradigm was to provide individuals with an easy and natural way to slice and dice through search results or simply content so a faceted search paradigm would be ideal not only for retrieval but also for Navigation and Discovery. At LinkedIn since a member profile does have these rich structural dimensions, along with rich text data, it seemed that it would be only a matter of time to create such an interface.

A click on a facet value would be similar to a filtering of search results through that value. For example in the search for John and later selecting the San Francisco should get you only people in San Francisco called John, i.e. John + facet_value(San Francisco) = John AND location:(San Francisco). While navigating through results this never leads to a Dead end.

What was implemented is essentially a query engine for the following type of query: SELECT f1,f2fn FROM members WHERE c1 AND c2 AND c3.. MATCH (fulltext query, e.g. java engineer) GROUP BY fx,fy,fz ORDER BY fa,fb LIMIT offset,count Deferring this query to a traditional RDBMS on 10s 100s millions of rows with sub-second query latency SLA is not feasible. Thus a distributed system like Sensei that handles the above query at internet scale, is necessary. Below is a faceted search snapshot. [11]

IIT Mandi

D D B , S p r i n g 2 0 1 3 | 10 Online analytical processing (OLAP) has been the traditional approach to solve these multi-dimensional analytical problems. However, LinkedIn had to build a solution that can answer these queries in milliseconds across 175+ million members and so built Avatara. Avatara is LinkedIn's scalable, low latency, and highly-available OLAP system for "Sharded" multi-dimensional queries in the time constraints of a request/response loop. An interesting insight for LinkedIn's use cases is that queries span relatively few usually tens to at most a hundred dimensions, so this data can be Sharded across a primary dimension. For "Who's Viewed My Profile?", we can shard the cube by the member herself, as the product does not allow analyzing profile views of anyone other than the member currently logged in. Here's a brief overview of how it works. As shown in the figure below, Avatara consists of two components: 1. An offline engine that computes cubes in batch 2. An online engine that serves queries in real time The offline engine computes cubes with high throughput by leveraging Hadoop for batch processing. It then writes cubes to Voldemort DDBS. The online engine queries the Voldemort store when a member loads a page. Every piece in this architecture runs on commodity hardware and can be easily scaled horizontally. cubes. The result of the batch engine is a set of Sharded small cubes, represented by key-value pairs, where each key is a shard (for example, by member_id for "Who's Viewed My Profile?"), and the value is the cube for the shard. 2.5.2 Online Engine All cubes are bulk loaded into Voldemort. The online query engine retrieves and processes data from Voldemort, returning results back to the client. It provides SQL-like operators, such as select, where, group by, plus some math operations. The wide-spread adoption of SQL makes it easy for application developers to interact with Avatara. With Avatara, 80% of queries can be satisfied within 10 ms, and 95% of queries can be answered within 25 ms for "Who's Viewed My Profile?" on a high traffic day. 2.6 Conclusions When the scale of data began to overload the LinkedIn servers, their solution wasnt to add more nodes but to cut out some of the matching heuristics that required too much compute power. Instead of writing algorithms to make People You Know more accurate, their team worked on getting LinkedIns Hadoop infrastructure in place and built a distributed database called Voldemort. They then built Azkaban, an open source scheduler for batch processes such as Hadoop jobs, and Kafka, another open source tool referred to as the big data equivalent of a message broker. At a high level, Kafka is responsible for managing the companys real-time data and getting those hundreds of feeds to the apps that subscribe to them with minimal latency. A 2012 study comparing systems for storing APM monitoring data reported that Voldemort, Cassandra, and HBase offered linear scalability in most cases, with Voldemort having the lowest latency and Cassandra having the highest throughput. Why hasnt LinkedIn shifted from a NoSQL Database like Voldemort? The fundamental problem is endemic to the relational database mindset, which places the burden of computation on reads rather than writes. This is completely wrong for large-scale web applications, where response time is critical. Its made much worse by the serial nature of most applications. Each component of the page blocks on reads from the data store, as well as the completion of the operations that come before it. Non-relational data stores reverse this model completely, because they dont have the complex read operations of SQL as mentioned by LinkedIn SNA Team in the Interview with Ryan King

The above diagram also shows the integration scenario of Hadoop with the LinkedIns key-store DDB Voldemort. 2.5.1 Offline Engine The offline batch engine processes data through a pipeline that has three phases: 1. Pre-processing 2. Projections and joins 3. Cubification Each phase runs one or more Hadoop jobs and produces output that is the input for the subsequent phase. We utilize Hadoop for its built-in high throughput, fault tolerance and horizontal scalability. The pipeline pre-processes raw data as needed, projects out dimensions of interest, performs user-defined joins, and at the end transforms the data to

Acknowledgements The authors of this paper would like to acknowledge the Data Team of the LinkedIn which has open-sourced their data store DBSs like Voldemort and SNA tools like Sensei DB, Avatara, Azkaban etc., hence providing various means for researching.

IIT Mandi

D D B , S p r i n g 2 0 1 3 | 11 References [1] [2] Dynamo: Amazon's Highly Available Key-Value Store [3] - The data team which manages the SNA of LinkedIn [4] ml [5] data_store%29 [6] Time, Clocks, and the Ordering of Events in a Distributed Systemfor the versioning details [7] Eventual Consistency Revisited A discussion on Werner Vogels' blog on the developers interaction with the storage system and what the tradeoffs mean in practical terms. [8] Brewer's conjecture and the feasibility of consistent, available, partition-tolerant web services Consistency, Availability and Partition-tolerance [9] Berkeley DB performance A somewhat biased overview of bdb performance. [10] Google's Bigtable for comparison, a very different approach. [11] One Size Fit's All: An Idea Whose Time Has Come and Gone Very interesting paper by the creator of Ingres, Postgres and Vertica [12] One Size Fits All? - Part 2, Benchmarking Results Benchmarks mentioned in the paper [13] Consistency in Amazon's Dynamo blog posts on Dynamo [14] Paxos Made Simple , Two-phase commit Wikipedia description. [15] The Life of a Typeahead Query The various technical aspects and challenges of real-time typeahead search in the context of social network. [16] Efficient type-ahead search on relational data: a TASTIER approach A relational approach for typeahead searching by means of specialized index structures and algorithms for joining related tuples in the database. [16] LinkedIn, A powerhouse Interviews with the Developing Team [17] principles behind Map Reduce and Hadoop. the [18]!forum/proj ect-voldemort

IIT Mandi