LIquid: a Large-Scale Relational Graph Database
29 Jul 2024 (2 months ago)
LinkedIn's Liquid: A Relational Graph Database
- The speaker, a graph database implementer, introduces Liquid, a relational graph database built at LinkedIn.
- Liquid handles about 2 million queries per second (QPS) and is a large-scale database system.
Motivation for Liquid
- LinkedIn built Liquid because the company's core value lies in second-degree connections, which are connections beyond a user's immediate network.
- LinkedIn's primary workload is graph serving, which involves browsing a large graph to find interesting connections.
- Graph serving requires real-time performance, with responses within a quarter to half a second.
- Other graph workloads include graph analytics, which involves analyzing the entire graph and takes minutes, and graph computation, which involves running algorithms on the entire graph and can take minutes or hours.
Liquid's Design
- Liquid is based on the relational model and data log, which are well-established concepts in computer science.
- The speaker emphasizes that Liquid is not a new invention but rather a practical application of existing theoretical concepts.
Data Representation
- The database uses two tables: one for vertices and one for edges.
- Vertices are represented by strings, but the database maps them to integers for efficiency.
- Edges are represented by three vertices, forming a compound primary key. This means an edge either exists or it doesn't, making it a simple structure.
- The speaker emphasizes that this simple structure is the core of the database, with the remaining details being more complex.
Indexing and Access
- The database is designed for browsing large amounts of data, requiring fast access to a potentially massive number of edges.
- Due to the size of the data, random access is unavoidable, and the database uses hash tables to achieve this.
- The speaker explains that the choice of three vertices for an edge is deliberate. It allows for efficient indexing and aligns with how people often encode knowledge.
Data Structures and Inverted Index
- The database is designed to work with data structures commonly used by programmers, such as strings and pointers.
- The speaker explains how a simple fact, like "Herman Melville wrote Moby Dick," can be represented in a relational database using structures and pointers. This creates a hand-built inverted index of the underlying relation.
DataLog Query Language
- The speaker then introduces DataLog, a query language that provides a more natural syntax for expressing queries compared to SQL.
- DataLog uses rules and assertions to represent relationships and facts.
- The speaker emphasizes that DataLog allows for querying inbound edges, which is a valuable feature for exploring a new database. This capability is difficult to achieve with traditional SQL queries.
Predicates in DataLog
- The speaker introduces the concept of predicates in DataLog, which are used to define the properties of relationships.
- Predicates can have different cardinalities, indicating the number of possible values for a given property.
- The speaker demonstrates how to define and use predicates in DataLog, showing how they can be used to represent information about books and their titles.
- The speaker highlights that predicates themselves are nodes in the graph, allowing for the addition of metadata to them. This enables queries that can explore the relationships between predicates.
Compound Predicates
- The speaker explains that predicates in Liquid are simple binary relationships, but many real-world relationships are more complex.
- Examples of complex relationships include "employer-employee with start date," "marriage with spouse and start date," and "actor-role-film."
- To represent these complex relationships, Liquid uses "compound predicates," which are essentially relationship tables with a compound primary key.
- This approach allows for the representation of multi-way relationships and their associated attributes, such as start dates, stop dates, and scores.
- The speaker emphasizes that this concept of compound predicates is not new and has been used in relational databases for decades, dating back to Peter Chen's entity-relationship modeling in the 1970s.
Advantages of Graph Databases
- The text discusses the advantages of using a graph database for representing relationships, particularly symmetric relationships like "mutual friends."
- In SQL, representing a mutual friendship requires additional logic to ensure the order of friends doesn't matter.
- Graph databases, however, naturally handle symmetric relationships through their structure, eliminating the need for extra logic.
Node Representation and Ontology
- The text highlights that nodes in a graph database are immutable strings representing entities, and they can be assigned types for parsing purposes.
- The text argues against storing complex data structures within nodes, citing the challenges of inheritance and ontology management.
- The example of the Beatles being classified as a person in a hierarchical system illustrates the limitations of a single ontology.
- The text emphasizes that graph databases are flexible and don't require a rigid hierarchical structure, allowing for the representation of facts without imposing a specific order or structure.
Query Evaluation Challenges
- The text discusses the challenges of query evaluation in a large-scale relational graph database, specifically focusing on the issues of cross products and static planning.
- Cross products, a common operation in SQL, can lead to results that are significantly larger than the actual data, making queries inefficient.
- The text highlights the need for subgraph returns, which allow for the retrieval of multiple tables instead of a single table, enabling more efficient processing of complex relationships.
- The text emphasizes the importance of outer joins, which are essential for efficiently handling relationships in a graph database.
- Dynamic planning, which involves choosing the optimal query plan based on the specific data, is crucial for efficient query execution in a graph database.
- The text notes that unlike sorted storage, hashtable storage allows for constant-time access to set sizes, making dynamic planning more efficient.
Dynamic Plan Optimization
- Dynamic plan optimization is close to optimal: The dynamic plan for finding the size of a set is not as efficient as a hand-coded plan, but it is still close to optimal. This is because the dynamic plan needs to perform some calculations to determine the best approach, but these calculations are tractable.
Graph Database Definition
- Graph database definition: A graph database is an implementation of the relational model with four key properties:
- All relationships are equal: Every relationship is represented as an edge, unlike SQL databases where relationships within a table are considered first-class and relationships across tables are second-class.
- Fast edge retrieval: Since there are many edges in a graph database, retrieving them must be fast (ideally in constant time).
- Query results are subgraphs: Complex queries should return structured results that are easy for applications to consume.
- Constant-time schema change: Adding new predicates should be possible without requiring any significant changes to the underlying database structure.
Comparison to SQL
- Comparison to SQL: Liquid differs from SQL in several ways:
- No denormalization: Liquid is fully normalized, unlike SQL databases which often use denormalization for performance.
- No nulls or ternary logic: Edges either exist or they don't, and unbound variables must be explicitly bound.
- Constraints through predicates: Liquid allows constraints to be defined through predicates, which is more flexible than SQL's approach of using columns for constraints.
Benefits of Liquid
- The speaker discusses the benefits of using a relational graph database like Liquid.
- Liquid offers a simpler schema than OWL, allowing for n-relationships, symmetric relationships, and composable rules.
- The speaker emphasizes that Liquid does not invent a new query language but instead leverages the existing relational model.
- The speaker believes that the future of data is big graphs, and the future of applications is big queries.
Liquid's Features
- Liquid allows for compound relationships, such as representing a skill level with a score.
- There are no limits on fan-out in Liquid, allowing for representation of skewed data distributions.
- The speaker argues that the graph model is superior to traditional data models for data modeling, as it is isomorphic to entity relationship modeling.
Search Workloads and Indexing
- The speaker suggests that graph databases are particularly well-suited for handling "search" workloads, which are often handled with caching in front of conventional databases.
- The speaker believes that indexing is a better approach than caching, as it provides complete data and consistency properties.
Caching vs. Indexing
- The previous system at LinkedIn was a cached system.
- The current system, Liquid, is uncached.
- Site reliability engineers (SREs) at LinkedIn are happy with the uncached nature of Liquid.