Você está na página 1de 73

July 11th, 2010

Apples, Oranges and NOSQL


Roi Aldaag Architect & Consultant Nadav Wiener Architect & Consultant

Agenda
Introduction
What is NoSQL? Whats wrong with RDBMS? Why now?

Agenda
RDBMS vs. NoSQL
Scaling CAP Theorem ACID vs. BASE

Agenda
NoSQL Taxonomy
Key / Value Column Document Graph

Agenda
How to choose?
Comparing Apples to Oranges Polyglot Persistence

Introduction

Introduction
Question: What do they all have in common?

Introduction
Before we answer some facts:

Introduction
Before we answer some facts:

Daily Page Views Daily Visitors Data size

7.8x109 620x106 Petabytes

7.1x109 500x106 Petabytes

550x106 56x106 Petabytes

350x106 37x106 Terabytes

82x106 12x106 Terabytes

July, 2010: http://www.alexa.com

10

Introduction
Answer: They use NoSQL data stores

11

Introduction

Why!?

12

Introduction
Relational DBs Have Scaling Limitations
ACID doesnt scale well horizontally
Sharding breaks relations Joins are inefficient

Transactions overhead Schema is not flexible


Predfined Hard to evolve

13

Introduction
What is NoSQL?
NO SQL / Not Only SQL A collective description of Open Source, Non-relational, data stores
Highly distributed Highly scalable Not ACID and... doesnt use SQL

Term coined in a convention in 2009 called NoSQL (Eric Evans) Started a movement that is gaining momentum

14

Introduction

15

Introduction
Why now?
NoSQL data stores predate RDBMS (1970)
But remained a niche

RDBMS most popular and generic option Web 2.0 introduced new requirements:
Exponential increase in data Information connectivity Semi-structured data

NoSQL data stores had answers


When time was right When RDBMSs didnt
16

Introduction
Its theory time:

17

Sc

ali

ng
18

Scaling
Scaling Up
Adding resources to a single node in a system
Add more CPUs or memory

Move system to a larger machine Pros:


Quick and Simple

Cons:
Outgrowing the capacity of largest system available (Mores law) Expensive Creates vendor lock-in
19

Scaling
Scaling Out
Add more nodes to a system Functional Scaling (vertical)
Grouping data by function and spreading functional groups across databases

Sharding (horizontal)
Splitting same functional data across multiple databases

Pros: More flexible


Cons: More complex

20

Distributed Databases

Distributed Databases

Many nodes Same database

Node 1

Node 2

Node 3

22

Distributed Databases
What are the requirements from distributed databases?
Consistency
All clients can see the same data

Availability
All clients can always access data

Partition tolerance
The ability to continue working when the network topology is broken The ability to recover once the network is healed

23

Distributed Databases
CAP Theorem (E. Brewer, N. Lynch)
You can fully satisfy at most 2 out of 3
Compromise on 3rd

Not all or nothing


Choose various levels of consistency, availability or partition tolerance

Recognize which of the CAP rules your business needs for the task

24

Distributed Databases
CA: Consistency & Availability
Partition Tolerance is compromised Single site clusters (easier to ensure all nodes are always in contact) When a network partition occurs, the system blocks e.g. Two Phase Commit (2PC)

Partition Tolerance 25

Distributed Databases
CP: Consistency & Partitioning
Availability is compromised Access to some data may be temporarily limited The rest is still consistent/accurate e.g. Sharded database TBD sample

Partition Tolerance 26

Distributed Databases
AP: Availability & Partitioning
Consistency is compromised System is still available under partitioning Some data returned may be temporarily not up-to-date Requires conflict resolution strategy e.g. DNS, caches, Master/Slave replication TBD sample

Partition Tolerance 27

ACID vs. BASE

ACID vs. BASE


ACID a quick recap
Atomicity
When a part of the transaction fails -> the entire transaction fails; Database state is left unchanged

Consistency
A transaction takes database from one consistent state to another

Isolation
A transaction can't see dirty state from other transactions

Durability
Commit means commit.

29

ACID vs. BASE


BASE
The CAP compliment of ACID
Just had to be called BASE Backronym:

Basically Available Soft State Eventually Consistent

30

ACID vs. BASE


RDBMS & ACID / NoSQL & BASE
RDBMSs strive to provide ACID guarantees
ACID forces consistency

NoSQL solutions often scale through BASE


BASE accepts that conflicts will happen

31

Taxonomy

Taxonomy
Key / Value Column

XML

Graph

Document

TXT

BIN

33

Taxonomy

Key / Value Databases

34

Taxonomy
Key/Value Stores
Simple Key / Value lookups (DHT) Value is opaque Focus on scaling to huge amounts of data Designed to handle massive load E.g.
Riak Project Voldemort Redis
Based on Amazons Dynamo paper

35

Taxonomy
Key/Value e.g.: Riak
No single point of failure No machines are special or central MapReduce queries (Erlang / Javascript) HTTP/JSON API Ring cluster with automatic replication Elastic / partition rebalancing

Written in: Erlang, C, Javascript Developed by: Basho Technologies Java client: (jonjlee / riak-java-client)

36

Key/Value e.g.: Riak


Data Model
Key / Value pairs are stored in a Bucket A Bucket ~ a namespace

Versioning
Each update is tracked by a Vector Clock
An algorithm for determining ordering and detecting conflicts

When in conflict
Last wins / manual resolution

37

Key/Value e.g.: Riak


Example: REST API
Read an object
GET /riak/bucket/key

Store a new object


POST /riak/bucket

Store an object with existing key (update)


PUT /riak/bucket/key

38

Key/Value e.g.: Riak


MapReduce
A framework supporting distributed computing on large data sets on clusters of machines Leverage parallel processing power Introduced by Google Inspired by map / reduce functions in functional programming Map step Reduce step

39

Key/Value e.g.: Riak


MapReduce example: Inverted Index
Map Parse each document Emit a sequence of <word, doc_id> pairs
<doc_id, doc_text>
<100, <200, <300,
TXT1

>, >, >

Node 1 Node 2 Node 3

<word ,doc_id> < word1 ,100>, < word2 ,100>, < < word2 ,200>, word2 ,300>
40

TXT2

TXT3

Key/Value e.g.: Riak


MapReduce example: Inverted Index
Reduce Accept all pairs for a given word Sort the corresponding document IDs Emit a <word, list(document ID)> pair
<word, < word1 < word2 < word3 list(document_id)> ,(100) >, ,(100,200)>, ,(300) >

41

Taxonomy

BigTable and Column Oriented Databases

42

Taxonomy
Column Stores BigTable derivatives
Conceptually a single, infinitely large table Each rows can have different number of columns Table is sparse: |rows|*|columns| > |values | Based on Googles BigTable paper E.g. Cassandra Hbase Hypertable

43

Taxonomy
Use Case: Manage products with diverse attributes
RDBMS: Create a central table with common attributes Create a table per product with unique attributes Use a join query Alternatively create a table that holds meta data on products NoSQL: Column oriented database Use arbitrarily columns

44

Taxonomy
Column Store e.g.: Cassandra
Data model: Googles BigTable Infrastructure: Amazon Dynamo Incremental scalability Flexible schema No single point of failure (Distributed P2P) Optimistic replication (Gossip protocol) Written in: Java Developed by: Facebook Java client: e.g. Hector / Thrift
45

Column e.g.: Cassandra


Data Model
Column
Smallest increment of data: tuple of name, value, timestamp {

name: "emailAddress", value: nosql@alphacsp.com", timestamp: 123456789


}

46

Column e.g.: Cassandra


SuperColumn A sorted, associative, unbounded array of columns

{ // this is a SuperColumn name: "homeAddress", // with an unbounded array of Columns value: { // the keys is the name of the Column street: {name: "street", value: "s", timestamp:...}, city: {name: "city", value: "c", timestamp:...}, zip: {name: "zip", value: "z", timestamp:...} } } 47

Column e.g.: Cassandra


ColumnFamily A container (~Table) for columns sorted by their names Column Families are referenced and sorted by row keys
Users = { // ColumnFamily john: { // key to row in CF "role" : "admin", "status" : "offline", "nick" : "dude1934" }, // end row fred: { // another row "nick" : freddy", "email" :"fred@example.com", "age" : "25", "gender" : "male", }, // more rows }

Column Family 48

Column e.g.: Cassandra


Keyspace The outer most grouping of data (~DB Schema) Contains ColumnFamilys There is no imposed relationship between ColumsFamilys

49

Column e.g.: Cassandra


Example
Tweets CF

Keyspace
Timeline CF

50

Taxonomy

XML

TXT

Document Oriented Databases

BIN

51

Taxonomy
Document Store
Store semi-structured documents (think JSON) Document versioning Map/Reduce based queries, sorting, aggregation, etc. DB is aware of internal structure E.g.
MongoDB CouchDB JackRabbit (JCR JSR 170)

52

Taxonomy
Use Case: Blog with tagged posts and comments
RDBMS: Table for each: posts, comments, tags Foreign relations NoSQL: Document storage Store post + tags + comments as a document

53

Taxonomy
Document Store e.g: MongoDB
MongoDB (from "humongous") Manages collections of JSON-like documents (BSON) Queries can return specific fields of documents Supports secondary indexes Atomic operations on single documents

Developed by: 10gen Written in: C++ Clients: Java, Scala and more
54

Docment e.g.: MongoDB


Example: Blog posts
Suppose you host a blog, where each post is tagged:
db.posts.save({ _id : 3, author:"john", title : Apples, Oranges and NOSQL", text : This article will", tags : [ database", nosql" ] });

Notice how posts have an array of tags


55

Docment e.g.: MongoDB


MongoDB supports secondary indexes and a query optimizer
Compound indexes are also supported
db.posts.ensureIndex({ tags: 1 }); db.posts.ensureIndex({ author: 1}); db.posts.find({ author: "john", tags: "nosql" }); // Result: { "_id" "author" "title" "text" "tags" }

: : : : :

3, "john", "Apples, Oranges and NOSQL", "This article will", ["database", "nosql", "mongodb" ]

56

Docment e.g.: MongoDB


Let's update our posts to include some comments:
db.posts.update({ _id: 3 }, { $inc: { comments_count: 4}, $pushAll : { comments: [ { text: Comment 1" }, { text: Comment 2", author: "Mr. T" }, { text: Comment 3" }, { text: Comment 4" } ] } });

57

Taxonomy

Graph Databases

58

Taxonomy
Graph databases
Inspired by mathematical graph theory G=(E,V) Models the structure of data Navigational data model Scalability / data complexity Data model: Key-Value pairs on Edges / Nodes Relationships: Edges between Nodes E.g.
Neo4j Pregel (Googles PageRank) AllegroGraph
59

Taxonomy
Use Case: Connected data - deep relationship links between users in a social network
RDBMS Complex recursive algorithm Multiple Self joins Round trips to DB / bulk read and resolve in RAM NoSQL: Graph Storage Network traversal

60

Taxonomy
Graph e.g.: Neo4J
High-performance graph engine Embedded / disk based Work with OO model: nodes, relationships, properties ACID Transactions
JTA support participate in 2PC with your RDBMS

Developed by: Neo Technologies Written in: Java Clients: Java, client libraries in other platforms

61

Graph e.g.: Neo4j

http://neo4j.org/

62

Comparing Apples to Oranges

Comparing Apples to Oranges


Comparing Data Structures
RDBMS Databases contains tables, columns and rows All rows the same structure Inherent ORM mismatch NoSQL Choose your data structure Data is stored in natural structure (e.g. Documents, Graphs, Objects)

64

Comparing Apples to Oranges


Comparing Schema Flexibility
RDBMS Strict schema, difficult to evolve Maintains relations and forces data integrity NoSQL Structure of data can be changed dynamically e.g. Column stores Cassandra Data can sometimes be completely opaque e.g Key/Value Project Voldemort

65

Comparing Apples to Oranges


Comparing Normalization & Relations
RDBMS The data model is normalized to remove data duplication Normalization establishes table relations NoSQL Denormalization is not a dirty word Relations are not explicitly defined Related data is usually grouped and stored as one unit E.g. document, column

66

Comparing Apples to Oranges


Comparing Data Acces
RDBMS CRUD operations using SQL Access data from multiple tables using SQL joins Generic API such as JDBC NoSQL Proprietary API and DSLs (e.g. Pig / Hive / Gremlin) MapReduce, graph traversals REST APIs, portable serialization formats
BSON, JSON, Apache Thrift, Memcached
67

Comparing Apples to Oranges


Comparing Reporting Capabilities
RDBMS Slice and Dice data, then reassemble any way you like NoSQL Hard to repurpose data for ad-hoc usage Plan ahead Think in advance How and what you store Data access patterns

68

Summary

Summary
Why NOSQL / BASE
ACID ruled exclusively in the last 40 years
doesnt compromise on consistency

Database industry neglected distributed DBs w/ availability Vacuum was filled with NoSQL BASE architectures
Strict A and P, minimize C compromise

Relational databases are now trying to catch up

70

Summary
NoSQL Limitations
Missing some query capabilities
joins / composite transaction

Eventual consistency -- not for every problem Not a drop in replacement for RDBMS on ACID No standardization -> product lock-in Relatively immature (support, bugs, community)

71

Summary
Choose the right tool for the job
Relational databases and NoSQL databases are designed to meet different needs RDBMS-only should not be a default NOSQL databases outperform RDBMSs in their particular niche No one size fits all / Silver bullet ...but you dont have to choose one

72

Summary
Polyglot Persistence
Poly: many Glot: language Meshing up persistence mechanisms to best meet requirements Good integration stories:
E.g. Neo4j + JDBC using JTA

73

Você também pode gostar