Escolar Documentos
Profissional Documentos
Cultura Documentos
Agenda
Introduction
What is NoSQL? Whats wrong with RDBMS? Why now?
Agenda
RDBMS vs. NoSQL
Scaling CAP Theorem ACID vs. BASE
Agenda
NoSQL Taxonomy
Key / Value Column Document Graph
Agenda
How to choose?
Comparing Apples to Oranges Polyglot Persistence
Introduction
Introduction
Question: What do they all have in common?
Introduction
Before we answer some facts:
Introduction
Before we answer some facts:
10
Introduction
Answer: They use NoSQL data stores
11
Introduction
Why!?
12
Introduction
Relational DBs Have Scaling Limitations
ACID doesnt scale well horizontally
Sharding breaks relations Joins are inefficient
13
Introduction
What is NoSQL?
NO SQL / Not Only SQL A collective description of Open Source, Non-relational, data stores
Highly distributed Highly scalable Not ACID and... doesnt use SQL
Term coined in a convention in 2009 called NoSQL (Eric Evans) Started a movement that is gaining momentum
14
Introduction
15
Introduction
Why now?
NoSQL data stores predate RDBMS (1970)
But remained a niche
RDBMS most popular and generic option Web 2.0 introduced new requirements:
Exponential increase in data Information connectivity Semi-structured data
Introduction
Its theory time:
17
Sc
ali
ng
18
Scaling
Scaling Up
Adding resources to a single node in a system
Add more CPUs or memory
Cons:
Outgrowing the capacity of largest system available (Mores law) Expensive Creates vendor lock-in
19
Scaling
Scaling Out
Add more nodes to a system Functional Scaling (vertical)
Grouping data by function and spreading functional groups across databases
Sharding (horizontal)
Splitting same functional data across multiple databases
20
Distributed Databases
Distributed Databases
Node 1
Node 2
Node 3
22
Distributed Databases
What are the requirements from distributed databases?
Consistency
All clients can see the same data
Availability
All clients can always access data
Partition tolerance
The ability to continue working when the network topology is broken The ability to recover once the network is healed
23
Distributed Databases
CAP Theorem (E. Brewer, N. Lynch)
You can fully satisfy at most 2 out of 3
Compromise on 3rd
Recognize which of the CAP rules your business needs for the task
24
Distributed Databases
CA: Consistency & Availability
Partition Tolerance is compromised Single site clusters (easier to ensure all nodes are always in contact) When a network partition occurs, the system blocks e.g. Two Phase Commit (2PC)
Partition Tolerance 25
Distributed Databases
CP: Consistency & Partitioning
Availability is compromised Access to some data may be temporarily limited The rest is still consistent/accurate e.g. Sharded database TBD sample
Partition Tolerance 26
Distributed Databases
AP: Availability & Partitioning
Consistency is compromised System is still available under partitioning Some data returned may be temporarily not up-to-date Requires conflict resolution strategy e.g. DNS, caches, Master/Slave replication TBD sample
Partition Tolerance 27
Consistency
A transaction takes database from one consistent state to another
Isolation
A transaction can't see dirty state from other transactions
Durability
Commit means commit.
29
30
31
Taxonomy
Taxonomy
Key / Value Column
XML
Graph
Document
TXT
BIN
33
Taxonomy
34
Taxonomy
Key/Value Stores
Simple Key / Value lookups (DHT) Value is opaque Focus on scaling to huge amounts of data Designed to handle massive load E.g.
Riak Project Voldemort Redis
Based on Amazons Dynamo paper
35
Taxonomy
Key/Value e.g.: Riak
No single point of failure No machines are special or central MapReduce queries (Erlang / Javascript) HTTP/JSON API Ring cluster with automatic replication Elastic / partition rebalancing
Written in: Erlang, C, Javascript Developed by: Basho Technologies Java client: (jonjlee / riak-java-client)
36
Versioning
Each update is tracked by a Vector Clock
An algorithm for determining ordering and detecting conflicts
When in conflict
Last wins / manual resolution
37
38
39
<word ,doc_id> < word1 ,100>, < word2 ,100>, < < word2 ,200>, word2 ,300>
40
TXT2
TXT3
41
Taxonomy
42
Taxonomy
Column Stores BigTable derivatives
Conceptually a single, infinitely large table Each rows can have different number of columns Table is sparse: |rows|*|columns| > |values | Based on Googles BigTable paper E.g. Cassandra Hbase Hypertable
43
Taxonomy
Use Case: Manage products with diverse attributes
RDBMS: Create a central table with common attributes Create a table per product with unique attributes Use a join query Alternatively create a table that holds meta data on products NoSQL: Column oriented database Use arbitrarily columns
44
Taxonomy
Column Store e.g.: Cassandra
Data model: Googles BigTable Infrastructure: Amazon Dynamo Incremental scalability Flexible schema No single point of failure (Distributed P2P) Optimistic replication (Gossip protocol) Written in: Java Developed by: Facebook Java client: e.g. Hector / Thrift
45
46
{ // this is a SuperColumn name: "homeAddress", // with an unbounded array of Columns value: { // the keys is the name of the Column street: {name: "street", value: "s", timestamp:...}, city: {name: "city", value: "c", timestamp:...}, zip: {name: "zip", value: "z", timestamp:...} } } 47
Column Family 48
49
Keyspace
Timeline CF
50
Taxonomy
XML
TXT
BIN
51
Taxonomy
Document Store
Store semi-structured documents (think JSON) Document versioning Map/Reduce based queries, sorting, aggregation, etc. DB is aware of internal structure E.g.
MongoDB CouchDB JackRabbit (JCR JSR 170)
52
Taxonomy
Use Case: Blog with tagged posts and comments
RDBMS: Table for each: posts, comments, tags Foreign relations NoSQL: Document storage Store post + tags + comments as a document
53
Taxonomy
Document Store e.g: MongoDB
MongoDB (from "humongous") Manages collections of JSON-like documents (BSON) Queries can return specific fields of documents Supports secondary indexes Atomic operations on single documents
Developed by: 10gen Written in: C++ Clients: Java, Scala and more
54
: : : : :
3, "john", "Apples, Oranges and NOSQL", "This article will", ["database", "nosql", "mongodb" ]
56
57
Taxonomy
Graph Databases
58
Taxonomy
Graph databases
Inspired by mathematical graph theory G=(E,V) Models the structure of data Navigational data model Scalability / data complexity Data model: Key-Value pairs on Edges / Nodes Relationships: Edges between Nodes E.g.
Neo4j Pregel (Googles PageRank) AllegroGraph
59
Taxonomy
Use Case: Connected data - deep relationship links between users in a social network
RDBMS Complex recursive algorithm Multiple Self joins Round trips to DB / bulk read and resolve in RAM NoSQL: Graph Storage Network traversal
60
Taxonomy
Graph e.g.: Neo4J
High-performance graph engine Embedded / disk based Work with OO model: nodes, relationships, properties ACID Transactions
JTA support participate in 2PC with your RDBMS
Developed by: Neo Technologies Written in: Java Clients: Java, client libraries in other platforms
61
http://neo4j.org/
62
64
65
66
68
Summary
Summary
Why NOSQL / BASE
ACID ruled exclusively in the last 40 years
doesnt compromise on consistency
Database industry neglected distributed DBs w/ availability Vacuum was filled with NoSQL BASE architectures
Strict A and P, minimize C compromise
70
Summary
NoSQL Limitations
Missing some query capabilities
joins / composite transaction
Eventual consistency -- not for every problem Not a drop in replacement for RDBMS on ACID No standardization -> product lock-in Relatively immature (support, bugs, community)
71
Summary
Choose the right tool for the job
Relational databases and NoSQL databases are designed to meet different needs RDBMS-only should not be a default NOSQL databases outperform RDBMSs in their particular niche No one size fits all / Silver bullet ...but you dont have to choose one
72
Summary
Polyglot Persistence
Poly: many Glot: language Meshing up persistence mechanisms to best meet requirements Good integration stories:
E.g. Neo4j + JDBC using JTA
73