Technical Deep-Dive in A Column-Oriented In-Memory Database

TEC215
Technical Deep-Dive in a Column-Oriented In-Memory Database

Prof. Hasso Plattner Stephan Mller
Research Group of Prof. Hasso Plattner Hasso Plattner Institute for Software Engineering University of Potsdam
Motivation
All Areas have to taken into account

Changed Hardware
Advances in data processing (software)
Complex Enterprise Applications
Our focus
3
Why a New Data Management?!

DBMS architecture has not changed over decades Redesign needed to handle the changes in:
Hardware trends (CPU/cache/memory)
Changed workloads Data characteristics/amount
Query engine
Buffer pool
New application requirements

Some academic prototypes: MonetDB, C-store, HyPer, HYRISE Several database vendors picked up the idea and have new databases in place
4
Traditional DBMS Architecture
(e.g., SAP, Vertica, Greenplum, Oracle)
Changes in Hardware
give an opportunity to re-think the assumptions of yesterday because of what is possible today.
Multi-Core Architecture (96 cores per server) One blade ~$50.000 = 1 Enterprise Class Server Parallel scaling across blades A 64 bit address space 2TB in current servers 25GB/s per core Cost-performance ratio rapidly declining Memory hierarchies
Main Memory becomes cheaper and larger

5
In the Meantime Research as come up with

several advance in software for processing data
Column-oriented data organization (the column store)

Sequential scans allow best bandwidth utilization between CPU cores and memory Independence of tuples within columns allows easy partitioning and therefore parallel processing
Lightweight Compression
Reducing data amount, while.. Increasing processing speed through late materialization
And more, e.g., parallel scan/join/aggregation

6
Data Management for Enterprise Applications

Challenge: Diverse Applications

Transactional Data Entry
Sources: Machines, Transactional Apps, User Interaction, etc.
Real-time Analytics, Structured Data

Sources: Reporting, Classical Analytics, planning, simulation
CPUs (multi-Core + Cache) Main Memory
Event Processing, Stream Data

Sources: machines, sensors, high volume systems
Data Management
Text Analytics, Unstructured Data

Sources: web, social, logs, support systems, etc.
OLTP vs. OLAP

Online Transaction Processing
Online Analytical Processing
Modern enterprise resource planning (ERP) systems are challenged by mixed workloads, including OLAP-style queries. For example:

OLTP-style: create sales order, invoice, accounting documents, display customer master data or sales order OLAP-style: dunning, available-to-promise, cross selling, operational reporting (list open sales orders)
But: Todays data management systems are optimized either for daily transactional or analytical workloads (storing their data along rows or columns)
Drawbacks of any Separation

OLAP/sub system does not have the latest data OLAP/sub system does only have predefined subset of the data
Cost-intensive ETL process has to synch both systems There is a lot of redundancy
Different data schemas introduce complexity for applications

combining sources
10
Workarounds in OTLP
As in OLAP-systems OLTP-systems facilitate redundant data to overcome shortcoming in todays data management
Materialized Views Materialized Aggregates Pre-computed and materialized result sets
Since the database has been the bottleneck, complex data processing is done on application server
Simple SQL statements Nested loop joins (SELECT . SELECT SINGLE ENDSELECT)
Batch-processing lead to
Long running business processes Inflexibility (e.g. ATP rescheduling)
11
Enterprise Applications Have a Specific Database Footprint

Today's Enterprise Applications
Complex processes
Increased data set (but real-world events driven) Separated into transactional (OLTP) and analytical (OLAP) applications Enterprise Data Management Wide schemas Sparse data with limited domain Workload consists of complex analytical queries Workload characteristics Set processing Read access
Insert operations instead of updates

12
Enterprise Data Characteristics

Enterprise data is sparse data: Most tables are empty (~150 important tables) Many columns are not used even once (~50%) Many columns have a low cardinality of values NULL values/default values are dominant Sparse distribution facilitates high compression
Tables that t into category
50000 37500 25000 15553 12500 0 0 1-100 6290 2685 1385 925 579 1M-10M 141 >10M 46418
13
100-1000 1000-10000 10K-100K 100K - 1M Number of Records
Enterprise Data Characteristics

Enterprise data is sparse data: Most tables are empty (~150 important tables) Many columns are not used even once (~50%) Many columns have a low cardinality of values NULL values/default values are dominant Sparse distribution facilitates high compression
80% 70%
% of Columns
78 % 64 %
Inventory Management Financial Accounting
60% 50% 40% 30% 20% 10%
24 % 13 % 9% 33 - 1023 1024 - 100000000 Number of Distinct Values 12 %
14
0%
1 - 32
Enterprise Workloads are Read-Mostly

Enterprise applications have evolved: Not just OLAP vs. OLTP Workload in enterprise applications constitutes of Mainly read queries (OLTP 83%, OLAP 94%) Many queries access large sets of data
100 % 90 % 80 % 70 % 60 % 50 % 40 % 30 % 20 % 10 % 15 0 % Write: Delete Modification Insert Read: Range Select Table Scan Lookup 100 % 90 % 80 % 70 % 60 % 50 % 40 % 30 % 20 % 10 % 0 % Write: Delete Modification Insert Read: Select
Workload
OLTP
OLAP
Workload
TPC-C
Approach
Change overall data management system assumption
In-Memory only Start with read-optimized data structures Transactional features as needed Vertically partitioned (column store) CPU-cache optimized Only one optimization objective main memory access
IN-Memory Column + Row OLTP + OLAP + Text
Column
ROW
TEXT
Rethink how enterprise application persistence is build

Single data management system No redundant data, no materialized views, cubes Computational application logic closer to the database (i.e. complex queries, stored procedures)
16
Backup
In-Memory Data Processing

Recap: Memory Hierarchy
18
Recap: Latency Numbers

L1 cache reference (cached data word) Branch mispredict L2 cache reference Main memory reference Send 2K bytes over 1 Gbps network SSD random read Read 1 MB sequentially from memory Disk seek Send packet CA Netherlands CA
19
0.5ns 5ns 7ns 100ns 20,000ns 150,000ns 250,000ns 10,000,000ns 150,000,000ns 0.1s 20s 150s 250s 10ms 150ms
In-Memory Data Processing

In DBMS, on disk as well as in memory, data processing is often:
Not CPU bound But bandwidth bound I/O Bottleneck
CPU could process data faster

Memory Access:
Not truly random (in the sense of constant latency) Data is read in blocks/cache lines Even if only parts of a block are requested
Potential waste of bandwidth

20
V1
V2
V3
V4
V5
V6
V7
V8
V9
V10
Cache Line 1
Cache Line 2
Data Layout in Main Memory

Basics (1)
Memory in todays computers has a linear address layout: addresses start at 0x0 and go to 0xFFFFFFFFFFFFFFFF for 64bit Not every system is 64bit addressable, e.g. modern Intel systems use only 48bits which allows up to 256TB RAM on a single machine Virtual memory allocated by the program can distribute over this space Each UNIX process has its own view on the address space Address translation is done in hardware by the CPU
22
Basics (2)
The memory layout is only linear, every higher-dimensional access is mapped to this linear band.
23
Physical Data Representation

Row store: Rows are stored consecutively Optimal for row-wise access (e.g. SELECT *) Column store: Columns are stored consecutively Optimal for attribute focused access (e.g. SUM, GROUP BY)
Note: concept is independent from storage type

But only in-memory implementation allows fast tuple reconstruction in case of a column store
Row-Store
Row 1 Row 2 Row 3
Column-store
Doc Doc Sold- Value Sales Num Date To Status Org
24
Row 4
Row Data Layout

Data is stored tuple-wise Leverage co-location of attributes for a single tuple Low cost for reconstruction, but higher cost for sequential scan of a single attribute
25
Columnar Data Layout

Data is stored attribute-wise Leverage sequential scan-speed in main memory for predicate evaluation Tuple reconstruction is more expensive
26
Row-oriented storage
A1
A2 A3
B1
B2 B3
C1
C2 C3
A4
27
B4
C4
A1 B1 C1
A2 A3
B2 B3
C2 C3
A4
28
B4
C4
A1 B1 C1 A2 B2 C2
A3
B3
C3
A4
29
B4
C4
A1 B1 C1 A2 B2 C2 A3 B3 C3
A4
30
B4
C4
A1 B1 C1 A2 B2 C2 A3 B3 C3
A4 B4 C4
31
Column-oriented storage
A1
A2 A3
B1
B2 B3
C1
C2 C3
A4
32
B4
C4
A1 A2 A3 A4
B1
B2 B3
C1
C2 C3
B4
33
C4
A1 A2 A3 A4 B1 B2 B3 B4
C1
C2 C3
C4
34
A1 A2 A3 A4 B1 B2 B3 B4 C1 C2 C3 C4
35
Example: OLTP-Style Query

A1 A2 A3 A4 B1 B2 B3 B4 C1 C2 C3 C4 struct Tuple { int a,b,c; }; Tuple data[4]; fill(data); Tuple third = data[3];
36
Example: OLTP-Style Query

A1 A2 A3 A4 Row Oriented Storage Tuple third = data[3];
1 2
B1 B2 B3 B4
C1 C2 C3 C4
struct Tuple { int a,b,c; }; Tuple data[4]; fill(data);
Column Oriented Storage
Cache line
37
Example: OLAP-Style Query

struct Tuple { int a,b,c; }; A1 A2 A3 A4 B1 B2 B3 B4 C1 C2 C3 C4
Tuple data[4]; fill(data);

int sum = 0; for(int i = 0;i<4;i++) sum += data[i].a;
38
Example: OLAP-Style Query

struct Tuple { int a,b,c; }; A1 A2 A3 A4 Row Oriented Storage for(int i = 0;i<4;i++) sum += data[i].a;
1 2 3
B1 B2 B3 B4
C1 C2 C3 C4
Tuple data[4]; fill(data);

int sum = 0;
Column Oriented Storage
Cache line
39
Mixed Workloads
Mixed Workloads involve attribute- and entity-focused queries OLTP-style queries
A1 A2 A3 A4 B1 B2 B3 B4 C1 C2 C3 C4
OLAP-style queries
A1 A2 A3 A4 B1 B2 B3 B4 C1 C2 C3 C4
40
Mixed Workloads: Choosing the Layout

Layout Row Column OLTP-Misses 2 3 OLAP-Misses 3 1 Mixed 5 4
OLTP-style queries
A1 A2 A3
41
OLAP-style queries
A1 A2 A3 A4 B1 B2 B3 B4 C1 C2 C3 C4
B1 B2 B3 B4
C1 C2 C3 C4
A4
Dictionary Encoding

Motivation
Main memory access is the new bottleneck
Idea: Trade CPU time to compress and decompress data
Compression reduces number of I/O operations to main memory

Leads to less cache misses due to more information on a cache line
Operation directly on compressed data Offsetting with bit-encoded fixed-length data types Based on limited value domain
43
Dictionary Encoding Example

8 billion humans Attributes
first name last name gender country city Birthday 200 byte
Each attribute is dictionary encoded

44
Sample Data
rec ID 39 40 41 42 43 fname John Mary Jane John Peter lname Smith Brown Doe Doe Schmidt gender m f f m m city Chicago London Palo Alto Palo Alto Potsdam country USA UK USA USA GER birthday 12.03.1964 12.05.1964 23.04.1976 17.06.1952 11.11.1975
45
Dictionary Encoding a Column

A column is split into a dictionary and an attribute vector Dictionary stores all distinct values with implicit value ID Attribute vector stores value IDs for all entries in the column Position is stored implicitly Enables offsetting with bit-encoded fixed-length data types
Rec ID 39 40 41 fname John Mary Jane Dictionary for fname Value ID 23 24 Value John Mary Attribute Vector for fname position Value ID
39
40 41 42 43
23
24 25 23 26
42
43
46
John
Peter
25
26
Jane
Peter
Querying Data using Dictionaries

Search for Attribute Value
Search Value IDs for requested value in Dictionary Scan Attribute Vector for Value ID Replace Value IDs in result with corresponding dictionary value
Dictionary for fname Value ID 23 24 Value John Mary Attribute Vector for fname position Value ID
39
40 41 42 43
23
24 25 23 26
25
26
47
Jane
Peter
Sorted Dictionary
Dictionary entries are sorted either by their numeric value or lexicographically
Dictionary lookup complexity: O(log(n)) instead of O(n)
Dictionary entries can be compressed to reduce the amount of required storage Selection criteria with ranges are less expensive (orderpreserving dictionary)
48
Compression Rate
Depends on cardinality / entropy Cardinality
Table cardinality: number of tuples in a relation Column cardinality: number of distinct values in a column
Entropy
measure for information density Entropy = column cardinality / table cardinality
49
Data Size Examples

Column First names Last names Gender City Country Birthday Column Cardinality 5 million 23 bit 8 million 23 bit 2 1 bit 1 million 20 bit 200 8 bit 40,000 16 bit Entropy 6.25 10-4 1 10-3 2.5 10-10 1.25 10-4 2.5 10-8 5 10-6 Item Size 49 Byte 50 Byte 1 Byte 49 Byte 49 Byte 2 Byte Plain Size 365.10 GB 373,840 MB 372.5 GB 381,470 MB 7.45 GB 7,630 MB 365.08 GB 373,840 MB 365.08 GB 373,840 MB 14.90 GB 15,260 MB Size with Dictionary (Dictionary + Column) 234 MB + 21.42 GB 22,168 MB 381 MB + 21.42 GB 22,316 MB 2 Byte + 0.93 GB 954 MB 46.73 MB + 18.62 GB 19,120 MB 6.09 KB + 7.45 GB 7,629 MB 76.29 KB + 14.90 GB 15,259 MB
50
In-Memory Database Operations

Scan Performance (1)

8 billion humans
Attributes
First Name Last Name Gender Country City Birthday 200 byte

52
Question: How many men/women? Assumed scan speed: 2MB/ms/core

Row Store Layout
Table size = 8 billion tuples x 200 bytes per tuple 1.6 TB Scan through all rows with 2MB/ms/core 800 seconds with 1 core
53

Row Store Full Table Scan
Table size = 8 billion tuples x 200 bytes per tuple 1.6 TB Scan through all rows with 2MB/ms/core 800 seconds with 1 core
54

Row Store Stride Access Gender
8 billion cache accesses 64 byte 512 GB
Read with 2MB/ms/core 256 seconds with 1 core
55

Column Store Layout
Table size

Attribute vectors: 91 GB Dictionaries: 700 MB
Total: 92 GB
Compression factor: 17
56

Column Store Full Column Scan on Gender
Size of attribute vector gender = 8 billion tuples x 1 bit per tuple 1 GB Scan through attribute vector with 2MB/ms/core 0.5 seconds with 1 core
57

Column Store Full Column Scan on Birthday
Size of attribute vector birthday = 8 billion tuples x 2 byte per tuple = 16 GB Scan through column with 2MB/ms/core 8 seconds with 1 core
58
Scan Performance Summary
How many women, how many men?
Column Store
Row Store
Full table scan

Time in seconds 0.5 800 1,600x slower
59
Stride access
256 512x slower
SELECT Example
SELECT first_name, last_name FROM world_population WHERE country=IT and gender=m id 2394 3010 3040 fname Gianluigi Lena Mario lname Buffon Gercke Balotelli country Italy Germany Italy gender m f m
3949 4902 20102
Manuel Lukas Klaas-Jan
Neuer Podolski Huntelaar
Germany Germany Netherlands
m m m
60
Query Plan
s s
Predicates are evaluated and generate position lists Intermediate position lists are logically combined Final position list is used for materialization
61
Query Execution
Value ID Dictionary for country Algeria France Germany Italy Netherlands id 2394 3010 3040 3949 4902 20102 fname Gianluigi Lena Mario Manuel Lukas Klaas-Jan lname Buffon Gercke Balotelli Neuer Podolski Huntelaar country 3 2 3 2 2 4 gender 1 0 1 1 1 1 1 m Value ID 0 Dictionary for gender f 0 1 2 3 4
gender = 1 (m)
Position 0
country = 3 (Italy)
Position
2
3 4
0
3
AND
Position 0 3
62
Tuple Reconstruction
Tuple Reconstruction #1
Accessing a record in a row store
All attributes are stored consecutively 200 byte 4 cache accesses 64 byte 256 byte
Read with 2MB/ms/core 0.128 microseconds with 1 core
64
Virtual Record IDs
All attributes are stored in separate columns Implicit record IDs are used to reconstruct rows
65
Virtual Record IDs
1 cache access for each attribute

6 cache accesses 64 byte 384 byte Read with 2MB/ms/core 0.192 microseconds with 1 core
66
Insert

Example
world_population
rowID fname lname gender country city birthday
0
1 2 3 4 ...
Martin
Michael Hanna Anton Sophie ...
Albrecht
Berg Schulze Meyer Schulze ...
m
m f m f ...
GER
GER GER AUT GER ...
Berlin
Berlin Hamburg Innsbruck Potsdam ...
08-05-1955
03-05-1970 04-04-1968 10-20-1992 09-03-1977 ...
INSERT INTO world_population VALUES (Karen, Schulze, w, GER, Rostock, 06-20-2012)
68
INSERT (1) w/o new Dictionary entry

AV
0 1 0 1 0 1 2 3
D
Albrecht Berg Meyer Schulze
0 1 2 3 4 ... fname Martin Michael Hanna Anton Sophie ... lname Albrecht Berg Schulze Meyer Schulze ... gender m m f m f ... country GER GER GER AUT GER ... city Berlin Berlin Hamburg Innsbruc k Potsdam ... birthday 08-05-1955 03-05-1970 04-04-1968 10-20-1992 09-03-1977 ...
2
3 4
3
2 3
Attribute Vector (AV) Dictionary (D)
69

AV
0 1 0 1 0 1
D
Albrecht Berg
0 1 2 3 4 ... fname Martin Michael Hanna Anton Sophie ... lname Albrecht Berg Schulze Meyer Schulze ... gender m m f m f ... country GER GER GER AUT GER ... city Berlin Berlin Hamburg Innsbruc k Potsdam ... birthday 08-05-1955 03-05-1970 04-04-1968 10-20-1992 09-03-1977 ...
2
3 4
3
2 3
2
3
Meyer
Schulze
1.
Look-up on D entry found
Attribute Vector (AV) Dictionary (D)
70

AV
0 1 0 1 0 1
D
Albrecht Berg
0 1 2 3 4 5 fname Martin Michael Hanna Anton Sophie lname Albrecht Berg Schulze Meyer Schulze Schulze ... ... ... ... gender m m f m f country GER GER GER AUT GER city Berlin Berlin Hamburg Innsbruc k Potsdam birthday 08-05-1955 03-05-1970 04-04-1968 10-20-1992 09-03-1977
2
3 4 5
3
2 3 3
2
3
Meyer
Schulze
1.
2.
Look-up on D entry found Append ValueID to AV
...
Attribute Vector (AV)
...
...
Dictionary (D)
71
INSERT (2) with new Dictionary Entry I/II

AV
0 1 0 0 0 1
D
Berlin Hamburg
0 1 2 3 4 5 ... ... fname Martin Michael Hanna Anton Sophie lname Albrecht Berg Schulze Meyer Schulze Schulze ... ... ... ... ... gender m m f m f country GER GER GER AUT GER city Berlin Berlin Hamburg Innsbruck Potsdam birthday 08-05-1955 03-05-1970 04-04-1968 10-20-1992 09-03-1977
2
3 4
1
2 3
2
3
Innsbruck
Potsdam
Dictionary (D)
72

AV
0 1 0 0 0 1
D
Berlin Hamburg
0 1 2 3 4 5 fname Martin Michael Hanna Anton Sophie lname Albrecht Berg Schulze Meyer Schulze Schulze ... ... ... ... ... ... gender m m f m f country GER GER GER AUT GER city Berlin Berlin Hamburg Innsbruck Potsdam birthday 08-05-1955 03-05-1970 04-04-1968 10-20-1992 09-03-1977
2
3 4
1
2 3
2
3
Innsbruck
Potsdam
1.
Look-up on D no entry found
...
Dictionary (D)
73

AV
0 1 0 0 0 1
D
Berlin Hamburg
0 1 2 3 4 5 ... ... fname Martin Michael Hanna Anton Sophie lname Albrecht Berg Schulze Meyer Schulze Schulze ... ... ... ... ... gender m m f m f country GER GER GER AUT GER city Berlin Berlin Hamburg Innsbruck Potsdam birthday 08-05-1955 03-05-1970 04-04-1968 10-20-1992 09-03-1977
2
3 4
1
2 3
2
3 4
Innsbruck
Potsdam Rostock
1.
2.
Look-up on D no entry found Append new value to D (no re-sorting necessary)
Dictionary (D)
74

AV
0 1 0 0 0 1
D
Berlin Hamburg
0 1 2 3 4 5 ... ... fname Martin Michael Hanna Anton Sophie lname Albrecht Berg Schulze Meyer Schulze Schulze ... ... ... gender m m f m f country GER GER GER AUT GER city Berlin Berlin Hamburg Innsbruck Potsdam Rostock ... ... birthday 08-05-1955 03-05-1970 04-04-1968 10-20-1992 09-03-1977
2
3 4 5
1
2 3 4
2
3 4
Innsbruck
Potsdam Rostock
1.
2. 3.
Look-up on D no entry found Append new value to D (no re-sorting necessary) Append ValueID to AV
Dictionary (D)
75

AV
0 1 2 3
D
0
1 2 3 4
Anton
Hanna Martin Michael Sophie
0 1 2 3 4 5 ...
fname Martin Michael Hanna Anton Sophie
lname Albrecht Berg Schulze Meyer Schulze Schulze
gender m m f m f
country GER GER GER AUT GER
city Berlin Berlin Hamburg Innsbruck Potsdam Rostock
birthday 08-05-1955 03-05-1970 04-04-1968 10-20-1992 09-03-1977
2
3 4
1
0 4
...
...
...
...
...
...
Dictionary (D)
76
INSERT (2) with new Dictionary Entry II/II

AV
0 1 2 3
D
0
1 2 3 4
Anton
Hanna Martin Michael Sophie
0 1 2 3 4 5
gender m m f m f
birthday 08-05-1955 03-05-1970 04-04-1968 10-20-1992 09-03-1977
2
3 4
1
0 4
1.
Look-up on D no entry found
...
...
...
...
...
...
...
Dictionary (D)
77

AV
0 1 2 3
D
0
1 2 3 4 5
Anton
Hanna Martin Michael Sophie Karen
0 1 2 3 4 5 ...
gender m m f m f
birthday 08-05-1955 03-05-1970 04-04-1968 10-20-1992 09-03-1977
2
3 4
1
0 4
1.
2.
Look-up on D no entry found Append new value to D
...
...
...
...
...
...
Dictionary (D)
78

AV
0 1 2 3
D (old)
0
1 2 3 4 5
D (new)
0 1 2 3 Anton Hanna Karen Martin
Anton
Hanna Martin Michael Sophie Karen
2
3 4
1
0 4
4
5
Michael
Sophie
1.
2. 3.
Look-up on D no entry found Append new value to D Sort D
Dictionary (D)
79

AV (old)
0 1 2 3
AV (new)
0 1 3 4 0 1 2 3
D (new)
Anton Hanna Karen Martin
2
3 4
1
0 4
2
3 4
1
0 5
4
5
Michael
Sophie
1.
2. 3. 4.
Look-up on D no entry found Append new value to D Sort D Change ValueIDs in AV
Dictionary (D)
80

AV
0 1 3 4
D
0
1 2 3 4 5
Anton
Hanna Karen Martin Michael Sophie
0 1 2 3 4 5 ...
fname Martin Michael Hanna Anton Sophie Karen ...
lname Albrecht Berg Schulze Meyer Schulze Schulze ...
gender m m f m f
birthday 08-05-1955 03-05-1970 04-04-1968 10-20-1992 09-03-1977
2
3 4 5
1
0 5 2
1.
2. 3. 4. 5.
Look-up on D no entry found Append new value to D Sort D Change ValueIDs in AV Append new ValueID to AV
...
...
...
...
Dictionary (D)
81
RESULT
world_population
rowID fname lname gender country city birthday
0
1 2 3 4 5 ...
Martin
Michael Hanna Anton Ulrike Karen ...
Albrecht
Berg Schulze Meyer Schulze Schulze ...
m
m f m f f ...
GER
GER GER AUT GER GER ...
Berlin
Berlin Hamburg Innsbruck Potsdam Rostock ...
08-05-1955
03-05-1970 04-04-1968 10-20-1992 09-03-1977 06-20-2012 ...
82
Insert-Only

Facts about Insert-Only
Principles

Never delete any data Invalidate outdated tuples instead Logical Update is changed to an technical Insert/Append Gap-less time travel possible, Legal requirements, e.g. for auditability can easily be met Implicit logging Snapshot Isolation and locking is simplified but applied compression reduces overhead
Advantages

Disadvantage: Increased memory consumption
84
Implementation possibilities
1.
Point Representation

Store complete tuple on attribute change Save insert timestamp in column valid_from Writes faster, reads slower Store complete tuple on attribute change Update replaced tuple, store current timestamp in valid_to Same timestamp is stored into valid_from in new tuple Reads faster, writes slower Update existing tuple on changes (not insert-only any longer) Store outdated values in separate history table
2.
Interval representation

3.
History reconstruction on demand

85
Status updates can be done in-place with timestamps Timestamps are not compressed
Insert-Only 1
Point Representation
id 0 1 2 3
Introduce one additional column: valid_from

fname Martin Michael Hanna Anton lname Albrecht Berg Schulze Meyer gender m m f m country GER GER GER AUT city Berlin Berlin Hamburg Innsbruck birthday 08-05-1955 03-05-1970 04-04-1968 10-20-1992 valid_from 10-11-2011 10-11-2011 10-11-2011 10-11-2011
4
5 1 ...
Ulrike
Sophie Michael ...
Schulze
Schulze Berg ... Perdopolus
f
f m ... m
GER
GER GER ... GRE
Potsdam
Rostock Potsdam ... Athen
09-03-1977
06-20-2012 03-05-1970 ... 03-12-1979
10-11-2011
10-11-2011 07-02-2012 ... 10-11-2011
8 10 9 Zacharias
86
Primary key is composed of id and valid_from Insert is easy: valid_from = current timestamp Select, Group By, Join: requires nested join to determine the valid_from timestamp for each object
Insert-Only 2
Interval Representation
id 0 1 2 3
Introduce 2 additional columns: valid_from and valid_to

fname Martin Michael Hanna Anton lname Albrecht Berg Schulze Meyer gender m m f m country GER GER GER AUT city Berlin Berlin Hamburg Innsbruck birthday 08-05-1955 03-05-1970 04-04-1968 10-20-1992 valid_from 10-11-2011 10-11-2011 10-11-2011 10-11-2011 07-02-2012 valid_to
4
5 1 ...
Ulrike
Sophie Michael ...
Schulze
Schulze Berg ... Perdopolus
f
f m ... m
GER
GER GER ... GRE
Potsdam
Rostock Potsdam ... Athen
09-03-1977
06-20-2012 03-05-1970 ... 03-12-1979
10-11-2011
10-11-2011 07-02-2012 ... 10-11-2011 ...
8 10 9 Zacharias

87
Primary key is composed of id and valid_from Insert requires update of the formerly current tuple Select, Group By, Join is easy: Where clause eliminates tuples out of range Finding up-to-date entries can be supported by an additional bit-vector on column valid_to
Snapshot Isolation

Snapshot Isolation guarantees consistent reads during a transaction, all reads retrieve the values that were active in the moment the transaction started Conflicts like lost updates may happen theoretically, but are prevented through

Pre-write checks in the database or Application-level locks

transaction2 wants to insert.
transaction1 read1 read2 insert1 insert2 transaction2
An alert is raised since the read values are no longer valid.
time
88
Status Updates
When updates of status fields are changed by replacement, do we need to insert a new version of the tuple? Insert Only would lead to an overhead (e.g. clearing in FI) Most status fields are binary Uncompressed in-place updates with row timestamp
t = NULL
t = 2009/06/30
Unpaid
89
Paid
Handling Data Modifications

Motivation
Inserting new tuples directly into a compressed structure can be expensive
New values can require reorganizing the dictionary Number of bits required to encode all dictionary values can change, attribute vector has to be reorganized
Deletion of tuples is expensive

All attribute vectors have to be reorganized, value IDs of following tuples have to be moved
91
Differential Buffer

New values are written to a dedicated differential buffer (Delta) Cache Sensitive B+ Tree (CSB+) used for faster search on Delta
Read
world_population
fname
Attribute Vector
0 0 1 1 0 1 2 3
Write
fname
Dictionary
Anton Hanna Michael Sophie
Attribute Vector
0 1 2 3 0 1 1 2
Dictionary
(compressed)
1 2
0
1 2
Angela
Klaus Andre
CSB+
3
4 5
3
2 1
8 Billion entries 92
up to 50,000 entries
Main Store
Differential Buffer/ Delta
Differential Buffer
Inserts of new values are faster, because dictionary and attribute vector does not need to be resorted Range select on differential buffer expensive, based on unsorted dictionary Differential Buffer requires more memory:
no attribute vector compression additional CSB+ Tree for dictionary
93
Tuple Lifetime
Michael moves from Berlin to Potsdam Main Table: world_population
recId fname lname gender country city birthday
0
1 2 3 4 5 ... 8 * 109
Martin
Michael Hanna Anton Ulrike Sophie
Albrecht
Berg Schulze Meyer Schulze Schulze
m
m f m f f
GER
GER GER AUT GER GER
Berlin
Berlin Hamburg Innsbruck Potsdam Rostock
08-051955
03-051970 04-041968 10-201992 09-031977 06-202012 ... 03-121979
Main Store
Differential Buffer
... ... ... ... ... UPDATE world_population Zacharias Perdopolus m GRE Athen SET city=Potsdam WHERE fname= Michael AND lname=Berg
94
Tuple Lifetime
0
1 2 3 4 5 ... 8 * 109
Martin
Michael Hanna Anton Ulrike Sophie
Albrecht
Berg Schulze Meyer Schulze Schulze
m
m f m f f
GER
GER GER AUT GER GER
Berlin
Berlin Hamburg Innsbruck Potsdam Rostock
08-051955
03-051970 04-041968 10-201992 09-031977 06-202012 ... 03-121979
Main Store
Differential Buffer
95
Tuple Lifetime
0
1 2 3 4 5 0 ... 8 * 109
Martin
Michael Hanna Anton Ulrike Sophie Michael
Albrecht
Berg Schulze Meyer Schulze Schulze Berg
m
m f m f f m
GER
GER GER AUT GER GER
Berlin
Berlin Hamburg Innsbruck Potsdam Rostock Potsdam
08-051955
03-051970 04-041968 10-201992 09-031977 06-2003-052012 1970 ... 03-121979
Main Store
Differential Buffer
96
Tuple Lifetime
Problem: Tuples are now available in Main Store and Differential Buffer Tuples of a table are marked by a validity vector to reduce the required amount of reorganization steps
Like an attribute vector for validity
Invalidated tuples stay in the database table, until the next reorganization takes place Search results are reduced using the validity vector 1 bit required per database tuple
97
Tuple Lifetime
recId fname lname gender country city birthday valid
Main Store
0
1 2 3 4 5 0 ... 8 * 109
Martin
Albrecht
m
m f m f f m
GER
GER GER AUT GER GER
Berlin
08-051955
03-051970 04-041968 10-201992 09-031977 06-2003-052012 1970 ... 03-121979
1
0 1 1 1 1
Differential Buffer
98
Tuple Lifetime
recId fname lname gender country city birthday valid
Main Store
0
1 2 3 4 5 0 ... 8 * 109
Martin
Albrecht
m
m f m f f m
GER
GER GER AUT GER GER
Berlin
08-051955
03-051970 04-041968 10-201992 09-031977 06-2003-052012 1970 ... 03-121979
1
0 1 1 1 1
Differential Buffer
99
Stored Procedures

Facts about Stored Procedures

Basically a procedural program stored in the database Written in a (vendor-) specific language (i.e. PL/SQL, Java, ) Usually support constructs like loops, conditions, i.e. Takes parameters as input Returns a result set Main usage: set operations that cannot be expressed in SQL (or are very hard to express) Additional usages:
Access control Data validation Data conversion
101
Advantages
Performance
No data transfer between database and application server
Code reduction
Share stored procedure written in a generic way Less code in the application layer
Improved security
No SQL injection possible in stored procedures
102
Implications on Application Development

How does it all come together?

1. Mixed Workload combining OLTP and analytic-style queries
Column-Stores are best suited for analytic-style queries In-memory database enables fast tuple re-construction In-memory column store allows aggregation on-the-fly
2. Sparse enterprise data

Lightweight compression schemes are optimal Increases query execution Improves feasibility of in-memory database
Changed Hardware Advances in data processing (software)
3. Mostly read workload

Read-optimized stores provide best throughput i.e. compressed in-memory column-store Write-optimized store as delta partition to handle data changes is sufficient
104
Complex Enterprise Applications
Our focus
An In-Memory Database for Enterprise Applications

In-Memory Database (IMDB)
Data resides permanently in main memory Main Memory is the primary persistence Still: logging to disk/recovery from disk Main memory access is the new bottleneck Cache-conscious algorithms/ data structures are crucial (locality is king)
Interface Services and Session Management
Query Execution Metadata TA Manager
Distribution Layer at Blade i Main Memory at Blade i
Active Data Main Store

Combined Column
Differential Store
Combined Column Column Column
Indexes
Inverted
Column
Column
Merge
Object Data Guide
Data aging
Time travel
Logging
Recovery
Log
105
Non-Volatile Memory
Passive Data (History)
Snapshots
Simplified Application Development

Traditional Application cache Database cache Column-oriented
Fewer caches necessary
No redundant data (OLAP/OLTP, LiveCache)

No maintenance of materialized views or aggregates Minimal index maintenance
Prebuilt aggregates
Raw data
106
Enterprise Application PoCs

SAP ERP Financials on In-Memory Technology

In-memory column database for an ERP system
Combined workload (parallel OLTP/OLAP queries) Leverage in-memory capabilities to

Reduce amount of data
Aggregate on-the-fly Run analytic-style queries (to replace materialized views)
Execute stored procedures
Use Case: SAP ERP Financials solution

Post and change documents
Display open items

Run dunning job
108
Analytical queries, such as balance sheet
Current Financials Solutions
109
The Target Financials Solution

Only base tables, algorithms, and some indexes
110
Feasibility of Financials on InMemory Technology in 2009

Modifications on SAP Financials
Removed secondary indices, sum tables and pre-calculated and materialized tables Reduce code complexity and simplify locks Insert Only to enable history (change document replacement)
Added stored procedures with business functionality
European division of a retailer

ERP 2005 ECC 6.0 EhP3 5.5 TB system database size Financials: 23 million headers / 8 GB in main memory 252 million items / 50 GB in main memory
(including inverted indices for join attributes and insert only extension)
111
In-Memory Financials on SAP ERP

accounting documents
dunning data
sum tables
BKPF BSEG
secondary indices
MHNK MHND
change documents
BSAD
BSAK BSAS BSID BSIK BSIS
GLT0
CDHDR
CDPOS
LFC1
KNC1
112
In-Memory Financials on SAP ERP

accounting documents
BKPF BSEG
113
Reduction by a Factor 10
DBMS BKPF BSEG 8.7 GB 255 GB 263.7 GB Secondary Indices Sum Tables Complete
114
IMDB 1.5 GB 50 GB 51.5 GB
255 GB 0.55 GB 519.25 GB
51.5 GB
Booking an accounting document

Insert into BKPF and BSEG only Lack of updates reduces locks
115
Dunning Run
Dunning run determines all open and due invoices Customer defined queries on 250M records Current system: 20 min New logic: 1.5 sec
In-memory column store Parallelized stored procedures Simplified Financials
116
Why?
Being able to perform the dunning run in such a short time lowers TCO Add more functionality! Run other jobs in the meantime! - in a multi-tenancy cloud setup hardware must be used wisely
117
Bring Application Logic Closer to the Storage Layer

Select accounts to be dunned, for each:
Select open account items from BSID, for each: Calculate due date Select dunning procedure, level and area Create MHNK entries
Create and write dunning item tables
118

Select accounts to be dunned, for each:
1 SELECT
Select open account items from BSID, for each: 10000 SELECTs Calculate due date 10000 SELECTs Select dunning procedure, level and area Create MHNK entries
31000 Entries
119

One single stored 1 SELECT Select accounts to be dunned, for each: procedure Select open account items from BSID, for each: 10000 SELECTs executed Calculate due date within newDB
Select dunning procedure, level and area Create MHNK entries
10000 SELECTs 31000 Entries
120

One single stored Select accounts to be dunned, for each: procedure Select open account items from BSID, for each: executed Calculate due date within newDB
Select dunning procedure, level and area Create MHNK entries
Calculated onthe-fly
121
Factor: 800x Acceleration

Quantity: 250 mio items, 380k open, 200k due
# 1 Operation Select open items Due date, dunning level Filter 1 (verify dunning levels) Filter 2 (check last dunning) Generate MHNK (aggregate) Generate MHND (execute filters) Original Version 1 0.63s Variant 2 1.01s (incl. T047 & KNB5 Join) Deferred to aggregation Variant 3 0.6s (incl. T047 & KNB5 Join) 0.5s
27s
~19s
1.1s
0.5s
Additional Information Hardware: 4 CPUs x 6 cores, 256 GB RAM
~15s
0.8s
0.4s
done in #1
1.2s
Done in #1
done in #1
140ms
Done in #1
122
Total
~20 minutes
~1 minute
~3.0s
(#3, #4 exec. in parallel)
~1.5s
(#2, #3, #4 exec. in parallel)
Dunning Application
123
Dunning Application
124
Available-to-Promise Check
Can I get enough quantities of a requested product on a desired delivery date? Goal: Analyze and validate the potential of in-memory and highly parallel data processing for Available-to-Promise (ATP) Challenges
Dynamic aggregation Instant rescheduling in minutes vs. nightly batch runs Real-time and historical analytics
Outcome
Real-time ATP checks without materialized views Ad-hoc rescheduling No materialized aggregates
125
In-Memory Available-toPromise
126
Demand Planning
Flexible analysis of demand planning data Zooming to choose granularity Filter by certain products or customers Browse through time spans Combination of location-based geo data with planning data in an in-memory database External factors such as the temperature, or the level of cloudiness can be overlaid to incorporate them in planning decisions 127
GORFID
HANA for Streaming Data Processing Use Case: In-Memory RFID Data Management Evaluation of SAP OER Prototypical implementation of:
RFID Read Event Repository on HANA Discovery Service on HANA (10 Billion data records with ca. 3 seconds response time) Frontends for iPhone, iPad2
Key Findings:
HANA is suited for streaming data (using bulk inserts) Analytics on streaming data is now possible
128
GORFID: Near Real-Time as a Concept
Bulk load every 2-3 seconds: > 50,000 inserts/s
129
Thanks!
Questions?
Online Lecture (starting Sept 3rd) http://openhpi.de
Stephan Mller Hasso Plattner Institute stephan.mueller@hpi.uni-potsdam.de

Technical Deep-Dive in A Column-Oriented In-Memory Database

Enviado por

Dados do documento

Título original

Direitos autorais

Formatos disponíveis

Compartilhar este documento

Compartilhar ou incorporar documento

Opções de compartilhamento

Você considera este documento útil?

Este conteúdo é inapropriado?

Direitos autorais:

Formatos disponíveis

Technical Deep-Dive in A Column-Oriented In-Memory Database

Enviado por

Direitos autorais:

Formatos disponíveis

TEC215

Technical Deep-Dive in a Column-Oriented In-Memory Database

All Areas have to taken into account

Advances in data processing (software)

Complex Enterprise Applications

Why a New Data Management?!

New application requirements

Traditional DBMS Architecture

(e.g., SAP, Vertica, Greenplum, Oracle)

Main Memory becomes cheaper and larger

In the Meantime Research as come up with

Column-oriented data organization (the column store)

And more, e.g., parallel scan/join/aggregation

Data Management for Enterprise Applications

Challenge: Diverse Applications

Real-time Analytics, Structured Data

CPUs (multi-Core + Cache) Main Memory

Event Processing, Stream Data

Text Analytics, Unstructured Data

OLTP vs. OLAP

Online Analytical Processing

Drawbacks of any Separation

Different data schemas introduce complexity for applications

Enterprise Applications Have a Specific Database Footprint

Insert operations instead of updates

Enterprise Data Characteristics

100-1000 1000-10000 10K-100K 100K - 1M Number of Records

Enterprise Data Characteristics

Inventory Management Financial Accounting

60% 50% 40% 30% 20% 10%

24 % 13 % 9% 33 - 1023 1024 - 100000000 Number of Distinct Values 12 %

Enterprise Workloads are Read-Mostly

IN-Memory Column + Row OLTP + OLAP + Text

Rethink how enterprise application persistence is build

In-Memory Data Processing

Recap: Memory Hierarchy

Recap: Latency Numbers

In-Memory Data Processing

CPU could process data faster

Potential waste of bandwidth

Data Layout in Main Memory

Physical Data Representation

Note: concept is independent from storage type

Row Data Layout

Columnar Data Layout

Example: OLTP-Style Query

Example: OLTP-Style Query

struct Tuple { int a,b,c; }; Tuple data[4]; fill(data);

Column Oriented Storage

Example: OLAP-Style Query

Tuple data[4]; fill(data);

Example: OLAP-Style Query

Tuple data[4]; fill(data);

Column Oriented Storage

Mixed Workloads: Choosing the Layout

Prof. Hasso Plattner Stephan Mller

Compression reduces number of I/O operations to main memory

Dictionary Encoding Example

Each attribute is dictionary encoded

Dictionary Encoding a Column

Querying Data using Dictionaries

Data Size Examples

In-Memory Database Operations