Você está na página 1de 4

Schema design for a Data warehouse

1. What is a Data Warehouse


- large numbers of (low-end) processors working in parallel
- a data center, a large number of low-end servers instead of a
smaller set of high-end servers
- a database is a collection of related data.
- a data warehouse is also a collection of information as well as a
supporting system.
- Historical information for comparative and competitive analysis.
- Enhanced data quality and completeness.
1.1 What is Data mining
- the operational data is analyzed using statistical techniques (R)and clustering
techniques (Mahout) to find the hidden patterns and trends.
- data mines do some kind of summarization of the data and can be used by data
warehouses for faster analytical processing for business intelligence.
- Data warehouse may make use of a data mine for analytical processing of the
data in a faster way.

2. Choose best technology


- parallel database systems MySQL cluster http://en.wikipedia.org/wiki/MySQL_Cluster
- MapReduce Hadoop

3. Parallel database systems approach


- write almost any parallel processing task as either a set of
database queries (possibly using user defined functions and aggregates to
filter and combine data)

MapReduce approach
- a set of MapReduce jobs

4. Some real world examples RDBMS


- EBays Teradata configuration uses just 72 nodes (two quad-core CPUs, 32GB
RAM, 104 300GB disks per node) to manage approximately 2.4PB of relational data
- Fox Interactive Medias warehouse is implemented using a 40-node Greenplum
DBMS. Each node is a Sun X4500 machine with two dual-core CPUs, 48 500GB
disks, and 16 GB RAM (1PB total disk space)

Some real world examples HADOOP


- Facebook over 16000 servers, approx 4000 to a data center, commodity hardware
- Amazon over 46,000 servers, commodity hardware
DBMS small number high end specialist hardware
HADOOP => large number commodity machines

4 Schemas
- The two classes of systems make different choices in several key areas. For
example,all DBMSs require that data conform to a well-defined schema,
whereas MapReduce permits data to be in any arbitrary format. Other differences
also include how each system provides indexing and compression optimizations,
programming models, the way in which data is distributed

- Star schema topology


- A single fact table that points to different dimension tables. Many data
warehouses have one fact table and multiple dimensions.
- A fact table contains any data you can weight or calculate on.
- Dimensions are more subject-based.
- A Factless schema is a fact table without measures. Example
Recording when severe weather alerts are in effect

5. Schema Support Mapreduce vs DBMS


- Parallel DBMSs require data to fit into the relational paradigm
- MapReduce does not require that data files adhere to a schema
- MapReduce programmer must often write a custom parser this is at least an
equivalent amount of work and here are also other potential problems with not
using a schema for large data sets.
- MapReduce input files must be built into the Map and Reduce programs.
- If a MapReduce data set is shared a second programmer must decipher the code
written by the first programmer to decide how to process the input file.
- DBMSs separate the schema from the application and store it in a set of system
catalogs that can be queried.
- DBMSs ensure the integrity of the data is enforced without additional work
on the programmers behalf.
- MapReduce is quite flexible. However if sharing is needed it is advantageous for
a data description, for star schema definitions with integrity constraints .
- Hive addresses can provide this, and this is a major point to drive the adoption of
Hive for mapreduce shops

6. Indexing
- All modern DBMSs use hash or B-tree indexes to accelerate access to data using a
proper index reduces the scope of the search dramatically.
- MapReduce frameworks do not provide built-in indexes. The programmer must
implement any indexes
7. Data Distribution

- Parallel DBMSs use knowledge of data distribution and location to their advantage a
parallel query optimizer strives to balance computational workloads while minimizing
the amount data transmitted over the network connecting the nodes of the cluster.
- MR programmer must perform these tasks manually.

Hive
Hive provides us data warehousing facilities on top of an existing Hadoop cluster.

- Hive was originally developed at Facebook.


- gives Hadoop SQL-like capabilities and database-like functionality.
- Hive is not a full a data warehouse. You can create schemas and design database
tables with Hive, but certain limitations exist, Eg no indexing
- Hive is most suited for data warehouse applications, where
- Relatively static data is analyzed,
- Fast response times are not required, and
- When the data is not changing rapidly.
- Hive queries are executed using map-reduce queries
- Hive compiler generates mapreduce jobs for most queries
- Data can be accessed via a simple query language, called HiveQL, which is similar
to SQL
- Hive supports primitive data formats such as TIMESTAMP, STRING, FLOAT,
BOOLEAN, DECIMAL, BINARY, DOUBLE, INT, TINYINT, SMALLINT and BIGINT.
In addition, primitive data types can be combined to form complex data types, such
as structs, maps and arrays.
- Familiar JDBC and ODBC drivers allow many applications to pull Hive data for
seamless reporting. Hive allows users to read data in arbitrary formats, using SerDes
and Input/Output formats.
- Hive does not provide record-level update, insert, or delete.
- Hive does not provide transactions too.

There are two types of Schemas in Hive


1. Managed tables.
2. External tables
When you drop an internal table, it drops the data, and it also drops the metadata.
When you drop an external table, it only drops the meta data.
Hive Query Language as it applies to schemas

SELECT employee_name FROM tbl_employee WHERE salary > 100;

ALTER command, we can change a schema in Hive.

ALTER TABLE hive_table_name RENAME TO new_name;

ALTER TABLE table_name CHANGE column_name column_name new_datatype;

ALTER TABLE employee CHANGE id id BIGINT;

HIVE AGGREGATION

SORT BY will sort the data within each reducer. You can use any number of reducers for SORT
BY operation.

ORDER BY will sort all of the data together, which has to pass through one reducer. Thus,
ORDER BY in hive uses single reducer.

ORDER BY guarantees total order in the output while SORT BY only guarantees ordering of the
rows within a reducer. If there is more than one reducer, SORT BY may give partially ordered
final results

Você também pode gostar