Você está na página 1de 2

Deep partitioning in Hive (Why and how?

There are many situations where we need to update a record in Hive and the only option
available is to overwrite the complete table or a partition.

Let's consider a use case - We are importing many tables from an OLTP system into hadoop
cluster with sqoop. Later we want to keep a ledger table for source OLTP count and imported
table count in HDFS to maintain balance between source and target.

Table design:
date | Table_group | Table_name | system | count | timestamp
2016-09-01 | sales | bookings | OLTP | 29998 | 2016-08-3103:14:07.99
2016-09-01| sales | bookings | Hive |29998| 2016-08-31 03:14:15.99
2016-09-02 | sales | forecasts | OLTP | 15888| 2016-09-0103:14:07.99
2016-09-02| sales | forecasts | Hive |15887| 2016-09-01 03:14:15.99

These tables are loaded into HDFS by Sqoop at different times in the data pipeline. If we
partition the table by date, we might have to find the row count of all the tables as a batch
process after the daily load completes. There will be a delay between the actual loading and
ledger update.

What happens if this table is partitioned by date, table_group, table_name and system. Then
each granular partition will have only one row. Meaning we can mimic update in Hive by
inserting a file which has the row count and as and when the table got loaded.

Implementation:

CREATEEXTERNALTABLE`count_ledger`(
`count`string,
`ts`string)
PARTITIONED BY (
`date`string,
`table_group`string,
`table_name`string,
`system`string
)
ROWFORMATDELIMITED
FIELDSTERMINATEDBY','
LINESTERMINATEDBY'\n'
LOCATION
'/data/count_ledger';
HDFS put to external directory for "OLTP count" and "Hive count"
hadoop fs -put /home/OLTP_booking_count.dat
/data/count_ledger/dt=${fromdate}/table_group=sales/table=bookings/system=OLTP/

hadoop fs -put /home/Hive_booking_count.dat


/data/count_ledger/dt=${fromdate}/table_group=sales/table=bookings/system=Hive/

MSCK repair for updating partitions automatically: msck repairtable count_ledger;

Note: We created the directory structure of HDFS similar to the partition, which is essential for
MSCK to work properly. The big benefit is that we don't have to issue alter commands to update
partitions.

HDFS directory: /table_group=sales/table=bookings/system=Hive/

Resembles Hive table partition:


PARTITIONED BY (
`date`string,
`table_group`string,
`table_name`string,
`system`string)

Just run MSCK anytime to keep the partitions up to date with HDFS structure.

Mimicking Update: In this design we can update the count of any table at any time (by HDFS
overwrite) and the data will be available for querying or Insert record by HDFS put.

Performance: The query performance will be improved since most of the information
(partitions) will be obtained from Meta store itself. Have has to display the content of one
record file.

selectcountfrom count_ledger
wheredate = '2016-09-01'and table_group = 'sales'and table_name = 'booking'andsystem =
'Hive';

Cons: Since this design will create numerous partitions and potential impact to namenode, we
can use this idea only for status tables or smaller aggregate tables (like country level summary)

Você também pode gostar