HiveQL: Partitioned Tables

I work on a project where we have a really huge dataset and we use a Hadoop cluster. I often need to select data from a specific table where one of columns is date. If table is really huge it takes a lot of time to make queries like this: “select * from table_name where day=’2015-05-01′”.

It’s better to use partitioned tables in this case (HiveQL querie example):

CREATE TABLE schema.table_A (field1 int, field2 string, field3 float)
PARTITIONED BY (day) stored as orc

Selecting data from this type of tables is much faster, because each day is actually stored in a separate file on a disk.

If we want to insert data into this table with overwriting we should use a query like this:

INSERT OVERWRITE TABLE schema.table_A PARTITION (day = '2015-05-01')
select * from table_B