I work on a project where we have a really huge dataset and we use a Hadoop cluster. I often need to select data from a specific table where one of columns is date. If table is really huge it takes a lot of time to make queries like this: “select * from table_name where day=’2015-05-01′”.
It’s better to use partitioned tables in this case (HiveQL querie example):
CREATE TABLE schema.table_A (field1 int, field2 string, field3 float) PARTITIONED BY (day) stored as orc
Selecting data from this type of tables is much faster, because each day is actually stored in a separate file on a disk.
If we want to insert data into this table with overwriting we should use a query like this:
INSERT OVERWRITE TABLE schema.table_A PARTITION (day = '2015-05-01') select * from table_B