Partitioning: An Anecdote for performance
Partitioning are of two types in hive,
1. Dynamic and Static partitioning
When any one of them is used, queries are run against only a portion of the data, providing significant performance gains
Static Partitioning—Used when the values for partition columns are known well in advance of loading the data into a Hive table
Dynamic Partitioning—Used when the values for partition columns are known only during loading of the data into a Hive table
When to choose Dynamic Partitioning?
Deciding when to use dynamic partitioning can be very challenging. We recommend using dynamic partitioning in these common scenarios:
Loading from an existing table that is not partitioned: In this scenario, the user doesn’t employ partitioning initially because the table in question is expected to remain relatively small. However, over the course of time, the table grows quite large and performance issues begin to appear. These issues should be corrected using a one-time load to dynamically partition the table.
Unknown values for partition columns: In some scenarios, it’s very difficult to know the unique values of all partition columns unless the data is manually inspected. As you can imagine, manual inspection isn’t a realistic option when dealing with batch pipelines or terabytes of data. You can try writing a Hive query to retrieve all unique value sets of partition columns. Let’s say, the query result contains many unique value sets. You’ll end up creating and executing an “ALTER TABLE” statement for each unique value set. Running a Hive query, preparing alter table statements, and executing them will significantly delay your data pipelines. Also, it’s an error-prone and cumbersome process. Hence, static partitioning is not used under these circumstances. However, dynamic partitioning can come to your rescue if you’d like to offload this work to Hive. Hive can detect the unique value sets for partition columns and create partitions on-the-fly.
Let’s revisit the scenario we discussed earlier in this blog. Imagine you have a very big table with data accumulated over many years. You want to improve the performance by partitioning the data. To complicate the scenario further, imagine you decide to not only partition the data by year, but also by month, day, hour, and advertiser id (let’s assume your data has an advertiser_id column). In this particular case, dynamic partitioning is very helpful. Why? You may be running a pipeline that ingests data hourly. You know exactly which year, month, day, and hour the data belongs to, but you can’t assume or tell which advertiser_id each record contains unless you manually inspect the data. Hive can automatically partition the data on all the required columns if dynamic partitioning is used.
Modifying the number of partition columns: In this scenario, the user initially designs a table with limited partition columns. As the data grows, the user decides to address performance concerns by adding additional partition column[s]. This is done in Hive by: 1. creating a new table with all required partition columns, 2. loading data into the new table from the already existing partitioned table, and 3. deleting the existing table. Dynamic partitioning should be used to perform step (2), as explained earlier.
Avoiding Dynamic Partitioning Pitfalls
So far, we’ve covered the advantages of dynamic partitioning. Now, let’s look at some common mistakes and pitfalls that users often encounter. The most common mistake made when using dynamic—or even static—partitions is to specify an existing column name as a partition column. In Hive, the partition columns are virtual columns, which helps with organizing and fetching the data efficiently. If you see the following error while creating a table with partitions, it means the column specified in a partition specification is also a regular column present in the table schema:
"Error in semantic analysis: Column repeated in partitioning columns"
Some other important points to remember when using dynamic partitions are:
Typically, data is loaded into a partitioned table in Hive by “INSERT OVERWRITE TABLE <table_name> PARTITION (col1=value1, col2=value2…) SELECT <columns> FROM <table_name>…” query. As mentioned above, when using dynamic partitioning, values for partition columns must not be specified. Hive will detect the values automatically from the data.
In order to detect the values for partition columns automatically, partition columns must be specified at the end of the “SELECT” clause in the same order as they are specified in the “PARTITIONED BY” clause while creating the table.
If the table has only dynamic partition columns, then the configuration setting hive.exec.dynamic.partition.mode should be set to non-strict mode:
SET hive.exec.dynamic.partition.mode=non-strict;
Hive enforces a limit on the number of dynamic partitions it can create. The default is 100 dynamic partitions per node, with a total (default) limit of 1000 dynamic partitions across all nodes. However, this setting is configurable. If your job tries to create more partitions than allowed, you may see the following exception:
metadata.HiveFatalException: [Error 20004]: Fatal error occurred when node tried to create too many dynamic partitions. The maximum number of dynamic partitions is controlled by hive.exec.max.dynamic.partitions and hive.exec.max.dynamic.partitions.pernode. Maximum was set to: 100
The setting hive.exec.max.dynamic.partitions,controls the total number of dynamic partitions for a table. Whereas, hive.exec.max.dynamic.partitions.pernode controls the maximum number of dynamic partitions that can be created on a node. So, to fix the above exception, try gradually increasing the number of partitions allowed using the above settings:
SET hive.exec.dynamic.partition=true;
SET hive.exec.max.dynamic.partitions=2048;
ET hive.exec.max.dynamic.partitions.pernode=256;
Now, let’s look at a complete example demonstrating how to create and load data using dynamic partitioning.
Change dynamic partition mode to non-strict:
SET hive.exec.dynamic.partition.mode=non-strict
Create the table with partitions:
CREATE TABLE patents (
citing_patent INT,
cited_patent INT,
assignee STRING,
companyname STRING,
publication_date STRING)
PARTITIONED BY (
year INT,
month INT,
day INT)
Load the data:
INSERT OVERWRITE INTO TABLE patents PARTITION (year, month, day)
SELECT citing, cited, name, company, year(publication_date), month(publication_date), day(publication_date)
FROM patents_raw_data;
The table, “patents_raw_data”, is an external table, which points to patent raw data. Notice the order of the partition columns specified in the “SELECT” clause is in exactly the same order as the partition columns specified in the “PARTITIONED BY” clause in create table query. Also, the columns year, month, and day are purposefully specified at the very end in the “SELECT” clause. Hive splits the data into multiple partitions by year, month, and day values. It also updates the Hive metastore automatically without explicit user intervention.
Next, let’s look at what can happen with dynamic partitioning when the data is skewed. The data that belongs to one partition is sent to only one reducer. However, consider a scenario where 90 percent of the data belongs to only one partition and the rest is spread across multiple partitions. In this situation, one reducer will be heavily loaded, while all other reducers have finished their work. The time required to finish the job will depend solely on the longest running reducer. This will significantly increase the data load time.
To overcome this problem, we suggest our customers perform some queries on the data to check how evenly it’s distributed. The key idea is to distribute the data evenly across the reducers. To achieve even distribution, the table can be further divided by buckets or new partition columns. The solution will vary depending on the use case and the nature of the data. In one recent example, we worked with one of our customers to address their performance issues. After gaining a deeper understanding of their use case and data, we determined that the data was skewed and suggested bucketing the data by another column to evenly spread the data load across the reducers. This resolved the problem and significantly reduced their data load times.
There aren’t many significant downsides to using dynamic partitioning. The only word of caution here is, if you don’t know your data well, you could unintentionally implement bad partitioning strategies when using dynamic partitioning. Static partitioning is much more controlled and you have to know your data to define the partitions. With dynamic partitioning, a user might mistakenly assume that there will only be 10 partition values flowing in as part of a dataset, when a partition column actually has thousands of different values in the incoming data. This could create thousands of tiny partitions and have unintentional side-effects.
Overall, dynamic partitioning is a well-implemented and tested feature, but is often overlooked in practice. Due to its proven ability to significantly ease user pain and improve performance, we encourage our customers to use dynamic partitioning—in accordance with the best practices outlined above—whenever it fits their requirements.
Partitioning are of two types in hive,
1. Dynamic and Static partitioning
When any one of them is used, queries are run against only a portion of the data, providing significant performance gains
Static Partitioning—Used when the values for partition columns are known well in advance of loading the data into a Hive table
Dynamic Partitioning—Used when the values for partition columns are known only during loading of the data into a Hive table
When to choose Dynamic Partitioning?
Deciding when to use dynamic partitioning can be very challenging. We recommend using dynamic partitioning in these common scenarios:
Loading from an existing table that is not partitioned: In this scenario, the user doesn’t employ partitioning initially because the table in question is expected to remain relatively small. However, over the course of time, the table grows quite large and performance issues begin to appear. These issues should be corrected using a one-time load to dynamically partition the table.
Unknown values for partition columns: In some scenarios, it’s very difficult to know the unique values of all partition columns unless the data is manually inspected. As you can imagine, manual inspection isn’t a realistic option when dealing with batch pipelines or terabytes of data. You can try writing a Hive query to retrieve all unique value sets of partition columns. Let’s say, the query result contains many unique value sets. You’ll end up creating and executing an “ALTER TABLE” statement for each unique value set. Running a Hive query, preparing alter table statements, and executing them will significantly delay your data pipelines. Also, it’s an error-prone and cumbersome process. Hence, static partitioning is not used under these circumstances. However, dynamic partitioning can come to your rescue if you’d like to offload this work to Hive. Hive can detect the unique value sets for partition columns and create partitions on-the-fly.
Let’s revisit the scenario we discussed earlier in this blog. Imagine you have a very big table with data accumulated over many years. You want to improve the performance by partitioning the data. To complicate the scenario further, imagine you decide to not only partition the data by year, but also by month, day, hour, and advertiser id (let’s assume your data has an advertiser_id column). In this particular case, dynamic partitioning is very helpful. Why? You may be running a pipeline that ingests data hourly. You know exactly which year, month, day, and hour the data belongs to, but you can’t assume or tell which advertiser_id each record contains unless you manually inspect the data. Hive can automatically partition the data on all the required columns if dynamic partitioning is used.
Modifying the number of partition columns: In this scenario, the user initially designs a table with limited partition columns. As the data grows, the user decides to address performance concerns by adding additional partition column[s]. This is done in Hive by: 1. creating a new table with all required partition columns, 2. loading data into the new table from the already existing partitioned table, and 3. deleting the existing table. Dynamic partitioning should be used to perform step (2), as explained earlier.
Avoiding Dynamic Partitioning Pitfalls
So far, we’ve covered the advantages of dynamic partitioning. Now, let’s look at some common mistakes and pitfalls that users often encounter. The most common mistake made when using dynamic—or even static—partitions is to specify an existing column name as a partition column. In Hive, the partition columns are virtual columns, which helps with organizing and fetching the data efficiently. If you see the following error while creating a table with partitions, it means the column specified in a partition specification is also a regular column present in the table schema:
"Error in semantic analysis: Column repeated in partitioning columns"
Some other important points to remember when using dynamic partitions are:
Typically, data is loaded into a partitioned table in Hive by “INSERT OVERWRITE TABLE <table_name> PARTITION (col1=value1, col2=value2…) SELECT <columns> FROM <table_name>…” query. As mentioned above, when using dynamic partitioning, values for partition columns must not be specified. Hive will detect the values automatically from the data.
In order to detect the values for partition columns automatically, partition columns must be specified at the end of the “SELECT” clause in the same order as they are specified in the “PARTITIONED BY” clause while creating the table.
If the table has only dynamic partition columns, then the configuration setting hive.exec.dynamic.partition.mode should be set to non-strict mode:
SET hive.exec.dynamic.partition.mode=non-strict;
Hive enforces a limit on the number of dynamic partitions it can create. The default is 100 dynamic partitions per node, with a total (default) limit of 1000 dynamic partitions across all nodes. However, this setting is configurable. If your job tries to create more partitions than allowed, you may see the following exception:
metadata.HiveFatalException: [Error 20004]: Fatal error occurred when node tried to create too many dynamic partitions. The maximum number of dynamic partitions is controlled by hive.exec.max.dynamic.partitions and hive.exec.max.dynamic.partitions.pernode. Maximum was set to: 100
The setting hive.exec.max.dynamic.partitions,controls the total number of dynamic partitions for a table. Whereas, hive.exec.max.dynamic.partitions.pernode controls the maximum number of dynamic partitions that can be created on a node. So, to fix the above exception, try gradually increasing the number of partitions allowed using the above settings:
SET hive.exec.dynamic.partition=true;
SET hive.exec.max.dynamic.partitions=2048;
ET hive.exec.max.dynamic.partitions.pernode=256;
Now, let’s look at a complete example demonstrating how to create and load data using dynamic partitioning.
Change dynamic partition mode to non-strict:
SET hive.exec.dynamic.partition.mode=non-strict
Create the table with partitions:
CREATE TABLE patents (
citing_patent INT,
cited_patent INT,
assignee STRING,
companyname STRING,
publication_date STRING)
PARTITIONED BY (
year INT,
month INT,
day INT)
Load the data:
INSERT OVERWRITE INTO TABLE patents PARTITION (year, month, day)
SELECT citing, cited, name, company, year(publication_date), month(publication_date), day(publication_date)
FROM patents_raw_data;
The table, “patents_raw_data”, is an external table, which points to patent raw data. Notice the order of the partition columns specified in the “SELECT” clause is in exactly the same order as the partition columns specified in the “PARTITIONED BY” clause in create table query. Also, the columns year, month, and day are purposefully specified at the very end in the “SELECT” clause. Hive splits the data into multiple partitions by year, month, and day values. It also updates the Hive metastore automatically without explicit user intervention.
Next, let’s look at what can happen with dynamic partitioning when the data is skewed. The data that belongs to one partition is sent to only one reducer. However, consider a scenario where 90 percent of the data belongs to only one partition and the rest is spread across multiple partitions. In this situation, one reducer will be heavily loaded, while all other reducers have finished their work. The time required to finish the job will depend solely on the longest running reducer. This will significantly increase the data load time.
To overcome this problem, we suggest our customers perform some queries on the data to check how evenly it’s distributed. The key idea is to distribute the data evenly across the reducers. To achieve even distribution, the table can be further divided by buckets or new partition columns. The solution will vary depending on the use case and the nature of the data. In one recent example, we worked with one of our customers to address their performance issues. After gaining a deeper understanding of their use case and data, we determined that the data was skewed and suggested bucketing the data by another column to evenly spread the data load across the reducers. This resolved the problem and significantly reduced their data load times.
There aren’t many significant downsides to using dynamic partitioning. The only word of caution here is, if you don’t know your data well, you could unintentionally implement bad partitioning strategies when using dynamic partitioning. Static partitioning is much more controlled and you have to know your data to define the partitions. With dynamic partitioning, a user might mistakenly assume that there will only be 10 partition values flowing in as part of a dataset, when a partition column actually has thousands of different values in the incoming data. This could create thousands of tiny partitions and have unintentional side-effects.
Overall, dynamic partitioning is a well-implemented and tested feature, but is often overlooked in practice. Due to its proven ability to significantly ease user pain and improve performance, we encourage our customers to use dynamic partitioning—in accordance with the best practices outlined above—whenever it fits their requirements.