recommended way to use DECIMAL in Parquet

classic Classic list List threaded Threaded
1 message Options
Reply | Threaded
Open this post in threaded view
|

recommended way to use DECIMAL in Parquet

Dave Oshinsky
I've been experimenting with transferring information from Oracle RDBMS to Parquet files, and then querying the Parquet using Drill.  Oracle tables often contain a lot of DECIMAL information, since that seems to be a very common way to represent numbers there.  My initial approach was to archive the DECIMAL values to variable length "bytes" in the Parquet.  However, I found that Drill does  not expect to see DECIMAL values as "bytes" (variable length), but rather as one of the fixed length types.  See https://issues.apache.org/jira/browse/DRILL-4184 for more info on this.

I then tried archiving DECIMAL values to fixed length byte arrays in Parquet (rather than variable length ones), with the length being the full precision (size) of each DECIMAL field.  This increases my Parquet file sizes by about 10%, but now Drill queries with a where clause on the DECIMAL values work, but with some oddities as shown here:

0: jdbc:drill:zk=local> select acct_no, curbal from dfs.`c:/dao/ acct2M_after.parquet` where acct_no = 70000008.0;
+-----------+----------+
|  acct_no  |  curbal  |
+-----------+----------+
| 70000008  | 7000.00  |
+-----------+----------+
1 row selected (25.411 seconds)
0: jdbc:drill:zk=local> select acct_no, curbal from dfs.`c:/dao/acct2M_after.parquet` where acct_no = 70000008;
+----------+---------+
| acct_no  | curbal  |
+----------+---------+
+----------+---------+
No rows selected (24.331 seconds)
0: jdbc:drill:zk=local> select acct_no, curbal from dfs.`c:/dao/acct2M_after.parquet` where cast(acct_no AS DECIMAL) = 70000008;
+-----------+----------+
|  acct_no  |  curbal  |
+-----------+----------+
| 70000008  | 7000.00  |
+-----------+----------+
1 row selected (25.742 seconds)
0: jdbc:drill:zk=local>

Note how the query only finds the row if the value in the where clause has ".0" appended, or the value is explicitly casted to DECIMAL.

Two questions:

1)      Are there any plans to allow for the more efficient storage of DECIMAL as "bytes" (rather than fixed length byte arrays, at the full/maximum precision)?

2)      Is it expected that the ".0" must be appended in the where clause (or an explicit cast applied to DECIMAL), or should I create a JIRA for that?  Intuitively speaking, a number is a number (whether it's decimal or not) so this behavior was a bit unexpected to me.

Thanks,
Dave Oshinsky





***************************Legal Disclaimer***************************
"This communication may contain confidential and privileged material for the
sole use of the intended recipient. Any unauthorized review, use or distribution
by others is strictly prohibited. If you have received the message by mistake,
please advise the sender by reply email and delete the message. Thank you."
**********************************************************************