Search This Blog

Friday, 3 February 2017

Performing SCD Type 2 in Hive


To begin with I just want to suggest couple of approaches to implement SCD Type 2 Dimension in Hive.
 - Do everything in RDBMS and just sqoop the results (make sense only if hadoop is used for data storage and discovery)
- Bring in incremental data and rebuild the dimension everyday (makes sense only in cases where dataset is not huge enough and you need to access everytime in SCD Type 2 way and there are many users)
- Bring in incremental data and make sense out of it when you need it (makes sense whether data is big or however many no. of users access the data in SCD Type 2 fashion)

Now that we have suggested out final approach. Lets dive deeper into technical specifics of how to do it. It turns out you can implement SCD Type #2 schema on read using hive SerDe, UDF and Hive Transform Fucntion. Now lets begin the step by step process to uncover the finer details:
- Bring in the External Data Source.
 Lets Assume the original dataset of a company that tracks its employees onsite engagement looks as follows:
 main_file.txt 
1,Victor,CyberTron,2014-02-18,Chicago
2,Natalie,CyberTron,2014-01-01,NewYork
3,Nathan,CyberTron,2014-01-01,NewYork
4,Sam,Acme,2011-01-01,Boston
4,Sam,Acme,2014-06-01,Atlanta

change_file.txt 
1,Victor,Acme,2014-06-09,Chicago
2,Natalie,Acme,2014-06-01,NewYork
3,Nathan,Acme,2014-06-01,NewYork
6,Hammond,Acme,2014-07-01,NewYork

- Add Both Files in External table and then insert into a bucketed table based on id, as follows:

Create Table IF NOT EXISTS  VJ_Main 
 ( Id               int,
    Name       String,
    Employer  String,
     EffDt        String,
     City String)
CLUSTERED BY (Id)
SORTED BY (EffDt)
INTO 4 BUCKETS
ROW
FORMAT DELIMITED
   FIELDS TERMINATED BY ','
STORED AS TEXTFILE;

Query the dataset to look for values in the table
hive -e "select * from vj_main"
4 Sam          Acme         2011-01-01 Boston
4 Sam          Acme         2014-06-01 Atlanta
1 Victor        CyberTron 2014-02-18 Chicago
1 Victor        Acme         2014-06-09 Chicago
2 Natalie      CyberTron 2014-01-01 NewYork
2 Natalie      Acme         2014-06-01 NewYork
6 Hammond Acme         2014-07-01 NewYork
3 Nathan      CyberTron 2014-01-01 NewYork
3 Nathan      Acme         2014-06-01 NewYork

- At this step, we execute the following hive query with transform function:

ADD File /home/cloudera/testhiveSCD/map_script.sh;
FROM VJ_Main  X
SELECT  TRANSFORM ( X.Id, X.NAme, X.Employer, X.EffDt, X.City )
USING 'map_script.sh'
AS  Id, Name, Employer, City,  EffStartDt, EffEndDt;

The bash script 'map_script.sh'  is as follows:
#!/bin/bash
unset prevId;
unset RecId;
unset EffEndDt;
unset
EffStartDt;
while read line
do
    firstpart=`echo  "$line" | cut -f1-3`
    secondpart=`echo "$line" | cut  -f5-`
    EffDt=`echo "$line" | cut -f4`
    EffStartDt="$EffDt"
    RecId=`echo "$line" | cut  -f1`
    if [ "X${prevId}" != "X${RecId}"  ]
    then
       unset EffEndDt;
    else
       EffEndDt="$EffDt"
    fi
    if [ ! -z  $prevId ]
    then
        echo  "$prevfirstpart    $prevsecondpart    $prevStartDt    $EffEndDt"
    fi
    prevId="$RecId"
    prevEffDt="$EffDt" 
    prevfirstpart="$firstpart"
    prevsecondpart="$secondpart"
    prevStartDt="$EffStartDt"
    prevEndDt="$EffEndDt"
done
echo "$prevfirstpart    $prevsecondpart    $prevStartDt    $EffEndDt"

- After the query executes the result appears as follows:
 4 Sam        Acme           Boston     2011-01-01 2014-06-01
4 Sam         Acme           Atlanta     2014-06-01 (null)
1 Victor       CyberTron   Chicago   2014-02-18 2014-06-09
1 Victor       Acme           Chicago   2014-06-09 (null)
2 Natalie      CyberTron  NewYork 2014-01-01 2014-06-01
2 Natalie      Acme          NewYork 2014-06-01 (null)
6 Hammond Acme          NewYork 2014-07-01 (null)
3 Nathan      CyberTron  NewYork 2014-01-01 2014-06-01
3 Nathan      Acme          NewYork 2014-06-01 2014-06-01

Now, the output result is in SCD Type#2 format. This entire process can also be done in Hive by using Hive UDF or by Overriding the Deserializer to implement the same windowing as done in bash script. The only difference is these proposed implementations will be done in java using StructObjectIspectors instead of environment variables.

No comments:

Post a Comment

Spark Memory Management

 Spark’s performance advantage over MapReduce  is due to Spark’s In-Memory Persistence and Memory Management Rather than writing to disk ...