- Do everything in RDBMS and just sqoop the results (make sense only if hadoop is used for data storage and discovery)
- Bring in incremental data and rebuild the dimension everyday (makes sense only in cases where dataset is not huge enough and you need to access everytime in SCD Type 2 way and there are many users)
- Bring in incremental data and make sense out of it when you need it (makes sense whether data is big or however many no. of users access the data in SCD Type 2 fashion)
Now that we have suggested out final approach. Lets dive deeper into technical specifics of how to do it. It turns out you can implement SCD Type #2 schema on read using hive SerDe, UDF and Hive Transform Fucntion. Now lets begin the step by step process to uncover the finer details:
- Bring in the External Data Source.
Lets Assume the original dataset of a company that tracks its employees onsite engagement looks as follows:
main_file.txt
1,Victor,CyberTron,2014-02-18,Chicago
2,Natalie,CyberTron,2014-01-01,NewYork
3,Nathan,CyberTron,2014-01-01,NewYork
4,Sam,Acme,2011-01-01,Boston
4,Sam,Acme,2014-06-01,Atlanta
change_file.txt
1,Victor,Acme,2014-06-09,Chicago
2,Natalie,Acme,2014-06-01,NewYork
3,Nathan,Acme,2014-06-01,NewYork
6,Hammond,Acme,2014-07-01,NewYork
- Add Both Files in External table and then insert into a bucketed table based on id, as follows:
Create Table IF NOT EXISTS VJ_Main
( Id int,
Name String,
Employer String,
EffDt String,
City String)
CLUSTERED BY (Id)
SORTED BY (EffDt)
INTO 4 BUCKETS
ROW
FORMAT DELIMITED
FIELDS TERMINATED BY ','
STORED AS TEXTFILE;
Query the dataset to look for values in the table
hive -e "select * from vj_main"
4 Sam Acme 2011-01-01 Boston
4 Sam Acme 2014-06-01 Atlanta
1 Victor CyberTron 2014-02-18 Chicago
1 Victor Acme 2014-06-09 Chicago
2 Natalie CyberTron 2014-01-01 NewYork
2 Natalie Acme 2014-06-01 NewYork
6 Hammond Acme 2014-07-01 NewYork
3 Nathan CyberTron 2014-01-01 NewYork
3 Nathan Acme 2014-06-01 NewYork
- At this step, we execute the following hive query with transform function:
ADD File /home/cloudera/testhiveSCD/map_script.sh;
FROM VJ_Main X
SELECT TRANSFORM ( X.Id, X.NAme, X.Employer, X.EffDt, X.City )
USING 'map_script.sh'
AS Id, Name, Employer, City, EffStartDt, EffEndDt;
The bash script 'map_script.sh' is as follows:
#!/bin/bash
unset prevId;
unset RecId;
unset EffEndDt;
unset
EffStartDt;
while read line
do
firstpart=`echo "$line" | cut -f1-3`
secondpart=`echo "$line" | cut -f5-`
EffDt=`echo "$line" | cut -f4`
EffStartDt="$EffDt"
RecId=`echo "$line" | cut -f1`
if [ "X${prevId}" != "X${RecId}" ]
then
unset EffEndDt;
else
EffEndDt="$EffDt"
fi
if [ ! -z $prevId ]
then
echo "$prevfirstpart $prevsecondpart $prevStartDt $EffEndDt"
fi
prevId="$RecId"
prevEffDt="$EffDt"
prevfirstpart="$firstpart"
prevsecondpart="$secondpart"
prevStartDt="$EffStartDt"
prevEndDt="$EffEndDt"
done
echo "$prevfirstpart $prevsecondpart $prevStartDt $EffEndDt"
- After the query executes the result appears as follows:
4 Sam Acme Boston 2011-01-01 2014-06-01
4 Sam Acme Atlanta 2014-06-01 (null)
1 Victor CyberTron Chicago 2014-02-18 2014-06-09
1 Victor Acme Chicago 2014-06-09 (null)
2 Natalie CyberTron NewYork 2014-01-01 2014-06-01
2 Natalie Acme NewYork 2014-06-01 (null)
6 Hammond Acme NewYork 2014-07-01 (null)
3 Nathan CyberTron NewYork 2014-01-01 2014-06-01
3 Nathan Acme NewYork 2014-06-01 2014-06-01
Now, the output result is in SCD Type#2 format. This entire process can also be done in Hive by using Hive UDF or by Overriding the Deserializer to implement the same windowing as done in bash script. The only difference is these proposed implementations will be done in java using StructObjectIspectors instead of environment variables.
No comments:
Post a Comment