Search This Blog

Sunday, 28 May 2017

scd type without duplicates

SCDtype without duplicates, Used in hive as due to immature update functionalities in hive 

truncate table cust_income;
truncate table cust_income_intmdt;
truncate table cust_income_target;


insert into table cust_income values
(1,'P','100000','20000','120000','1-2-2016'),
(2,'P','120000','20000','140000','1-2-2016'),
(3,'P','130000','20000','150000','1-2-2016');


insert into cust_income_intmdt
select tgt.cust_id, tgt.person_org_code, tgt.annual_net_inc, tgt.total_tax, tgt.est_gross_inc, tgt.dw_ld_ts, tgt.start_dt, tgt.end_dt from cust_income_target tgt where tgt.end_dt is not null
union all
select tgt.cust_id, tgt.person_org_code, tgt.annual_net_inc, tgt.total_tax, tgt.est_gross_inc, tgt.dw_ld_ts, tgt.start_dt, src.dw_ld_ts end_dt from cust_income_target tgt
left outer join
cust_income src on (tgt.cust_id = src.cust_id) where tgt.end_dt is null and (tgt.annual_net_inc/NVL(src.annual_net_inc,1))<> 1
union all
select src.cust_id, src.person_org_code, src.annual_net_inc, src.total_tax, src.est_gross_inc, src.dw_ld_ts, src.dw_ld_ts start_dt, null end_dt from cust_income src
left outer join
cust_income_target tgt on (src.cust_id = tgt.cust_id) where (src.annual_net_inc/NVL(tgt.annual_net_inc,1))<> 1;

truncate table cust_income_target;

insert into cust_income_target
select * from cust_income_intmdt;

truncate table cust_income_intmdt;

truncate table cust_income;

insert into table cust_income values
(1,'P','110000','20000','130000','2-2-2016'),
(2,'P','130000','20000','150000','2-2-2016'),
(4,'P','130000','20000','150000','2-2-2016'),
(3,'P','130000','20000','170000','1-2-2017');

insert into cust_income_intmdt
select tgt.cust_id, tgt.person_org_code, tgt.annual_net_inc, tgt.total_tax, tgt.est_gross_inc, tgt.dw_ld_ts, tgt.start_dt, tgt.end_dt from cust_income_target tgt where tgt.end_dt is not null
union all
select tgt.cust_id, tgt.person_org_code, tgt.annual_net_inc, tgt.total_tax, tgt.est_gross_inc, tgt.dw_ld_ts, tgt.start_dt, case when(tgt.annual_net_inc/NVL(src.annual_net_inc,1))<> 1 then src.dw_ld_ts else tgt.end_dt end end_dt from cust_income_target tgt
left outer join
cust_income src on (tgt.cust_id = src.cust_id) where tgt.end_dt is null
union all
select src.cust_id, src.person_org_code, src.annual_net_inc, src.total_tax, src.est_gross_inc, src.dw_ld_ts, src.dw_ld_ts start_dt, null end_dt from cust_income src
left outer join
cust_income_target tgt on (src.cust_id = tgt.cust_id) where (src.annual_net_inc/NVL(tgt.annual_net_inc,1))<> 1;

truncate table cust_income_target;

insert into cust_income_target
select * from cust_income_intmdt;

truncate table cust_income_intmdt;

truncate table cust_income;

insert into table cust_income values
(4,'P','130000','20000','150000','3-2-2017'),
(2,'P','140000','20000','160000','3-2-2016'),
(5,'P','150000','70000','170000','3-2-2016');

insert into cust_income_intmdt
select tgt.cust_id, tgt.person_org_code, tgt.annual_net_inc, tgt.total_tax, tgt.est_gross_inc, tgt.dw_ld_ts, tgt.start_dt, tgt.end_dt from cust_income_target tgt where tgt.end_dt is not null
union all
select tgt.cust_id, tgt.person_org_code, tgt.annual_net_inc, tgt.total_tax, tgt.est_gross_inc, tgt.dw_ld_ts, tgt.start_dt, case when(tgt.annual_net_inc/NVL(src.annual_net_inc,1))<> 1 then src.dw_ld_ts else tgt.end_dt end end_dt from cust_income_target tgt
left outer join
cust_income src on (tgt.cust_id = src.cust_id) where tgt.end_dt is null
union all
select src.cust_id, src.person_org_code, src.annual_net_inc, src.total_tax, src.est_gross_inc, src.dw_ld_ts, src.dw_ld_ts start_dt, null end_dt from cust_income src
left outer join
cust_income_target tgt on (src.cust_id = tgt.cust_id) where (src.annual_net_inc/NVL(tgt.annual_net_inc,1))<> 1;

truncate table cust_income_target;

insert into cust_income_target
select * from cust_income_intmdt;

truncate table cust_income_intmdt;

Sql Tips

  1. FROM generates the data set
  2. WHERE filters the generated data set
  3. GROUP BY aggregates the filtered data set
  4. HAVING filters the aggregated data set
  5. SELECT transforms the filters aggregated data set
  6. ORDER BY sorts the transformed data set
  7. LIMIT .. OFFSET frames the sorted data set

Tuesday, 23 May 2017

Using Group By Without Aggregate Functions Versus Distinct

Group by without Aggregate functions versus Distinct (sql stuff)




Given this data:
Col1  Col2  Col3
 A     X     1
 A     Y     2
 A     Y     3
 B     X     0
 B     Y     3
 B     Z     1

The below query


SELECT Col1, Col2, Col3 FROM data GROUP BY Col1, Col2, Col3

Would result in exactly the same table.


However, this query:
SELECT Col1, Col2 FROM data GROUP BY Col1, Col2
Would result in
Col1  Col2
 A     X  
 A     Y  
 B     X  
 B     Y  
 B     Z  

Now, a query:
SELECT Col1, Col2, Col3 FROM data GROUP BY Col1, Col2
Would create a problem: the line with A, Y is the result of grouping the two lines
 A     Y     2
 A     Y     3
So, which value should be in Col3, '2' or '3'?
Normally you would use a group by to calculate e.g. a sum:
SELECT Col1, Col2, SUM(Col3) FROM data GROUP BY Col1, Col2

So in the line we had a problem with we now get (2+3) = 5.


Grouping by all your columns in your select is effectively the same as using DISTINCT, and it is preferable to use the DISTINCT keyword word readability in this case.
So instead of
SELECT Col1, Col2, Col3 FROM data GROUP BY Col1, Col2
use
SELECT DINSTINCT Col1, Col2, Col3 FROM data

Sunday, 21 May 2017

Yarn verus Zookeeper


Yarn Vs Zookeeper (in brief)

YARN is the resource manager in Hadoop-2 architecture. It is similar to Mesos, as a role:
Given a cluster, and requests of resources, YARN will grant access to those resources (by making orders to NodeManagers which actually manage nodes). So YARN is the central scheduling coordinator of the cluster taking care that job requests get scheduled to the cluster in an orderly fashion taking into accounts resources constraints, scheduling strategies, priorities, fairness, and any rules.
So yes, YARN manages a cluster of nodes from the resource allocation coordination and scheduling perspective.


Zookeeper is in another business: ZooKeeper is a centralized service for maintaining configuration information, naming, providing distributed synchronization, and providing group services.
Zookeeper is a cluster of its own, with 3 or 5 nodes, and does not manage a cluster outside of it, it just like a database superficially, it allows writes and reads, in a consistent fashion (it is a CP system from CAP perspective).

Now to their relation: YARN has a HA variant (a highly available setup). In that HA setup, Automatic failover (though embedded leader election) is set up via Zookeeper.
How does this failover works automatically over zookeeper generically? (meaning, nothing yarn specific here, imagine any daemon with failover capability over a set of hosts): You can simply imagine that in zookeeper, there is a piece of information about "what yarn nodes are there"? and there could be 0 (nasty, yarn is down), 1 (ok, we got yarn up), or 2 (great, first node from this list is the current yarn master, while the second one is a standby failover yarn node, currently waiting and just copying updates from the master so he is ready if the times come. notice that there is an order here, which can be lexicographical, sorting some attribute of the hosts or host names themselves). This is just an example how leader election would work: the leader is the first element in a sorted list of nodes "competing" to be a leader of the pack.

Spark Memory Management

 Spark’s performance advantage over MapReduce  is due to Spark’s In-Memory Persistence and Memory Management Rather than writing to disk ...