Search This Blog

Tuesday, 23 May 2017

Using Group By Without Aggregate Functions Versus Distinct

Group by without Aggregate functions versus Distinct (sql stuff)




Given this data:
Col1  Col2  Col3
 A     X     1
 A     Y     2
 A     Y     3
 B     X     0
 B     Y     3
 B     Z     1

The below query


SELECT Col1, Col2, Col3 FROM data GROUP BY Col1, Col2, Col3

Would result in exactly the same table.


However, this query:
SELECT Col1, Col2 FROM data GROUP BY Col1, Col2
Would result in
Col1  Col2
 A     X  
 A     Y  
 B     X  
 B     Y  
 B     Z  

Now, a query:
SELECT Col1, Col2, Col3 FROM data GROUP BY Col1, Col2
Would create a problem: the line with A, Y is the result of grouping the two lines
 A     Y     2
 A     Y     3
So, which value should be in Col3, '2' or '3'?
Normally you would use a group by to calculate e.g. a sum:
SELECT Col1, Col2, SUM(Col3) FROM data GROUP BY Col1, Col2

So in the line we had a problem with we now get (2+3) = 5.


Grouping by all your columns in your select is effectively the same as using DISTINCT, and it is preferable to use the DISTINCT keyword word readability in this case.
So instead of
SELECT Col1, Col2, Col3 FROM data GROUP BY Col1, Col2
use
SELECT DINSTINCT Col1, Col2, Col3 FROM data

No comments:

Post a Comment

Spark Memory Management

 Spark’s performance advantage over MapReduce  is due to Spark’s In-Memory Persistence and Memory Management Rather than writing to disk ...