vkptechdiary: Using Group By Without Aggregate Functions Versus Distinct

Tuesday, 23 May 2017

Using Group By Without Aggregate Functions Versus Distinct

Group by without Aggregate functions versus Distinct (sql stuff)

Given this data:

Col1  Col2  Col3
 A     X     1
 A     Y     2
 A     Y     3
 B     X     0
 B     Y     3
 B     Z     1

The below query

SELECT Col1, Col2, Col3 FROM data GROUP BY Col1, Col2, Col3

Would result in exactly the same table.

However, this query:

SELECT Col1, Col2 FROM data GROUP BY Col1, Col2

Would result in

Col1  Col2
 A     X  
 A     Y  
 B     X  
 B     Y  
 B     Z

Now, a query:

SELECT Col1, Col2, Col3 FROM data GROUP BY Col1, Col2

Would create a problem: the line with A, Y is the result of grouping the two lines

 A     Y     2
 A     Y     3

So, which value should be in Col3, '2' or '3'?
Normally you would use a group by to calculate e.g. a sum:

SELECT Col1, Col2, SUM(Col3) FROM data GROUP BY Col1, Col2

So in the line we had a problem with we now get (2+3) = 5.

Grouping by all your columns in your select is effectively the same as using DISTINCT, and it is preferable to use the DISTINCT keyword word readability in this case.
So instead of

SELECT Col1, Col2, Col3 FROM data GROUP BY Col1, Col2

use

SELECT DINSTINCT Col1, Col2, Col3 FROM data

vkptechdiary

Search This Blog

Tuesday, 23 May 2017

Using Group By Without Aggregate Functions Versus Distinct

No comments:

Post a Comment

Spark Memory Management