I was always taught that the way to count rows in SQL is count(*). Google tutorial for BigQuery also has that - link.
But there is a problem with such approach if you start working with complicated queries that use sub-queries. Look at this code:
It will return a mistake:
Somewhere inside the guts of BigQuery count(*) and 0 have different types and it returns a mistake. Of course you can cast count(*) as integer – this code works:
But this solution is a nightmare – you never recall it upfront, only after BigQuery throws a mistake. A better way is to do INTEGER(count(*)) in the very beginning.
Today I learned that the best way is to do sum(1). Shorter and just works, use everywhere and never get that mistake again:
But there is a problem with such approach if you start working with complicated queries that use sub-queries. Look at this code:
SELECT
SUM(IF(year%2=0, children, 0))
FROM (
SELECT year, COUNT(1) AS children
FROM [publicdata:samples.natality]
GROUP BY year
)
It will return a mistake:
“Error: Argument type mismatch in function IF: 'children' is type uint64, '0' is type int32.”
Somewhere inside the guts of BigQuery count(*) and 0 have different types and it returns a mistake. Of course you can cast count(*) as integer – this code works:
SELECT
SUM(IF(year%2=0, INTEGER(children), 0))
FROM (
SELECT year, COUNT(*) AS children
FROM [publicdata:samples.natality]
GROUP BY year
)
But this solution is a nightmare – you never recall it upfront, only after BigQuery throws a mistake. A better way is to do INTEGER(count(*)) in the very beginning.
Today I learned that the best way is to do sum(1). Shorter and just works, use everywhere and never get that mistake again:
SELECT
SUM(IF(year%2=0, children, 0))
FROM (
SELECT year, sum(1) AS children
FROM [publicdata:samples.natality]
GROUP BY year
)
No comments:
Post a Comment
Note: Only a member of this blog may post a comment.