Thoughts on statistical inference and big data: BigQuery: don’t count(*) -> sum(1) !!!

I was always taught that the way to count rows in SQL is count(*). Google tutorial for BigQuery also has that - link.

But there is a problem with such approach if you start working with complicated queries that use sub-queries. Look at this code:

SELECT
  SUM(IF(year%2=0, children, 0))
FROM (
  SELECT year, COUNT(1) AS children
  FROM [publicdata:samples.natality]
  GROUP BY year 
)

It will return a mistake:

“Error: Argument type mismatch in function IF: 'children' is type uint64, '0' is type int32.”

Somewhere inside the guts of BigQuery count(*) and 0 have different types and it returns a mistake. Of course you can cast count(*) as integer – this code works:

SELECT
  SUM(IF(year%2=0, INTEGER(children), 0))
FROM (
  SELECT year, COUNT(*) AS children
  FROM [publicdata:samples.natality]
  GROUP BY year
)

But this solution is a nightmare – you never recall it upfront, only after BigQuery throws a mistake. A better way is to do INTEGER(count(*)) in the very beginning.

Today I learned that the best way is to do sum(1). Shorter and just works, use everywhere and never get that mistake again:

SELECT
  SUM(IF(year%2=0, children, 0))
FROM (
  SELECT year, sum(1) AS children
  FROM [publicdata:samples.natality]
  GROUP BY year 
)

Thoughts on statistical inference and big data

Thursday, November 26, 2015

BigQuery: don’t count(*) -> sum(1) !!!

No comments:

Post a Comment