Hi
I have a sub-query in Snowflake that is using the function of LISTAGG and PARTITION BY in multiple SELECT DISTINCT statements for example LISTAGG(DISTINCT TRIM(SUBJECT_REF), ' | ') OVER (PARTITION BY IDENTIFIER) AS EN_NUM_AGG, kindly see attachment to see the full SQL statement.
Example of one of the subqueries:
โฆ
en AS
(
SELECT DISTINCT
IDENTIFIER,
ID,
LISTAGG(DISTINCT ID, ' | ') OVER (PARTITION BY IDENTIFIER) AS ID_AGG,
TRIM(SUBJECT_REF) AS EN_NUM,
LISTAGG(DISTINCT TRIM(SUBJECT_REF), ' | ') OVER (PARTITION BY IDENTIFIER) AS EN_NUM_AGG,
DQ_EN_CARD,
LISTAGG(DISTINCT DQ_EN_CARD, ' | ') OVER (PARTITION BY IDENTIFIER) AS DQ_EN_CARD_AGG,
DATA_SOURCE,
ORIGINAL_SOURCE
FROM
db.schema.view1
WHERE
subj_type = 'English'
) ...
I wanted to replicate the same in Ataccama ONE Desktop using the Group Aggregator step, where
- I allocate each sub-query in one aggregation set with a condition of subj_type = '<value>' in the When condition.
- For each column in the SELECT DISTINCT subquery if it is a non-aggregated value, I used first() function to get a distinct value, for example first(ID). I tried using distinct() but it was not working with an error of 'unknown column' when the column is clearly available there. Why is that so?
- For each column in the SELECT DISTINCT subquery if it is aggregated value, I used this expression to replicate the same as LISTAGG.. concatenate(distinct(DQ_EN_CARD),'|'), however some columns are not producing any results but some have results like 1|1|0 for column like DQ_EN_CARD_AGG when the actual result should only show distinct value of 1 and 0 so it should be 1|0. I believe because in Ataccama it behaves slightly different compared to the Snowflake as Snowflake read and group line by line, but Ataccama seems to read all aggregations sets first then only group them?
- Based on this query for example, LISTAGG(DISTINCT TRIM(SUBJECT_REF), ' | ') OVER (PARTITION BY IDENTIFIER) AS EN_NUM_AGG I define the Group By in the step to IDENTIFIER to replicate the PARTITION BY. I bet the behaviour is different here since I did not get the expected results.
- Finally, I joined the right output with the left output of the Group Aggregator because to my understanding the right join only will produce the columns that we add in the aggregation set, other columns that are not used in the aggregation set shall be retrieved from the left output.
So I am not sure whether this step is best to fit in this use case or not. Appreciate your assistance. Attached is the full SQL statement.