Skip to main content
Solved

Replicate LISTAGG(DISTINCT columnA, ' | ') OVER (PARTITION BY columnB) in ONE Desktop

  • March 11, 2025
  • 1 reply
  • 36 views

Forum|alt.badge.img

Hi

I have a sub-query in Snowflake that is using the function of LISTAGG and PARTITION BY in multiple SELECT DISTINCT statements for example LISTAGG(DISTINCT TRIM(SUBJECT_REF), ' | ') OVER (PARTITION BY IDENTIFIER) AS EN_NUM_AGG, kindly see attachment to see the full SQL statement. 

Example of one of the subqueries:
 

โ€ฆ

en AS

(

SELECT DISTINCT 

    IDENTIFIER,

    ID,

    LISTAGG(DISTINCT ID, ' | ') OVER (PARTITION BY IDENTIFIER) AS ID_AGG,

    TRIM(SUBJECT_REF) AS EN_NUM,

    LISTAGG(DISTINCT TRIM(SUBJECT_REF), ' | ') OVER (PARTITION BY IDENTIFIER) AS EN_NUM_AGG,

    DQ_EN_CARD,

    LISTAGG(DISTINCT DQ_EN_CARD, ' | ') OVER (PARTITION BY IDENTIFIER) AS DQ_EN_CARD_AGG,

    DATA_SOURCE,

    ORIGINAL_SOURCE

FROM

    db.schema.view1    

WHERE 

    subj_type = 'English'

) ...

I wanted to replicate the same in Ataccama ONE Desktop using the Group Aggregator step, where

  1. I allocate each sub-query in one aggregation set with a condition of subj_type = '<value>' in the When condition.
  2. For each column in the SELECT DISTINCT subquery if it is a non-aggregated value, I used first() function to get a distinct value,  for example first(ID). I tried using distinct() but it was not working with an error of 'unknown column' when the column is clearly available there. Why is that so?
  3. For each column in the SELECT DISTINCT subquery if it is aggregated value, I used this expression to replicate the same as LISTAGG.. concatenate(distinct(DQ_EN_CARD),'|'), however some columns are not producing any results but some have results like 1|1|0 for column like DQ_EN_CARD_AGG when the actual result should only show distinct value of 1 and 0 so it should be 1|0. I believe because in Ataccama it behaves slightly different compared to the Snowflake as Snowflake read and group line by line, but Ataccama seems to read all aggregations sets first then only group them?
  4. Based on this query for example,  LISTAGG(DISTINCT TRIM(SUBJECT_REF), ' | ') OVER (PARTITION BY IDENTIFIER) AS EN_NUM_AGG I define the Group By in the step to IDENTIFIER to replicate the PARTITION BY. I bet the behaviour is different here since I did not get the expected results.
  5. Finally, I joined the right output with the left output of the Group Aggregator because to my understanding the right join only will produce the columns that we add in the aggregation set, other columns that are not used in the aggregation set shall be retrieved from the left output.

So I am not sure whether this step is best to fit in this use case or not. Appreciate your assistance. Attached is the full SQL statement.

Best answer by Phil Holbrook

Hi โ€‹@Radziah 

There are two steps in ONE Desktop that you could use to tackle this problem: the group aggregator and the representative creator.  Group aggregator is easier to understand coming from SQL - so we'll stick with that.  The way you are using the group aggregator generally looks OK.

Your main problem is that concatenate and distinct are very different types of expression in ONE Desktop.

Concatenate is an aggregating function - it is only meaningful in the context of a grouping operation (like the group aggregator) and it operates across each set of rows that are being grouped.  If you use the statement,
concatenate(DQ_EN_CARD, '|')
you are saying, 
"Take the value of DQ_EN_CARD from every row and concatenate them separated by the '|' character"
... exactly as you would expect from the SQL equivalent.

The "distinct" function is not an aggregating function and doesn't work like SQL.  It is not valid in an aggregating context, which is why you got the "Unknown column" error. Although the column exists, it is not single-valued.

Distinct is actually a word-set operation - fully specified as
set.distinct(<string>,<separator>)
but you can omit the "set." prefix.  There are many word set operators which can be very useful. They operate on a single string value, from a single row, and treat the string as a set of values with the specified separator. The separator defaults to a space, hence "word set" operator - it was designed to handle sentences.

So, if you use the statement:
distinct('1|3|2|3|1', '|')
you will get the result '1|3|2'. 

So - to achieve the concatenated set of distinct values you want in the group aggregator step, you need to specify
distinct(concatenate(DQ_EN_CARD, '|'), '|')

It has to be this way round, because you need to convert your multiple rows into a single string with the concatenate function, then deduplicate it with distinct.  You have to specify the separator in the context of both functions.

Similarly, there is a "sort" word set operation, so
sort(distinct('1|3|2|3|1', '|'), '|')
is '1|2|3' (Watch out, though - itโ€™s an alpha-sort!)

It is a common use case to group values using an aggregating function and a separator, and then process the resulting string with word set operations.

You can read up on both aggregating functions and word-set expressions in the ONE Desktop built-in help under 
Help -> Help Contents -> Ataccama ONE Desktop -> ONE Basics -> Expressions

View original
Did this topic help you find an answer to your question?

1 reply

Phil Holbrook
Ataccamer
Forum|alt.badge.img+1
  • Ataccamer
  • 16 replies
  • Answer
  • March 13, 2025

Hi โ€‹@Radziah 

There are two steps in ONE Desktop that you could use to tackle this problem: the group aggregator and the representative creator.  Group aggregator is easier to understand coming from SQL - so we'll stick with that.  The way you are using the group aggregator generally looks OK.

Your main problem is that concatenate and distinct are very different types of expression in ONE Desktop.

Concatenate is an aggregating function - it is only meaningful in the context of a grouping operation (like the group aggregator) and it operates across each set of rows that are being grouped.  If you use the statement,
concatenate(DQ_EN_CARD, '|')
you are saying, 
"Take the value of DQ_EN_CARD from every row and concatenate them separated by the '|' character"
... exactly as you would expect from the SQL equivalent.

The "distinct" function is not an aggregating function and doesn't work like SQL.  It is not valid in an aggregating context, which is why you got the "Unknown column" error. Although the column exists, it is not single-valued.

Distinct is actually a word-set operation - fully specified as
set.distinct(<string>,<separator>)
but you can omit the "set." prefix.  There are many word set operators which can be very useful. They operate on a single string value, from a single row, and treat the string as a set of values with the specified separator. The separator defaults to a space, hence "word set" operator - it was designed to handle sentences.

So, if you use the statement:
distinct('1|3|2|3|1', '|')
you will get the result '1|3|2'. 

So - to achieve the concatenated set of distinct values you want in the group aggregator step, you need to specify
distinct(concatenate(DQ_EN_CARD, '|'), '|')

It has to be this way round, because you need to convert your multiple rows into a single string with the concatenate function, then deduplicate it with distinct.  You have to specify the separator in the context of both functions.

Similarly, there is a "sort" word set operation, so
sort(distinct('1|3|2|3|1', '|'), '|')
is '1|2|3' (Watch out, though - itโ€™s an alpha-sort!)

It is a common use case to group values using an aggregating function and a separator, and then process the resulting string with word set operations.

You can read up on both aggregating functions and word-set expressions in the ONE Desktop built-in help under 
Help -> Help Contents -> Ataccama ONE Desktop -> ONE Basics -> Expressions


Reply


Cookie policy

We use cookies to enhance and personalize your experience. If you accept you agree to our full cookie policy. Learn more about our cookies.

 
Cookie settings