Solved

Ataccama One Desktop - Select only distinct rows

Forum|Forum|3 years ago
September 21, 2022
10 replies
470 views

Marnix Wisselaar
Star Blazer L2

Is there a way to quickly select only distinct rows in Ataccama One Desktop?

(I now used a record descriptor and filter with Regex find(“.:.:1”, rd_column)

Best answer by DannyRyan

Hi @Marnix Wisselaar
Using the Record Descriptor Builder step is certainly the best practice approach to solve this use case.

An alternative to using the Filter step with Regular Expression is to use the built-in expression function word().

I have attached a quick example using the expression function word().

DannyRyan
Head of Technical Training
Answer
Forum|Forum|3 years ago
September 21, 2022

Hi @Marnix Wisselaar
Using the Record Descriptor Builder step is certainly the best practice approach to solve this use case.

An alternative to using the Filter step with Regular Expression is to use the built-in expression function word().

I have attached a quick example using the expression function word().

select_only_distinct_records.zip

Danny | Did it answer your question? Mark it as the correct answer!

Marnix Wisselaar
Author
Star Blazer L2
Forum|Forum|3 years ago
September 21, 2022

Thanks so much. Great how you documented it! Will certainly be added to our cookbook ;)

Radziah
Data Voyager
Forum|Forum|1 year ago
February 26, 2025

Hi @DannyRyan

I downloaded your plan and ran it but unfortunately I could not see the results as it turned to be blank for both columns. What would be the the input and expected results look like on your end? As I have similar use case to replicate the select distinct values from Snowflake in Ataccama.

DannyRyan
Head of Technical Training
Forum|Forum|1 year ago
February 26, 2025

Dear @Radziah,

Thank you for reaching out and for trying out the plan. I understand you're experiencing an issue with blank results in both columns, and I'm happy to help you troubleshoot this.

To better understand the situation, would you mind sharing a screenshot of your plan's configuration and the execution results? This will give me valuable context.

The original plan is designed to generate 100 random records, and it should function correctly. Given the randomized nature of the input data, it's worth noting that running the plan multiple times can produce varied datasets, which may affect the distinct values reflected in the output.

It's also possible that there might have been modifications to the plan, potentially impacting the flow of the 100 randomly generated records into the RecordDescriptorBuilder step. Sharing your plan file would allow us to review it together and pinpoint any potential discrepancies.

Please feel free to attach the screenshot and your plan file to your reply. I look forward to assisting you in resolving this issue.

Danny | Did it answer your question? Mark it as the correct answer!

Radziah
Data Voyager
Forum|Forum|1 year ago
February 27, 2025

Hi @DannyRyan

Thanks for prompt reply, I did not change anything in the plan and ran it as it is. Please find my attached plan and the screenshot of the output.

i tried to change the 100 random records generation to 50 also gave me blank output.

select_only_distinct_records.zip

DannyRyan
Head of Technical Training
Forum|Forum|1 year ago
February 27, 2025

Hi @Radziah

I've made some tweaks to the plan and attached a new version for you. This one should now show both all the records and the filtered ones, which I hope will be much clearer!

Here's a breakdown of what I've changed:

Replacing the Filter step: I switched the regular "Filter" step to an "Extract Filter" step. This allows us to output both the full dataset and the filtered results within the same pipeline, making it easier to see what's happening.
Adding a Text File Writer: I've included an extra "Text File Writer" step to capture all the records. This way, you'll have a complete view of the data.
Fixing the filter condition: This was the key change! The original filter condition was preventing any results from showing up.

Let's talk about the expression change in the Filter/Extract Filter step:

Original: word(rd_value,2) = '1'
Updated: word(rd_value,1,':') = '1'

Here's why:

The "Record Descriptor" has three values separated by colons (":").
The first value is the "Group ID," the second is the "Group Size," and the third is the "Position" within the "Group ID."
To find distinct records, we need to look for records where the "Group Size" is '1' (meaning it's the only record of its kind).
The original expression was looking at the wrong index. We needed to target the "Group Size" (index 1), not the "Position" (index 2).
Also we needed to tell the word() function that the delimiter was a colon.
So, by changing the index to 1 and explicitly setting the delimiter to ":", we're now correctly checking for records with a "Group Size" of '1'.

I really hope this updated plan works well for you and helps you get a better grasp of:

Record Descriptors
The word() expression
Debugging plans and expressions

Please don't hesitate to reach out if you have any more questions or need further clarification. I'm happy to help!

Best regards,

Danny

select_only_distinct_records.zip

Danny | Did it answer your question? Mark it as the correct answer!

Radziah
Data Voyager
Forum|Forum|1 year ago
March 4, 2025

Thanks @DannyRyan

I ran the latest plan couple of times only got the expected output which I would say only Community will be out in the output_distinct since the group size is 1? Based on the rd_value of 3:1:1.

My understanding of select distinct is let’s say we have generated 49 rows of ‘Data’, 50 rows of ‘People’ and 1 row of ‘Community’, isn’t it supposed to have three rows of output in total; 1 row ‘Data’, 1 row ‘People’ and 1 row ‘Community’? In your plan after I ran multiple times, the only output_distinct that I managed to get is like in the image below:

DannyRyan
Head of Technical Training
Forum|Forum|1 year ago
March 4, 2025

Dear @Radziah,

Thank you for your inquiry. It appears your specific requirement for extracting unique values from a dataset differs from the original post's context.

To achieve the desired outcome of displaying only unique values for each group within your data, you can utilize the following implementation.

Record descriptors, as previously mentioned, consist of three components separated by colons (:). These components are:

Record Group Identifier (Index 0): This identifies the group to which a record belongs. In your case, it would represent the distinct categories (e.g., 'People', 'Data', 'Community').
Group Size (Index 1): This indicates the total number of records within a specific group.
Position within Group (Index 2): This denotes the ordinal position of a record within its group. For instance, if a group has three records, their positions would be 1, 2, and 3.

To select only one record per Group ID, we can apply a filter based on the position within the group. Specifically, we can filter for records where the position is equal to 1. This can be expressed as:

word(rd_value, 2, ':') = '1'

This condition can be interpreted as follows:

Split the Record Descriptor string into three distinct words, using the colon (:) character as the delimiter.
Evaluate the third word (index 2), which represents the Position within Group.
If the Position within Group is equal to '1', then retain the corresponding record.

Consequently, the output will comprise all records with a Position within Group value of 1, effectively representing the first record from each distinct group.

I hope this explanation is clear and helpful. Please do not hesitate to ask if you have any further questions.

Sincerely,

Danny

Danny | Did it answer your question? Mark it as the correct answer!

Radziah
Data Voyager
Forum|Forum|1 year ago
March 6, 2025

Hi @DannyRyan

Thank you! This time it works as expected as shown in the image below. Let’s say if I have more columns to read the distinct values, additional columns named column2 and column3. If i want to see the distinct values of all columns, I should add the column2 and column3 under src_value under Expressions in Record Descriptor Builder?

DannyRyan
Head of Technical Training
Forum|Forum|1 year ago
March 6, 2025

Hi @Radziah

You're welcome! I'm glad to hear the previous solution addressed your needs.

Yes, you are absolutely correct. To obtain distinct values across multiple columns you should include those columns in the "Partition By" section of the Record Descriptor Builder. This effectively groups the records based on the combined values of all specified columns.

Think of it as concatenating the values of those columns (column1, column2, column3, etc.) and then identifying the unique combinations.

To illustrate this, I've created an example using six "food groups," with each group represented by a separate column in the "Partition By" section.

In the attached screenshot, you'll notice the following:

Record 1 has a "group size" of 1, indicating that the specific combination of values across all six food group columns is unique within the dataset.
Records 3 and 4, on the other hand, have a "group size" of 2. This signifies that these two records share identical values across all six food group columns.
By adding more columns to the "Partition By" section, you're essentially expanding the criteria for determining distinctness, allowing you to identify unique combinations across a broader set of attributes.

I hope this clarifies the process.

Sincerely,

Danny

select_distinct_records_multiple_columns.zip

Danny | Did it answer your question? Mark it as the correct answer!

Sign up

Login to the Ataccama Community

Scanning file for viruses.

This file cannot be downloaded