Is there a way to quickly select only distinct rows in Ataccama One Desktop?
(I now used a record descriptor and filter with Regex find(“.:.:1”, rd_column)
Is there a way to quickly select only distinct rows in Ataccama One Desktop?
(I now used a record descriptor and filter with Regex find(“.:.:1”, rd_column)
Hi
Using the Record Descriptor Builder step is certainly the best practice approach to solve this use case.
An alternative to using the Filter step with Regular Expression is to use the built-in expression function word().
I have attached a quick example using the expression function word().
Thanks so much. Great how you documented it! Will certainly be added to our cookbook ;)
Hi
I downloaded your plan and ran it but unfortunately I could not see the results as it turned to be blank for both columns. What would be the the input and expected results look like on your end? As I have similar use case to replicate the select distinct values from Snowflake in Ataccama.
Dear @Radziah,
Thank you for reaching out and for trying out the plan. I understand you're experiencing an issue with blank results in both columns, and I'm happy to help you troubleshoot this.
To better understand the situation, would you mind sharing a screenshot of your plan's configuration and the execution results? This will give me valuable context.
The original plan is designed to generate 100 random records, and it should function correctly. Given the randomized nature of the input data, it's worth noting that running the plan multiple times can produce varied datasets, which may affect the distinct values reflected in the output.
It's also possible that there might have been modifications to the plan, potentially impacting the flow of the 100 randomly generated records into the RecordDescriptorBuilder step. Sharing your plan file would allow us to review it together and pinpoint any potential discrepancies.
Please feel free to attach the screenshot and your plan file to your reply. I look forward to assisting you in resolving this issue.
Hi
Thanks for prompt reply, I did not change anything in the plan and ran it as it is. Please find my attached plan and the screenshot of the output.
i tried to change the 100 random records generation to 50 also gave me blank output.
Hi
I've made some tweaks to the plan and attached a new version for you. This one should now show both all the records and the filtered ones, which I hope will be much clearer!
Here's a breakdown of what I've changed:
Let's talk about the expression change in the Filter/Extract Filter step:
word(rd_value,2) = '1'
word(rd_value,1,':') = '1'
Here's why:
I really hope this updated plan works well for you and helps you get a better grasp of:
Please don't hesitate to reach out if you have any more questions or need further clarification. I'm happy to help!
Best regards,
Danny
Thanks
I ran the latest plan couple of times only got the expected output which I would say only Community will be out in the output_distinct since the group size is 1? Based on the rd_value of 3:1:1.
My understanding of select distinct is let’s say we have generated 49 rows of ‘Data’, 50 rows of ‘People’ and 1 row of ‘Community’, isn’t it supposed to have three rows of output in total; 1 row ‘Data’, 1 row ‘People’ and 1 row ‘Community’? In your plan after I ran multiple times, the only output_distinct that I managed to get is like in the image below:
Dear
Thank you for your inquiry. It appears your specific requirement for extracting unique values from a dataset differs from the original post's context.
To achieve the desired outcome of displaying only unique values for each group within your data, you can utilize the following implementation.
Record descriptors, as previously mentioned, consist of three components separated by colons (:). These components are:
To select only one record per Group ID, we can apply a filter based on the position within the group. Specifically, we can filter for records where the position is equal to 1. This can be expressed as:
word(rd_value, 2, ':') = '1'
This condition can be interpreted as follows:
Consequently, the output will comprise all records with a Position within Group value of 1, effectively representing the first record from each distinct group.
I hope this explanation is clear and helpful. Please do not hesitate to ask if you have any further questions.
Sincerely,
Danny
Hi
Thank you! This time it works as expected as shown in the image below. Let’s say if I have more columns to read the distinct values, additional columns named column2 and column3. If i want to see the distinct values of all columns, I should add the column2 and column3 under src_value under Expressions in Record Descriptor Builder?
Hi
You're welcome! I'm glad to hear the previous solution addressed your needs.
Yes, you are absolutely correct. To obtain distinct values across multiple columns you should include those columns in the "Partition By" section of the Record Descriptor Builder. This effectively groups the records based on the combined values of all specified columns.
Think of it as concatenating the values of those columns (column1, column2, column3, etc.) and then identifying the unique combinations.
To illustrate this, I've created an example using six "food groups," with each group represented by a separate column in the "Partition By" section.
In the attached screenshot, you'll notice the following:
Record 1 has a "group size" of 1, indicating that the specific combination of values across all six food group columns is unique within the dataset.
Records 3 and 4, on the other hand, have a "group size" of 2. This signifies that these two records share identical values across all six food group columns.
By adding more columns to the "Partition By" section, you're essentially expanding the criteria for determining distinctness, allowing you to identify unique combinations across a broader set of attributes.
I hope this clarifies the process.
Sincerely,
Danny
No account yet? Create an account
Enter your E-mail address. We'll send you an e-mail with instructions to reset your password.