Skip to main content

Is there a way to quickly select only distinct rows in Ataccama One Desktop? 

(I now used a record descriptor and filter with Regex find(“.:.:1”, rd_column)

Hi @Marnix Wisselaar 
Using the Record Descriptor Builder step is certainly the best practice approach to solve this use case.

An alternative to using the Filter step with Regular Expression is to use the built-in expression function word().

I have attached a quick example using the expression function word().

 


Thanks so much. Great how you documented it! Will certainly be added to our cookbook ;)

 


Hi ​@DannyRyan 

I downloaded your plan and ran it but unfortunately I could not see the results as it turned to be blank for both columns. What would be the the input and expected results look like on  your end? As I have similar use case to replicate the select distinct values from Snowflake in Ataccama.


Dear @Radziah,

Thank you for reaching out and for trying out the plan. I understand you're experiencing an issue with blank results in both columns, and I'm happy to help you troubleshoot this.

To better understand the situation, would you mind sharing a screenshot of your plan's configuration and the execution results? This will give me valuable context.

The original plan is designed to generate 100 random records, and it should function correctly. Given the randomized nature of the input data, it's worth noting that running the plan multiple times can produce varied datasets, which may affect the distinct values reflected in the output.

It's also possible that there might have been modifications to the plan, potentially impacting the flow of the 100 randomly generated records into the RecordDescriptorBuilder step. Sharing your plan file would allow us to review it together and pinpoint any potential discrepancies.

Please feel free to attach the screenshot and your plan file to your reply. I look forward to assisting you in resolving this issue.


Hi ​@DannyRyan 

Thanks for prompt reply, I did not change anything in the plan and ran it as it is. Please find my attached plan and the screenshot of the output.

i tried to change the 100 random records generation to 50 also gave me blank output.


Hi ​@Radziah 

I've made some tweaks to the plan and attached a new version for you. This one should now show both all the records and the filtered ones, which I hope will be much clearer!

Here's a breakdown of what I've changed:

  • Replacing the Filter step: I switched the regular "Filter" step to an "Extract Filter" step. This allows us to output both the full dataset and the filtered results within the same pipeline, making it easier to see what's happening.
  • Adding a Text File Writer: I've included an extra "Text File Writer" step to capture all the records. This way, you'll have a complete view of the data.
  • Fixing the filter condition: This was the key change! The original filter condition was preventing any results from showing up.

Let's talk about the expression change in the Filter/Extract Filter step:

  • Original: word(rd_value,2) = '1'
  • Updated: word(rd_value,1,':') = '1'

Here's why:

  • The "Record Descriptor" has three values separated by colons (":").
  • The first value is the "Group ID," the second is the "Group Size," and the third is the "Position" within the "Group ID."
  • To find distinct records, we need to look for records where the "Group Size" is '1' (meaning it's the only record of its kind).
  • The original expression was looking at the wrong index. We needed to target the "Group Size" (index 1), not the "Position" (index 2).
  • Also we needed to tell the word() function that the delimiter was a colon.
  • So, by changing the index to 1 and explicitly setting the delimiter to ":", we're now correctly checking for records with a "Group Size" of '1'.

I really hope this updated plan works well for you and helps you get a better grasp of:

  • Record Descriptors
  • The word() expression
  • Debugging plans and expressions

 

Please don't hesitate to reach out if you have any more questions or need further clarification. I'm happy to help!

Best regards,

Danny


Thanks ​@DannyRyan 

I ran the latest plan couple of times only got the expected output which I would say only Community will be out in the output_distinct since the group size is 1? Based on the rd_value of 3:1:1.

My understanding of select distinct is let’s say we have generated 49 rows of ‘Data’, 50 rows of ‘People’ and 1 row of ‘Community’, isn’t it supposed to have three rows of output in total; 1 row ‘Data’, 1 row ‘People’ and 1 row ‘Community’? In your plan after I ran multiple times, the only output_distinct that I managed to get is like in the image below:

 


Reply