Issue: We receive files to be tested against our DQ rules (Monitoring Projects) every 2 weeks. Each set of files is referenced by its date created (aka 2024-10-01,2024-10-14...). When running Monitoring Projects against these items we only get the Execution Date which can be different from the file date. When extracting data for reporting from ONE we only have the Extraction Date and don’t necessarily know which file date the plan results are for (sometimes have to rerun old files for new rules etc..).
Question: Is there a way to add a custom attribute / parameter for when we run the Monitoring Project (at the processing id level) so we can report our DQ results by File date?
The 2 options we could think of:
In extraction component connect to the catalog items and get the file date which is in the table itself (tried this but we need to extract history so would need file date, catalog item, and Monitoring Project processing id to join with DQ Aggregation Results tasks in the component and it didn’t seem like that was possible (plus expensive)
Add a custom attribute/parameter that we are prompted to fill in when running the Monitoring Project (and could set through an API call for automation) that can brought into the Monitoring Project results without having to connect to the actual catalog item data
Thanks for any help any advance!
Greg
Page 1 / 1
Hi Greg, To understand the requirement better i guess it would be great if you explain what kind of data from the platform you expect to see in the report (if i understand correctly the data prepared by the platform is consumed by some external data viz tool). Apart of that, could you please let us know which platform version you’re on at the moment?
You can define the “shape” of your export data either in Post Processing Components (Exports) for each catalog item in monitoring project (which depending on the number of such CI’s can be quite a complex and time consuming task since you’ll have to build some processing logic in each of PPC’s), or use the default PPC output form the platform which is usually source data (which can include the file data, as you mentioned) + technical attributes like valid\invalid rules\explanations and process everything in one place on ONE runtime server where you can run different scheduled jobs. Some of our customers export platform metadata and join it with aggregated dq results form the platform and then join that metadata with the output generated by post processing component. Having aggregated dq results, platform metadata and PPC exports in one place can give you a lot of flexibility to prepare the reports in the format suitable for you.
Regarding the API call parametrization approach, never had to implement this myself so i guess I'll need to check with someone from out engineering if that’s something that’s possible.
I hope this is helpful. Ivan
HI @ivan.kozlov,
Thanks for getting back to me. Answers below (please let me know if that make sense):
We have 2 components we run from ONE desktop (eventually will automate as part of Runtime Server) that derive DQ results from Monitoring Projects (by dimension, result , catalog item and rule instance id). These are used along with custom Post Processing plans containing failed rows (by rule instance id) to create Tableau data sources for reporting DQ Results by monitoring project execution. The catalog items in the monitoring projects are based on views that automatically increment to the latest file date when we receive new data (aka view: customer_2024_10_18 _> new file ->view now: customer_2024_11_04)
The metadata for the file date and monitoring project execution date need to be together in the feeds so that the user could review results by file_date (file metadata) instead of now where they select the execution date (DQ Monitoring Project metadata): This date could be many days later or out of sequence in case we have to do a re-run of files to determine their quality after modifying rules.
I tried to add logic inside each of the components for this but our export components are built around DQ results task not data in the actual files. Additionally, we pull in the last 5 runs for trending of results and that data is not easily accessible from a Catalog Item metadata call.
All this to say if I could create a custom field/parameter and the Monitoring Project Processing level (aka file date) and add it to the DQ results for ingestion to Tableau I could then choose results by file date. I was thinking of updating the Metadata model to this end (add attribute) but am not sure of the implications and whether it would work.
Please let me know if that makes sense. ANY and ALL help is greatly appreciated !
Greg
@greglvaughan I thnk I understand what your problem is - there should be a way within the post processing plan to add a field to the result file. You should be able to add the “processing_date” as a field, always with the “processing_date” as “CURRENT_DATE()”. This should allow you to have the date of the file as you listed above(ingestion), and then every time the monitoring project is ran, will list the date that the file had data quality ran against it
Hi,
Thanks for the suggestion. Unfortunately, I need the data to flow the other way: The date in the file date has to bubble up to the DQ results. The only way I have discovered to do this without trying to mess with the metadata model is to add a Filter to the catalog items for the monitoring project. I can then use the DQ Filter task to get the value and insert it into my DQ results extraction component.
The downside to this is if you already are using a filter in your catalog items as soon as you add a second .. the whole thing breaks (this was noted in some other community posts ): I am hoping some future release of the product will address this, but for now I am going to try and
add a dummy catalog item with just one filter on file date
add the catalog item to the monitoring project
create a ‘branch’ in the extraction component file to get this value and merge it with the other results for inclusion in Tableau reports
I will let you know if it works.. Wish me luck ;)
gV
p.s we are on 14.5 and going to 15.3 in about 2 weeks.
The above solution worked. I was able to create a VCI that retrieved the file date from the views used to extract data into Ataccama for DQ rules analysis. I then created a simple completeness check on the fields, placed it into the Monitoring Project, used the ‘DQ Monitoring Project Values’ (and others to get the file date by monitoring project processing id).
This id and the file date were then Joined with the main DQ results (component) flow from the component .. and viola!
thanks for all the suggestions and help!
gV
Thank you so much for sharing the solution here for other members @greglvaughan!