Skip to main content

Hello,

I have a requirement to append the records in parquet file in the event of RDM on-publish synchronization. Does anyone know, how can we append data into Parquet file in RDM on-publish event?

I believe parquet file writer will write data in the file, but if the new record created in the RDM, will it append the data in the same file or will replace the file completely and create new one.

Thanks for any help.

-Ojaswini

Hello community,

 

Please find the findings till now.

  1. With Parquet files, each write operation essentially creates a new file at the target location, necessitating the deletion of the old file. This behavior poses challenges when it comes to performing updates and deletions through the RDM UI's 'onPublish' functionality.

2. Addition of new record:-
-can be done
-Each time, a new file is generated containing a complete set of records. This set comprises the union of records received from the integration input during 'onPublish' and those retrieved from the RDM extended Reader. It's important to note that if the RDM extended Reader is not utilized, the file will consist solely of the new records.

2)Updation of record
-The Parquet file writing process does not differentiate between update and original records. Instead, it includes both types of records in the resulting file. This includes the original records from the RDM extended Reader as well as the updated records received from the integration input.
-It does not consider primary Key concept.
-As the RDM also updates records after the on-publish process is completed, during the on-publish plan execution, there may be two records with the same ID in the Parquet file: the original record and the updated one. In this scenario, both records are written to the file.

For Example:

record1: C Cancel (original)

record2: C Cancelled (updated)

3)Deletion of the record
-Since the RDM updates records only after the on-publish process has concluded, during the on-publish plan execution, the Parquet file may contain both the original records and records marked for deletion (original records + deleted records). Both types of records are written to the file in this context.

-It's not possible to utilize a condition step check (e.g., for distinguishing between New, Updated, and Deleted records) because DQC (Data Quality Center) allows only one Parquet file writer at a time for a single on-publish operation. This limitation restricts the ability to segregate records based on their status during the writing process.

 

However, this parquet file writing can be achieved with the use of Integration input from on Publish step , RDM Extender reader and then joining these records. Once it is joined filter out deleted records before writing into Parquet file. 

I hope this will help someone.

Thanks,

Ojaswini


Thank you for sharing your solutions @Ojaswini 🙋‍♀️


Reply