Skip to main content

Hello,

I am currently working on a data masking project in ONE Desktop and I'm seeking some advice on how to approach a specific scenario using two tables.

I have two CSV files:

  1. Classification Table (CSV #1): This table contains two columns - 'ColumnName' and 'DataClassification'. The 'ColumnName' field lists the names of various columns, and 'DataClassification' provides their respective data classifications, such as "Restricted", "Confidential", "Public", or it may be blank.

  2. Data Table (CSV #2): This table contains multiple columns, named according to the entries in the 'ColumnName' field of the Classification Table. Each column holds different sets of data.

Classification Table (CSV #1):

ColumnName DataClassification
Name Confidential
Age Public
Email Restricted

 

Data Table (CSV #2):

Name Age Email
Alice 30 alice@example.com
Bob 25 bob@example.com

 

My goal is to mask the data in the Data Table (CSV #2) based on the classification provided in the Classification Table (CSV #1). For instance, if a column is marked as "Confidential" in CSV #1, the entire corresponding column in CSV #2 should be masked accordingly.

I'm looking for guidance or suggestions on the most scalable way to achieve this without resorting to writing a custom step. It seems like all of the possible steps in the desktop app go row by row, which is an issue if I’m trying to conditionally mask an entire column, or require manual input to specify the columns.

Hi @danielsu 

The most scalable way it to leverage the fact that Ataccama plans are actually XML-files.
So what you can do is:

  1. Use Classification Table as an input
  2. Generate a plan that will Read CSV file based on columns list
  3. Apply necessary transformations based on Classification
  4. Write results wherever you want.

This approach though implies that your Classification Table holds all the columns from the Data Table and in correct order. if this is not true you can use semi-manual approach.

  1. Create a generator plan as in previous approach, but this time instead of actual Read/Write steps use Integration Input/Output steps. And use .comp extension for the result
  2. Manually create a plan that will read data from Data Table and write it to desired destination
  3. Put component created on step #1 to plan created at step #2 and use Auto-Map to link columns.

@AKislyakov 

If I have the columns that need to be masked in a lookup table, is it possible to for the columns to be fed in an alter column step or similar?

We’re trying to make this as generic as possible as part as an export project execution, so any help would be appreciated.


No, the problem is that inside expression you don’t know what the name of the target column is. Another approach could be to transpose the data to vertical format, apply masking based on Column Name and then transform it back.
 

Row ID Column Name Column Value
1 Name Alice
1 Age 30
1 Email alice@example.com

 

The problem with this approach is that this transformation parts need manual configuration and are error-prone. 


@AKislyakov Thank you so much for the reply. I’ve been looking into possible solutions and I’m also wondering if there’s any function in Ataccama expression to get the current entry’s column? Like a simple getColumnName()? That would solve the problem as well as I could pair that with a lookup table of the columns that need to be masked.

Also, could you clarify step #2?

  1. Generate a plan that will Read CSV file based on columns list

How exactly can I read a CSV file based on the columns list without hardcoding an expression in the input reader?

I’m also a little confused on what you meant by leveraging the fact that Ataccama plans are XML files. How can that be leveraged to solve this problem?

Appreciate the support.


If you open an Ataccama plan in a text editor or click the XML button at the top of One Desktop, you will see that the plan is actually an XML file.

 

 

 

Thus, you can generate plans as any other XML-files.

In attached Example (bin/csv_read_write_generator.comp) I take an input that contains list of csv files and their columns in following format

FOLDER_NAME FILE_NAME FILE_EXT ORDER COLUMN_NAME
../data/in/ sample_input csv 1 col_1
../data/in/ sample_input csv 2 col_2

 

And for each combination of FOLDER_NAME, FILE_NAME and FILE_EXT generate a plan in the generated_plans folder that:

  1. Reads the FOLDER_NAME/FILE_NAME.FILE_EXT file with specified columns;
  2. Adds a new column;
  3. Writes original columns to the FOLDER_NAME/../out/FILE_NAME.FILE_EXT output file.

Here is an example of how generated plan looks like

 

 


Hi @danielsu - I’m closing this thread for now. If you have any questions please don’t hesitate to follow up here or create a new post 🙋‍♀️


Reply