Data profiling: speed and partial results

(Tom De Decker) #1

Hi everybody,

We started playing around with the DQ Analyzer today, using different database connections. One thing we did was attempt making a data profile of one of the larger tables in our IT environment (37 mio rows, about 40 GB of data). Meanwhile, one of our dba’s monitored the resource usage to make sure nothing too impactful could happen to our production system.

Some findings:

  • (1) Resource usage on the source (Oracle) DB is really low. Which is excellent. Is seems like DQA is doing hardly anything more than a pure select * statement. CPU usage of my computer is also well within acceptable bounds.

  • (2) On the other hand, it takes quite a lot of time to finish processing that many rows. After about 1h, the run had processed about 10 mio rows, after which I aborted the job. Given (1), this made us wonder if putting some of the calculations on the “select” part of the job could lead to gain in processing speed. Obviously, we don’t know what actually happens in the background, so we might be way off.

  • (3) The previous point also made us wonder about whether/how DQA could handle actual table monstrosities (100+ GB). Is Ataccama One faster or more efficient than DQA?

  • (4) Finally, but also related, we were a bit disappointed to see that, after we interrupted the run, there were no partial results for those first 10 mio rows that did appear to be processed (according to the Console). Maybe that could be a nice feature to add (e.g. after cancelling a job, prompt the user whether he wants to see profiling results for the processed rows)?



(Katrin Popova) #2

Hello Tom,

Thank you for providing us feedback, we value your opinion.

You are correct in point (1): Ataccama does not perform any calculations in the database. It just performs a select to get the data from there. All the processing is done where the application is installed (laptop or sever for the commercial product). For large amount of data usually runtime properties configuration is important and also making sure that the environment is set correctly (e.g. Java memory parameters).

Canceling job in DQA causes the data processing and aggregations to stop immediately (the profiling report is not available during processing). For results of data samples please consider to either constrain your SQL query in DB reader step (e.g. in where condition) or Condition step to control the data flow.

The biggest difference is that Ataccama ONE runs on server. So you could have one server with great performance and multiple users can use it. Ataccama ONE is using similar approach for data processing. The processing would take the same amount of time (on the same HW), if it is done by the same engine, so the performance would be the same. In addition, Ataccama ONE has sampling feature that would allow huge tables to be sampled and the ability to profile part of it. The processing would be significantly faster based on the sample size. There is also possibility to run profiling on cluster.