We started playing around with the DQ Analyzer today, using different database connections. One thing we did was attempt making a data profile of one of the larger tables in our IT environment (37 mio rows, about 40 GB of data). Meanwhile, one of our dba’s monitored the resource usage to make sure nothing too impactful could happen to our production system.
(1) Resource usage on the source (Oracle) DB is really low. Which is excellent. Is seems like DQA is doing hardly anything more than a pure select * statement. CPU usage of my computer is also well within acceptable bounds.
(2) On the other hand, it takes quite a lot of time to finish processing that many rows. After about 1h, the run had processed about 10 mio rows, after which I aborted the job. Given (1), this made us wonder if putting some of the calculations on the “select” part of the job could lead to gain in processing speed. Obviously, we don’t know what actually happens in the background, so we might be way off.
(3) The previous point also made us wonder about whether/how DQA could handle actual table monstrosities (100+ GB). Is Ataccama One faster or more efficient than DQA?
(4) Finally, but also related, we were a bit disappointed to see that, after we interrupted the run, there were no partial results for those first 10 mio rows that did appear to be processed (according to the Console). Maybe that could be a nice feature to add (e.g. after cancelling a job, prompt the user whether he wants to see profiling results for the processed rows)?