x
I'm sorry, but you have been denied access to edit this topic.
Georg Verweyen
sad I’m sad

Large number of records

Starting a test with bigger table, I detect that the DQ_Analyzer can "only" work with 1.703.936 records. With 1.736.704 records the programs ended with a "java.lang.OutOfMemoryError: Java heap space" message. There was no resource bottleneck. This hint to a java configuration problem (yes, I read the configuration hint http://community.ataccama.com/ataccam.... Why need the program these memory? Can you switch of same statistics to live with less memory?

The whole table as over 22 mio records.
1 person has
this problem
+1
Reply

  • Hi Georg,

    Thank you for reporting the problem. It's actually not the number of records that forces DQ Analyzer to need more memory - in fact, it stores at most ~10,000 values for each column. Instead, it's probably the width of the record (number of columns or size of the data) that consumes the space. The value of 10,000 is a memory-to-performance tradeoff ratio we came up with during our measurements of "typical" tables - but it might not fit all scenarios.

    The easiest way to overcome the problem is indeed to increase memory for the runtime process as noted in the message you mentioned in the post. Because most of the 10,000 buckets were probably filled in, if you were able to process 1M records with 256MB memory, 512MB should be probably be enough to handle the 22M record load.

    There is also another (but undocumented in DQ Analyzer) way to deal with the problem. Data Quality Center engine comes with a set of performance tuning parameters that can be used to tune up various aspects of the process. Since DQ Analyzer uses this engine as well, the same parameters can be used to tune its performance.

    The "magic parameter" is "sorter.inMemory", which normally contains the value 10,000 mentioned before. We can set it to, say, 1000 to reduce memory consumption - the cache will hold 10x less records, but the performance should not drop rapidly (say, 5% performance loss in this case).

    Now the final part - how to set this up:

    • First, use the "plan file" option in Create profile dialog.
    • Click the arrow just behind the run button and choose "Run configuration..."
    • Click the Arguments tab and type "-Dsorter.inMemory=1000" (without the quotes) into the "VM arguments" field
    • Choose Apply and Run - the lower memory settings will be used this time. The same parameters will be used for any subsequent launches of the same plan file.


    Let us know if this solves the issue for you.

    Pavel

  • (some HTML allowed)
    How does this make you feel?
    Add Image
    I'm

    e.g. sad, anxious, confused, frustrated indifferent, undecided, unconcerned kidding, amused, unsure, silly happy, confident, thankful, excited

  • Georg Verweyen
    sad I’m frustrated
    Hi Pavel,

    a first quick check of the solution: 3.000.000 records runs and 10.000.000 same error again.

    28.06.2010 06:51:34 [INFO] Running runtime...
    28.06.2010 06:51:51 [FATAL] Error in step tag_small (cz.adastra.cif.tasks.io.jdbc.read.JdbcReader)

    java.lang.OutOfMemoryError: Java heap space
    at com.mysql.jdbc.MysqlIO.nextRowFast(MysqlIO.java:1633)
    at com.mysql.jdbc.MysqlIO.nextRow(MysqlIO.java:1410)
    at com.mysql.jdbc.MysqlIO.readSingleRowSet(MysqlIO.java:2887)
    at com.mysql.jdbc.MysqlIO.getResultSet(MysqlIO.java:477)
    at com.mysql.jdbc.MysqlIO.readResultsForQueryOrUpdate(MysqlIO.java:2582)
    at com.mysql.jdbc.MysqlIO.readAllResults(MysqlIO.java:1758)
    at com.mysql.jdbc.MysqlIO.sqlQueryDirect(MysqlIO.java:2172)
    at com.mysql.jdbc.ConnectionImpl.execSQL(ConnectionImpl.java:2690)
    at com.mysql.jdbc.ConnectionImpl.execSQL(ConnectionImpl.java:2619)
    at com.mysql.jdbc.StatementImpl.executeQuery(StatementImpl.java:1465)
    at cz.adastra.cif.model.internal.modelimpl.ConnectionPool$_E.executeQuery(ConnectionPool.java:707)
    at cz.adastra.cif.tasks.io.jdbc.read.JdbcReaderInstance.run(JdbcReaderInstance.java:120)
    at cz.adastra.cif.processor.internal.runner.B.E(ComplexStepNode.java:62)
    at cz.adastra.cif.processor.internal.runner.F.run(Node.java:115)
    at java.lang.Thread.run(Unknown Source)
    28.06.2010 06:51:51 [INFO] Stopping runtime...


    I will check the second hint this evening.

    Georg V.

    By the way the table has four columns...
  • (some HTML allowed)
    How does this make you feel?
    Add Image
    I'm

    e.g. sad, anxious, confused, frustrated indifferent, undecided, unconcerned kidding, amused, unsure, silly happy, confident, thankful, excited

  • Hi Georg,

    Thank you for the error log - this sheds more light to the problem. The true cause for the problem is a MySQL driver's little-known "feature". Some time ago, they added a new protocol of obtaining data from the server (or maybe they modified the existing) so that it first reads the entire dataset and then provides it to the application (here, DQ Analyzer) - thus giving it no chance of reading huge tables. We previously thought that the issue occurred in a few versions of the driver and was subsequently fixed, but it seems that it is not the case.

    Fortunately, a workaround for the problem exists. The driver can be switched into "fetch" mode where it reads only a certain number of rows into memory before it provides them to the application. The fetch mode can be switched on by editing the connection URL in the database connection definition in DQ Analyzer. Normally, the URL looks something like this:

    jdbc:mysql://<host>:3306/<database>?characterEncoding=utf8&useUnicode=true

    To switch the fetch mode on, one has to add "&useCursorFetch=true&defaultFetchSize=100" at the end:

    jdbc:mysql://<host>:3306/<database>?characterEncoding=utf8&useUnicode=true&useCursorFetch=true&defaultFetchSize=100

    This should fix the problem for good - let me know if that worked for you as well. We will now check the recent drivers from MySQL team and see if the problem persists - and make sure the driver developers are aware of the problem.

    Pavel
  • (some HTML allowed)
    How does this make you feel?
    Add Image
    I'm

    e.g. sad, anxious, confused, frustrated indifferent, undecided, unconcerned kidding, amused, unsure, silly happy, confident, thankful, excited

  • Georg Verweyen
    sad I’m tired
    Hi Pavel,

    the modification of the jdbc connection string works. These morning the 3 millons data records needs 02:22 min. Now the first checks with 1,6 millions data records need 11:43 min. I will start a full run, but I calculate the total runtime of 03:07:00.

    Georg
  • (some HTML allowed)
    How does this make you feel?
    Add Image
    I'm

    e.g. sad, anxious, confused, frustrated indifferent, undecided, unconcerned kidding, amused, unsure, silly happy, confident, thankful, excited

  • Hi Georg,

    Maybe try to set the defaultFetchSize parameter to 1000 instead of 100. If the table contains just a few columns, it should improve the performance.

    Pavel
  • (some HTML allowed)
    How does this make you feel?
    Add Image
    I'm

    e.g. sad, anxious, confused, frustrated indifferent, undecided, unconcerned kidding, amused, unsure, silly happy, confident, thankful, excited

  • Georg Verweyen
    Hi Pavel,

    I clean up my disk and start the runs against the 20.4 mio records table (data was from the OpenStreetMap-Projekt, all description tags of the german nodes and ways). After a hour my disk (more than 100 times free disk space from the MySQL table space) are completed filled by a temporary file.

    30.06.2010 06:01:31 [INFO] Running runtime...
    30.06.2010 06:58:23 [FATAL] Error in step tag (cz.adastra.cif.tasks.io.jdbc.read.JdbcReader)

    java.sql.SQLException: Incorrect key file for table 'C:\xampp\tmp\#sql814_18_6.MYI'; try to repair it
    at com.mysql.jdbc.SQLError.createSQLException(SQLError.java:1075)
    at com.mysql.jdbc.MysqlIO.checkErrorPacket(MysqlIO.java:3562)
    at com.mysql.jdbc.MysqlIO.checkErrorPacket(MysqlIO.java:3494)
    at com.mysql.jdbc.MysqlIO.sendCommand(MysqlIO.java:1960)
    at com.mysql.jdbc.ServerPreparedStatement.serverExecute(ServerPreparedStatement.java:1346)
    at com.mysql.jdbc.ServerPreparedStatement.executeInternal(ServerPreparedStatement.java:846)
    at com.mysql.jdbc.PreparedStatement.execute(PreparedStatement.java:1367)
    at com.mysql.jdbc.StatementImpl.createResultSetUsingServerFetch(StatementImpl.java:560)
    at com.mysql.jdbc.StatementImpl.executeQuery(StatementImpl.java:1397)
    at cz.adastra.cif.model.internal.modelimpl.ConnectionPool$_E.executeQuery(ConnectionPool.java:707)
    at cz.adastra.cif.tasks.io.jdbc.read.JdbcReaderInstance.run(JdbcReaderInstance.java:120)
    at cz.adastra.cif.processor.internal.runner.B.E(ComplexStepNode.java:62)
    at cz.adastra.cif.processor.internal.runner.F.run(Node.java:115)
    at java.lang.Thread.run(Unknown Source)
    30.06.2010 06:58:24 [INFO] Stopping runtime...
    30.06.2010 06:58:24 [INFO] Finished!
  • (some HTML allowed)
    How does this make you feel?
    Add Image
    I'm

    e.g. sad, anxious, confused, frustrated indifferent, undecided, unconcerned kidding, amused, unsure, silly happy, confident, thankful, excited

  • Well, MySQL never stops surprising me. The file C:\xampp\tmp\#sql814_18_6.MYI is a MySQL temporary table that is created by MySQL engine or the driver (hard to tell) to allow for the fetch-based access to the table data. Not only is surprising that they have to actually create the table, but the size of the temporary table is always enormous. I went through the documentation and the only relevant discussion about the approach was that "it was easier to implement it that way" - so there is probably no option to read the data efficiently using the JDBC driver.

    There is, however, still an option how to actually read the data without consuming all the available memory or the disk space. Instead of JDBC, we can use ODBC - and as far as I can tell, ODBC driver seems to be working fine (no outrageous memory consumption or disk usage).

    To set up an ODBC connection to MySQL, you have to download and install the package from http://www.mysql.com/downloads/connec... , then go to "Control Panel" > "Administrative Tools" > "Data Sources (ODBC)", Click Add, select MySQL, fill in the required parameters and give the connection a name. In DQ Analyzer, click new database connection, choose the "ODBC" driver, and replace "<connection_name>" with the data source name you have selected while creating the ODBC data source. You should be able to use the new connection in the same way as you did with the JDBC one.

    It is worth mentioning that ODBC connections are generally slower than their JDBC counterparts, but it seems that it is the only option to get large data from a MySQL table, apart from exporting the data to a CSV file and analyzing the file instead.

    Hopefully, this will finally help you to analyze the data. Let me know if that worked as outlined above and how fast it was,

    Pavel
  • (some HTML allowed)
    How does this make you feel?
    Add Image
    I'm

    e.g. sad, anxious, confused, frustrated indifferent, undecided, unconcerned kidding, amused, unsure, silly happy, confident, thankful, excited

  • Georg Verweyen

    04.07.2010 13:34:37 [INFO] Using following licences:
    04.07.2010 13:34:37 [INFO] C:\Ataccama DQ Analyzer 6\runtime\lib
    04.07.2010 13:34:37 [INFO] Using following licences:
    04.07.2010 13:34:37 [INFO] C:\Ataccama DQ Analyzer 6\runtime\lib\..\license_keys\license.plf
    04.07.2010 13:34:37 [INFO] Using Ataccama DQ Analyzer configuration file C:\Users\Georg\Projekte\neue Projekte\openstreetmap\DQ\tag.plan
    04.07.2010 13:34:37 [INFO] Using runtime configuration file C:\Users\Georg\AppData\Local\Temp\runtimeCfg3676394729793320878.tmp
    04.07.2010 13:34:40 [INFO] Ataccama DQ Analyzer engine initialized.
    04.07.2010 13:34:40 [INFO] Creating runtime...
    04.07.2010 13:34:41 [INFO] Starting runtime...
    04.07.2010 13:34:41 [INFO] Running runtime...
    Out of memory (Needed 2343068 bytes)
    04.07.2010 13:38:07 [FATAL] Error in step tag (cz.adastra.cif.tasks.io.jdbc.read.JdbcReader)

    java.sql.SQLException: [MySQL][ODBC 5.1 Driver][mysqld-5.1.37-log]MySQL client ran out of memory
    at sun.jdbc.odbc.JdbcOdbc.createSQLException(Unknown Source)
    at sun.jdbc.odbc.JdbcOdbc.standardError(Unknown Source)
    at sun.jdbc.odbc.JdbcOdbc.SQLExecDirect(Unknown Source)
    at sun.jdbc.odbc.JdbcOdbcStatement.execute(Unknown Source)
    at sun.jdbc.odbc.JdbcOdbcStatement.executeQuery(Unknown Source)
    at cz.adastra.cif.model.internal.modelimpl.ConnectionPool$_E.executeQuery(ConnectionPool.java:707)
    at cz.adastra.cif.tasks.io.jdbc.read.JdbcReaderInstance.run(JdbcReaderInstance.java:120)
    at cz.adastra.cif.processor.internal.runner.B.E(ComplexStepNode.java:62)
    at cz.adastra.cif.processor.internal.runner.F.run(Node.java:115)
    at java.lang.Thread.run(Unknown Source)
    04.07.2010 13:38:26 [INFO] Stopping runtime...
    04.07.2010 13:38:27 [INFO] Finished!

    04.07.2010 13:38:27 [WARNING] TASK COMPLETED WITH ERRORS. Elapsed time: 00:03:46
    #46;.\license_keys\license.plf
    04.07.2010 13:34:37 [INFO] Using Ataccama DQ Analyzer configuration file C:\Users\Georg\Projekte\neue Projekte\openstreetmap\DQ\tag.plan
    04.07.2010 13:34:37 [INFO] Using runtime configuration file C:\Users\Georg\AppData\Local\Temp\runtimeCfg3676394729793320878.tmp
    04.07.2010 13:34:40 [INFO] Ataccama DQ Analyzer engine initialized.
    04.07.2010 13:34:40 [INFO] Creating runtime...
    04.07.2010 13:34:41 [INFO] Starting runtime...
    04.07.2010 13:34:41 [INFO] Running runtime...
    Out of memory (Needed 2343068 bytes)
    04.07.2010 13:38:07 [FATAL] Error in step tag (cz.adastra.cif.tasks.io.jdbc.read.JdbcReader)

    java.sql.SQLException: [MySQL][ODBC 5.1 Driver][mysqld-5.1.37-log]MySQL client ran out of memory
    at sun.jdbc.odbc.JdbcOdbc.createSQLException(Unknown Source)
    at sun.jdbc.odbc.JdbcOdbc.standardError(Unknown Source)
    at sun.jdbc.odbc.JdbcOdbc.SQLExecDirect(Unknown Source)
    at sun.jdbc.odbc.JdbcOdbcStatement.execute(Unknown Source)
    at sun.jdbc.odbc.JdbcOdbcStatement.executeQuery(Unknown Source)
    at cz.adastra.cif.model.internal.modelimpl.ConnectionPool$_E.executeQuery(ConnectionPool.java:707)
    at cz.adastra.cif.tasks.io.jdbc.read.JdbcReaderInstance.run(JdbcReaderInstance.java:120)
    at cz.adastra.cif.processor.internal.runner.B.E(ComplexStepNode.java:62)
    at cz.adastra.cif.processor.internal.runner.F.run(Node.java:115)
    at java.lang.Thread.run(Unknown Source)
    04.07.2010 13:38:26 [INFO] Stopping runtime...
    04.07.2010 13:38:27 [INFO] Finished!

    04.07.2010 13:38:27 [WARNING] TASK COMPLETED WITH ERRORS. Elapsed time: 00:03:46


    The two javaw-process wanted 1 GB (each one) memory [Parameter: default memory for launch configuration 512 MB, default parallelism level (number of threads) 1]
  • (some HTML allowed)
    How does this make you feel?
    Add Image
    I'm

    e.g. sad, anxious, confused, frustrated indifferent, undecided, unconcerned kidding, amused, unsure, silly happy, confident, thankful, excited

  • Georg Verweyen
    sad I’m anxious
    Ok, restart the machine, stop all unnecessary jobs like skype, icq etc, setting the parameter default for launch configuration to 256 MB, press the button apply and start the job at 14:10.

    One javaw-process fetchs 1.4 GB memory and after 2 hours 53 minutes the processes ends without errors.

    04.07.2010 17:02:24 [INFO] Current status: Reading data (pass 1/2) - 20442600 records read
    04.07.2010 17:02:29 [INFO] Current status: Reading data (pass 1/2) - 20444319 records read
    04.07.2010 17:02:29 [INFO] Current status: Computing profile (pass 2/2) - 0% complete
    04.07.2010 17:02:34 [INFO] Current status: Computing profile (pass 2/2) - 2% complete
    04.07.2010 17:02:39 [INFO] Current status: Computing profile (pass 2/2) - 4% complete
    04.07.2010 17:02:44 [INFO] Current status: Computing profile (pass 2/2) - 6% complete
    04.07.2010 17:02:49 [INFO] Current status: Computing profile (pass 2/2) - 8% complete
    04.07.2010 17:02:54 [INFO] Current status: Computing profile (pass 2/2) - 9% complete
    04.07.2010 17:02:59 [INFO] Current status: Computing profile (pass 2/2) - 11% complete
    04.07.2010 17:03:04 [INFO] Current status: Computing profile (pass 2/2) - 13% complete
    04.07.2010 17:03:09 [INFO] Current status: Computing profile (pass 2/2) - 13% complete
    04.07.2010 17:03:14 [INFO] Current status: Computing profile (pass 2/2) - 15% complete
    04.07.2010 17:03:19 [INFO] Current status: Computing profile (pass 2/2) - 17% complete
    04.07.2010 17:03:24 [INFO] Current status: Computing profile (pass 2/2) - 19% complete
    04.07.2010 17:03:29 [INFO] Current status: Computing profile (pass 2/2) - 22% complete
    04.07.2010 17:03:34 [INFO] Current status: Computing profile (pass 2/2) - 75% complete
    04.07.2010 17:03:39 [INFO] Current status: Computing profile (pass 2/2) - 77% complete
    04.07.2010 17:03:44 [INFO] Current status: Computing profile (pass 2/2) - 79% complete
    04.07.2010 17:03:49 [INFO] Current status: Computing profile (pass 2/2) - 80% complete
    04.07.2010 17:03:54 [INFO] Current status: Computing profile (pass 2/2) - 82% complete
    04.07.2010 17:03:59 [INFO] Current status: Computing profile (pass 2/2) - 89% complete
    04.07.2010 17:04:04 [INFO] Current status: Computing profile (pass 2/2) - 93% complete
    04.07.2010 17:04:09 [INFO] Current status: Computing profile (pass 2/2) - 100% complete
    04.07.2010 17:04:41 [INFO] Stopping runtime...
    04.07.2010 17:04:41 [INFO] Finished!

    04.07.2010 17:04:41 [INFO] Task completed successfully. Elapsed time: 02:53:47


    thanks on all fellow combatant.

    Georg V.
  • (some HTML allowed)
    How does this make you feel?
    Add Image
    I'm

    e.g. sad, anxious, confused, frustrated indifferent, undecided, unconcerned kidding, amused, unsure, silly happy, confident, thankful, excited

  • Georg Verweyen
    I started a new effort creating a csv-file (on newer data) and running the DQ-Analyzer. Please notice the runtime.

    08.07.2010 07:45:42 [INFO] Current status: Reading data (pass 1/2) - 22427634 records read
    08.07.2010 07:45:42 [INFO] Current status: Computing profile (pass 2/2) - 1% complete
    08.07.2010 07:45:47 [INFO] Current status: Computing profile (pass 2/2) - 3% complete
    08.07.2010 07:45:52 [INFO] Current status: Computing profile (pass 2/2) - 5% complete
    08.07.2010 07:45:57 [INFO] Current status: Computing profile (pass 2/2) - 7% complete
    08.07.2010 07:46:02 [INFO] Current status: Computing profile (pass 2/2) - 9% complete
    08.07.2010 07:46:07 [INFO] Current status: Computing profile (pass 2/2) - 11% complete
    08.07.2010 07:46:12 [INFO] Current status: Computing profile (pass 2/2) - 13% complete
    08.07.2010 07:46:17 [INFO] Current status: Computing profile (pass 2/2) - 14% complete
    08.07.2010 07:46:22 [INFO] Current status: Computing profile (pass 2/2) - 17% complete
    08.07.2010 07:46:27 [INFO] Current status: Computing profile (pass 2/2) - 21% complete
    08.07.2010 07:46:32 [INFO] Current status: Computing profile (pass 2/2) - 75% complete
    08.07.2010 07:46:37 [INFO] Current status: Computing profile (pass 2/2) - 76% complete
    08.07.2010 07:46:42 [INFO] Current status: Computing profile (pass 2/2) - 78% complete
    08.07.2010 07:46:47 [INFO] Current status: Computing profile (pass 2/2) - 80% complete
    08.07.2010 07:46:52 [INFO] Current status: Computing profile (pass 2/2) - 82% complete
    08.07.2010 07:46:57 [INFO] Current status: Computing profile (pass 2/2) - 89% complete
    08.07.2010 07:47:02 [INFO] Current status: Computing profile (pass 2/2) - 93% complete
    08.07.2010 07:47:07 [INFO] Current status: Computing profile (pass 2/2) - 100% complete
    08.07.2010 07:47:46 [INFO] Stopping runtime...
    08.07.2010 07:47:46 [INFO] Finished!

    08.07.2010 07:47:46 [WARNING] Task completed with warnings. Elapsed time: 00:10:54
  • (some HTML allowed)
    How does this make you feel?
    Add Image
    I'm

    e.g. sad, anxious, confused, frustrated indifferent, undecided, unconcerned kidding, amused, unsure, silly happy, confident, thankful, excited

  • Georg Verweyen
    I started a new effort creating a csv-file (on newer data) and running the DQ-Analyzer. Please notice the runtime.

    08.07.2010 07:45:42 [INFO] Current status: Reading data (pass 1/2) - 22427634 records read
    08.07.2010 07:45:42 [INFO] Current status: Computing profile (pass 2/2) - 1% complete
    08.07.2010 07:45:47 [INFO] Current status: Computing profile (pass 2/2) - 3% complete
    08.07.2010 07:45:52 [INFO] Current status: Computing profile (pass 2/2) - 5% complete
    08.07.2010 07:45:57 [INFO] Current status: Computing profile (pass 2/2) - 7% complete
    08.07.2010 07:46:02 [INFO] Current status: Computing profile (pass 2/2) - 9% complete
    08.07.2010 07:46:07 [INFO] Current status: Computing profile (pass 2/2) - 11% complete
    08.07.2010 07:46:12 [INFO] Current status: Computing profile (pass 2/2) - 13% complete
    08.07.2010 07:46:17 [INFO] Current status: Computing profile (pass 2/2) - 14% complete
    08.07.2010 07:46:22 [INFO] Current status: Computing profile (pass 2/2) - 17% complete
    08.07.2010 07:46:27 [INFO] Current status: Computing profile (pass 2/2) - 21% complete
    08.07.2010 07:46:32 [INFO] Current status: Computing profile (pass 2/2) - 75% complete
    08.07.2010 07:46:37 [INFO] Current status: Computing profile (pass 2/2) - 76% complete
    08.07.2010 07:46:42 [INFO] Current status: Computing profile (pass 2/2) - 78% complete
    08.07.2010 07:46:47 [INFO] Current status: Computing profile (pass 2/2) - 80% complete
    08.07.2010 07:46:52 [INFO] Current status: Computing profile (pass 2/2) - 82% complete
    08.07.2010 07:46:57 [INFO] Current status: Computing profile (pass 2/2) - 89% complete
    08.07.2010 07:47:02 [INFO] Current status: Computing profile (pass 2/2) - 93% complete
    08.07.2010 07:47:07 [INFO] Current status: Computing profile (pass 2/2) - 100% complete
    08.07.2010 07:47:46 [INFO] Stopping runtime...
    08.07.2010 07:47:46 [INFO] Finished!

    08.07.2010 07:47:46 [WARNING] Task completed with warnings. Elapsed time: 00:10:54
  • (some HTML allowed)
    How does this make you feel?
    Add Image
    I'm

    e.g. sad, anxious, confused, frustrated indifferent, undecided, unconcerned kidding, amused, unsure, silly happy, confident, thankful, excited