What Ataccama DQC reader/writer step should you choose for Big Data sources?

  • 7 February 2023
  • 2 replies
  • 138 views

Userlevel 5
Badge +4

One of the most common data sources that our users connect to are relational databases, e.g. Oracle, MSSQL, PostgreSQL, etc. For those sources, we usually recommend using JDBC drivers to connect, and use JDBC Reader and JDBC Writer steps to work with the data.

But for big data sources, e.g. Databricks, within the Ataccama ONE Desktop environment there are a few different steps available. Ever wondered which one you should choose?

 

Hive Reader and Writer:

Hive Reader and Writers are the simplest to configure, but also offer the least options. You will need to specify the exact database, table and column names you want to read from / write to. For the Reader step you can provide some simple where clause in the Filter.

 

Spark SQL Reader:

For more complex SQL queries, you can use the Spark SQL Reader. This is not limited to where clauses, but also joins, aggregates etc.

 

Spark Reader and Writer:

For even more complex operations, you can use the Spark Reader and Writer steps. They are more complex to configure, using the Spark DataFrame reader / writer APIs.

 

JDBC Reader and Writer:

JDBC Reader and Writer can theoretically be used as a last resort, on low volumes of data. You will likely see slow performance on big data environments using JDBC connectors, and is generally not recommended.


2 replies

Userlevel 1
Badge +1

Hi @MayKwok , are there any tutorials/resources on usage of Spark SQL Reader/Spark Reader and Writer inside a .plan file ?

Userlevel 2
Badge +4

Hello Deepak,

The steps have detailed description in the bottom left with description on how to configure each field and links to external Spark documentation. There is also context tooltips when hovering on fields. Those should be enough to help you configure the steps. 

 Regards,

Maksim

Reply