Apache NiFi provides "ExecuteSQL" processor to execute a query and return the results as flow files. But, if we choose the Execution option as "All Nodes" , does NiFi divides the query in to different batches and executes each of them in parallel (similar to how SQOOP does) ?
1 Answer
If you use ExecuteSQL and select all nodes, then the same query is run on all nodes.
If you want sqoop like behavior you will want to use processors like GenerateTableFetch on primary node only, then use a load-balanced connection connected to ExecuteSQL so that the fetch queries get distributed across the cluster.
10 Comments
Akhil
Bryan, would you be able to answer this question also - stackoverflow.com/questions/56126682/…
Bryan Bende
Hard to answer since it depends on lots of factors.. size of DB table, size of NiFi cluster, hardware specs of cluster, etc. In general, sqoop will probably win for large scale performance.
Akhil
The GenerateTableFetch will generate multiple queries based on the size of the table and pass it to subsequent processors as flow files, ExecuteSQL in this case. How can I load balance the ExecuteSQL processor ? Can you give an example ?
Bryan Bende
I mentioned this in the answer, you run GenerateTableFetch on primary node, connect it to ExecuteSQL, and configure load balancing on the connection - blogs.apache.org/nifi/entry/load-balancing-across-the-cluster
Bryan Bende
yes the only processors that should ever be set to 'primary node only' are source processors, so in this case that would be GenerateTableFetch. Also, you can easily run a two node cluster locally to test it out.
|