Spring Batch Partitioner – Case Study with SourceCode – Best Practices

I’m writing this post because i report a bug at Spring Community Jira, this is the link:

https://jira.spring.io/browse/BATCH-2309

I started a sample project which could reproduce the problem to show the community what I was experiencing, but to my surprise I was using the partitioner feature incorrectly. I am writing this post to share what I learned throughout this experience to help those who are going through the same questions.

My Goal: I wanted to use the resource partitioner for parallel processing but was worried to use the primary key of the table (column ID) because my table has gaps (id column is not incremental) and for this reason the partitioner would distribute number of different records for each thread, thus being inefficient in their distribution.

For example:

This is the good example partitioner:

https://github.com/spring-projects/spring-batch/blob/master/spring-batch-samples/src/main/java/org/springframework/batch/sample/common/ColumnRangePartitioner.java

Suppose that my table has the following records: Ids 1, 8, 9,10 11, 12, 13, 14, 15.

min: 1

max: 15

gridSize = number of threads = 2 in this example

target size calculation: int targetSize = (max min) / gridSize + 1;

(15 – 1) / 2 + 1 = 8

In this example:

Thread number 1 will receive to process: 1 to 8

Thread number 2 will receive to process: 9 to 16

The Problem: Thread 1 receives only two records to process (The Id’s 1 and 8) and the thread 2 will receive 7 records to process. At this case the partitioner to split incorrectly number of records between threads.

My Goal: I want to split the number of records equally between all threads.

Where I was going wrong: To achieve my goal I tried to use a query that makes use of rownum and / or ntile oracle feature, the goal was to use the split an id that is sequential, with no gaps in the id column table, so the load would be uniform among the threads. The JdbcPagingItemReader class can not be used with multithreaded characteristics using Oracle ROWNUM because the query is partially executed multiple times in the database and there is no guarantee that all records are processed because a confusion of Ids between threads occurs.

 The correct way: You can use JdbcPagingItemReader using the Primary Key column (may be single or multiple columns) or JdbcCursorItemReader can use both the PK column or  Rownum / NTILE to do division.

Why use JdbcCursorItemReader not cause problems of mistaken IDs or lost records ?

This class executes the query once the database and will use chunk mode to fetch the records as needed. If you use a rownum column in this case will not cause data loss because every query is processed only once in the database.

To illustrate and facilitate understanding, I created a design example set with various possible configurations available here:

springbatchpoc

GitHub Example Project:

https://github.com/victorjabur/PartitionSpringBatch_DataLose_Poc_BATCH-2309

Here are the sql scripts to create the database tables used in this poc:

https://github.com/victorjabur/PartitionSpringBatch_DataLose_Poc_BATCH-2309/tree/master/src/main/resources/sql

  1. JdbcCursorItemReader-OracleNtile – It works
  2. JdbcCursorItemReader-OracleRownum – It works
  3. JdbcPagingItemReader-OracleNtile – It not works, don’t use this. PagingReader does not work with NTile
  4. JdbcPagingItemReader-OracleRownum – It not works, don’t use this. PagingReader does not work with Rownum
  5. JdbcPagingItemReader-TablePrimaryKey – It works, but the records aren’t distributed in an uniform way (same quantity for each thread)

What is Oracle NTile ?
This feature of Oracle Database can create a desired number of containers so that each thread can consume one. For example: I have 1000 records in the database to be divided among 10 threads:

SELECT ID, DESCRIPTION, FLAG_PROCESSED, NTILE(10) OVER (ORDER BY ID)
AS CONTAINER_COLUMN FROM TABLE_SOURCE
WHERE FLAG_PROCESSED = 'N';

With this query, you can use the column “CONTAINER_COLUMN”, values are already pre split into buckets ready to be divided among the various threads.

This is the documentation with more clarified explanation:

https://docs.oracle.com/cd/B19306_01/server.102/b14200/functions101.htm

That’s it.

Any question or suggestion is very welcome.

Credits to this post:

http://alexandreesl.wordpress.com/2014/09/21/spring-batch-construindo-processamento-massivo-de-dados-em-java/
http://www.mkyong.com/spring-batch/spring-batch-partitioning-example/
https://github.com/spring-projects/spring-batch/tree/master/spring-batch-samples

Cheers,
Victor Jabur

Advertisements

7 thoughts on “Spring Batch Partitioner – Case Study with SourceCode – Best Practices

  1. Hi, Really interesting article and thanks for the code.It really helps me to understand better.

    My question is,
    Instead of MIN() , MAX() or NTILE() did u try to use offset and limit after some ordering?
    Ex: OFFSET 20 ROWS FETCH NEXT 10 ROWS ONLY;

    Or did u experience any problem with that?
    We can split the rows equally with fetch and limit, right?
    Correct me if I’m wrong

  2. wow! Superb one!
    Helped me a lot for decisions about partitioner spring batch approach along with tasklet, reader writer & processor. 🙂

    Any idea if I need to fetch massive data(100000 records approx) from REST service as input for my batch module & write that data into my own database table, then what approach should be most suitable one to use?

    I have done some researches & prepared below approach to go with:
    1. Use 1 job with 2 steps where 1st step will have a tasklet which will read data from web service & have a map<PK ID, List>
    2. map<PK ID, List> : I am thinking to use flyweight design pattern for json data object but need to research if it is possible in my batch application scenario
    3. #1ie tasklet will store data into a temporary database table (I do not know of better approach than using +1 database hit to store data)
    3. my 1st job’s 2nd step comprises of chunk – reader processor writer of commit-interval = 500 or 800 so that from the temporary table, my data will be read in 500 or 800 times
    along with this I need to research about how partitioner approach will fit at reader.
    I am just scared of getting out of memory exceptions :S
    4. Reader will read from temp database table, processor may be optional / map from temp database enitity to the actual table entity, writer will write chunk data to actual table
    My thoughts here are not to have 2 different entities for temp & actual DB tables

    I am some how feeling that there must be more optimized way or a better design approach but being my 1st time for such use case, I brought this approach & if you can suggest improvements then it will be best 🙂

    Sorry for long with a help seeking message 😦

    Thanks in advanced for helps, Victor!

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s