Spring Batch Partitioner – Case Study with SourceCode – Best Practices
I started a sample project which could reproduce the problem to show the community what I was experiencing, but to my surprise I was using the partitioner feature incorrectly. I am writing this post to share what I learned throughout this experience to help those who are going through the same questions.
I’m writing this post because i report a bug at Spring Community Jira, this is the link:
My Goal: I wanted to use the resource partitioner for parallel processing but was worried to use the primary key of the table (column ID) because my table has gaps (id column is not incremental) and for this reason the partitioner would distribute number of different records for each thread, thus being inefficient in their distribution.
For example:
This is the good example partitioner:
Suppose that my table has the following records: Ids 1, 8, 9,10 11, 12, 13, 14, 15.
min: 1
max: 15
gridSize = number of threads = 2 in this example
target size calculation: int targetSize = (max – min) / gridSize + 1;
(15 – 1) / 2 + 1 = 8
In this example:
Thread number 1 will receive to process: 1 to 8
Thread number 2 will receive to process: 9 to 16
The Problem: Thread 1 receives only two records to process (The Id’s 1 and 8) and the thread 2 will receive 7 records to process. At this case the partitioner to split incorrectly number of records between threads.
My Goal: I want to split the number of records equally between all threads.
Where I was going wrong: To achieve my goal I tried to use a query that makes use of rownum and / or ntile oracle feature, the goal was to use the split an id that is sequential, with no gaps in the id column table, so the load would be uniform among the threads. The JdbcPagingItemReader class can not be used with multithreaded characteristics using Oracle ROWNUM because the query is partially executed multiple times in the database and there is no guarantee that all records are processed because a confusion of Ids between threads occurs.
The correct way: You can use JdbcPagingItemReader using the Primary Key column (may be single or multiple columns) or JdbcCursorItemReader can use both the PK column or Rownum / NTILE to do division.
Why use JdbcCursorItemReader not cause problems of mistaken IDs or lost records ?
This class executes the query once the database and will use chunk mode to fetch the records as needed. If you use a rownum column in this case will not cause data loss because every query is processed only once in the database.
To illustrate and facilitate understanding, I created a design example set with various possible configurations available here:
GitHub Example Project:
Here are the sql scripts to create the database tables used in this poc:
- JdbcCursorItemReader-OracleNtile – It works
- JdbcCursorItemReader-OracleRownum – It works
- JdbcPagingItemReader-OracleNtile – It not works, don’t use this. PagingReader does not work with NTile
- JdbcPagingItemReader-OracleRownum – It not works, don’t use this. PagingReader does not work with Rownum
- JdbcPagingItemReader-TablePrimaryKey – It works, but the records aren’t distributed in an uniform way (same quantity for each thread)
What is Oracle NTile ?
This feature of Oracle Database can create a desired number of containers so that each thread can consume one. For example: I have 1000 records in the database to be divided among 10 threads:
[code language=”sql”]
SELECT ID, DESCRIPTION, FLAG_PROCESSED, NTILE(10) OVER (ORDER BY ID)
AS CONTAINER_COLUMN FROM TABLE_SOURCE
WHERE FLAG_PROCESSED = ‘N’;
[/code]
With this query, you can use the column “CONTAINER_COLUMN”, values are already pre split into buckets ready to be divided among the various threads.
This is the documentation with more clarified explanation:
That’s it.
Any question or suggestion is very welcome.
Credits to this post:
Cheers,
Victor Jabur