We noticed a data inconsistency in a 3-node asynchronous replication Postgres setup. A single JDBC transaction inserted/updated rows in two tables but we are noticing that one table did not have a few rows inserted. A few hours after the transaction the primary node went down and standby node became master.
Environment
- PostgreSQL 13.x (3 nodes, asynchronous streaming replication)
- JDBC (PostgreSQL driver)
- BoneCP connection pool
- connection.setAutoCommit(false)
- Isolation: REPEATABLE READ
- Explicit connection.commit() after multiple executeBatch() calls
- queryTimeout = 600s (long transaction)
- Application logs show no rollback, no exceptions
- Updates made in a transaction have the same value for column version
- Entries are written to postgres under a global lock
- No entries have been deleted from the table
- data‑checksums were not enabled
Code for inserting:
try (Connection connection = dataSource.getConnection()) {
connection.setAutoCommit(false);
connection.setTransactionIsolation(Connection.TRANSACTION_REPEATABLE_READ);
// 1. Insert into main table (table_a)
// size of updated is around 400000
if (!updated.isEmpty()) {
try (PreparedStatement ps = connection.prepareStatement(
"INSERT INTO table_a (id, version, content) VALUES (?, ?, ?)")) {
for (Storable s : updated.values()) {
ps.setString(1, s.getId());
ps.setLong(2, s.getVersion());
ps.setBytes(3, s.toBytes());
ps.addBatch();
}
int[] updates = ps.executeBatch();
int updateCount = IntStream.of(updates).sum();
if (updateCount != updated.size()) {
throw new SQLException("Unexpected updateCount=" + updateCount);
}
}
}
// 2. Update index table (table_b)
if (!updated.isEmpty()) {
try (PreparedStatement ps = connection.prepareStatement(
"UPDATE table_b SET id=?, name=?, version=? WHERE id=?")) {
for (Storable s : updated.values()) {
ps.setString(1, s.getId());
ps.setString(2, s.getName());
ps.setLong(3, s.getVersion());
ps.setString(4, s.getId());
ps.addBatch();
}
int[] updates = ps.executeBatch();
int updateCount = IntStream.of(updates).sum();
if (updateCount != updated.size()) {
throw new SQLException("Unexpected updateCount=" + updateCount);
}
}
}
connection.commit(); // <- Commit succeeds without errors
} catch (SQLException e) {
connection.rollback();
...
}
Observed Behavior
The transaction completed successfully (no rollback or SQL exception).
After a failover (a few hours later):
The new primary had the updated entries in table_b
But for a few IDs(which are present in table_b), table_a had no new rows for that version. None of the postgres nodes have these rows in table_a.
For most IDs, both tables were updated correctly.
The tables were first queried after failover, so it’s unclear if the inconsistency existed before failover or appeared later.
PostgreSQL logs showed no errors during the transaction or failover.
Questions:
- Could this be caused by disk or memory corruption (given data_checksums were disabled)?
- Could it be caused due to an issue with async replication? (Logs for successful syncing of wal are present)
- What specific PostgreSQL or system logs should be checked to confirm whether the WAL for this transaction was fully written and applied?