Fix negative replication delay metric #8520

tristan957 · 2024-07-25T20:50:58Z

In some cases, we can get a negative metric for replication_delay_bytes. My best guess from all the research I've done is that we evaluate pg_last_wal_receive_lsn() before pg_last_wal_replay_lsn(), and that by the time everything is said and done, the replay LSN has advanced past the receive LSN. In this case, our lag can effectively be modeled as 0 due to the speed of the WAL reception and replay.

tristan957 · 2024-07-25T20:55:53Z

I've also given thought to something along the lines of:

DO $$
DECLARE
  receive_lsn pg_lsn;
  replay_lsn pg_lsn;
  replication_delay_bytes numeric;
BEGIN
  SELECT pg_last_wal_replay_lsn() INTO replay_lsn;
  SELECT pg_last_wal_receive_lsn() INTO receive_lsn;

  SELECT pg_wal_lsn_diff(receive_lsn, replay_lsn) INTO replication_delay_bytes;
END $$;

Then we can guarantee that receive LSN will always be larger than replay LSN at the expense of a tiny bit larger replication, the amount of WAL replayed between the time replay_lsn is set and when receive_lsn is set. In practice, this is likely negligible.

github-actions · 2024-07-25T21:38:43Z

3150 tests run: 3029 passed, 0 failed, 121 skipped (full report)

Code coverage* (full report)

functions: 32.5% (7022 of 21597 functions)
lines: 49.9% (55877 of 111995 lines)

* collected from Rust tests only

_{The comment gets automatically updated with the latest test results
759186d at 2024-07-30T22:48:37.765Z :recycle:}

tristan957 · 2024-07-29T14:51:01Z

Discussion with Laurenz Albe, here: https://dba.stackexchange.com/questions/341186/postgres-hot-standby-has-a-negative-replication-delay

tristan957 · 2024-07-30T17:11:38Z

The more I think about this, the more I think the current PR is probably the best way to solve this problem. A negative value really means that we are processing WAL so fast, that there is essentially no lag.

I guess if you're a really smart user, would seeing a small non-zero value be more believable than seeing 0?

vm-image-spec.yaml

In some cases, we can get a negative metric for replication_delay_bytes. My best guess from all the research I've done is that we evaluate pg_last_wal_receive_lsn() before pg_last_wal_replay_lsn(), and that by the time everything is said and done, the replay LSN has advanced past the receive LSN. In this case, our lag can effectively be modeled as 0 due to the speed of the WAL reception and replay.

tristan957 requested a review from hlinnaka July 25, 2024 20:50

ololobus mentioned this pull request Jul 30, 2024

Epic: stabilize physical replication #6211

Open

hlinnaka approved these changes Jul 30, 2024

View reviewed changes

vm-image-spec.yaml Show resolved Hide resolved

tristan957 force-pushed the tristan957/negative-delay branch from 70d72e1 to c38ad20 Compare July 30, 2024 21:50

tristan957 marked this pull request as ready for review July 30, 2024 21:53

tristan957 force-pushed the tristan957/negative-delay branch from c38ad20 to 759186d Compare July 30, 2024 21:53

hlinnaka approved these changes Jul 31, 2024

View reviewed changes

tristan957 merged commit 5e0409d into main Jul 31, 2024
65 checks passed

tristan957 deleted the tristan957/negative-delay branch July 31, 2024 15:17

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix negative replication delay metric #8520

Fix negative replication delay metric #8520

tristan957 commented Jul 25, 2024

tristan957 commented Jul 25, 2024

github-actions bot commented Jul 25, 2024 •

edited

Loading

tristan957 commented Jul 29, 2024

tristan957 commented Jul 30, 2024 •

edited

Loading

Fix negative replication delay metric #8520

Fix negative replication delay metric #8520

Conversation

tristan957 commented Jul 25, 2024

tristan957 commented Jul 25, 2024

github-actions bot commented Jul 25, 2024 • edited Loading

3150 tests run: 3029 passed, 0 failed, 121 skipped (full report)

Code coverage* (full report)

tristan957 commented Jul 29, 2024

tristan957 commented Jul 30, 2024 • edited Loading

github-actions bot commented Jul 25, 2024 •

edited

Loading

tristan957 commented Jul 30, 2024 •

edited

Loading