Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix negative replication delay metric #8520

Merged
merged 1 commit into from
Jul 31, 2024
Merged

Conversation

tristan957
Copy link
Member

In some cases, we can get a negative metric for replication_delay_bytes. My best guess from all the research I've done is that we evaluate pg_last_wal_receive_lsn() before pg_last_wal_replay_lsn(), and that by the time everything is said and done, the replay LSN has advanced past the receive LSN. In this case, our lag can effectively be modeled as 0 due to the speed of the WAL reception and replay.

@tristan957 tristan957 requested a review from hlinnaka July 25, 2024 20:50
@tristan957
Copy link
Member Author

I've also given thought to something along the lines of:

DO $$
DECLARE
  receive_lsn pg_lsn;
  replay_lsn pg_lsn;
  replication_delay_bytes numeric;
BEGIN
  SELECT pg_last_wal_replay_lsn() INTO replay_lsn;
  SELECT pg_last_wal_receive_lsn() INTO receive_lsn;

  SELECT pg_wal_lsn_diff(receive_lsn, replay_lsn) INTO replication_delay_bytes;
END $$;

Then we can guarantee that receive LSN will always be larger than replay LSN at the expense of a tiny bit larger replication, the amount of WAL replayed between the time replay_lsn is set and when receive_lsn is set. In practice, this is likely negligible.

Copy link

github-actions bot commented Jul 25, 2024

3150 tests run: 3029 passed, 0 failed, 121 skipped (full report)


Code coverage* (full report)

  • functions: 32.5% (7022 of 21597 functions)
  • lines: 49.9% (55877 of 111995 lines)

* collected from Rust tests only


The comment gets automatically updated with the latest test results
759186d at 2024-07-30T22:48:37.765Z :recycle:

@tristan957
Copy link
Member Author

@tristan957
Copy link
Member Author

tristan957 commented Jul 30, 2024

The more I think about this, the more I think the current PR is probably the best way to solve this problem. A negative value really means that we are processing WAL so fast, that there is essentially no lag.

I guess if you're a really smart user, would seeing a small non-zero value be more believable than seeing 0?

vm-image-spec.yaml Show resolved Hide resolved
@tristan957 tristan957 marked this pull request as ready for review July 30, 2024 21:53
In some cases, we can get a negative metric for replication_delay_bytes.
My best guess from all the research I've done is that we evaluate
pg_last_wal_receive_lsn() before pg_last_wal_replay_lsn(), and that by
the time everything is said and done, the replay LSN has advanced past
the receive LSN. In this case, our lag can effectively be modeled as
0 due to the speed of the WAL reception and replay.
@tristan957 tristan957 merged commit 5e0409d into main Jul 31, 2024
65 checks passed
@tristan957 tristan957 deleted the tristan957/negative-delay branch July 31, 2024 15:17
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants