-
Notifications
You must be signed in to change notification settings - Fork 418
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fix negative replication delay metric #8520
Conversation
I've also given thought to something along the lines of: DO $$
DECLARE
receive_lsn pg_lsn;
replay_lsn pg_lsn;
replication_delay_bytes numeric;
BEGIN
SELECT pg_last_wal_replay_lsn() INTO replay_lsn;
SELECT pg_last_wal_receive_lsn() INTO receive_lsn;
SELECT pg_wal_lsn_diff(receive_lsn, replay_lsn) INTO replication_delay_bytes;
END $$; Then we can guarantee that receive LSN will always be larger than replay LSN at the expense of a tiny bit larger replication, the amount of WAL replayed between the time |
3150 tests run: 3029 passed, 0 failed, 121 skipped (full report)Code coverage* (full report)
* collected from Rust tests only The comment gets automatically updated with the latest test results
759186d at 2024-07-30T22:48:37.765Z :recycle: |
Discussion with Laurenz Albe, here: https://dba.stackexchange.com/questions/341186/postgres-hot-standby-has-a-negative-replication-delay |
The more I think about this, the more I think the current PR is probably the best way to solve this problem. A negative value really means that we are processing WAL so fast, that there is essentially no lag. I guess if you're a really smart user, would seeing a small non-zero value be more believable than seeing 0? |
70d72e1
to
c38ad20
Compare
In some cases, we can get a negative metric for replication_delay_bytes. My best guess from all the research I've done is that we evaluate pg_last_wal_receive_lsn() before pg_last_wal_replay_lsn(), and that by the time everything is said and done, the replay LSN has advanced past the receive LSN. In this case, our lag can effectively be modeled as 0 due to the speed of the WAL reception and replay.
c38ad20
to
759186d
Compare
In some cases, we can get a negative metric for replication_delay_bytes. My best guess from all the research I've done is that we evaluate pg_last_wal_receive_lsn() before pg_last_wal_replay_lsn(), and that by the time everything is said and done, the replay LSN has advanced past the receive LSN. In this case, our lag can effectively be modeled as 0 due to the speed of the WAL reception and replay.