Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

pageserver: fixes for layer visibility metric #8603

Merged
merged 6 commits into from
Aug 6, 2024

Conversation

jcsp
Copy link
Contributor

@jcsp jcsp commented Aug 5, 2024

Problem

In staging, we could see that occasionally tenants were wrapping their pageserver_visible_physical_size metric past zero to 2^64.

This is harmless right now, but will matter more later when we start using visible size in things like the /utilization endpoint.

Summary of changes

  • Add debug asserts that detect this case. test_gc_of_remote_layers works as a reproducer for this issue once the asserts are added.
  • Tighten up the interface around access_stats so that only Layer can mutate it.
  • In Layer, wrap calls to record_access in code that will update the visible size statistic if the access implicitly marks the layer visible (this was what caused the bug)
  • In LayerManager::rewrite_layers, use the proper set_visibility layer function instead of directly using access_stats (this is an additional path where metrics could go bad.)
  • Removed unused instances of LayerAccessStats in DeltaLayer and ImageLayer which I noticed while reviewing the code paths that call record_access.

Checklist before requesting a review

  • I have performed a self-review of my code.
  • If it is a core feature, I have added thorough tests.
  • Do we need to implement analytics? if so did you add the relevant metrics to the dashboard?
  • If this PR requires public announcement, mark it with /release-notes label and add several sentences in this section.

Checklist before merging

  • Do not forget to reformat commit message to not include the above checklist

@jcsp jcsp added t/bug Issue Type: Bug c/storage/pageserver Component: storage: pageserver labels Aug 5, 2024
@jcsp jcsp changed the title Jcsp/fix layer visibility metric pageserver: fixes for layer visibility metric Aug 5, 2024
Copy link

github-actions bot commented Aug 5, 2024

2112 tests run: 2043 passed, 0 failed, 69 skipped (full report)


Code coverage* (full report)

  • functions: 32.8% (7154 of 21803 functions)
  • lines: 50.5% (57750 of 114289 lines)

* collected from Rust tests only


The comment gets automatically updated with the latest test results
1e944bc at 2024-08-06T13:37:27.624Z :recycle:

@jcsp jcsp marked this pull request as ready for review August 5, 2024 13:13
@jcsp jcsp requested a review from a team as a code owner August 5, 2024 13:13
@jcsp jcsp requested a review from VladLazar August 5, 2024 13:13
Copy link
Contributor

@VladLazar VladLazar left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good but you might need to rebase to reconcile with #8601

@jcsp jcsp enabled auto-merge (squash) August 6, 2024 12:22
@jcsp jcsp merged commit 42229aa into main Aug 6, 2024
63 checks passed
@jcsp jcsp deleted the jcsp/fix-layer-visibility-metric branch August 6, 2024 13:47
jcsp added a commit that referenced this pull request Aug 12, 2024
## Problem

In staging, we could see that occasionally tenants were wrapping their
pageserver_visible_physical_size metric past zero to 2^64.

This is harmless right now, but will matter more later when we start
using visible size in things like the /utilization endpoint.

## Summary of changes

- Add debug asserts that detect this case. `test_gc_of_remote_layers`
works as a reproducer for this issue once the asserts are added.
- Tighten up the interface around access_stats so that only Layer can
mutate it.
- In Layer, wrap calls to `record_access` in code that will update the
visible size statistic if the access implicitly marks the layer visible
(this was what caused the bug)
- In LayerManager::rewrite_layers, use the proper set_visibility layer
function instead of directly using access_stats (this is an additional
path where metrics could go bad.)
- Removed unused instances of LayerAccessStats in DeltaLayer and
ImageLayer which I noticed while reviewing the code paths that call
record_access.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
c/storage/pageserver Component: storage: pageserver t/bug Issue Type: Bug
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants