pageserver: add `bench_ingest` #7409

jcsp · 2024-04-17T15:58:02Z

Problem

We lack a rust bench for the inmemory layer and delta layer write paths: it is useful to benchmark these components independent of postgres & WAL decoding.

Related: #8452

Summary of changes

Refactor DeltaLayerWriter to avoid carrying a Timeline, so that it can be cleanly tested + benched without a Tenant/Timeline test harness. It only needed the Timeline for building Layer, so this can be done in a separate step.
Add bench_ingest, which exercises a variety of workload "shapes" (big values, small values, sequential keys, random keys)
Include a small uncontroversial optimization: in freeze, only exhaustively walk values to assert ordering relative to end_lsn in debug mode.

These benches are limited by drive performance on a lot of machines, but still useful as a local tool for iterating on CPU/memory improvements around this code path.

Anecdotal measurements on Hetzner AX102 (Ryzen 7950xd):


ingest-small-values/ingest 128MB/100b seq
                        time:   [1.1160 s 1.1230 s 1.1289 s]
                        thrpt:  [113.38 MiB/s 113.98 MiB/s 114.70 MiB/s]
Found 1 outliers among 10 measurements (10.00%)
  1 (10.00%) low mild
Benchmarking ingest-small-values/ingest 128MB/100b rand: Warming up for 3.0000 s
Warning: Unable to complete 10 samples in 10.0s. You may wish to increase target time to 18.9s.
ingest-small-values/ingest 128MB/100b rand
                        time:   [1.9001 s 1.9056 s 1.9110 s]
                        thrpt:  [66.982 MiB/s 67.171 MiB/s 67.365 MiB/s]
Benchmarking ingest-small-values/ingest 128MB/100b rand-1024keys: Warming up for 3.0000 s
Warning: Unable to complete 10 samples in 10.0s. You may wish to increase target time to 11.0s.
ingest-small-values/ingest 128MB/100b rand-1024keys
                        time:   [1.0715 s 1.0828 s 1.0937 s]
                        thrpt:  [117.04 MiB/s 118.21 MiB/s 119.46 MiB/s]
ingest-small-values/ingest 128MB/100b seq, no delta
                        time:   [425.49 ms 429.07 ms 432.04 ms]
                        thrpt:  [296.27 MiB/s 298.32 MiB/s 300.83 MiB/s]
Found 1 outliers among 10 measurements (10.00%)
  1 (10.00%) low mild

ingest-big-values/ingest 128MB/8k seq
                        time:   [373.03 ms 375.84 ms 379.17 ms]
                        thrpt:  [337.58 MiB/s 340.57 MiB/s 343.13 MiB/s]
Found 1 outliers among 10 measurements (10.00%)
  1 (10.00%) high mild
ingest-big-values/ingest 128MB/8k seq, no delta
                        time:   [81.534 ms 82.811 ms 83.364 ms]
                        thrpt:  [1.4994 GiB/s 1.5095 GiB/s 1.5331 GiB/s]
Found 1 outliers among 10 measurements (10.00%)

Checklist before requesting a review

I have performed a self-review of my code.
If it is a core feature, I have added thorough tests.
Do we need to implement analytics? if so did you add the relevant metrics to the dashboard?
If this PR requires public announcement, mark it with /release-notes label and add several sentences in this section.

Checklist before merging

Do not forget to reformat commit message to not include the above checklist

github-actions · 2024-04-17T16:03:33Z

2138 tests run: 2069 passed, 0 failed, 69 skipped (full report)

Flaky tests (2)

Postgres 16

test_subscriber_restart: release

Postgres 15

test_top_tenants: release

Code coverage* (full report)

functions: 32.8% (7153 of 21803 functions)
lines: 50.5% (57735 of 114292 lines)

* collected from Rust tests only

_{The comment gets automatically updated with the latest test results
da223bc at 2024-08-06T16:48:12.791Z :recycle:}

@jcsp

part of #7124 # Problem (Re-stating the problem from #7124 for posterity) The `test_bulk_ingest` benchmark shows about 2x lower throughput with `tokio-epoll-uring` compared to `std-fs`. That's why we temporarily disabled it in #7238. The reason for this regression is that the benchmark runs on a system without memory pressure and thus std-fs writes don't block on disk IO but only copy the data into the kernel page cache. `tokio-epoll-uring` cannot beat that at this time, and possibly never. (However, under memory pressure, std-fs would stall the executor thread on kernel page cache writeback disk IO. That's why we want to use `tokio-epoll-uring`. And we likely want to use O_DIRECT in the future, at which point std-fs becomes an absolute show-stopper.) More elaborate analysis: https://neondatabase.notion.site/Why-test_bulk_ingest-is-slower-with-tokio-epoll-uring-918c5e619df045a7bd7b5f806cfbd53f?pvs=4 # Changes This PR increases the buffer size of `blob_io` and `EphemeralFile` from PAGE_SZ=8k to 64k. Longer-term, we probably want to do double-buffering / pipelined IO. # Resource Usage We currently do not flush the buffer when freezing the InMemoryLayer. That means a single Timeline can have multiple 64k buffers alive, esp if flushing is slow. This poses an OOM risk. We should either bound the number of frozen layers (#7317). Or we should change the freezing code to flush the buffer and drop the allocation. However, that's future work. # Performance (Measurements done on i3en.3xlarge.) The `test_bulk_insert.py` is too noisy, even with instance storage. It varies by 30-40%. I suspect that's due to compaction. Raising amount of data by 10x doesn't help with the noisiness.) So, I used the `bench_ingest` from @jcsp 's #7409 . Specifically, the `ingest-small-values/ingest 128MB/100b seq` and `ingest-small-values/ingest 128MB/100b seq, no delta` benchmarks. | | | seq | seq, no delta | |-----|-------------------|-----|---------------| | 8k | std-fs | 55 | 165 | | 8k | tokio-epoll-uring | 37 | 107 | | 64k | std-fs | 55 | 180 | | 64k | tokio-epoll-uring | 48 | 164 | The `8k` is from before this PR, the `64k` is with this PR. The values are the throughput reported by the benchmark (MiB/s). We see that this PR gets `tokio-epoll-uring` from 67% to 87% of `std-fs` performance in the `seq` benchmark. Notably, `seq` appears to hit some other bottleneck at `55 MiB/s`. CC'ing #7418 due to the apparent bottlenecks in writing delta layers. For `seq, no delta`, this PR gets `tokio-epoll-uring` from 64% to 91% of `std-fs` performance.

part of #7418 # Motivation (reproducing #7418) When we do an `InMemoryLayer::write_to_disk`, there is a tremendous amount of random read I/O, as deltas from the ephemeral file (written in LSN order) are written out to the delta layer in key order. In benchmarks (#7409) we can see that this delta layer writing phase is substantially more expensive than the initial ingest of data, and that within the delta layer write a significant amount of the CPU time is spent traversing the page cache. # High-Level Changes Add a new mode for L0 flush that works as follows: * Read the full ephemeral file into memory -- layers are much smaller than total memory, so this is afforable * Do all the random reads directly from this in memory buffer instead of using blob IO/page cache/disk reads. * Add a semaphore to limit how many timelines may concurrently do this (limit peak memory). * Make the semaphore configurable via PS config. # Implementation Details The new `BlobReaderRef::Slice` is a temporary hack until we can ditch `blob_io` for `InMemoryLayer` => Plan for this is laid out in #8183 # Correctness The correctness of this change is quite obvious to me: we do what we did before (`blob_io`) but read from memory instead of going to disk. The highest bug potential is in doing owned-buffers IO. I refactored the API a bit in preliminary PR #8186 to make it less error-prone, but still, careful review is requested. # Performance I manually measured single-client ingest performance from `pgbench -i ...`. Full report: https://neondatabase.notion.site/2024-06-28-benchmarking-l0-flush-performance-e98cff3807f94cb38f2054d8c818fe84?pvs=4 tl;dr: * no speed improvements during ingest, but * significantly lower pressure on PS PageCache (eviction rate drops to 1/3) * (that's why I'm working on this) * noticable but modestly lower CPU time This is good enough for merging this PR because the changes require opt-in. We'll do more testing in staging & pre-prod. # Stability / Monitoring **memory consumption**: there's no _hard_ limit on max `InMemoryLayer` size (aka "checkpoint distance") , hence there's no hard limit on the memory allocation we do for flushing. In practice, we a) [log a warning](https://github.com/neondatabase/neon/blob/23827c6b0d400cbb9a972d4d05d49834816c40d1/pageserver/src/tenant/timeline.rs#L5741-L5743) when we flush oversized layers, so we'd know which tenant is to blame and b) if we were to put a hard limit in place, we would have to decide what to do if there is an InMemoryLayer that exceeds the limit. It seems like a better option to guarantee a max size for frozen layer, dependent on `checkpoint_distance`. Then limit concurrency based on that. **metrics**: we do have the [flush_time_histo](https://github.com/neondatabase/neon/blob/23827c6b0d400cbb9a972d4d05d49834816c40d1/pageserver/src/tenant/timeline.rs#L3725-L3726), but that includes the wait time for the semaphore. We could add a separate metric for the time spent after acquiring the semaphore, so one can infer the wait time. Seems unnecessary at this point, though.

VladLazar

Looks good - just some questions

pageserver/benches/bench_ingest.rs

## Problem We lack a rust bench for the inmemory layer and delta layer write paths: it is useful to benchmark these components independent of postgres & WAL decoding. Related: #8452 ## Summary of changes - Refactor DeltaLayerWriter to avoid carrying a Timeline, so that it can be cleanly tested + benched without a Tenant/Timeline test harness. It only needed the Timeline for building `Layer`, so this can be done in a separate step. - Add `bench_ingest`, which exercises a variety of workload "shapes" (big values, small values, sequential keys, random keys) - Include a small uncontroversial optimization: in `freeze`, only exhaustively walk values to assert ordering relative to end_lsn in debug mode. These benches are limited by drive performance on a lot of machines, but still useful as a local tool for iterating on CPU/memory improvements around this code path. Anecdotal measurements on Hetzner AX102 (Ryzen 7950xd): ``` ingest-small-values/ingest 128MB/100b seq time: [1.1160 s 1.1230 s 1.1289 s] thrpt: [113.38 MiB/s 113.98 MiB/s 114.70 MiB/s] Found 1 outliers among 10 measurements (10.00%) 1 (10.00%) low mild Benchmarking ingest-small-values/ingest 128MB/100b rand: Warming up for 3.0000 s Warning: Unable to complete 10 samples in 10.0s. You may wish to increase target time to 18.9s. ingest-small-values/ingest 128MB/100b rand time: [1.9001 s 1.9056 s 1.9110 s] thrpt: [66.982 MiB/s 67.171 MiB/s 67.365 MiB/s] Benchmarking ingest-small-values/ingest 128MB/100b rand-1024keys: Warming up for 3.0000 s Warning: Unable to complete 10 samples in 10.0s. You may wish to increase target time to 11.0s. ingest-small-values/ingest 128MB/100b rand-1024keys time: [1.0715 s 1.0828 s 1.0937 s] thrpt: [117.04 MiB/s 118.21 MiB/s 119.46 MiB/s] ingest-small-values/ingest 128MB/100b seq, no delta time: [425.49 ms 429.07 ms 432.04 ms] thrpt: [296.27 MiB/s 298.32 MiB/s 300.83 MiB/s] Found 1 outliers among 10 measurements (10.00%) 1 (10.00%) low mild ingest-big-values/ingest 128MB/8k seq time: [373.03 ms 375.84 ms 379.17 ms] thrpt: [337.58 MiB/s 340.57 MiB/s 343.13 MiB/s] Found 1 outliers among 10 measurements (10.00%) 1 (10.00%) high mild ingest-big-values/ingest 128MB/8k seq, no delta time: [81.534 ms 82.811 ms 83.364 ms] thrpt: [1.4994 GiB/s 1.5095 GiB/s 1.5331 GiB/s] Found 1 outliers among 10 measurements (10.00%) ```

This was referenced Apr 17, 2024

perf(walingest): mitigate bulk ingest throughput regression through larger EphemeralFile in-memory buffer #7273

Closed

bypass PageCache for L0 flush #7418

Closed

problame mentioned this pull request Apr 25, 2024

perf!: use larger buffers for blob_io and ephemeral_file #7485

Merged

problame mentioned this pull request Jun 27, 2024

L0 flush: opt-in mechanism to bypass PageCache reads and writes #8190

Merged

pageserver: make bench'able methods public

74eda0b

jcsp force-pushed the jcsp/ingest-bench branch from 5e1448d to 9b8ae32 Compare August 1, 2024 15:10

jcsp mentioned this pull request Aug 1, 2024

pageserver: 2x ingest performance #8452

Open

jcsp changed the title ~~Jcsp/ingest bench~~ pageserver: add bench_ingest Aug 1, 2024

jcsp added c/storage/pageserver Component: storage: pageserver a/tech_debt Area: related to tech debt labels Aug 1, 2024

jcsp added 3 commits August 1, 2024 15:43

pageserver: refactor DeltaLayerWriter to not need a Timeline

137cbb4

pageserver: add ingest bench

ae7d635

pageserver: downgrade an assertion to debug

5dcfe1c

jcsp force-pushed the jcsp/ingest-bench branch from 9b8ae32 to 5dcfe1c Compare August 1, 2024 15:45

jcsp requested a review from arpad-m August 1, 2024 15:45

jcsp marked this pull request as ready for review August 1, 2024 15:45

jcsp requested a review from a team as a code owner August 1, 2024 15:45

VladLazar reviewed Aug 2, 2024

View reviewed changes

pageserver/benches/bench_ingest.rs Show resolved Hide resolved

pageserver/benches/bench_ingest.rs Show resolved Hide resolved

pageserver/benches/bench_ingest.rs Outdated Show resolved Hide resolved

arpad-m reviewed Aug 2, 2024

View reviewed changes

pageserver/benches/bench_ingest.rs Show resolved Hide resolved

arpad-m approved these changes Aug 2, 2024

View reviewed changes

add a doc comment

a8be0f3

jcsp added 2 commits August 5, 2024 12:23

s/field3/field6/

d152a57

clean up temp dir

c2d5395

jcsp enabled auto-merge (squash) August 5, 2024 12:38

jcsp added 5 commits August 5, 2024 17:38

Merge remote-tracking branch 'upstream/main' into jcsp/ingest-bench

513cafd

update split_writer for merge

bf3e767

Merge remote-tracking branch 'upstream/main' into jcsp/ingest-bench

2af99ae

Merge remote-tracking branch 'upstream/main' into jcsp/ingest-bench

d38afdb

Merge remote-tracking branch 'upstream/main' into jcsp/ingest-bench

da223bc

jcsp merged commit ca5390a into main Aug 6, 2024
63 checks passed

jcsp deleted the jcsp/ingest-bench branch August 6, 2024 16:39

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

pageserver: add `bench_ingest` #7409

pageserver: add `bench_ingest` #7409

jcsp commented Apr 17, 2024 •

edited

Loading

github-actions bot commented Apr 17, 2024 •

edited

Loading

Postgres 16

Postgres 15

VladLazar left a comment

pageserver: add bench_ingest #7409

pageserver: add bench_ingest #7409

Conversation

jcsp commented Apr 17, 2024 • edited Loading

Problem

Summary of changes

Checklist before requesting a review

Checklist before merging

github-actions bot commented Apr 17, 2024 • edited Loading

2138 tests run: 2069 passed, 0 failed, 69 skipped (full report)

Postgres 16

Postgres 15

Code coverage* (full report)

VladLazar left a comment

Choose a reason for hiding this comment

pageserver: add `bench_ingest` #7409

pageserver: add `bench_ingest` #7409

jcsp commented Apr 17, 2024 •

edited

Loading

github-actions bot commented Apr 17, 2024 •

edited

Loading