Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support for ARM SVE2. #8051

Merged
merged 51 commits into from
Mar 15, 2024
Merged

Support for ARM SVE2. #8051

merged 51 commits into from
Mar 15, 2024

Conversation

zvookin
Copy link
Member

@zvookin zvookin commented Jan 29, 2024

Heavily based on Steve Suzuki's work here: #6781 . Hopefully easier to merge with less effect on existing ARM support and fewer constraints on CodeGen_LLVM.

@zvookin
Copy link
Member Author

zvookin commented Jan 29, 2024

I can't seem to name Steve Suzuki in the reviewers list. Apologies if I'm missing something.

Z Stern added 4 commits January 29, 2024 22:23
Replace internal_error with nop return for CodeGen_LLVM::match_vector_type_scalable called on scalar.
Remove dead code.
src/CodeGen_ARM.cpp Outdated Show resolved Hide resolved
src/CodeGen_ARM.cpp Outdated Show resolved Hide resolved
src/CodeGen_ARM.cpp Outdated Show resolved Hide resolved
src/CodeGen_ARM.cpp Outdated Show resolved Hide resolved
src/CodeGen_ARM.cpp Outdated Show resolved Hide resolved
src/CodeGen_ARM.cpp Outdated Show resolved Hide resolved
src/CodeGen_ARM.cpp Outdated Show resolved Hide resolved
@steven-johnson
Copy link
Contributor

looks like legit failure in correctness_vector_reductions on arm64

@zvookin
Copy link
Member Author

zvookin commented Feb 24, 2024

Why is the IR sharing issue only affecting this SIMD op check test, not others?

@abadams
Copy link
Member

abadams commented Feb 24, 2024

The other simd op check tests just use Exprs and some dummy calls to functions that get subbed out with per-test-case local imageparams, so there's no shared mutable IR. This test defines Funcs that are used by multiple test cases. It's the sharing of those Funcs that caused the problem.

@abadams
Copy link
Member

abadams commented Feb 24, 2024

Though now that you say that, I guess we should move the deep copying code I added into the base class so that this isn't an issue elsewhere later.

@steven-johnson
Copy link
Contributor

Is this ready to land? Are there remaining issues? Should I run a global presubmit inside Google first?

@zvookin
Copy link
Member Author

zvookin commented Feb 27, 2024

It should be ready to land, pending validation that it does not break anything in existing ARM support, so please run a global presubmit. We can discuss testing correctness one to one. My first attempt to test this under QEMU failed completely because FreeBSD on QEMU on Mac OS (M3) is an exercise in disk corruption.

@steven-johnson
Copy link
Contributor

Looks to be clean inside Google, are we ready to land?

@steven-johnson steven-johnson added the release_notes For changes that may warrant a note in README for official releases. label Mar 6, 2024
@abadams
Copy link
Member

abadams commented Mar 6, 2024

I still need to benchmark it on our workloads.

@abadams
Copy link
Member

abadams commented Mar 12, 2024

On average it makes no difference to our neon workloads, though neon codegen definitely changed in a bunch of places.

I did find one minor inefficiency while eyeballing things, which uncovered an inefficiency in main. Consider this pipeline:

#include "Halide.h"
using namespace Halide;

int main() {
    Func f, g;
    Var x;

    f(x) = x / 17.0f;
    f.compute_root();

    g(x) = {cast<int>(ceil(f(x))), cast<int>(floor(f(x)))};

    g.vectorize(x, 4);

    g.compile_to_assembly("/dev/stdout", {}, "", Target{"arm-64-linux-no_asserts-no_runtime-no_bounds_query"});

    return 0;
}

On both main and the branch this generates the inner loop:

	ldr	q0, [x9], #16
	subs	x8, x8, #1
	frintp	v1.4s, v0.4s
	frintm	v0.4s, v0.4s
	fcvtzs	v1.4s, v1.4s
	fcvtzs	v0.4s, v0.4s
	str	q1, [x10], #16
	str	q0, [x11], #16

It rounds up and rounds down as a float, then converts to an int. It should instead be using a different rounding mode on the int conversions and just using fcvtms and fcvtps

If you remove the .vectorize call then on main we get:

	ldr	s0, [x8], #4
	subs	x19, x19, #1
	fcvtps	w9, s0
	fcvtms	w10, s0
	str	w9, [x20], #4
	str	w10, [x21], #4

and on the branch we get:

	ldr	s0, [x8], #4
	subs	x19, x19, #1
	frintp	s1, s0
	frintm	s0, s0
	fcvtzs	v1.2s, v1.2s
	fcvtzs	v0.2s, v0.2s
	st1	{ v1.s }[0], [x20], #4
	st1	{ v0.s }[0], [x21], #4

It looks like the branch goes down the vector path even in the scalar case, and hits the issue we already have with vector floor/ceil.

For the vector case, it looks like the llvm IR is:

  %22 = tail call <4 x float> @llvm.floor.v4f32(<4 x float> %18)
  %23 = fptosi <4 x float> %22 to <4 x i32>

but we should instead be calling the intrinsic llvm.aarch64.neon.fcvtms.v4i32.v4f32

I'll open a PR onto this branch

@abadams
Copy link
Member

abadams commented Mar 13, 2024

Oops, looks like I broke something. Will fix.

@abadams
Copy link
Member

abadams commented Mar 13, 2024

Can't seem to repro locally, and in the llvm commit log there have been reverts of aarch64 stuff. I'll just do a merge with main and see what happens.

This revealed a bug. FindIntrinsics was not enabled for scalars anyway,
so it was semi-pointless.
@abadams abadams merged commit 76a7dd4 into main Mar 15, 2024
19 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
release_notes For changes that may warrant a note in README for official releases.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants