Skip to content

Commit 9a7809b

Browse files
committed
[LV, VP]VP intrinsics support for the Loop Vectorizer
This patch introduces generating VP intrinsics in the Loop Vectorizer. Currently the Loop Vectorizer supports vector predication in a very limited capacity via tail-folding and masked load/store/gather/scatter intrinsics. However, this does not let architectures with active vector length predication support take advantage of their capabilities. Architectures with general masked predication support also can only take advantage of predication on memory operations. By having a way for the Loop Vectorizer to generate Vector Predication intrinsics, which (will) provide a target-independent way to model predicated vector instructions, These architectures can make better use of their predication capabilities. Our first approach (implemented in this patch) builds on top of the existing tail-folding mechanism in the LV, but instead of generating masked intrinsics for memory operations it generates VP intrinsics for loads/stores instructions. Other important part of this approach is how the Explicit Vector Length is computed. (We use active vector length and explicit vector length interchangeably; VP intrinsics define this vector length parameter as Explicit Vector Length (EVL)). We consider the following three ways to compute the EVL parameter for the VP Intrinsics. - The simplest way is to use the VF as EVL and rely solely on the mask parameter to control predication. The mask parameter is the same as computed for current tail-folding implementation. - The second way is to insert instructions to compute `min(VF, trip_count - index)` for each vector iteration. - For architectures like RISC-V, which have special instruction to compute/set an explicit vector length, we also introduce an experimental intrinsic `get_vector_length`, that can be lowered to architecture specific instruction(s) to compute EVL. Also, added a new recipe to emit instructions for computing EVL. Using VPlan in this way will eventually help build and compare VPlans corresponding to different strategies and alternatives. ===Tentative Development Roadmap=== * Use vp-intrinsics for all possible vector operations. That work has 2 possible implementations: 1. Introduce a new pass which transforms emitted vector instructions to vp intrinsics if the the loop was transformed to use predication for loads/stores. The advantage of this approach is that it does not require many changes in the loop vectorizer itself. The disadvantage is that it may require to copy some existing functionality from the loop vectorizer in a separate patch, have similar code in the different passes and perform the same analysis 2 times, at least. 2. Extend Loop Vectorizer using VectorBuildor and make it emit vp intrinsics automatically in presence of EVL value. The advantage is that it does not require a separate pass, thus it may reduce compile time. Plus, we can avoid code duplication. It requires some extra work in the LoopVectorizer to add VectorBuilder support and smart vector instructions/vp intrinsics emission. Also, to fully support Loop Vectorizer it will require adding a new PHI recipe to handle EVL on the previous iteration + extending several existing recipes with the new operands (depends on the design). * Switch to vp-intrinsics for memory operations for VLS and VLA vectorizations. Differential Revision: https://reviews.llvm.org/D99750
1 parent c177507 commit 9a7809b

24 files changed

+1591
-31
lines changed

llvm/include/llvm/Analysis/TargetTransformInfo.h

Lines changed: 4 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -190,7 +190,10 @@ enum class TailFoldingStyle {
190190
/// Use predicate to control both data and control flow, but modify
191191
/// the trip count so that a runtime overflow check can be avoided
192192
/// and such that the scalar epilogue loop can always be removed.
193-
DataAndControlFlowWithoutRuntimeCheck
193+
DataAndControlFlowWithoutRuntimeCheck,
194+
/// Use predicated EVL instructions for tail-folding.
195+
/// Indicates that VP intrinsics should be used if tail-folding is enabled.
196+
DataWithEVL,
194197
};
195198

196199
struct TailFoldingInfo {

llvm/lib/Target/RISCV/RISCVTargetTransformInfo.cpp

Lines changed: 4 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -234,6 +234,10 @@ RISCVTTIImpl::getIntImmCostIntrin(Intrinsic::ID IID, unsigned Idx,
234234
return TTI::TCC_Free;
235235
}
236236

237+
bool RISCVTTIImpl::hasActiveVectorLength(unsigned, Type *DataTy, Align) const {
238+
return ST->hasVInstructions();
239+
}
240+
237241
TargetTransformInfo::PopcntSupportKind
238242
RISCVTTIImpl::getPopcntSupport(unsigned TyWidth) {
239243
assert(isPowerOf2_32(TyWidth) && "Ty width must be power of 2");

llvm/lib/Target/RISCV/RISCVTargetTransformInfo.h

Lines changed: 16 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -75,6 +75,22 @@ class RISCVTTIImpl : public BasicTTIImplBase<RISCVTTIImpl> {
7575
const APInt &Imm, Type *Ty,
7676
TTI::TargetCostKind CostKind);
7777

78+
/// \name Vector Predication Information
79+
/// Whether the target supports the %evl parameter of VP intrinsic efficiently
80+
/// in hardware, for the given opcode and type/alignment. (see LLVM Language
81+
/// Reference - "Vector Predication Intrinsics",
82+
/// https://llvm.org/docs/LangRef.html#vector-predication-intrinsics and
83+
/// "IR-level VP intrinsics",
84+
/// https://llvm.org/docs/Proposals/VectorPredication.html#ir-level-vp-intrinsics).
85+
/// \param Opcode the opcode of the instruction checked for predicated version
86+
/// support.
87+
/// \param DataType the type of the instruction with the \p Opcode checked for
88+
/// prediction support.
89+
/// \param Alignment the alignment for memory access operation checked for
90+
/// predicated version support.
91+
bool hasActiveVectorLength(unsigned Opcode, Type *DataType,
92+
Align Alignment) const;
93+
7894
TargetTransformInfo::PopcntSupportKind getPopcntSupport(unsigned TyWidth);
7995

8096
bool shouldExpandReduction(const IntrinsicInst *II) const;

llvm/lib/Transforms/Vectorize/LoopVectorize.cpp

Lines changed: 146 additions & 9 deletions
Original file line numberDiff line numberDiff line change
@@ -123,6 +123,7 @@
123123
#include "llvm/IR/User.h"
124124
#include "llvm/IR/Value.h"
125125
#include "llvm/IR/ValueHandle.h"
126+
#include "llvm/IR/VectorBuilder.h"
126127
#include "llvm/IR/Verifier.h"
127128
#include "llvm/Support/Casting.h"
128129
#include "llvm/Support/CommandLine.h"
@@ -247,10 +248,12 @@ static cl::opt<TailFoldingStyle> ForceTailFoldingStyle(
247248
clEnumValN(TailFoldingStyle::DataAndControlFlow, "data-and-control",
248249
"Create lane mask using active.lane.mask intrinsic, and use "
249250
"it for both data and control flow"),
250-
clEnumValN(
251-
TailFoldingStyle::DataAndControlFlowWithoutRuntimeCheck,
252-
"data-and-control-without-rt-check",
253-
"Similar to data-and-control, but remove the runtime check")));
251+
clEnumValN(TailFoldingStyle::DataAndControlFlowWithoutRuntimeCheck,
252+
"data-and-control-without-rt-check",
253+
"Similar to data-and-control, but remove the runtime check"),
254+
clEnumValN(TailFoldingStyle::DataWithEVL, "data-with-evl",
255+
"Use predicated EVL instructions for tail folding if the "
256+
"target supports vector length predication")));
254257

255258
static cl::opt<bool> MaximizeBandwidth(
256259
"vectorizer-maximize-bandwidth", cl::init(false), cl::Hidden,
@@ -1098,9 +1101,7 @@ void InnerLoopVectorizer::collectPoisonGeneratingRecipes(
10981101
// handled.
10991102
if (isa<VPWidenMemoryInstructionRecipe>(CurRec) ||
11001103
isa<VPInterleaveRecipe>(CurRec) ||
1101-
isa<VPScalarIVStepsRecipe>(CurRec) ||
1102-
isa<VPCanonicalIVPHIRecipe>(CurRec) ||
1103-
isa<VPActiveLaneMaskPHIRecipe>(CurRec))
1104+
isa<VPScalarIVStepsRecipe>(CurRec) || isa<VPHeaderPHIRecipe>(CurRec))
11041105
continue;
11051106

11061107
// This recipe contributes to the address computation of a widen
@@ -1633,6 +1634,23 @@ class LoopVectorizationCostModel {
16331634
return foldTailByMasking() || Legal->blockNeedsPredication(BB);
16341635
}
16351636

1637+
/// Returns true if VP intrinsics with explicit vector length support should
1638+
/// be generated in the tail folded loop.
1639+
bool useVPIWithVPEVLVectorization() const {
1640+
return PreferEVL && !EnableVPlanNativePath &&
1641+
getTailFoldingStyle() == TailFoldingStyle::DataWithEVL &&
1642+
// FIXME: implement support for max safe dependency distance.
1643+
Legal->isSafeForAnyVectorWidth() &&
1644+
// FIXME: remove this once reductions are supported.
1645+
Legal->getReductionVars().empty() &&
1646+
// FIXME: remove this once vp_reverse is supported.
1647+
none_of(
1648+
WideningDecisions,
1649+
[](const std::pair<std::pair<Instruction *, ElementCount>,
1650+
std::pair<InstWidening, InstructionCost>>
1651+
&Data) { return Data.second.first == CM_Widen_Reverse; });
1652+
}
1653+
16361654
/// Returns true if the Phi is part of an inloop reduction.
16371655
bool isInLoopReduction(PHINode *Phi) const {
16381656
return InLoopReductions.contains(Phi);
@@ -1778,6 +1796,10 @@ class LoopVectorizationCostModel {
17781796
/// All blocks of loop are to be masked to fold tail of scalar iterations.
17791797
bool CanFoldTailByMasking = false;
17801798

1799+
/// Control whether to generate VP intrinsics with explicit-vector-length
1800+
/// support in vectorized code.
1801+
bool PreferEVL = false;
1802+
17811803
/// A map holding scalar costs for different vectorization factors. The
17821804
/// presence of a cost for an instruction in the mapping indicates that the
17831805
/// instruction will be scalarized when vectorizing with the associated
@@ -4733,6 +4755,41 @@ LoopVectorizationCostModel::computeMaxVF(ElementCount UserVF, unsigned UserIC) {
47334755
// FIXME: look for a smaller MaxVF that does divide TC rather than masking.
47344756
if (Legal->prepareToFoldTailByMasking()) {
47354757
CanFoldTailByMasking = true;
4758+
if (getTailFoldingStyle() == TailFoldingStyle::None)
4759+
return MaxFactors;
4760+
4761+
if (UserIC > 1) {
4762+
LLVM_DEBUG(dbgs() << "LV: Preference for VP intrinsics indicated. Will "
4763+
"not generate VP intrinsics since interleave count "
4764+
"specified is greater than 1.\n");
4765+
return MaxFactors;
4766+
}
4767+
4768+
if (MaxFactors.ScalableVF.isVector()) {
4769+
assert(MaxFactors.ScalableVF.isScalable() &&
4770+
"Expected scalable vector factor.");
4771+
// FIXME: use actual opcode/data type for analysis here.
4772+
PreferEVL = getTailFoldingStyle() == TailFoldingStyle::DataWithEVL &&
4773+
TTI.hasActiveVectorLength(0, nullptr, Align());
4774+
#if !NDEBUG
4775+
if (getTailFoldingStyle() == TailFoldingStyle::DataWithEVL) {
4776+
if (PreferEVL)
4777+
dbgs() << "LV: Preference for VP intrinsics indicated. Will "
4778+
"try to generate VP Intrinsics.\n";
4779+
else
4780+
dbgs() << "LV: Preference for VP intrinsics indicated. Will "
4781+
"not try to generate VP Intrinsics since the target "
4782+
"does not support vector length predication.\n";
4783+
}
4784+
#endif // !NDEBUG
4785+
4786+
// Tail folded loop using VP intrinsics restricts the VF to be scalable
4787+
// for now.
4788+
// TODO: extend it for fixed vectors, if required.
4789+
if (PreferEVL)
4790+
MaxFactors.FixedVF = ElementCount::getFixed(1);
4791+
}
4792+
47364793
return MaxFactors;
47374794
}
47384795

@@ -5342,6 +5399,10 @@ LoopVectorizationCostModel::selectInterleaveCount(ElementCount VF,
53425399
if (!isScalarEpilogueAllowed())
53435400
return 1;
53445401

5402+
// Do not interleave if EVL is preferred and no User IC is specified.
5403+
if (useVPIWithVPEVLVectorization())
5404+
return 1;
5405+
53455406
// We used the distance for the interleave count.
53465407
if (!Legal->isSafeForAnyVectorWidth())
53475408
return 1;
@@ -8596,6 +8657,8 @@ void LoopVectorizationPlanner::buildVPlansWithVPRecipes(ElementCount MinVF,
85968657
VPlanTransforms::truncateToMinimalBitwidths(
85978658
*Plan, CM.getMinimalBitwidths(), PSE.getSE()->getContext());
85988659
VPlanTransforms::optimize(*Plan, *PSE.getSE());
8660+
if (CM.useVPIWithVPEVLVectorization())
8661+
VPlanTransforms::addExplicitVectorLength(*Plan);
85998662
assert(VPlanVerifier::verifyPlanIsValid(*Plan) && "VPlan is invalid");
86008663
VPlans.push_back(std::move(Plan));
86018664
}
@@ -9451,6 +9514,52 @@ void VPReplicateRecipe::execute(VPTransformState &State) {
94519514
State.ILV->scalarizeInstruction(UI, this, VPIteration(Part, Lane), State);
94529515
}
94539516

9517+
/// Creates either vp_store or vp_scatter intrinsics calls to represent
9518+
/// predicated store/scatter.
9519+
static Instruction *
9520+
lowerStoreUsingVectorIntrinsics(IRBuilderBase &Builder, Value *Addr,
9521+
Value *StoredVal, bool IsScatter, Value *Mask,
9522+
Value *EVLPart, const Align &Alignment) {
9523+
CallInst *Call;
9524+
if (IsScatter) {
9525+
Call = Builder.CreateIntrinsic(Type::getVoidTy(EVLPart->getContext()),
9526+
Intrinsic::vp_scatter,
9527+
{StoredVal, Addr, Mask, EVLPart});
9528+
} else {
9529+
VectorBuilder VBuilder(Builder);
9530+
VBuilder.setEVL(EVLPart).setMask(Mask);
9531+
Call = cast<CallInst>(VBuilder.createVectorInstruction(
9532+
Instruction::Store, Type::getVoidTy(EVLPart->getContext()),
9533+
{StoredVal, Addr}));
9534+
}
9535+
Call->addParamAttr(
9536+
1, Attribute::getWithAlignment(Call->getContext(), Alignment));
9537+
return Call;
9538+
}
9539+
9540+
/// Creates either vp_load or vp_gather intrinsics calls to represent
9541+
/// predicated load/gather.
9542+
static Instruction *lowerLoadUsingVectorIntrinsics(IRBuilderBase &Builder,
9543+
VectorType *DataTy,
9544+
Value *Addr, bool IsGather,
9545+
Value *Mask, Value *EVLPart,
9546+
const Align &Alignment) {
9547+
CallInst *Call;
9548+
if (IsGather) {
9549+
Call = Builder.CreateIntrinsic(DataTy, Intrinsic::vp_gather,
9550+
{Addr, Mask, EVLPart}, nullptr,
9551+
"wide.masked.gather");
9552+
} else {
9553+
VectorBuilder VBuilder(Builder);
9554+
VBuilder.setEVL(EVLPart).setMask(Mask);
9555+
Call = cast<CallInst>(VBuilder.createVectorInstruction(
9556+
Instruction::Load, DataTy, Addr, "vp.op.load"));
9557+
}
9558+
Call->addParamAttr(
9559+
0, Attribute::getWithAlignment(Call->getContext(), Alignment));
9560+
return Call;
9561+
}
9562+
94549563
void VPWidenMemoryInstructionRecipe::execute(VPTransformState &State) {
94559564
VPValue *StoredValue = isStore() ? getStoredValue() : nullptr;
94569565

@@ -9482,14 +9591,31 @@ void VPWidenMemoryInstructionRecipe::execute(VPTransformState &State) {
94829591
}
94839592
}
94849593

9594+
auto MaskValue = [&](unsigned Part) -> Value * {
9595+
if (isMaskRequired)
9596+
return BlockInMaskParts[Part];
9597+
return nullptr;
9598+
};
9599+
94859600
// Handle Stores:
94869601
if (SI) {
94879602
State.setDebugLocFrom(SI->getDebugLoc());
94889603

94899604
for (unsigned Part = 0; Part < State.UF; ++Part) {
94909605
Instruction *NewSI = nullptr;
94919606
Value *StoredVal = State.get(StoredValue, Part);
9492-
if (CreateGatherScatter) {
9607+
if (State.EVL) {
9608+
Value *EVLPart = State.get(State.EVL, Part);
9609+
// If EVL is not nullptr, then EVL must be a valid value set during plan
9610+
// creation, possibly default value = whole vector register length. EVL
9611+
// is created only if TTI prefers predicated vectorization, thus if EVL
9612+
// is not nullptr it also implies preference for predicated
9613+
// vectorization.
9614+
// FIXME: Support reverse store after vp_reverse is added.
9615+
NewSI = lowerStoreUsingVectorIntrinsics(
9616+
Builder, State.get(getAddr(), Part), StoredVal, CreateGatherScatter,
9617+
MaskValue(Part), EVLPart, Alignment);
9618+
} else if (CreateGatherScatter) {
94939619
Value *MaskPart = isMaskRequired ? BlockInMaskParts[Part] : nullptr;
94949620
Value *VectorGep = State.get(getAddr(), Part);
94959621
NewSI = Builder.CreateMaskedScatter(StoredVal, VectorGep, Alignment,
@@ -9519,7 +9645,18 @@ void VPWidenMemoryInstructionRecipe::execute(VPTransformState &State) {
95199645
State.setDebugLocFrom(LI->getDebugLoc());
95209646
for (unsigned Part = 0; Part < State.UF; ++Part) {
95219647
Value *NewLI;
9522-
if (CreateGatherScatter) {
9648+
if (State.EVL) {
9649+
Value *EVLPart = State.get(State.EVL, Part);
9650+
// If EVL is not nullptr, then EVL must be a valid value set during plan
9651+
// creation, possibly default value = whole vector register length. EVL
9652+
// is created only if TTI prefers predicated vectorization, thus if EVL
9653+
// is not nullptr it also implies preference for predicated
9654+
// vectorization.
9655+
// FIXME: Support reverse loading after vp_reverse is added.
9656+
NewLI = lowerLoadUsingVectorIntrinsics(
9657+
Builder, DataTy, State.get(getAddr(), Part), CreateGatherScatter,
9658+
MaskValue(Part), EVLPart, Alignment);
9659+
} else if (CreateGatherScatter) {
95239660
Value *MaskPart = isMaskRequired ? BlockInMaskParts[Part] : nullptr;
95249661
Value *VectorGep = State.get(getAddr(), Part);
95259662
NewLI = Builder.CreateMaskedGather(DataTy, VectorGep, Alignment, MaskPart,

llvm/lib/Transforms/Vectorize/VPlan.h

Lines changed: 43 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -244,6 +244,12 @@ struct VPTransformState {
244244
ElementCount VF;
245245
unsigned UF;
246246

247+
/// If EVL is not nullptr, then EVL must be a valid value set during plan
248+
/// creation, possibly a default value = whole vector register length. EVL is
249+
/// created only if TTI prefers predicated vectorization, thus if EVL is
250+
/// not nullptr it also implies preference for predicated vectorization.
251+
VPValue *EVL = nullptr;
252+
247253
/// Hold the indices to generate specific scalar instructions. Null indicates
248254
/// that all instances are to be generated, using either scalar or vector
249255
/// instructions.
@@ -1136,6 +1142,8 @@ class VPInstruction : public VPRecipeWithIRFlags {
11361142
SLPLoad,
11371143
SLPStore,
11381144
ActiveLaneMask,
1145+
ExplicitVectorLength,
1146+
ExplicitVectorLengthIVIncrement,
11391147
CalculateTripCountMinusVF,
11401148
// Increment the canonical IV separately for each unrolled part.
11411149
CanonicalIVIncrementForPart,
@@ -1245,6 +1253,8 @@ class VPInstruction : public VPRecipeWithIRFlags {
12451253
default:
12461254
return false;
12471255
case VPInstruction::ActiveLaneMask:
1256+
case VPInstruction::ExplicitVectorLength:
1257+
case VPInstruction::ExplicitVectorLengthIVIncrement:
12481258
case VPInstruction::CalculateTripCountMinusVF:
12491259
case VPInstruction::CanonicalIVIncrementForPart:
12501260
case VPInstruction::BranchOnCount:
@@ -2316,6 +2326,39 @@ class VPActiveLaneMaskPHIRecipe : public VPHeaderPHIRecipe {
23162326
#endif
23172327
};
23182328

2329+
/// A recipe for generating the phi node for the current index of elements,
2330+
/// adjusted in accordance with EVL value. It starts at StartIV value and gets
2331+
/// incremented by EVL in each iteration of the vector loop.
2332+
class VPEVLBasedIVPHIRecipe : public VPHeaderPHIRecipe {
2333+
public:
2334+
VPEVLBasedIVPHIRecipe(VPValue *StartMask, DebugLoc DL)
2335+
: VPHeaderPHIRecipe(VPDef::VPEVLBasedIVPHISC, nullptr, StartMask, DL) {}
2336+
2337+
~VPEVLBasedIVPHIRecipe() override = default;
2338+
2339+
VP_CLASSOF_IMPL(VPDef::VPEVLBasedIVPHISC)
2340+
2341+
static inline bool classof(const VPHeaderPHIRecipe *D) {
2342+
return D->getVPDefID() == VPDef::VPEVLBasedIVPHISC;
2343+
}
2344+
2345+
/// Generate phi for handling IV based on EVL over iterations correctly.
2346+
void execute(VPTransformState &State) override;
2347+
2348+
/// Returns true if the recipe only uses the first lane of operand \p Op.
2349+
bool onlyFirstLaneUsed(const VPValue *Op) const override {
2350+
assert(is_contained(operands(), Op) &&
2351+
"Op must be an operand of the recipe");
2352+
return true;
2353+
}
2354+
2355+
#if !defined(NDEBUG) || defined(LLVM_ENABLE_DUMP)
2356+
/// Print the recipe.
2357+
void print(raw_ostream &O, const Twine &Indent,
2358+
VPSlotTracker &SlotTracker) const override;
2359+
#endif
2360+
};
2361+
23192362
/// A Recipe for widening the canonical induction variable of the vector loop.
23202363
class VPWidenCanonicalIVRecipe : public VPSingleDefRecipe {
23212364
public:

llvm/lib/Transforms/Vectorize/VPlanAnalysis.cpp

Lines changed: 8 additions & 8 deletions
Original file line numberDiff line numberDiff line change
@@ -207,14 +207,14 @@ Type *VPTypeAnalysis::inferScalarType(const VPValue *V) {
207207
Type *ResultTy =
208208
TypeSwitch<const VPRecipeBase *, Type *>(V->getDefiningRecipe())
209209
.Case<VPCanonicalIVPHIRecipe, VPFirstOrderRecurrencePHIRecipe,
210-
VPReductionPHIRecipe, VPWidenPointerInductionRecipe>(
211-
[this](const auto *R) {
212-
// Handle header phi recipes, except VPWienIntOrFpInduction
213-
// which needs special handling due it being possibly truncated.
214-
// TODO: consider inferring/caching type of siblings, e.g.,
215-
// backedge value, here and in cases below.
216-
return inferScalarType(R->getStartValue());
217-
})
210+
VPReductionPHIRecipe, VPWidenPointerInductionRecipe,
211+
VPEVLBasedIVPHIRecipe>([this](const auto *R) {
212+
// Handle header phi recipes, except VPWienIntOrFpInduction
213+
// which needs special handling due it being possibly truncated.
214+
// TODO: consider inferring/caching type of siblings, e.g.,
215+
// backedge value, here and in cases below.
216+
return inferScalarType(R->getStartValue());
217+
})
218218
.Case<VPWidenIntOrFpInductionRecipe, VPDerivedIVRecipe>(
219219
[](const auto *R) { return R->getScalarType(); })
220220
.Case<VPPredInstPHIRecipe, VPWidenPHIRecipe, VPScalarIVStepsRecipe,

0 commit comments

Comments
 (0)