[TableGen][DecoderEmitter] Add option to emit type-specialized code #146593

jurahul · 2025-07-01T19:40:10Z

This change attempts to reduce the size of the disassembler code generated by DecoderEmitter.

Current state:

Currently, the code generated by the decoder emitter consists of two key functions: decodeInstruction which is the entry point into the generated code and decodeToMCInst which is invoked when a decode op is reached in the while traversing through the decoder table. Both functions are templated on InsnType which is the raw instruction bits that are supplied to decodeInstruction.
Several backends call decodeInstruction with different types, leading to several template instantiations of this function in the final code. As an example, AMDGPU instantiates this function with type DecoderUInt128 type for decoding 96/128 bit instructions, uint64_t for decoding 64-bit instructions, and uint32_t for decoding 32-bit instructions.
Since there is just one decodeToMCInst generated, it has code that handles all instruction sizes. The decoders emitted for different instructions sizes rarely have any intersection with each other. That means, in the AMDGPU case, the instantiation with InsnType == DecoderUInt128 has decoder code for 32/64-bit instructions that is never exercised. Conversely, the instantiation with InsnType == uint64_t has decoder code for 128/96/32 bit instructions that is never exercised. This leads to unnecessary dead code in the generated disassembler binary.

With this change, the DecoderEmitter will stop generating a single templated decodeInstruction and will instead generate several overloaded versions of this function and the associated decodeToMCInst function as well. Instead of using the templated InsnType, it will use an auto-inferred type which can be either a standard C++ integrer type, APInt, or a std::bitset. As a results, decoders for 32-bit instructions will appear only in the 32-bit variant of decodeToMCinst and 64-bit decoders will appear only in 64-bit variant and that will fix the code duplication in the templated variant.

Additionally, the DecodeIndex will now be computed per-instruction bitwidth as instead of being computed globally across all bitwidths in the earlier case. So, the values will generally be smaller than before and hence will consume less bytes in their ULEB128 encoding in the decoder tables, resulting in further reduction in the size of the decode tables.

Since this non-templated decoder also needs some changes in the C++ code, added an option GenerateTemplatedDecoder to InstrInfo that is defaulted to false, but targets can set to true to fall back to using templated code. The goal is to migrate all targets to use non-templated decoder and deprecate this option in future.

Adopt this feature for the AMDGPU backend. In a release build, this results in a net 35% reduction in the .text size of libLLVMAMDGPUDisassembler.so and a 5% reduction in the .rodata size. Actual numbers measured locally for a Linux x86_64 build using clang-18.1.3 toolchain are:

.text 378780 -> 244684, i.e., a 35% reduction in size
.rodata 352648 -> 334960 i.e., a 5% reduction in size

For targets that do not use multiple instantiations of decodeInstruction, opting in into this feature may not result in code/data size improvement but potential compile time improvements by avoiding the use of templated code.

topperc · 2025-07-01T20:21:28Z

What's the motivation?

jurahul · 2025-07-01T20:27:36Z

Please see the PR description that I just added. I also have data to support this, will add soon (tabulating it ATM)

jurahul · 2025-07-01T20:29:50Z

Failure is in the unit test (TableGen/VarLenDecoder.td), I need to update it.

topperc · 2025-07-01T20:39:25Z

Why can't the disassembler emitter figure out the bitwidths from the tablegen input? It already makes separate tables for each instruction size.

topperc · 2025-07-01T20:42:33Z

Why can't the disassembler emitter figure out the bitwidths from the tablegen input? It already makes separate tables for each instruction size.

Oh is it because RISC-V uses 3 bit widths, but only 2 types for DecodeInstruction?

topperc · 2025-07-01T20:50:32Z

Can we store the type in the Instruction class in the .td files like the bitwidth instead of introducing a complex command line argument?

jurahul · 2025-07-01T20:54:02Z

Right, this is POC at this which shows that the proposed optimization works. I am open to changing the interface here as well. The command line one was simple enough to not mess with tablegen instruction class etc, but that is an option, though it feels more intrusive. The command line is moderately complex and localized to the decoder emitter.

jurahul · 2025-07-01T21:07:20Z

Repeating the type per-instruction record might be redundant (and we would need more verification as well to verify for a given size, all insts of that size have the C++ type specified and its consistent). One option is to add a new InstructionTypeAndSize class that records this information, and DecoderEmitter can use that if its present else fall back to templated code. Something like

class InstructionDecoderTypeAndSize<string CPPType, list<int> Bitwidths> {
}

class InstructionDecoderTypeAndSizes<list<InstructionDecoderTypeAndSize>> {
}

and a particular backend can define a single record of type InstructionDecoderTypeAndSizes<> which the DecoderEmitter will use. This is essentially encoding the command line option as a record.

// RISCV.td
// Opt-in to non-templated deocder code.
def : InstructionDecoderTypeAndSizes<[
                InstructionDecoderTypeAndSize<"uint64_t", [48]>,
                InstructionDecoderTypeAndSize<"uint32_t", [16,32]>]>;

or more simply

class InstructionDecoderTypeAndSizes<list<string> CPPTypes, list<list<int>> Bitwidths> {
}

def : InstructionDecoderTypeAndSizes<
           [ "uint32_t", uint64_t"],
           [ [16,32],    [64]     ]>;

topperc · 2025-07-01T22:55:53Z

Repeating the type per-instruction record might be redundant (and we would need more verification as well to verify for a given size, all insts of that size have the C++ type specified and its consistent). One option is to add a new InstructionTypeAndSize class that records this information, and DecoderEmitter can use that if its present else fall back to templated code. Something like
class InstructionDecoderTypeAndSize<string CPPType, list<int> Bitwidths> {
}

class InstructionDecoderTypeAndSizes<list<InstructionDecoderTypeAndSize>> {
}
and a particular backend can define a single record of type InstructionDecoderTypeAndSizes<> which the DecoderEmitter will use. This is essentially encoding the command line option as a record.
// RISCV.td
// Opt-in to non-templated deocder code.
def : InstructionDecoderTypeAndSizes<[
                InstructionDecoderTypeAndSize<"uint64_t", [48]>,
                InstructionDecoderTypeAndSize<"uint32_t", [16,32]>]>;
or more simply
class InstructionDecoderTypeAndSizes<list<string> CPPTypes, list<list<int>> Bitwidths> {
}

def : InstructionDecoderTypeAndSizes<
           [ "uint32_t", uint64_t"],
           [ [16,32],    [64]     ]>;

RISCV uses a common base class for each of the 3 instruction sizes. Other targets may be similar.

class RVInst<dag outs, dag ins, string opcodestr, string argstr,                 
             list<dag> pattern, InstFormat format>                               
    : RVInstCommon<outs, ins, opcodestr, argstr, pattern, format> {              
  field bits<32> Inst;                                                           
  // SoftFail is a field the disassembler can use to provide a way for           
  // instructions to not match without killing the whole decode process. It is   
  // mainly used for ARM, but Tablegen expects this field to exist or it fails   
  // to build the decode table.                                                  
  field bits<32> SoftFail = 0;                                                   
  let Size = 4;                                                                  
}                                                                                
                                                                                 
class RVInst48<dag outs, dag ins, string opcodestr, string argstr,               
               list<dag> pattern, InstFormat format>                             
    : RVInstCommon<outs, ins, opcodestr, argstr, pattern, format> {              
  field bits<48> Inst;                                                           
  field bits<48> SoftFail = 0;                                                   
  let Size = 6;                                                                  
}                                                                                
                                                                                 
class RVInst64<dag outs, dag ins, string opcodestr, string argstr,               
               list<dag> pattern, InstFormat format>                             
    : RVInstCommon<outs, ins, opcodestr, argstr, pattern, format> {              
  field bits<64> Inst;                                                           
  field bits<64> SoftFail = 0;                                                   
  let Size = 8;                                                                  
}

jurahul · 2025-07-01T23:58:44Z

Repeating the type per-instruction record might be redundant (and we would need more verification as well to verify for a given size, all insts of that size have the C++ type specified and its consistent). One option is to add a new InstructionTypeAndSize class that records this information, and DecoderEmitter can use that if its present else fall back to templated code. Something like
class InstructionDecoderTypeAndSize<string CPPType, list<int> Bitwidths> {
}

class InstructionDecoderTypeAndSizes<list<InstructionDecoderTypeAndSize>> {
}
and a particular backend can define a single record of type InstructionDecoderTypeAndSizes<> which the DecoderEmitter will use. This is essentially encoding the command line option as a record.
// RISCV.td
// Opt-in to non-templated deocder code.
def : InstructionDecoderTypeAndSizes<[
                InstructionDecoderTypeAndSize<"uint64_t", [48]>,
                InstructionDecoderTypeAndSize<"uint32_t", [16,32]>]>;
or more simply
class InstructionDecoderTypeAndSizes<list<string> CPPTypes, list<list<int>> Bitwidths> {
}

def : InstructionDecoderTypeAndSizes<
           [ "uint32_t", uint64_t"],
           [ [16,32],    [64]     ]>;

RISCV uses a common base class for each of the 3 instruction sizes. Other targets may be similar.

class RVInst<dag outs, dag ins, string opcodestr, string argstr,                 
             list<dag> pattern, InstFormat format>                               
    : RVInstCommon<outs, ins, opcodestr, argstr, pattern, format> {              
  field bits<32> Inst;                                                           
  // SoftFail is a field the disassembler can use to provide a way for           
  // instructions to not match without killing the whole decode process. It is   
  // mainly used for ARM, but Tablegen expects this field to exist or it fails   
  // to build the decode table.                                                  
  field bits<32> SoftFail = 0;                                                   
  let Size = 4;                                                                  
}                                                                                
                                                                                 
class RVInst48<dag outs, dag ins, string opcodestr, string argstr,               
               list<dag> pattern, InstFormat format>                             
    : RVInstCommon<outs, ins, opcodestr, argstr, pattern, format> {              
  field bits<48> Inst;                                                           
  field bits<48> SoftFail = 0;                                                   
  let Size = 6;                                                                  
}                                                                                
                                                                                 
class RVInst64<dag outs, dag ins, string opcodestr, string argstr,               
               list<dag> pattern, InstFormat format>                             
    : RVInstCommon<outs, ins, opcodestr, argstr, pattern, format> {              
  field bits<64> Inst;                                                           
  field bits<64> SoftFail = 0;                                                   
  let Size = 8;                                                                  
}

Right, but nonetheless, we will have the type specified per instruction instance and we will still need to validate for example that for all instructions with a particular size, the type string is same. To me that seems unnecessary duplication of this information and then additional verification to make sure that it's consistent. Also, unlike the size in bytes, which is a core property of the instruction, its C++ type to represent its bits in memory seems not a core property. Many backends seems to choose the same type (for example uint64_t) for all their 16/32/48/64 bit insts. Adoption wise as well, sticking it in the per-inst record seems more invasive (for example, in our and several other downstream backends the core instruction records are auto-generated so the adoption curve for this increases further).

jurahul · 2025-07-02T00:55:36Z

Requesting not review per-se but opinion on the user interface for this optimization. Choices proposed are:

Command line option, as in this PR. +: non-intrusive in terms of .td files, -: need to parse it and parsing can be flaky.
Per instruction record carries cpp type (@topperc 's suggestion): +: No command line option parsing flakiness, -: (IMO) too invasive, I see difficulty or increased complexity in adoption for auto-generated inst defs used in several downstream backends, needs additional validation for consistent across all insts of a given size.
Option embedded as a new singleton records in the .td file: +:No command line option parsing flakiness, less intrusive than the option below (as the single def is standalone not attached to anything else), no consistency checks. -: ?

topperc · 2025-07-02T00:56:55Z

Many backends seems to choose the same type (for example uint64_t) for all their 16/32/48/64 bit insts.

I'm probably going to change to uint64_t for RISC-V. The 48-bit instructions are only used by one vendor and are relatively recent additions. I think the duplication cost just wasn't considered when they were added.

I agree adding to the Inst class might be too invasive. I still think it should be in .td files somehow. Needing to change a CMake file and replicating to GN and the other build systems when a new instruction width is added seems bad.

jurahul · 2025-07-02T01:00:21Z

Many backends seems to choose the same type (for example uint64_t) for all their 16/32/48/64 bit insts.

I'm probably going to change to uint64_t for RISC-V. The 48-bit instructions are only used by one vendor and are relatively recent additions. I think the duplication cost just wasn't considered when they were added.

I agree adding to the Inst class might be too invasive. I still think it should be in .td files somehow. Needing to change a CMake file and replicating to GN and the other build systems when a new instruction width is added seems bad.

Right, is the option #3 above palatable? We essentially encode it as a standalone record that the DecoderEmitter will look for.

topperc · 2025-07-02T01:05:34Z

Many backends seems to choose the same type (for example uint64_t) for all their 16/32/48/64 bit insts.

I'm probably going to change to uint64_t for RISC-V. The 48-bit instructions are only used by one vendor and are relatively recent additions. I think the duplication cost just wasn't considered when they were added.
I agree adding to the Inst class might be too invasive. I still think it should be in .td files somehow. Needing to change a CMake file and replicating to GN and the other build systems when a new instruction width is added seems bad.

Right, is the option #3 above palatable? We essentially encode it as a standalone record that the DecoderEmitter will look for.

Maybe it should be stored in the InstrInfo class like isLittleEndianEncoding or the Target class?

…6. NFC Insn is passed to decodeInstruction which is a template function based on the type of Insn. By using uint64_t we ensure only one version of decodeInstruction is created. This reduces the file size of RISCVDisassembler.cpp.o by ~25% in my local build. This should get even more size benefit than llvm#146593.

jurahul · 2025-07-02T03:24:58Z

InstrInfo seems reasonable. Let me rework the PR to add that

jurahul · 2025-07-02T17:16:34Z

@topperc Please let me know if this new scheme looks ok. If yes, I'll migrate the rest of the targets (Right now I just changed AARCH64 and RISCV) to use this, and add some unit tests for a final review.

topperc · 2025-07-02T17:21:50Z

@topperc Please let me know if this new scheme looks ok. If yes, I'll migrate the rest of the targets (Right now I just changed AARCH64 and RISCV) to use this, and add some unit tests for a final review.

Does this have any binary size effect on RISCV after #146619?

jurahul · 2025-07-02T17:31:10Z

@topperc Please let me know if this new scheme looks ok. If yes, I'll migrate the rest of the targets (Right now I just changed AARCH64 and RISCV) to use this, and add some unit tests for a final review.

Does this have any binary size effect on RISCV after #146619?

I have not tested. My speculation is, no binary size change but just minor compile time improvement by avoiding template specialization. I'll check and report back.

jurahul · 2025-07-02T21:01:57Z

Looks like templating adds a little bit to the code size. Building the RISCVDisassembler.cpp.o in a release config with/without this change results in the following:

Old:
196112 ./build/lib/Target/RISCV/Disassembler/CMakeFiles/LLVMRISCVDisassembler.dir/RISCVDisassembler.cpp.o 
New:
196096 ./build/lib/Target/RISCV/Disassembler/CMakeFiles/LLVMRISCVDisassembler.dir/RISCVDisassembler.cpp.o

So, 16 bytes less. Not significant though.

topperc · 2025-07-02T21:14:19Z

Looks like templating adds a little bit to the code size. Building the RISCVDisassembler.cpp.o in a release config with/without this change results in the following:
Old:
196112 ./build/lib/Target/RISCV/Disassembler/CMakeFiles/LLVMRISCVDisassembler.dir/RISCVDisassembler.cpp.o 
New:
196096 ./build/lib/Target/RISCV/Disassembler/CMakeFiles/LLVMRISCVDisassembler.dir/RISCVDisassembler.cpp.o
So, 16 bytes less. Not significant though.

Could just be a difference in the name mangling of the function name? Or are you checking the .text size?

jurahul · 2025-07-02T21:22:58Z

yeah, your guess was right. I dumped the sizes with size -A and I see:

New:

.text._ZN12_GLOBAL__N_114decodeToMCInstEjN4llvm14MCDisassembler12DecodeStatusEmRNS0_6MCInstEmPKS1_Rb     27023

.text._ZN12_GLOBAL__N_117decodeInstructionEPKhRN4llvm6MCInstEmmPKNS2_14MCDisassemblerERKNS2_15MCSubtargetInfoE        9238 

Old:                                                                                                                                                                                                                                                  

.text._ZN12_GLOBAL__N_114decodeToMCInstImEEN4llvm14MCDisassembler12DecodeStatusEjS3_T_RNS1_6MCInstEmPKS2_Rb                             27023
 .text._ZN12_GLOBAL__N_117decodeInstructionImEEN4llvm14MCDisassembler12DecodeStatusEPKhRNS1_6MCInstET_mPKS2_RKNS1_15MCSubtargetInfoE      9238

That is, text sizes are the same but mangled names are different and that likely leads to larger object file sizes.

jurahul · 2025-07-02T21:27:49Z

Note though that what you did for RISCV may not be applicable/desirable for all targets. For example, AMDGPU has 128 bit instructions, so I am assuming if we just use a 128-bit type for all instructions, we may pay a penalty in terms of the bit extraction costs (32 vs 64-bit may not be as bad).

jurahul · 2025-07-02T22:21:19Z

@topperc My question is still unanswered. WDYT of this new interface to op-in into this optimization?

topperc · 2025-07-22T20:38:42Z

dispatched to the correct decodeToMCInst* based on the bit width either by making it an argument to decodeInstruction or storing it in the first byte of the table? RISC-V could continue using uint64_t for the decodeInstruction template.

This sounds like a good idea to me. (Not sure I understand the "templated decodeInstruction" part.)

Prior to this change decodeInstruction had a template parameter for the type of Insn. I was just suggesting to go back to that.

s-barannikov · 2025-07-22T20:48:54Z

dispatched to the correct decodeToMCInst* based on the bit width either by making it an argument to decodeInstruction or storing it in the first byte of the table? RISC-V could continue using uint64_t for the decodeInstruction template.

This sounds like a good idea to me. (Not sure I understand the "templated decodeInstruction" part.)

Prior to this change decodeInstruction had a template parameter for the type of Insn. I was just suggesting to go back to that.

I see. As an alternative, generate single, non-templated decodeInstruction for the widest integer type (e.g., uint64_t for RISC-V, uint32_t for ARM). This would prevent accidental instantiation of the template for different types.

jurahul · 2025-07-23T03:30:47Z

Let me try going back to a templated decodeInstruction / single copy of decodeInstruction. Note that it will have to dynamically dispatch to both decodeToMCInst and fieldFromInstruction, but that will likely help reduce the code size. A single non-templated one for the widest type is a possibility as well, but that may be a bitset<> type for some targets, like AMDGPU.

In any case, we are potentially trading off some runtime decoding perf for smaller code size. The choice is whether it's in terms of always operating at the highest instruction bitwidth vs dynamic dispatch to the 2 functions above. And in either case, when we call decodeToMCInst, we will still call the appropriate variant with the proper data type. So I am thinking I will just go with the first suggestion above, as follows:

// Note: insn param is gone
static DecodeStatus decodeInstructionImpl(const uint8_t DecodeTable[], MCInst &MI, uint64_t Address, const MCDisassembler *DisAsm, const MCSubtargetInfo &STI, llvm::function_ref<uint64_t(unsigned Start, unsigned Len) fieldFromInstructionPtr,  llvm::function_ref<DecodeStatus(...)> decodeToMCInstPtr) { //
...
}

static DecodeStatus decodeInstruction(const uint8_t DecodeTable[], MCInst &MI, uint16_t insn, uint64_t Address, const MCDisassembler *DisAsm, const MCSubtargetInfo &STI) {
  // dispatch to decodeInstructionImpl with appropriate lambdas.
}

topperc · 2025-07-23T04:05:41Z

Let me try going back to a templated decodeInstruction / single copy of decodeInstruction. Note that it will have to dynamically dispatch to both decodeToMCInst and fieldFromInstruction, but that will likely help reduce the code size. A single non-templated one for the widest type is a possibility as well, but that may be a bitset<> type for some targets, like AMDGPU.

Why would we need to dynamically dispatch to fieldFromInstruction? Can't it use the same template type as decodeInstruction and just work?

jurahul · 2025-07-23T05:16:31Z

what I am thinking is: If we make the entire decodeInstruction templated, we still get code duplication when it's instantiated for each type (RISCV will just do 1 instantiation, but other targets may not). So, I am trying to implement with no templating (which also matches well with the option name). That entails an impl function that is bitwidth agnostic (this will address code size) and we will still have the per-bitwidth entry point that will just call the impl function with appropriate lambdas to encapsulate any insn specific code, which includes calls to fieldFromInstruction, decodeToMCInst, and the +ve/-ve mask checks for softfail.

jurahul · 2025-07-23T05:39:39Z

OTOH: operating at the highest bitwidth always may not be bad. Basically, we just call fieldFromInstruction on a larger bitwidth value, but that may not be as expensive as an indirect function call. And then just at the point of decodeToMCInst call, we call the version with the appropriate bitwidth and type.

Given that there are no benchmarks here to measure which one is faster, we need to make a judgement call. Always doing the highest bitwidth (in decodeInstruction) might be less expensive than dynamic calls to fieldFromInstruction? So the 2 choices are essentially:

decodeInstructionImpl operating at highest bitwidth, and then doing a dynamic dispatch to appropriate decodeToMCInst() with appropriate down casting, or what I suggested above, where both calls to fieldFromInstruction and decodeToMCInst are dynamic.

Note that fieldFromInstruction is called both from decodeInstruction as well as from decodeToMCInst. The one in decodeToMCInst will be with the type corresponding to the bitwidth of the instruction, so those are not affected by this. The ones in decodeInstruction (in handling extractfield and checkfield ops) are affected by this choice.

@topperc and @s-barannikov WDYT? Am I overthinking this?

…eToMCInst`

jurahul · 2025-07-25T22:52:46Z

I finally got around to prototyping this. I went with the idea that we will have a single non-templated decodeInstructionImpl that operates at the max instruction width and that accepts a function_ref<> for dynamic dispatch to the correct version of decodeToMCInst. As an example, here's the code generated for AMDGPU:

static DecodeStatus decodeInstructionImpl(const uint8_t DecodeTable[], MCInst &MI, const std::bitset<128> &insn, 
                                          uint64_t Address, const MCDisassembler *DisAsm, const MCSubtargetInfo &STI,
                                          function_ref<DecodeStatus(unsigned, DecodeStatus, MCInst &, uint64_t, const MCDisassembler *, bool &)> decodeToMCInstPtr) {
...
}

static DecodeStatus decodeInstruction(const uint8_t DecodeTable[], MCInst &MI, uint32_t insn, uint64_t Address, const MCDisassembler *DisAsm, const MCSubtargetInfo &STI) {
  std::bitset<128> InsnMaxWidth = insn;
  auto DecodeToMCInst = [insn](unsigned DecodeIdx, DecodeStatus S, MCInst &MI, uint64_t Address, const MCDisassembler *DisAsm, bool &DecodeComplete) {
    return decodeToMCInst32(DecodeIdx, S, insn, MI, Address, DisAsm, DecodeComplete);
  };
  return decodeInstructionImpl(DecodeTable, MI, InsnMaxWidth, Address, DisAsm, STI, DecodeToMCInst);
}

static DecodeStatus decodeInstruction(const uint8_t DecodeTable[], MCInst &MI, uint64_t insn, uint64_t Address, const MCDisassembler *DisAsm, const MCSubtargetInfo &STI) {
  std::bitset<128> InsnMaxWidth = insn;
  auto DecodeToMCInst = [insn](unsigned DecodeIdx, DecodeStatus S, MCInst &MI, uint64_t Address, const MCDisassembler *DisAsm, bool &DecodeComplete) {
    return decodeToMCInst64(DecodeIdx, S, insn, MI, Address, DisAsm, DecodeComplete);
  };
  return decodeInstructionImpl(DecodeTable, MI, InsnMaxWidth, Address, DisAsm, STI, DecodeToMCInst);
}

static DecodeStatus decodeInstruction(const uint8_t DecodeTable[], MCInst &MI, const std::bitset<96> &insn, uint64_t Address, const MCDisassembler *DisAsm, const MCSubtargetInfo &STI) {
  const std::bitset<96> Mask(maskTrailingOnes<uint64_t>(64));
  std::bitset<128> InsnMaxWidth((insn & Mask).to_ulong());
  InsnMaxWidth |= std::bitset<128>(((insn >> 64) & Mask).to_ulong()) << 64;

  auto DecodeToMCInst = [&insn](unsigned DecodeIdx, DecodeStatus S, MCInst &MI, uint64_t Address, const MCDisassembler *DisAsm, bool &DecodeComplete) {
    return decodeToMCInst96(DecodeIdx, S, insn, MI, Address, DisAsm, DecodeComplete);
  };
  return decodeInstructionImpl(DecodeTable, MI, InsnMaxWidth, Address, DisAsm, STI, DecodeToMCInst);
}

static DecodeStatus decodeInstruction(const uint8_t DecodeTable[], MCInst &MI, const std::bitset<128> &insn, uint64_t Address, const MCDisassembler *DisAsm, const MCSubtargetInfo &STI) {
  std::bitset<128> InsnMaxWidth = insn;
  auto DecodeToMCInst = [&insn](unsigned DecodeIdx, DecodeStatus S, MCInst &MI, uint64_t Address, const MCDisassembler *DisAsm, bool &DecodeComplete) {
    return decodeToMCInst128(DecodeIdx, S, insn, MI, Address, DisAsm, DecodeComplete);
  };
  return decodeInstructionImpl(DecodeTable, MI, InsnMaxWidth, Address, DisAsm, STI, DecodeToMCInst);
}

We still have the per-bit-width overloads of decodeInstruction that upcast the bits to the max bit width and call decodeInstructionImpl. With this, the code size regression for RISCV is not as much as the earlier version:

                New         old

RISCV   text    57270       55660
        rodata  37825       38058

AMDGPU  text    268044      440444
        rodata  360568      378952

I'll do some more testing and put this version up for review.

jurahul · 2025-07-30T16:14:10Z

Planning to do some profiling with the new llvm-mc --runs option to see which of the above options has better decode times. I am assuming we want to err on the side of faster decoder over smaller code size (will use some tests from AMDGPU and RISCV)

jurahul force-pushed the decoder_emitter_type_specialization branch from 30d0838 to 2d7d1dc Compare July 1, 2025 22:19

jurahul requested review from s-barannikov, mshockwave and topperc July 2, 2025 00:48

topperc mentioned this pull request Jul 2, 2025

[RISCV] Use uint64_t for Insn in getInstruction32 and getInstruction16. NFC #146619

Merged

jurahul force-pushed the decoder_emitter_type_specialization branch from 2d7d1dc to 04366ee Compare July 2, 2025 14:38

jurahul added 15 commits July 25, 2025 13:26

[TableGen][DecoderEmitter] Add option to emit type-specialized `decod…

e9eb4ed

…eToMCInst`

Review feedback

6c0374a

Specialize decodeInstruction

e988597

Full specialization

eed9ba5

Rename NonTemplatedInsnType to a more generic DecoderOption

b8aac4a

Revert unneeded changes

9d17c1b

Add a struct CPPType to track C++ type better

7f7ae09

Try fix clang-format

5228c47

Review feedback

8c19c0a

Flip default, fix build, and fix RISCV disassembler

a1697c7

Delete unused header added in earlier version

c69de70

Fix test cases, review feedback

56f910f

Fix build failure

0144cf7

Update comment

1fdac70

Single non-templated decodeInstructionImpl

522e20f

Rename insn to Insn

a08bd99

jurahul force-pushed the decoder_emitter_type_specialization branch from 96a955e to a08bd99 Compare July 25, 2025 23:12

jurahul added 4 commits July 26, 2025 06:29

Fix bug in TryDecode, need to use TmpMI

a21f975

Fix lit tests for insn->Insn rename

9005d4e

ulong -> uulong for Windows

d78b836

uulong -> ullong

6623763

jurahul mentioned this pull request Jul 29, 2025

[llvm-mc] Add --runs option for benchmarking #151149

Merged

[TableGen][DecoderEmitter] Add option to emit type-specialized code #146593

Are you sure you want to change the base?

[TableGen][DecoderEmitter] Add option to emit type-specialized code #146593

Uh oh!

Conversation

jurahul commented Jul 1, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

topperc commented Jul 1, 2025

Uh oh!

jurahul commented Jul 1, 2025

Uh oh!

jurahul commented Jul 1, 2025

Uh oh!

topperc commented Jul 1, 2025

Uh oh!

topperc commented Jul 1, 2025

Uh oh!

topperc commented Jul 1, 2025

Uh oh!

jurahul commented Jul 1, 2025

Uh oh!

jurahul commented Jul 1, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

topperc commented Jul 1, 2025

Uh oh!

jurahul commented Jul 1, 2025

Uh oh!

jurahul commented Jul 2, 2025

Uh oh!

topperc commented Jul 2, 2025

Uh oh!

jurahul commented Jul 2, 2025

Uh oh!

topperc commented Jul 2, 2025

Uh oh!

jurahul commented Jul 2, 2025

Uh oh!

jurahul commented Jul 2, 2025

Uh oh!

topperc commented Jul 2, 2025

Uh oh!

jurahul commented Jul 2, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

jurahul commented Jul 2, 2025

Uh oh!

topperc commented Jul 2, 2025

Uh oh!

jurahul commented Jul 2, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

jurahul commented Jul 2, 2025

Uh oh!

jurahul commented Jul 2, 2025

Uh oh!

topperc commented Jul 22, 2025

Uh oh!

s-barannikov commented Jul 22, 2025

Uh oh!

jurahul commented Jul 23, 2025

Uh oh!

topperc commented Jul 23, 2025

Uh oh!

jurahul commented Jul 23, 2025

Uh oh!

jurahul commented Jul 23, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

jurahul commented Jul 25, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

jurahul commented Jul 30, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

jurahul commented Jul 1, 2025 •

edited

Loading

jurahul commented Jul 1, 2025 •

edited

Loading

jurahul commented Jul 2, 2025 •

edited

Loading

jurahul commented Jul 2, 2025 •

edited

Loading

jurahul commented Jul 23, 2025 •

edited

Loading

jurahul commented Jul 25, 2025 •

edited

Loading

jurahul commented Jul 30, 2025 •

edited

Loading