diff --git a/llvm/docs/CommandGuide/index.rst b/llvm/docs/CommandGuide/index.rst index 88fc1fd326b76..f85f32a1fdd51 100644 --- a/llvm/docs/CommandGuide/index.rst +++ b/llvm/docs/CommandGuide/index.rst @@ -27,6 +27,7 @@ Basic Commands llvm-dis llvm-dwarfdump llvm-dwarfutil + llvm-ir2vec llvm-lib llvm-libtool-darwin llvm-link diff --git a/llvm/docs/CommandGuide/llvm-ir2vec.rst b/llvm/docs/CommandGuide/llvm-ir2vec.rst new file mode 100644 index 0000000000000..13fe4996b968f --- /dev/null +++ b/llvm/docs/CommandGuide/llvm-ir2vec.rst @@ -0,0 +1,170 @@ +llvm-ir2vec - IR2Vec Embedding Generation Tool +============================================== + +.. program:: llvm-ir2vec + +SYNOPSIS +-------- + +:program:`llvm-ir2vec` [*options*] *input-file* + +DESCRIPTION +----------- + +:program:`llvm-ir2vec` is a standalone command-line tool for IR2Vec. It +generates IR2Vec embeddings for LLVM IR and supports triplet generation +for vocabulary training. It provides two main operation modes: + +1. **Triplet Mode**: Generates triplets (opcode, type, operands) for vocabulary + training from LLVM IR. + +2. **Embedding Mode**: Generates IR2Vec embeddings using a trained vocabulary + at different granularity levels (instruction, basic block, or function). + +The tool is designed to facilitate machine learning applications that work with +LLVM IR by converting the IR into numerical representations that can be used by +ML models. + +.. note:: + + For information about using IR2Vec programmatically within LLVM passes and + the C++ API, see the `IR2Vec Embeddings `_ + section in the MLGO documentation. + +OPERATION MODES +--------------- + +Triplet Generation Mode +~~~~~~~~~~~~~~~~~~~~~~~ + +In triplet mode, :program:`llvm-ir2vec` analyzes LLVM IR and extracts triplets +consisting of opcodes, types, and operands. These triplets can be used to train +vocabularies for embedding generation. + +Usage: + +.. code-block:: bash + + llvm-ir2vec --mode=triplets input.bc -o triplets.txt + +Embedding Generation Mode +~~~~~~~~~~~~~~~~~~~~~~~~~~ + +In embedding mode, :program:`llvm-ir2vec` uses a pre-trained vocabulary to +generate numerical embeddings for LLVM IR at different levels of granularity. + +Example Usage: + +.. code-block:: bash + + llvm-ir2vec --mode=embeddings --ir2vec-vocab-path=vocab.json --level=func input.bc -o embeddings.txt + +OPTIONS +------- + +.. option:: --mode= + + Specify the operation mode. Valid values are: + + * ``triplets`` - Generate triplets for vocabulary training + * ``embeddings`` - Generate embeddings using trained vocabulary (default) + +.. option:: --level= + + Specify the embedding generation level. Valid values are: + + * ``inst`` - Generate instruction-level embeddings + * ``bb`` - Generate basic block-level embeddings + * ``func`` - Generate function-level embeddings (default) + +.. option:: --function= + + Process only the specified function instead of all functions in the module. + +.. option:: --ir2vec-vocab-path= + + Specify the path to the vocabulary file (required for embedding mode). + The vocabulary file should be in JSON format and contain the trained + vocabulary for embedding generation. See `llvm/lib/Analysis/models` + for pre-trained vocabulary files. + +.. option:: --ir2vec-opc-weight= + + Specify the weight for opcode embeddings (default: 1.0). This controls + the relative importance of instruction opcodes in the final embedding. + +.. option:: --ir2vec-type-weight= + + Specify the weight for type embeddings (default: 0.5). This controls + the relative importance of type information in the final embedding. + +.. option:: --ir2vec-arg-weight= + + Specify the weight for argument embeddings (default: 0.2). This controls + the relative importance of operand information in the final embedding. + +.. option:: -o + + Specify the output filename. Use ``-`` to write to standard output (default). + +.. option:: --help + + Print a summary of command line options. + +.. note:: + + ``--level``, ``--function``, ``--ir2vec-vocab-path``, ``--ir2vec-opc-weight``, + ``--ir2vec-type-weight``, and ``--ir2vec-arg-weight`` are only used in embedding + mode. These options are ignored in triplet mode. + +INPUT FILE FORMAT +----------------- + +:program:`llvm-ir2vec` accepts LLVM bitcode files (``.bc``) and LLVM IR files +(``.ll``) as input. The input file should contain valid LLVM IR. + +OUTPUT FORMAT +------------- + +Triplet Mode Output +~~~~~~~~~~~~~~~~~~~ + +In triplet mode, the output consists of lines containing space-separated triplets: + +.. code-block:: text + + ... + +Each line represents the information of one instruction, with the opcode, type, +and operands. + +Embedding Mode Output +~~~~~~~~~~~~~~~~~~~~~ + +In embedding mode, the output format depends on the specified level: + +* **Function Level**: One embedding vector per function +* **Basic Block Level**: One embedding vector per basic block, grouped by function +* **Instruction Level**: One embedding vector per instruction, grouped by basic block and function + +Each embedding is represented as a floating point vector. + +EXIT STATUS +----------- + +:program:`llvm-ir2vec` returns 0 on success, and a non-zero value on failure. + +Common failure cases include: + +* Invalid or missing input file +* Missing or invalid vocabulary file (in embedding mode) +* Specified function not found in the module +* Invalid command line options + +SEE ALSO +-------- + +:doc:`../MLGO` + +For more information about the IR2Vec algorithm and approach, see: +`IR2Vec: LLVM IR Based Scalable Program Embeddings `_. diff --git a/llvm/docs/MLGO.rst b/llvm/docs/MLGO.rst index ed0769bebeac3..965a21b8c84b8 100644 --- a/llvm/docs/MLGO.rst +++ b/llvm/docs/MLGO.rst @@ -468,6 +468,13 @@ The core components are: Using IR2Vec ------------ +.. note:: + + This section describes how to use IR2Vec within LLVM passes. A standalone + tool :doc:`CommandGuide/llvm-ir2vec` is available for generating the + embeddings and triplets from LLVM IR files, which can be useful for + training vocabularies and generating embeddings outside of compiler passes. + For generating embeddings, first the vocabulary should be obtained. Then, the embeddings can be computed and accessed via an ``ir2vec::Embedder`` instance. @@ -524,6 +531,10 @@ Further Details For more detailed information about the IR2Vec algorithm, its parameters, and advanced usage, please refer to the original paper: `IR2Vec: LLVM IR Based Scalable Program Embeddings `_. + +For information about using IR2Vec tool for generating embeddings and +triplets from LLVM IR, see :doc:`CommandGuide/llvm-ir2vec`. + The LLVM source code for ``IR2Vec`` can also be explored to understand the implementation details. @@ -595,4 +606,3 @@ optimizations that are currently MLGO-enabled, it may be used as follows: where the ``name`` is a path fragment. We will expect to find 2 files, ``.in`` (readable, data incoming from the managing process) and ``.out`` (writable, the model runner sends data to the managing process) - diff --git a/llvm/tools/llvm-ir2vec/llvm-ir2vec.cpp b/llvm/tools/llvm-ir2vec/llvm-ir2vec.cpp index eba8c2e5678b1..c9e2c7c713e18 100644 --- a/llvm/tools/llvm-ir2vec/llvm-ir2vec.cpp +++ b/llvm/tools/llvm-ir2vec/llvm-ir2vec.cpp @@ -276,12 +276,8 @@ int main(int argc, char **argv) { "Generates embeddings for a given LLVM IR and " "supports triplet generation for vocabulary " "training and embedding generation.\n\n" - "Usage:\n" - " Triplet mode: llvm-ir2vec --mode=triplets input.bc\n" - " Embedding mode: llvm-ir2vec --mode=embeddings " - "--ir2vec-vocab-path=vocab.json --level=func input.bc\n" - " Levels: --level=inst (instructions), --level=bb (basic blocks), " - "--level=func (functions)\n"); + "See https://llvm.org/docs/CommandGuide/llvm-ir2vec.html for more " + "information.\n"); // Validate command line options if (Mode == TripletMode && Level.getNumOccurrences() > 0)