From 1b6dc9c8c995b65267e8dbb1aea39cbf5f3179ec Mon Sep 17 00:00:00 2001
From: Jianhui Li <jian.hui.li@intel.com>
Date: Mon, 7 Jul 2025 23:43:12 -0700
Subject: [PATCH 01/27] Add matrix_desc and operations

---
 docs/rfcs/XeGPU.md | 14 ++++++++++++++
 1 file changed, 14 insertions(+)

diff --git a/docs/rfcs/XeGPU.md b/docs/rfcs/XeGPU.md
index 16e52b48f..b9bdd3ff1 100644
--- a/docs/rfcs/XeGPU.md
+++ b/docs/rfcs/XeGPU.md
@@ -329,6 +329,20 @@ Attribute `Memory_kind` describes the memory kind. "global" means the global mem
 
 `nbarrier` and `fence` operations lower to uniform instructions, so there is no need to specify the `sg_map`.
 
+## XeGPU operations to access share local memory
+Users must create `matrix_desc` to hold a matrix in the share local memory. The matrix must be row-major. The matrix can attach a attribute for its memory layout, for example, a blocked layout or just original non-blocked row-major layout (aka. linear layout). 
+User can get a subview of an existing `matrix_desc` to get a new `matrix_desc`, potentially having a stride. Then user can use load_matrix and store_matrix to move the matrix data between share local memory and vectors (registers). The matrix is typically 2d and but can be multi-dimension. XeGPU's load_matrix and store_matrix works at workgroup level only. It uses xegpu.layout to describe how the matrix is decomposed to data fragments and maps to work items. The workgroup level operation loads the entire matrix to vector.
+
+The motivation of `matrix_desc` data type and related operations is to simplify the programming model. Rather than trying to reuse `tensor_desc` to describe the matrix/tile in the share local memory, it is straightforward to use a dedicate data type to describe it. The use of share local memory is usually very local not exposed to workgroup level user, for example, supporting the lowering of transpose, reduction, and convert layout operations. So the createion of matrix_desc doesn't take a memref as input and implictly allocate share local memory. The share local memory may be blocked to facilitate the optimized lowering to load chunk or 1d block load. 
+
+
+| Ops	| Syntax	| Example |
+| :---   | :----   | :--- |
+|create_matrix_desc	| operation ::= xegpu.create_matrix_desc attr-dict : type(\$mdesc)	| %mdesc_a = xegpu.create_matrix_desc : matrix_desc<256x128xbf16,  @mem_layout=block[8, 16] > |
+|matrix_desc_subview	| operation ::= xegpu.matrix_desc_subview \$mdesc, DynamicIndexList<\$coord>  attr-dict : type(\$mdesc) -> type(\$mdesc)	| %mdesc_coop = xegpu.matrix_desc_subview %mdesc[128, 0]:matrix_desc<256x256xbf16,  @layout_type=1> -> matrix_desc<128x128xbf16, @row_stride=256,  @mem_layout=block[8, 16]> |
+|load_matrix	| operation ::= xegpu.load_matrix  $mdesc attr-dict : type($mdesc), {type(coords)} -> type($res)	| %result = xegpu.load_matrix %mdesc : matrix_desc<128x256xbf16, @mem_layout=block[8, 16]> -> vector<128x256xbf16> |
+|store_matrix	| operation ::= xegpu.store_matrix  $mdesc, $val attr-dict : type($mdesc), {type(coords)}, type($val)	| %result = xegpu.store_matrix %mdesc, %val : matrix_desc<128x256xbf16, @mem_layout=block[8, 16]>, vector<128x256xbf16> |
+
 ## XeGPU Attributes to support Work Item Level semantics
 
 **Attribute xegpu.sg_map**

From 39198a30b84fb3f462131b48e389f200b7102d71 Mon Sep 17 00:00:00 2001
From: Jianhui Li <jian.hui.li@intel.com>
Date: Tue, 8 Jul 2025 09:04:15 -0700
Subject: [PATCH 02/27] Update XeGPU.md

---
 docs/rfcs/XeGPU.md | 20 ++++++++++++++++++++
 1 file changed, 20 insertions(+)

diff --git a/docs/rfcs/XeGPU.md b/docs/rfcs/XeGPU.md
index b9bdd3ff1..0a54c6508 100644
--- a/docs/rfcs/XeGPU.md
+++ b/docs/rfcs/XeGPU.md
@@ -343,6 +343,26 @@ The motivation of `matrix_desc` data type and related operations is to simplify
 |load_matrix	| operation ::= xegpu.load_matrix  $mdesc attr-dict : type($mdesc), {type(coords)} -> type($res)	| %result = xegpu.load_matrix %mdesc : matrix_desc<128x256xbf16, @mem_layout=block[8, 16]> -> vector<128x256xbf16> |
 |store_matrix	| operation ::= xegpu.store_matrix  $mdesc, $val attr-dict : type($mdesc), {type(coords)}, type($val)	| %result = xegpu.store_matrix %mdesc, %val : matrix_desc<128x256xbf16, @mem_layout=block[8, 16]>, vector<128x256xbf16> |
 
+User creates `matrix_desc` to hold a matrix in the share local memory. The operation allocates a share local memory for the matrix, assuming the matrix is row-major and contiguous. The block attribute indicates the matrix has a blocked layout.
+```mlir
+%mdesc_a = xegpu.create_matrix_desc: matrix_desc<256x128xbf16, @mem_layout=block[8, 16]>
+```
+User creates a subview of matrix.
+```mlir
+%mdesc_a = xegpu.matrix_desc_subview %mdescs_a[%mma_cycle_i, 0, 0]: matrix_desc<3x256x128xbf16> -> matrix_desc<256x128xbf16>
+%mdesc_coop_a = xegpu.matrix_desc_subview %mdesc_a[0, %wg_id_x_in_cluster*64]: matrix_desc<256x128xbf16> -> matrix_desc<256x64xbf16, row_stride=128>
+```
+
+Users load a matrix from share local memory to vector. 
+```mlir
+vec_a = load_matrix matrix_desc_a: matrix_desc<256x128xbf16, @mem_layout=block[8, 16]> -> vector<256x128xbf6>
+```
+
+Users store a matrix to share local memory from vector. 
+```mlir
+store_matrix matrix_desc_b, vec_a :matrix_desc<256x128xbf16, @mem_layout=block[8, 16]>, vector<256x128xbf6>
+```
+
 ## XeGPU Attributes to support Work Item Level semantics
 
 **Attribute xegpu.sg_map**

From c1ab2984fad72d309e3ddca20b9d57153ffdf87b Mon Sep 17 00:00:00 2001
From: Jianhui Li <jian.hui.li@intel.com>
Date: Tue, 8 Jul 2025 20:03:53 -0700
Subject: [PATCH 03/27] Update XeGPU.md

---
 docs/rfcs/XeGPU.md | 86 ++++++++++++++++++++++++++++++++++++++++------
 1 file changed, 75 insertions(+), 11 deletions(-)

diff --git a/docs/rfcs/XeGPU.md b/docs/rfcs/XeGPU.md
index 0a54c6508..a4d576f5e 100644
--- a/docs/rfcs/XeGPU.md
+++ b/docs/rfcs/XeGPU.md
@@ -338,31 +338,95 @@ The motivation of `matrix_desc` data type and related operations is to simplify
 
 | Ops	| Syntax	| Example |
 | :---   | :----   | :--- |
-|create_matrix_desc	| operation ::= xegpu.create_matrix_desc attr-dict : type(\$mdesc)	| %mdesc_a = xegpu.create_matrix_desc : matrix_desc<256x128xbf16,  @mem_layout=block[8, 16] > |
-|matrix_desc_subview	| operation ::= xegpu.matrix_desc_subview \$mdesc, DynamicIndexList<\$coord>  attr-dict : type(\$mdesc) -> type(\$mdesc)	| %mdesc_coop = xegpu.matrix_desc_subview %mdesc[128, 0]:matrix_desc<256x256xbf16,  @layout_type=1> -> matrix_desc<128x128xbf16, @row_stride=256,  @mem_layout=block[8, 16]> |
-|load_matrix	| operation ::= xegpu.load_matrix  $mdesc attr-dict : type($mdesc), {type(coords)} -> type($res)	| %result = xegpu.load_matrix %mdesc : matrix_desc<128x256xbf16, @mem_layout=block[8, 16]> -> vector<128x256xbf16> |
-|store_matrix	| operation ::= xegpu.store_matrix  $mdesc, $val attr-dict : type($mdesc), {type(coords)}, type($val)	| %result = xegpu.store_matrix %mdesc, %val : matrix_desc<128x256xbf16, @mem_layout=block[8, 16]>, vector<128x256xbf16> |
+|create_matrix_desc	| operation ::= xegpu.create_matrix_desc attr-dict : type(\$mdesc)	| %mdesc_a = xegpu.create_matrix_desc : matrix_desc<256x128xbf16> |
+|matrix_desc_subview	| operation ::= xegpu.matrix_desc_subview \$mdesc, DynamicIndexList<\$coord>  attr-dict : type(\$mdesc) -> type(\$mdesc)	| %mdesc_coop = xegpu.matrix_desc_subview %mdesc[128, 0]:matrix_desc<256x256xbf16,  @layout_type=1> -> matrix_desc<128x128xbf16, @stride=[256,1],  @block=[8, 16]> |
+|load_matrix	| operation ::= xegpu.load_matrix  $mdesc attr-dict : type($mdesc), {type(coords)} -> type($res)	| %result = xegpu.load_matrix %mdesc : matrix_desc<128x256xbf16, @block=[8, 16]> -> vector<128x256xbf16> |
+|store_matrix	| operation ::= xegpu.store_matrix  $mdesc, $val attr-dict : type($mdesc), {type(coords)}, type($val)	| %result = xegpu.store_matrix %mdesc, %val : matrix_desc<128x256xbf16, @block=[8, 16]>, vector<128x256xbf16> |
 
-User creates `matrix_desc` to hold a matrix in the share local memory. The operation allocates a share local memory for the matrix, assuming the matrix is row-major and contiguous. The block attribute indicates the matrix has a blocked layout.
+User creates `matrix_desc` to hold a matrix in the share local memory. The operation allocates a share local memory for the matrix, assuming the matrix is row-major and contiguous. 
 ```mlir
-%mdesc_a = xegpu.create_matrix_desc: matrix_desc<256x128xbf16, @mem_layout=block[8, 16]>
+%mdesc_a = xegpu.create_matrix_desc: matrix_desc<256x128xbf16>
 ```
-User creates a subview of matrix.
+User creates a subview of matrix. The new matrix maybe associated with `block` and `strides` atttribute to describe the memory layout. The `strides` attributes allows matrix_desc being further decomposed to subgroup and work item level. The `block` attribute indicates the matrix has a blocked layout. 
 ```mlir
-%mdesc_a = xegpu.matrix_desc_subview %mdescs_a[%mma_cycle_i, 0, 0]: matrix_desc<3x256x128xbf16> -> matrix_desc<256x128xbf16>
-%mdesc_coop_a = xegpu.matrix_desc_subview %mdesc_a[0, %wg_id_x_in_cluster*64]: matrix_desc<256x128xbf16> -> matrix_desc<256x64xbf16, row_stride=128>
+%mdesc_a = xegpu.matrix_desc_subview %mdescs_a[%mma_cycle_i, 0, 0]: matrix_desc<3x256x128xbf16> -> matrix_desc<256x128xbf16, @block=[8, 16]>
+%mdesc_coop_a = xegpu.matrix_desc_subview %mdesc_a[0, %wg_id_x_in_cluster*64]: matrix_desc<256x128xbf16> -> matrix_desc<256x64xbf16, @strides=[128, 1]>
 ```
 
 Users load a matrix from share local memory to vector. 
 ```mlir
-vec_a = load_matrix matrix_desc_a: matrix_desc<256x128xbf16, @mem_layout=block[8, 16]> -> vector<256x128xbf6>
+vec_a = load_matrix matrix_desc_a: matrix_desc<256x128xbf16, @block=[8, 16]> -> vector<256x128xbf6>
 ```
 
 Users store a matrix to share local memory from vector. 
 ```mlir
-store_matrix matrix_desc_b, vec_a :matrix_desc<256x128xbf16, @mem_layout=block[8, 16]>, vector<256x128xbf6>
+store_matrix matrix_desc_b, vec_a :matrix_desc<256x128xbf16, @block=[8, 16]>, vector<256x128xbf6>
 ```
 
+**Cooperative Transpose Example**
+Suppose we have wg-level user input code 
+```mlir
+#Coop_t_wg ={sg_layout = [4, 8],  sg_data= [8, 32], order=[1, 0] }
+#Coop_wg = {sg_layout = [8, 4] , sg_data= [32, 8], order=[1, 0] }
+#dpas_wg = {sg_layout = [8, 4],  sg_data= [32, 32], order=[1, 0] }
+
+%at = load_nd %tdesc: tensor_desc<32x256xf16, #Coop_t_wg> -> vector<32x256xf16 >
+%a = vector.transpose %1 #Coop_wg :vector<32x256xf16> -> vector<256x32xf16>
+%a_dpas = Conv_layout %2 #Coop_wg #dpas_wg 
+```
+
+After an optimization pass which optimize the transpose-A pattern, the transformed code uses store_matrix and load_matrix. Note the load_nd and store_matrix has smaller sg_data so the subgroups perform cooperative transpose.
+```mlir
+#Coop_t_wg ={sg_layout = [4, 8], sg_data= [8, 32], order=[1, 0] }
+#dpas_t_wg = {sg_layout = [4, 8], sg_data= [32, 32], order=[1, 0] }
+
+%at = load_nd %tdesc: tensor_desc<32x256xf16, #Coop_t_wg> -> vector<32x256xf16 >
+%m1 = create_matrix_desc : matrix_desc<32x256xf16>
+%m1t = matrix_desc_subview %m1: matrix_desc<32x256xf16, strides=[1, 32], #coop_wg>
+store_matrix %m1t[0, 0], %at: vector<32x256xf16>, matrix_desc<32x256xf16, strides=[1, 32], #Coop_t_wg>
+barrier
+%m4 = matrix_desc_subview : matrix_desc<256x32xf16, #dpas_t_wg >
+%a_dpas = load_matrix %m2 [0, 0] #dpas_t_wg : vector<256x32xf16>, matrix_desc<256x32xf16, #dpas_t_wg>
+```
+
+After wg->sg level distribution and blocking, this lowers to the following sg-level code. 
+```mlir
+%at1 = load_nd %tdesc, sg_coords1: tensor_desc<32x256xf16> -> vector<8x16xf16>
+%at2 = load_nd %tdesc, sg_coords2: tensor_desc<32x256xf16> -> vector<8x16xf16>
+%m1 = create_matrix_desc : matrix_desc<32x256xf16>
+
+%m1t = matrix_desc_subview %m1: matrix_desc<32x256xf16, block=[16, 16], strides=[1, 32]>
+store_matrix %m1t, sg_slm_coord1, %at1: vector<8x16xf16>, matrix_desc<32x256xf16, block=[16, 16], strides=[1, 32]>
+store_matrix %m1t, sg_slm_coord2, %at2: vector<8x16xf16>, matrix_desc<32x256xf16, block=[16, 16], strides=[1, 32]>
+barrier
+%m4 = matrix_desc_subview : matrix_desc<256x32xf16, block=[16, 16] >
+%a_dpas1 = load_matrix %m2 sg_slm_coord3 #dpas_t_wg : vector<16x16xf16>, matrix_desc<256x32xf16, block=[16, 16]>
+%a_dpas2 = load_matrix %m2 sg_slm_coord3 #dpas_t_wg : vector<16x16xf16>, matrix_desc<256x32xf16, block=[16, 16]>
+%a_dpas3 = load_matrix %m2 sg_slm_coord3 #dpas_t_wg : vector<16x16xf16>, matrix_desc<256x32xf16, block=[16, 16]>
+%a_dpas4 = load_matrix %m2 sg_slm_coord3 #dpas_t_wg : vector<16x16xf16>, matrix_desc<256x32xf16, block=[16, 16]>
+```
+
+After assigned with lane_layout
+```mlir
+#Coop_t_lane ={lane_layout = [1, 16] , lane_data= [1, 1]} 
+#dpas_t_lane = {lane_layout = [2, 8], lane_data= [1, 2]}
+
+%at1 = load_nd %tdesc, sg_coords1: tensor_desc<32x256xf16, #Coop_t_lane> -> vector<8x16xf16>
+%at2 = load_nd %tdesc, sg_coords2: tensor_desc<32x256xf16, #Coop_t_lane> -> vector<8x16xf16>
+%m1 = create_matrix_desc : matrix_desc<32x256xf16>
+
+%m1t = matrix_desc_subview %m1: matrix_desc<32x256xf16, block=[16, 16], strides=[1, 32], #Coop_t_lane>
+store_matrix %m1t, sg_slm_coord1, %at1: vector<8x16xf16>, matrix_desc<32x256xf16, block=[16, 16], strides=[1, 32], #Coop_t_lane>
+store_matrix %m1t, sg_slm_coord2, %at2: vector<8x16xf16>, matrix_desc<32x256xf16, block=[16, 16], strides=[1, 32], #Coop_t_lane>
+barrier
+%m4 = matrix_desc_subview : matrix_desc<256x32xf16, block=[16, 16], #dpas_t_lane >
+%a_dpas1 = load_matrix %m2 sg_slm_coord3 #dpas_t_wg : vector<16x16xf16>, matrix_desc<256x32xf16, block=[16, 16],  #dpas_t_lane>
+%a_dpas2 = load_matrix %m2 sg_slm_coord3 #dpas_t_wg : vector<16x16xf16>, matrix_desc<256x32xf16, block=[16, 16],  #dpas_t_lane>
+%a_dpas3 = load_matrix %m2 sg_slm_coord3 #dpas_t_wg : vector<16x16xf16>, matrix_desc<256x32xf16, block=[16, 16],  #dpas_t_lane>
+%a_dpas4 = load_matrix %m2 sg_slm_coord3 #dpas_t_wg : vector<16x16xf16>, matrix_desc<256x32xf16, block=[16, 16],  #dpas_t_lane>
+```
+
+
 ## XeGPU Attributes to support Work Item Level semantics
 
 **Attribute xegpu.sg_map**

From 1bbcdaa6237c4fa57e86609d23f62fa09098cd51 Mon Sep 17 00:00:00 2001
From: Jianhui Li <jian.hui.li@intel.com>
Date: Tue, 8 Jul 2025 20:05:41 -0700
Subject: [PATCH 04/27] Update XeGPU.md

---
 docs/rfcs/XeGPU.md | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/docs/rfcs/XeGPU.md b/docs/rfcs/XeGPU.md
index a4d576f5e..c5d12038f 100644
--- a/docs/rfcs/XeGPU.md
+++ b/docs/rfcs/XeGPU.md
@@ -408,7 +408,7 @@ barrier
 
 After assigned with lane_layout
 ```mlir
-#Coop_t_lane ={lane_layout = [1, 16] , lane_data= [1, 1]} 
+#Coop_t_lane ={lane_layout = [1, 16] , lane_data= [8, 1]} 
 #dpas_t_lane = {lane_layout = [2, 8], lane_data= [1, 2]}
 
 %at1 = load_nd %tdesc, sg_coords1: tensor_desc<32x256xf16, #Coop_t_lane> -> vector<8x16xf16>

From 8906c08928001759a8252551de1057034f7f533f Mon Sep 17 00:00:00 2001
From: Jianhui Li <jian.hui.li@intel.com>
Date: Wed, 9 Jul 2025 07:51:05 -0700
Subject: [PATCH 05/27] Update XeGPU.md

---
 docs/rfcs/XeGPU.md | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/docs/rfcs/XeGPU.md b/docs/rfcs/XeGPU.md
index c5d12038f..52bae60a2 100644
--- a/docs/rfcs/XeGPU.md
+++ b/docs/rfcs/XeGPU.md
@@ -366,7 +366,7 @@ store_matrix matrix_desc_b, vec_a :matrix_desc<256x128xbf16, @block=[8, 16]>, ve
 **Cooperative Transpose Example**
 Suppose we have wg-level user input code 
 ```mlir
-#Coop_t_wg ={sg_layout = [4, 8],  sg_data= [8, 32], order=[1, 0] }
+#Coop_t_wg ={sg_layout = [4, 8],  sg_data= [8, 32], order=[0, 1] }
 #Coop_wg = {sg_layout = [8, 4] , sg_data= [32, 8], order=[1, 0] }
 #dpas_wg = {sg_layout = [8, 4],  sg_data= [32, 32], order=[1, 0] }
 
@@ -377,7 +377,7 @@ Suppose we have wg-level user input code
 
 After an optimization pass which optimize the transpose-A pattern, the transformed code uses store_matrix and load_matrix. Note the load_nd and store_matrix has smaller sg_data so the subgroups perform cooperative transpose.
 ```mlir
-#Coop_t_wg ={sg_layout = [4, 8], sg_data= [8, 32], order=[1, 0] }
+#Coop_t_wg ={sg_layout = [4, 8], sg_data= [8, 32], order=[0, 1] }
 #dpas_t_wg = {sg_layout = [4, 8], sg_data= [32, 32], order=[1, 0] }
 
 %at = load_nd %tdesc: tensor_desc<32x256xf16, #Coop_t_wg> -> vector<32x256xf16 >

From 3c0c8ad99cebee22554a33bf099bfd7f8454cdb1 Mon Sep 17 00:00:00 2001
From: Jianhui Li <jian.hui.li@intel.com>
Date: Wed, 9 Jul 2025 16:20:37 -0700
Subject: [PATCH 06/27] Update XeGPU.md

---
 docs/rfcs/XeGPU.md | 113 +++++++++++++++++++++++++++++++--------------
 1 file changed, 79 insertions(+), 34 deletions(-)

diff --git a/docs/rfcs/XeGPU.md b/docs/rfcs/XeGPU.md
index 52bae60a2..227460a78 100644
--- a/docs/rfcs/XeGPU.md
+++ b/docs/rfcs/XeGPU.md
@@ -377,55 +377,100 @@ Suppose we have wg-level user input code
 
 After an optimization pass which optimize the transpose-A pattern, the transformed code uses store_matrix and load_matrix. Note the load_nd and store_matrix has smaller sg_data so the subgroups perform cooperative transpose.
 ```mlir
-#Coop_t_wg ={sg_layout = [4, 8], sg_data= [8, 32], order=[0, 1] }
-#dpas_t_wg = {sg_layout = [4, 8], sg_data= [32, 32], order=[1, 0] }
+#Coop_t_wg ={sg_layout = [4, 8], sg_data= [8, 32], order=[0, 1 }
+#dpas_t_wg = {sg_layout = [8, 4], sg_data= [32, 32], order=[1, 0] }
 
 %at = load_nd %tdesc: tensor_desc<32x256xf16, #Coop_t_wg> -> vector<32x256xf16 >
-%m1 = create_matrix_desc : matrix_desc<32x256xf16>
-%m1t = matrix_desc_subview %m1: matrix_desc<32x256xf16, strides=[1, 32], #coop_wg>
-store_matrix %m1t[0, 0], %at: vector<32x256xf16>, matrix_desc<32x256xf16, strides=[1, 32], #Coop_t_wg>
+%m = create_matrix_desc : matrix_desc<32x256xf16>
+%mt = matrix_desc_subview %m: matrix_desc<32x256xf16, strides=[1, 32], #coop_t_wg>
+store_matrix %mt[0, 0], %at: vector<32x256xf16>, matrix_desc<32x256xf16, strides=[1, 32], #coop_t_wg>
 barrier
-%m4 = matrix_desc_subview : matrix_desc<256x32xf16, #dpas_t_wg >
-%a_dpas = load_matrix %m2 [0, 0] #dpas_t_wg : vector<256x32xf16>, matrix_desc<256x32xf16, #dpas_t_wg>
+%ma = matrix_desc_subview : matrix_desc<256x32xf16, #dpas_t_wg>
+%a_dpas = load_matrix %ma [0, 0] #dpas_t_wg : matrix_desc<256x32xf16, #dpas_t_wg> -> vector<256x32xf16>
 ```
 
-After wg->sg level distribution and blocking, this lowers to the following sg-level code. 
+After wg->sg level distribution, this lowers to the following sg-level code. 
 ```mlir
-%at1 = load_nd %tdesc, sg_coords1: tensor_desc<32x256xf16> -> vector<8x16xf16>
-%at2 = load_nd %tdesc, sg_coords2: tensor_desc<32x256xf16> -> vector<8x16xf16>
-%m1 = create_matrix_desc : matrix_desc<32x256xf16>
+#coop_t_inst ={ inst_data=[8, 16] }
+#dpas_t_inst = {inst_data=[16, 16] }
+create_nd_tdesc %tdesc_sg [widy*32+sg_idy*8, widx*256+sg_idx*32] : : memref<4096x4096xf16> -> : tensor_desc<8x32xf16, #coop_t_inst>
+%at = load_nd %tdesc: tensor_desc<8x32xf16, #coop_t_inst> -> vector<8x32xf16> 
 
-%m1t = matrix_desc_subview %m1: matrix_desc<32x256xf16, block=[16, 16], strides=[1, 32]>
-store_matrix %m1t, sg_slm_coord1, %at1: vector<8x16xf16>, matrix_desc<32x256xf16, block=[16, 16], strides=[1, 32]>
-store_matrix %m1t, sg_slm_coord2, %at2: vector<8x16xf16>, matrix_desc<32x256xf16, block=[16, 16], strides=[1, 32]>
+%m = create_matrix_desc : matrix_desc<32x256xf16>
+%mt_sg = create_matrix_desc %m [sg_idy*8, sg_idx*32]: matrix_desc<32x256xf16, block=[16, 16], strides=[1, 32], #coop_t_inst >
+store_matrix %mt_sg, %at: vector<8x32xf16>, matrix_desc<8x32xf16, block=[16, 16], strides=[1, 32], #coop_t_inst >
 barrier
-%m4 = matrix_desc_subview : matrix_desc<256x32xf16, block=[16, 16] >
-%a_dpas1 = load_matrix %m2 sg_slm_coord3 #dpas_t_wg : vector<16x16xf16>, matrix_desc<256x32xf16, block=[16, 16]>
-%a_dpas2 = load_matrix %m2 sg_slm_coord3 #dpas_t_wg : vector<16x16xf16>, matrix_desc<256x32xf16, block=[16, 16]>
-%a_dpas3 = load_matrix %m2 sg_slm_coord3 #dpas_t_wg : vector<16x16xf16>, matrix_desc<256x32xf16, block=[16, 16]>
-%a_dpas4 = load_matrix %m2 sg_slm_coord3 #dpas_t_wg : vector<16x16xf16>, matrix_desc<256x32xf16, block=[16, 16]>
+%ma = matrix_desc_subview %m : matrix_desc<256x32xf16, block=[16, 16], #dpas_t_inst>
+%ma_sg = matrix_desc_subview %ma [sg_idy*32, sg_idx*32%32]: matrix_desc<32x32xf16, block=[16, 16] , #dpas_t_inst>
+%a_dpas = load_matrix ma_sg: matrix_desc<32x32xf16, block=[16, 16], #dpas_t_inst >-> vector<32x32xf16>
+
 ```
 
-After assigned with lane_layout
+After blocking according to inst_data.
 ```mlir
-#Coop_t_lane ={lane_layout = [1, 16] , lane_data= [8, 1]} 
-#dpas_t_lane = {lane_layout = [2, 8], lane_data= [1, 2]}
+create_nd_tdesc %tdesc_sg [widy*32+sg_idy*8, widx*256+sg_idx*32] : : memref<4096x4096xf16> -> : tensor_desc<8x32xf16>
+%at = load_nd %tdesc, sg_coords1: tensor_desc<8x32xf16> -> vector<8x32xf16> 
+%at0 = vector.extract %at[0, 0] : vector<8x32xf16> -> vector<8x16xf16>
+%at1 = vector.extract %at[0, 16] : vector<8x32xf16> -> vector<8x16xf16>
+%m = create_matrix_desc : matrix_desc<32x256xf16>
+%mt_sg = create_matrix_desc %m1 [0, 0]: matrix_desc<32x256xf16, block=[16, 16], strides=[1, 32]>
 
-%at1 = load_nd %tdesc, sg_coords1: tensor_desc<32x256xf16, #Coop_t_lane> -> vector<8x16xf16>
-%at2 = load_nd %tdesc, sg_coords2: tensor_desc<32x256xf16, #Coop_t_lane> -> vector<8x16xf16>
-%m1 = create_matrix_desc : matrix_desc<32x256xf16>
+%mt_inst0 = create_matrix_desc % mt_sg [sg_idy*8, sg_idx*32]: matrix_desc<32x256xf16, block=[16, 16], strides=[1, 32]> -> matrix_desc<8x16xf16, block=[16, 16], strides=[1, 32]>
+%mt_inst1 = create_matrix_desc % mt_sg [sg_idy*8, sg_idx*32+16]: matrix_desc<32x256xf16, block=[16, 16], strides=[1, 32]> -> matrix_desc<8x16xf16, block=[16, 16], strides=[1, 32]>
 
-%m1t = matrix_desc_subview %m1: matrix_desc<32x256xf16, block=[16, 16], strides=[1, 32], #Coop_t_lane>
-store_matrix %m1t, sg_slm_coord1, %at1: vector<8x16xf16>, matrix_desc<32x256xf16, block=[16, 16], strides=[1, 32], #Coop_t_lane>
-store_matrix %m1t, sg_slm_coord2, %at2: vector<8x16xf16>, matrix_desc<32x256xf16, block=[16, 16], strides=[1, 32], #Coop_t_lane>
+store_matrix %mt_inst0, %at0: vector<8x16xf16>, matrix_desc<8x16xf16, block=[16, 16], strides=[1, 32]>
+store_matrix %mt_inst1, %at1: vector<8x16xf16>, matrix_desc<8x16xf16, block=[16, 16], strides=[1, 32]>
 barrier
-%m4 = matrix_desc_subview : matrix_desc<256x32xf16, block=[16, 16], #dpas_t_lane >
-%a_dpas1 = load_matrix %m2 sg_slm_coord3 #dpas_t_wg : vector<16x16xf16>, matrix_desc<256x32xf16, block=[16, 16],  #dpas_t_lane>
-%a_dpas2 = load_matrix %m2 sg_slm_coord3 #dpas_t_wg : vector<16x16xf16>, matrix_desc<256x32xf16, block=[16, 16],  #dpas_t_lane>
-%a_dpas3 = load_matrix %m2 sg_slm_coord3 #dpas_t_wg : vector<16x16xf16>, matrix_desc<256x32xf16, block=[16, 16],  #dpas_t_lane>
-%a_dpas4 = load_matrix %m2 sg_slm_coord3 #dpas_t_wg : vector<16x16xf16>, matrix_desc<256x32xf16, block=[16, 16],  #dpas_t_lane>
-```
+%ma = matrix_desc_subview %m: matrix_desc<256x32xf16, block=[16, 16] >
+%ma_inst0 = matrix_desc_subview % ma [sg_idy*32, sg_idx*32%32]: matrix_desc<16x16xf16, block=[16, 16] >
+% ma_inst1 = matrix_desc_subview %ma [sg_idy*32, sg_idx*32%32+16]: matrix_desc<16x16xf16, block=[16, 16] >
+% ma_inst2 = matrix_desc_subview %ma [sg_idy*32+16, sg_idx*32%32]: matrix_desc<16x16xf16, block=[16, 16] >
+% ma_inst3 = matrix_desc_subview %ma [sg_idy*32+16, sg_idx*32%32+16]: matrix_desc<16x16xf16, block=[16, 16] >
+%a_dpas_0 = load_matrix ma_inst0: matrix_desc<16x16xf16, block=[16, 16]>, vector<16x16xf16>
+%a_dpas_1 = load_matrix ma_inst1: matrix_desc<16x16xf16, block=[16, 16]>, vector<16x16xf16>
+%a_dpas_2 = load_matrix ma_inst2: matrix_desc<16x16xf16, block=[16, 16]>, vector<16x16xf16>
+%a_dpas_3 = load_matrix ma_inst3: matrix_desc<16x16xf16, block=[16, 16]>, vector<16x16xf16>
+```
+
+MaterializeSLMAccess pass replace matrix_desc and related ops to memref and load and load_1d. Pseudo code used to simplify the code.
+```mlir
+create_nd_tdesc %tdesc_sg [widy*32+sg_idy*8, widx*256+sg_idx*32] : : memref<4096x4096xf16> -> : tensor_desc<8x32xf16>
+%at = load_nd %tdesc, sg_coords1: tensor_desc<8x32xf16> -> vector<8x32xf16> 
+%at0 = vector.extract %at[0, 0] : vector<8x32xf16> -> vector<8x16xf16>
+%at1 = vector.extract %at[0, 16] : vector<8x32xf16> -> vector<8x16xf16>
+
+%blk_y=sg_idy*8 /16: index
+%blk_in_y=sg_idy*8 %16: index
+%sg_idx_vec = %sg_idx*32 + [0..15] : vector<16xindex> 
+%blk_x=%sg_idx_vec /16: vector<16xindex > 
+%blk_in_x=%sg_idx_vec %16: vector<16xindex >
+%sg_start_offset_vec = %blk_y * 16 + %blk_in_y + %blk_x * 512 + %blk_in_x*16
+%tdesc0 = xegpu.create_tdesc %m, %sg_start_offset_vec: memref<8192xf16, 3>, %sg_start_offset_vec ->tdesc<8x16xf16, chunk=8, scope=slm>
+
+%sg_idx_vec2 = %sg_idx*32 + [16..31] : vector<16xindex> 
+%blk_x2=%sg_idx_vec /16: vector<16xindex > 
+%blk_in_x2=%sg_idx_vec %16: vector<16xindex >
+%sg_start_offset_vec2 = %blk_y * 16 + %blk_in_y + %blk_x * 512 + %blk_in_x*16
+%tdesc1 = xegpu.create_tdesc %m, %sg_start_offset_vec2: memref<8192xf16, 3>, %sg_start_offset_vec ->tdesc<8x16xf16, chunk=8, scope=slm>
+
+xegpu.store %tdesc0, %at0: tdesc<8x32xf16, chunk=8, scope=slm>, vector<8x16xf16>
+xegpu.store %tdesc1, %at1: tdesc<8x32xf16, chunk=8, scope=slm>, vector<8x16xf16>
 
+barrier
+%inst_start_offset0  = sg_idy*2* 512
+%tdesc0 = xegpu.create_nd_tdesc %m1, % inst_start_offset0 : memref<8192xf16, 3>, index->tdesc<256xf16 >
+%inst_start_offset0  = sg_idy*2* 512 + 256
+%tdesc1 = xegpu.create_nd_tdesc %m1, % inst_start_offset0 : memref<8192xf16, 3>, index->tdesc<256xf16 >
+%inst_start_offset0  = sg_idy*2* 512 + 512
+%tdesc2 = xegpu.create_nd_tdesc %m1, % inst_start_offset0 : memref<8192xf16, 3>, index->tdesc<256xf16 >
+%inst_start_offset0  = sg_idy*2* 512 + 512 + 256
+%tdesc3 = xegpu.create_nd_tdesc %m1, % inst_start_offset0 : memref<8192xf16, 3>, index->tdesc<256xf16 >
+
+a_dpas_0 = Load_nd %tdesc0: tdesc<256xf16 > -> vector<256xf16>
+a_dpas_1 = Load_nd %tdesc1: tdesc<256xf16 > -> vector<256xf16>
+a_dpas_2 = Load_nd %tdesc2: tdesc<256xf16 > -> vector<256xf16>
+a_dpas_3 = Load_nd %tdesc3: tdesc<256xf16 > -> vector<256xf16>
+```
 
 ## XeGPU Attributes to support Work Item Level semantics
 

From bfc834b7b9bf387b8cb79c16a652b138aeaa0c0e Mon Sep 17 00:00:00 2001
From: Jianhui Li <jian.hui.li@intel.com>
Date: Thu, 10 Jul 2025 19:43:36 -0700
Subject: [PATCH 07/27] Update XeGPU.md

---
 docs/rfcs/XeGPU.md | 8 ++++----
 1 file changed, 4 insertions(+), 4 deletions(-)

diff --git a/docs/rfcs/XeGPU.md b/docs/rfcs/XeGPU.md
index 227460a78..b8ebaaa8a 100644
--- a/docs/rfcs/XeGPU.md
+++ b/docs/rfcs/XeGPU.md
@@ -330,10 +330,9 @@ Attribute `Memory_kind` describes the memory kind. "global" means the global mem
 `nbarrier` and `fence` operations lower to uniform instructions, so there is no need to specify the `sg_map`.
 
 ## XeGPU operations to access share local memory
-Users must create `matrix_desc` to hold a matrix in the share local memory. The matrix must be row-major. The matrix can attach a attribute for its memory layout, for example, a blocked layout or just original non-blocked row-major layout (aka. linear layout). 
-User can get a subview of an existing `matrix_desc` to get a new `matrix_desc`, potentially having a stride. Then user can use load_matrix and store_matrix to move the matrix data between share local memory and vectors (registers). The matrix is typically 2d and but can be multi-dimension. XeGPU's load_matrix and store_matrix works at workgroup level only. It uses xegpu.layout to describe how the matrix is decomposed to data fragments and maps to work items. The workgroup level operation loads the entire matrix to vector.
+XeGPU introduced `matrix_desc` data type to simplify programming share local memory (slm). In Xe2, the slm access is not directly programmed by user, but as an implementation to support transpose, reduction, and convert layout of a workgroup level tile. There is a common programming pattern that user allocates a 2d matrix in slm with row-major contiguous layout, then distribute the matrix to each subgroup and lane. The distribution process involves complex address computation, espeically when the memory access try to view the 2d matrix in a transposed view. The address computation becomes even more complicated for Xe2's 1d block load as it requires to block the slm so the innermost block is contiguous in slm. `matrix_desc` data type simplified the distribution by encoding the transposed and blocked layout as attribute, which separates the logical addresses compution in distribution and later physical address computation. The distribution process works on a logical address on top of row-major contiguous view of 2d matrix, and later materialized to physical address using the slm's memory layout attributes as required by 1d block load and regular load. 
 
-The motivation of `matrix_desc` data type and related operations is to simplify the programming model. Rather than trying to reuse `tensor_desc` to describe the matrix/tile in the share local memory, it is straightforward to use a dedicate data type to describe it. The use of share local memory is usually very local not exposed to workgroup level user, for example, supporting the lowering of transpose, reduction, and convert layout operations. So the createion of matrix_desc doesn't take a memref as input and implictly allocate share local memory. The share local memory may be blocked to facilitate the optimized lowering to load chunk or 1d block load. 
+Users must create `matrix_desc` to hold a matrix in the share local memory. The matrix must be row-major. The matrix can attach a attribute for its memory layout, for example, a blocked layout or just original non-blocked row-major layout (aka. linear layout). User can get a subview of an existing `matrix_desc` to get a new `matrix_desc`, potentially having strided and blocked layout attributes. Then user can use load_matrix and store_matrix to move the matrix data between slm and vectors (registers). The matrix is typically 2d and but can be multi-dimension. XeGPU's load_matrix and store_matrix works at workgroup scope only. 
 
 
 | Ops	| Syntax	| Example |
@@ -344,10 +343,11 @@ The motivation of `matrix_desc` data type and related operations is to simplify
 |store_matrix	| operation ::= xegpu.store_matrix  $mdesc, $val attr-dict : type($mdesc), {type(coords)}, type($val)	| %result = xegpu.store_matrix %mdesc, %val : matrix_desc<128x256xbf16, @block=[8, 16]>, vector<128x256xbf16> |
 
 User creates `matrix_desc` to hold a matrix in the share local memory. The operation allocates a share local memory for the matrix, assuming the matrix is row-major and contiguous. 
+
 ```mlir
 %mdesc_a = xegpu.create_matrix_desc: matrix_desc<256x128xbf16>
 ```
-User creates a subview of matrix. The new matrix maybe associated with `block` and `strides` atttribute to describe the memory layout. The `strides` attributes allows matrix_desc being further decomposed to subgroup and work item level. The `block` attribute indicates the matrix has a blocked layout. 
+User creates a subview of matrix. The new matrix maybe associated with `block` and `strides` atttributes to describe the memory layout. The `strides` attributes allows matrix_desc being further decomposed to subgroup and work item level. The `block` attribute indicates the matrix has a blocked layout. The `block` attribute facilitates the optimized lowering to 1d block load, and `strides` for load with chunk. It can also attach `xegpu.layout` attribute to describe how the matrix is decomposed to data fragments and maps to work items. 
 ```mlir
 %mdesc_a = xegpu.matrix_desc_subview %mdescs_a[%mma_cycle_i, 0, 0]: matrix_desc<3x256x128xbf16> -> matrix_desc<256x128xbf16, @block=[8, 16]>
 %mdesc_coop_a = xegpu.matrix_desc_subview %mdesc_a[0, %wg_id_x_in_cluster*64]: matrix_desc<256x128xbf16> -> matrix_desc<256x64xbf16, @strides=[128, 1]>

From be46fadf0d598b3679a1d648716cdfcd7c742cbd Mon Sep 17 00:00:00 2001
From: Jianhui Li <jian.hui.li@intel.com>
Date: Thu, 10 Jul 2025 21:59:27 -0700
Subject: [PATCH 08/27] Update XeGPU.md

---
 docs/rfcs/XeGPU.md | 32 ++++++++++++++++++++++++++++++--
 1 file changed, 30 insertions(+), 2 deletions(-)

diff --git a/docs/rfcs/XeGPU.md b/docs/rfcs/XeGPU.md
index b8ebaaa8a..bae6f5ddf 100644
--- a/docs/rfcs/XeGPU.md
+++ b/docs/rfcs/XeGPU.md
@@ -329,12 +329,40 @@ Attribute `Memory_kind` describes the memory kind. "global" means the global mem
 
 `nbarrier` and `fence` operations lower to uniform instructions, so there is no need to specify the `sg_map`.
 
-## XeGPU operations to access share local memory
-XeGPU introduced `matrix_desc` data type to simplify programming share local memory (slm). In Xe2, the slm access is not directly programmed by user, but as an implementation to support transpose, reduction, and convert layout of a workgroup level tile. There is a common programming pattern that user allocates a 2d matrix in slm with row-major contiguous layout, then distribute the matrix to each subgroup and lane. The distribution process involves complex address computation, espeically when the memory access try to view the 2d matrix in a transposed view. The address computation becomes even more complicated for Xe2's 1d block load as it requires to block the slm so the innermost block is contiguous in slm. `matrix_desc` data type simplified the distribution by encoding the transposed and blocked layout as attribute, which separates the logical addresses compution in distribution and later physical address computation. The distribution process works on a logical address on top of row-major contiguous view of 2d matrix, and later materialized to physical address using the slm's memory layout attributes as required by 1d block load and regular load. 
+## matrix_desc Type: Simplified Shared Local Memory (SLM) Abstraction
+To streamline programming of shared local memory (SLM) on Intel Xe architecture, the XeGPU dialect introduces a new type: matrix_desc. This abstraction is designed to simplify the management of workgroup-level tiles in SLM, especially in scenarios involving layout transformations such as transpose, reduction, and blocking.
+**Background and Motivation**
+On Xe2 GPUs, SLM remains accessible for direct use by programmers. However, in tile-based programming — particularly when applying layout transformations such as transpose, re-layout — SLM is more commonly used as a backing store to facilitate structured tile movement across subgroups and lanes.
+
+Prior to the introduction of matrix_desc, SLM usage was modeled using the nd_tdesc type, which was originally designed for global memory access. As such, it lacked layout-specific attributes like blocking and stride metadata, which are essential for modeling tiled or transposed views in SLM. Developers were responsible for manually computing physical addresses — a process that became particularly complex when applying transformations such as transpose or blocking as required by chunked load or 1D block load. 
+
+This complexity was further compounded by hierarchical distribution, where workgroup-level tiles are subdivided across subgroups, instructions, and individual lanes — each step requiring separate address transformation logic. This made the code error-prone and difficult to optimize.
+
+**Design and Semantics**
+
+The matrix_desc type addresses these challenges by:
+
+-Encoding layout transformations (e.g., transpose, blocking) as static attributes of the descriptor.
+
+-Separating logical and physical address computation:
+
+	-The distribution and unrolling process operates on a conceptual row-major 2D matrix.
+
+	-The physical address materialization then maps logical coordinates to hardware-compliant SLM addresses, guided by layout attributes in matrix_desc.
+
+This separation simplifies distribution and unrolling passes and enables systematic, robust transformations during compilation. The descriptor encapsulates all necessary layout metadata to generate correct and efficient SLM access patterns — supporting both regular loads and 1D block loads — without requiring the user to write explicit address arithmetic.
 
 Users must create `matrix_desc` to hold a matrix in the share local memory. The matrix must be row-major. The matrix can attach a attribute for its memory layout, for example, a blocked layout or just original non-blocked row-major layout (aka. linear layout). User can get a subview of an existing `matrix_desc` to get a new `matrix_desc`, potentially having strided and blocked layout attributes. Then user can use load_matrix and store_matrix to move the matrix data between slm and vectors (registers). The matrix is typically 2d and but can be multi-dimension. XeGPU's load_matrix and store_matrix works at workgroup scope only. 
 
 
+**Basic Usage**
+
+To represent a matrix stored in shared local memory (SLM), users must create a matrix_desc object. The underlying memory is assumed to follow a row-major layout, and the base matrix_desc represents a raw, unannotated matrix in this layout. The base matrix may be n-dimensional.
+
+Only matrix_desc instances created via subview may carry an xegpu.layout attribute, which specifies the mapping of lanes and registers to fragments of the matrix. This attribute guides the tile distribution process based on the assumed row-major view of the original matrix. In addition, subviewed matrix_desc instances may carry layout metadata such as blocking and striding, which are used to control physical address computation when accessing SLM.
+
+Data movement between SLM and vector registers is performed using load_matrix and store_matrix, which operate at workgroup scope and require the input matrix_desc to be 2D. If the original matrix is higher-dimensional, it must be subviewed to a 2D shape before it can be used with these operations.
+
 | Ops	| Syntax	| Example |
 | :---   | :----   | :--- |
 |create_matrix_desc	| operation ::= xegpu.create_matrix_desc attr-dict : type(\$mdesc)	| %mdesc_a = xegpu.create_matrix_desc : matrix_desc<256x128xbf16> |

From c75986c3dfa4d1bdbb609232dcaecb0ccc42b393 Mon Sep 17 00:00:00 2001
From: Jianhui Li <jian.hui.li@intel.com>
Date: Thu, 10 Jul 2025 22:01:43 -0700
Subject: [PATCH 09/27] Update XeGPU.md

---
 docs/rfcs/XeGPU.md | 13 +------------
 1 file changed, 1 insertion(+), 12 deletions(-)

diff --git a/docs/rfcs/XeGPU.md b/docs/rfcs/XeGPU.md
index bae6f5ddf..a48ff0547 100644
--- a/docs/rfcs/XeGPU.md
+++ b/docs/rfcs/XeGPU.md
@@ -340,21 +340,10 @@ This complexity was further compounded by hierarchical distribution, where workg
 
 **Design and Semantics**
 
-The matrix_desc type addresses these challenges by:
-
--Encoding layout transformations (e.g., transpose, blocking) as static attributes of the descriptor.
-
--Separating logical and physical address computation:
-
-	-The distribution and unrolling process operates on a conceptual row-major 2D matrix.
-
-	-The physical address materialization then maps logical coordinates to hardware-compliant SLM addresses, guided by layout attributes in matrix_desc.
+The matrix_desc type addresses these challenges by encoding layout transformations—such as transpose and blocking—as static attributes of the descriptor, and by clearly separating logical and physical address computation. The distribution and unrolling process operates on a conceptual row-major 2D matrix, enabling clean and structured logical access, while the physical address materialization phase maps these logical coordinates to hardware-compliant SLM addresses, guided by the layout attributes attached to the matrix_desc.
 
 This separation simplifies distribution and unrolling passes and enables systematic, robust transformations during compilation. The descriptor encapsulates all necessary layout metadata to generate correct and efficient SLM access patterns — supporting both regular loads and 1D block loads — without requiring the user to write explicit address arithmetic.
 
-Users must create `matrix_desc` to hold a matrix in the share local memory. The matrix must be row-major. The matrix can attach a attribute for its memory layout, for example, a blocked layout or just original non-blocked row-major layout (aka. linear layout). User can get a subview of an existing `matrix_desc` to get a new `matrix_desc`, potentially having strided and blocked layout attributes. Then user can use load_matrix and store_matrix to move the matrix data between slm and vectors (registers). The matrix is typically 2d and but can be multi-dimension. XeGPU's load_matrix and store_matrix works at workgroup scope only. 
-
-
 **Basic Usage**
 
 To represent a matrix stored in shared local memory (SLM), users must create a matrix_desc object. The underlying memory is assumed to follow a row-major layout, and the base matrix_desc represents a raw, unannotated matrix in this layout. The base matrix may be n-dimensional.

From d5b06840d336b14a13c1e7613f30a49cc9aab3e8 Mon Sep 17 00:00:00 2001
From: Jianhui Li <jian.hui.li@intel.com>
Date: Thu, 10 Jul 2025 22:42:48 -0700
Subject: [PATCH 10/27] Update XeGPU.md

---
 docs/rfcs/XeGPU.md | 58 +++++++++++++++++++++++++++++++++-------------
 1 file changed, 42 insertions(+), 16 deletions(-)

diff --git a/docs/rfcs/XeGPU.md b/docs/rfcs/XeGPU.md
index a48ff0547..5f94858b9 100644
--- a/docs/rfcs/XeGPU.md
+++ b/docs/rfcs/XeGPU.md
@@ -330,8 +330,11 @@ Attribute `Memory_kind` describes the memory kind. "global" means the global mem
 `nbarrier` and `fence` operations lower to uniform instructions, so there is no need to specify the `sg_map`.
 
 ## matrix_desc Type: Simplified Shared Local Memory (SLM) Abstraction
+
 To streamline programming of shared local memory (SLM) on Intel Xe architecture, the XeGPU dialect introduces a new type: matrix_desc. This abstraction is designed to simplify the management of workgroup-level tiles in SLM, especially in scenarios involving layout transformations such as transpose, reduction, and blocking.
+
 **Background and Motivation**
+
 On Xe2 GPUs, SLM remains accessible for direct use by programmers. However, in tile-based programming — particularly when applying layout transformations such as transpose, re-layout — SLM is more commonly used as a backing store to facilitate structured tile movement across subgroups and lanes.
 
 Prior to the introduction of matrix_desc, SLM usage was modeled using the nd_tdesc type, which was originally designed for global memory access. As such, it lacked layout-specific attributes like blocking and stride metadata, which are essential for modeling tiled or transposed views in SLM. Developers were responsible for manually computing physical addresses — a process that became particularly complex when applying transformations such as transpose or blocking as required by chunked load or 1D block load. 
@@ -359,29 +362,29 @@ Data movement between SLM and vector registers is performed using load_matrix an
 |load_matrix	| operation ::= xegpu.load_matrix  $mdesc attr-dict : type($mdesc), {type(coords)} -> type($res)	| %result = xegpu.load_matrix %mdesc : matrix_desc<128x256xbf16, @block=[8, 16]> -> vector<128x256xbf16> |
 |store_matrix	| operation ::= xegpu.store_matrix  $mdesc, $val attr-dict : type($mdesc), {type(coords)}, type($val)	| %result = xegpu.store_matrix %mdesc, %val : matrix_desc<128x256xbf16, @block=[8, 16]>, vector<128x256xbf16> |
 
-User creates `matrix_desc` to hold a matrix in the share local memory. The operation allocates a share local memory for the matrix, assuming the matrix is row-major and contiguous. 
+Users create a `matrix_desc` to represent a matrix stored in shared local memory (SLM). The operation allocates SLM for the matrix, assuming a row-major contiguous layout.
 
 ```mlir
 %mdesc_a = xegpu.create_matrix_desc: matrix_desc<256x128xbf16>
 ```
-User creates a subview of matrix. The new matrix maybe associated with `block` and `strides` atttributes to describe the memory layout. The `strides` attributes allows matrix_desc being further decomposed to subgroup and work item level. The `block` attribute indicates the matrix has a blocked layout. The `block` attribute facilitates the optimized lowering to 1d block load, and `strides` for load with chunk. It can also attach `xegpu.layout` attribute to describe how the matrix is decomposed to data fragments and maps to work items. 
+Users can create a subview of a matrix_desc to represent a sliced or partitioned view of the original matrix. Subviews may reduce the rank of the matrix, allowing users to extract a lower-dimensional matrix from a higher-dimensional one. The resulting matrix_desc may be annotated with layout attributes such as @block and @strides to describe its memory layout more precisely. The @block attribute indicates that the matrix follows a blocked layout, enabling optimized lowering to 1D block loads. The @strides attribute specifies the logical strides of each dimension and is typically used to support chunked loads. Additionally, a subview may carry an xegpu.layout attribute that defines how the matrix is logically partitioned into fragments and mapped to work items.
 ```mlir
 %mdesc_a = xegpu.matrix_desc_subview %mdescs_a[%mma_cycle_i, 0, 0]: matrix_desc<3x256x128xbf16> -> matrix_desc<256x128xbf16, @block=[8, 16]>
 %mdesc_coop_a = xegpu.matrix_desc_subview %mdesc_a[0, %wg_id_x_in_cluster*64]: matrix_desc<256x128xbf16> -> matrix_desc<256x64xbf16, @strides=[128, 1]>
 ```
 
-Users load a matrix from share local memory to vector. 
+Users can load a matrix from shared local memory into a vector value using the load_matrix operation. The result is a vector type in the IR, representing a tile stored in registers.
 ```mlir
 vec_a = load_matrix matrix_desc_a: matrix_desc<256x128xbf16, @block=[8, 16]> -> vector<256x128xbf6>
 ```
 
-Users store a matrix to share local memory from vector. 
+Users can store a matrix from a vector value into shared local memory using the store_matrix operation.
 ```mlir
 store_matrix matrix_desc_b, vec_a :matrix_desc<256x128xbf16, @block=[8, 16]>, vector<256x128xbf6>
 ```
 
 **Cooperative Transpose Example**
-Suppose we have wg-level user input code 
+This example demonstrates a cooperative transpose pattern in which a matrix tile is loaded by a workgroup and collaboratively transposed across subgroups or threads. The operation is broken into two steps: a local transpose using vector.transpose and a cooperative re-layout using xegpu.convert_layout, where neighboring subgroups within a workgroup exchange data to form the desired transposed tile layout.
 ```mlir
 #Coop_t_wg ={sg_layout = [4, 8],  sg_data= [8, 32], order=[0, 1] }
 #Coop_wg = {sg_layout = [8, 4] , sg_data= [32, 8], order=[1, 0] }
@@ -391,19 +394,42 @@ Suppose we have wg-level user input code
 %a = vector.transpose %1 #Coop_wg :vector<32x256xf16> -> vector<256x32xf16>
 %a_dpas = Conv_layout %2 #Coop_wg #dpas_wg 
 ```
+In this flow:
 
-After an optimization pass which optimize the transpose-A pattern, the transformed code uses store_matrix and load_matrix. Note the load_nd and store_matrix has smaller sg_data so the subgroups perform cooperative transpose.
-```mlir
-#Coop_t_wg ={sg_layout = [4, 8], sg_data= [8, 32], order=[0, 1 }
-#dpas_t_wg = {sg_layout = [8, 4], sg_data= [32, 32], order=[1, 0] }
+vector.transpose applies a local transpose within each thread’s register tile.
+
+xegpu.convert_layout performs a cooperative data exchange among threads/subgroups to assemble a larger tile in the transposed layout.
+
+The result is a matrix tile conforming to the #dpas_wg layout, ready for compute instructions such as DPAS.
+
+After an optimization pass that targets the transpose-A pattern, the code is transformed to use store_matrix and load_matrix to materialize the transpose cooperatively in shared local memory. Note that both load_nd and store_matrix use smaller sg_data values, meaning each subgroup processes a smaller fragment, enabling a cooperative transpose across threads.
+
+It is generally preferred to fuse transpose and convert_layout earlier in the pipeline, as this affects the blocking strategy for load_matrix and store_matrix (which are the lowered forms of the logical layout conversion and transpose). Early fusion enables better alignment with optimal hardware load instructions.
+
+```mlir 
+#Coop_t_wg  = { sg_layout = [4, 8], sg_data = [8, 32], order = [0, 1] }  // original layout
+#dpas_t_wg  = { sg_layout = [8, 4], sg_data = [32, 32], order = [1, 0] } // target DPAS layout
+
+%at = xegpu.load_nd %tdesc
+    : tensor_desc<32x256xf16, #Coop_t_wg> -> vector<32x256xf16>
+
+%m = xegpu.create_matrix_desc
+    : matrix_desc<32x256xf16>
+
+%mt = xegpu.matrix_desc_subview %m[0, 0]
+    : matrix_desc<32x256xf16> -> matrix_desc<32x256xf16, @strides=[1, 32], #Coop_t_wg>
+
+xegpu.store_matrix %mt[0, 0], %at
+    : vector<32x256xf16>, matrix_desc<32x256xf16, @strides=[1, 32], #Coop_t_wg>
+
+xegpu.barrier
+
+%ma = xegpu.matrix_desc_subview %m[0, 0]
+    : matrix_desc<32x256xf16> -> matrix_desc<256x32xf16, #dpas_t_wg>
+
+%a_dpas = xegpu.load_matrix %ma[0, 0]
+    : matrix_desc<256x32xf16, #dpas_t_wg> -> vector<256x32xf16>
 
-%at = load_nd %tdesc: tensor_desc<32x256xf16, #Coop_t_wg> -> vector<32x256xf16 >
-%m = create_matrix_desc : matrix_desc<32x256xf16>
-%mt = matrix_desc_subview %m: matrix_desc<32x256xf16, strides=[1, 32], #coop_t_wg>
-store_matrix %mt[0, 0], %at: vector<32x256xf16>, matrix_desc<32x256xf16, strides=[1, 32], #coop_t_wg>
-barrier
-%ma = matrix_desc_subview : matrix_desc<256x32xf16, #dpas_t_wg>
-%a_dpas = load_matrix %ma [0, 0] #dpas_t_wg : matrix_desc<256x32xf16, #dpas_t_wg> -> vector<256x32xf16>
 ```
 
 After wg->sg level distribution, this lowers to the following sg-level code. 

From 3657d362c46109eebdcc9fdc2983d22a101e0d62 Mon Sep 17 00:00:00 2001
From: Jianhui Li <jian.hui.li@intel.com>
Date: Thu, 10 Jul 2025 23:06:13 -0700
Subject: [PATCH 11/27] Update XeGPU.md

---
 docs/rfcs/XeGPU.md | 360 ++++++++++++++++++++++++++++++++++-----------
 1 file changed, 273 insertions(+), 87 deletions(-)

diff --git a/docs/rfcs/XeGPU.md b/docs/rfcs/XeGPU.md
index 5f94858b9..b9abf89e3 100644
--- a/docs/rfcs/XeGPU.md
+++ b/docs/rfcs/XeGPU.md
@@ -369,8 +369,11 @@ Users create a `matrix_desc` to represent a matrix stored in shared local memory
 ```
 Users can create a subview of a matrix_desc to represent a sliced or partitioned view of the original matrix. Subviews may reduce the rank of the matrix, allowing users to extract a lower-dimensional matrix from a higher-dimensional one. The resulting matrix_desc may be annotated with layout attributes such as @block and @strides to describe its memory layout more precisely. The @block attribute indicates that the matrix follows a blocked layout, enabling optimized lowering to 1D block loads. The @strides attribute specifies the logical strides of each dimension and is typically used to support chunked loads. Additionally, a subview may carry an xegpu.layout attribute that defines how the matrix is logically partitioned into fragments and mapped to work items.
 ```mlir
-%mdesc_a = xegpu.matrix_desc_subview %mdescs_a[%mma_cycle_i, 0, 0]: matrix_desc<3x256x128xbf16> -> matrix_desc<256x128xbf16, @block=[8, 16]>
-%mdesc_coop_a = xegpu.matrix_desc_subview %mdesc_a[0, %wg_id_x_in_cluster*64]: matrix_desc<256x128xbf16> -> matrix_desc<256x64xbf16, @strides=[128, 1]>
+%mdesc_a = xegpu.matrix_desc_subview %mdescs_a[%mma_cycle_i, 0, 0]
+    : matrix_desc<3x256x128xbf16> -> matrix_desc<256x128xbf16, @block=[8, 16]>
+
+%mdesc_coop_a = xegpu.matrix_desc_subview %mdesc_a[0, %wg_id_x_in_cluster * 64]
+    : matrix_desc<256x128xbf16> -> matrix_desc<256x64xbf16, @strides=[128, 1]>
 ```
 
 Users can load a matrix from shared local memory into a vector value using the load_matrix operation. The result is a vector type in the IR, representing a tile stored in registers.
@@ -384,6 +387,7 @@ store_matrix matrix_desc_b, vec_a :matrix_desc<256x128xbf16, @block=[8, 16]>, ve
 ```
 
 **Cooperative Transpose Example**
+
 This example demonstrates a cooperative transpose pattern in which a matrix tile is loaded by a workgroup and collaboratively transposed across subgroups or threads. The operation is broken into two steps: a local transpose using vector.transpose and a cooperative re-layout using xegpu.convert_layout, where neighboring subgroups within a workgroup exchange data to form the desired transposed tile layout.
 ```mlir
 #Coop_t_wg ={sg_layout = [4, 8],  sg_data= [8, 32], order=[0, 1] }
@@ -396,13 +400,14 @@ This example demonstrates a cooperative transpose pattern in which a matrix tile
 ```
 In this flow:
 
-vector.transpose applies a local transpose within each thread’s register tile.
+1. vector.transpose applies a local transpose within each thread’s register tile.
 
-xegpu.convert_layout performs a cooperative data exchange among threads/subgroups to assemble a larger tile in the transposed layout.
+2. xegpu.convert_layout performs a cooperative data exchange among threads/subgroups to assemble a larger tile in the transposed layout.
 
-The result is a matrix tile conforming to the #dpas_wg layout, ready for compute instructions such as DPAS.
+3. The result is a matrix tile conforming to the #dpas_wg layout, ready for compute instructions such as DPAS.
 
-After an optimization pass that targets the transpose-A pattern, the code is transformed to use store_matrix and load_matrix to materialize the transpose cooperatively in shared local memory. Note that both load_nd and store_matrix use smaller sg_data values, meaning each subgroup processes a smaller fragment, enabling a cooperative transpose across threads.
+**After an optimization pass that targets the transpose-A pattern**
+The code is transformed to use store_matrix and load_matrix to materialize the transpose cooperatively in shared local memory. Note that both load_nd and store_matrix use smaller sg_data values, meaning each subgroup processes a smaller fragment, enabling a cooperative transpose across threads.
 
 It is generally preferred to fuse transpose and convert_layout earlier in the pipeline, as this affects the blocking strategy for load_matrix and store_matrix (which are the lowered forms of the logical layout conversion and transpose). Early fusion enables better alignment with optimal hardware load instructions.
 
@@ -432,87 +437,268 @@ xegpu.barrier
 
 ```
 
-After wg->sg level distribution, this lowers to the following sg-level code. 
-```mlir
-#coop_t_inst ={ inst_data=[8, 16] }
-#dpas_t_inst = {inst_data=[16, 16] }
-create_nd_tdesc %tdesc_sg [widy*32+sg_idy*8, widx*256+sg_idx*32] : : memref<4096x4096xf16> -> : tensor_desc<8x32xf16, #coop_t_inst>
-%at = load_nd %tdesc: tensor_desc<8x32xf16, #coop_t_inst> -> vector<8x32xf16> 
-
-%m = create_matrix_desc : matrix_desc<32x256xf16>
-%mt_sg = create_matrix_desc %m [sg_idy*8, sg_idx*32]: matrix_desc<32x256xf16, block=[16, 16], strides=[1, 32], #coop_t_inst >
-store_matrix %mt_sg, %at: vector<8x32xf16>, matrix_desc<8x32xf16, block=[16, 16], strides=[1, 32], #coop_t_inst >
-barrier
-%ma = matrix_desc_subview %m : matrix_desc<256x32xf16, block=[16, 16], #dpas_t_inst>
-%ma_sg = matrix_desc_subview %ma [sg_idy*32, sg_idx*32%32]: matrix_desc<32x32xf16, block=[16, 16] , #dpas_t_inst>
-%a_dpas = load_matrix ma_sg: matrix_desc<32x32xf16, block=[16, 16], #dpas_t_inst >-> vector<32x32xf16>
-
-```
-
-After blocking according to inst_data.
-```mlir
-create_nd_tdesc %tdesc_sg [widy*32+sg_idy*8, widx*256+sg_idx*32] : : memref<4096x4096xf16> -> : tensor_desc<8x32xf16>
-%at = load_nd %tdesc, sg_coords1: tensor_desc<8x32xf16> -> vector<8x32xf16> 
-%at0 = vector.extract %at[0, 0] : vector<8x32xf16> -> vector<8x16xf16>
-%at1 = vector.extract %at[0, 16] : vector<8x32xf16> -> vector<8x16xf16>
-%m = create_matrix_desc : matrix_desc<32x256xf16>
-%mt_sg = create_matrix_desc %m1 [0, 0]: matrix_desc<32x256xf16, block=[16, 16], strides=[1, 32]>
-
-%mt_inst0 = create_matrix_desc % mt_sg [sg_idy*8, sg_idx*32]: matrix_desc<32x256xf16, block=[16, 16], strides=[1, 32]> -> matrix_desc<8x16xf16, block=[16, 16], strides=[1, 32]>
-%mt_inst1 = create_matrix_desc % mt_sg [sg_idy*8, sg_idx*32+16]: matrix_desc<32x256xf16, block=[16, 16], strides=[1, 32]> -> matrix_desc<8x16xf16, block=[16, 16], strides=[1, 32]>
-
-store_matrix %mt_inst0, %at0: vector<8x16xf16>, matrix_desc<8x16xf16, block=[16, 16], strides=[1, 32]>
-store_matrix %mt_inst1, %at1: vector<8x16xf16>, matrix_desc<8x16xf16, block=[16, 16], strides=[1, 32]>
-barrier
-%ma = matrix_desc_subview %m: matrix_desc<256x32xf16, block=[16, 16] >
-%ma_inst0 = matrix_desc_subview % ma [sg_idy*32, sg_idx*32%32]: matrix_desc<16x16xf16, block=[16, 16] >
-% ma_inst1 = matrix_desc_subview %ma [sg_idy*32, sg_idx*32%32+16]: matrix_desc<16x16xf16, block=[16, 16] >
-% ma_inst2 = matrix_desc_subview %ma [sg_idy*32+16, sg_idx*32%32]: matrix_desc<16x16xf16, block=[16, 16] >
-% ma_inst3 = matrix_desc_subview %ma [sg_idy*32+16, sg_idx*32%32+16]: matrix_desc<16x16xf16, block=[16, 16] >
-%a_dpas_0 = load_matrix ma_inst0: matrix_desc<16x16xf16, block=[16, 16]>, vector<16x16xf16>
-%a_dpas_1 = load_matrix ma_inst1: matrix_desc<16x16xf16, block=[16, 16]>, vector<16x16xf16>
-%a_dpas_2 = load_matrix ma_inst2: matrix_desc<16x16xf16, block=[16, 16]>, vector<16x16xf16>
-%a_dpas_3 = load_matrix ma_inst3: matrix_desc<16x16xf16, block=[16, 16]>, vector<16x16xf16>
-```
-
-MaterializeSLMAccess pass replace matrix_desc and related ops to memref and load and load_1d. Pseudo code used to simplify the code.
-```mlir
-create_nd_tdesc %tdesc_sg [widy*32+sg_idy*8, widx*256+sg_idx*32] : : memref<4096x4096xf16> -> : tensor_desc<8x32xf16>
-%at = load_nd %tdesc, sg_coords1: tensor_desc<8x32xf16> -> vector<8x32xf16> 
-%at0 = vector.extract %at[0, 0] : vector<8x32xf16> -> vector<8x16xf16>
-%at1 = vector.extract %at[0, 16] : vector<8x32xf16> -> vector<8x16xf16>
-
-%blk_y=sg_idy*8 /16: index
-%blk_in_y=sg_idy*8 %16: index
-%sg_idx_vec = %sg_idx*32 + [0..15] : vector<16xindex> 
-%blk_x=%sg_idx_vec /16: vector<16xindex > 
-%blk_in_x=%sg_idx_vec %16: vector<16xindex >
-%sg_start_offset_vec = %blk_y * 16 + %blk_in_y + %blk_x * 512 + %blk_in_x*16
-%tdesc0 = xegpu.create_tdesc %m, %sg_start_offset_vec: memref<8192xf16, 3>, %sg_start_offset_vec ->tdesc<8x16xf16, chunk=8, scope=slm>
-
-%sg_idx_vec2 = %sg_idx*32 + [16..31] : vector<16xindex> 
-%blk_x2=%sg_idx_vec /16: vector<16xindex > 
-%blk_in_x2=%sg_idx_vec %16: vector<16xindex >
-%sg_start_offset_vec2 = %blk_y * 16 + %blk_in_y + %blk_x * 512 + %blk_in_x*16
-%tdesc1 = xegpu.create_tdesc %m, %sg_start_offset_vec2: memref<8192xf16, 3>, %sg_start_offset_vec ->tdesc<8x16xf16, chunk=8, scope=slm>
-
-xegpu.store %tdesc0, %at0: tdesc<8x32xf16, chunk=8, scope=slm>, vector<8x16xf16>
-xegpu.store %tdesc1, %at1: tdesc<8x32xf16, chunk=8, scope=slm>, vector<8x16xf16>
-
-barrier
-%inst_start_offset0  = sg_idy*2* 512
-%tdesc0 = xegpu.create_nd_tdesc %m1, % inst_start_offset0 : memref<8192xf16, 3>, index->tdesc<256xf16 >
-%inst_start_offset0  = sg_idy*2* 512 + 256
-%tdesc1 = xegpu.create_nd_tdesc %m1, % inst_start_offset0 : memref<8192xf16, 3>, index->tdesc<256xf16 >
-%inst_start_offset0  = sg_idy*2* 512 + 512
-%tdesc2 = xegpu.create_nd_tdesc %m1, % inst_start_offset0 : memref<8192xf16, 3>, index->tdesc<256xf16 >
-%inst_start_offset0  = sg_idy*2* 512 + 512 + 256
-%tdesc3 = xegpu.create_nd_tdesc %m1, % inst_start_offset0 : memref<8192xf16, 3>, index->tdesc<256xf16 >
-
-a_dpas_0 = Load_nd %tdesc0: tdesc<256xf16 > -> vector<256xf16>
-a_dpas_1 = Load_nd %tdesc1: tdesc<256xf16 > -> vector<256xf16>
-a_dpas_2 = Load_nd %tdesc2: tdesc<256xf16 > -> vector<256xf16>
-a_dpas_3 = Load_nd %tdesc3: tdesc<256xf16 > -> vector<256xf16>
+**Adding attributes for Instruction-Level Blocking**
+In this example, the matrix is distributed according to hardware capability, using instruction-level blocking. This type of blocking does not change the physical memory layout (i.e., there is no memory-level tiling); instead, it affects how data is accessed and lowered to instructions like store_scatter and load_gather.
+
+Each lane handles 2 f16 elements (32 bits), and the matrix_desc uses a @strides attribute to represent memory layout. During lowering, offsets must be computed by composing: 1) The logical layout (e.g., xegpu.layout), and 2) The memory layout (e.g., @strides), to produce final physical offsets.
+```mlir
+#Coop_t_wg  = { sg_layout = [4, 8], sg_data = [8, 32], inst_data = [1, 32], order = [0, 1] }
+#dpas_t_wg  = { sg_layout = [8, 4], sg_data = [32, 32], inst_data = [1, 32], order = [1, 0] }
+// Load a tile cooperatively using logical layout #Coop_t_wg
+%at = xegpu.load_nd %tdesc
+    : tensor_desc<32x256xf16, #Coop_t_wg> -> vector<32x256xf16>
+
+// Allocate a matrix_desc over shared memory
+%m = xegpu.create_matrix_desc
+    : matrix_desc<32x256xf16>
+
+// Subview with strides, using the same layout for consistent offset mapping
+%mt = xegpu.matrix_desc_subview %m
+    : matrix_desc<32x256xf16> -> matrix_desc<32x256xf16, @strides=[1, 32], #Coop_t_wg>
+
+// Store matrix with instruction-level blocking (per-inst offsets computed using strides)
+xegpu.store_matrix %mt[0, 0], %at
+    : vector<32x256xf16>, matrix_desc<32x256xf16, @strides=[1, 32], #Coop_t_wg>
+
+// Synchronize before loading the transposed result
+xegpu.barrier
+
+// Subview for DPAS layout (transposed, larger tile per subgroup)
+%ma = xegpu.matrix_desc_subview %m
+    : matrix_desc<32x256xf16> -> matrix_desc<256x32xf16, #dpas_t_wg>
+
+// Load matrix cooperatively with new layout
+%a_dpas = xegpu.load_matrix %ma[0, 0]
+    : matrix_desc<256x32xf16, #dpas_t_wg> -> vector<256x32xf16>
+```
+**Optimized with Blocking: Lowering to store_chunk and 1D Block Load**
+This pattern demonstrates a more optimized strategy using instruction-level blocking, enabling the use of efficient memory instructions such as store_chunk and 1D block load. For correct and efficient lowering, several constraints must be satisfied:
+
+The inst_data field must specify a meaningful 2D shape that aligns with the capabilities of store_chunk and 1D block load.
+
+Blocking must be explicitly expressed in the memory layout via the @block attribute. Two related matrix_desc subviews (e.g., producer and consumer) must have consistent block sizes. In some cases, the block shape may be transposed between the two to accommodate different access orders.
+
+Each instruction must access only within its assigned matrix block boundary — no cross-block accesses are allowed.
+
+During lowering, store_matrix is lowered to store_chunk if the matrix has strides, and load_matrix is lowered to 1D block load if the matrix has a blocked layout.
+
+```mlir
+#Coop_t_wg  = { sg_layout = [4, 8], sg_data = [8, 32], inst_data = [8, 16],  order = [0, 1] }
+#dpas_t_wg  = { sg_layout = [8, 4], sg_data = [32, 32], inst_data = [16, 16], order = [1, 0] }
+// Load matrix cooperatively from global memory
+%at = xegpu.load_nd %tdesc
+    : tensor_desc<32x256xf16, #Coop_t_wg> -> vector<32x256xf16>
+
+// Allocate shared local memory buffer
+%m = xegpu.create_matrix_desc
+    : matrix_desc<32x256xf16>
+
+// Subview with both block and stride attributes
+%mt = xegpu.matrix_desc_subview %m
+    : matrix_desc<32x256xf16> -> matrix_desc<32x256xf16, @block=[16, 16], @strides=[1, 32], #Coop_t_wg>
+
+// Store cooperatively into SLM using blocking-aware lowering (store_chunk)
+xegpu.store_matrix %mt[0, 0], %at
+    : vector<32x256xf16>, matrix_desc<32x256xf16, @block=[16, 16], @strides=[1, 32], #Coop_t_wg>
+
+// Synchronize before reuse
+xegpu.barrier
+
+// Subview with matching block shape for 1D block load
+%ma = xegpu.matrix_desc_subview %m
+    : matrix_desc<32x256xf16> -> matrix_desc<256x32xf16, @block=[16, 16], #dpas_t_wg>
+
+// Load cooperatively from SLM using 1D block load
+%a_dpas = xegpu.load_matrix %ma[0, 0]
+    : matrix_desc<256x32xf16, @block=[16, 16], #dpas_t_wg> -> vector<256x32xf16>
+```
+
+**Workgroup to Subgroup Distribution**
+This example illustrates how data is distributed from workgroup to subgroups. It demonstrates how load_matrix and store_matrix cooperate with matrix_desc subviews to enable efficient subgroup distribution. In this step, the sg_layout and sg_data attributes are removed from the layout specification, leaving only the inst_data attribute.
+
+The matrix is assumed to be stored in row-major contiguous layout, and indexing into it is performed using logical coordinates. These logical coordinates are used throughout tile distribution and layout transformations. Only at the final lowering stage (e.g., MaterializeSLMAccess) are physical offsets computed using memory layout attributes such as @strides and @block. A key property of the load_matrix/store_matrix abstraction is that logical tile decomposition does not alter the block or stride metadata, making logical address computation straightforward.
+
+```mlir
+#coop_t_inst  = { inst_data = [8, 16] }
+#dpas_t_inst  = { inst_data = [16, 16] }
+
+// Each subgroup loads its portion of the global matrix using inst_data layout
+%tdesc_sg = xegpu.create_nd_tdesc %base[%widy * 32 + %sg_idy * 8, %widx * 256 + %sg_idx * 32]
+    : memref<4096x4096xf16> -> tensor_desc<8x32xf16, #coop_t_inst>
+
+%at = xegpu.load_nd %tdesc_sg
+    : tensor_desc<8x32xf16, #coop_t_inst> -> vector<8x32xf16>
+
+// Allocate a matrix in SLM with row-major layout
+%m = xegpu.create_matrix_desc
+    : matrix_desc<32x256xf16>
+
+// Subgroup subview: logical coords computed assuming row-major layout
+%mt_sg = xegpu.matrix_desc_subview %m[%sg_idy * 8, %sg_idx * 32]
+    : matrix_desc<32x256xf16, @block=[16, 16], @strides=[1, 32], #coop_t_inst>
+      -> matrix_desc<8x32xf16, @block=[16, 16], @strides=[1, 32]>
+
+// Store vector into SLM using per-inst tile and logical mapping
+xegpu.store_matrix %mt_sg, %at
+    : vector<8x32xf16>, matrix_desc<8x32xf16, @block=[16, 16], @strides=[1, 32], #coop_t_inst>
+
+// Barrier to synchronize SLM access
+xegpu.barrier
+
+// Subview for DPAS tile shape
+%ma = xegpu.matrix_desc_subview %m
+    : matrix_desc<32x256xf16> -> matrix_desc<256x32xf16, @block=[16, 16], #dpas_t_inst>
+
+// Subgroup-level subview for cooperative DPAS load
+%ma_sg = xegpu.matrix_desc_subview %ma[%sg_idy * 32, %sg_idx * (32 % 32)]
+    : matrix_desc<256x32xf16, @block=[16, 16], #dpas_t_inst>
+      -> matrix_desc<32x32xf16, @block=[16, 16]>
+
+// Load matrix cooperatively from SLM using 1D block load
+%a_dpas = xegpu.load_matrix %ma_sg
+    : matrix_desc<32x32xf16, @block=[16, 16], #dpas_t_inst> -> vector<32x32xf16>
+```
+**Unrolling Guided by Inst_data**
+This example illustrates how matrix loads and stores can be unrolled into smaller instruction tiles for better alignment with hardware capabilities. This pattern ensures that each store operation writes within its assigned block boundary, respecting the @block and @strides attributes.
+
+On the load side, the matrix_desc is subviewed into multiple 16×16 instruction tiles, which are then used in separate load_matrix operations. This breakdown enables explicit instruction-level unrolling, allowing each instruction to operate on a fixed tile size that aligns with DPAS or tensor-core instruction requirements.
+
+```mlir
+// Load global matrix fragment cooperatively
+%tdesc_sg = xegpu.create_nd_tdesc %base[%widy * 32 + %sg_idy * 8, %widx * 256 + %sg_idx * 32]
+    : memref<4096x4096xf16> -> tensor_desc<8x32xf16>
+
+%at = xegpu.load_nd %tdesc_sg
+    : tensor_desc<8x32xf16> -> vector<8x32xf16>
+
+// Extract 16-column instruction tiles
+%at0 = vector.extract %at[0, 0]   : vector<8x32xf16> -> vector<8x16xf16>
+%at1 = vector.extract %at[0, 16]  : vector<8x32xf16> -> vector<8x16xf16>
+
+// Create shared memory backing matrix
+%m = xegpu.create_matrix_desc : matrix_desc<32x256xf16>
+
+// Compute instruction-tile-level subviews
+%mt_inst0 = xegpu.matrix_desc_subview %m[%sg_idy * 8, %sg_idx * 32]
+    : matrix_desc<32x256xf16> -> matrix_desc<8x16xf16, @block=[16, 16], @strides=[1, 32]>
+
+%mt_inst1 = xegpu.matrix_desc_subview %m[%sg_idy * 8, %sg_idx * 32 + 16]
+    : matrix_desc<32x256xf16> -> matrix_desc<8x16xf16, @block=[16, 16], @strides=[1, 32]>
+
+// Store unrolled tiles into shared local memory
+xegpu.store_matrix %mt_inst0, %at0
+    : vector<8x16xf16>, matrix_desc<8x16xf16, @block=[16, 16], @strides=[1, 32]>
+
+xegpu.store_matrix %mt_inst1, %at1
+    : vector<8x16xf16>, matrix_desc<8x16xf16, @block=[16, 16], @strides=[1, 32]>
+
+// Synchronize to ensure SLM is ready
+xegpu.barrier
+
+// Create 32×32 transposed matrix view for DPAS-style consumption
+%ma = xegpu.matrix_desc_subview %m
+    : matrix_desc<32x256xf16> -> matrix_desc<256x32xf16, @block=[16, 16]>
+
+// Compute 16×16 instruction tiles for DPAS load
+%ma_inst0 = xegpu.matrix_desc_subview %ma[%sg_idy * 32,       %sg_idx * 32 % 32]
+    : matrix_desc<256x32xf16> -> matrix_desc<16x16xf16, @block=[16, 16]>
+
+%ma_inst1 = xegpu.matrix_desc_subview %ma[%sg_idy * 32,       %sg_idx * 32 % 32 + 16]
+    : matrix_desc<256x32xf16> -> matrix_desc<16x16xf16, @block=[16, 16]>
+
+%ma_inst2 = xegpu.matrix_desc_subview %ma[%sg_idy * 32 + 16,  %sg_idx * 32 % 32]
+    : matrix_desc<256x32xf16> -> matrix_desc<16x16xf16, @block=[16, 16]>
+
+%ma_inst3 = xegpu.matrix_desc_subview %ma[%sg_idy * 32 + 16,  %sg_idx * 32 % 32 + 16]
+    : matrix_desc<256x32xf16> -> matrix_desc<16x16xf16, @block=[16, 16]>
+
+// Load unrolled tiles for compute
+%a_dpas_0 = xegpu.load_matrix %ma_inst0
+    : matrix_desc<16x16xf16, @block=[16, 16]> -> vector<16x16xf16>
+
+%a_dpas_1 = xegpu.load_matrix %ma_inst1
+    : matrix_desc<16x16xf16, @block=[16, 16]> -> vector<16x16xf16>
+
+%a_dpas_2 = xegpu.load_matrix %ma_inst2
+    : matrix_desc<16x16xf16, @block=[16, 16]> -> vector<16x16xf16>
+
+%a_dpas_3 = xegpu.load_matrix %ma_inst3
+    : matrix_desc<16x16xf16, @block=[16, 16]> -> vector<16x16xf16>
+```
+
+**MaterializeSLMAccess: Lowering matrix_desc to Physical Memory Access**
+This step lowers high-level matrix_desc abstractions and cooperative memory operations (store_matrix, load_matrix) into explicit memory operations (store_chunk, load_1d) over shared local memory (memref). It performs full address materialization using the matrix's layout attributes (@strides, @block) and logical lane coordinates.
+
+Key Concepts:
+Matrix-to-Memory Conversion: Replace matrix_desc-based tile abstractions with raw memref and compute physical offsets explicitly.
+
+Chunked Store: Each thread stores a small fragment (e.g., 8×16) using logical offsets composed with layout metadata. Lowered to store_chunk.
+
+1D Block Load: A transposed layout (e.g., 256×32) is blocked as 16×16 tiles. Contiguous blocks are loaded using load_1d, which requires computing the physical offset of the first element per tile.
+
+Offset Calculation: Logical per-lane coordinates are transformed into logical block coordinates, then to physical offsets using block size and strides.
+
+```mlir
+// Load global input tile into vector
+%tdesc_sg = xegpu.create_nd_tdesc %base[%widy * 32 + %sg_idy * 8, %widx * 256 + %sg_idx * 32]
+    : memref<4096x4096xf16> -> tensor_desc<8x32xf16>
+%at = xegpu.load_nd %tdesc_sg : tensor_desc<8x32xf16> -> vector<8x32xf16>
+
+// Unroll 8x32 into two 8x16 tiles
+%at0 = vector.extract %at[0, 0]   : vector<8x32xf16> -> vector<8x16xf16>
+%at1 = vector.extract %at[0, 16]  : vector<8x32xf16> -> vector<8x16xf16>
+
+// Shared local memory buffer
+%m = memref.alloc : memref<8192xf16, 3>
+
+// Compute blocked offset vectors for SLM store
+%blk_y       = divi_signed %sg_idy * 8, 16        : index
+%blk_in_y    = remi_signed %sg_idy * 8, 16        : index
+%sg_idx_vec  = addi %sg_idx * 32, dense<[0, ..., 15]> : vector<16xindex>
+%blk_x       = divi_unsigned %sg_idx_vec, 16
+%blk_in_x    = remi_unsigned %sg_idx_vec, 16
+%offset_vec0 = addi (addi (addi (%blk_y * 16, %blk_in_y), %blk_x * 512), %blk_in_x * 16)
+
+// Create tensor_desc for SLM store
+%tdesc0 = xegpu.create_tdesc %m, %offset_vec0
+    : memref<8192xf16, 3>, vector<16xindex> -> tdesc<16x8xf16, chunk=8, scope=slm>
+
+// Repeat for second tile
+%sg_idx_vec2 = addi %sg_idx * 32, dense<[16, ..., 31]> : vector<16xindex>
+%blk_x2      = divi_unsigned %sg_idx_vec2, 16
+%blk_in_x2   = remi_unsigned %sg_idx_vec2, 16
+%offset_vec1 = addi (addi (addi (%blk_y * 16, %blk_in_y), %blk_x2 * 512), %blk_in_x2 * 16)
+%tdesc1 = xegpu.create_tdesc %m, %offset_vec1
+    : memref<8192xf16, 3>, vector<16xindex> -> tdesc<16x8xf16, chunk=8, scope=slm>
+
+// Transpose and store
+%at0_t = vector.transpose %at0 : vector<8x16xf16> -> vector<16x8xf16>
+%at1_t = vector.transpose %at1 : vector<8x16xf16> -> vector<16x8xf16>
+xegpu.store %tdesc0, %at0_t : tdesc<16x8xf16, chunk=8, scope=slm>, vector<16x8xf16>
+xegpu.store %tdesc1, %at1_t : tdesc<16x8xf16, chunk=8, scope=slm>, vector<16x8xf16>
+
+// Barrier to ensure SLM visibility
+xegpu.barrier
+
+// ---------------------- Load 1D Block ----------------------
+
+// Compute per-tile physical offsets
+%inst_start_offset0 = mul %sg_idy, 2 * 512
+%inst_start_offset1 = add %inst_start_offset0, 256
+%inst_start_offset2 = add %inst_start_offset0, 512
+%inst_start_offset3 = add %inst_start_offset0, 768
+
+// Create tdesc for 1D block loads
+%tdesc0 = xegpu.create_nd_tdesc %m, %inst_start_offset0 : memref<8192xf16, 3>, index -> tdesc<256xf16>
+%tdesc1 = xegpu.create_nd_tdesc %m, %inst_start_offset1 : memref<8192xf16, 3>, index -> tdesc<256xf16>
+%tdesc2 = xegpu.create_nd_tdesc %m, %inst_start_offset2 : memref<8192xf16, 3>, index -> tdesc<256xf16>
+%tdesc3 = xegpu.create_nd_tdesc %m, %inst_start_offset3 : memref<8192xf16, 3>, index -> tdesc<256xf16>
+
+// Load 1D tiles
+%a_dpas_0 = xegpu.load_nd %tdesc0 : tdesc<256xf16> -> vector<256xf16>
+%a_dpas_1 = xegpu.load_nd %tdesc1 : tdesc<256xf16> -> vector<256xf16>
+%a_dpas_2 = xegpu.load_nd %tdesc2 : tdesc<256xf16> -> vector<256xf16>
+%a_dpas_3 = xegpu.load_nd %tdesc3 : tdesc<256xf16> -> vector<256xf16>
 ```
 
 ## XeGPU Attributes to support Work Item Level semantics

From 3b79e51bbeae8e84b9ee01642ff3359fc197d28b Mon Sep 17 00:00:00 2001
From: Jianhui Li <jian.hui.li@intel.com>
Date: Fri, 11 Jul 2025 17:43:42 -0700
Subject: [PATCH 12/27] Update XeGPU.md

---
 docs/rfcs/XeGPU.md | 203 ++++++++++++++-------------------------------
 1 file changed, 62 insertions(+), 141 deletions(-)

diff --git a/docs/rfcs/XeGPU.md b/docs/rfcs/XeGPU.md
index b9abf89e3..7de5b2dba 100644
--- a/docs/rfcs/XeGPU.md
+++ b/docs/rfcs/XeGPU.md
@@ -351,14 +351,15 @@ This separation simplifies distribution and unrolling passes and enables systema
 
 To represent a matrix stored in shared local memory (SLM), users must create a matrix_desc object. The underlying memory is assumed to follow a row-major layout, and the base matrix_desc represents a raw, unannotated matrix in this layout. The base matrix may be n-dimensional.
 
-Only matrix_desc instances created via subview may carry an xegpu.layout attribute, which specifies the mapping of lanes and registers to fragments of the matrix. This attribute guides the tile distribution process based on the assumed row-major view of the original matrix. In addition, subviewed matrix_desc instances may carry layout metadata such as blocking and striding, which are used to control physical address computation when accessing SLM.
+matrix_desc_view creates a matrix_desc instance with memory layout attributes such as @block and @stride. These attributes define the blocking and striding parameters, which govern physical address computation when accessing shared local memory (SLM). Additionally, an xegpu.layout attribute is added to specify the mapping of lanes and registers to fragments of the matrix, guiding tile distribution based on the assumed row-major view of the matrix. The matrix_desc_subview creates a subview on top of the matrix_desc produced by matrix_desc_view, inheriting all of its layout attributes. The subview is then subject to decomposition and distribution. 
 
 Data movement between SLM and vector registers is performed using load_matrix and store_matrix, which operate at workgroup scope and require the input matrix_desc to be 2D. If the original matrix is higher-dimensional, it must be subviewed to a 2D shape before it can be used with these operations.
 
 | Ops	| Syntax	| Example |
 | :---   | :----   | :--- |
 |create_matrix_desc	| operation ::= xegpu.create_matrix_desc attr-dict : type(\$mdesc)	| %mdesc_a = xegpu.create_matrix_desc : matrix_desc<256x128xbf16> |
-|matrix_desc_subview	| operation ::= xegpu.matrix_desc_subview \$mdesc, DynamicIndexList<\$coord>  attr-dict : type(\$mdesc) -> type(\$mdesc)	| %mdesc_coop = xegpu.matrix_desc_subview %mdesc[128, 0]:matrix_desc<256x256xbf16,  @layout_type=1> -> matrix_desc<128x128xbf16, @stride=[256,1],  @block=[8, 16]> |
+|matrix_desc_view	| operation ::= xegpu.matrix_desc_view \$mdesc  attr-dict : type(\$mdesc) -> type(\$mdesc)	| %mdesc_a_layout = xegpu.matrix_desc_view %mdesc:matrix_desc<256x128xbf16> -> matrix_desc<256x128xbf16, @stride=[1, 256],  @block=[8, 16]> |
+|matrix_desc_subview	| operation ::= xegpu.matrix_desc_subview \$mdesc, DynamicIndexList<\$coord>  attr-dict : type(\$mdesc) -> type(\$mdesc)	| %mdesc_coop = xegpu.matrix_desc_subview %mdesc[128, 0]:matrix_desc<256x256xbf16, @stride=[256,1],  @block=[8, 16]> -> matrix_desc<128x128xbf16, @stride=[256,1],  @block=[8, 16]> |
 |load_matrix	| operation ::= xegpu.load_matrix  $mdesc attr-dict : type($mdesc), {type(coords)} -> type($res)	| %result = xegpu.load_matrix %mdesc : matrix_desc<128x256xbf16, @block=[8, 16]> -> vector<128x256xbf16> |
 |store_matrix	| operation ::= xegpu.store_matrix  $mdesc, $val attr-dict : type($mdesc), {type(coords)}, type($val)	| %result = xegpu.store_matrix %mdesc, %val : matrix_desc<128x256xbf16, @block=[8, 16]>, vector<128x256xbf16> |
 
@@ -367,10 +368,16 @@ Users create a `matrix_desc` to represent a matrix stored in shared local memory
 ```mlir
 %mdesc_a = xegpu.create_matrix_desc: matrix_desc<256x128xbf16>
 ```
-Users can create a subview of a matrix_desc to represent a sliced or partitioned view of the original matrix. Subviews may reduce the rank of the matrix, allowing users to extract a lower-dimensional matrix from a higher-dimensional one. The resulting matrix_desc may be annotated with layout attributes such as @block and @strides to describe its memory layout more precisely. The @block attribute indicates that the matrix follows a blocked layout, enabling optimized lowering to 1D block loads. The @strides attribute specifies the logical strides of each dimension and is typically used to support chunked loads. Additionally, a subview may carry an xegpu.layout attribute that defines how the matrix is logically partitioned into fragments and mapped to work items.
+matrix_desc_view annotates matrix_desc with layout attributes such as @block and @strides to describe its memory layout more precisely. The @block attribute indicates that the matrix follows a blocked layout, enabling optimized lowering to 1D block loads. The @strides attribute specifies the logical strides of each dimension and is typically used to support chunked loads.
+
+```mlir
+%mdesc_a_layout = xegpu.matrix_desc_view %mdesc:matrix_desc<256x128xbf16> -> matrix_desc<256x128xbf16, @stride=[1, 256],  @block=[8, 16]>
+```
+Users can create a subview of a matrix_desc to represent a sliced or partitioned view of the original matrix. Subviews may reduce the rank of the matrix, allowing users to extract a lower-dimensional matrix from a higher-dimensional one. Subview inherit memory layout attributes from the base matrix_desc. Additionally, a view may carry an xegpu.layout attribute that defines how the matrix is logically partitioned into fragments and mapped to work items.
+
 ```mlir
 %mdesc_a = xegpu.matrix_desc_subview %mdescs_a[%mma_cycle_i, 0, 0]
-    : matrix_desc<3x256x128xbf16> -> matrix_desc<256x128xbf16, @block=[8, 16]>
+    : matrix_desc<3x256x128xbf16, @block=[8, 16]> -> matrix_desc<256x128xbf16, @block=[8, 16]>
 
 %mdesc_coop_a = xegpu.matrix_desc_subview %mdesc_a[0, %wg_id_x_in_cluster * 64]
     : matrix_desc<256x128xbf16> -> matrix_desc<256x64xbf16, @strides=[128, 1]>
@@ -394,7 +401,7 @@ This example demonstrates a cooperative transpose pattern in which a matrix tile
 #Coop_wg = {sg_layout = [8, 4] , sg_data= [32, 8], order=[1, 0] }
 #dpas_wg = {sg_layout = [8, 4],  sg_data= [32, 32], order=[1, 0] }
 
-%at = load_nd %tdesc: tensor_desc<32x256xf16, #Coop_t_wg> -> vector<32x256xf16 >
+%at = load_nd %tdesc: tensor_desc<4096x4096xf16, #Coop_t_wg> -> vector<32x256xf16 >
 %a = vector.transpose %1 #Coop_wg :vector<32x256xf16> -> vector<256x32xf16>
 %a_dpas = Conv_layout %2 #Coop_wg #dpas_wg 
 ```
@@ -406,7 +413,7 @@ In this flow:
 
 3. The result is a matrix tile conforming to the #dpas_wg layout, ready for compute instructions such as DPAS.
 
-**After an optimization pass that targets the transpose-A pattern**
+**After optimization that targets the transpose-A pattern**
 The code is transformed to use store_matrix and load_matrix to materialize the transpose cooperatively in shared local memory. Note that both load_nd and store_matrix use smaller sg_data values, meaning each subgroup processes a smaller fragment, enabling a cooperative transpose across threads.
 
 It is generally preferred to fuse transpose and convert_layout earlier in the pipeline, as this affects the blocking strategy for load_matrix and store_matrix (which are the lowered forms of the logical layout conversion and transpose). Early fusion enables better alignment with optimal hardware load instructions.
@@ -415,68 +422,38 @@ It is generally preferred to fuse transpose and convert_layout earlier in the pi
 #Coop_t_wg  = { sg_layout = [4, 8], sg_data = [8, 32], order = [0, 1] }  // original layout
 #dpas_t_wg  = { sg_layout = [8, 4], sg_data = [32, 32], order = [1, 0] } // target DPAS layout
 
-%at = xegpu.load_nd %tdesc
-    : tensor_desc<32x256xf16, #Coop_t_wg> -> vector<32x256xf16>
-
-%m = xegpu.create_matrix_desc
-    : matrix_desc<32x256xf16>
-
-%mt = xegpu.matrix_desc_subview %m[0, 0]
-    : matrix_desc<32x256xf16> -> matrix_desc<32x256xf16, @strides=[1, 32], #Coop_t_wg>
-
-xegpu.store_matrix %mt[0, 0], %at
-    : vector<32x256xf16>, matrix_desc<32x256xf16, @strides=[1, 32], #Coop_t_wg>
-
-xegpu.barrier
-
-%ma = xegpu.matrix_desc_subview %m[0, 0]
-    : matrix_desc<32x256xf16> -> matrix_desc<256x32xf16, #dpas_t_wg>
-
-%a_dpas = xegpu.load_matrix %ma[0, 0]
-    : matrix_desc<256x32xf16, #dpas_t_wg> -> vector<256x32xf16>
-
+%at = xegpu.load_nd %tdesc : tensor_desc<4096x4096xf16, #Coop_t_wg> -> vector<32x256xf16>
+%m = xegpu.create_matrix_desc : matrix_desc<8192xf16>
+%mt = xegpu.matrix_desc_view %m : matrix_desc<32x256xf16> -> matrix_desc<32x256xf16, @strides=[1, 32], #Coop_t_wg>
+xegpu.store_matrix %mt[0, 0], %at : vector<32x256xf16>, matrix_desc<32x256xf16, @strides=[1, 32], #Coop_t_wg>
+gpu.barrier
+%ma = xegpu.matrix_desc_view %m : matrix_desc<8192xf16> -> matrix_desc<256x32xf16, #dpas_t_wg>
+%a_dpas = xegpu.load_matrix %ma[0, 0] : matrix_desc<256x32xf16, #dpas_t_wg> -> vector<256x32xf16>
 ```
 
-**Adding attributes for Instruction-Level Blocking**
-In this example, the matrix is distributed according to hardware capability, using instruction-level blocking. This type of blocking does not change the physical memory layout (i.e., there is no memory-level tiling); instead, it affects how data is accessed and lowered to instructions like store_scatter and load_gather.
+**Adding attributes for Instruction-Level Blocking: basic blocking**
+In this example, the xegpu.layout is extended to support instruction-level blocking. The basic blocking assumes 16 lanes, and each lane handles 2 f16 elements (32 bits). This basic blocking does not change the physical memory layout (i.e., there is no memory-level tiling); instead, it loweres to instructions like store_scatter and load_gather. 
 
-Each lane handles 2 f16 elements (32 bits), and the matrix_desc uses a @strides attribute to represent memory layout. During lowering, offsets must be computed by composing: 1) The logical layout (e.g., xegpu.layout), and 2) The memory layout (e.g., @strides), to produce final physical offsets.
 ```mlir
 #Coop_t_wg  = { sg_layout = [4, 8], sg_data = [8, 32], inst_data = [1, 32], order = [0, 1] }
 #dpas_t_wg  = { sg_layout = [8, 4], sg_data = [32, 32], inst_data = [1, 32], order = [1, 0] }
-// Load a tile cooperatively using logical layout #Coop_t_wg
-%at = xegpu.load_nd %tdesc
-    : tensor_desc<32x256xf16, #Coop_t_wg> -> vector<32x256xf16>
 
-// Allocate a matrix_desc over shared memory
-%m = xegpu.create_matrix_desc
-    : matrix_desc<32x256xf16>
-
-// Subview with strides, using the same layout for consistent offset mapping
-%mt = xegpu.matrix_desc_subview %m
-    : matrix_desc<32x256xf16> -> matrix_desc<32x256xf16, @strides=[1, 32], #Coop_t_wg>
-
-// Store matrix with instruction-level blocking (per-inst offsets computed using strides)
-xegpu.store_matrix %mt[0, 0], %at
-    : vector<32x256xf16>, matrix_desc<32x256xf16, @strides=[1, 32], #Coop_t_wg>
+%at = xegpu.load_nd %tdesc: tensor_desc<4096x4096xf16, #Coop_t_wg> -> vector<32x256xf16>
+%m = xegpu.create_matrix_desc: matrix_desc<8192xf16>
+%mt = xegpu.matrix_desc_view %m : matrix_desc<8192xf16> -> matrix_desc<32x256xf16, @strides=[1, 32], #Coop_t_wg>
+xegpu.store_matrix %mt[0, 0], %at: vector<32x256xf16>, matrix_desc<32x256xf16, @strides=[1, 32], #Coop_t_wg>
 
-// Synchronize before loading the transposed result
-xegpu.barrier
+gpu.barrier
 
-// Subview for DPAS layout (transposed, larger tile per subgroup)
-%ma = xegpu.matrix_desc_subview %m
-    : matrix_desc<32x256xf16> -> matrix_desc<256x32xf16, #dpas_t_wg>
-
-// Load matrix cooperatively with new layout
-%a_dpas = xegpu.load_matrix %ma[0, 0]
-    : matrix_desc<256x32xf16, #dpas_t_wg> -> vector<256x32xf16>
+%ma = xegpu.matrix_desc_view %m : matrix_desc<8192xf16> -> matrix_desc<256x32xf16, #dpas_t_wg>
+%a_dpas = xegpu.load_matrix %ma[0, 0] : matrix_desc<256x32xf16, #dpas_t_wg> -> vector<256x32xf16>
 ```
 **Optimized with Blocking: Lowering to store_chunk and 1D Block Load**
-This pattern demonstrates a more optimized strategy using instruction-level blocking, enabling the use of efficient memory instructions such as store_chunk and 1D block load. For correct and efficient lowering, several constraints must be satisfied:
+This pattern demonstrates a more optimized strategy for instruction-level blocking, enabling the use of efficient memory instructions such as store_chunk and 1D block load. For correct and efficient lowering, several constraints must be satisfied:
 
 The inst_data field must specify a meaningful 2D shape that aligns with the capabilities of store_chunk and 1D block load.
 
-Blocking must be explicitly expressed in the memory layout via the @block attribute. Two related matrix_desc subviews (e.g., producer and consumer) must have consistent block sizes. In some cases, the block shape may be transposed between the two to accommodate different access orders.
+Blocking must be explicitly expressed in the memory layout via the @block attribute. Two related matrix_desc subviews (e.g., producer and consumer) must have consistent block sizes. If one matrix_desc is transposed, the block shape should match the transposed shape of the other one.
 
 Each instruction must access only within its assigned matrix block boundary — no cross-block accesses are allowed.
 
@@ -485,38 +462,21 @@ During lowering, store_matrix is lowered to store_chunk if the matrix has stride
 ```mlir
 #Coop_t_wg  = { sg_layout = [4, 8], sg_data = [8, 32], inst_data = [8, 16],  order = [0, 1] }
 #dpas_t_wg  = { sg_layout = [8, 4], sg_data = [32, 32], inst_data = [16, 16], order = [1, 0] }
-// Load matrix cooperatively from global memory
-%at = xegpu.load_nd %tdesc
-    : tensor_desc<32x256xf16, #Coop_t_wg> -> vector<32x256xf16>
 
-// Allocate shared local memory buffer
-%m = xegpu.create_matrix_desc
-    : matrix_desc<32x256xf16>
-
-// Subview with both block and stride attributes
-%mt = xegpu.matrix_desc_subview %m
-    : matrix_desc<32x256xf16> -> matrix_desc<32x256xf16, @block=[16, 16], @strides=[1, 32], #Coop_t_wg>
-
-// Store cooperatively into SLM using blocking-aware lowering (store_chunk)
-xegpu.store_matrix %mt[0, 0], %at
-    : vector<32x256xf16>, matrix_desc<32x256xf16, @block=[16, 16], @strides=[1, 32], #Coop_t_wg>
-
-// Synchronize before reuse
-xegpu.barrier
-
-// Subview with matching block shape for 1D block load
-%ma = xegpu.matrix_desc_subview %m
-    : matrix_desc<32x256xf16> -> matrix_desc<256x32xf16, @block=[16, 16], #dpas_t_wg>
+%at = xegpu.load_nd %tdesc : tensor_desc<4096x4096xf16, #Coop_t_wg> -> vector<32x256xf16>
+%m = xegpu.create_matrix_desc : matrix_desc<8192xf16>
+%mt = xegpu.matrix_desc_view %m : matrix_desc<8192xf16> -> matrix_desc<32x256xf16, @block=[16, 16], @strides=[1, 32], #Coop_t_wg>
+xegpu.store_matrix %mt[0, 0], %at : vector<32x256xf16>, matrix_desc<32x256xf16, @block=[16, 16], @strides=[1, 32], #Coop_t_wg>
 
-// Load cooperatively from SLM using 1D block load
-%a_dpas = xegpu.load_matrix %ma[0, 0]
-    : matrix_desc<256x32xf16, @block=[16, 16], #dpas_t_wg> -> vector<256x32xf16>
+gpu.barrier
+%ma = xegpu.matrix_desc_view %m : matrix_desc<8192xf16> -> matrix_desc<256x32xf16, @block=[16, 16], #dpas_t_wg>
+%a_dpas = xegpu.load_matrix %ma[0, 0] : matrix_desc<256x32xf16, @block=[16, 16], #dpas_t_wg> -> vector<256x32xf16>
 ```
 
 **Workgroup to Subgroup Distribution**
 This example illustrates how data is distributed from workgroup to subgroups. It demonstrates how load_matrix and store_matrix cooperate with matrix_desc subviews to enable efficient subgroup distribution. In this step, the sg_layout and sg_data attributes are removed from the layout specification, leaving only the inst_data attribute.
 
-The matrix is assumed to be stored in row-major contiguous layout, and indexing into it is performed using logical coordinates. These logical coordinates are used throughout tile distribution and layout transformations. Only at the final lowering stage (e.g., MaterializeSLMAccess) are physical offsets computed using memory layout attributes such as @strides and @block. A key property of the load_matrix/store_matrix abstraction is that logical tile decomposition does not alter the block or stride metadata, making logical address computation straightforward.
+The matrix is assumed to be stored in row-major contiguous layout, and indexing into it is performed using logical coordinates. These logical coordinates are used throughout tile distribution and layout transformations. Only at the final lowering stage (e.g., MaterializeSLMAccess) are physical offsets computed using memory layout attributes such as @strides and @block. A key property of the matrix_desc data type is that logical tile decomposition does not alter the block or stride metadata, making logical address computation straightforward.
 
 ```mlir
 #coop_t_inst  = { inst_data = [8, 16] }
@@ -525,126 +485,87 @@ The matrix is assumed to be stored in row-major contiguous layout, and indexing
 // Each subgroup loads its portion of the global matrix using inst_data layout
 %tdesc_sg = xegpu.create_nd_tdesc %base[%widy * 32 + %sg_idy * 8, %widx * 256 + %sg_idx * 32]
     : memref<4096x4096xf16> -> tensor_desc<8x32xf16, #coop_t_inst>
-
 %at = xegpu.load_nd %tdesc_sg
     : tensor_desc<8x32xf16, #coop_t_inst> -> vector<8x32xf16>
-
-// Allocate a matrix in SLM with row-major layout
 %m = xegpu.create_matrix_desc
     : matrix_desc<32x256xf16>
-
-// Subgroup subview: logical coords computed assuming row-major layout
-%mt_sg = xegpu.matrix_desc_subview %m[%sg_idy * 8, %sg_idx * 32]
+%mt = xegpu.matrix_desc_view %m 
+    : matrix_desc<8192xf16> -> matrix_desc<32x256xf16, @block=[16, 16], @strides=[1, 32], #coop_t_inst>
+%mt_sg = xegpu.matrix_desc_subview %mt[%sg_idy * 8, %sg_idx * 32]
     : matrix_desc<32x256xf16, @block=[16, 16], @strides=[1, 32], #coop_t_inst>
-      -> matrix_desc<8x32xf16, @block=[16, 16], @strides=[1, 32]>
-
-// Store vector into SLM using per-inst tile and logical mapping
+      -> matrix_desc<8x32xf16, @block=[16, 16], @strides=[1, 32], #coop_t_inst>
 xegpu.store_matrix %mt_sg, %at
     : vector<8x32xf16>, matrix_desc<8x32xf16, @block=[16, 16], @strides=[1, 32], #coop_t_inst>
 
-// Barrier to synchronize SLM access
-xegpu.barrier
+gpu.barrier
 
-// Subview for DPAS tile shape
-%ma = xegpu.matrix_desc_subview %m
-    : matrix_desc<32x256xf16> -> matrix_desc<256x32xf16, @block=[16, 16], #dpas_t_inst>
-
-// Subgroup-level subview for cooperative DPAS load
+%ma = xegpu.matrix_desc_view %m
+    : matrix_desc<8192xf16> -> matrix_desc<256x32xf16, @block=[16, 16], #dpas_t_inst>
 %ma_sg = xegpu.matrix_desc_subview %ma[%sg_idy * 32, %sg_idx * (32 % 32)]
     : matrix_desc<256x32xf16, @block=[16, 16], #dpas_t_inst>
       -> matrix_desc<32x32xf16, @block=[16, 16]>
-
-// Load matrix cooperatively from SLM using 1D block load
 %a_dpas = xegpu.load_matrix %ma_sg
     : matrix_desc<32x32xf16, @block=[16, 16], #dpas_t_inst> -> vector<32x32xf16>
 ```
 **Unrolling Guided by Inst_data**
-This example illustrates how matrix loads and stores can be unrolled into smaller instruction tiles for better alignment with hardware capabilities. This pattern ensures that each store operation writes within its assigned block boundary, respecting the @block and @strides attributes.
-
-On the load side, the matrix_desc is subviewed into multiple 16×16 instruction tiles, which are then used in separate load_matrix operations. This breakdown enables explicit instruction-level unrolling, allowing each instruction to operate on a fixed tile size that aligns with DPAS or tensor-core instruction requirements.
+This example illustrates how matrix loads and stores can be unrolled into smaller instruction tiles for better alignment with hardware capabilities. This pattern ensures that each load and store operation writes within its assigned block boundary, respecting the @block and @strides attributes. On the load side, the matrix_desc is subviewed into multiple 16×16 instruction tiles, which are then used in separate load_matrix operations. This breakdown enables explicit instruction-level unrolling, allowing each instruction to operate on a fixed tile size that aligns with DPAS or tensor-core instruction requirements.
 
 ```mlir
-// Load global matrix fragment cooperatively
 %tdesc_sg = xegpu.create_nd_tdesc %base[%widy * 32 + %sg_idy * 8, %widx * 256 + %sg_idx * 32]
     : memref<4096x4096xf16> -> tensor_desc<8x32xf16>
-
-%at = xegpu.load_nd %tdesc_sg
-    : tensor_desc<8x32xf16> -> vector<8x32xf16>
-
-// Extract 16-column instruction tiles
+%at = xegpu.load_nd %tdesc_sg     : tensor_desc<8x32xf16> -> vector<8x32xf16>
 %at0 = vector.extract %at[0, 0]   : vector<8x32xf16> -> vector<8x16xf16>
 %at1 = vector.extract %at[0, 16]  : vector<8x32xf16> -> vector<8x16xf16>
-
-// Create shared memory backing matrix
-%m = xegpu.create_matrix_desc : matrix_desc<32x256xf16>
-
-// Compute instruction-tile-level subviews
-%mt_inst0 = xegpu.matrix_desc_subview %m[%sg_idy * 8, %sg_idx * 32]
-    : matrix_desc<32x256xf16> -> matrix_desc<8x16xf16, @block=[16, 16], @strides=[1, 32]>
-
-%mt_inst1 = xegpu.matrix_desc_subview %m[%sg_idy * 8, %sg_idx * 32 + 16]
-    : matrix_desc<32x256xf16> -> matrix_desc<8x16xf16, @block=[16, 16], @strides=[1, 32]>
-
-// Store unrolled tiles into shared local memory
+%m = xegpu.create_matrix_desc : matrix_desc<8192xf16>
+%mt = xegpu.matrix_desc_view %m 
+    : matrix_desc<8192xf16> -> matrix_desc<32x256xf16, @block=[16, 16], @strides=[1, 32], #coop_t_inst>
+%mt_inst0 = xegpu.matrix_desc_subview %mt[%sg_idy * 8, %sg_idx * 32]
+    : matrix_desc<8x16xf16, @block=[16, 16], @strides=[1, 32]>
+%mt_inst1 = xegpu.matrix_desc_subview %mt[%sg_idy * 8, %sg_idx * 32 + 16]
+    : matrix_desc<8x16xf16, @block=[16, 16], @strides=[1, 32]>
 xegpu.store_matrix %mt_inst0, %at0
     : vector<8x16xf16>, matrix_desc<8x16xf16, @block=[16, 16], @strides=[1, 32]>
-
 xegpu.store_matrix %mt_inst1, %at1
     : vector<8x16xf16>, matrix_desc<8x16xf16, @block=[16, 16], @strides=[1, 32]>
 
-// Synchronize to ensure SLM is ready
-xegpu.barrier
+gpu.barrier
 
-// Create 32×32 transposed matrix view for DPAS-style consumption
 %ma = xegpu.matrix_desc_subview %m
-    : matrix_desc<32x256xf16> -> matrix_desc<256x32xf16, @block=[16, 16]>
-
-// Compute 16×16 instruction tiles for DPAS load
+    : matrix_desc<8192xf16> -> matrix_desc<256x32xf16, @block=[16, 16]>
 %ma_inst0 = xegpu.matrix_desc_subview %ma[%sg_idy * 32,       %sg_idx * 32 % 32]
     : matrix_desc<256x32xf16> -> matrix_desc<16x16xf16, @block=[16, 16]>
-
 %ma_inst1 = xegpu.matrix_desc_subview %ma[%sg_idy * 32,       %sg_idx * 32 % 32 + 16]
     : matrix_desc<256x32xf16> -> matrix_desc<16x16xf16, @block=[16, 16]>
-
 %ma_inst2 = xegpu.matrix_desc_subview %ma[%sg_idy * 32 + 16,  %sg_idx * 32 % 32]
     : matrix_desc<256x32xf16> -> matrix_desc<16x16xf16, @block=[16, 16]>
-
 %ma_inst3 = xegpu.matrix_desc_subview %ma[%sg_idy * 32 + 16,  %sg_idx * 32 % 32 + 16]
     : matrix_desc<256x32xf16> -> matrix_desc<16x16xf16, @block=[16, 16]>
-
-// Load unrolled tiles for compute
 %a_dpas_0 = xegpu.load_matrix %ma_inst0
     : matrix_desc<16x16xf16, @block=[16, 16]> -> vector<16x16xf16>
-
 %a_dpas_1 = xegpu.load_matrix %ma_inst1
     : matrix_desc<16x16xf16, @block=[16, 16]> -> vector<16x16xf16>
-
 %a_dpas_2 = xegpu.load_matrix %ma_inst2
     : matrix_desc<16x16xf16, @block=[16, 16]> -> vector<16x16xf16>
-
 %a_dpas_3 = xegpu.load_matrix %ma_inst3
     : matrix_desc<16x16xf16, @block=[16, 16]> -> vector<16x16xf16>
 ```
 
 **MaterializeSLMAccess: Lowering matrix_desc to Physical Memory Access**
-This step lowers high-level matrix_desc abstractions and cooperative memory operations (store_matrix, load_matrix) into explicit memory operations (store_chunk, load_1d) over shared local memory (memref). It performs full address materialization using the matrix's layout attributes (@strides, @block) and logical lane coordinates.
+This step lowers high-level matrix_desc operations (store_matrix, load_matrix) into low-level memory operations (store_chunk, load_1d) over shared local memory. It performs full address materialization using the matrix's layout attributes (@strides, @block) and logical lane coordinates.
 
 Key Concepts:
 Matrix-to-Memory Conversion: Replace matrix_desc-based tile abstractions with raw memref and compute physical offsets explicitly.
 
-Chunked Store: Each thread stores a small fragment (e.g., 8×16) using logical offsets composed with layout metadata. Lowered to store_chunk.
+Chunked Store: Each thread stores a small fragment (e.g., 8×1) using the logical offset composed with layout metadata. Lowered to store_chunk.
 
-1D Block Load: A transposed layout (e.g., 256×32) is blocked as 16×16 tiles. Contiguous blocks are loaded using load_1d, which requires computing the physical offset of the first element per tile.
+1D Block Load: A transposed layout (e.g., 256×32) is blocked as 16×16 tiles. Contiguous blocks are loaded using load_1d, which requires computing the physical offset of the first element per 1D block.
 
 Offset Calculation: Logical per-lane coordinates are transformed into logical block coordinates, then to physical offsets using block size and strides.
 
 ```mlir
-// Load global input tile into vector
 %tdesc_sg = xegpu.create_nd_tdesc %base[%widy * 32 + %sg_idy * 8, %widx * 256 + %sg_idx * 32]
     : memref<4096x4096xf16> -> tensor_desc<8x32xf16>
-%at = xegpu.load_nd %tdesc_sg : tensor_desc<8x32xf16> -> vector<8x32xf16>
-
-// Unroll 8x32 into two 8x16 tiles
+%at = xegpu.load_nd %tdesc_sg     : tensor_desc<8x32xf16> -> vector<8x32xf16>
 %at0 = vector.extract %at[0, 0]   : vector<8x32xf16> -> vector<8x16xf16>
 %at1 = vector.extract %at[0, 16]  : vector<8x32xf16> -> vector<8x16xf16>
 
@@ -678,7 +599,7 @@ xegpu.store %tdesc0, %at0_t : tdesc<16x8xf16, chunk=8, scope=slm>, vector<16x8xf
 xegpu.store %tdesc1, %at1_t : tdesc<16x8xf16, chunk=8, scope=slm>, vector<16x8xf16>
 
 // Barrier to ensure SLM visibility
-xegpu.barrier
+gpu.barrier
 
 // ---------------------- Load 1D Block ----------------------
 

From 5dd778e4f28b456fd32c7b25abd44e16b7dde64d Mon Sep 17 00:00:00 2001
From: Jianhui Li <jian.hui.li@intel.com>
Date: Fri, 11 Jul 2025 18:49:08 -0700
Subject: [PATCH 13/27] Update XeGPU.md

---
 docs/rfcs/XeGPU.md | 34 +++++++++++++++++++++-------------
 1 file changed, 21 insertions(+), 13 deletions(-)

diff --git a/docs/rfcs/XeGPU.md b/docs/rfcs/XeGPU.md
index 7de5b2dba..9d34da7d3 100644
--- a/docs/rfcs/XeGPU.md
+++ b/docs/rfcs/XeGPU.md
@@ -414,6 +414,7 @@ In this flow:
 3. The result is a matrix tile conforming to the #dpas_wg layout, ready for compute instructions such as DPAS.
 
 **After optimization that targets the transpose-A pattern**
+
 The code is transformed to use store_matrix and load_matrix to materialize the transpose cooperatively in shared local memory. Note that both load_nd and store_matrix use smaller sg_data values, meaning each subgroup processes a smaller fragment, enabling a cooperative transpose across threads.
 
 It is generally preferred to fuse transpose and convert_layout earlier in the pipeline, as this affects the blocking strategy for load_matrix and store_matrix (which are the lowered forms of the logical layout conversion and transpose). Early fusion enables better alignment with optimal hardware load instructions.
@@ -431,7 +432,8 @@ gpu.barrier
 %a_dpas = xegpu.load_matrix %ma[0, 0] : matrix_desc<256x32xf16, #dpas_t_wg> -> vector<256x32xf16>
 ```
 
-**Adding attributes for Instruction-Level Blocking: basic blocking**
+**Adding Attributes for Basic Instruction-Level Blocking**
+
 In this example, the xegpu.layout is extended to support instruction-level blocking. The basic blocking assumes 16 lanes, and each lane handles 2 f16 elements (32 bits). This basic blocking does not change the physical memory layout (i.e., there is no memory-level tiling); instead, it loweres to instructions like store_scatter and load_gather. 
 
 ```mlir
@@ -448,7 +450,8 @@ gpu.barrier
 %ma = xegpu.matrix_desc_view %m : matrix_desc<8192xf16> -> matrix_desc<256x32xf16, #dpas_t_wg>
 %a_dpas = xegpu.load_matrix %ma[0, 0] : matrix_desc<256x32xf16, #dpas_t_wg> -> vector<256x32xf16>
 ```
-**Optimized with Blocking: Lowering to store_chunk and 1D Block Load**
+**Optimized Blocking: Lowering to store_chunk and 1D Block Load**
+
 This pattern demonstrates a more optimized strategy for instruction-level blocking, enabling the use of efficient memory instructions such as store_chunk and 1D block load. For correct and efficient lowering, several constraints must be satisfied:
 
 The inst_data field must specify a meaningful 2D shape that aligns with the capabilities of store_chunk and 1D block load.
@@ -474,6 +477,7 @@ gpu.barrier
 ```
 
 **Workgroup to Subgroup Distribution**
+
 This example illustrates how data is distributed from workgroup to subgroups. It demonstrates how load_matrix and store_matrix cooperate with matrix_desc subviews to enable efficient subgroup distribution. In this step, the sg_layout and sg_data attributes are removed from the layout specification, leaving only the inst_data attribute.
 
 The matrix is assumed to be stored in row-major contiguous layout, and indexing into it is performed using logical coordinates. These logical coordinates are used throughout tile distribution and layout transformations. Only at the final lowering stage (e.g., MaterializeSLMAccess) are physical offsets computed using memory layout attributes such as @strides and @block. A key property of the matrix_desc data type is that logical tile decomposition does not alter the block or stride metadata, making logical address computation straightforward.
@@ -508,6 +512,7 @@ gpu.barrier
     : matrix_desc<32x32xf16, @block=[16, 16], #dpas_t_inst> -> vector<32x32xf16>
 ```
 **Unrolling Guided by Inst_data**
+
 This example illustrates how matrix loads and stores can be unrolled into smaller instruction tiles for better alignment with hardware capabilities. This pattern ensures that each load and store operation writes within its assigned block boundary, respecting the @block and @strides attributes. On the load side, the matrix_desc is subviewed into multiple 16×16 instruction tiles, which are then used in separate load_matrix operations. This breakdown enables explicit instruction-level unrolling, allowing each instruction to operate on a fixed tile size that aligns with DPAS or tensor-core instruction requirements.
 
 ```mlir
@@ -530,7 +535,7 @@ xegpu.store_matrix %mt_inst1, %at1
 
 gpu.barrier
 
-%ma = xegpu.matrix_desc_subview %m
+%ma = xegpu.matrix_desc_view %m
     : matrix_desc<8192xf16> -> matrix_desc<256x32xf16, @block=[16, 16]>
 %ma_inst0 = xegpu.matrix_desc_subview %ma[%sg_idy * 32,       %sg_idx * 32 % 32]
     : matrix_desc<256x32xf16> -> matrix_desc<16x16xf16, @block=[16, 16]>
@@ -551,6 +556,7 @@ gpu.barrier
 ```
 
 **MaterializeSLMAccess: Lowering matrix_desc to Physical Memory Access**
+
 This step lowers high-level matrix_desc operations (store_matrix, load_matrix) into low-level memory operations (store_chunk, load_1d) over shared local memory. It performs full address materialization using the matrix's layout attributes (@strides, @block) and logical lane coordinates.
 
 Key Concepts:
@@ -572,15 +578,16 @@ Offset Calculation: Logical per-lane coordinates are transformed into logical bl
 // Shared local memory buffer
 %m = memref.alloc : memref<8192xf16, 3>
 
+// ---------------------- Chunked Load ----------------------
 // Compute blocked offset vectors for SLM store
 %blk_y       = divi_signed %sg_idy * 8, 16        : index
 %blk_in_y    = remi_signed %sg_idy * 8, 16        : index
 %sg_idx_vec  = addi %sg_idx * 32, dense<[0, ..., 15]> : vector<16xindex>
 %blk_x       = divi_unsigned %sg_idx_vec, 16
 %blk_in_x    = remi_unsigned %sg_idx_vec, 16
-%offset_vec0 = addi (addi (addi (%blk_y * 16, %blk_in_y), %blk_x * 512), %blk_in_x * 16)
-
-// Create tensor_desc for SLM store
+// calculate physic addresses with pre-computed strides of the blocked matrix.
+// [32x256, strides=1x32] blocked as [2x16x16x16, strides=256x512x1x16] 
+%offset_vec0 = addi (addi (addi (%blk_in_y, %blk_in_x * 16), %blk_y * 256),%blk_x * 512)
 %tdesc0 = xegpu.create_tdesc %m, %offset_vec0
     : memref<8192xf16, 3>, vector<16xindex> -> tdesc<16x8xf16, chunk=8, scope=slm>
 
@@ -588,34 +595,35 @@ Offset Calculation: Logical per-lane coordinates are transformed into logical bl
 %sg_idx_vec2 = addi %sg_idx * 32, dense<[16, ..., 31]> : vector<16xindex>
 %blk_x2      = divi_unsigned %sg_idx_vec2, 16
 %blk_in_x2   = remi_unsigned %sg_idx_vec2, 16
-%offset_vec1 = addi (addi (addi (%blk_y * 16, %blk_in_y), %blk_x2 * 512), %blk_in_x2 * 16)
+%offset_vec1 = addi (addi (addi (%blk_in_y, %blk_in_x2 * 16), %blk_y * 256),%blk_x2 * 512)
 %tdesc1 = xegpu.create_tdesc %m, %offset_vec1
     : memref<8192xf16, 3>, vector<16xindex> -> tdesc<16x8xf16, chunk=8, scope=slm>
 
-// Transpose and store
+// The transpose is added as we remove the transpose attribute out from chunked load/store and expect an explict data transpose.
+// it will be no op after lane distribution since each lane owns same data when [8,1] is transpose to [1, 8]
 %at0_t = vector.transpose %at0 : vector<8x16xf16> -> vector<16x8xf16>
 %at1_t = vector.transpose %at1 : vector<8x16xf16> -> vector<16x8xf16>
 xegpu.store %tdesc0, %at0_t : tdesc<16x8xf16, chunk=8, scope=slm>, vector<16x8xf16>
 xegpu.store %tdesc1, %at1_t : tdesc<16x8xf16, chunk=8, scope=slm>, vector<16x8xf16>
 
-// Barrier to ensure SLM visibility
 gpu.barrier
 
 // ---------------------- Load 1D Block ----------------------
-
-// Compute per-tile physical offsets
+// Compute per-block physical offsets
+// pre-computed strides of the blocked matrix: [256x32] blocked as [2x16x16x16, strides=512x256x16x1]
+// sg_idx*32 coord to blocked matrix ccord: sg_idx*32%32/16 (0), sg_idx*32%32%16 (0). %32 due matrix shape[1] is 32
+// sg_idy*32 coord to blocked matrix coord: sg_idy*32/16, sg_idy*32%16 (0)
+//  then map to physical addr using stride  [2x16x16x16, strides=512x256x16x1], get sg_idy*32/16 *512
 %inst_start_offset0 = mul %sg_idy, 2 * 512
 %inst_start_offset1 = add %inst_start_offset0, 256
 %inst_start_offset2 = add %inst_start_offset0, 512
 %inst_start_offset3 = add %inst_start_offset0, 768
 
-// Create tdesc for 1D block loads
 %tdesc0 = xegpu.create_nd_tdesc %m, %inst_start_offset0 : memref<8192xf16, 3>, index -> tdesc<256xf16>
 %tdesc1 = xegpu.create_nd_tdesc %m, %inst_start_offset1 : memref<8192xf16, 3>, index -> tdesc<256xf16>
 %tdesc2 = xegpu.create_nd_tdesc %m, %inst_start_offset2 : memref<8192xf16, 3>, index -> tdesc<256xf16>
 %tdesc3 = xegpu.create_nd_tdesc %m, %inst_start_offset3 : memref<8192xf16, 3>, index -> tdesc<256xf16>
 
-// Load 1D tiles
 %a_dpas_0 = xegpu.load_nd %tdesc0 : tdesc<256xf16> -> vector<256xf16>
 %a_dpas_1 = xegpu.load_nd %tdesc1 : tdesc<256xf16> -> vector<256xf16>
 %a_dpas_2 = xegpu.load_nd %tdesc2 : tdesc<256xf16> -> vector<256xf16>

From 63bed6aeca05550bf0373d098a472dfcae70346d Mon Sep 17 00:00:00 2001
From: Jianhui Li <jian.hui.li@intel.com>
Date: Mon, 21 Jul 2025 16:01:09 -0700
Subject: [PATCH 14/27] Update XeGPU.md

---
 docs/rfcs/XeGPU.md | 226 ++++++++++++++++++---------------------------
 1 file changed, 92 insertions(+), 134 deletions(-)

diff --git a/docs/rfcs/XeGPU.md b/docs/rfcs/XeGPU.md
index 9d34da7d3..0e953058b 100644
--- a/docs/rfcs/XeGPU.md
+++ b/docs/rfcs/XeGPU.md
@@ -349,40 +349,32 @@ This separation simplifies distribution and unrolling passes and enables systema
 
 **Basic Usage**
 
-To represent a matrix stored in shared local memory (SLM), users must create a matrix_desc object. The underlying memory is assumed to follow a row-major layout, and the base matrix_desc represents a raw, unannotated matrix in this layout. The base matrix may be n-dimensional.
-
-matrix_desc_view creates a matrix_desc instance with memory layout attributes such as @block and @stride. These attributes define the blocking and striding parameters, which govern physical address computation when accessing shared local memory (SLM). Additionally, an xegpu.layout attribute is added to specify the mapping of lanes and registers to fragments of the matrix, guiding tile distribution based on the assumed row-major view of the matrix. The matrix_desc_subview creates a subview on top of the matrix_desc produced by matrix_desc_view, inheriting all of its layout attributes. The subview is then subject to decomposition and distribution. 
-
-Data movement between SLM and vector registers is performed using load_matrix and store_matrix, which operate at workgroup scope and require the input matrix_desc to be 2D. If the original matrix is higher-dimensional, it must be subviewed to a 2D shape before it can be used with these operations.
+To represent a matrix stored in shared local memory (SLM), users must create a matrix_desc object. Create_matrix_desc initializes a matrix_desc instance with memory layout attributes such as @block and @stride. These attributes define the blocking and striding parameters, which govern physical address computation when accessing shared local memory (SLM). The matrix_desc_subview creates a subview on top of the matrix_desc, inheriting all of its layout attributes. Load_matrix and store_matrix performs data movement between SLM and vector registers. xegpu.layout attribute is added to load_matrix and store_matrix to specify the mapping of lanes and registers to fragments of the matrix, guiding tile distribution based on the assumed row-major view of the matrix. 
 
 | Ops	| Syntax	| Example |
 | :---   | :----   | :--- |
-|create_matrix_desc	| operation ::= xegpu.create_matrix_desc attr-dict : type(\$mdesc)	| %mdesc_a = xegpu.create_matrix_desc : matrix_desc<256x128xbf16> |
-|matrix_desc_view	| operation ::= xegpu.matrix_desc_view \$mdesc  attr-dict : type(\$mdesc) -> type(\$mdesc)	| %mdesc_a_layout = xegpu.matrix_desc_view %mdesc:matrix_desc<256x128xbf16> -> matrix_desc<256x128xbf16, @stride=[1, 256],  @block=[8, 16]> |
+|create_matrix_desc	| operation ::= xegpu.create_matrix_desc $mref attr-dict :type($mref), type(\$mdesc)	| %mdesc_a = xegpu.create_matrix_desc %m: memref<65536xi8, 3> -> matrix_desc<256x128xbf16> |
 |matrix_desc_subview	| operation ::= xegpu.matrix_desc_subview \$mdesc, DynamicIndexList<\$coord>  attr-dict : type(\$mdesc) -> type(\$mdesc)	| %mdesc_coop = xegpu.matrix_desc_subview %mdesc[128, 0]:matrix_desc<256x256xbf16, @stride=[256,1],  @block=[8, 16]> -> matrix_desc<128x128xbf16, @stride=[256,1],  @block=[8, 16]> |
 |load_matrix	| operation ::= xegpu.load_matrix  $mdesc attr-dict : type($mdesc), {type(coords)} -> type($res)	| %result = xegpu.load_matrix %mdesc : matrix_desc<128x256xbf16, @block=[8, 16]> -> vector<128x256xbf16> |
 |store_matrix	| operation ::= xegpu.store_matrix  $mdesc, $val attr-dict : type($mdesc), {type(coords)}, type($val)	| %result = xegpu.store_matrix %mdesc, %val : matrix_desc<128x256xbf16, @block=[8, 16]>, vector<128x256xbf16> |
 
-Users create a `matrix_desc` to represent a matrix stored in shared local memory (SLM). The operation allocates SLM for the matrix, assuming a row-major contiguous layout.
+Users create a `matrix_desc` to represent a matrix stored in shared local memory (SLM). The operation takes a memory buffer (1D int8 memref with empty layout) and create a structured representation of the share local memory. The result matrix_desc has proper information including shape, element type, and memory layout attributes (@block and @strides). The @block attribute indicates that the matrix follows a blocked layout, enabling optimized lowering to 1D block loads. The @strides attribute specifies the logical strides of each dimension and is typically used to support chunked loads.
 
-```mlir
-%mdesc_a = xegpu.create_matrix_desc: matrix_desc<256x128xbf16>
-```
-matrix_desc_view annotates matrix_desc with layout attributes such as @block and @strides to describe its memory layout more precisely. The @block attribute indicates that the matrix follows a blocked layout, enabling optimized lowering to 1D block loads. The @strides attribute specifies the logical strides of each dimension and is typically used to support chunked loads.
+When there is no input memref operand, it allocates SLM for the matrix, assuming a row-major contiguous layout.
 
 ```mlir
-%mdesc_a_layout = xegpu.matrix_desc_view %mdesc:matrix_desc<256x128xbf16> -> matrix_desc<256x128xbf16, @stride=[1, 256],  @block=[8, 16]>
+%mdesc_a = xegpu.create_matrix_desc: matrix_desc<256x128xbf16>
+%mdesc_b = xegpu. create_matrix_desc %m : memref<16384xi8, 3>-> matrix_desc<32x256xf16, @strides=[1, 32]>
 ```
-Users can create a subview of a matrix_desc to represent a sliced or partitioned view of the original matrix. Subviews may reduce the rank of the matrix, allowing users to extract a lower-dimensional matrix from a higher-dimensional one. Subview inherit memory layout attributes from the base matrix_desc. Additionally, a view may carry an xegpu.layout attribute that defines how the matrix is logically partitioned into fragments and mapped to work items.
+Users can create a subview of a matrix_desc to represent a sliced or partitioned view of the original matrix. Subviews may reduce the rank of the matrix, allowing users to extract a lower-dimensional matrix from a higher-dimensional one. Subview inherits memory layout attributes from the base matrix_desc. For GEMM use case, matrix operations typically work on 2D matrix_desc. If the original matrix is higher-dimensional, it can be subviewed to a 2D shape before it is used with these operations. 
 
 ```mlir
 %mdesc_a = xegpu.matrix_desc_subview %mdescs_a[%mma_cycle_i, 0, 0]
     : matrix_desc<3x256x128xbf16, @block=[8, 16]> -> matrix_desc<256x128xbf16, @block=[8, 16]>
 
 %mdesc_coop_a = xegpu.matrix_desc_subview %mdesc_a[0, %wg_id_x_in_cluster * 64]
-    : matrix_desc<256x128xbf16> -> matrix_desc<256x64xbf16, @strides=[128, 1]>
+    : matrix_desc<256x128xbf16, @strides=[128, 1]> -> matrix_desc<256x64xbf16, @strides=[128, 1]>
 ```
-
 Users can load a matrix from shared local memory into a vector value using the load_matrix operation. The result is a vector type in the IR, representing a tile stored in registers.
 ```mlir
 vec_a = load_matrix matrix_desc_a: matrix_desc<256x128xbf16, @block=[8, 16]> -> vector<256x128xbf6>
@@ -401,9 +393,9 @@ This example demonstrates a cooperative transpose pattern in which a matrix tile
 #Coop_wg = {sg_layout = [8, 4] , sg_data= [32, 8], order=[1, 0] }
 #dpas_wg = {sg_layout = [8, 4],  sg_data= [32, 32], order=[1, 0] }
 
-%at = load_nd %tdesc: tensor_desc<4096x4096xf16, #Coop_t_wg> -> vector<32x256xf16 >
-%a = vector.transpose %1 #Coop_wg :vector<32x256xf16> -> vector<256x32xf16>
-%a_dpas = Conv_layout %2 #Coop_wg #dpas_wg 
+%at = load_nd %tdesc: tensor_desc<32x256xf16, #Coop_t_wg> -> vector<32x256xf16>
+%a = vector.transpose %1 {layout_result_0 = #Coop_wg}: vector<32x256xf16> to vector<256x32xf16>
+%a_dpas = Conv_layout %2 <{from = #Coop_wg, to = #dpas_wg}>: vector<256x32xf16> 
 ```
 In this flow:
 
@@ -415,50 +407,51 @@ In this flow:
 
 **After optimization that targets the transpose-A pattern**
 
-The code is transformed to use store_matrix and load_matrix to materialize the transpose cooperatively in shared local memory. Note that both load_nd and store_matrix use smaller sg_data values, meaning each subgroup processes a smaller fragment, enabling a cooperative transpose across threads.
+The code is transformed to use store_matrix and load_matrix to implement the transpose cooperatively in shared local memory. Note that both load_nd and store_matrix use smaller sg_data values, meaning each subgroup processes a smaller fragment, enabling a cooperative transpose across threads.
 
-It is generally preferred to fuse transpose and convert_layout earlier in the pipeline, as this affects the blocking strategy for load_matrix and store_matrix (which are the lowered forms of the logical layout conversion and transpose). Early fusion enables better alignment with optimal hardware load instructions.
+It is generally preferred to detect the “transpose + convert_layout” pattern and fuse them earlier in the pipeline, as this affects the blocking strategy for load_matrix and store_matrix (which are the lowered forms of the logical layout conversion and transpose). Early fusion enables better alignment with optimal hardware load instructions.
 
 ```mlir 
 #Coop_t_wg  = { sg_layout = [4, 8], sg_data = [8, 32], order = [0, 1] }  // original layout
 #dpas_t_wg  = { sg_layout = [8, 4], sg_data = [32, 32], order = [1, 0] } // target DPAS layout
 
-%at = xegpu.load_nd %tdesc : tensor_desc<4096x4096xf16, #Coop_t_wg> -> vector<32x256xf16>
-%m = xegpu.create_matrix_desc : matrix_desc<8192xf16>
-%mt = xegpu.matrix_desc_view %m : matrix_desc<32x256xf16> -> matrix_desc<32x256xf16, @strides=[1, 32], #Coop_t_wg>
-xegpu.store_matrix %mt[0, 0], %at : vector<32x256xf16>, matrix_desc<32x256xf16, @strides=[1, 32], #Coop_t_wg>
+%at = xegpu.load_nd %tdesc : tensor_desc<32x256xf16, #Coop_t_wg> -> vector<32x256xf16>
+%m = memref.alloca() {alignment = 1024} : memref<16384xi8, 3>
+%mt = xegpu. create_matrix_desc %m : memref<16384xi8, 3>-> matrix_desc<32x256xf16, @strides=[1, 32]>
+xegpu.store_matrix %at, %mt[0, 0] #Coop_t_wg: vector<32x256xf16>, matrix_desc<32x256xf16, @strides=[1, 32]>
 gpu.barrier
-%ma = xegpu.matrix_desc_view %m : matrix_desc<8192xf16> -> matrix_desc<256x32xf16, #dpas_t_wg>
-%a_dpas = xegpu.load_matrix %ma[0, 0] : matrix_desc<256x32xf16, #dpas_t_wg> -> vector<256x32xf16>
+%ma = xegpu.create_matrix_desc %m : memref<16384xi8, 3>-> matrix_desc<256x32xf16>
+%a_dpas = xegpu.load_matrix %ma[0, 0] #dpas_t_wg: matrix_desc<256x32xf16> -> vector<256x32xf16>
 ```
 
-**Adding Attributes for Basic Instruction-Level Blocking**
+**Layout Assignment**
+***Basic Blocking: Using regular load and store instruction***
 
-In this example, the xegpu.layout is extended to support instruction-level blocking. The basic blocking assumes 16 lanes, and each lane handles 2 f16 elements (32 bits). This basic blocking does not change the physical memory layout (i.e., there is no memory-level tiling); instead, it loweres to instructions like store_scatter and load_gather. 
+In this example, the xegpu.layout is extended to support instruction-level blocking. The basic blocking assumes 16 lanes, and each lane handles 2 f16 elements (32 bits). This basic instruction blocking does not try to block memory layout. It lowers to instructions like chunked store and load_gather. 
 
 ```mlir
 #Coop_t_wg  = { sg_layout = [4, 8], sg_data = [8, 32], inst_data = [1, 32], order = [0, 1] }
 #dpas_t_wg  = { sg_layout = [8, 4], sg_data = [32, 32], inst_data = [1, 32], order = [1, 0] }
 
-%at = xegpu.load_nd %tdesc: tensor_desc<4096x4096xf16, #Coop_t_wg> -> vector<32x256xf16>
-%m = xegpu.create_matrix_desc: matrix_desc<8192xf16>
-%mt = xegpu.matrix_desc_view %m : matrix_desc<8192xf16> -> matrix_desc<32x256xf16, @strides=[1, 32], #Coop_t_wg>
-xegpu.store_matrix %mt[0, 0], %at: vector<32x256xf16>, matrix_desc<32x256xf16, @strides=[1, 32], #Coop_t_wg>
+%at = xegpu.load_nd %tdesc: tensor_desc<32x256xf16, #Coop_t_wg> -> vector<32x256xf16>
+%m = memref.alloca() {alignment = 1024} : memref<16384xi8, 3>
+%m = xegpu.create_matrix_desc %m : memref<16384xi8, 3> -> matrix_desc<32x256xf16, @strides=[1, 32]>
+xegpu.store_matrix %at, %mt[0, 0] #Coop_t_wg: vector<32x256xf16>, matrix_desc<32x256xf16, @strides=[1, 32]>
 
 gpu.barrier
 
-%ma = xegpu.matrix_desc_view %m : matrix_desc<8192xf16> -> matrix_desc<256x32xf16, #dpas_t_wg>
-%a_dpas = xegpu.load_matrix %ma[0, 0] : matrix_desc<256x32xf16, #dpas_t_wg> -> vector<256x32xf16>
+%ma = xegpu.create_matrix_desc %m : memref<16384xi8, 3> -> matrix_desc<256x32xf16>
+%a_dpas = xegpu.load_matrix %ma[0, 0] #dpas_t_wg: matrix_desc<256x32xf16> -> vector<256x32xf16>
 ```
-**Optimized Blocking: Lowering to store_chunk and 1D Block Load**
+***Optimized Blocking: Lowering to store_chunk and 1D Block Load***
 
-This pattern demonstrates a more optimized strategy for instruction-level blocking, enabling the use of efficient memory instructions such as store_chunk and 1D block load. For correct and efficient lowering, several constraints must be satisfied:
+This pattern demonstrates a more optimized strategy for instruction-level blocking, enabling the use of efficient memory instructions such as 1D block load. For correct and efficient lowering, several constraints must be satisfied:
 
-The inst_data field must specify a meaningful 2D shape that aligns with the capabilities of store_chunk and 1D block load.
+- The inst_data field must specify a meaningful 2D shape that aligns with the capabilities of chunked store and 1D block load.
 
-Blocking must be explicitly expressed in the memory layout via the @block attribute. Two related matrix_desc subviews (e.g., producer and consumer) must have consistent block sizes. If one matrix_desc is transposed, the block shape should match the transposed shape of the other one.
+- Blocking must be explicitly expressed in the memory layout via the @block attribute. Two related matrix_desc (e.g., producer and consumer) must have consistent block sizes. If one matrix_desc is transposed, the block shape should match the transposed shape of the other one.
 
-Each instruction must access only within its assigned matrix block boundary — no cross-block accesses are allowed.
+- Each instruction must access only within its assigned matrix block boundary — no cross-block accesses are allowed.
 
 During lowering, store_matrix is lowered to store_chunk if the matrix has strides, and load_matrix is lowered to 1D block load if the matrix has a blocked layout.
 
@@ -466,21 +459,21 @@ During lowering, store_matrix is lowered to store_chunk if the matrix has stride
 #Coop_t_wg  = { sg_layout = [4, 8], sg_data = [8, 32], inst_data = [8, 16],  order = [0, 1] }
 #dpas_t_wg  = { sg_layout = [8, 4], sg_data = [32, 32], inst_data = [16, 16], order = [1, 0] }
 
-%at = xegpu.load_nd %tdesc : tensor_desc<4096x4096xf16, #Coop_t_wg> -> vector<32x256xf16>
-%m = xegpu.create_matrix_desc : matrix_desc<8192xf16>
-%mt = xegpu.matrix_desc_view %m : matrix_desc<8192xf16> -> matrix_desc<32x256xf16, @block=[16, 16], @strides=[1, 32], #Coop_t_wg>
-xegpu.store_matrix %mt[0, 0], %at : vector<32x256xf16>, matrix_desc<32x256xf16, @block=[16, 16], @strides=[1, 32], #Coop_t_wg>
+%at = xegpu.load_nd %tdesc : tensor_desc<32x256xf16, #Coop_t_wg> -> vector<32x256xf16>
+%m = memref.alloca() {alignment = 1024} : memref<16384xi8, 3>
+%mt = xegpu.create_matrix_desc %m : memref<16384xi8, 3>  -> matrix_desc<32x256xf16, @block=[16, 16], @strides=[1, 32]>
+xegpu.store_matrix %at, %mt[0, 0] #Coop_t_wg : vector<32x256xf16>, matrix_desc<32x256xf16, @block=[16, 16], @strides=[1, 32]>
 
 gpu.barrier
-%ma = xegpu.matrix_desc_view %m : matrix_desc<8192xf16> -> matrix_desc<256x32xf16, @block=[16, 16], #dpas_t_wg>
-%a_dpas = xegpu.load_matrix %ma[0, 0] : matrix_desc<256x32xf16, @block=[16, 16], #dpas_t_wg> -> vector<256x32xf16>
+%ma = xegpu.create_matrix_desc %m : memref<16384xi8, 3>  -> matrix_desc<256x32xf16, @block=[16, 16]>
+%a_dpas = xegpu.load_matrix %ma[0, 0] #dpas_t_wg : matrix_desc<256x32xf16, @block=[16, 16], #dpas_t_wg> -> vector<256x32xf16>
 ```
 
 **Workgroup to Subgroup Distribution**
 
-This example illustrates how data is distributed from workgroup to subgroups. It demonstrates how load_matrix and store_matrix cooperate with matrix_desc subviews to enable efficient subgroup distribution. In this step, the sg_layout and sg_data attributes are removed from the layout specification, leaving only the inst_data attribute.
+This example illustrates how load_matrix and store_matrix are distributed from workgroup to subgroups. After distribution, the sg_layout and sg_data attributes are removed from the layout specification, leaving only the inst_data attribute.
 
-The matrix is assumed to be stored in row-major contiguous layout, and indexing into it is performed using logical coordinates. These logical coordinates are used throughout tile distribution and layout transformations. Only at the final lowering stage (e.g., MaterializeSLMAccess) are physical offsets computed using memory layout attributes such as @strides and @block. A key property of the matrix_desc data type is that logical tile decomposition does not alter the block or stride metadata, making logical address computation straightforward.
+The distribution process assumes matrix stored in row-major contiguous layout, and performes indexing using logical coordinates. These logical coordinates are used throughout tile distribution and layout transformations. Only at the final lowering stage (e.g., MaterializeSLMAccess) are physical offsets computed using memory layout attributes such as @strides and @block. A key property of the matrix_desc data type is that logical tile decomposition does not alter the block or stride metadata, making logical address computation straightforward.
 
 ```mlir
 #coop_t_inst  = { inst_data = [8, 16] }
@@ -491,29 +484,19 @@ The matrix is assumed to be stored in row-major contiguous layout, and indexing
     : memref<4096x4096xf16> -> tensor_desc<8x32xf16, #coop_t_inst>
 %at = xegpu.load_nd %tdesc_sg
     : tensor_desc<8x32xf16, #coop_t_inst> -> vector<8x32xf16>
-%m = xegpu.create_matrix_desc
-    : matrix_desc<32x256xf16>
-%mt = xegpu.matrix_desc_view %m 
-    : matrix_desc<8192xf16> -> matrix_desc<32x256xf16, @block=[16, 16], @strides=[1, 32], #coop_t_inst>
-%mt_sg = xegpu.matrix_desc_subview %mt[%sg_idy * 8, %sg_idx * 32]
-    : matrix_desc<32x256xf16, @block=[16, 16], @strides=[1, 32], #coop_t_inst>
-      -> matrix_desc<8x32xf16, @block=[16, 16], @strides=[1, 32], #coop_t_inst>
-xegpu.store_matrix %mt_sg, %at
-    : vector<8x32xf16>, matrix_desc<8x32xf16, @block=[16, 16], @strides=[1, 32], #coop_t_inst>
+%m = memref.alloca() {alignment = 1024} : memref<16384xi8, 3>
+%mt = xegpu.create_matrix_desc %m : memref<16384xi8, 3>  -> matrix_desc<32x256xf16, @block=[16, 16], @strides=[1, 32]>
+xegpu.store_matrix %at, %mt[%sg_idy * 8, %sg_idx * 32] #coop_t_inst
+    : vector<8x32xf16>, matrix_desc<32x256xf16, @block=[16, 16], @strides=[1, 32]>
 
 gpu.barrier
-
-%ma = xegpu.matrix_desc_view %m
-    : matrix_desc<8192xf16> -> matrix_desc<256x32xf16, @block=[16, 16], #dpas_t_inst>
-%ma_sg = xegpu.matrix_desc_subview %ma[%sg_idy * 32, %sg_idx * (32 % 32)]
-    : matrix_desc<256x32xf16, @block=[16, 16], #dpas_t_inst>
-      -> matrix_desc<32x32xf16, @block=[16, 16]>
-%a_dpas = xegpu.load_matrix %ma_sg
-    : matrix_desc<32x32xf16, @block=[16, 16], #dpas_t_inst> -> vector<32x32xf16>
+%ma = xegpu.create_matrix_desc %m : memref<16384xi8, 3>  -> matrix_desc<256x32xf16, @block=[16, 16]>
+%a_dpas = xegpu.load_matrix %ma[%sg_idy * 32, %sg_idx * (32 % 32)]  #dpas_t_inst
+    : matrix_desc<256x32xf16, @block=[16, 16]> -> vector<32x32xf16>
 ```
-**Unrolling Guided by Inst_data**
 
-This example illustrates how matrix loads and stores can be unrolled into smaller instruction tiles for better alignment with hardware capabilities. This pattern ensures that each load and store operation writes within its assigned block boundary, respecting the @block and @strides attributes. On the load side, the matrix_desc is subviewed into multiple 16×16 instruction tiles, which are then used in separate load_matrix operations. This breakdown enables explicit instruction-level unrolling, allowing each instruction to operate on a fixed tile size that aligns with DPAS or tensor-core instruction requirements.
+**Unrolling Guided by Inst_data**
+This example illustrates how matrix loads and stores can be unrolled into smaller instruction tiles for better alignment with hardware capabilities. This inst_data attributes ensures that each store operation writes within its assigned block boundary, respecting the @block attributes. On the load side, the matrix_desc is subviewed into multiple 16×16 instruction tiles, which are then used in separate load_matrix operations. This breakdown enables explicit instruction-level unrolling, allowing each instruction to operate on a fixed tile size that aligns with DPAS or tensor-core instruction requirements.
 
 ```mlir
 %tdesc_sg = xegpu.create_nd_tdesc %base[%widy * 32 + %sg_idy * 8, %widx * 256 + %sg_idx * 32]
@@ -521,38 +504,23 @@ This example illustrates how matrix loads and stores can be unrolled into smalle
 %at = xegpu.load_nd %tdesc_sg     : tensor_desc<8x32xf16> -> vector<8x32xf16>
 %at0 = vector.extract %at[0, 0]   : vector<8x32xf16> -> vector<8x16xf16>
 %at1 = vector.extract %at[0, 16]  : vector<8x32xf16> -> vector<8x16xf16>
-%m = xegpu.create_matrix_desc : matrix_desc<8192xf16>
-%mt = xegpu.matrix_desc_view %m 
-    : matrix_desc<8192xf16> -> matrix_desc<32x256xf16, @block=[16, 16], @strides=[1, 32], #coop_t_inst>
-%mt_inst0 = xegpu.matrix_desc_subview %mt[%sg_idy * 8, %sg_idx * 32]
-    : matrix_desc<8x16xf16, @block=[16, 16], @strides=[1, 32]>
-%mt_inst1 = xegpu.matrix_desc_subview %mt[%sg_idy * 8, %sg_idx * 32 + 16]
-    : matrix_desc<8x16xf16, @block=[16, 16], @strides=[1, 32]>
-xegpu.store_matrix %mt_inst0, %at0
-    : vector<8x16xf16>, matrix_desc<8x16xf16, @block=[16, 16], @strides=[1, 32]>
-xegpu.store_matrix %mt_inst1, %at1
-    : vector<8x16xf16>, matrix_desc<8x16xf16, @block=[16, 16], @strides=[1, 32]>
+%m = memref.alloca() {alignment = 1024} : memref<16384xi8, 3>
+%mt = xegpu.create_matrix_desc %m : memref<16384xi8, 3>  -> matrix_desc<32x256xf16, @block=[16, 16], @strides=[1, 32]>
+xegpu.store_matrix %at0, %mt[%sg_idy * 8, %sg_idx * 32]
+    : vector<8x16xf16>, matrix_desc<32x256xf16, @block=[16, 16], @strides=[1, 32]>
+xegpu.store_matrix %at1, %mt[%sg_idy * 8, %sg_idx * 32 + 16]
+    : vector<8x16xf16>, matrix_desc<32x256xf16, @block=[16, 16], @strides=[1, 32]>
 
 gpu.barrier
-
-%ma = xegpu.matrix_desc_view %m
-    : matrix_desc<8192xf16> -> matrix_desc<256x32xf16, @block=[16, 16]>
-%ma_inst0 = xegpu.matrix_desc_subview %ma[%sg_idy * 32,       %sg_idx * 32 % 32]
-    : matrix_desc<256x32xf16> -> matrix_desc<16x16xf16, @block=[16, 16]>
-%ma_inst1 = xegpu.matrix_desc_subview %ma[%sg_idy * 32,       %sg_idx * 32 % 32 + 16]
-    : matrix_desc<256x32xf16> -> matrix_desc<16x16xf16, @block=[16, 16]>
-%ma_inst2 = xegpu.matrix_desc_subview %ma[%sg_idy * 32 + 16,  %sg_idx * 32 % 32]
-    : matrix_desc<256x32xf16> -> matrix_desc<16x16xf16, @block=[16, 16]>
-%ma_inst3 = xegpu.matrix_desc_subview %ma[%sg_idy * 32 + 16,  %sg_idx * 32 % 32 + 16]
-    : matrix_desc<256x32xf16> -> matrix_desc<16x16xf16, @block=[16, 16]>
-%a_dpas_0 = xegpu.load_matrix %ma_inst0
-    : matrix_desc<16x16xf16, @block=[16, 16]> -> vector<16x16xf16>
-%a_dpas_1 = xegpu.load_matrix %ma_inst1
-    : matrix_desc<16x16xf16, @block=[16, 16]> -> vector<16x16xf16>
-%a_dpas_2 = xegpu.load_matrix %ma_inst2
-    : matrix_desc<16x16xf16, @block=[16, 16]> -> vector<16x16xf16>
-%a_dpas_3 = xegpu.load_matrix %ma_inst3
-    : matrix_desc<16x16xf16, @block=[16, 16]> -> vector<16x16xf16>
+%ma = xegpu.create_matrix_desc %m : memref<16384xi8, 3>  -> matrix_desc<256x32xf16, @block=[16, 16]>
+%a_dpas_0 = xegpu.load_matrix %ma[%sg_idy * 32, %sg_idx * 32 % 32]
+    : matrix_desc<256x32xf16, @block=[16, 16]> -> vector<16x16xf16>
+%a_dpas_1 = xegpu.load_matrix %ma[%sg_idy * 32, %sg_idx * 32 % 32 + 16]
+    : matrix_desc<256x32xf16, @block=[16, 16]> -> vector<16x16xf16>
+%a_dpas_2 = xegpu.load_matrix %ma[%sg_idy * 32 + 16,  %sg_idx * 32 % 32]
+    : matrix_desc<256x32xf16, @block=[16, 16]> -> vector<16x16xf16>
+%a_dpas_3 = xegpu.load_matrix %[%sg_idy * 32 + 16,  %sg_idx * 32 % 32 + 16]
+    : matrix_desc<256x32xf16, @block=[16, 16]> -> vector<16x16xf16>
 ```
 
 **MaterializeSLMAccess: Lowering matrix_desc to Physical Memory Access**
@@ -560,13 +528,11 @@ gpu.barrier
 This step lowers high-level matrix_desc operations (store_matrix, load_matrix) into low-level memory operations (store_chunk, load_1d) over shared local memory. It performs full address materialization using the matrix's layout attributes (@strides, @block) and logical lane coordinates.
 
 Key Concepts:
-Matrix-to-Memory Conversion: Replace matrix_desc-based tile abstractions with raw memref and compute physical offsets explicitly.
+- Chunked Store: Each thread stores a small fragment (e.g., 8×1) using the logical offset composed with layout metadata. Lowered to store_chunk.
 
-Chunked Store: Each thread stores a small fragment (e.g., 8×1) using the logical offset composed with layout metadata. Lowered to store_chunk.
+- 1D Block Load: A transposed layout (e.g., 256×32) is blocked as 16×16 tiles. Contiguous blocks are loaded using load_1d, which requires computing the physical offset of the first element per 1D block.
 
-1D Block Load: A transposed layout (e.g., 256×32) is blocked as 16×16 tiles. Contiguous blocks are loaded using load_1d, which requires computing the physical offset of the first element per 1D block.
-
-Offset Calculation: Logical per-lane coordinates are transformed into logical block coordinates, then to physical offsets using block size and strides.
+- Offset Calculation: Logical per-lane coordinates are transformed into logical block coordinates, then to physical offsets using block size and strides.
 
 ```mlir
 %tdesc_sg = xegpu.create_nd_tdesc %base[%widy * 32 + %sg_idy * 8, %widx * 256 + %sg_idx * 32]
@@ -576,41 +542,38 @@ Offset Calculation: Logical per-lane coordinates are transformed into logical bl
 %at1 = vector.extract %at[0, 16]  : vector<8x32xf16> -> vector<8x16xf16>
 
 // Shared local memory buffer
-%m = memref.alloc : memref<8192xf16, 3>
+%m = memref.alloca() {alignment = 1024} : memref<16384xi8, 3>
+
+// ---------------------- Chunked Store ----------------------
+// The transpose is added as we remove the transpose attribute out from chunked load/store and expect an explict data transpose.
+// it will be no op after lane distribution since each lane owns same data when [8,1] is transpose to [1, 8]
+%at0_t = vector.transpose %at0 : vector<8x16xf16> -> vector<16x8xf16>
 
-// ---------------------- Chunked Load ----------------------
 // Compute blocked offset vectors for SLM store
-%blk_y       = divi_signed %sg_idy * 8, 16        : index
-%blk_in_y    = remi_signed %sg_idy * 8, 16        : index
-%sg_idx_vec  = addi %sg_idx * 32, dense<[0, ..., 15]> : vector<16xindex>
-%blk_x       = divi_unsigned %sg_idx_vec, 16
-%blk_in_x    = remi_unsigned %sg_idx_vec, 16
+%blk_y=sg_idy*8 /16: index
+%blk_in_y=sg_idy*8 %16: index
+%sg_idx_vec = %sg_idx*32 + [0..15] : vector<16xindex> 
+%blk_x=%sg_idx_vec /16: vector<16xindex > 
+%blk_in_x=%sg_idx_vec %16: vector<16xindex >
+
 // calculate physic addresses with pre-computed strides of the blocked matrix.
 // [32x256, strides=1x32] blocked as [2x16x16x16, strides=256x512x1x16] 
-%offset_vec0 = addi (addi (addi (%blk_in_y, %blk_in_x * 16), %blk_y * 256),%blk_x * 512)
-%tdesc0 = xegpu.create_tdesc %m, %offset_vec0
-    : memref<8192xf16, 3>, vector<16xindex> -> tdesc<16x8xf16, chunk=8, scope=slm>
+%offset_vec0 = %blk_y * 256+ + %blk_x * 512 + %blk_in_y + %blk_in_x*16
+xegpu.store %at0_t, %m, %offset_vec0 @chunk_size=8: vector<16x8xf16>, memref<8192xf16, 3>, vector<16xindex>
 
 // Repeat for second tile
-%sg_idx_vec2 = addi %sg_idx * 32, dense<[16, ..., 31]> : vector<16xindex>
-%blk_x2      = divi_unsigned %sg_idx_vec2, 16
-%blk_in_x2   = remi_unsigned %sg_idx_vec2, 16
-%offset_vec1 = addi (addi (addi (%blk_in_y, %blk_in_x2 * 16), %blk_y * 256),%blk_x2 * 512)
-%tdesc1 = xegpu.create_tdesc %m, %offset_vec1
-    : memref<8192xf16, 3>, vector<16xindex> -> tdesc<16x8xf16, chunk=8, scope=slm>
-
-// The transpose is added as we remove the transpose attribute out from chunked load/store and expect an explict data transpose.
-// it will be no op after lane distribution since each lane owns same data when [8,1] is transpose to [1, 8]
-%at0_t = vector.transpose %at0 : vector<8x16xf16> -> vector<16x8xf16>
 %at1_t = vector.transpose %at1 : vector<8x16xf16> -> vector<16x8xf16>
-xegpu.store %tdesc0, %at0_t : tdesc<16x8xf16, chunk=8, scope=slm>, vector<16x8xf16>
-xegpu.store %tdesc1, %at1_t : tdesc<16x8xf16, chunk=8, scope=slm>, vector<16x8xf16>
+%sg_idx_vec2 = %sg_idx*32 + [16..31] : vector<16xindex> 
+%blk_x2=%sg_idx_vec2 /16: vector<16xindex > 
+%blk_in_x2=%sg_idx_vec2 %16: vector<16xindex >
+%offset_vec1 = %blk_y * 256+ + %blk_x2 * 512 + %blk_in_y+ %blk_in_x2*16
+xegpu.store %at1_t, %m, %offset_vec1: @chunk_size=8: vector<16x8xf16>, memref<8192xf16, 3>, vector<16xindex>
 
 gpu.barrier
 
 // ---------------------- Load 1D Block ----------------------
 // Compute per-block physical offsets
-// pre-computed strides of the blocked matrix: [256x32] blocked as [2x16x16x16, strides=512x256x16x1]
+// pre-computed strides of the blocked matrix: [256x32] blocked as [16x2x16x16, strides=512x256x16x1]
 // sg_idx*32 coord to blocked matrix ccord: sg_idx*32%32/16 (0), sg_idx*32%32%16 (0). %32 due matrix shape[1] is 32
 // sg_idy*32 coord to blocked matrix coord: sg_idy*32/16, sg_idy*32%16 (0)
 //  then map to physical addr using stride  [2x16x16x16, strides=512x256x16x1], get sg_idy*32/16 *512
@@ -619,15 +582,10 @@ gpu.barrier
 %inst_start_offset2 = add %inst_start_offset0, 512
 %inst_start_offset3 = add %inst_start_offset0, 768
 
-%tdesc0 = xegpu.create_nd_tdesc %m, %inst_start_offset0 : memref<8192xf16, 3>, index -> tdesc<256xf16>
-%tdesc1 = xegpu.create_nd_tdesc %m, %inst_start_offset1 : memref<8192xf16, 3>, index -> tdesc<256xf16>
-%tdesc2 = xegpu.create_nd_tdesc %m, %inst_start_offset2 : memref<8192xf16, 3>, index -> tdesc<256xf16>
-%tdesc3 = xegpu.create_nd_tdesc %m, %inst_start_offset3 : memref<8192xf16, 3>, index -> tdesc<256xf16>
-
-%a_dpas_0 = xegpu.load_nd %tdesc0 : tdesc<256xf16> -> vector<256xf16>
-%a_dpas_1 = xegpu.load_nd %tdesc1 : tdesc<256xf16> -> vector<256xf16>
-%a_dpas_2 = xegpu.load_nd %tdesc2 : tdesc<256xf16> -> vector<256xf16>
-%a_dpas_3 = xegpu.load_nd %tdesc3 : tdesc<256xf16> -> vector<256xf16>
+%a_dpas_0 = xegpu.load_nd %m, %inst_start_offset0 : memref<8192xf16, 3>, index -> vector<256xf16>
+%a_dpas_1 = xegpu.load_nd %m, %inst_start_offset1 : memref<8192xf16, 3>, index -> vector<256xf16>
+%a_dpas_2 = xegpu.load_nd %m, %inst_start_offset2 : memref<8192xf16, 3>, index -> vector<256xf16>
+%a_dpas_3 = xegpu.load_nd %m, %inst_start_offset3 : memref<8192xf16, 3>, index -> vector<256xf16>
 ```
 
 ## XeGPU Attributes to support Work Item Level semantics

From da43f5d3e38ece998bffc77a86b94700918dd170 Mon Sep 17 00:00:00 2001
From: Jianhui Li <jian.hui.li@intel.com>
Date: Mon, 21 Jul 2025 16:02:17 -0700
Subject: [PATCH 15/27] Update XeGPU.md

---
 docs/rfcs/XeGPU.md | 1 +
 1 file changed, 1 insertion(+)

diff --git a/docs/rfcs/XeGPU.md b/docs/rfcs/XeGPU.md
index 0e953058b..bea245905 100644
--- a/docs/rfcs/XeGPU.md
+++ b/docs/rfcs/XeGPU.md
@@ -496,6 +496,7 @@ gpu.barrier
 ```
 
 **Unrolling Guided by Inst_data**
+
 This example illustrates how matrix loads and stores can be unrolled into smaller instruction tiles for better alignment with hardware capabilities. This inst_data attributes ensures that each store operation writes within its assigned block boundary, respecting the @block attributes. On the load side, the matrix_desc is subviewed into multiple 16×16 instruction tiles, which are then used in separate load_matrix operations. This breakdown enables explicit instruction-level unrolling, allowing each instruction to operate on a fixed tile size that aligns with DPAS or tensor-core instruction requirements.
 
 ```mlir

From dbaea60b89ad98ce91912b4e02e85b8ca79242a1 Mon Sep 17 00:00:00 2001
From: Jianhui Li <jian.hui.li@intel.com>
Date: Wed, 13 Aug 2025 18:57:37 -0700
Subject: [PATCH 16/27] Update XeGPU.md

---
 docs/rfcs/XeGPU.md | 12 +++++++-----
 1 file changed, 7 insertions(+), 5 deletions(-)

diff --git a/docs/rfcs/XeGPU.md b/docs/rfcs/XeGPU.md
index bea245905..b667327ed 100644
--- a/docs/rfcs/XeGPU.md
+++ b/docs/rfcs/XeGPU.md
@@ -354,9 +354,9 @@ To represent a matrix stored in shared local memory (SLM), users must create a m
 | Ops	| Syntax	| Example |
 | :---   | :----   | :--- |
 |create_matrix_desc	| operation ::= xegpu.create_matrix_desc $mref attr-dict :type($mref), type(\$mdesc)	| %mdesc_a = xegpu.create_matrix_desc %m: memref<65536xi8, 3> -> matrix_desc<256x128xbf16> |
-|matrix_desc_subview	| operation ::= xegpu.matrix_desc_subview \$mdesc, DynamicIndexList<\$coord>  attr-dict : type(\$mdesc) -> type(\$mdesc)	| %mdesc_coop = xegpu.matrix_desc_subview %mdesc[128, 0]:matrix_desc<256x256xbf16, @stride=[256,1],  @block=[8, 16]> -> matrix_desc<128x128xbf16, @stride=[256,1],  @block=[8, 16]> |
-|load_matrix	| operation ::= xegpu.load_matrix  $mdesc attr-dict : type($mdesc), {type(coords)} -> type($res)	| %result = xegpu.load_matrix %mdesc : matrix_desc<128x256xbf16, @block=[8, 16]> -> vector<128x256xbf16> |
-|store_matrix	| operation ::= xegpu.store_matrix  $mdesc, $val attr-dict : type($mdesc), {type(coords)}, type($val)	| %result = xegpu.store_matrix %mdesc, %val : matrix_desc<128x256xbf16, @block=[8, 16]>, vector<128x256xbf16> |
+|matrix_desc_subview	| operation ::= xegpu.matrix_desc_subview $mdesc[$offsets]  attr-dict : type(\$mdesc) -> type(\$mdesc)	| %mdesc_coop = xegpu.matrix_desc_subview %mdesc[128, 0]:matrix_desc<256x256xbf16, @stride=[256,1],  @block=[8, 16]> -> matrix_desc<128x128xbf16, @stride=[256,1],  @block=[8, 16]> |
+|load_matrix	| operation ::= xegpu.load_matrix $mdesc[$offsets] attr-dict : type($mdesc), type(offsets) -> type($res)	| %result = xegpu.load_matrix %mdesc[0, 0] : matrix_desc<128x256xbf16, @block=[8, 16]> -> vector<128x256xbf16> |
+|store_matrix	| operation ::= xegpu.store_matrix $val, $mdesc[$offsets] attr-dict : type($val), type($mdesc), type(offsets) 	| %result = xegpu.store_matrix %val %mdesc[0, 0] : vector<128x256xbf16>, matrix_desc<128x256xbf16, @block=[8, 16]> |
 
 Users create a `matrix_desc` to represent a matrix stored in shared local memory (SLM). The operation takes a memory buffer (1D int8 memref with empty layout) and create a structured representation of the share local memory. The result matrix_desc has proper information including shape, element type, and memory layout attributes (@block and @strides). The @block attribute indicates that the matrix follows a blocked layout, enabling optimized lowering to 1D block loads. The @strides attribute specifies the logical strides of each dimension and is typically used to support chunked loads.
 
@@ -377,12 +377,14 @@ Users can create a subview of a matrix_desc to represent a sliced or partitioned
 ```
 Users can load a matrix from shared local memory into a vector value using the load_matrix operation. The result is a vector type in the IR, representing a tile stored in registers.
 ```mlir
-vec_a = load_matrix matrix_desc_a: matrix_desc<256x128xbf16, @block=[8, 16]> -> vector<256x128xbf6>
+vec_a = load_matrix matrix_desc_a[0, 0]: matrix_desc<256x128xbf16, @block=[8, 16]> -> vector<256x128xbf6>
+%a_dpas = xegpu.load_matrix %ma[%sg_idy * 32, 0] : matrix_desc<256x32xf16, @block=[16, 16]> -> vector<32x32xf16>
 ```
 
 Users can store a matrix from a vector value into shared local memory using the store_matrix operation.
 ```mlir
-store_matrix matrix_desc_b, vec_a :matrix_desc<256x128xbf16, @block=[8, 16]>, vector<256x128xbf6>
+store_matrix vec_a, matrix_desc_b[0, 0] : vector<256x128xbf6>, matrix_desc<256x128xbf16, @block=[8, 16]>
+xegpu.store_matrix %at, %mt[%sg_idy * 8, %sg_idx * 32] : vector<8x32xf16>, matrix_desc<32x256xf16, @block=[16, 16], @strides=[1, 32]>
 ```
 
 **Cooperative Transpose Example**

From be04d1d71e09f616e2ff8836bb15a3aa05a0ca1b Mon Sep 17 00:00:00 2001
From: Jianhui Li <jian.hui.li@intel.com>
Date: Fri, 15 Aug 2025 07:19:26 -0700
Subject: [PATCH 17/27] Update XeGPU.md

---
 docs/rfcs/XeGPU.md | 100 ++++++++++++++++++++++-----------------------
 1 file changed, 50 insertions(+), 50 deletions(-)

diff --git a/docs/rfcs/XeGPU.md b/docs/rfcs/XeGPU.md
index b667327ed..63388dc1a 100644
--- a/docs/rfcs/XeGPU.md
+++ b/docs/rfcs/XeGPU.md
@@ -329,62 +329,62 @@ Attribute `Memory_kind` describes the memory kind. "global" means the global mem
 
 `nbarrier` and `fence` operations lower to uniform instructions, so there is no need to specify the `sg_map`.
 
-## matrix_desc Type: Simplified Shared Local Memory (SLM) Abstraction
+## mem_desc Type: Simplified Shared Local Memory (SLM) Abstraction
 
-To streamline programming of shared local memory (SLM) on Intel Xe architecture, the XeGPU dialect introduces a new type: matrix_desc. This abstraction is designed to simplify the management of workgroup-level tiles in SLM, especially in scenarios involving layout transformations such as transpose, reduction, and blocking.
+To streamline programming of shared local memory (SLM) on Intel Xe architecture, the XeGPU dialect introduces a new type: mem_desc. This abstraction is designed to simplify the management of workgroup-level tiles in SLM, especially in scenarios involving layout transformations such as transpose, reduction, and blocking.
 
 **Background and Motivation**
 
 On Xe2 GPUs, SLM remains accessible for direct use by programmers. However, in tile-based programming — particularly when applying layout transformations such as transpose, re-layout — SLM is more commonly used as a backing store to facilitate structured tile movement across subgroups and lanes.
 
-Prior to the introduction of matrix_desc, SLM usage was modeled using the nd_tdesc type, which was originally designed for global memory access. As such, it lacked layout-specific attributes like blocking and stride metadata, which are essential for modeling tiled or transposed views in SLM. Developers were responsible for manually computing physical addresses — a process that became particularly complex when applying transformations such as transpose or blocking as required by chunked load or 1D block load. 
+Prior to the introduction of mem_desc, SLM usage was modeled using the nd_tdesc type, which was originally designed for global memory access. As such, it lacked layout-specific attributes like blocking and stride metadata, which are essential for modeling tiled or transposed views in SLM. Developers were responsible for manually computing physical addresses — a process that became particularly complex when applying transformations such as transpose or blocking as required by chunked load or 1D block load. 
 
 This complexity was further compounded by hierarchical distribution, where workgroup-level tiles are subdivided across subgroups, instructions, and individual lanes — each step requiring separate address transformation logic. This made the code error-prone and difficult to optimize.
 
 **Design and Semantics**
 
-The matrix_desc type addresses these challenges by encoding layout transformations—such as transpose and blocking—as static attributes of the descriptor, and by clearly separating logical and physical address computation. The distribution and unrolling process operates on a conceptual row-major 2D matrix, enabling clean and structured logical access, while the physical address materialization phase maps these logical coordinates to hardware-compliant SLM addresses, guided by the layout attributes attached to the matrix_desc.
+The mem_desc type addresses these challenges by encoding layout transformations—such as transpose and blocking—as static attributes of the descriptor, and by clearly separating logical and physical address computation. The distribution and unrolling process operates on a conceptual row-major 2D matrix, enabling clean and structured logical access, while the physical address materialization phase maps these logical coordinates to hardware-compliant SLM addresses, guided by the layout attributes attached to the mem_desc.
 
 This separation simplifies distribution and unrolling passes and enables systematic, robust transformations during compilation. The descriptor encapsulates all necessary layout metadata to generate correct and efficient SLM access patterns — supporting both regular loads and 1D block loads — without requiring the user to write explicit address arithmetic.
 
 **Basic Usage**
 
-To represent a matrix stored in shared local memory (SLM), users must create a matrix_desc object. Create_matrix_desc initializes a matrix_desc instance with memory layout attributes such as @block and @stride. These attributes define the blocking and striding parameters, which govern physical address computation when accessing shared local memory (SLM). The matrix_desc_subview creates a subview on top of the matrix_desc, inheriting all of its layout attributes. Load_matrix and store_matrix performs data movement between SLM and vector registers. xegpu.layout attribute is added to load_matrix and store_matrix to specify the mapping of lanes and registers to fragments of the matrix, guiding tile distribution based on the assumed row-major view of the matrix. 
+To represent a matrix stored in shared local memory (SLM), users must create a mem_desc object. Create_mem_desc initializes a mem_desc instance with memory layout attributes such as @block and @stride. These attributes define the blocking and striding parameters, which govern physical address computation when accessing shared local memory (SLM). The mem_desc_subview creates a subview on top of the mem_desc, inheriting all of its layout attributes. Load_matrix and store_matrix performs data movement between SLM and vector registers. xegpu.layout attribute is added to load_matrix and store_matrix to specify the mapping of lanes and registers to fragments of the matrix, guiding tile distribution based on the assumed row-major view of the matrix. 
 
 | Ops	| Syntax	| Example |
 | :---   | :----   | :--- |
-|create_matrix_desc	| operation ::= xegpu.create_matrix_desc $mref attr-dict :type($mref), type(\$mdesc)	| %mdesc_a = xegpu.create_matrix_desc %m: memref<65536xi8, 3> -> matrix_desc<256x128xbf16> |
-|matrix_desc_subview	| operation ::= xegpu.matrix_desc_subview $mdesc[$offsets]  attr-dict : type(\$mdesc) -> type(\$mdesc)	| %mdesc_coop = xegpu.matrix_desc_subview %mdesc[128, 0]:matrix_desc<256x256xbf16, @stride=[256,1],  @block=[8, 16]> -> matrix_desc<128x128xbf16, @stride=[256,1],  @block=[8, 16]> |
-|load_matrix	| operation ::= xegpu.load_matrix $mdesc[$offsets] attr-dict : type($mdesc), type(offsets) -> type($res)	| %result = xegpu.load_matrix %mdesc[0, 0] : matrix_desc<128x256xbf16, @block=[8, 16]> -> vector<128x256xbf16> |
-|store_matrix	| operation ::= xegpu.store_matrix $val, $mdesc[$offsets] attr-dict : type($val), type($mdesc), type(offsets) 	| %result = xegpu.store_matrix %val %mdesc[0, 0] : vector<128x256xbf16>, matrix_desc<128x256xbf16, @block=[8, 16]> |
+|create_mem_desc	| operation ::= xegpu.create_mem_desc $mref attr-dict :type($mref), type(\$mdesc)	| %mdesc_a = xegpu.create_mem_desc %m: memref<65536xi8, 3> -> mem_desc<256x128xbf16> |
+|mem_desc_subview	| operation ::= xegpu.mem_desc_subview $mdesc[$offsets]  attr-dict : type(\$mdesc) -> type(\$mdesc)	| %mdesc_coop = xegpu.mem_desc_subview %mdesc[128, 0]:mem_desc<256x256xbf16, @stride=[256,1],  @block=[8, 16]> -> mem_desc<128x128xbf16, @stride=[256,1],  @block=[8, 16]> |
+|load_matrix	| operation ::= xegpu.load_matrix $mdesc[$offsets] attr-dict : type($mdesc), type(offsets) -> type($res)	| %result = xegpu.load_matrix %mdesc[0, 0] : mem_desc<128x256xbf16, @block=[8, 16]> -> vector<128x256xbf16> |
+|store_matrix	| operation ::= xegpu.store_matrix $val, $mdesc[$offsets] attr-dict : type($val), type($mdesc), type(offsets) 	| %result = xegpu.store_matrix %val %mdesc[0, 0] : vector<128x256xbf16>, mem_desc<128x256xbf16, @block=[8, 16]> |
 
-Users create a `matrix_desc` to represent a matrix stored in shared local memory (SLM). The operation takes a memory buffer (1D int8 memref with empty layout) and create a structured representation of the share local memory. The result matrix_desc has proper information including shape, element type, and memory layout attributes (@block and @strides). The @block attribute indicates that the matrix follows a blocked layout, enabling optimized lowering to 1D block loads. The @strides attribute specifies the logical strides of each dimension and is typically used to support chunked loads.
+Users create a `mem_desc` to represent a matrix stored in shared local memory (SLM). The operation takes a memory buffer (1D int8 memref with empty layout) and create a structured representation of the share local memory. The result mem_desc has proper information including shape, element type, and memory layout attributes (@block and @strides). The @block attribute indicates that the matrix follows a blocked layout, enabling optimized lowering to 1D block loads. The @strides attribute specifies the logical strides of each dimension and is typically used to support chunked loads.
 
 When there is no input memref operand, it allocates SLM for the matrix, assuming a row-major contiguous layout.
 
 ```mlir
-%mdesc_a = xegpu.create_matrix_desc: matrix_desc<256x128xbf16>
-%mdesc_b = xegpu. create_matrix_desc %m : memref<16384xi8, 3>-> matrix_desc<32x256xf16, @strides=[1, 32]>
+%mdesc_a = xegpu.create_mem_desc: mem_desc<256x128xbf16>
+%mdesc_b = xegpu. create_mem_desc %m : memref<16384xi8, 3>-> mem_desc<32x256xf16, @strides=[1, 32]>
 ```
-Users can create a subview of a matrix_desc to represent a sliced or partitioned view of the original matrix. Subviews may reduce the rank of the matrix, allowing users to extract a lower-dimensional matrix from a higher-dimensional one. Subview inherits memory layout attributes from the base matrix_desc. For GEMM use case, matrix operations typically work on 2D matrix_desc. If the original matrix is higher-dimensional, it can be subviewed to a 2D shape before it is used with these operations. 
+Users can create a subview of a mem_desc to represent a sliced or partitioned view of the original matrix. Subviews may reduce the rank of the matrix, allowing users to extract a lower-dimensional matrix from a higher-dimensional one. Subview inherits memory layout attributes from the base mem_desc. For GEMM use case, matrix operations typically work on 2D mem_desc. If the original matrix is higher-dimensional, it can be subviewed to a 2D shape before it is used with these operations. 
 
 ```mlir
-%mdesc_a = xegpu.matrix_desc_subview %mdescs_a[%mma_cycle_i, 0, 0]
-    : matrix_desc<3x256x128xbf16, @block=[8, 16]> -> matrix_desc<256x128xbf16, @block=[8, 16]>
+%mdesc_a = xegpu.mem_desc_subview %mdescs_a[%mma_cycle_i, 0, 0]
+    : mem_desc<3x256x128xbf16, @block=[8, 16]> -> mem_desc<256x128xbf16, @block=[8, 16]>
 
-%mdesc_coop_a = xegpu.matrix_desc_subview %mdesc_a[0, %wg_id_x_in_cluster * 64]
-    : matrix_desc<256x128xbf16, @strides=[128, 1]> -> matrix_desc<256x64xbf16, @strides=[128, 1]>
+%mdesc_coop_a = xegpu.mem_desc_subview %mdesc_a[0, %wg_id_x_in_cluster * 64]
+    : mem_desc<256x128xbf16, @strides=[128, 1]> -> mem_desc<256x64xbf16, @strides=[128, 1]>
 ```
 Users can load a matrix from shared local memory into a vector value using the load_matrix operation. The result is a vector type in the IR, representing a tile stored in registers.
 ```mlir
-vec_a = load_matrix matrix_desc_a[0, 0]: matrix_desc<256x128xbf16, @block=[8, 16]> -> vector<256x128xbf6>
-%a_dpas = xegpu.load_matrix %ma[%sg_idy * 32, 0] : matrix_desc<256x32xf16, @block=[16, 16]> -> vector<32x32xf16>
+vec_a = load_matrix mem_desc_a[0, 0]: mem_desc<256x128xbf16, @block=[8, 16]> -> vector<256x128xbf6>
+%a_dpas = xegpu.load_matrix %ma[%sg_idy * 32, 0] : mem_desc<256x32xf16, @block=[16, 16]> -> vector<32x32xf16>
 ```
 
 Users can store a matrix from a vector value into shared local memory using the store_matrix operation.
 ```mlir
-store_matrix vec_a, matrix_desc_b[0, 0] : vector<256x128xbf6>, matrix_desc<256x128xbf16, @block=[8, 16]>
-xegpu.store_matrix %at, %mt[%sg_idy * 8, %sg_idx * 32] : vector<8x32xf16>, matrix_desc<32x256xf16, @block=[16, 16], @strides=[1, 32]>
+store_matrix vec_a, mem_desc_b[0, 0] : vector<256x128xbf6>, mem_desc<256x128xbf16, @block=[8, 16]>
+xegpu.store_matrix %at, %mt[%sg_idy * 8, %sg_idx * 32] : vector<8x32xf16>, mem_desc<32x256xf16, @block=[16, 16], @strides=[1, 32]>
 ```
 
 **Cooperative Transpose Example**
@@ -419,11 +419,11 @@ It is generally preferred to detect the “transpose + convert_layout” pattern
 
 %at = xegpu.load_nd %tdesc : tensor_desc<32x256xf16, #Coop_t_wg> -> vector<32x256xf16>
 %m = memref.alloca() {alignment = 1024} : memref<16384xi8, 3>
-%mt = xegpu. create_matrix_desc %m : memref<16384xi8, 3>-> matrix_desc<32x256xf16, @strides=[1, 32]>
-xegpu.store_matrix %at, %mt[0, 0] #Coop_t_wg: vector<32x256xf16>, matrix_desc<32x256xf16, @strides=[1, 32]>
+%mt = xegpu. create_mem_desc %m : memref<16384xi8, 3>-> mem_desc<32x256xf16, @strides=[1, 32]>
+xegpu.store_matrix %at, %mt[0, 0] #Coop_t_wg: vector<32x256xf16>, mem_desc<32x256xf16, @strides=[1, 32]>
 gpu.barrier
-%ma = xegpu.create_matrix_desc %m : memref<16384xi8, 3>-> matrix_desc<256x32xf16>
-%a_dpas = xegpu.load_matrix %ma[0, 0] #dpas_t_wg: matrix_desc<256x32xf16> -> vector<256x32xf16>
+%ma = xegpu.create_mem_desc %m : memref<16384xi8, 3>-> mem_desc<256x32xf16>
+%a_dpas = xegpu.load_matrix %ma[0, 0] #dpas_t_wg: mem_desc<256x32xf16> -> vector<256x32xf16>
 ```
 
 **Layout Assignment**
@@ -437,13 +437,13 @@ In this example, the xegpu.layout is extended to support instruction-level block
 
 %at = xegpu.load_nd %tdesc: tensor_desc<32x256xf16, #Coop_t_wg> -> vector<32x256xf16>
 %m = memref.alloca() {alignment = 1024} : memref<16384xi8, 3>
-%m = xegpu.create_matrix_desc %m : memref<16384xi8, 3> -> matrix_desc<32x256xf16, @strides=[1, 32]>
-xegpu.store_matrix %at, %mt[0, 0] #Coop_t_wg: vector<32x256xf16>, matrix_desc<32x256xf16, @strides=[1, 32]>
+%m = xegpu.create_mem_desc %m : memref<16384xi8, 3> -> mem_desc<32x256xf16, @strides=[1, 32]>
+xegpu.store_matrix %at, %mt[0, 0] #Coop_t_wg: vector<32x256xf16>, mem_desc<32x256xf16, @strides=[1, 32]>
 
 gpu.barrier
 
-%ma = xegpu.create_matrix_desc %m : memref<16384xi8, 3> -> matrix_desc<256x32xf16>
-%a_dpas = xegpu.load_matrix %ma[0, 0] #dpas_t_wg: matrix_desc<256x32xf16> -> vector<256x32xf16>
+%ma = xegpu.create_mem_desc %m : memref<16384xi8, 3> -> mem_desc<256x32xf16>
+%a_dpas = xegpu.load_matrix %ma[0, 0] #dpas_t_wg: mem_desc<256x32xf16> -> vector<256x32xf16>
 ```
 ***Optimized Blocking: Lowering to store_chunk and 1D Block Load***
 
@@ -451,7 +451,7 @@ This pattern demonstrates a more optimized strategy for instruction-level blocki
 
 - The inst_data field must specify a meaningful 2D shape that aligns with the capabilities of chunked store and 1D block load.
 
-- Blocking must be explicitly expressed in the memory layout via the @block attribute. Two related matrix_desc (e.g., producer and consumer) must have consistent block sizes. If one matrix_desc is transposed, the block shape should match the transposed shape of the other one.
+- Blocking must be explicitly expressed in the memory layout via the @block attribute. Two related mem_desc (e.g., producer and consumer) must have consistent block sizes. If one mem_desc is transposed, the block shape should match the transposed shape of the other one.
 
 - Each instruction must access only within its assigned matrix block boundary — no cross-block accesses are allowed.
 
@@ -463,19 +463,19 @@ During lowering, store_matrix is lowered to store_chunk if the matrix has stride
 
 %at = xegpu.load_nd %tdesc : tensor_desc<32x256xf16, #Coop_t_wg> -> vector<32x256xf16>
 %m = memref.alloca() {alignment = 1024} : memref<16384xi8, 3>
-%mt = xegpu.create_matrix_desc %m : memref<16384xi8, 3>  -> matrix_desc<32x256xf16, @block=[16, 16], @strides=[1, 32]>
-xegpu.store_matrix %at, %mt[0, 0] #Coop_t_wg : vector<32x256xf16>, matrix_desc<32x256xf16, @block=[16, 16], @strides=[1, 32]>
+%mt = xegpu.create_mem_desc %m : memref<16384xi8, 3>  -> mem_desc<32x256xf16, @block=[16, 16], @strides=[1, 32]>
+xegpu.store_matrix %at, %mt[0, 0] #Coop_t_wg : vector<32x256xf16>, mem_desc<32x256xf16, @block=[16, 16], @strides=[1, 32]>
 
 gpu.barrier
-%ma = xegpu.create_matrix_desc %m : memref<16384xi8, 3>  -> matrix_desc<256x32xf16, @block=[16, 16]>
-%a_dpas = xegpu.load_matrix %ma[0, 0] #dpas_t_wg : matrix_desc<256x32xf16, @block=[16, 16], #dpas_t_wg> -> vector<256x32xf16>
+%ma = xegpu.create_mem_desc %m : memref<16384xi8, 3>  -> mem_desc<256x32xf16, @block=[16, 16]>
+%a_dpas = xegpu.load_matrix %ma[0, 0] #dpas_t_wg : mem_desc<256x32xf16, @block=[16, 16], #dpas_t_wg> -> vector<256x32xf16>
 ```
 
 **Workgroup to Subgroup Distribution**
 
 This example illustrates how load_matrix and store_matrix are distributed from workgroup to subgroups. After distribution, the sg_layout and sg_data attributes are removed from the layout specification, leaving only the inst_data attribute.
 
-The distribution process assumes matrix stored in row-major contiguous layout, and performes indexing using logical coordinates. These logical coordinates are used throughout tile distribution and layout transformations. Only at the final lowering stage (e.g., MaterializeSLMAccess) are physical offsets computed using memory layout attributes such as @strides and @block. A key property of the matrix_desc data type is that logical tile decomposition does not alter the block or stride metadata, making logical address computation straightforward.
+The distribution process assumes matrix stored in row-major contiguous layout, and performes indexing using logical coordinates. These logical coordinates are used throughout tile distribution and layout transformations. Only at the final lowering stage (e.g., MaterializeSLMAccess) are physical offsets computed using memory layout attributes such as @strides and @block. A key property of the mem_desc data type is that logical tile decomposition does not alter the block or stride metadata, making logical address computation straightforward.
 
 ```mlir
 #coop_t_inst  = { inst_data = [8, 16] }
@@ -487,19 +487,19 @@ The distribution process assumes matrix stored in row-major contiguous layout, a
 %at = xegpu.load_nd %tdesc_sg
     : tensor_desc<8x32xf16, #coop_t_inst> -> vector<8x32xf16>
 %m = memref.alloca() {alignment = 1024} : memref<16384xi8, 3>
-%mt = xegpu.create_matrix_desc %m : memref<16384xi8, 3>  -> matrix_desc<32x256xf16, @block=[16, 16], @strides=[1, 32]>
+%mt = xegpu.create_mem_desc %m : memref<16384xi8, 3>  -> mem_desc<32x256xf16, @block=[16, 16], @strides=[1, 32]>
 xegpu.store_matrix %at, %mt[%sg_idy * 8, %sg_idx * 32] #coop_t_inst
-    : vector<8x32xf16>, matrix_desc<32x256xf16, @block=[16, 16], @strides=[1, 32]>
+    : vector<8x32xf16>, mem_desc<32x256xf16, @block=[16, 16], @strides=[1, 32]>
 
 gpu.barrier
-%ma = xegpu.create_matrix_desc %m : memref<16384xi8, 3>  -> matrix_desc<256x32xf16, @block=[16, 16]>
+%ma = xegpu.create_mem_desc %m : memref<16384xi8, 3>  -> mem_desc<256x32xf16, @block=[16, 16]>
 %a_dpas = xegpu.load_matrix %ma[%sg_idy * 32, %sg_idx * (32 % 32)]  #dpas_t_inst
-    : matrix_desc<256x32xf16, @block=[16, 16]> -> vector<32x32xf16>
+    : mem_desc<256x32xf16, @block=[16, 16]> -> vector<32x32xf16>
 ```
 
 **Unrolling Guided by Inst_data**
 
-This example illustrates how matrix loads and stores can be unrolled into smaller instruction tiles for better alignment with hardware capabilities. This inst_data attributes ensures that each store operation writes within its assigned block boundary, respecting the @block attributes. On the load side, the matrix_desc is subviewed into multiple 16×16 instruction tiles, which are then used in separate load_matrix operations. This breakdown enables explicit instruction-level unrolling, allowing each instruction to operate on a fixed tile size that aligns with DPAS or tensor-core instruction requirements.
+This example illustrates how matrix loads and stores can be unrolled into smaller instruction tiles for better alignment with hardware capabilities. This inst_data attributes ensures that each store operation writes within its assigned block boundary, respecting the @block attributes. On the load side, the mem_desc is subviewed into multiple 16×16 instruction tiles, which are then used in separate load_matrix operations. This breakdown enables explicit instruction-level unrolling, allowing each instruction to operate on a fixed tile size that aligns with DPAS or tensor-core instruction requirements.
 
 ```mlir
 %tdesc_sg = xegpu.create_nd_tdesc %base[%widy * 32 + %sg_idy * 8, %widx * 256 + %sg_idx * 32]
@@ -508,27 +508,27 @@ This example illustrates how matrix loads and stores can be unrolled into smalle
 %at0 = vector.extract %at[0, 0]   : vector<8x32xf16> -> vector<8x16xf16>
 %at1 = vector.extract %at[0, 16]  : vector<8x32xf16> -> vector<8x16xf16>
 %m = memref.alloca() {alignment = 1024} : memref<16384xi8, 3>
-%mt = xegpu.create_matrix_desc %m : memref<16384xi8, 3>  -> matrix_desc<32x256xf16, @block=[16, 16], @strides=[1, 32]>
+%mt = xegpu.create_mem_desc %m : memref<16384xi8, 3>  -> mem_desc<32x256xf16, @block=[16, 16], @strides=[1, 32]>
 xegpu.store_matrix %at0, %mt[%sg_idy * 8, %sg_idx * 32]
-    : vector<8x16xf16>, matrix_desc<32x256xf16, @block=[16, 16], @strides=[1, 32]>
+    : vector<8x16xf16>, mem_desc<32x256xf16, @block=[16, 16], @strides=[1, 32]>
 xegpu.store_matrix %at1, %mt[%sg_idy * 8, %sg_idx * 32 + 16]
-    : vector<8x16xf16>, matrix_desc<32x256xf16, @block=[16, 16], @strides=[1, 32]>
+    : vector<8x16xf16>, mem_desc<32x256xf16, @block=[16, 16], @strides=[1, 32]>
 
 gpu.barrier
-%ma = xegpu.create_matrix_desc %m : memref<16384xi8, 3>  -> matrix_desc<256x32xf16, @block=[16, 16]>
+%ma = xegpu.create_mem_desc %m : memref<16384xi8, 3>  -> mem_desc<256x32xf16, @block=[16, 16]>
 %a_dpas_0 = xegpu.load_matrix %ma[%sg_idy * 32, %sg_idx * 32 % 32]
-    : matrix_desc<256x32xf16, @block=[16, 16]> -> vector<16x16xf16>
+    : mem_desc<256x32xf16, @block=[16, 16]> -> vector<16x16xf16>
 %a_dpas_1 = xegpu.load_matrix %ma[%sg_idy * 32, %sg_idx * 32 % 32 + 16]
-    : matrix_desc<256x32xf16, @block=[16, 16]> -> vector<16x16xf16>
+    : mem_desc<256x32xf16, @block=[16, 16]> -> vector<16x16xf16>
 %a_dpas_2 = xegpu.load_matrix %ma[%sg_idy * 32 + 16,  %sg_idx * 32 % 32]
-    : matrix_desc<256x32xf16, @block=[16, 16]> -> vector<16x16xf16>
+    : mem_desc<256x32xf16, @block=[16, 16]> -> vector<16x16xf16>
 %a_dpas_3 = xegpu.load_matrix %[%sg_idy * 32 + 16,  %sg_idx * 32 % 32 + 16]
-    : matrix_desc<256x32xf16, @block=[16, 16]> -> vector<16x16xf16>
+    : mem_desc<256x32xf16, @block=[16, 16]> -> vector<16x16xf16>
 ```
 
-**MaterializeSLMAccess: Lowering matrix_desc to Physical Memory Access**
+**MaterializeSLMAccess: Lowering mem_desc to Physical Memory Access**
 
-This step lowers high-level matrix_desc operations (store_matrix, load_matrix) into low-level memory operations (store_chunk, load_1d) over shared local memory. It performs full address materialization using the matrix's layout attributes (@strides, @block) and logical lane coordinates.
+This step lowers high-level mem_desc operations (store_matrix, load_matrix) into low-level memory operations (store_chunk, load_1d) over shared local memory. It performs full address materialization using the matrix's layout attributes (@strides, @block) and logical lane coordinates.
 
 Key Concepts:
 - Chunked Store: Each thread stores a small fragment (e.g., 8×1) using the logical offset composed with layout metadata. Lowered to store_chunk.

From 53690aea184ae36b86a64402f82bbc81725ef8a2 Mon Sep 17 00:00:00 2001
From: Jianhui Li <jian.hui.li@intel.com>
Date: Fri, 15 Aug 2025 07:21:38 -0700
Subject: [PATCH 18/27] Update XeGPU.md

---
 docs/rfcs/XeGPU.md | 2 --
 1 file changed, 2 deletions(-)

diff --git a/docs/rfcs/XeGPU.md b/docs/rfcs/XeGPU.md
index 63388dc1a..254b76ef5 100644
--- a/docs/rfcs/XeGPU.md
+++ b/docs/rfcs/XeGPU.md
@@ -360,8 +360,6 @@ To represent a matrix stored in shared local memory (SLM), users must create a m
 
 Users create a `mem_desc` to represent a matrix stored in shared local memory (SLM). The operation takes a memory buffer (1D int8 memref with empty layout) and create a structured representation of the share local memory. The result mem_desc has proper information including shape, element type, and memory layout attributes (@block and @strides). The @block attribute indicates that the matrix follows a blocked layout, enabling optimized lowering to 1D block loads. The @strides attribute specifies the logical strides of each dimension and is typically used to support chunked loads.
 
-When there is no input memref operand, it allocates SLM for the matrix, assuming a row-major contiguous layout.
-
 ```mlir
 %mdesc_a = xegpu.create_mem_desc: mem_desc<256x128xbf16>
 %mdesc_b = xegpu. create_mem_desc %m : memref<16384xi8, 3>-> mem_desc<32x256xf16, @strides=[1, 32]>

From d0cc26846461a46a622707101ad0122668c97aa6 Mon Sep 17 00:00:00 2001
From: Igor Zamyatin <igor.zamyatin@intel.com>
Date: Fri, 15 Aug 2025 10:33:22 -0500
Subject: [PATCH 19/27] Nit fixes

---
 docs/rfcs/XeGPU.md | 8 ++++----
 1 file changed, 4 insertions(+), 4 deletions(-)

diff --git a/docs/rfcs/XeGPU.md b/docs/rfcs/XeGPU.md
index 254b76ef5..3116899b7 100644
--- a/docs/rfcs/XeGPU.md
+++ b/docs/rfcs/XeGPU.md
@@ -349,7 +349,7 @@ This separation simplifies distribution and unrolling passes and enables systema
 
 **Basic Usage**
 
-To represent a matrix stored in shared local memory (SLM), users must create a mem_desc object. Create_mem_desc initializes a mem_desc instance with memory layout attributes such as @block and @stride. These attributes define the blocking and striding parameters, which govern physical address computation when accessing shared local memory (SLM). The mem_desc_subview creates a subview on top of the mem_desc, inheriting all of its layout attributes. Load_matrix and store_matrix performs data movement between SLM and vector registers. xegpu.layout attribute is added to load_matrix and store_matrix to specify the mapping of lanes and registers to fragments of the matrix, guiding tile distribution based on the assumed row-major view of the matrix. 
+To represent a matrix stored in shared local memory (SLM), users must create a mem_desc object. Create_mem_desc initializes a mem_desc instance with memory layout attributes such as @block and @stride. These attributes define the blocking and striding parameters, which govern physical address computation when accessing shared local memory (SLM). The mem_desc_subview creates a subview on top of the mem_desc, inheriting all of its layout attributes. Load_matrix and store_matrix perform data movement between SLM and vector registers. xegpu.layout attribute is added to load_matrix and store_matrix to specify the mapping of lanes and registers to fragments of the matrix, guiding tile distribution based on the assumed row-major view of the matrix. 
 
 | Ops	| Syntax	| Example |
 | :---   | :----   | :--- |
@@ -362,7 +362,7 @@ Users create a `mem_desc` to represent a matrix stored in shared local memory (S
 
 ```mlir
 %mdesc_a = xegpu.create_mem_desc: mem_desc<256x128xbf16>
-%mdesc_b = xegpu. create_mem_desc %m : memref<16384xi8, 3>-> mem_desc<32x256xf16, @strides=[1, 32]>
+%mdesc_b = xegpu.create_mem_desc %m : memref<16384xi8, 3>-> mem_desc<32x256xf16, @strides=[1, 32]>
 ```
 Users can create a subview of a mem_desc to represent a sliced or partitioned view of the original matrix. Subviews may reduce the rank of the matrix, allowing users to extract a lower-dimensional matrix from a higher-dimensional one. Subview inherits memory layout attributes from the base mem_desc. For GEMM use case, matrix operations typically work on 2D mem_desc. If the original matrix is higher-dimensional, it can be subviewed to a 2D shape before it is used with these operations. 
 
@@ -393,9 +393,9 @@ This example demonstrates a cooperative transpose pattern in which a matrix tile
 #Coop_wg = {sg_layout = [8, 4] , sg_data= [32, 8], order=[1, 0] }
 #dpas_wg = {sg_layout = [8, 4],  sg_data= [32, 32], order=[1, 0] }
 
-%at = load_nd %tdesc: tensor_desc<32x256xf16, #Coop_t_wg> -> vector<32x256xf16>
+%at = xegpu.load_nd %tdesc: tensor_desc<32x256xf16, #Coop_t_wg> -> vector<32x256xf16>
 %a = vector.transpose %1 {layout_result_0 = #Coop_wg}: vector<32x256xf16> to vector<256x32xf16>
-%a_dpas = Conv_layout %2 <{from = #Coop_wg, to = #dpas_wg}>: vector<256x32xf16> 
+%a_dpas = xegpu.conv_layout %2 <{from = #Coop_wg, to = #dpas_wg}>: vector<256x32xf16> 
 ```
 In this flow:
 

From e418745c46cbdfd62dd2a92bf83fce93eda2b6ed Mon Sep 17 00:00:00 2001
From: Garra1980 <igor.zamyatin@intel.com>
Date: Fri, 15 Aug 2025 18:22:37 +0200
Subject: [PATCH 20/27] Fix pre-commit

---
 docs/rfcs/XeGPU.md | 22 +++++++++++-----------
 1 file changed, 11 insertions(+), 11 deletions(-)

diff --git a/docs/rfcs/XeGPU.md b/docs/rfcs/XeGPU.md
index 3116899b7..f1f92fa2e 100644
--- a/docs/rfcs/XeGPU.md
+++ b/docs/rfcs/XeGPU.md
@@ -337,7 +337,7 @@ To streamline programming of shared local memory (SLM) on Intel Xe architecture,
 
 On Xe2 GPUs, SLM remains accessible for direct use by programmers. However, in tile-based programming — particularly when applying layout transformations such as transpose, re-layout — SLM is more commonly used as a backing store to facilitate structured tile movement across subgroups and lanes.
 
-Prior to the introduction of mem_desc, SLM usage was modeled using the nd_tdesc type, which was originally designed for global memory access. As such, it lacked layout-specific attributes like blocking and stride metadata, which are essential for modeling tiled or transposed views in SLM. Developers were responsible for manually computing physical addresses — a process that became particularly complex when applying transformations such as transpose or blocking as required by chunked load or 1D block load. 
+Prior to the introduction of mem_desc, SLM usage was modeled using the nd_tdesc type, which was originally designed for global memory access. As such, it lacked layout-specific attributes like blocking and stride metadata, which are essential for modeling tiled or transposed views in SLM. Developers were responsible for manually computing physical addresses — a process that became particularly complex when applying transformations such as transpose or blocking as required by chunked load or 1D block load.
 
 This complexity was further compounded by hierarchical distribution, where workgroup-level tiles are subdivided across subgroups, instructions, and individual lanes — each step requiring separate address transformation logic. This made the code error-prone and difficult to optimize.
 
@@ -349,7 +349,7 @@ This separation simplifies distribution and unrolling passes and enables systema
 
 **Basic Usage**
 
-To represent a matrix stored in shared local memory (SLM), users must create a mem_desc object. Create_mem_desc initializes a mem_desc instance with memory layout attributes such as @block and @stride. These attributes define the blocking and striding parameters, which govern physical address computation when accessing shared local memory (SLM). The mem_desc_subview creates a subview on top of the mem_desc, inheriting all of its layout attributes. Load_matrix and store_matrix perform data movement between SLM and vector registers. xegpu.layout attribute is added to load_matrix and store_matrix to specify the mapping of lanes and registers to fragments of the matrix, guiding tile distribution based on the assumed row-major view of the matrix. 
+To represent a matrix stored in shared local memory (SLM), users must create a mem_desc object. Create_mem_desc initializes a mem_desc instance with memory layout attributes such as @block and @stride. These attributes define the blocking and striding parameters, which govern physical address computation when accessing shared local memory (SLM). The mem_desc_subview creates a subview on top of the mem_desc, inheriting all of its layout attributes. Load_matrix and store_matrix perform data movement between SLM and vector registers. xegpu.layout attribute is added to load_matrix and store_matrix to specify the mapping of lanes and registers to fragments of the matrix, guiding tile distribution based on the assumed row-major view of the matrix.
 
 | Ops	| Syntax	| Example |
 | :---   | :----   | :--- |
@@ -364,7 +364,7 @@ Users create a `mem_desc` to represent a matrix stored in shared local memory (S
 %mdesc_a = xegpu.create_mem_desc: mem_desc<256x128xbf16>
 %mdesc_b = xegpu.create_mem_desc %m : memref<16384xi8, 3>-> mem_desc<32x256xf16, @strides=[1, 32]>
 ```
-Users can create a subview of a mem_desc to represent a sliced or partitioned view of the original matrix. Subviews may reduce the rank of the matrix, allowing users to extract a lower-dimensional matrix from a higher-dimensional one. Subview inherits memory layout attributes from the base mem_desc. For GEMM use case, matrix operations typically work on 2D mem_desc. If the original matrix is higher-dimensional, it can be subviewed to a 2D shape before it is used with these operations. 
+Users can create a subview of a mem_desc to represent a sliced or partitioned view of the original matrix. Subviews may reduce the rank of the matrix, allowing users to extract a lower-dimensional matrix from a higher-dimensional one. Subview inherits memory layout attributes from the base mem_desc. For GEMM use case, matrix operations typically work on 2D mem_desc. If the original matrix is higher-dimensional, it can be subviewed to a 2D shape before it is used with these operations.
 
 ```mlir
 %mdesc_a = xegpu.mem_desc_subview %mdescs_a[%mma_cycle_i, 0, 0]
@@ -395,7 +395,7 @@ This example demonstrates a cooperative transpose pattern in which a matrix tile
 
 %at = xegpu.load_nd %tdesc: tensor_desc<32x256xf16, #Coop_t_wg> -> vector<32x256xf16>
 %a = vector.transpose %1 {layout_result_0 = #Coop_wg}: vector<32x256xf16> to vector<256x32xf16>
-%a_dpas = xegpu.conv_layout %2 <{from = #Coop_wg, to = #dpas_wg}>: vector<256x32xf16> 
+%a_dpas = xegpu.conv_layout %2 <{from = #Coop_wg, to = #dpas_wg}>: vector<256x32xf16>
 ```
 In this flow:
 
@@ -411,7 +411,7 @@ The code is transformed to use store_matrix and load_matrix to implement the tra
 
 It is generally preferred to detect the “transpose + convert_layout” pattern and fuse them earlier in the pipeline, as this affects the blocking strategy for load_matrix and store_matrix (which are the lowered forms of the logical layout conversion and transpose). Early fusion enables better alignment with optimal hardware load instructions.
 
-```mlir 
+```mlir
 #Coop_t_wg  = { sg_layout = [4, 8], sg_data = [8, 32], order = [0, 1] }  // original layout
 #dpas_t_wg  = { sg_layout = [8, 4], sg_data = [32, 32], order = [1, 0] } // target DPAS layout
 
@@ -427,7 +427,7 @@ gpu.barrier
 **Layout Assignment**
 ***Basic Blocking: Using regular load and store instruction***
 
-In this example, the xegpu.layout is extended to support instruction-level blocking. The basic blocking assumes 16 lanes, and each lane handles 2 f16 elements (32 bits). This basic instruction blocking does not try to block memory layout. It lowers to instructions like chunked store and load_gather. 
+In this example, the xegpu.layout is extended to support instruction-level blocking. The basic blocking assumes 16 lanes, and each lane handles 2 f16 elements (32 bits). This basic instruction blocking does not try to block memory layout. It lowers to instructions like chunked store and load_gather.
 
 ```mlir
 #Coop_t_wg  = { sg_layout = [4, 8], sg_data = [8, 32], inst_data = [1, 32], order = [0, 1] }
@@ -553,19 +553,19 @@ Key Concepts:
 // Compute blocked offset vectors for SLM store
 %blk_y=sg_idy*8 /16: index
 %blk_in_y=sg_idy*8 %16: index
-%sg_idx_vec = %sg_idx*32 + [0..15] : vector<16xindex> 
-%blk_x=%sg_idx_vec /16: vector<16xindex > 
+%sg_idx_vec = %sg_idx*32 + [0..15] : vector<16xindex>
+%blk_x=%sg_idx_vec /16: vector<16xindex >
 %blk_in_x=%sg_idx_vec %16: vector<16xindex >
 
 // calculate physic addresses with pre-computed strides of the blocked matrix.
-// [32x256, strides=1x32] blocked as [2x16x16x16, strides=256x512x1x16] 
+// [32x256, strides=1x32] blocked as [2x16x16x16, strides=256x512x1x16]
 %offset_vec0 = %blk_y * 256+ + %blk_x * 512 + %blk_in_y + %blk_in_x*16
 xegpu.store %at0_t, %m, %offset_vec0 @chunk_size=8: vector<16x8xf16>, memref<8192xf16, 3>, vector<16xindex>
 
 // Repeat for second tile
 %at1_t = vector.transpose %at1 : vector<8x16xf16> -> vector<16x8xf16>
-%sg_idx_vec2 = %sg_idx*32 + [16..31] : vector<16xindex> 
-%blk_x2=%sg_idx_vec2 /16: vector<16xindex > 
+%sg_idx_vec2 = %sg_idx*32 + [16..31] : vector<16xindex>
+%blk_x2=%sg_idx_vec2 /16: vector<16xindex >
 %blk_in_x2=%sg_idx_vec2 %16: vector<16xindex >
 %offset_vec1 = %blk_y * 256+ + %blk_x2 * 512 + %blk_in_y+ %blk_in_x2*16
 xegpu.store %at1_t, %m, %offset_vec1: @chunk_size=8: vector<16x8xf16>, memref<8192xf16, 3>, vector<16xindex>

From bf87ab2fa6921368ca60264068331ec72a356534 Mon Sep 17 00:00:00 2001
From: Jianhui Li <jian.hui.li@intel.com>
Date: Thu, 25 Sep 2025 15:43:39 -0700
Subject: [PATCH 21/27] add lane level attributes

---
 docs/rfcs/XeGPU.md | 34 +++++++++++++++++++++++++++++++---
 1 file changed, 31 insertions(+), 3 deletions(-)

diff --git a/docs/rfcs/XeGPU.md b/docs/rfcs/XeGPU.md
index f1f92fa2e..a8eaf67ba 100644
--- a/docs/rfcs/XeGPU.md
+++ b/docs/rfcs/XeGPU.md
@@ -1,3 +1,4 @@
+
 # RFC for XeGPU Dialect
 
 ## Summary
@@ -375,16 +376,43 @@ Users can create a subview of a mem_desc to represent a sliced or partitioned vi
 ```
 Users can load a matrix from shared local memory into a vector value using the load_matrix operation. The result is a vector type in the IR, representing a tile stored in registers.
 ```mlir
-vec_a = load_matrix mem_desc_a[0, 0]: mem_desc<256x128xbf16, @block=[8, 16]> -> vector<256x128xbf6>
+vec_a = xegpu.load_matrix mem_desc_a[0, 0]: mem_desc<256x128xbf16, @block=[8, 16]> -> vector<256x128xbf6>
 %a_dpas = xegpu.load_matrix %ma[%sg_idy * 32, 0] : mem_desc<256x32xf16, @block=[16, 16]> -> vector<32x32xf16>
 ```
-
 Users can store a matrix from a vector value into shared local memory using the store_matrix operation.
 ```mlir
-store_matrix vec_a, mem_desc_b[0, 0] : vector<256x128xbf6>, mem_desc<256x128xbf16, @block=[8, 16]>
+xegpu.store_matrix vec_a, mem_desc_b[0, 0] : vector<256x128xbf6>, mem_desc<256x128xbf16, @block=[8, 16]>
 xegpu.store_matrix %at, %mt[%sg_idy * 8, %sg_idx * 32] : vector<8x32xf16>, mem_desc<32x256xf16, @block=[16, 16], @strides=[1, 32]>
 ```
 
+At the lane level, a load_matrix operation retrieves a single element from the matrix in slm, with the element address determined by the lane’s offset.
+If the `vec_len` and `vec_dir` attributes are present, the operation instead retrieves a vector of length `vec_len` along the direction specified by `vec_dir`.
+If the `subgroupBlockIO` attribute is present, the load is a cooperative subgroup operation. In this case, the operation consumes a uniform memory descriptor and uniform offsets, 
+and returns the per-lane portion of the cooperatively loaded block.
+When 
+```mlir
+// Load a single element per lane
+%a = xegpu.load_matrix %ma[%sg_idy * 32, 0+%lane_id] : mem_desc<256x32xf16> -> f16
+// Load a vector along the column direction
+%a_dpas = xegpu.load_matrix %ma[%sg_idy * 32, 0+%lane_id] @vec_dir=col @vec_len=16: mem_desc<256x32xf16, @stride=[1, 16], @block=[16, 16]> -> vector<16xf16>
+// Cooperative subgroup block load
+%a_dpas = xegpu.load_matrix %ma[%sg_idy * 32, 0] @subgroupBlockIO : mem_desc<256x32xf16, @block=[16, 16]> -> vector<16xf16>
+```
+
+At the lane level, a store_matrix operation writes a single element to the matrix in slm, with the element address determined by the lane’s offset.
+If the `vec_len` and `vec_dir` attributes are present, the operation instead writes a vector of length `vec_len` along the direction specified by `vec_dir`.
+If the `subgroupBlockIO` attribute is present, the store is a cooperative subgroup operation. In this case, the operation consumes a uniform memory descriptor and uniform offsets, 
+and writes the per-lane portion of the data to the matrix cooperatively.
+When 
+```mlir
+// Store a single element per lane
+xegpu.store_matrix %a, %ma[%sg_idy * 32, 0+%lane_id] : f16, mem_desc<256x32xf16>
+// Store a vector along the column direction
+xegpu.store_matrix %a_dpas, %ma[%sg_idy * 32, 0+%lane_id] @vec_dir=col @vec_len=16: vector<16xf16>, mem_desc<256x32xf16, @stride=[1, 16], @block=[16, 16]>
+// Cooperative subgroup block Store
+xegpu.store_matrix %a_dpas, %ma[%sg_idy * 32, 0] @subgroupBlockIO : vector<16xf16>, mem_desc<256x32xf16, @block=[16, 16]>
+```
+
 **Cooperative Transpose Example**
 
 This example demonstrates a cooperative transpose pattern in which a matrix tile is loaded by a workgroup and collaboratively transposed across subgroups or threads. The operation is broken into two steps: a local transpose using vector.transpose and a cooperative re-layout using xegpu.convert_layout, where neighboring subgroups within a workgroup exchange data to form the desired transposed tile layout.

From 40ae990cde4c670a3b7ddf48590b76a6e008fcfd Mon Sep 17 00:00:00 2001
From: Jianhui Li <jian.hui.li@intel.com>
Date: Thu, 25 Sep 2025 18:47:44 -0700
Subject: [PATCH 22/27] add subgroup to lane distribution example

---
 docs/rfcs/XeGPU.md | 35 ++++++++++++++++++++++++++++++++---
 1 file changed, 32 insertions(+), 3 deletions(-)

diff --git a/docs/rfcs/XeGPU.md b/docs/rfcs/XeGPU.md
index a8eaf67ba..a1b105967 100644
--- a/docs/rfcs/XeGPU.md
+++ b/docs/rfcs/XeGPU.md
@@ -504,17 +504,19 @@ This example illustrates how load_matrix and store_matrix are distributed from w
 The distribution process assumes matrix stored in row-major contiguous layout, and performes indexing using logical coordinates. These logical coordinates are used throughout tile distribution and layout transformations. Only at the final lowering stage (e.g., MaterializeSLMAccess) are physical offsets computed using memory layout attributes such as @strides and @block. A key property of the mem_desc data type is that logical tile decomposition does not alter the block or stride metadata, making logical address computation straightforward.
 
 ```mlir
+#load_t_inst  = { inst_data = [8, 32] }
 #coop_t_inst  = { inst_data = [8, 16] }
 #dpas_t_inst  = { inst_data = [16, 16] }
 
 // Each subgroup loads its portion of the global matrix using inst_data layout
 %tdesc_sg = xegpu.create_nd_tdesc %base[%widy * 32 + %sg_idy * 8, %widx * 256 + %sg_idx * 32]
-    : memref<4096x4096xf16> -> tensor_desc<8x32xf16, #coop_t_inst>
+    : memref<4096x4096xf16> -> tensor_desc<8x32xf16, #load_t_inst>
 %at = xegpu.load_nd %tdesc_sg
-    : tensor_desc<8x32xf16, #coop_t_inst> -> vector<8x32xf16>
+    : tensor_desc<8x32xf16, #load_t_inst> -> vector<8x32xf16>
+%at2 = xegpu.conv_layout %at #coop_t_inst
 %m = memref.alloca() {alignment = 1024} : memref<16384xi8, 3>
 %mt = xegpu.create_mem_desc %m : memref<16384xi8, 3>  -> mem_desc<32x256xf16, @block=[16, 16], @strides=[1, 32]>
-xegpu.store_matrix %at, %mt[%sg_idy * 8, %sg_idx * 32] #coop_t_inst
+xegpu.store_matrix %at2, %mt[%sg_idy * 8, %sg_idx * 32] #coop_t_inst
     : vector<8x32xf16>, mem_desc<32x256xf16, @block=[16, 16], @strides=[1, 32]>
 
 gpu.barrier
@@ -552,6 +554,33 @@ gpu.barrier
     : mem_desc<256x32xf16, @block=[16, 16]> -> vector<16x16xf16>
 ```
 
+**Subgroup to Lane distribution**
+
+```mlir
+%tdesc_sg = xegpu.create_nd_tdesc %base[%widy * 32 + %sg_idy * 8, %widx * 256 + %sg_idx * 32]
+    : memref<4096x4096xf16> -> tensor_desc<8x32xf16>
+%at = xegpu.load_nd %tdesc_sg     : tensor_desc<8x32xf16> -> vector<16xf16>
+%at0 = vector.extract %at[0]   : vector<16xf16> -> vector<8xf16>
+%at1 = vector.extract %at[8]  : vector<16xf16> -> vector<8xf16>
+%m = memref.alloca() {alignment = 1024} : memref<16384xi8, 3>
+%mt = xegpu.create_mem_desc %m : memref<16384xi8, 3>  -> mem_desc<32x256xf16, @block=[16, 16], @strides=[1, 32]>
+xegpu.store_matrix %at0, %mt[%sg_idy * 8, %sg_idx * 32 + %lane_id ] @vec_len=8 @vec_dir=col
+    : vector<8xf16>, mem_desc<32x256xf16, @block=[16, 16], @strides=[1, 32]>
+xegpu.store_matrix %at1, %mt[%sg_idy * 8, %sg_idx * 32 + 16 + %lane_id] @vec_len=8 @vec_dir=col
+    : vector<8xf16>, mem_desc<32x256xf16, @block=[16, 16], @strides=[1, 32]>
+
+gpu.barrier
+%ma = xegpu.create_mem_desc %m : memref<16384xi8, 3>  -> mem_desc<256x32xf16, @block=[16, 16]>
+%a_dpas_0 = xegpu.load_matrix %ma[%sg_idy * 32, %sg_idx * 32 % 32] @subgroupBlockIO
+    : mem_desc<256x32xf16, @block=[16, 16]> -> vector<16xf16>
+%a_dpas_1 = xegpu.load_matrix %ma[%sg_idy * 32, %sg_idx * 32 % 32 + 16] @subgroupBlockIO
+    : mem_desc<256x32xf16, @block=[16, 16]> -> vector<16xf16>
+%a_dpas_2 = xegpu.load_matrix %ma[%sg_idy * 32 + 16,  %sg_idx * 32 % 32] @subgroupBlockIO
+    : mem_desc<256x32xf16, @block=[16, 16]> -> vector<16xf16>
+%a_dpas_3 = xegpu.load_matrix %[%sg_idy * 32 + 16,  %sg_idx * 32 % 32 + 16] @subgroupBlockIO
+    : mem_desc<256x32xf16, @block=[16, 16]> -> vector<16xf16>
+```
+
 **MaterializeSLMAccess: Lowering mem_desc to Physical Memory Access**
 
 This step lowers high-level mem_desc operations (store_matrix, load_matrix) into low-level memory operations (store_chunk, load_1d) over shared local memory. It performs full address materialization using the matrix's layout attributes (@strides, @block) and logical lane coordinates.

From 978ce9df2de006324f781d54f260c20073749bbb Mon Sep 17 00:00:00 2001
From: Jianhui Li <jian.hui.li@intel.com>
Date: Fri, 26 Sep 2025 17:48:06 +0000
Subject: [PATCH 23/27] update the lowering to xevm

---
 docs/rfcs/XeGPU.md | 106 +++++++++++++++++++++++----------------------
 1 file changed, 54 insertions(+), 52 deletions(-)

diff --git a/docs/rfcs/XeGPU.md b/docs/rfcs/XeGPU.md
index a1b105967..781e3ddd9 100644
--- a/docs/rfcs/XeGPU.md
+++ b/docs/rfcs/XeGPU.md
@@ -336,7 +336,7 @@ To streamline programming of shared local memory (SLM) on Intel Xe architecture,
 
 **Background and Motivation**
 
-On Xe2 GPUs, SLM remains accessible for direct use by programmers. However, in tile-based programming — particularly when applying layout transformations such as transpose, re-layout — SLM is more commonly used as a backing store to facilitate structured tile movement across subgroups and lanes.
+On Xe GPUs, SLM remains accessible for direct use by programmers. However, in tile-based programming — particularly when applying layout transformations such as transpose, re-layout — SLM is more commonly used as a backing store to facilitate structured tile movement across subgroups and lanes.
 
 Prior to the introduction of mem_desc, SLM usage was modeled using the nd_tdesc type, which was originally designed for global memory access. As such, it lacked layout-specific attributes like blocking and stride metadata, which are essential for modeling tiled or transposed views in SLM. Developers were responsible for manually computing physical addresses — a process that became particularly complex when applying transformations such as transpose or blocking as required by chunked load or 1D block load.
 
@@ -344,11 +344,11 @@ This complexity was further compounded by hierarchical distribution, where workg
 
 **Design and Semantics**
 
-The mem_desc type addresses these challenges by encoding layout transformations—such as transpose and blocking—as static attributes of the descriptor, and by clearly separating logical and physical address computation. The distribution and unrolling process operates on a conceptual row-major 2D matrix, enabling clean and structured logical access, while the physical address materialization phase maps these logical coordinates to hardware-compliant SLM addresses, guided by the layout attributes attached to the mem_desc.
+The mem_desc type addresses these challenges by encoding layout transformations—such as transpose and blocking—as static attributes of the descriptor, and by clearly separating logical and physical address computation. The distribution and unrolling process operates on a conceptual row-major 2D matrix, enabling clean and structured logical access, while the XeVM lowering pass maps these logical coordinates to hardware-compliant SLM addresses, guided by the layout attributes attached to the mem_desc.
 
 This separation simplifies distribution and unrolling passes and enables systematic, robust transformations during compilation. The descriptor encapsulates all necessary layout metadata to generate correct and efficient SLM access patterns — supporting both regular loads and 1D block loads — without requiring the user to write explicit address arithmetic.
 
-**Basic Usage**
+**OP definition**
 
 To represent a matrix stored in shared local memory (SLM), users must create a mem_desc object. Create_mem_desc initializes a mem_desc instance with memory layout attributes such as @block and @stride. These attributes define the blocking and striding parameters, which govern physical address computation when accessing shared local memory (SLM). The mem_desc_subview creates a subview on top of the mem_desc, inheriting all of its layout attributes. Load_matrix and store_matrix perform data movement between SLM and vector registers. xegpu.layout attribute is added to load_matrix and store_matrix to specify the mapping of lanes and registers to fragments of the matrix, guiding tile distribution based on the assumed row-major view of the matrix.
 
@@ -385,6 +385,7 @@ xegpu.store_matrix vec_a, mem_desc_b[0, 0] : vector<256x128xbf6>, mem_desc<256x1
 xegpu.store_matrix %at, %mt[%sg_idy * 8, %sg_idx * 32] : vector<8x32xf16>, mem_desc<32x256xf16, @block=[16, 16], @strides=[1, 32]>
 ```
 
+**Lane level attributes**
 At the lane level, a load_matrix operation retrieves a single element from the matrix in slm, with the element address determined by the lane’s offset.
 If the `vec_len` and `vec_dir` attributes are present, the operation instead retrieves a vector of length `vec_len` along the direction specified by `vec_dir`.
 If the `subgroupBlockIO` attribute is present, the load is a cooperative subgroup operation. In this case, the operation consumes a uniform memory descriptor and uniform offsets, 
@@ -433,11 +434,11 @@ In this flow:
 
 3. The result is a matrix tile conforming to the #dpas_wg layout, ready for compute instructions such as DPAS.
 
-**After optimization that targets the transpose-A pattern**
+**Cooperative Transpose Optimization pass targeting transpose-A pattern**
 
 The code is transformed to use store_matrix and load_matrix to implement the transpose cooperatively in shared local memory. Note that both load_nd and store_matrix use smaller sg_data values, meaning each subgroup processes a smaller fragment, enabling a cooperative transpose across threads.
 
-It is generally preferred to detect the “transpose + convert_layout” pattern and fuse them earlier in the pipeline, as this affects the blocking strategy for load_matrix and store_matrix (which are the lowered forms of the logical layout conversion and transpose). Early fusion enables better alignment with optimal hardware load instructions.
+It is generally preferable to detect and fuse the “transpose + convert_layout” pattern at the workgroup level early in the compilation pipeline. Early fusion directly influences the blocking strategy for `load_matrix` and `store_matrix`, which are the lowered forms of logical layout conversion and transpose. If this fusion is not performed at the workgroup level, later fusion passes may only fuse transpose with load at the subgroup level, potentially missing the most optimized code sequence.
 
 ```mlir
 #Coop_t_wg  = { sg_layout = [4, 8], sg_data = [8, 32], order = [0, 1] }  // original layout
@@ -458,13 +459,15 @@ gpu.barrier
 In this example, the xegpu.layout is extended to support instruction-level blocking. The basic blocking assumes 16 lanes, and each lane handles 2 f16 elements (32 bits). This basic instruction blocking does not try to block memory layout. It lowers to instructions like chunked store and load_gather.
 
 ```mlir
-#Coop_t_wg  = { sg_layout = [4, 8], sg_data = [8, 32], inst_data = [1, 32], order = [0, 1] }
+#Load_t_wg  = { sg_layout = [4, 8], sg_data = [8, 32], inst_data = [8, 32], order = [0, 1] }
+#Coop_t_wg  = { sg_layout = [4, 8], sg_data = [8, 32], inst_data = [1, 16], order = [0, 1] }
 #dpas_t_wg  = { sg_layout = [8, 4], sg_data = [32, 32], inst_data = [1, 32], order = [1, 0] }
 
-%at = xegpu.load_nd %tdesc: tensor_desc<32x256xf16, #Coop_t_wg> -> vector<32x256xf16>
+%at = xegpu.load_nd %tdesc: tensor_desc<32x256xf16, #Load_t_wg> -> vector<32x256xf16>
+%at2 = xegpu.conv_layout %at #coop_t_wg
 %m = memref.alloca() {alignment = 1024} : memref<16384xi8, 3>
 %m = xegpu.create_mem_desc %m : memref<16384xi8, 3> -> mem_desc<32x256xf16, @strides=[1, 32]>
-xegpu.store_matrix %at, %mt[0, 0] #Coop_t_wg: vector<32x256xf16>, mem_desc<32x256xf16, @strides=[1, 32]>
+xegpu.store_matrix %at2, %mt[0, 0] #Coop_t_wg: vector<32x256xf16>, mem_desc<32x256xf16, @strides=[1, 32]>
 
 gpu.barrier
 
@@ -484,13 +487,15 @@ This pattern demonstrates a more optimized strategy for instruction-level blocki
 During lowering, store_matrix is lowered to store_chunk if the matrix has strides, and load_matrix is lowered to 1D block load if the matrix has a blocked layout.
 
 ```mlir
+#Load_t_wg  = { sg_layout = [4, 8], sg_data = [8, 32], inst_data = [8, 32], order = [0, 1] }
 #Coop_t_wg  = { sg_layout = [4, 8], sg_data = [8, 32], inst_data = [8, 16],  order = [0, 1] }
 #dpas_t_wg  = { sg_layout = [8, 4], sg_data = [32, 32], inst_data = [16, 16], order = [1, 0] }
 
-%at = xegpu.load_nd %tdesc : tensor_desc<32x256xf16, #Coop_t_wg> -> vector<32x256xf16>
+%at = xegpu.load_nd %tdesc : tensor_desc<32x256xf16, #Load_t_wg> -> vector<32x256xf16>
+%at2 = xegpu.conv_layout %at #coop_t_wg
 %m = memref.alloca() {alignment = 1024} : memref<16384xi8, 3>
 %mt = xegpu.create_mem_desc %m : memref<16384xi8, 3>  -> mem_desc<32x256xf16, @block=[16, 16], @strides=[1, 32]>
-xegpu.store_matrix %at, %mt[0, 0] #Coop_t_wg : vector<32x256xf16>, mem_desc<32x256xf16, @block=[16, 16], @strides=[1, 32]>
+xegpu.store_matrix %at2, %mt[0, 0] #Coop_t_wg : vector<32x256xf16>, mem_desc<32x256xf16, @block=[16, 16], @strides=[1, 32]>
 
 gpu.barrier
 %ma = xegpu.create_mem_desc %m : memref<16384xi8, 3>  -> mem_desc<256x32xf16, @block=[16, 16]>
@@ -501,7 +506,7 @@ gpu.barrier
 
 This example illustrates how load_matrix and store_matrix are distributed from workgroup to subgroups. After distribution, the sg_layout and sg_data attributes are removed from the layout specification, leaving only the inst_data attribute.
 
-The distribution process assumes matrix stored in row-major contiguous layout, and performes indexing using logical coordinates. These logical coordinates are used throughout tile distribution and layout transformations. Only at the final lowering stage (e.g., MaterializeSLMAccess) are physical offsets computed using memory layout attributes such as @strides and @block. A key property of the mem_desc data type is that logical tile decomposition does not alter the block or stride metadata, making logical address computation straightforward.
+The distribution process assumes matrix stored in row-major contiguous layout, and performes indexing using logical coordinates. These logical coordinates are used throughout tile distribution and layout transformations. Only at the XeVM lowering stage are physical offsets computed using memory layout attributes such as @strides and @block. A key property of the mem_desc data type is that logical tile decomposition does not alter the block or stride metadata, making logical address computation straightforward.
 
 ```mlir
 #load_t_inst  = { inst_data = [8, 32] }
@@ -556,6 +561,8 @@ gpu.barrier
 
 **Subgroup to Lane distribution**
 
+This example illustrates how `load_matrix` and `store_matrix` operations are distributed from subgroup to lane. For simplicity, the lane layout assignment pass is omitted. After distribution, these operations work on 1D vectors or scalars. The lane-level attribute `subgroupBlockIO` is used to represent 1D block loads, while `vec_len` and `vec_dir` indicate chunked loads.
+
 ```mlir
 %tdesc_sg = xegpu.create_nd_tdesc %base[%widy * 32 + %sg_idy * 8, %widx * 256 + %sg_idx * 32]
     : memref<4096x4096xf16> -> tensor_desc<8x32xf16>
@@ -581,72 +588,67 @@ gpu.barrier
     : mem_desc<256x32xf16, @block=[16, 16]> -> vector<16xf16>
 ```
 
-**MaterializeSLMAccess: Lowering mem_desc to Physical Memory Access**
+**XeGPU lowering to XeVM**
 
-This step lowers high-level mem_desc operations (store_matrix, load_matrix) into low-level memory operations (store_chunk, load_1d) over shared local memory. It performs full address materialization using the matrix's layout attributes (@strides, @block) and logical lane coordinates.
+This step lowers lane level mem_desc operations (store_matrix, load_matrix) into XeVM/LLVM operations. At this point, the XeVM code performs full address materialization using the matrix's layout attributes (@strides, @block) and logical lane coordinates.
 
 Key Concepts:
-- Chunked Store: Each thread stores a small fragment (e.g., 8×1) using the logical offset composed with layout metadata. Lowered to store_chunk.
+- **Chunked Load/Store**: Each thread loads or stores a small fragment (e.g., 8×1) using the logical offset composed with layout metadata. Lowered to llvm.load/llvm.store with a vector operand.
 
-- 1D Block Load: A transposed layout (e.g., 256×32) is blocked as 16×16 tiles. Contiguous blocks are loaded using load_1d, which requires computing the physical offset of the first element per 1D block.
-
-- Offset Calculation: Logical per-lane coordinates are transformed into logical block coordinates, then to physical offsets using block size and strides.
+- **1D Block Load/Store:** In a transposed layout (e.g., 256×32), the matrix is blocked into 16×16 tiles. Elements within each block are contiguous in memory, allowing efficient loading via `XeVM.blockload`. All lanes use the same uniform block address and cooperatively load a contiguous block, with each lane retrieving multiple elements at a stride equal to the subgroup size. The uniform block address is computed by applying the layout metadata (as a function) to the logical base offset of the tile.
 
 ```mlir
-%tdesc_sg = xegpu.create_nd_tdesc %base[%widy * 32 + %sg_idy * 8, %widx * 256 + %sg_idx * 32]
+// psudo code 
+//%tdesc_sg = xegpu.create_nd_tdesc %base[%widy * 32 + %sg_idy * 8, %widx * 256 + %sg_idx * 32]
     : memref<4096x4096xf16> -> tensor_desc<8x32xf16>
-%at = xegpu.load_nd %tdesc_sg     : tensor_desc<8x32xf16> -> vector<8x32xf16>
-%at0 = vector.extract %at[0, 0]   : vector<8x32xf16> -> vector<8x16xf16>
-%at1 = vector.extract %at[0, 16]  : vector<8x32xf16> -> vector<8x16xf16>
-
-// Shared local memory buffer
-%m = memref.alloca() {alignment = 1024} : memref<16384xi8, 3>
+//%at = xegpu.load_nd %tdesc_sg     : tensor_desc<8x32xf16> -> vector<16xf16>
+%at0 = vector.extract %at[0]   : vector<16xf16> -> vector<8xf16>
+%at1 = vector.extract %at[8]  : vector<16xf16> -> vector<8xf16>
+%m_i8 = llvm.alloca 16384 {alignment = 1024}  : !llvm.ptr<i8, 3>
+%m = llvm.bitcast %m_i8 : !llvm.ptr<i8, 3> to !llvm.ptr<f16, 3>
 
 // ---------------------- Chunked Store ----------------------
-// The transpose is added as we remove the transpose attribute out from chunked load/store and expect an explict data transpose.
-// it will be no op after lane distribution since each lane owns same data when [8,1] is transpose to [1, 8]
-%at0_t = vector.transpose %at0 : vector<8x16xf16> -> vector<16x8xf16>
-
-// Compute blocked offset vectors for SLM store
-%blk_y=sg_idy*8 /16: index
-%blk_in_y=sg_idy*8 %16: index
-%sg_idx_vec = %sg_idx*32 + [0..15] : vector<16xindex>
-%blk_x=%sg_idx_vec /16: vector<16xindex >
-%blk_in_x=%sg_idx_vec %16: vector<16xindex >
+// Compute blocked offset for each lane
+%blk_y = sg_idy*8 / 16: index
+%blk_in_y = sg_idy*8 % 16: index
+%blk_x = (%sg_idx*32 + %lane_id) / 16: index
+%blk_in_x = (%sg_idx*32 + %lane_id) % 16: index
 
 // calculate physic addresses with pre-computed strides of the blocked matrix.
 // [32x256, strides=1x32] blocked as [2x16x16x16, strides=256x512x1x16]
-%offset_vec0 = %blk_y * 256+ + %blk_x * 512 + %blk_in_y + %blk_in_x*16
-xegpu.store %at0_t, %m, %offset_vec0 @chunk_size=8: vector<16x8xf16>, memref<8192xf16, 3>, vector<16xindex>
+%offset = %blk_y * 256+ + %blk_x * 512 + %blk_in_y + %blk_in_x*16
+%addr = %m + %offset : !llvm.ptr<f16, 3>
+llvm.store %at0_t, %addr: vector<8xf16>, !llvm.ptr<f16, 3>
 
 // Repeat for second tile
-%at1_t = vector.transpose %at1 : vector<8x16xf16> -> vector<16x8xf16>
-%sg_idx_vec2 = %sg_idx*32 + [16..31] : vector<16xindex>
-%blk_x2=%sg_idx_vec2 /16: vector<16xindex >
-%blk_in_x2=%sg_idx_vec2 %16: vector<16xindex >
-%offset_vec1 = %blk_y * 256+ + %blk_x2 * 512 + %blk_in_y+ %blk_in_x2*16
-xegpu.store %at1_t, %m, %offset_vec1: @chunk_size=8: vector<16x8xf16>, memref<8192xf16, 3>, vector<16xindex>
+%blk_x1 = (%sg_idx*32 + 16 + %lane_id) / 16: index
+%blk_in_x1 = (%sg_idx*32 + 16 + %lane_id) % 16: index
+%offset1 = %blk_y * 256+ + %blk_x1 * 512 + %blk_in_y+ %blk_in_x1*16
+%addr1 = %m + %offset1 : !llvm.ptr<f16, 3>
+llvm.store %at1_t, %m: @chunk_size=8: vector<8xf16>, !llvm.ptr<f16, 3>
 
 gpu.barrier
 
 // ---------------------- Load 1D Block ----------------------
 // Compute per-block physical offsets
 // pre-computed strides of the blocked matrix: [256x32] blocked as [16x2x16x16, strides=512x256x16x1]
-// sg_idx*32 coord to blocked matrix ccord: sg_idx*32%32/16 (0), sg_idx*32%32%16 (0). %32 due matrix shape[1] is 32
-// sg_idy*32 coord to blocked matrix coord: sg_idy*32/16, sg_idy*32%16 (0)
-//  then map to physical addr using stride  [2x16x16x16, strides=512x256x16x1], get sg_idy*32/16 *512
-%inst_start_offset0 = mul %sg_idy, 2 * 512
+// [sg_idy*32, sg_idx*32%32=0] coord to blocked matrix ccord: [sg_idy*32/16, 0, 0, 0]
+// then map to physical addr using stride  [2x16x16x16, strides=512x256x16x1], 
+// get sg_idy*32/16*512 = sg_idy*1024
+%inst_start_offset0 = mul %sg_idy, 1024
 %inst_start_offset1 = add %inst_start_offset0, 256
 %inst_start_offset2 = add %inst_start_offset0, 512
 %inst_start_offset3 = add %inst_start_offset0, 768
+%addr0 = %m + %inst_start_offset0 : !llvm.ptr<f16, 3>
+%addr1 = %m + %inst_start_offset1 : !llvm.ptr<f16, 3>
+%addr2 = %m + %inst_start_offset2 : !llvm.ptr<f16, 3>
+%addr3 = %m + %inst_start_offset3 : !llvm.ptr<f16, 3>
 
-%a_dpas_0 = xegpu.load_nd %m, %inst_start_offset0 : memref<8192xf16, 3>, index -> vector<256xf16>
-%a_dpas_1 = xegpu.load_nd %m, %inst_start_offset1 : memref<8192xf16, 3>, index -> vector<256xf16>
-%a_dpas_2 = xegpu.load_nd %m, %inst_start_offset2 : memref<8192xf16, 3>, index -> vector<256xf16>
-%a_dpas_3 = xegpu.load_nd %m, %inst_start_offset3 : memref<8192xf16, 3>, index -> vector<256xf16>
-```
+%a_dpas_0 = xevm.blockload %m, %addr0 : !llvm.ptr<f16, 3> -> vector<16xf16>
+%a_dpas_1 = xevm.blockload %m, %addr1 : !llvm.ptr<f16, 3> -> vector<16xf16>
+%a_dpas_2 = xevm.blockload %m, %addr2 : !llvm.ptr<f16, 3> -> vector<16xf16>
+%a_dpas_3 = xevm.blockload %m, %addr3 : !llvm.ptr<f16, 3> -> vector<16xf16>
 
-## XeGPU Attributes to support Work Item Level semantics
 
 **Attribute xegpu.sg_map**
 

From 4867a23d7804a2192ee25ae24851215b455f14f8 Mon Sep 17 00:00:00 2001
From: Jianhui Li <jian.hui.li@intel.com>
Date: Fri, 26 Sep 2025 17:54:22 +0000
Subject: [PATCH 24/27] minor fix

---
 docs/rfcs/XeGPU.md | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/docs/rfcs/XeGPU.md b/docs/rfcs/XeGPU.md
index 781e3ddd9..5b3cb63e4 100644
--- a/docs/rfcs/XeGPU.md
+++ b/docs/rfcs/XeGPU.md
@@ -648,7 +648,7 @@ gpu.barrier
 %a_dpas_1 = xevm.blockload %m, %addr1 : !llvm.ptr<f16, 3> -> vector<16xf16>
 %a_dpas_2 = xevm.blockload %m, %addr2 : !llvm.ptr<f16, 3> -> vector<16xf16>
 %a_dpas_3 = xevm.blockload %m, %addr3 : !llvm.ptr<f16, 3> -> vector<16xf16>
-
+```
 
 **Attribute xegpu.sg_map**
 

From 450b6b882a60f09a9216891fa49b2731e93d64af Mon Sep 17 00:00:00 2001
From: Jianhui Li <jian.hui.li@intel.com>
Date: Fri, 26 Sep 2025 21:14:38 +0000
Subject: [PATCH 25/27] add reduction example

---
 docs/rfcs/XeGPU.md | 40 ++++++++++++++++++++++++++++++++++++++++
 1 file changed, 40 insertions(+)

diff --git a/docs/rfcs/XeGPU.md b/docs/rfcs/XeGPU.md
index 5b3cb63e4..5d298c2de 100644
--- a/docs/rfcs/XeGPU.md
+++ b/docs/rfcs/XeGPU.md
@@ -650,6 +650,46 @@ gpu.barrier
 %a_dpas_3 = xevm.blockload %m, %addr3 : !llvm.ptr<f16, 3> -> vector<16xf16>
 ```
 
+**Reduction using SLM Example**
+
+Below is a code example demonstrating reduction at the workgroup level, which requires performing reduction across subgroups.
+
+```
+#a_wg  = { sg_layout = [8, 4], sg_data = [16, 16], order = [0, 1] }
+#reduce_wg = slice<#a_wg, dim=[1]>
+
+%a = xegpu.load %src[%offset], %mask  #a_wg : ui64, vector<128x64xindex>, vector<128x64xi1> -> vector<128x64xf16>
+%redce = vector.reduce %a dim=[1]: vector<128x64xf16> #reduce_wg -> vector<128xf16> 
+```
+
+***Workgroup-Level Reduction Using SLM***
+
+Reduction across a workgroup requires coordination between subgroups. The typical pattern is:
+
+1. Each subgroup performs a partial reduction in registers.
+2. Intermediate results are written to shared local memory (SLM).
+3. All subgroups read the intermediate values from SLM and perform the final reduction.
+
+This approach ensures efficient and correct reduction across the entire workgroup by leveraging both register-level and SLM-based reductions.
+
+Below is an example illustrating this pattern:
+
+```mlir
+
+%sg_a = xegpu.load %src[%sg_offset], %sg_mask  #a_wg : ui64, vector<16x16xindex>, vector<16x16xi1> -> vector<16x16xf16>
+%sg_partial_reduced = vector.reduce_add %sg_a [1] : vector<16x16xf16> -> vector<16xf16>
+
+// Allocate SLM buffer and create mem_desc for the full matrix
+%m = memref.alloca() {alignment = 1024} : memref<1024xi8, 3>
+%mt = xegpu.create_mem_desc %m : memref<1024xi8, 3>  -> mem_desc<128x4xf16, @stride=[1, 128], @block=[16, 4]>
+xegpu.store_matrix %sg_partial_reduced, %mt[sg_y*16, sg_x] : vector<16xf16>, mem_desc<128x4xf16, @stride=[1, 128], @block=[16, 4]>
+gpu.barrier
+
+%sg_partial_reduce_4lanes = xegpu.load_matrix %mt[sg_y*16, 0] : mem_desc<128x4xf16, @stride=[1, 128], @block=[16, 4]> -> vector<16x4xf16>
+%sg_reduced = vector.reduce_add %sg_partial_reduce_4lanes [1] : vector<16x4xf16> -> vector<16xf16>
+
+```
+
 **Attribute xegpu.sg_map**
 
 xegpu.sg_map specifies how a 2D tensor (defined by the tensor descriptor) is partitioned among work items (WIs) within a subgroup. sg_map consists of two parameters:

From d622bba5829f26e28740339cf216c07969f51eb1 Mon Sep 17 00:00:00 2001
From: Jianhui Li <jian.hui.li@intel.com>
Date: Tue, 14 Oct 2025 16:23:18 -0700
Subject: [PATCH 26/27] Update XeGPU.md

---
 docs/rfcs/XeGPU.md | 20 +++++---------------
 1 file changed, 5 insertions(+), 15 deletions(-)

diff --git a/docs/rfcs/XeGPU.md b/docs/rfcs/XeGPU.md
index 5d298c2de..94038978c 100644
--- a/docs/rfcs/XeGPU.md
+++ b/docs/rfcs/XeGPU.md
@@ -350,30 +350,20 @@ This separation simplifies distribution and unrolling passes and enables systema
 
 **OP definition**
 
-To represent a matrix stored in shared local memory (SLM), users must create a mem_desc object. Create_mem_desc initializes a mem_desc instance with memory layout attributes such as @block and @stride. These attributes define the blocking and striding parameters, which govern physical address computation when accessing shared local memory (SLM). The mem_desc_subview creates a subview on top of the mem_desc, inheriting all of its layout attributes. Load_matrix and store_matrix perform data movement between SLM and vector registers. xegpu.layout attribute is added to load_matrix and store_matrix to specify the mapping of lanes and registers to fragments of the matrix, guiding tile distribution based on the assumed row-major view of the matrix.
+To represent a matrix stored in shared local memory (SLM), users must create a mem_desc object. Create_mem_desc initializes a mem_desc instance with memory layout attributes such as @block and @stride. These attributes define the blocking and striding parameters, which govern physical address computation when accessing shared local memory (SLM). Load_matrix and store_matrix perform data movement between SLM and vector registers. xegpu.layout attribute is added to load_matrix and store_matrix to specify the mapping of lanes and registers to fragments of the matrix, guiding tile distribution based on the assumed row-major view of the matrix.
 
 | Ops	| Syntax	| Example |
 | :---   | :----   | :--- |
-|create_mem_desc	| operation ::= xegpu.create_mem_desc $mref attr-dict :type($mref), type(\$mdesc)	| %mdesc_a = xegpu.create_mem_desc %m: memref<65536xi8, 3> -> mem_desc<256x128xbf16> |
-|mem_desc_subview	| operation ::= xegpu.mem_desc_subview $mdesc[$offsets]  attr-dict : type(\$mdesc) -> type(\$mdesc)	| %mdesc_coop = xegpu.mem_desc_subview %mdesc[128, 0]:mem_desc<256x256xbf16, @stride=[256,1],  @block=[8, 16]> -> mem_desc<128x128xbf16, @stride=[256,1],  @block=[8, 16]> |
+|create_mem_desc	| operation ::= xegpu.create_mem_desc $mref attr-dict :type($mref), type(\$mdesc)	| %mdesc_a = xegpu.create_mem_desc %m: memref<65536xi8, 3> -> mem_desc<256x128xbf16, @stride=[256,1],  @block=[8, 16]> |
 |load_matrix	| operation ::= xegpu.load_matrix $mdesc[$offsets] attr-dict : type($mdesc), type(offsets) -> type($res)	| %result = xegpu.load_matrix %mdesc[0, 0] : mem_desc<128x256xbf16, @block=[8, 16]> -> vector<128x256xbf16> |
 |store_matrix	| operation ::= xegpu.store_matrix $val, $mdesc[$offsets] attr-dict : type($val), type($mdesc), type(offsets) 	| %result = xegpu.store_matrix %val %mdesc[0, 0] : vector<128x256xbf16>, mem_desc<128x256xbf16, @block=[8, 16]> |
 
-Users create a `mem_desc` to represent a matrix stored in shared local memory (SLM). The operation takes a memory buffer (1D int8 memref with empty layout) and create a structured representation of the share local memory. The result mem_desc has proper information including shape, element type, and memory layout attributes (@block and @strides). The @block attribute indicates that the matrix follows a blocked layout, enabling optimized lowering to 1D block loads. The @strides attribute specifies the logical strides of each dimension and is typically used to support chunked loads.
+Users create a `mem_desc` to represent a matrix stored in shared local memory (SLM). The operation takes a memref and adds structure information to the representation of the share local memory. The result mem_desc has proper information including shape, element type, and memory layout attributes (@block and @strides). The result mem_desc must be 2D. The @block attribute indicates that the matrix follows a blocked layout, enabling optimized lowering to 1D block loads. The @strides attribute specifies the logical strides of each dimension and is typically used to support chunked loads.
 
 ```mlir
 %mdesc_a = xegpu.create_mem_desc: mem_desc<256x128xbf16>
 %mdesc_b = xegpu.create_mem_desc %m : memref<16384xi8, 3>-> mem_desc<32x256xf16, @strides=[1, 32]>
 ```
-Users can create a subview of a mem_desc to represent a sliced or partitioned view of the original matrix. Subviews may reduce the rank of the matrix, allowing users to extract a lower-dimensional matrix from a higher-dimensional one. Subview inherits memory layout attributes from the base mem_desc. For GEMM use case, matrix operations typically work on 2D mem_desc. If the original matrix is higher-dimensional, it can be subviewed to a 2D shape before it is used with these operations.
-
-```mlir
-%mdesc_a = xegpu.mem_desc_subview %mdescs_a[%mma_cycle_i, 0, 0]
-    : mem_desc<3x256x128xbf16, @block=[8, 16]> -> mem_desc<256x128xbf16, @block=[8, 16]>
-
-%mdesc_coop_a = xegpu.mem_desc_subview %mdesc_a[0, %wg_id_x_in_cluster * 64]
-    : mem_desc<256x128xbf16, @strides=[128, 1]> -> mem_desc<256x64xbf16, @strides=[128, 1]>
-```
 Users can load a matrix from shared local memory into a vector value using the load_matrix operation. The result is a vector type in the IR, representing a tile stored in registers.
 ```mlir
 vec_a = xegpu.load_matrix mem_desc_a[0, 0]: mem_desc<256x128xbf16, @block=[8, 16]> -> vector<256x128xbf6>
@@ -423,7 +413,7 @@ This example demonstrates a cooperative transpose pattern in which a matrix tile
 #dpas_wg = {sg_layout = [8, 4],  sg_data= [32, 32], order=[1, 0] }
 
 %at = xegpu.load_nd %tdesc: tensor_desc<32x256xf16, #Coop_t_wg> -> vector<32x256xf16>
-%a = vector.transpose %1 {layout_result_0 = #Coop_wg}: vector<32x256xf16> to vector<256x32xf16>
+%a = vector.transpose %at {layout_result_0 = #Coop_wg}: vector<32x256xf16> to vector<256x32xf16>
 %a_dpas = xegpu.conv_layout %2 <{from = #Coop_wg, to = #dpas_wg}>: vector<256x32xf16>
 ```
 In this flow:
@@ -532,7 +522,7 @@ gpu.barrier
 
 **Unrolling Guided by Inst_data**
 
-This example illustrates how matrix loads and stores can be unrolled into smaller instruction tiles for better alignment with hardware capabilities. This inst_data attributes ensures that each store operation writes within its assigned block boundary, respecting the @block attributes. On the load side, the mem_desc is subviewed into multiple 16×16 instruction tiles, which are then used in separate load_matrix operations. This breakdown enables explicit instruction-level unrolling, allowing each instruction to operate on a fixed tile size that aligns with DPAS or tensor-core instruction requirements.
+This example illustrates how matrix loads and stores can be unrolled into smaller instruction tiles for better alignment with hardware capabilities. This inst_data attributes ensures that each store operation writes within its assigned block boundary, respecting the @block attributes. On the load side, the operation is decompsed into multiple 16×16 instructions. This breakdown enables explicit instruction-level unrolling, allowing each instruction to operate on a fixed tile size that aligns with DPAS or tensor-core instruction requirements.
 
 ```mlir
 %tdesc_sg = xegpu.create_nd_tdesc %base[%widy * 32 + %sg_idy * 8, %widx * 256 + %sg_idx * 32]

From f45f67b47db9deba24d0442fdcb807241b19ce00 Mon Sep 17 00:00:00 2001
From: Jianhui Li <jian.hui.li@intel.com>
Date: Tue, 14 Oct 2025 16:41:32 -0700
Subject: [PATCH 27/27] remove vec_len and vec_dir attribute

---
 docs/rfcs/XeGPU.md | 18 ++++++++++--------
 1 file changed, 10 insertions(+), 8 deletions(-)

diff --git a/docs/rfcs/XeGPU.md b/docs/rfcs/XeGPU.md
index 94038978c..697e2c317 100644
--- a/docs/rfcs/XeGPU.md
+++ b/docs/rfcs/XeGPU.md
@@ -358,7 +358,9 @@ To represent a matrix stored in shared local memory (SLM), users must create a m
 |load_matrix	| operation ::= xegpu.load_matrix $mdesc[$offsets] attr-dict : type($mdesc), type(offsets) -> type($res)	| %result = xegpu.load_matrix %mdesc[0, 0] : mem_desc<128x256xbf16, @block=[8, 16]> -> vector<128x256xbf16> |
 |store_matrix	| operation ::= xegpu.store_matrix $val, $mdesc[$offsets] attr-dict : type($val), type($mdesc), type(offsets) 	| %result = xegpu.store_matrix %val %mdesc[0, 0] : vector<128x256xbf16>, mem_desc<128x256xbf16, @block=[8, 16]> |
 
-Users create a `mem_desc` to represent a matrix stored in shared local memory (SLM). The operation takes a memref and adds structure information to the representation of the share local memory. The result mem_desc has proper information including shape, element type, and memory layout attributes (@block and @strides). The result mem_desc must be 2D. The @block attribute indicates that the matrix follows a blocked layout, enabling optimized lowering to 1D block loads. The @strides attribute specifies the logical strides of each dimension and is typically used to support chunked loads.
+Users create a `mem_desc` to represent a matrix stored in shared local memory (SLM). This operation takes a `memref` and augments it with structural information describing the layout of the shared local memory. The input memref must be contiguous, and its size and base address alignment must meet the specific microarchitectural requirements. 
+
+The resulting `mem_desc` includes metadata such as shape, element type, and memory layout attributes (`block` and `strides`). The `mem_desc` must describe a 2D matrix. The @block attribute indicates that the matrix follows a blocked layout, enabling optimized lowering to 1D block loads. The @strides attribute specifies the logical strides of each dimension and is typically used to support chunked loads.
 
 ```mlir
 %mdesc_a = xegpu.create_mem_desc: mem_desc<256x128xbf16>
@@ -377,7 +379,7 @@ xegpu.store_matrix %at, %mt[%sg_idy * 8, %sg_idx * 32] : vector<8x32xf16>, mem_d
 
 **Lane level attributes**
 At the lane level, a load_matrix operation retrieves a single element from the matrix in slm, with the element address determined by the lane’s offset.
-If the `vec_len` and `vec_dir` attributes are present, the operation instead retrieves a vector of length `vec_len` along the direction specified by `vec_dir`.
+When the return type is a vector, the operation retrieves multiple contiguous elements from SLM and returns them as a single vector value.
 If the `subgroupBlockIO` attribute is present, the load is a cooperative subgroup operation. In this case, the operation consumes a uniform memory descriptor and uniform offsets, 
 and returns the per-lane portion of the cooperatively loaded block.
 When 
@@ -385,13 +387,13 @@ When
 // Load a single element per lane
 %a = xegpu.load_matrix %ma[%sg_idy * 32, 0+%lane_id] : mem_desc<256x32xf16> -> f16
 // Load a vector along the column direction
-%a_dpas = xegpu.load_matrix %ma[%sg_idy * 32, 0+%lane_id] @vec_dir=col @vec_len=16: mem_desc<256x32xf16, @stride=[1, 16], @block=[16, 16]> -> vector<16xf16>
+%a_dpas = xegpu.load_matrix %ma[%sg_idy * 32, 0+%lane_id] : mem_desc<256x32xf16, @stride=[1, 16], @block=[16, 16]> -> vector<16xf16>
 // Cooperative subgroup block load
 %a_dpas = xegpu.load_matrix %ma[%sg_idy * 32, 0] @subgroupBlockIO : mem_desc<256x32xf16, @block=[16, 16]> -> vector<16xf16>
 ```
 
 At the lane level, a store_matrix operation writes a single element to the matrix in slm, with the element address determined by the lane’s offset.
-If the `vec_len` and `vec_dir` attributes are present, the operation instead writes a vector of length `vec_len` along the direction specified by `vec_dir`.
+When the input is a vector, the operation writes the vector elements to contiguous addresses within the matrix stored in SLM.
 If the `subgroupBlockIO` attribute is present, the store is a cooperative subgroup operation. In this case, the operation consumes a uniform memory descriptor and uniform offsets, 
 and writes the per-lane portion of the data to the matrix cooperatively.
 When 
@@ -399,7 +401,7 @@ When
 // Store a single element per lane
 xegpu.store_matrix %a, %ma[%sg_idy * 32, 0+%lane_id] : f16, mem_desc<256x32xf16>
 // Store a vector along the column direction
-xegpu.store_matrix %a_dpas, %ma[%sg_idy * 32, 0+%lane_id] @vec_dir=col @vec_len=16: vector<16xf16>, mem_desc<256x32xf16, @stride=[1, 16], @block=[16, 16]>
+xegpu.store_matrix %a_dpas, %ma[%sg_idy * 32, 0+%lane_id] : vector<16xf16>, mem_desc<256x32xf16, @stride=[1, 16], @block=[16, 16]>
 // Cooperative subgroup block Store
 xegpu.store_matrix %a_dpas, %ma[%sg_idy * 32, 0] @subgroupBlockIO : vector<16xf16>, mem_desc<256x32xf16, @block=[16, 16]>
 ```
@@ -551,7 +553,7 @@ gpu.barrier
 
 **Subgroup to Lane distribution**
 
-This example illustrates how `load_matrix` and `store_matrix` operations are distributed from subgroup to lane. For simplicity, the lane layout assignment pass is omitted. After distribution, these operations work on 1D vectors or scalars. The lane-level attribute `subgroupBlockIO` is used to represent 1D block loads, while `vec_len` and `vec_dir` indicate chunked loads.
+This example illustrates how `load_matrix` and `store_matrix` operations are distributed from subgroup to lane. For simplicity, the lane layout assignment pass is omitted. After distribution, these operations work on 1D vectors or scalars. The lane-level attribute `subgroupBlockIO` is used to represent 1D block loads, while store_matrix with vector input represents chunked loads.
 
 ```mlir
 %tdesc_sg = xegpu.create_nd_tdesc %base[%widy * 32 + %sg_idy * 8, %widx * 256 + %sg_idx * 32]
@@ -561,9 +563,9 @@ This example illustrates how `load_matrix` and `store_matrix` operations are dis
 %at1 = vector.extract %at[8]  : vector<16xf16> -> vector<8xf16>
 %m = memref.alloca() {alignment = 1024} : memref<16384xi8, 3>
 %mt = xegpu.create_mem_desc %m : memref<16384xi8, 3>  -> mem_desc<32x256xf16, @block=[16, 16], @strides=[1, 32]>
-xegpu.store_matrix %at0, %mt[%sg_idy * 8, %sg_idx * 32 + %lane_id ] @vec_len=8 @vec_dir=col
+xegpu.store_matrix %at0, %mt[%sg_idy * 8, %sg_idx * 32 + %lane_id ]
     : vector<8xf16>, mem_desc<32x256xf16, @block=[16, 16], @strides=[1, 32]>
-xegpu.store_matrix %at1, %mt[%sg_idy * 8, %sg_idx * 32 + 16 + %lane_id] @vec_len=8 @vec_dir=col
+xegpu.store_matrix %at1, %mt[%sg_idy * 8, %sg_idx * 32 + 16 + %lane_id]
     : vector<8xf16>, mem_desc<32x256xf16, @block=[16, 16], @strides=[1, 32]>
 
 gpu.barrier