Skip to content

Commit e13db55

Browse files
aeglbp3tk0v
authored andcommitted
x86/resctrl: Introduce snc_nodes_per_l3_cache
Intel Sub-NUMA Cluster (SNC) is a feature that subdivides the CPU cores and memory controllers on a socket into two or more groups. These are presented to the operating system as NUMA nodes. This may enable some workloads to have slightly lower latency to memory as the memory controller(s) in an SNC node are electrically closer to the CPU cores on that SNC node. This cost may be offset by lower bandwidth since the memory accesses for each core can only be interleaved between the memory controllers on the same SNC node. Resctrl monitoring on an Intel system depends upon attaching RMIDs to tasks to track L3 cache occupancy and memory bandwidth. There is an MSR that controls how the RMIDs are shared between SNC nodes. The default mode divides them numerically. E.g. when there are two SNC nodes on a socket the lower number half of the RMIDs are given to the first node, the remainder to the second node. This would be difficult to use with the Linux resctrl interface as specific RMID values assigned to resctrl groups are not visible to users. RMID sharing mode divides the physical RMIDs evenly between SNC nodes but uses a logical RMID in the IA32_PQR_ASSOC MSR. For example a system with 200 physical RMIDs (as enumerated by CPUID leaf 0xF) that has two SNC nodes per L3 cache instance would have 100 logical RMIDs available for Linux to use. A task running on SNC node 0 with RMID 5 would accumulate LLC occupancy and MBM bandwidth data in physical RMID 5. Another task using RMID 5, but running on SNC node 1 would accumulate data in physical RMID 105. Even with this renumbering SNC mode requires several changes in resctrl behavior for correct operation. Add a static global to arch/x86/kernel/cpu/resctrl/monitor.c to indicate how many SNC domains share an L3 cache instance. Initialize this to "1". Runtime detection of SNC mode will adjust this value. Update all places to take appropriate action when SNC mode is enabled: 1) The number of logical RMIDs per L3 cache available for use is the number of physical RMIDs divided by the number of SNC nodes. 2) Likewise the "mon_scale" value must be divided by the number of SNC nodes. 3) Add a function to convert from logical RMID values (assigned to tasks and loaded into the IA32_PQR_ASSOC MSR on context switch) to physical RMID values to load into IA32_QM_EVTSEL MSR when reading counters on each SNC node. Signed-off-by: Tony Luck <[email protected]> Signed-off-by: Borislav Petkov (AMD) <[email protected]> Reviewed-by: Reinette Chatre <[email protected]> Tested-by: Babu Moger <[email protected]> Link: https://lore.kernel.org/r/[email protected]
1 parent 1a17160 commit e13db55

File tree

1 file changed

+50
-6
lines changed

1 file changed

+50
-6
lines changed

arch/x86/kernel/cpu/resctrl/monitor.c

Lines changed: 50 additions & 6 deletions
Original file line numberDiff line numberDiff line change
@@ -97,6 +97,8 @@ unsigned int resctrl_rmid_realloc_limit;
9797

9898
#define CF(cf) ((unsigned long)(1048576 * (cf) + 0.5))
9999

100+
static int snc_nodes_per_l3_cache = 1;
101+
100102
/*
101103
* The correction factor table is documented in Documentation/arch/x86/resctrl.rst.
102104
* If rmid > rmid threshold, MBM total and local values should be multiplied
@@ -185,7 +187,43 @@ static inline struct rmid_entry *__rmid_entry(u32 idx)
185187
return entry;
186188
}
187189

188-
static int __rmid_read(u32 rmid, enum resctrl_event_id eventid, u64 *val)
190+
/*
191+
* When Sub-NUMA Cluster (SNC) mode is not enabled (as indicated by
192+
* "snc_nodes_per_l3_cache == 1") no translation of the RMID value is
193+
* needed. The physical RMID is the same as the logical RMID.
194+
*
195+
* On a platform with SNC mode enabled, Linux enables RMID sharing mode
196+
* via MSR 0xCA0 (see the "RMID Sharing Mode" section in the "Intel
197+
* Resource Director Technology Architecture Specification" for a full
198+
* description of RMID sharing mode).
199+
*
200+
* In RMID sharing mode there are fewer "logical RMID" values available
201+
* to accumulate data ("physical RMIDs" are divided evenly between SNC
202+
* nodes that share an L3 cache). Linux creates an rdt_mon_domain for
203+
* each SNC node.
204+
*
205+
* The value loaded into IA32_PQR_ASSOC is the "logical RMID".
206+
*
207+
* Data is collected independently on each SNC node and can be retrieved
208+
* using the "physical RMID" value computed by this function and loaded
209+
* into IA32_QM_EVTSEL. @cpu can be any CPU in the SNC node.
210+
*
211+
* The scope of the IA32_QM_EVTSEL and IA32_QM_CTR MSRs is at the L3
212+
* cache. So a "physical RMID" may be read from any CPU that shares
213+
* the L3 cache with the desired SNC node, not just from a CPU in
214+
* the specific SNC node.
215+
*/
216+
static int logical_rmid_to_physical_rmid(int cpu, int lrmid)
217+
{
218+
struct rdt_resource *r = &rdt_resources_all[RDT_RESOURCE_L3].r_resctrl;
219+
220+
if (snc_nodes_per_l3_cache == 1)
221+
return lrmid;
222+
223+
return lrmid + (cpu_to_node(cpu) % snc_nodes_per_l3_cache) * r->num_rmid;
224+
}
225+
226+
static int __rmid_read_phys(u32 prmid, enum resctrl_event_id eventid, u64 *val)
189227
{
190228
u64 msr_val;
191229

@@ -197,7 +235,7 @@ static int __rmid_read(u32 rmid, enum resctrl_event_id eventid, u64 *val)
197235
* IA32_QM_CTR.Error (bit 63) and IA32_QM_CTR.Unavailable (bit 62)
198236
* are error bits.
199237
*/
200-
wrmsr(MSR_IA32_QM_EVTSEL, eventid, rmid);
238+
wrmsr(MSR_IA32_QM_EVTSEL, eventid, prmid);
201239
rdmsrl(MSR_IA32_QM_CTR, msr_val);
202240

203241
if (msr_val & RMID_VAL_ERROR)
@@ -233,14 +271,17 @@ void resctrl_arch_reset_rmid(struct rdt_resource *r, struct rdt_mon_domain *d,
233271
enum resctrl_event_id eventid)
234272
{
235273
struct rdt_hw_mon_domain *hw_dom = resctrl_to_arch_mon_dom(d);
274+
int cpu = cpumask_any(&d->hdr.cpu_mask);
236275
struct arch_mbm_state *am;
276+
u32 prmid;
237277

238278
am = get_arch_mbm_state(hw_dom, rmid, eventid);
239279
if (am) {
240280
memset(am, 0, sizeof(*am));
241281

282+
prmid = logical_rmid_to_physical_rmid(cpu, rmid);
242283
/* Record any initial, non-zero count value. */
243-
__rmid_read(rmid, eventid, &am->prev_msr);
284+
__rmid_read_phys(prmid, eventid, &am->prev_msr);
244285
}
245286
}
246287

@@ -275,16 +316,19 @@ int resctrl_arch_rmid_read(struct rdt_resource *r, struct rdt_mon_domain *d,
275316
{
276317
struct rdt_hw_mon_domain *hw_dom = resctrl_to_arch_mon_dom(d);
277318
struct rdt_hw_resource *hw_res = resctrl_to_arch_res(r);
319+
int cpu = cpumask_any(&d->hdr.cpu_mask);
278320
struct arch_mbm_state *am;
279321
u64 msr_val, chunks;
322+
u32 prmid;
280323
int ret;
281324

282325
resctrl_arch_rmid_read_context_check();
283326

284327
if (!cpumask_test_cpu(smp_processor_id(), &d->hdr.cpu_mask))
285328
return -EINVAL;
286329

287-
ret = __rmid_read(rmid, eventid, &msr_val);
330+
prmid = logical_rmid_to_physical_rmid(cpu, rmid);
331+
ret = __rmid_read_phys(prmid, eventid, &msr_val);
288332
if (ret)
289333
return ret;
290334

@@ -1022,8 +1066,8 @@ int __init rdt_get_mon_l3_config(struct rdt_resource *r)
10221066
int ret;
10231067

10241068
resctrl_rmid_realloc_limit = boot_cpu_data.x86_cache_size * 1024;
1025-
hw_res->mon_scale = boot_cpu_data.x86_cache_occ_scale;
1026-
r->num_rmid = boot_cpu_data.x86_cache_max_rmid + 1;
1069+
hw_res->mon_scale = boot_cpu_data.x86_cache_occ_scale / snc_nodes_per_l3_cache;
1070+
r->num_rmid = (boot_cpu_data.x86_cache_max_rmid + 1) / snc_nodes_per_l3_cache;
10271071
hw_res->mbm_width = MBM_CNTR_WIDTH_BASE;
10281072

10291073
if (mbm_offset > 0 && mbm_offset <= MBM_CNTR_WIDTH_OFFSET_MAX)

0 commit comments

Comments
 (0)