Skip to content

Refactor handling of historical data #1266

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 3 commits into from
Apr 4, 2022
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
7 changes: 3 additions & 4 deletions docs/glossary.md
Original file line number Diff line number Diff line change
Expand Up @@ -35,11 +35,10 @@ The following is a glossary of domain specific terminology. Although benchmarks
## Analysis

* **test result delta**: the difference between two test results for the same metric and test case.
* **significant test result delta**: a test result delta that is above some threshold that we have determined to be an actual change in performance and not noise.
* **noisy test case**: a test case for which the non-significant test result deltas are on average above some threshold. Non-noisy data tends to be very close to zero, and so a test case that produces non-significant test result deltas that are not close enough to zero on average are classified as noisy.
* **significance threshold**: the threshold at which a test result delta is considered "significant" (i.e., a real change in performance and not just noise). This is calculated using [the upper IQR fence](https://www.statisticshowto.com/upper-and-lower-fences/#:~:text=Upper%20and%20lower%20fences%20cordon,%E2%80%93%20(1.5%20*%20IQR)) as seen [here](https://github.com/rust-lang/rustc-perf/blob/8ba845644b4cfcffd96b909898d7225931b55557/site/src/comparison.rs#L935-L941).
* **significant test result delta**: a test result delta above the significance threshold. Significant test result deltas can be thought of as "statistically significant".
* **dodgy test case**: a test case for which the significance threshold is significantly large indicating a high amount of variability in the test and thus making it necessary to be somewhat skeptical of any results too close to the significance threshold.
* **relevant test result delta**: a synonym for *significant test result delta* in situations where the term "significant" might be ambiguous and readers may potentially interpret *significant* as "large" or "statistically significant". For example, in try run results, we use the term relevant instead of significant.
* **highly variable test case**: a test case that frequently produces (over some threshold) significant test result deltas. This differs from sensitive benchmarks in that the deltas aren't necessarily large, they just happen more frequently than one would expect, and it is unclear why the significant deltas are happening. Note: highly variable test cases can be classified as noisy. They are different from classically noisy benchmarks in that they often produce deltas that are above the significance threshold. We cannot therefore be 100% if they are variable because they are noisy or because parts of the compiler being exercised by these benchmarks are just being changed very often.
* **highly sensitive benchmark**: a test case that tends to produce large (over some threshold) significant test result deltas. This differs from highly variable test cases in that it is usually clear what type of changes tend to result in significant test result deltas. For example, some of the stress tests are highly sensitive; they produce significant test result deltas when parts of the compiler that they are stressing change and the percentage change is typically quite large.

## Other

Expand Down
106 changes: 23 additions & 83 deletions site/src/comparison.rs
Original file line number Diff line number Diff line change
Expand Up @@ -606,7 +606,8 @@ async fn compare_given_commits(
let statistics_for_a = statistics_from_series(&mut responses);
let statistics_for_b = statistics_from_series(&mut responses);

let variances = BenchmarkVariances::calculate(ctxt, a.clone(), master_commits, stat).await?;
let mut historical_data =
HistoricalDataMap::calculate(ctxt, a.clone(), master_commits, stat).await?;
let statistics = statistics_for_a
.into_iter()
.filter_map(|(test_case, a)| {
Expand All @@ -616,7 +617,7 @@ async fn compare_given_commits(
benchmark: test_case.0,
profile: test_case.1,
scenario: test_case.2,
variance: variances.data.get(&test_case).cloned(),
historical_data: historical_data.data.remove(&test_case),
results: (a, b),
})
})
Expand Down Expand Up @@ -815,14 +816,13 @@ impl Comparison {
}
}

/// A description of the amount of variance a certain benchmark is historically
/// experiencing at a given point in time.
pub struct BenchmarkVariances {
/// Variance data on a per test case basis
pub data: HashMap<(Benchmark, Profile, Scenario), BenchmarkVariance>,
/// The historical data for a certain benchmark
pub struct HistoricalDataMap {
/// Historical data on a per test case basis
pub data: HashMap<(Benchmark, Profile, Scenario), HistoricalData>,
}

impl BenchmarkVariances {
impl HistoricalDataMap {
const NUM_PREVIOUS_COMMITS: usize = 100;
const MIN_PREVIOUS_COMMITS: usize = 50;

Expand All @@ -832,18 +832,18 @@ impl BenchmarkVariances {
master_commits: &[collector::MasterCommit],
stat: String,
) -> Result<Self, BoxedError> {
let mut variance_data = HashMap::new();
let mut historical_data = HashMap::new();

let previous_commits = Arc::new(previous_commits(
from,
Self::NUM_PREVIOUS_COMMITS,
master_commits,
));

// Return early if we don't have enough data to calculate variance.
// Return early if we don't have enough data for historical analysis
if previous_commits.len() < Self::MIN_PREVIOUS_COMMITS {
return Ok(Self {
data: variance_data,
data: historical_data,
});
}

Expand All @@ -860,37 +860,25 @@ impl BenchmarkVariances {

for _ in previous_commits.iter() {
for (test_case, stat) in statistics_from_series(&mut previous_commit_series) {
variance_data.entry(test_case).or_default().push(stat);
historical_data.entry(test_case).or_default().push(stat);
}
}

// Only retain test cases for which we have enough data to calculate variance.
variance_data.retain(|_, v| v.data.len() >= Self::MIN_PREVIOUS_COMMITS);

for ((bench, _, _), results) in variance_data.iter_mut() {
log::trace!("Calculating variance for: {}", bench);
results.calculate_description();
}
historical_data.retain(|_, v| v.data.len() >= Self::MIN_PREVIOUS_COMMITS);

Ok(Self {
data: variance_data,
data: historical_data,
})
}
}

#[derive(Debug, Default, Clone, Serialize)]
pub struct BenchmarkVariance {
pub struct HistoricalData {
data: Vec<f64>,
description: BenchmarkVarianceDescription,
}

impl BenchmarkVariance {
/// The ratio of change that we consider significant.
const SIGNFICANT_DELTA_THRESHOLD: f64 = 0.01;
/// The percentage of significant changes that we consider too high
const SIGNFICANT_CHANGE_THRESHOLD: f64 = 5.0;
/// The ratio of change that constitutes noisy data
const NOISE_THRESHOLD: f64 = 0.001;
impl HistoricalData {
/// The multiple of the IQR above Q3 that signifies significance
const IQR_MULTIPLIER: f64 = 3.0;

Expand Down Expand Up @@ -954,36 +942,6 @@ impl BenchmarkVariance {
(q1, q3)
}

fn calculate_description(&mut self) {
self.description = BenchmarkVarianceDescription::Normal;

let percent_changes = self.percent_changes();
let non_significant = percent_changes
.iter()
.take_while(|&&c| c < Self::SIGNFICANT_DELTA_THRESHOLD)
.collect::<Vec<_>>();

let percent_significant_changes = ((percent_changes.len() - non_significant.len()) as f64
/ percent_changes.len() as f64)
* 100.0;
log::trace!(
"Percent significant changes: {:.1}%",
percent_significant_changes
);

if percent_significant_changes > Self::SIGNFICANT_CHANGE_THRESHOLD {
self.description = BenchmarkVarianceDescription::HighlyVariable;
return;
}

let delta_mean =
non_significant.iter().cloned().sum::<f64>() / (non_significant.len() as f64);
log::trace!("Ratio change: {:.4}", delta_mean);
if delta_mean > Self::NOISE_THRESHOLD {
self.description = BenchmarkVarianceDescription::Noisy;
}
}

// Absolute deltas between adjacent results
fn deltas(&self) -> impl Iterator<Item = f64> + '_ {
self.data
Expand All @@ -999,24 +957,6 @@ impl BenchmarkVariance {
}
}

#[derive(Debug, Clone, Copy, Serialize)]
#[serde(tag = "type", content = "percent")]
pub enum BenchmarkVarianceDescription {
Normal,
/// A highly variable benchmark that produces many significant changes.
/// This might indicate a benchmark which is very sensitive to compiler changes.
HighlyVariable,
/// A noisy benchmark which is likely to see changes in performance simply between
/// compiler runs.
Noisy,
}

impl Default for BenchmarkVarianceDescription {
fn default() -> Self {
Self::Normal
}
}

/// Gets the previous commit
pub fn prev_commit<'a>(
artifact: &ArtifactId,
Expand Down Expand Up @@ -1048,14 +988,14 @@ pub struct TestResultComparison {
benchmark: Benchmark,
profile: Profile,
scenario: Scenario,
variance: Option<BenchmarkVariance>,
historical_data: Option<HistoricalData>,
results: (f64, f64),
}

impl TestResultComparison {
/// The amount of relative change considered significant when
/// we cannot determine from historical data
const SIGNIFICANT_RELATIVE_CHANGE_THRESHOLD: f64 = 0.002;
const DEFAULT_SIGNIFICANCE_THRESHOLD: f64 = 0.002;

pub fn benchmark(&self) -> Benchmark {
self.benchmark
Expand All @@ -1077,10 +1017,10 @@ impl TestResultComparison {

/// Magnitude of change considered significant
fn significance_threshold(&self) -> f64 {
self.variance
self.historical_data
.as_ref()
.map(|v| v.significance_threshold())
.unwrap_or(Self::SIGNIFICANT_RELATIVE_CHANGE_THRESHOLD)
.map(|d| d.significance_threshold())
.unwrap_or(Self::DEFAULT_SIGNIFICANCE_THRESHOLD)
}

/// This is a numeric magnitude of a particular change.
Expand Down Expand Up @@ -1139,7 +1079,7 @@ impl TestResultComparison {
}

fn is_dodgy(&self) -> bool {
self.variance
self.historical_data
.as_ref()
.map(|v| v.is_dodgy())
.unwrap_or(false)
Expand Down Expand Up @@ -1473,7 +1413,7 @@ mod tests {
benchmark: index.to_string().as_str().into(),
profile: Profile::Check,
scenario: Scenario::Empty,
variance: None,
historical_data: None,
results: (before, after),
});
}
Expand Down