Postmortem of fs.realpath() changes leading to userland breakage

@nodejs/collaborators 

I'm opening this issue so we can look critically at our processes and figure out how we might improve things to avoid things like this in the future.

I'm not totally around the issue but here's the high-level view that I'm seeing
1. https://github.com/nodejs/node-v0.x-archive/issues/7902 -> https://github.com/nodejs/node/issues/2680 `fs.realpath()` is identified as being significantly slower than `realpath(3)`
2. https://github.com/nodejs/node/pull/3594 introduces a change that cuts the JS implementation and replaces it with a new libuv implementation that directly uses `realpath(3)` **this is deemed to be a breaking change but _only_ because it replaces the `cache` argument with an `options` argument** (both are `Object`)
3. https://github.com/nodejs/node/pull/3594#issuecomment-210342608 change is finally landed on the 15th of April
4. https://github.com/nodejs/node/pull/3594#issuecomment-213060070 citgm picks up glob failure on the 22nd of April
5. Discussion and problem dissection takes place, an attempt to fix glob is made @ https://github.com/isaacs/node-glob/pull/259 but is ultimately deemed by @isaacs to lead to too big a breaking change and is not accepted, although that happened _well_ after v6 went out
6. v6.0.0 is released containing the change on the 27th if April
7. https://github.com/nodejs/citgm/pull/126 citgm is updated and glob is added as flaky, although the PR doesn't mention glob explicitly
8. Minor concerns are raised post-v6 in the original PR but no further movement is made and discussion ends
9. We get a string of Windows errors that @addaleax is identifying as being rooted in this issue, see https://github.com/nodejs/node/issues/7175#issuecomment-227210966
10. https://github.com/nodejs/node/issues/7175 @isaacs opens an issue with a detailed rundown of the problems its causing on the 6th of June, discussion ensues, decisive action is yet to be taken (may change once we discuss at CTC meeting today)

We didn't intend to break anything more than the `cache` option (and even then, the breakage was not _your code won't run_ breakage, it just won't run quite how you expect unless you're doing something funky with the cache). We discovered that it broke glob's tests. The breakage of glob was marked as ignorable (flaky) and we proceeded with v6. Even though we've had a full postmortem of the problems by now and we've also been able to identify other breakages coming from this, we've still not acted.

Here's some questions to get us going, and let's try to leave discussion about the specific actions on this one to https://github.com/nodejs/node/issues/7175 and focus on process here and see if we can figure out:
1. Clarify the steps that occurred in the chain of actions—is there anything to change about what I've listed above?
2. Do we collectively think anything is broken in our processes
3. If we have process breakage, what can we do to improve?

My personal judgement on (2) is that we have handled this poorly and that something is broken and needs to be identified and fixed. This has delivered a poor experience for Node.js users and we should consider this a black-eye and something to avoid in the future, i.e. a mistake to learn from. I see the breakdown purely as process rather than about any action by individuals. I'm also concerned by the lack of decisiveness on taking action here, we've had v6 out for a couple of months now and we still haven't done _anything_ on this.

The [stdio issues](https://github.com/nodejs/node/issues/6980) are bear some similarity I think. In certain areas we're having difficulty acting decisively to address real user experience issues. One theme that I'm seeing, although not overt and and certainly not by everyone, is a preference for correctness, performance and purity over other concerns. I don't know if this is a real problem because it comes out of the diversity of our collaborator group, but it's possible that having this in the discussion mix is partly responsible for our difficulty in making decisive headway in dealing with problems like this. Maybe we need to more clearly build priorities into the culture we have around core.

When critiquing our actions we should put things in perspective, because overall I think we've done an amazing job since v4 to lift standards and earned an impressive amount of trust and respect from our users. Let's just strive to always do better.


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

Postmortem of fs.realpath() changes leading to userland breakage #9

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Uh oh!

Postmortem of fs.realpath() changes leading to userland breakage #9

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions