-
-
Notifications
You must be signed in to change notification settings - Fork 27
Postmortem of fs.realpath() changes leading to userland breakage #9
Description
@nodejs/collaborators
I'm opening this issue so we can look critically at our processes and figure out how we might improve things to avoid things like this in the future.
I'm not totally around the issue but here's the high-level view that I'm seeing
- fs.realpath 70x slower than native node-v0.x-archive#7902 -> fs.realpath 70x slower than native node#2680
fs.realpath()
is identified as being significantly slower thanrealpath(3)
- fs: optimize realpath using uv_fs_realpath() node#3594 introduces a change that cuts the JS implementation and replaces it with a new libuv implementation that directly uses
realpath(3)
this is deemed to be a breaking change but only because it replaces thecache
argument with anoptions
argument (both areObject
) - fs: optimize realpath using uv_fs_realpath() node#3594 (comment) change is finally landed on the 15th of April
- fs: optimize realpath using uv_fs_realpath() node#3594 (comment) citgm picks up glob failure on the 22nd of April
- Discussion and problem dissection takes place, an attempt to fix glob is made @ fix: catch ELOOP isaacs/node-glob#259 but is ultimately deemed by @isaacs to lead to too big a breaking change and is not accepted, although that happened well after v6 went out
- v6.0.0 is released containing the change on the 27th if April
- Granular flake support citgm#126 citgm is updated and glob is added as flaky, although the PR doesn't mention glob explicitly
- Minor concerns are raised post-v6 in the original PR but no further movement is made and discussion ends
- We get a string of Windows errors that @addaleax is identifying as being rooted in this issue, see Node 6 fs.realpath behavior changes node#7175 (comment)
- Node 6 fs.realpath behavior changes node#7175 @isaacs opens an issue with a detailed rundown of the problems its causing on the 6th of June, discussion ensues, decisive action is yet to be taken (may change once we discuss at CTC meeting today)
We didn't intend to break anything more than the cache
option (and even then, the breakage was not your code won't run breakage, it just won't run quite how you expect unless you're doing something funky with the cache). We discovered that it broke glob's tests. The breakage of glob was marked as ignorable (flaky) and we proceeded with v6. Even though we've had a full postmortem of the problems by now and we've also been able to identify other breakages coming from this, we've still not acted.
Here's some questions to get us going, and let's try to leave discussion about the specific actions on this one to nodejs/node#7175 and focus on process here and see if we can figure out:
- Clarify the steps that occurred in the chain of actions—is there anything to change about what I've listed above?
- Do we collectively think anything is broken in our processes
- If we have process breakage, what can we do to improve?
My personal judgement on (2) is that we have handled this poorly and that something is broken and needs to be identified and fixed. This has delivered a poor experience for Node.js users and we should consider this a black-eye and something to avoid in the future, i.e. a mistake to learn from. I see the breakdown purely as process rather than about any action by individuals. I'm also concerned by the lack of decisiveness on taking action here, we've had v6 out for a couple of months now and we still haven't done anything on this.
The stdio issues are bear some similarity I think. In certain areas we're having difficulty acting decisively to address real user experience issues. One theme that I'm seeing, although not overt and and certainly not by everyone, is a preference for correctness, performance and purity over other concerns. I don't know if this is a real problem because it comes out of the diversity of our collaborator group, but it's possible that having this in the discussion mix is partly responsible for our difficulty in making decisive headway in dealing with problems like this. Maybe we need to more clearly build priorities into the culture we have around core.
When critiquing our actions we should put things in perspective, because overall I think we've done an amazing job since v4 to lift standards and earned an impressive amount of trust and respect from our users. Let's just strive to always do better.