Skip to content

[RFC] Proposed development plan for Zephyr's POSIX subsystem #17706

@pfalcon

Description

@pfalcon

Summary

This RFC seeks to transform and extend Zephyr's POSIX subsystem, which was initially conceived to implement just a small embedded profile specification of POSIX, into a subsystem with wider coverage of the full POSIX standard. While doing so, it doesn't seek to establish specific (sub)set of the POSIX standard to implement. Instead, it seeks to establish process and criteria to allow incremental and gradual development and addition of new features, based on the Zephyr stakeholder and community needs.

Mission statement

There're 2 ways to develop software for a particular system:

  1. Write it from scratch specifically for a given system.
  2. Port existing software (developed for other systems and/or standard APIs).

Zephyr is a small, efficient RTOS, and thus p.1 was the initial scope. But importance of p.2 should not be underestimated. The author of this RFC and the growing community of Zephyr users think that inability or extra hurdles in porting existing software become a growing blocker on the way to wider Zephyr adoption and usage.

This RFC seeks to remedy the situation and enable large-scale application porting to Zephyr, by laying close attention on the implementation of the standard OS API (the POSIX standard). At the same time, it seeks to do so in sustainable, manageable and lean way, following the principles of agile software development and open-source, community-driven process.

Motivation

Zephyr includes many subsystems, which are largely disjoint. One of such subsystem is BSD Sockets(-like) subsystem, initially written by the author of this RFC. It was initially developed as a proof-of-concept, alternative networking API to Zephyr's own (adhoc) networking API. There were 3 main ideas why adding BSD Sockets(-like) API to Zephyr would be useful:

  1. To reuse programmers' existing knowledge and experience when developing Zephyr applications.
  2. To base Zephyr API on well-known, tested and tried design patterns.
  3. To allow to port existing applications to Zephyr.

Of these, p.1 was the initial motivation, p.2 helped the BSD Sockets(-like) API to achieve status of the official networking API, when it became clear that it provides a good answer for kernel-vs-userspace separation challenges and resource protection needs.

However, leveraging p.3 took some time to gather momentum, with real work starting since the beginning of this year (2019). Even the first porting experiment (see a retrospective section below) exposed big issues with BSD Sockets(-like) subsystem. To remind, this section starts with "Zephyr includes many subsystems, which are largely disjoint." Then, throughout the text, the sockets subsystem is called "BSD Sockets(-like)". Existing sockets subsystem largely follows the spirit of BSD Sockets API, but lacks a lot of functionality and features a lot of small-ish differences if taken POSIX BSD Sockets API by word. It also doesn't integrate well with other Zephyr subsystems, like existing (also very incomplete) POSIX subsystem and C library.

All that led to following issues observed:

  • A lot of functionality have to be added to BSD Sockets API (even if largely shallow, i.e. simple enough features).
  • There're still trivial/repetitive changes to be done to 3rd-party code, e.g. due differences in header location of "real POSIX" vs "Zephyr's BSD Sockets(-like) subsystem".
  • Finally, existing applications never use just BSD Sockets API part of POSIX, but rely on various other APIs (and it should be reminded that C standard library is also part of POSIX). Due to "disintegrated" nature of Zephyr subsystems, trying to use various parts in the same application led to heavy conflicts and ugly workarounds (that's if a programmer was patient enough to reach for them and not give up, writing off Zephyr's subsystem(s) involved as "completely broken").

Over time, while fighting with the issues described above, the solution became apparent: Different Zephyr subsystems should be integrated together under auspices of the POSIX standard, following it closely by a word, not just by a spirit.

Implementation process

Developing a complete implementation of POSIX IEEE 1003.1 is no simple task, due to a breadth and depth of the standard. It would take dozens of man-years to finish that task. There're no such (formally allocatable) resources in the Zephyr community, so this RFC doesn't lay a specific plan to implement "full POSIX". Instead, the proposal is to focus on the practical side of things. As the previous sections said, the main motivation is to be able to port/reuse existing application software to Zephyr, and that's why we're interested POSIX, and not any other way around. Thus, development of the POSIX subsystem should be primarily driven by active porting efforts:

  1. Zephyr community selects POSIX-compatible project(s) to port to Zephyr at a particular streak (based on the community needs/interests).
  2. Features missing from POSIX subsystem are developed, functionality not working properly or faithfully enough gets fixed/extended - all in incremental manner.
  3. Work done in p.2 is submitted/merged upstream in agile/streamlined manner, to parallelize the work, let other users benefit even from interim results, and motivate them to join the effort, providing a positive feedback cycle.
  4. Process repeats from p.1

This is essentially a lean/agile development methodology, where development is driven by the short-term needs, and as long as the development goes in the right direction - more POSIX functionality gets implemented (even if not completely!), CI passes, there're no obvious mistakes or noticeable/avoidable technical debt added - it gets merged and process immediately repeats with the next development iteration, etc.

Of course, besides community-driven new-feature process, there's also maintenance process working in a usual way:

  1. As time permits, maintainers select known technical debt, or known issue to improve.
  2. Improvement is made.
  3. Improvement is merged.
  4. Repeats from p.1.

This process might have more background priority and lower intensity, but otherwise follows the same agile workflow as feature process, with the same acceptance criteria (as long as a change improves the situation and doesn't deteriorate it, it's good to go).

Relationship to the existing Zephyr POSIX subsystem

This efforts is supposed to be fully based on the existing POSIX subsystem, and intended to continue its development further in continuous, seamless, sustainable fashion. There's no intention to replace it, tear it off, beat with sticks, or anything like that. There may be a need for deep bug-fixes or wide refactors, but too-deep and too-wide cases should be rare, and each case would be handled on as-needed basis, following the usual process (big non-trivial changes gets RFCed and discussed, etc., while normal changes follow the agile process described above).

It should be noted that the process of the elaboration of the existing POSIX subsystem is going for quite some time now, and this RFC effectively just captures this existing practices, for the entire Zephyr community to be in loop of it.

During the initial discussion of the development process of the POSIX subsystem (i.e. the subject of this RFC), it was raised to the attention the fact that initially the POSIX subsystem was intended to implement PSE52 profile of POSIX. The author of this RFC (also a maintainer of the POSIX subsystem for last half a year and the author of many changes to it) has to admit that such a claim came as a surprise. A lot of time while preparing this RFC was spent trying to understand this situation and how historical plans for PSE52 profile development affect this RFC. Below, the situation with PSE52 is traced in detail:

  1. Let's try to search the official documentation for PSE52: https://docs.zephyrproject.org/latest/search.html?q=PSE52&check_keywords=yes&area=default . Only one hit, for the 1.11.0 changelog: https://docs.zephyrproject.org/latest/releases/release-notes-1.11.html?highlight=pse52
  2. Let's double-check by grepping doc sources in the tree:
zephyr/doc$ grep -r -i PSE52 *
releases/release-notes-1.11.rst:* POSIX PSE52 partial support.
releases/release-notes-1.11.rst:* POSIX PSE52 support:
releases/release-notes-1.11.rst:* :github:`1291` - Initial Posix PSE52 Support

So, web search works well, there're no other references in the docs.
3. The 1.11.0 changelog links to #1291 dated 2017-08-30, which indeed talks about implementing PSE52 subset of POSIX.
4. The answer would be implied by the doc search above, but let's double-check:

zephyr$ grep -r -i PSE52 *
doc/releases/release-notes-1.11.rst:* POSIX PSE52 partial support.
doc/releases/release-notes-1.11.rst:* POSIX PSE52 support:
doc/releases/release-notes-1.11.rst:* :github:`1291` - Initial Posix PSE52 Support

I.e., there're no further mentioning of PSE52 in the tree, in particular, no config options specifically for PSE52.

So, how the existing POSIX subsystem is described by the config options?

  1. https://docs.zephyrproject.org/latest/reference/kconfig/CONFIG_POSIX_API.html :

Enable mostly-standards-compliant implementations of various POSIX (IEEE 1003.1) APIs.

I.e. the main POSIX option describes itself as implementing the "big" POSIX (IEEE 1003.1), not some subset (IEEE 1003.13). For the full disclosure, this description comes from a patch by the author of this RFC, dated 2018-09-27: 8dc69e0 . This proves the point that the changes described in this document didn't start today or yesterday, but are in progress for quite some time. (After making a number of changes like that to POSIX subsystem, the author of this RFC volunteered to be a maintainer of the subsystem to progress it along the vision now formally described in this document).

However, I didn't mention IEEE 1003.1 out of top of my head, I essentially just copied it from a description of one previous POSIX patches: eb0aaca :

Add IEEE 1003.1 Posix Style file system API support.

That commit was made by one of the original authors of the POSIX subsystem.

Hopefully, the evidence presented is enough to make following summary:

  1. Indeed, the implementation of the POSIX subsystem started with a request to implement a subset of POSIX functionality mandated by PSE52.
  2. However, even during initial implementation, there was leaning towards referring to the full POSIX standard, IEEE 1003.1. (We might pause here and start saying that it was a typo, but such a discussion would lead us sideways, with counter-claims that real-world engineers are less interested in OpenGroup marketing materials obscure subsets of a well-known API, than the whole API, which allows to work with real-world applications. Again, let's please not go there.)
  3. Beyond the initial implementation 2 years ago, the POSIX subsystem organically and gradually was growing to subsume more POSIX functionality (and a lot of bugfixes). This RFC is nothing but captures the existing process.
  4. In either case, the original PSE52 subset goal doesn't conflict in any way with the subject of this RFC. PSE52 is a subset of full POSIX. Full POSIX is superset of PSE52. Original PSE52 implementation is no alien to the implementation discussed in this RFC. The wider POSIX implementation discussed in this RFC grows naturally from the original implementation.
  5. It might be a bit different if we had something like lib/pse52. The exact difference would be that this RFC would start with proposal to rename it lib/posix. But per p.2, it was done in future-proof way from the start, so there's nothing to worry about here.

Location of the headers

Another question raised during pre-discussion of this RFC was location of the new POSIX headers added. The previous section should provide pretty obvious and natural response: the existing POSIX subsystem has headers in include/posix, thus any extension to it would also have headers in include/posix.

There was speculation that (some?) new headers might be put directly in include/. It would be quite inconsistent and unsustainable to have some POSIX headers under one path, while other under another (also 3rd, 4th, etc?). An example was given based on a particular header which was in a subdirectory: include/posix/arpa/inet.h (re: https://github.com/zephyrproject-rtos/zephyr/pull/16621/files), speculating that it as well could go into include/arpa/inet.h. But that's a very peculiar example. There're many POSIX headers, and majority of them go into the top-level include directory, not a subdirectory like arpa/ above. Again, we don't want to make confusing rules of what goes where, based on such a kind of criteria. This won't be sustainable, will lead to mistakes, conflicts, duplication, etc.

The interesting implication question is however whether using the include/posix/ location as was done originally was a good move, and whether it would make sense to revisit it now.

Generally, the header space should be structured and ordered. More specifically, there should be proper namespaces of native Zephyr headers vs POSIX headers. Failing that, there will be confusion and conflicts, again. We actually had example of that, so such claims come from the actual experience: b4b108d .

The idea situation would be that Zephyr native headers would be namespaced, e.g. located in include/zephyr/ and included as #include <zephyr/...>. While POSIX headers would be at their natural locations mandated by the standard. This would 100% resolve any risk of namespace conflicts (including with other 3rd-party projects). Unfortunately, recently Zephyr TSC discussed this question and made a decision to not move Zephyr headers under zephyr/, so for the mid-term (2-3 years), we're locked out from the opportunity to resolve this issue completely, until further experience and leveraging Zephyr in real-world conditions might prompt another iteration of handling this matter.

Then, in the current conditions, sticking with existing include/posix/ makes good sense - it's already tested and tried solution, which doesn't give as strong non-conflict guarantees as the described above, but still provides bare-minimum required namespacing separation. While this solution requires some overheads in managing include paths, and indeed, some recent elaborations of that aren't merged yet (#15937), at least it's by now well understood that these elaborations are required and how to have done them.

Retrospective: Existing 3rd-party applications porting projects

  • 2017-08 by @pfalcon (author of the RFC): MicroPython socket module. MicroPython was actually a testbed for prototyping Zephyr's BSD Socket(-like) API. As such, it was largely and initially a from-scratch project, so POSIX compatibility matters became apparent only later, when turned out that MicroPython had "socket" module implementation for POSIX system, and in the end had module for Zephyr, which had probably ~80% of code shared with native POSIX module, and the rest adhoc to Zephyr. That might be ok for MicroPython, due to its goal of detailed efficiency, but already raised concerns that such an approach won't be scalable to porting many applications.
  • 2019-01 by @pfalcon: Porting of OPC UA protocol open62541 library. This is a true 3rd-party POSIX API project, which showed a lot of deficiencies in BSD Socket vs POSIX vs libc integration. The work with completed, but with enormous time spent in debugging build conflicts and with some dirty, umergeable workarounds. The majority (dozen+) of patches were however cleaned up and merged into Zephyr, large chunk for 1.14. Until suddenly, the next chunk of patches (posix: Add headers related to BSD Sockets API #16621, posix: Clean up various headers #16626, lib: posix: Switch to use zephyr_interface_library_named cmake directive #15937) following the very same process as before didn't catch some blocking attention which led to the creation of this RFC.
  • 2019-05 by @pfalcon: Porting/elaboration of Google Cloud Platform IoT Embedded SDK (further called GoogleIoT). GoogleIoT is described as supporting Zephyr, but actually builds only for BOARD=native_posix, and while doing so, includes host-side POSIX headers. It's POSIX-based otherwise. Switching it to proper Zephyr POSIX subsystem showed all the issues familiar from the open62541 porting work, which served as another motivation to revive, cleanup, and submit patches done while working on it (which now also apply to GoogleIoT work).
  • 2019-05 by @PiotrZierhoffer, @tgorochowik, et al: Porting of civitweb to Zephyr ([RFC] Missing parts of libc required for CivetWeb #16683, Add civetweb HTTP sample #17019). The developers of the port immediately faced the POSIX vs newlib conflicts as were described above, starting with open62541 port. Instead of trying to seek to resolve them, they decided to use minimal libc instead. Of course, it is far from POSIX compliance itself, prompting developers to import missing functionality from other libc projects, like musl. All leading to concerns of maintainability of the mixups of pieces of different flavors of libc's, instead of working on elaborating the existing one (newlib). This again served as a motivation to cleanup and submit previous patches resolve POSIX subsys vs newlib compatibility.

Immediate scope of work

  • Making sure that Zephyr BSD Sockets subsys and Zephyr POSIX subsys properly integrate.
  • Making sure Zephyr POSIX subsys and newlib, the official Zephyr's "full libc", properly integrate.

Metadata

Metadata

Assignees

Labels

RFCRequest For Comments: want input from the communityarea: POSIXPOSIX API Library

Type

Projects

Status

Done

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions