Skip to content

Take action to stop the temporary directory being deleted due to lack of use #31732

@droberts195

Description

@droberts195

Since #27609 Elasticsearch defaults to a per-run temporary directory. On Linux this is a directory created by the startup script under /tmp.

Linux systemd has functionality that can remove files and directories from /tmp that have not been used for a certain length of time. This functionality is described in man tmpfiles.d.

RHEL and CentOS 7 ship with a configuration for this functionality in /usr/lib/tmpfiles.d/tmp.conf that deletes files and directories that have not been touched for 10 days:

v /tmp 1777 root root 10d

(Note: If you read the man page you might think this doesn't delete old files, as the man page says:

  The age field only applies to lines starting with d, D, and x. If omitted or set to "-", no automatic clean-up is done.

However, the man page is wrong. Cleanup by age also applies to other configuration entries, including v. There are 7 letters it applies to in the code: https://github.com/systemd/systemd/blob/2479c4fe3fc3d0b631b93debbc2a83aa40a5f379/src/tmpfiles/tmpfiles.c#L1904

v is CREATE_SUBVOLUME in that switch.)

Currently the only part of Elasticsearch that uses java.io.tmpdir more than a few seconds after startup is ML. As a result, if someone does not start an ML job on a particular node that is running on RHEL or CentOS 7 then 10 days after ES startup the temporary directory is removed by tmpfiles.d functionality. If an ML job is run on the node after this then it fails because the temporary directory does not exist.

Due to security manager the ES JVM cannot recreate the temporary directory. Therefore the best solution would seem to be to periodically create and remove a file in the temporary directory. If we created and removed a file every 22 hours then this would keep the directory modification time within the last day, even for days when daylight saving time starting reduces the day length to 23 hours. So this would keep the directory alive even for a user who configured tmpfiles.d to clean after 1 day.

Since ML is currently the only affected component this periodic touching of the temporary directory could be done in the ML code. However, this problem could also affect 3rd party plugins that use java.io.tmpdir, so it would be nicer if the functionality to keep the temporary directory alive was in core Elasticsearch. The ML team can implement it if we can get some advice on the best place in the code to put it.

Metadata

Metadata

Assignees

No one assigned

    Labels

    :Core/Infra/CoreCore issues without another label:mlMachine learning>bug

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions