Skip to content

Distributed Serialization segfault on 1.5.1 #37545

@mmattocks

Description

@mmattocks

I have recently begun experiencing a strange issue where a long-running process with many workers will segfault after about an hour with the following output:

signal (11): Segmentation fault
in expression starting at /srv/git/rys_nucleosomes/nested_sampling/dif_pos_learner.jl:82
sig_match_fast at /buildworker/worker/package_linux64/build/src/gf.c:2250 [inlined]
jl_lookup_generic_ at /buildworker/worker/package_linux64/build/src/gf.c:2332 [inlined]
jl_apply_generic at /buildworker/worker/package_linux64/build/src/gf.c:2394
serialize_any at /buildworker/worker/package_linux64/build/usr/share/julia/stdlib/v1.5/Serialization/src/Serialization.jl:648
serialize at /buildworker/worker/package_linux64/build/usr/share/julia/stdlib/v1.5/Serialization/src/Serialization.jl:627 [inlined]
serialize at /buildworker/worker/package_linux64/build/usr/share/julia/stdlib/v1.5/Serialization/src/Serialization.jl:272
serialize at /buildworker/worker/package_linux64/build/usr/share/julia/stdlib/v1.5/Serialization/src/Serialization.jl:2000
unknown function (ip: 0x7f303d394df5)
_jl_invoke at /buildworker/worker/package_linux64/build/src/gf.c:2214 [inlined]
jl_apply_generic at /buildworker/worker/package_linux64/build/src/gf.c:2398
serialize_msg at /buildworker/worker/package_linux64/build/usr/share/julia/stdlib/v1.5/Distributed/src/messages.jl:90
unknown function (ip: 0x7f303d392e05)
_jl_invoke at /buildworker/worker/package_linux64/build/src/gf.c:2214 [inlined]
jl_apply_generic at /buildworker/worker/package_linux64/build/src/gf.c:2398
jl_apply at /buildworker/worker/package_linux64/build/src/julia.h:1690 [inlined]
do_apply at /buildworker/worker/package_linux64/build/src/builtins.c:655
jl_f__apply_latest at /buildworker/worker/package_linux64/build/src/builtins.c:705
#invokelatest#1 at ./essentials.jl:710 [inlined]    
invokelatest at ./essentials.jl:709 [inlined]     
send_msg_ at /buildworker/worker/package_linux64/build/usr/share/julia/stdlib/v1.5/Distributed/src/messages.jl:185
send_msg_now at /buildworker/worker/package_linux64/build/usr/share/julia/stdlib/v1.5/Distributed/src/messages.jl:130 [inlined]   
send_msg_now at /buildworker/worker/package_linux64/build/usr/share/julia/stdlib/v1.5/Distributed/src/messages.jl:125
deliver_result at /buildworker/worker/package_linux64/build/usr/share/julia/stdlib/v1.5/Distributed/src/process_messages.jl:111    
unknown function (ip: 0x7f303d394ab8)              
_jl_invoke at /buildworker/worker/package_linux64/build/src/gf.c:2214 [inlined]
jl_apply_generic at /buildworker/worker/package_linux64/build/src/gf.c:2398
macro expansion at /buildworker/worker/package_linux64/build/usr/share/julia/stdlib/v1.5/Distributed/src/process_messages.jl:302 [inlined]
#105 at ./task.jl:356
unknown function (ip: 0x7f303d390b6c)
_jl_invoke at /buildworker/worker/package_linux64/build/src/gf.c:2214 [inlined]
jl_apply_generic at /buildworker/worker/package_linux64/build/src/gf.c:2398
jl_apply at /buildworker/worker/package_linux64/build/src/julia.h:1690 [inlined]
start_task at /buildworker/worker/package_linux64/build/src/task.c:707
unknown function (ip: (nil))
Allocations: 859407961 (Pool: 859209782; Big: 198179); GC: 217
fish: “julia” terminated by signal SIGSEGV (Address boundary error)

I really have no clue where to start producing a MWE because of the "unknown function" stuff. Does anyone have any hint as to what might be happening here? I am confused by "serialize" in this report, as none of the code the remote workers are executing makes calls to serialize(). Could this be arising as a result of remote workers calling remotecall_fetch(deserialize,...)? That's the only thing in my code the remote workers are executing that has anything to do with Serialization, so maybe it's something lower level than that.

julia> versioninfo()
Julia Version 1.5.1
Commit 697e782ab8 (2020-08-25 20:08 UTC)
Platform Info:
  OS: Linux (x86_64-pc-linux-gnu)
  CPU: Intel(R) Core(TM) i5-4670K CPU @ 3.40GHz
  WORD_SIZE: 64
  LIBM: libopenlibm
  LLVM: libLLVM-9.0.1 (ORCJIT, haswell)
Environment:
  JULIA_NUM_THREADS = 2

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions