Skip to content

Conversation

@sebroy
Copy link

@sebroy sebroy commented May 1, 2025

Test restoring the pre-commit hook

osandov and others added 30 commits December 18, 2024 16:24
Ubuntu enables -Wformat-security by default, but upstream GCC doesn't
enable it even with -Wall. It caught some legitimate issues in the
module API branch, so let's enable it explicitly.

Signed-off-by: Omar Sandoval <[email protected]>
An upcoming commit needs to look up a few notes by name+type, so add a
common helper function for that.

Signed-off-by: Omar Sandoval <[email protected]>
This will be used for checking the CRC in .gnu_debuglink.

Signed-off-by: Omar Sandoval <[email protected]>
These will be used for GNU build IDs.

Signed-off-by: Omar Sandoval <[email protected]>
We are going to need to parse various structures that have different 64-
and 32-bit formats, may be byte-swapped, and may be unaligned (e.g.,
various ELF and link map structures). Provide a couple of convenience
macros for doing this.

Signed-off-by: Omar Sandoval <[email protected]>
This is a trivial wrapper to delete by entry instead of by key or
iterator.

Signed-off-by: Omar Sandoval <[email protected]>
This has been available since Linux kernel commit 0935288c6e00 ("kdump:
append kernel build-id string to VMCOREINFO") (in v5.9). Save it so we
can use it when loading debugging information.

Unfortunately, the build ID in VMCOREINFO was briefly broken in several
stable releases. 6.10 and 5.15 reached their end-of-life while broken
and so will remain broken forever. It feels like overkill to drop
support for those versions over this, so we work around it with a
version check.

Signed-off-by: Omar Sandoval <[email protected]>
No code changes other than moving it in the file to make upcoming diffs
cleaner.

Signed-off-by: Omar Sandoval <[email protected]>
We currently use debuginfod via libdwfl, but when we get rid of our
libdwfl dependency, we'll need to do the debuginfod calls ourselves. So,
let's add the scaffolding for using libdebuginfod. We provide three
choices at build time:

* No debuginfod (./configure --without-debuginfod).
* Soft dependency: load libdebuginfod with dlopen at runtime if
  available (./configure --with-debuginfod --enable-dlopen-debuginfod).
  This is the default and probably what we want distros to use.
* Hard dependency: link against libdebuginfod (./configure
  --with-debuginfod --disable-dlopen-debuginfod). This is intended for
  environments where dlopen can't be used (e.g., manylinux wheels).

The client handle will be created lazily, so for now this just sets up
some wrappers and doesn't do much with them.

Signed-off-by: Omar Sandoval <[email protected]>
We try to pick a good default, but it's not exactly the same as logging
so it needs extra flexibility. This will be used by upcoming debuginfod
integration.

Signed-off-by: Omar Sandoval <[email protected]>
drgn currently provides limited control over how debugging information
is found. drgn has hardcoded logic for where to search for debugging
information. The most the user can do is provide a list of files for
drgn to try in addition to the default locations (with the -s CLI option
or the drgn.Program.load_debug_info() method).

The implementation is also a mess. We use libdwfl, but its data model is
slightly different from what we want, so we have to work around it or
reimplement its functionality in several places: see commits
e5874ad ("libdrgn: use libdwfl"), e6abfea ("libdrgn:
debug_info: report userspace core dump debug info ourselves"), and
1d4854a ("libdrgn: implement optimized x86-64 ELF relocations") for
some examples. The mismatched combination of libdwfl and our own code is
difficult to maintain, and the lack of control over the whole debug info
pipeline has made it difficult to fix several longstanding issues.

The solution is a major rework removing our libdwfl dependency and
replacing it with our own model. This (huge) commit is that rework
comprising the following components:

- drgn.Module/struct drgn_module, a representation of a binary used by a
  program.
- Automatic discovery of the modules loaded in a program.
- Interfaces for manually creating and overriding modules.
- Automatic discovery of debugging information from the standard
  locations and debuginfod.
- Interfaces for custom debug info finders and for manually overriding
  debugging information.
- Tons of test cases.

A lot of care was taken to make these interfaces extremely flexible yet
cohesive. The existing interfaces are also reimplemented on top of the
new functionality to maintain backwards compatibility, with one
exception: drgn.Program.load_debug_info()/-s would previously accept
files that it didn't find loaded in the program. This turned out to be a
big footgun for users, so now this must be done explicitly (with
drgn.ExtraModule/--extra-symbols).

The API and implementation both owe a lot to libdwfl:

- The concepts of modules, module address ranges/section addresses, and
  file biases are heavily inspired by the libdwfl interfaces.
- Ideas for determining modules in userspace processes and core dumps
  were taken from libdwfl.
- Our implementation of ELF symbol table address lookups is based on
  dwfl_module_addrinfo().

drgn has taken these concepts and fine-tuned them based on lessons
learned.

Credit is also due to Stephen Brennan for early testing and feedback.

Closes #16, closes #25, closes osandov#332.

Signed-off-by: Omar Sandoval <[email protected]>
Overwriting a file that libelf has already mmap'd can confuse it and
cause it to crash. In particular, libelf/elf_begin.c::file_read_elf()
initializes Elf_Scn::rawdata_base and Elf_Scn::data_base from the mmap'd
file. libelf/elf_getdata.c::__libelf_set_rawdata_wrlock() also sets
Elf_Scn::rawdata_base from the mmap'd file. If the file changes between
those two events, then Elf_Scn::rawdata_base will change. Then, the
following line in libelf/elf_end.c::elf_end() will try to free an mmap'd
pointer:

	if (scn->data_base != scn->rawdata_base)
	  free (scn->data_base);

Stephen reported crashes like this from
test_gnu_debugaltlink_debug_directories() while testing a patch that
inadvertently caused debug info to be indexed on module creation.

Reported-by: Stephen Brennan <[email protected]>
Signed-off-by: Omar Sandoval <[email protected]>
Since commit 4e83130 ("Introduce module and debug info finder
APIs"), DWARF indexing is somewhat lazy. As a result, drgn_void_type()
may require DWARF indexing in order to determine the program's main
language. This is overkill for drgn_object_init(), which just needs to
initialize a valid dummy object that is usually reinitialized
immediately.

Signed-off-by: Omar Sandoval <[email protected]>
Fixes: 4e83130 ("Introduce module and debug info finder APIs")
Signed-off-by: Omar Sandoval <[email protected]>
Mention that `drgn` is shipped with Ubuntu except jammy, but it's an
older version

Signed-off-by: Michel Lind <[email protected]>
Since libkdumpfile commit 5b044292abe9 ("Clarify and fix attribute data
lifetime") changes the lifetime of attribute values retrieved with
kdump_attr_ref_get(), the extra reference would keep the PRSTATUS blob
around even after kdump_free().

However, the attribute hierarchy cannot change while iterating over the
PRSTATUS attributes, so it is not necessary to take an attribute reference
and we can use kdump_get_typed_attr().

The attribute blob itself should not change either, but it is a good idea
to keep its data pinned, because a raw pointer to it is stored in the
drgn_thread_set hash table. If some code tries to modify the PRSTATUS
attribute data, the attempt will fail with KDUMP_ERR_BUSY rather than leave
a dangling pointer in the hash table and possibly cause a UAF bug later.

The blob pin does not prevent freeing the blob when the blob reference
count reaches zero.

Signed-off-by: Petr Tesarik <[email protected]>
The kdump_get_typed_attr() function prototype changed in libkdumpfile
commit e182aeaf4d72 ("Make kdump_get_typed_attr() easier to use").

Signed-off-by: Petr Tesarik <[email protected]>
…on_name

Multiple people have lamented that StackFrame.name is None for functions
implemented in assembly or missing debug info for any other reason. With
DWARFless debugging, this will be way more common. My original hope was
that StackFrame.name would strictly be the function name from the
debugging information and that callers would fall back to getting the
symbol name themselves. However, the distinction isn't super meaningful
to users, so let's add the fallback directly to StackFrame.name and add
StackFrame.function_name with the old behavior of StackFrame.name.

Signed-off-by: Omar Sandoval <[email protected]>
This was previously used for testing internals via ctypes, but it's no
longer needed.

Fixes: 7d251fe ("Translate C lexer tests to C unit tests")
Signed-off-by: Omar Sandoval <[email protected]>
The x86-64 fallback unwinder currently has a special case for handling a
call to a NULL pointer. Other architectures need the same workaround. To
avoid code duplication, let's extract the null program counter check
into the generic stack tracing code and add a bad_call_unwind
architecture callback. This also gives us a centralized place to add
heuristics for detecting non-null bad calls.

Signed-off-by: Omar Sandoval <[email protected]>
This will mainly be useful for bad_unwind_call implementations that make
use of drgn_register_state_dup() and need to mark some registers as
unknown in the unwound frame.

Signed-off-by: Omar Sandoval <[email protected]>
Since AArch64 uses a link register rather than storing the return
address on the stack, this is a bit easier than on x86-64. Fixes osandov#462.

Signed-off-by: Omar Sandoval <[email protected]>
This is a shorter-term solution for anyone who can't run a version of
drgn with the previous fix.

Signed-off-by: Omar Sandoval <[email protected]>
PID 0 is not unique in the Linux kernel; there is a task with PID 0 for
each CPU. stack_trace(0) currently fails with a generic "task not found"
error message, which can be confusing; see osandov#462. Add a hint to use
idle_task() to the error message when the given PID is 0.

Signed-off-by: Omar Sandoval <[email protected]>
Alec Rivers noticed in osandov#461 that WITH_LIBKDUMPFILE is misspelled as
WITH_KDUMPFILE here. The whole ifdef block isn't actually needed, so
remove it.

Fixes: 4e330bb ("cli: indicate if drgn was compiled with libkdumpfile")
Signed-off-by: Omar Sandoval <[email protected]>
None of our setter functions handle deletions, which pass the value as
NULL. This results in a segfault when attempting to access the value.
Fix them all with a new convenience macro.

Fixes: 50e4ac8 ("libdrgn: allow overriding program default language")
Fixes: 4e83130 ("Introduce module and debug info finder APIs")
Signed-off-by: Omar Sandoval <[email protected]>
STRING_BUILDER has been really convenient, so let's do the same for
vectors and hash tables.

Signed-off-by: Omar Sandoval <[email protected]>
…lable

Since the module API was introduced, Program.load_debug_info() and the
drgn CLI's -s option match strictly based on build IDs. This fails when
the build ID is not available, specifically in the case of Linux kernel
core dumps without a usable build ID in VMCOREINFO (old versions and a
few buggy stable versions).

Before the module API, Program.load_debug_info() and -s used any vmlinux
file given to them. This caused confusion when the wrong file was given,
so we don't want to bring that behavior back. Instead, let's look for a
vmlinux file matching the Linux version from VMCOREINFO.

Fixes osandov#464.

Fixes: 4e83130 ("Introduce module and debug info finder APIs")
Signed-off-by: Omar Sandoval <[email protected]>
We do this so often I'm surprised I didn't add this sooner.

Signed-off-by: Omar Sandoval <[email protected]>
osandov and others added 24 commits April 11, 2025 01:42
Tweak style, split up some examples for clarification, and a couple more
small updates.

Signed-off-by: Omar Sandoval <[email protected]>
I'm not really even sure how or why the URLs changed. Again.

Signed-off-by: Stephen Brennan <[email protected]>
Fixes: adf6472 ("Add plugin system")
Signed-off-by: Omar Sandoval <[email protected]>
In some test environments, like Buck, it's not easy to reexecute the
Python interpreter and have it find the drgn module. Instead of using
subprocess, fork and call drgn.cli._main() directly (setting up the
necessary pipes ourselves). This gives us slightly less realistic
end-to-end coverage, but it makes the tests more robust.

Signed-off-by: Omar Sandoval <[email protected]>
We need to close the socket.

Signed-off-by: Omar Sandoval <[email protected]>
Our handling of rbp in ORC_TYPE_REGS is incorrect. First, we set the CFI
rule to restore rbp from the on-stack pt_regs. Then, we set it again
based on bp_reg, which is always ORC_REG_UNDEFINED, meaning to use the
previous rbp. The intention is to keep the value we just got from
pt_regs, but we interpret it as the previous frame's rbp, which is
incorrect.

Fix it by skipping the bp_reg handling for ORC_TYPE_REGS.

Reported-by: Leo Martins <[email protected]>
Signed-off-by: Omar Sandoval <[email protected]>
…iles again

Since the fixes commit, the _drgn Python extension module links directly
to libpython. However, manylinux extensions are not supposed to link
against libpython [1], and the manylinux container images intentionally
omit libpython, so building drgn for manylinux fails.

Add a few modes of building libdrgn to address this:

* --disable-libdrgn --enable-python-extension: build _drgn.so (the
  Python extension module), which doesn't link against libpython, and
  don't build libdrgn.so. This is what setup.py uses since Python-only
  installations don't need libdrgn. It works in manylinux.
* --enable-python: build libdrgn.so with Python support, which links
  against libpython, and don't build _drgn.so. In this configuration,
  libdrgn.so also works as the Python extension module, so _drgn.so can
  be installed as a symlink to libdrgn.so.
* (No options, equivalent to --enable-libdrgn --disable-python
  --disable-python-extension): build libdrgn.so with no Python support,
  don't build _drgn.so. For backwards-compatibility, this is the default
  when building libdrgn directly, but I doubt anyone does that these
  days.
* --enable-libdrgn --enable-python-extension: build libdrgn.so with
  Python support, which links against libpython, and build _drgn.so,
  which doesn't link against libpython. I guess this works but I don't
  see a reason to prefer it over --enable-python + a symlink.

This also requires refactoring our libtool convenience libraries.
Hopefully this sucks less in Meson.

[1]: https://peps.python.org/pep-0513/#libpythonx-y-so-1

Fixes: 3d49360 ("libdrgn: combine libdrgn and _drgn Python extension into one .so")
Signed-off-by: Omar Sandoval <[email protected]>
The new version includes a new kdumpid binary with a dependency on BFD
that we don't want, so disable it.

Signed-off-by: Omar Sandoval <[email protected]>
Some versions of the kernel fail to build for ppc64 with GCC 12 and
CONFIG_KPROBES enabled due to -Wdangling-pointer. Backport the commit
that disabled that warning.

Fixes: 9b7297d ("vmtest.config: enable CONFIG_KPROBES for upcoming kmodify breakpoints")
Signed-off-by: Omar Sandoval <[email protected]>
The GitHub Actions Ubuntu 20.04 image has just been shut down:
actions/runner-images#11101. Update to Ubuntu 22.04. Since 22.04 doesn't
include Python 3.6 or 3.7, also drop those. I've already announced that
this upcoming release will be the last one with support for those Python
versions (osandov#467). I'll test them manually before cutting the release.

Signed-off-by: Omar Sandoval <[email protected]>
I've found that these flood the logs and often make me lose more
interesting stuff in the scrollback. The values are available in
Module.section_addresses, so just log where we got them.

Signed-off-by: Omar Sandoval <[email protected]>
…lugins

These options also disable directories added by plugins, so document
that.

Signed-off-by: Omar Sandoval <[email protected]>
dnf_debug_info_finder() needs a couple more comments, and
example_debug_info_finder() wouldn't actually run as is.

Signed-off-by: Omar Sandoval <[email protected]>
It's missing a blank line, which messes up the formatting. Also reword
it a tiny bit.

Fixes: 545aa52 ("mm, slab: Fix test failures on kernels with SLOB")
Signed-off-by: Omar Sandoval <[email protected]>
The major reworks for the module/debug info finder APIs and proper
debuginfod support substantially change the story for getting debugging
symbols. Revamp our documentation:

* Document how to use debuginfod on different distributions.
* Add flow charts with our recommendations on each distribution.
* Add openSUSE documentation.
* Also document how to build with debugging symbols on different build
  systems.

Closes osandov#380.

Signed-off-by: Omar Sandoval <[email protected]>
Fixes: adf6472 ("Add plugin system")
Signed-off-by: Omar Sandoval <[email protected]>
Signed-off-by: Omar Sandoval <[email protected]>
struct foo64 is only 20 bytes with i386's alignment requirements, so we
get the following compiler warning on i386:

  ‘memcpy’ forming offset [20, 23] is out of the bounds [0, 20] of object ‘foo’ with type ‘struct foo64’

and a crash when running the tests. Fix it by adding padding manually.

I verified that this only affects the test cases. All of the structs we
actually use deserialize_struct64 for are properly defined.

Fixes osandov#493.

Signed-off-by: Omar Sandoval <[email protected]>
We need to catch and print exceptions in the forked process. stdout and
stderr are also str, so we shouldn't be calling .decode().

Fixes: 22ac3f3 ("tests: don't shell out for CLI tests")
Signed-off-by: Omar Sandoval <[email protected]>
drgn -e '' should start drgn, execute the empty statement, and exit, not
enter interactive mode. Fix it by checking whether args.exec is None
instead of bool(args.exec).

Fixes: 150ee76 ("cli: add -e option to exec() code directly")
Signed-off-by: Omar Sandoval <[email protected]>
We don't have a nice automated way to run the C unit tests from the main
vmtest CLI yet (I'm probably going to wait until we convert to Meson to
see how that will work), but at least install check so it can be done
manually.

Signed-off-by: Omar Sandoval <[email protected]>
@sebroy sebroy force-pushed the dlpx/pr/sebroy/f784b91a-846d-425d-a843-377d986f524b branch from 6420980 to b067e43 Compare May 1, 2025 14:04
@prakashsurya
Copy link

Before we move forward with this, can we get a handle on why we're only updating os-upgrade, and not develop?

@prakashsurya
Copy link

Also, do we want to update os-upgrade, vs. keeping in sync with develop.. and then potentially doing this sync on develop, after the LTS "merge"? assuming there's a good reason not to do the sync on develop prior to the merge..?

@sebroy
Copy link
Author

sebroy commented May 1, 2025

I'm not going to move forward with this @prakashsurya unless you want to move forward with it. We can certainly take care of this merge on develop later.

@sebroy
Copy link
Author

sebroy commented May 1, 2025

We've decided to pause merge conflict resolution for drgn until after the os-upgrade branches have been merged to develop.

@sebroy sebroy closed this May 1, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Development

Successfully merging this pull request may close these issues.

7 participants