This repository was archived by the owner on Mar 21, 2024. It is now read-only.
-
Notifications
You must be signed in to change notification settings - Fork 147
Debugging on AzureML
Anton Schwaighofer edited this page Sep 29, 2020
·
4 revisions
When creating the AzureML cluster, you need to tick the "Enable ssh" section. Pick your authentication method.
import rpdb
rpdb_port = 4444
rpdb.handle_trap(port=rpdb_port)
logging.info(f"rpdb is handling traps. To debug: identify the main runner.py process, then as root: "
f"kill -TRAP <process_id>; nc 127.0.0.1 {rpdb_port}")This is already done by the InnerEye toolbox, just adding here for completeness.
- From the "Details" tab in the run's page, note the Run ID, then click on the target name under "Compute target".
- Click on the "Nodes" tab, and identify the node whose "Current run ID" is that of your run.
- Copy the contents of the "Connection string" column for that node to the clipboard (
ssh user@...) and execute it in a shell. You need to havesshinstalled obviously. - Type "bash" for a nicer command shell (optional).
- Run
sudo docker psto see if Docker is running. You should see an output that lists 1 Docker container ID. - Identify the main python process with a command such as
ps aux | grep 'python.*runner.py' | egrep -wv 'bash|grep'You may need to vary this if it does not yield exactly one line of output.
- Note the process identifier (the value in the PID column, generally the second one).
- Issue the commands
kill -TRAP nnnn
nc 127.0.0.1 4444where nnnn is the process identifier. If the python process is in a state where it can
accept the connection, the "nc" command will print a prompt from which you can issue pdb
commands.
Notes:
- The last step (
killandnc) can be successfully issued at most once for a given process. Thus if you might want a colleague to carry out the debugging, think carefully before issuing these commands yourself.
Quick summary:
-
wforwhere, full stack trace -
uanddforupanddown, go one frame up/down -
sforstep, execute one step help
When exiting that via Ctrl-C, the process will be stuck at PDB prompt, and we can't re-connect, so have to kill the job.
- Run
sudo docker psto see the container ID. - Run
sudo docker exec -it <containerID> /bin/bashto start bash inside the container - Install additional tools:
apt-get update
apt-get install htop gdb vim netcat- Run
htopto see a multi-CPU utilization chart and info - Run the
kill/ncas described above
- Go inside the Docker container as described above.
- Install
gdb - Install
pip install cython - Execute
which python. This will print something like/azureml-envs/azureml_1234abc/bin/python - Using
vimor a reasonable editor, edit~/.gdbinitand add this line (replacingazureml_1234abcwith the folder where your Python resides)
source /azureml-envs/azureml_1234abc/lib/python-3.7/site-packages/Cython/Debugger/libpython.py
- Start
gdbviagdb python nnn, wherennnis the process ID of the Python job (check viatop)
-
py-btto get a trace of where the process presently is -
info thto see which threads are running -
thread 2to switch to thread 2, then you can runpy-btto see where thread is - Traces are printed out with innermost stackframe at top
- Watch out for "Waiting for the GIL" at the top of the stacktrace - this would indicate thread contention,
-
py-upto move up the stack -
cto continue running, Ctrl-C to interrupt
- Run
topto see if there's a Python process running - Run
nvidia-smito check GPU status