Retrieve core files¶
Set core file path on infrastructure¶
This article on Kubernetes and core file managemnt and this other one provide some tracks about core file management.
First, on your Kubernetes nodes, set the core file path, named <core-path>
, in example below it is /tmp/coredump
, and make it writable to all container users which can potentially create core files:
COREPATH=/tmp/coredump
sudo mkdir $COREPATH
sudo chmod 777 $COREPATH
sudo sh -c 'echo "$COREPATH/core.%e.%p.%h.%t" > /proc/sys/kernel/core_pattern'
Note
For kind users, on development workstation, above command must be run on host, it will be then propagated to k8s node and to pods.
Store core files in a persistent storage¶
Install Qserv operator and then install a Qserv instance dedicated to development:
kubectl apply -k https://github.com/lsst/qserv-operator/manifests/dev
Core files produced by every Qserv binaries will be stored and available.
For additional information, see Fine-tune a Qserv development instance
Download core files¶
Core files will be available on the node running the pods, inside the <core-path>
directory.
Note
For kind users, use docker
command to get core file on the workstation (see demo below). For bare-metal Kubernetes clusters, scp
or rsync
should work fine.
Demo¶
This demo rely on a Kubernetes cluster based on kind and the qserv-operator
:
# Create a directory to store core files on the k8s node (kind-specific)
docker exec -it -- kind-control-plane sh -c "mkdir -p /tmp/coredump && chmod 777 /tmp/coredump"
# Install Qserv
kubectl apply -k qserv-operator/manifests/dev
# Check Qserv is running
kubectl get pods -o wide
NAME READY STATUS RESTARTS AGE IP NODE
qserv-dev-czar-0 3/3 Running 0 11s 10.244.0.54 kind-control-plane
qserv-dev-repl-ctl-0 1/1 Running 0 11s 10.244.0.46 kind-control-plane
qserv-dev-repl-db-0 1/1 Running 0 11s 10.244.0.56 kind-control-plane
qserv-dev-worker-0 5/5 Running 0 11s 10.244.0.53 kind-control-plane
qserv-dev-worker-1 5/5 Running 0 11s 10.244.0.51 kind-control-plane
qserv-dev-worker-2 5/5 Running 0 11s 10.244.0.55 kind-control-plane
qserv-dev-xrootd-redirector-0 2/2 Running 0 11s 10.244.0.44 kind-control-plane
qserv-dev-xrootd-redirector-1 2/2 Running 0 10s 10.244.0.45 kind-control-plane
qserv-dev-xrootd-redirector-2 2/2 Running 0 10s 10.244.0.47 kind-control-plane
qserv-dev-xrootd-redirector-3 2/2 Running 0 10s 10.244.0.48 kind-control-plane
qserv-operator-5467b89db4-hbwgc 1/1 Running 0 149m 10.244.0.5 kind-control-plane
# Kill replication controller
kubectl exec -it qserv-dev-repl-ctl-0 -- bash
bash-4.2$ ps -ef
UID PID PPID C STIME TTY TIME CMD
qserv 1 0 0 11:30 ? 00:00:00 /bin/sh /cm-start/start.sh
qserv 9 1 0 11:30 ? 00:00:00 qserv-replica-master-http --worker-evict-timeout=3600 --health-probe-interval=120 --replication-interval=1200 --config=mysql://qsreplica:@qserv-dev-repl-db:3306/qservReplica --qserv-db-password=CHANGEME
qserv 100 0 0 11:38 pts/0 00:00:00 bash
qserv 112 100 0 11:38 pts/0 00:00:00 ps -ef
bash-4.2$ kill -s SIGSEGV 9
bash-4.2$ command terminated with exit code 137
# List and retrieve core file (kind-specific)
docker ls docker exec -it kind-control-plane ls /tmp/coredump
core.qserv-replica-m.9.qserv-dev-repl-ctl-0.1597318703
# Retrieve corefile locally (docker cp does not work because /tmp is managed by tmpfs in kind)
docker exec kind-control-plane tar Ccf "/tmp/coredump" - . | tar Cxf . -
ls
core.qserv-replica-m.9.qserv-dev-repl-ctl-0.1597318703
Debug manually a process inside a Qserv container¶
Install a Qserv instance dedicated to development¶
Install Qserv operator and then install a Qserv instance dedicated to development:
kubectl apply -k https://github.com/lsst/qserv-operator/manifests/dev
Demo¶
Open a shell in the debugger
container and list Pod’s full processes list:
kubectl exec -it qserv-worker-0 -c debugger -- bash
[root@qserv-worker-0 /]# ps -ef
UID PID PPID C STIME TTY TIME CMD
65535 1 0 0 13:06 ? 00:00:00 /pause
root 20 0 0 13:06 ? 00:00:00 /bin/sh /cm-start/start.sh
1000 28 20 0 13:06 ? 00:00:02 mysqld
1000 60 0 0 13:06 ? 00:00:00 sleep infinity
root 67 0 0 13:06 ? 00:00:00 /bin/sh /cm-start/start.sh
root 74 67 0 13:06 ? 00:00:00 su qserv -c sh /cm-start/wmgr.sh
1000 75 74 0 13:06 ? 00:00:00 sh /cm-start/wmgr.sh
1000 82 75 0 13:06 ? 00:00:00 python /qserv/stack/stack/miniconda3-py37_4.8.2-1eb92eb/Linux64/qserv/2021.7.1-rc1+2c8521dd9c/bin/qservWmg
root 84 0 0 13:06 ? 00:00:00 /bin/sh /cm-start/start.sh -S cmsd
root 92 84 0 13:06 ? 00:00:00 su qserv -c /cm-start/xrd.sh -S cmsd
1000 93 92 0 13:06 ? 00:00:00 /bin/sh /cm-start/xrd.sh -S cmsd
1000 99 93 0 13:06 ? 00:00:00 cmsd -c /cm-etc/xrootd.cf -n worker -I v4 -l @libXrdSsiLog.so -+xrdssi /cm-etc/xrdssi.cf
root 221 0 0 13:06 ? 00:00:00 /bin/sh /cm-start/start.sh
root 232 221 0 13:06 ? 00:00:00 su qserv -c /cm-start/xrd.sh -S xrootd
1000 233 232 0 13:06 ? 00:00:00 /bin/sh /cm-start/xrd.sh -S xrootd
1000 238 233 0 13:06 ? 00:00:00 xrootd -c /cm-etc/xrootd.cf -n worker -I v4 -l @libXrdSsiLog.so -+xrdssi /cm-etc/xrdssi.cf
root 403 0 0 13:06 pts/0 00:00:00 /usr/bin/bash
root 689 0 0 13:19 pts/1 00:00:00 bash
root 761 689 0 13:22 pts/1 00:00:00 ps -ef
Attach gdb
to xrootd
process:
# Helper to display gdb command line
[root@qserv-worker-0 /]# debugtools 238
2021/08/09 13:24:37 Path to executable: /proc/238/root/qserv/stack/stack/miniconda3-py37_4.8.2-1eb92eb/Linux64/xrootd/affinity-flex-hash-g5b015dcebc/bin/xrootd
2021/08/09 13:24:37 gdb command-line: gdb -iex "set sysroot /proc/238/root" -iex "set auto-load safe-path /proc/238/root" -p 238 /proc/238/root/qserv/stack/stack/miniconda3-py37_4.8.2-1eb92eb/Linux64/xrootd/affinity-flex-hash-g5b015dcebc/bin/xrootd
[root@qserv-worker-0 /]# gdb -iex "set sysroot /proc/238/root" -iex "set auto-load safe-path /proc/238/root" -p 238 /proc/238/root/qserv/stack/stack/miniconda3-py37_4.8.2-1eb92eb/Linux64/xrootd/affinity-flex-hash-g5b015dcebc/bin/xrootd
...
Loaded symbols for /proc/238/root/qserv/stack/conda/miniconda3-py37_4.8.2/envs/lsst-scipipe-1eb92eb/lib/./libicui18n.so.67
0x00007f3e15fd3afb in do_futex_wait.constprop.1 () from /proc/238/root/lib64/libpthread.so.0
(gdb) bt
#0 0x00007f3e15fd3afb in do_futex_wait.constprop.1 () from /proc/238/root/lib64/libpthread.so.0
#1 0x00007f3e15fd3b8f in __new_sem_wait_slow.constprop.0 () from /proc/238/root/lib64/libpthread.so.0
#2 0x00007f3e15fd3c2b in sem_wait@@GLIBC_2.2.5 () from /proc/238/root/lib64/libpthread.so.0
#3 0x00005636d5b98959 in Wait (this=<optimized out>)
at /qserv/stack/stack/miniconda3-py37_4.8.2-1eb92eb/EupsBuildDir/Linux64/xrootd-affinity-flex-hash-g5b015dcebc/xrootd-affinity-flex-hash-g5b015dcebc/src/./XrdSys/XrdSysPthread.hh:421
#4 mainAccept(void*) ()
at /qserv/stack/stack/miniconda3-py37_4.8.2-1eb92eb/EupsBuildDir/Linux64/xrootd-affinity-flex-hash-g5b015dcebc/xrootd-affinity-flex-hash-g5b015dcebc/src/Xrd/XrdMain.cc:129
#5 0x00005636d5b8f5e2 in main (argc=<optimized out>, argv=<optimized out>)
at /qserv/stack/stack/miniconda3-py37_4.8.2-1eb92eb/EupsBuildDir/Linux64/xrootd-affinity-flex-hash-g5b015dcebc/xrootd-affinity-flex-hash-g5b015dcebc/src/Xrd/XrdMain.cc:213
Lots of additional debugging tools are available inside the debugtools image, Check the debugtools documentation for additional information.
Fine-tune a Qserv development instance¶
Pre-requisites¶
First, download qserv-operator
locally
git clone https://github.com/lsst/qserv-operator
Core path¶
It is possible to set the core path easily by editing the corepath
parameter in file qserv-operator/manifests/dev/qserv.yaml
apiVersion: qserv.lsst.org/v1alpha1
kind: Qserv
metadata:
name: qserv
spec:
devel:
corepath: "<core-path>"
Manual debugging with gdb¶
It is possible to set the component(s) to debug by editing the debug
parameters in file qserv-operator/manifests/dev/qserv.yaml
apiVersion: qserv.lsst.org/v1alpha1
kind: Qserv
metadata:
name: qserv
spec:
...
replication:
debug: "repl-ctl"
Values for the debug parameter are:
* all
: both replication controller and replication workers start in debug mode.
* repl-ctl
: replication controller start in debug mode.
* repl-wrk
: all replication worker start in debug mode.
In above example, replication controller will not start, so that user can open an interactive shell inside the container, start the replication controller process in debug mode and perform debugging operation. The container won’t restart if the replication controller crashes.
Re-install Qserv¶
Once file qserv-operator/manifests/dev/qserv.yaml
is ready, (re-)install Qserv in the current namespace
kubectl apply -k qserv-operator/manifests/dev