查证微服务死循环导致 CPU 100%问题
/ / 阅读数:2755摘要 :在 Linux 系统上,当 Python 进程出现 CPU 100% 的问题时,该如何查证?本文介绍了如何使用 gdb, gcore, strace, ltrace 等常见调试工具和具体的查证过程。
问题表现
监控系统报警某服务进程出现 CPU 100% 情况,该业务进程不响应任何请求,无日志输出,进程 IO 和内存占用都正常。
查证过程
起初怀疑是遇到 Eventlet 打日志死锁导致卡住 的问题,但经过一番排查,原来是代码 BUG 导致了死循环。
具体查证过程如下:
top 查看 CPU 和内存占用
# top -bn1p 6325 Tasks: 608 total, 9 running, 596 sleeping, 0 stopped, 3 zombie %Cpu(s): 20.5 us, 6.3 sy, 0.0 ni, 72.3 id, 0.5 wa, 0.0 hi, 0.5 si, 0.0 st KiB Mem: 65983744 total, 65372892 used, 610852 free, 574684 buffers KiB Swap: 4194300 total, 0 used, 4194300 free. 42109148 cached Mem PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND 6325 cc 20 0 4840544 628972 69656 R 100.0 1.0 927:41.47 python2.7 |
显示进程 CPU 100%,内存占用正常。
strace 查看系统调用
# strace -p 6325
|
结果一直卡住,无任何输出,说明进程没有进行任何系统调用。
ltrace 查看库函数调用
# ltrace -cp 6325 ^C% time seconds usecs/call calls function ------ ----------- ----------- --------- -------------------- 71.84 10.207067 81 124827 memset 23.87 3.392006 81 41608 strchr 2.15 0.304948 82 3708 sem_wait 2.14 0.303884 81 3708 sem_post ------ ----------- ----------- --------- -------------------- 100.00 14.207905 173851 total # ltrace -p 6325 [pid 6325] memset(0x7f44e98309a8, '\0', 8) = 0x7f44e98309a8 [pid 6325] strchr("O|O:enumerate", ':') = ":enumerate" [pid 6325] memset(0x7f44e98309a8, '\0', 8) = 0x7f44e98309a8 [pid 6325] strchr("O|O:enumerate", ':') = ":enumerate" ... |
结果显示进程一直在调用memset
和strchr
,极有可能是遇到死循环了,而且死循环里面重复执行enumerate
函数。
gcore 生成 coredump 文件
为了避免gdb attach
进程造成的其他影响(比如可能出现进程异常退出,死锁突然恢复,影响线上服务等),最好将进程生成一个 coredump 文件,然后再慢慢分析。
# gcore 6325 # ls -lsh core.6325 2.7G -rw-r--r-- 1 root root 2.7G 4月 14 00:56 core.6325 |
生成完 coredump 文件,如果出问题进程无法从线上摘除,则可以直接停掉该进程了。
gdb 分析 coredump 文件
# gdb python core.6325 GNU gdb (Debian 7.7.1+dfsg-5) 7.7.1 ... Reading symbols from python...Reading symbols from /usr/lib/debug/usr/bin/python2.7...done. done. ... Core was generated by `/xxx/service_xx/venv/bin/python2.7'. |
发现生成 Core 文件的Python
和系统默认的Python
不是同一个文件(可能期间有重新编译或升级过 Python 版本):
# md5sum /xxx/service_xx/venv/bin/python2.7 b93dd3c9f0dfc774db3ee726fc90ca13 /xxx/service_xx/venv/bin/python2.7 # md5sum $(which python) cc840097a91c3d18b05c9d15196cb267 /usr/bin/python # /xxx/service_xx/venv/bin/python2.7 Python 2.7.9 (default, Jun 29 2016, 13:08:31) # python Python 2.7.9 (default, Sep 25 2018, 20:42:16) |
如果直接用不同版本的 Python 去 gdb 调试,将无法看到正确的堆栈信息。于是改用虚拟环境的Python
进行调试:
# gdb /xxx/service_xx/venv/bin/python2.7 core.6325 ... Reading symbols from /xxx/service_xx/venv/bin/python2.7...(no debugging symbols found)...done. ... |
结果提示(no debugging symbols found)
,原来apt-get install python2.7-dbg
生成的 python-dbg 文件在/usr/lib/debug/usr/bin/python2.7
目录,
而我们 debug 的是虚拟环境的Python
,无法找到对应的带调试符号的Python
。
按 stackoverflow 上的建议 ,将/usr/lib/debug/usr/bin
软链接到虚拟环境.debug
目录,并且在~/.gdbinit
添加add-auto-load-safe-path
授权,然后继续 gdb。
$ ln -sf /usr/lib/debug/usr/bin /xxx/service_xx/venv/.debug $ cat ~/.gdbinit add-auto-load-safe-path /xxx/service_xx/venv/bin/.debug/python2.7-gdb.py |
然后再重新 gdb 调试:
$ gdb /xxx/service_xx/venv/bin/python2.7 core.6325 ... Reading symbols from /xxx/service_xx/venv/bin/python2.7... warning: the debug information found in "/xxx/service_xx/venv/bin/.debug/python2.7" does not match "/xxx/service_xx/venv/bin/python2.7" (CRC mismatch). ... |
结果还是提示Python
版本不匹配,于是从其他机器拷贝了一份和虚拟环境Python
版本相匹配的 python-dbg 文件到.debug
目录,然后再重新 gdb 调试:
$ rm -rf /xxx/service_xx/venv/bin/.debug $ mkdir /xxx/service_xx/venv/bin/.debug $ scp -r xxx:/usr/lib/debug/usr/bin/* /xxx/service_xx/venv/bin/.debug $ gdb /xxx/service_xx/venv/bin/python2.7 core.6325 GNU gdb (Debian 7.7.1+dfsg-5) 7.7.1 ... Reading symbols from /xxx/service_xx/venv/bin/python2.7...Reading symbols from /xxx/service_xx/venv/bin/.debug/python2.7...done. done. ... |
现在 python-dbg 版本和虚拟环境的Python
版本是对应上了,可以继续分析:
(gdb) info threads Id Target Id Frame 25 Thread 0x7f8361666700 (LWP 6325) _PyCode_CheckLineNumber (bounds=<optimized out>, lasti=<optimized out>, co=<optimized out>) at ../Objects/codeobject.c:565 24 Thread 0x7f82f056a700 (LWP 7439) pthread_cond_timedwait@@GLIBC_2.3.2 () at ../nptl/sysdeps/unix/sysv/linux/x86_64/pthread_cond_timedwait.S:238 ...(省略)... 3 Thread 0x7f83161d4700 (LWP 6411) sem_wait () at ../nptl/sysdeps/unix/sysv/linux/x86_64/sem_wait.S:85 2 Thread 0x7f83159d3700 (LWP 6410) sem_wait () at ../nptl/sysdeps/unix/sysv/linux/x86_64/sem_wait.S:85 * 1 Thread 0x7f83151d2700 (LWP 6405) sem_wait () at ../nptl/sysdeps/unix/sysv/linux/x86_64/sem_wait.S:85 |
使用info threads
查看当前进程的线程列表,发现大部分都在wait
信号,只有 25 号线程在做其他事情,切换到 25 号线程,分析调用栈:
(gdb) thread 25 [Switching to thread 25 (Thread 0x7f8361666700 (LWP 6325))] #0 _PyCode_CheckLineNumber (bounds=<optimized out>, lasti=<optimized out>, co=<optimized out>) at ../Objects/codeobject.c:565 565 ../Objects/codeobject.c: 没有那个文件或目录. (gdb) py-bt Python Exception <class 'gdb.MemoryError'> Cannot access memory at address 0x58: Error occurred in Python command: Cannot access memory at address 0x58 |
使用py-bt
报内存访问错误,只能用原始bt
来分析了(添加full
参数可以看更详细的内容):
(gdb) bt full #0 _PyCode_CheckLineNumber (bounds=<optimized out>, ...) at ../Objects/codeobject.c:565 size = 46 p = 0x7f82acb60f3c "\006\001\006\001\r\001\006\001..." #1 maybe_call_line_trace (instr_prev=<optimized out>, ...) at ../Python/ceval.c:3743 line = 79 #2 PyEval_EvalFrameEx () at ../Python/ceval.c:1050 opcode = 6 #3 0x00000000004c7a59 in PyEval_EvalCodeEx () at ../Python/ceval.c:3265 f = <unknown at remote 0x849d0c8> retval = <code at remote 0x7f835a1b5db0> fastlocals = 0x2e tstate = 0xa2c0a0 u = <unknown at remote 0xf2> #4 0x00000000004cad3b in fast_function (nk=<optimized out>, ...) at ../Python/ceval.c:4129 co = 0x563c10 <trace_trampoline> globals = <unknown at remote 0xf2> argdefs = <unknown at remote 0xf2> #5 call_function (oparg=<optimized out>, pp_stack=<optimized out>) at ../Python/ceval.c:4054 func = <function at remote 0x7f835a1dbb90> w = <function at remote 0x7f835a1dbb90> nk = 682340552 n = 4 pfunc = 0x3276ea8 #6 PyEval_EvalFrameEx () at ../Python/ceval.c:2679 sp = 0x3276ec8 opcode = 2 #7 0x00000000004c996a in fast_function (nk=<optimized out>, ...) at ../Python/ceval.c:4119 f = Frame 0x3276cd0, for file ./service/recall/newuser.py, line 49, in get_gametype_anchor_by_sn (...(truncated) tstate = 0xa2c0a0 stack = 0xf2 co = 0x563c10 <trace_trampoline> globals = <unknown at remote 0xf2> argdefs = <unknown at remote 0xf2> #8 call_function (oparg=<optimized out>, pp_stack=<optimized out>) at ../Python/ceval.c:4054 func = <function at remote 0x7f835a1dbaa0> w = <function at remote 0x7f835a1dbaa0> nk = 682340552 n = 7 pfunc = 0x5b971d8 ...(省略)... #43 0x00007f83243691b0 in ?? () No symbol table info available. #44 0x0000000000000000 in ?? () No symbol table info available. |
分析Frame # 7
发现当前线程正在执行./service/recall/newuser.py, line 49, in get_gametype_anchor_by_sn
方法。
找到对应的源代码:
def get_predict_gametype_anchor(self, gt_scores, limit): ... while pool_can_use and curr_sample_cnt < sample_cnt: for idx, ent in enumerate(gt_pool): ... |
果然,这里有一个while
循环嵌套了enumerate
调用。通过仔细分析代码,发现在某种情况下确实会出现死循环情况,至此问题解决。
总结
- 遇到线上问题时,优先使用
gcore PID
来保存现场 - 再 使用 strace、ltrace 和
gdb
分析 - 如果没有什么线索,可以尝试 pyrasite-shell 或 lptrace
gdb
调试Python
进程的时候,运行进程的Python
版本和 python-dbg 一定要匹配
注意
当出现死循环或 IO block 导致 Eventlet 卡死时,使用lptrace
或pyrasite-shell
都会出现不响应情况。
使用lptrace
不响应,Ctrl + C 导致进程异常退出
lptrace
上去后一直卡住,没有任何反应,按 CTRL+C 中断后,出问题的进程意外退出:
# lptrace -p 6325 ^CTraceback (most recent call last): File "/usr/local/bin/lptrace", line 115, in <module> main() File "/usr/local/bin/lptrace", line 111, in main strace(pid) File "/usr/local/bin/lptrace", line 49, in strace runfile(pid, trace_code) File "/usr/local/bin/lptrace", line 40, in runfile inject(pid, cmd) File "/usr/local/bin/lptrace", line 84, in inject out, err = p.communicate() File "/usr/lib/python2.7/subprocess.py", line 799, in communicate return self._communicate(input) File "/usr/lib/python2.7/subprocess.py", line 1409, in _communicate stdout, stderr = self._communicate_with_poll(input) File "/usr/lib/python2.7/subprocess.py", line 1463, in _communicate_with_poll ready = poller.poll() KeyboardInterrupt |
在 stderr 日志有以下报错:
# tail -1 /xxx/log/supervisor/service_xx_stderr.log
Fatal Python error: This thread state must be current when releasing
|
使用 pyrasite-shell block 超时
在 stderr 日志有以下报错:
Traceback (most recent call last):
File "<string>", line 165, in run
Exception: pyrasite cannot establish reverse connection to localhost:39651