Board logo

标题: 一个 Linux 上分析死锁的简单方法(2) [打印本页]

作者: look_w    时间: 2018-4-22 16:15     标题: 一个 Linux 上分析死锁的简单方法(2)

清单 4. 对死锁进程第一次执行 pstack(pstack –进程号)的输出结果
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
[dyu@xilinuxbldsrv purify]$ pstack 6721
Thread 5 (Thread 0x41e37940 (LWP 6722)):
#0  0x0000003d1a80d4c4 in __lll_lock_wait () from /lib64/libpthread.so.0
#1  0x0000003d1a808e1a in _L_lock_1034 () from /lib64/libpthread.so.0
#2  0x0000003d1a808cdc in pthread_mutex_lock () from /lib64/libpthread.so.0
#3  0x0000000000400a9b in func1() ()
#4  0x0000000000400ad7 in thread1(void*) ()
#5  0x0000003d1a80673d in start_thread () from /lib64/libpthread.so.0
#6  0x0000003d19cd40cd in clone () from /lib64/libc.so.6
Thread 4 (Thread 0x42838940 (LWP 6723)):
#0  0x0000003d1a80d4c4 in __lll_lock_wait () from /lib64/libpthread.so.0
#1  0x0000003d1a808e1a in _L_lock_1034 () from /lib64/libpthread.so.0
#2  0x0000003d1a808cdc in pthread_mutex_lock () from /lib64/libpthread.so.0
#3  0x0000000000400a17 in func2() ()
#4  0x0000000000400a53 in thread2(void*) ()
#5  0x0000003d1a80673d in start_thread () from /lib64/libpthread.so.0
#6  0x0000003d19cd40cd in clone () from /lib64/libc.so.6
Thread 3 (Thread 0x43239940 (LWP 6724)):
#0  0x0000003d19c9a541 in nanosleep () from /lib64/libc.so.6
#1  0x0000003d19c9a364 in sleep () from /lib64/libc.so.6
#2  0x00000000004009bc in thread3(void*) ()
#3  0x0000003d1a80673d in start_thread () from /lib64/libpthread.so.0
#4  0x0000003d19cd40cd in clone () from /lib64/libc.so.6
Thread 2 (Thread 0x43c3a940 (LWP 6725)):
#0  0x0000003d19c9a541 in nanosleep () from /lib64/libc.so.6
#1  0x0000003d19c9a364 in sleep () from /lib64/libc.so.6
#2  0x0000000000400976 in thread4(void*) ()
#3  0x0000003d1a80673d in start_thread () from /lib64/libpthread.so.0
#4  0x0000003d19cd40cd in clone () from /lib64/libc.so.6
Thread 1 (Thread 0x2b984ecabd90 (LWP 6721)):
#0  0x0000003d1a807b35 in pthread_join () from /lib64/libpthread.so.0
#1  0x0000000000400900 in main ()




清单 5. 对死锁进程第二次执行 pstack(pstack –进程号)的输出结果
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
[dyu@xilinuxbldsrv purify]$ pstack 6721
Thread 5 (Thread 0x40bd6940 (LWP 6722)):
#0  0x0000003d1a80d4c4 in __lll_lock_wait () from /lib64/libpthread.so.0
#1  0x0000003d1a808e1a in _L_lock_1034 () from /lib64/libpthread.so.0
#2  0x0000003d1a808cdc in pthread_mutex_lock () from /lib64/libpthread.so.0
#3  0x0000000000400a87 in func1() ()
#4  0x0000000000400ac3 in thread1(void*) ()
#5  0x0000003d1a80673d in start_thread () from /lib64/libpthread.so.0
#6  0x0000003d19cd40cd in clone () from /lib64/libc.so.6
Thread 4 (Thread 0x415d7940 (LWP 6723)):
#0  0x0000003d1a80d4c4 in __lll_lock_wait () from /lib64/libpthread.so.0
#1  0x0000003d1a808e1a in _L_lock_1034 () from /lib64/libpthread.so.0
#2  0x0000003d1a808cdc in pthread_mutex_lock () from /lib64/libpthread.so.0
#3  0x0000000000400a03 in func2() ()
#4  0x0000000000400a3f in thread2(void*) ()
#5  0x0000003d1a80673d in start_thread () from /lib64/libpthread.so.0
#6  0x0000003d19cd40cd in clone () from /lib64/libc.so.6
Thread 3 (Thread 0x41fd8940 (LWP 6724)):
#0  0x0000003d19c7aec2 in memset () from /lib64/libc.so.6
#1  0x00000000004009be in thread3(void*) ()
#2  0x0000003d1a80673d in start_thread () from /lib64/libpthread.so.0
#3  0x0000003d19cd40cd in clone () from /lib64/libc.so.6
Thread 2 (Thread 0x429d9940 (LWP 6725)):
#0  0x0000003d19c7ae0d in memset () from /lib64/libc.so.6
#1  0x0000000000400982 in thread4(void*) ()
#2  0x0000003d1a80673d in start_thread () from /lib64/libpthread.so.0
#3  0x0000003d19cd40cd in clone () from /lib64/libc.so.6
Thread 1 (Thread 0x2af906fd9d90 (LWP 6721)):
#0  0x0000003d1a807b35 in pthread_join () from /lib64/libpthread.so.0
#1  0x0000000000400900 in main ()




连续多次查看这个进程的函数调用关系堆栈进行分析:当进程吊死时,多次使用 pstack 查看进程的函数调用堆栈,死锁线程将一直处于等锁的状态,对比多次的函数调用堆栈输出结果,确定哪两个线程(或者几个线程)一直没有变化且一直处于等锁的状态(可能存在两个线程 一直没有变化)。
输出分析:
根据上面的输出对比可以发现,线程 1 和线程 2 由第一次 pstack 输出的处在 sleep 函数变化为第二次 pstack 输出的处在 memset 函数。但是线程 4 和线程 5 一直处在等锁状态(pthread_mutex_lock),在连续两次的 pstack 信息输出中没有变化,所以我们可以推测线程 4 和线程 5 发生了死锁。
Gdb into thread输出:
清单 6. 然后通过 gdb attach 到死锁进程
1
2
3
4
5
6
7
8
9
10
11
  (gdb) info thread
5 Thread 0x41e37940 (LWP 6722)  0x0000003d1a80d4c4 in __lll_lock_wait ()
from /lib64/libpthread.so.0
4 Thread 0x42838940 (LWP 6723)  0x0000003d1a80d4c4 in __lll_lock_wait ()
from /lib64/libpthread.so.0
3 Thread 0x43239940 (LWP 6724)  0x0000003d19c9a541 in nanosleep ()
from /lib64/libc.so.6
2 Thread 0x43c3a940 (LWP 6725)  0x0000003d19c9a541 in nanosleep ()
from /lib64/libc.so.6
* 1 Thread 0x2b984ecabd90 (LWP 6721)  0x0000003d1a807b35 in pthread_join ()
from /lib64/libpthread.so.0




清单 7. 切换到线程 5 的输出
1
2
3
4
5
6
7
8
9
10
11
(gdb) thread 5
[Switching to thread 5 (Thread 0x41e37940 (LWP 6722))]#0  0x0000003d1a80d4c4 in
__lll_lock_wait () from /lib64/libpthread.so.0
(gdb) where
#0  0x0000003d1a80d4c4 in __lll_lock_wait () from /lib64/libpthread.so.0
#1  0x0000003d1a808e1a in _L_lock_1034 () from /lib64/libpthread.so.0
#2  0x0000003d1a808cdc in pthread_mutex_lock () from /lib64/libpthread.so.0
#3  0x0000000000400a9b in func1 () at lock.cpp:18
#4  0x0000000000400ad7 in thread1 (arg=0x0) at lock.cpp:43
#5  0x0000003d1a80673d in start_thread () from /lib64/libpthread.so.0
#6  0x0000003d19cd40cd in clone () from /lib64/libc.so.6




清单 8. 线程 4 和线程 5 的输出
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
(gdb) f 3
#3  0x0000000000400a9b in func1 () at lock.cpp:18
18          pthread_mutex_lock(&mutex2);
(gdb) thread 4
[Switching to thread 4 (Thread 0x42838940 (LWP 6723))]#0  0x0000003d1a80d4c4 in
__lll_lock_wait () from /lib64/libpthread.so.0
(gdb) f 3
#3  0x0000000000400a17 in func2 () at lock.cpp:31
31          pthread_mutex_lock(&mutex1);
(gdb) p mutex1
$1 = {__data = {__lock = 2, __count = 0, __owner = 6722, __nusers = 1, __kind = 0,
__spins = 0, __list = {__prev = 0x0, __next = 0x0}},
__size = "\002\000\000\000\000\000\000\000B\032\000\000\001", '\000'
<repeats 26 times>, __align = 2}
(gdb) p mutex3
$2 = {__data = {__lock = 0, __count = 0, __owner = 0, __nusers = 0,
__kind = 0, __spins = 0, __list = {__prev = 0x0, __next = 0x0}},
__size = '\000' <repeats 39 times>, __align = 0}
(gdb) p mutex2
$3 = {__data = {__lock = 2, __count = 0, __owner = 6723, __nusers = 1,
__kind = 0, __spins = 0, __list = {__prev = 0x0, __next = 0x0}},
__size = "\002\000\000\000\000\000\000\000C\032\000\000\001", '\000'
<repeats 26 times>, __align = 2}
(gdb)




从上面可以发现,线程 4 正试图获得锁 mutex1,但是锁 mutex1 已经被 LWP 为 6722 的线程得到(__owner = 6722),线程 5 正试图获得锁 mutex2,但是锁 mutex2 已经被 LWP 为 6723 的 得到(__owner = 6723),从 pstack 的输出可以发现,LWP 6722 与线程 5 是对应的,LWP 6723 与线程 4 是对应的。所以我们可以得出, 线程 4 和线程 5 发生了交叉持锁的死锁现象。查看线程的源代码发现,线程 4 和线程 5 同时使用 mutex1 和 mutex2,且申请顺序不合理。
总结本文简单介绍了一种在 Linux 平台下分析死锁问题的方法,对一些死锁问题的分析有一定作用。希望对大家有帮助。理解了死锁的原因,尤其是产生死锁的四个必要条件,就可以最大可能地避免、预防和解除死锁。所以,在系统设计、进程调度等方面注意如何不让这四个必要条件成立,如何确定资源的合理分配算法,避免进程永久占据系统资源。此外,也要防止进程在处于等待状态的情况下占用资源 , 在系统运行过程中,对进程发出的每一个系统能够满足的资源申请进行动态检查,并根据检查结果决定是否分配资源,若分配后系统可能发生死锁,则不予分配,否则予以分配。因此,对资源的分配要给予合理的规划,使用有序资源分配法和银行家算法等是避免死锁的有效方法。




欢迎光临 电子技术论坛_中国专业的电子工程师学习交流社区-中电网技术论坛 (http://bbs.eccn.com/) Powered by Discuz! 7.0.0