Board logo

标题: kdump 的亲密战友 crash(3) [打印本页]

作者: look_w    时间: 2018-6-18 12:20     标题: kdump 的亲密战友 crash(3)

精彩案例如前文所述,当 linux 系统内核发生崩溃的时候,可以通过 kdump 等方式收集内核崩溃之前的内存,生成一个转储文件 vmcore。内核开发者通过分析该 vmcore 文件就可以诊断出内核崩溃的原因,从而进行操作系统的代码改进。那么 crash 就是一个被广泛使用的内核崩溃转储文件分析工具,掌握 crash 的使用技巧,对于定位问题有着十分重要的作用。
这里采用笔者在实际测试工作中发现的 SLES 系统下的系统崩溃问题作为案例来进行讲解。该系统已经配置了 kdump 启用,因此在系统发生崩溃之后,在 /var/crash/ 当天日期 / 目录下面生成一个 vmcore 文件,下面我们来对这个文件进行分析。
1. 首先启动 crash
清单 6. 启动 crash
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
# crash vmlinux-3.0.8-0.11-ppc64 vmcore

crash 5.1.9
Copyright (C) 2002-2011  Red Hat, Inc.
Copyright (C) 2004, 2005, 2006  IBM Corporation
Copyright (C) 1999-2006  Hewlett-Packard Co
Copyright (C) 2005, 2006  Fujitsu Limited
Copyright (C) 2006, 2007  VA Linux Systems Japan K.K.
Copyright (C) 2005  NEC Corporation
Copyright (C) 1999, 2002, 2007  Silicon Graphics, Inc.
Copyright (C) 1999, 2000, 2001, 2002  Mission Critical Linux, Inc.
This program is free software, covered by the GNU General Public License,
and you are welcome to change it and/or distribute copies of it under
certain conditions.  Enter "help copying" to see the conditions.
This program has absolutely no warranty.  Enter "help warranty" for details.

GNU gdb (GDB) 7.0
Copyright (C) 2009 Free Software Foundation, Inc.
License GPLv3+: GNU GPL version 3 or later <http://gnu.org/licenses/gpl.html>
This is free software: you are free to change and redistribute it.
There is NO WARRANTY, to the extent permitted by law.  Type "show copying"
and "show warranty" for details.
This GDB was configured as "powerpc64-unknown-linux-gnu"...
     KERNEL: vmlinux-3.0.8-0.11-ppc64         
   DUMPFILE: vmcore
       CPUS: 40
       DATE: Wed Nov 16 20:17:11 2011
     UPTIME: 10:37:23
LOAD AVERAGE: 60.00, 60.00, 60.00
      TASKS: 811
   NODENAME: eellp1
    RELEASE: 3.0.8-0.11-ppc64
    VERSION: #1 SMP Thu Nov 10 16:28:46 UTC 2011 (3cea58b)
    MACHINE: ppc64  (3550 Mhz)
     MEMORY: 4 GB
      PANIC: "Oops: Kernel access of bad area, sig: 11 [#1]" (check log for details)
        PID: 5563
    COMMAND: "sh"
       TASK: c0000000faac3700  [THREAD_INFO: c0000000f8ce0000]
        CPU: 36
      STATE: TASK_RUNNING (PANIC)

crash>




可以看到内核版本是 3.0.8-0.11-ppc64,这是一个 sles11sp2 的开发版本。
2. 接下来,我们用 bt 命令来看一下堆栈
清单 7. bt 命令
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
crash> bt
PID: 5563   TASK: c0000000faac3700  CPU: 36  COMMAND: "sh"
#0 [c0000000f8ce31b0] .crash_kexec at c0000000001039f8
#1 [c0000000f8ce33b0] .die at c000000000020158
#2 [c0000000f8ce3450] .bad_page_fault at c000000000045004
#3 [c0000000f8ce34d0] handle_page_fault at c000000000005ec8
Data Access error  [300] exception frame:
R0:  0000000000130000    R1:  c0000000f8ce37c0    R2:  c000000000f876d8   
R3:  c000000001224dc8    R4:  0000000000000001    R5:  0000000000000000   
R6:  cfffffffffffffff    R7:  0000000002220000    R8:  2ffffffff1f10000   
R9:  d00000000e0f0000    R10: 0000000000000000    R11: 0000000100000000   
R12: 0000000082002424    R13: c000000001f06c00    R14: 000000001003e270   
R15: 0000000000000001    R16: 0000000000000001    R17: 0000000000000000   
R18: 0000000000000000    R19: c0000000f820b4b8    R20: c0000000f8ce3df8   
R21: c000000000fe2400    R22: 00000fffb53d0000    R23: fffffffffffff000   
R24: 0000000000000400    R25: 000000000000ed99    R26: 0000000000002000   
R27: 0000000000002e58    R28: c000000001224dc8    R29: c000000001224dc0   
R30: c000000000ef2658    R31: c0000000f8ce39a0   
NIP: c000000000255900    MSR: 8000000000009032    OR3: c000000000005278
CTR: c000000000263a08    LR:  c0000000002558dc    XER: 0000000000000001
CCR: 0000000022002444    MQ:  0000000000000001    DAR: 0000000100000008
DSISR: 0000000040000000     Syscall Result: 0000000000000000
.....
#4 [c0000000f8ce37c0] .get_vmalloc_info at c000000000255900
[Link Register ]  [c0000000f8ce37c0] .get_vmalloc_info at c0000000002558dc  (un
reliable)
#5 [c0000000f8ce3850] .meminfo_proc_show at c000000000263ad8
#6 [c0000000f8ce3b40] .seq_read at c00000000020aa44
#7 [c0000000f8ce3c30] .proc_reg_read at c000000000258ccc
#8 [c0000000f8ce3ce0] .vfs_read at c0000000001dee60
#9 [c0000000f8ce3d80] .sys_read at c0000000001df06c
#10 [c0000000f8ce3e30] syscall_exit at c0000000000097ec
syscall  [c01] exception frame:
R0:  0000000000000003    R1:  00000ffff3cceb60    R2:  00000fffb5305c40   
R3:  0000000000000008    R4:  00000fffb53d0000    R5:  0000000000000400   
R6:  0000000000000001    R7:  00000fffb5249f88    R8:  800000000200f032   
R9:  0000000000000000    R10: 0000000000000000    R11: 0000000000000000   
R12: 0000000000000000    R13: 00000fffb50b8110   
NIP: 00000fffb523d0c4    MSR: 800000000200f032    OR3: 0000000000000008
CTR: 00000fffb51dae70    LR:  00000fffb51daeac    XER: 0000000000000001
CCR: 0000000044002422    MQ:  0000000000000001    DAR: 00000fffb51dcd60
DSISR: 0000000040000000     Syscall Result: 00000fffb53d0000
Crash>




3. 我们看到系统崩溃前的最后一个调用是“#4 [c0000000f8ce37c0] .get_vmalloc_info at c000000000255900”,现在用 dis 命令来看一下该地址的反汇编结果
清单 8. dis 命令
1
2
3
crash> dis -l c000000000255900
/usr/src/debug/kernel-ppc64-3.0.8/linux-3.0/fs/proc/mmu.c: 47
0xc000000000255900 <.get_vmalloc_info+112>:     ld      r10,8(r11)




4. 从上面的反汇编结果中,我们看到问题出在 mmu.c 第 47 行代码,翻开 linux 源码的相应位置
清单 9. linux 源码
1
2
3
4
5
6
21 void get_vmalloc_info(struct vmalloc_info *vmi)
22 {
23         struct vm_struct *vma;
……
46 for (vma = vmlist; vma; vma = vma->next) {
47                       unsigned long addr = (unsigned long) vma->addr;




用 struct 命令查看数据结构
清单 10. struct 命令
1
2
3
4
5
6
7
8
9
10
11
12
13
crash> struct -o vm_struct
struct vm_struct {
  [0] struct vm_struct *next;
  [8] void *addr;
[16] long unsigned int size;
[24] long unsigned int flags;
[32] struct page **pages;
[40] unsigned int nr_pages;
[48] phys_addr_t phys_addr;
[56] void *caller;
}
SIZE: 64
crash>




对照源码和反汇编代码,我们发现第 47 行的源码,实际对应的就是反汇编的代码
ld      r10,8(r11)    # 将寄存器 r11 的第 8 个 byte 后的内容,load 到寄存器 r10
5. 那么 r11 中应该是 vm_struct 结构,我们再用 struct 来看看
清单 11. struct 命令
1
2
3
crash> struct vm_struct 0000000100000000
struct: invalid kernel virtual address: 0000000100000000
crash>




说明 r11 的内容已经被破坏,并不是指向一个 vm_struct 结构了。
经过上面的层层分析,我们推测问题的产生过程如下:mmu.c 第 46 行, vma = vma->next 取到了一个错误的地址,导致第 47 行 addr = (unsigned long) vma->addr 产生了内核错误。当然,更深层的原因,还需要对代码逻辑进行分析,找出导致这个现象的根源。




欢迎光临 电子技术论坛_中国专业的电子工程师学习交流社区-中电网技术论坛 (http://bbs.eccn.com/) Powered by Discuz! 7.0.0