昨天面试的时候被问到这么一个问题:“如果你开发的程序在Linux上运行的时候发生了段错误之类的问题,你会用什么方法解决?”...心想这不就是在考我coredump嘛,吧啦吧啦说了一通。对方追问:“如果你编译的时候忘记加上-g调试选项了,你又该如何定位段错误的位置呢?”...额,就这么被问住了。最后草草作答,但我一直觉得:即便没有符号信息,只要我们获取到了出错位置的内存地址就应该有办法定位到错误的。这个猜测对么?如果对,应该又是怎么做呢?
首先借用一下C语言结构体里的成员数组和指针 | | 酷 壳 - CoolShell中段错误的例子:
root@k8s:~/test# gcc -o crash_noDebug crash.c
crash.c: In function ‘main’:
crash.c:15:10: warning: format not a string literal and no format arguments [-Wformat-security]
printf(f.a->s);
^
root@k8s:~/test# ulimit -c
0
root@k8s:~/test# ulimit -c unlimited
root@k8s:~/test# ulimit -c
unlimited
root@k8s:~/test# ./crash_noDebug
Segmentation fault (core dumped)
root@k8s:~/test# gdb crash_noDebug core
GNU gdb (Ubuntu 7.11.1-0ubuntu1~16.5) 7.11.1
Copyright (C) 2016 Free Software Foundation, Inc.
License GPLv3+: GNU GPL version 3 or later <http://gnu.org/licenses/gpl.html>
This is free software: you are free to change and redistribute it.
There is NO WARRANTY, to the extent permitted by law. Type "show copying"
and "show warranty" for details.
This GDB was configured as "x86_64-linux-gnu".
Type "show configuration" for configuration details.
For bug reporting instructions, please see:
<http://www.gnu.org/software/gdb/bugs/>.
Find the GDB manual and other documentation resources online at:
<http://www.gnu.org/software/gdb/documentation/>.
For help, type "help".
Type "apropos word" to search for commands related to "word"...
Reading symbols from crash_noDebug...(no debugging symbols found)...done.
[New LWP 18077]
Core was generated by `./crash_noDebug'.
Program terminated with signal SIGSEGV, Segmentation fault.
#0 strchrnul () at ../sysdeps/x86_64/strchr.S:32
32 ../sysdeps/x86_64/strchr.S: No such file or directory.
(gdb) where
#0 strchrnul () at ../sysdeps/x86_64/strchr.S:32
#1 0x00007f1ce8cb2208 in __find_specmb (format=0x4 <error: Cannot access memory at address 0x4>) at printf-parse.h:108
#2 _IO_vfprintf_internal (s=0x7f1ce902a620 <_IO_2_1_stdout_>, format=0x4 <error: Cannot access memory at address 0x4>,
ap=ap@entry=0x7ffccc8407e8) at vfprintf.c:1312
#3 0x00007f1ce8cba899 in __printf (format=<optimized out>) at printf.c:33
#4 0x000000000040055f in main ()
(gdb)
虽然没有符号表直接映射出代码出错的位置,但是最后#4 0x000000000040055f 的这个地址应该还是很有价值的。我暂且做了一个假设,假设加-g选项与否并不影响代码段的内存位置。那么,如果成立的话,只要重新尝试用-g选项编译一下程序,然后通过某种方法定位到0x000000000040055f 地址所对应的符号信息,应该就能够成功解题了。为了验证这个猜测,我重新coredump了一下带有符号信息的程序。
root@k8s:~/test# gcc -o crash_withDebug -g crash.c
crash.c: In function ‘main’:
crash.c:15:10: warning: format not a string literal and no format arguments [-Wformat-security]
printf(f.a->s);
^
root@k8s:~/test# ./crash_withDebug
Segmentation fault (core dumped)
root@k8s:~/test# gdb crash_withDebug core
GNU gdb (Ubuntu 7.11.1-0ubuntu1~16.5) 7.11.1
Copyright (C) 2016 Free Software Foundation, Inc.
License GPLv3+: GNU GPL version 3 or later <http://gnu.org/licenses/gpl.html>
This is free software: you are free to change and redistribute it.
There is NO WARRANTY, to the extent permitted by law. Type "show copying"
and "show warranty" for details.
This GDB was configured as "x86_64-linux-gnu".
Type "show configuration" for configuration details.
For bug reporting instructions, please see:
<http://www.gnu.org/software/gdb/bugs/>.
Find the GDB manual and other documentation resources online at:
<http://www.gnu.org/software/gdb/documentation/>.
For help, type "help".
Type "apropos word" to search for commands related to "word"...
Reading symbols from crash_withDebug...done.
[New LWP 23657]
Core was generated by `./crash_withDebug'.
Program terminated with signal SIGSEGV, Segmentation fault.
#0 strchrnul () at ../sysdeps/x86_64/strchr.S:32
32 ../sysdeps/x86_64/strchr.S: No such file or directory.
(gdb) where
#0 strchrnul () at ../sysdeps/x86_64/strchr.S:32
#1 0x00007ffabb54e208 in __find_specmb (format=0x4 <error: Cannot access memory at address 0x4>) at printf-parse.h:108
#2 _IO_vfprintf_internal (s=0x7ffabb8c6620 <_IO_2_1_stdout_>, format=0x4 <error: Cannot access memory at address 0x4>,
ap=ap@entry=0x7ffe3a0c4708) at vfprintf.c:1312
#3 0x00007ffabb556899 in __printf (format=<optimized out>) at printf.c:33
#4 0x000000000040055f in main (argc=1, argv=0x7ffe3a0c48e8) at crash.c:15
(gdb)
得到段错误发生的位置是crash.c的第15行,内存地址亦是0x000000000040055f, 这大概侧面印证了此前那个假设的正确性。当然正常的项目中程序的coredump恐怕不总是这样容易再现的,所以不能寄望于追加-g选项后,运行得到coredump文件再次调试获得出错位置。那么,如果只有0x000000000040055f这个内存地址信息,我们该如何定位代码位置呢?这个问题先留在这里,我想系统的学习和整理一下编译、内存等相关的知识,到时候应该自然而然能够得出答案吧!
C语言的编译过程如下,
那么问题来了,gcc的-g选项具体作用在上述的哪一步呢?如果能知道符号表(Symbol table)是个什么东西就不难推测出-g实际作用于编译过程,因为所谓的符号表实际是汇编代码中的追加的一些符号信息。
root@k8s:~/test# gcc -S crash.i -o crash_noDebug.S
crash.c: In function ‘main’:
crash.c:15:10: warning: format not a string literal and no format arguments [-Wformat-security]
printf(f.a->s);
^
root@k8s:~/test# gcc -S crash.i -g -o crash_withDebug.S
crash.c: In function ‘main’:
crash.c:15:10: warning: format not a string literal and no format arguments [-Wformat-security]
printf(f.a->s);
^
root@k8s:~/test# diff crash_noDebug.S crash_withDebug.S
2a3
> .Ltext0:
6a8,9
> .file 1 "crash.c"
> .loc 1 12 0
15a19
> .loc 1 13 0
16a21
> .loc 1 14 0
20a26
> .loc 1 15 0
26a33
> .loc 1 17 0
27a35
> .loc 1 18 0
33a42,380
> .Letext0:
> .section .debug_info,"",@progbits
> .Ldebug_info0:
...略...
> .section .debug_line,"",@progbits
> .Ldebug_line0:
> .section .debug_str,"MS",@progbits,1
> .LASF3:
> .string "unsigned int"
> .LASF13:
> .string "/root/test"
> .LASF0:
> .string "long unsigned int"
> .LASF8:
> .string "char"
> .LASF12:
> .string "crash.c"
> .LASF1:
> .string "unsigned char"
> .LASF14:
> .string "main"
> .LASF6:
> .string "long int"
> .LASF9:
> .string "argc"
> .LASF11:
> .string "GNU C11 5.4.0 20160609 -mtune=generic -march=x86-64 -g -fstack-protector-strong"
> .LASF2:
> .string "short unsigned int"
> .LASF4:
> .string "signed char"
> .LASF5:
> .string "short int"
> .LASF7:
> .string "sizetype"
> .LASF10:
> .string "argv"
最后就是借助一些Linux二进制文件分析工具的力量来找到0x000000000040055f这个地址对应的代码段是哪里的问题了。常用的一些工具如nm、objdump、readelf之类,其中:
nm:专门用来列出二进制文件中的符号信息的。无法详细定位到目标地址的内容,Pass。
objdump:用以显示目标文件的各色信息,比如可以反汇编得到.text段信息。再合适不过了。
root@k8s:~/test# objdump -d -j .text crash_withDebug
crash_withDebug: file format elf64-x86-64
Disassembly of section .text:
0000000000400526 <main>:
400526: 55 push %rbp
400527: 48 89 e5 mov %rsp,%rbp
40052a: 48 83 ec 20 sub $0x20,%rsp
40052e: 89 7d ec mov %edi,-0x14(%rbp)
400531: 48 89 75 e0 mov %rsi,-0x20(%rbp)
400535: 48 c7 45 f0 00 00 00 movq $0x0,-0x10(%rbp)
40053c: 00
40053d: 48 8b 45 f0 mov -0x10(%rbp),%rax
400541: 48 83 c0 04 add $0x4,%rax
400545: 48 85 c0 test %rax,%rax
400548: 74 15 je 40055f <main+0x39>
40054a: 48 8b 45 f0 mov -0x10(%rbp),%rax
40054e: 48 83 c0 04 add $0x4,%rax
400552: 48 89 c7 mov %rax,%rdi
400555: b8 00 00 00 00 mov $0x0,%eax
40055a: e8 a1 fe ff ff callq 400400 <printf@plt>
40055f: b8 00 00 00 00 mov $0x0,%eax
400564: c9 leaveq
400565: c3 retq
400566: 66 2e 0f 1f 84 00 00 nopw %cs:0x0(%rax,%rax,1)
readelf:显示ELF格式文件(如可执行二进制、o目标文件、共享库以及coredump文件)的信息。尝试使用-s选项打印符号信息,与nm一样无法准确定位目标地址内容,Pass。
很后悔当时没向面试官询问一下正确答案应该是什么,从目前的调查结果来看,
·通过原有coredump确定段错误内存地址
·追加-g选项重新编译可执行程序
·借助objdump查看代码段符号与地址对应关系
这个思路基本能粗略的定位到出错位置,至于是否还有更方便或者高效的方法本人就不甚明了。长路漫漫啊...