复盘:从C++ STL源码推演程序中的bug

这几天写程序发现个现存bug,虽然简单,但是比较不容易发现,后来直接看底层才解决,写篇blog复盘一下。

。。。

具体表现就是服务端软件接受请求时,一些值在首次请求是正确的,以后请求时都成了非随机固定值。

其实这个场景比较常见。有人会说,软件带了状态。

既然第一次是正确的,说明程序本身没问题,问题在各种状态标记,或者说可能作为状态的值的生命周期上。

这一想法直接导致查bug思路进入误区。

看起来是带了状态,所以我把相关的构造析构,各种涉及到对象生命周期的代码都检查调试了一遍,没发现问题。

由于代码不公开,这里省略所有上层软件的调试、推演与一言难尽,直接用gdb显示最终问题。

具体代码就不展示了
[qianzichen@dev ~]$ ps -ef | grep -E '$regex...' | awk '{print $2}'
25497 
[qianzichen@dev ~]$ gdb -p 25497 
GNU gdb (GDB) Red Hat Enterprise Linux (7.2-50.el6)
Copyright (C) 2010 Free Software Foundation, Inc.
License GPLv3+: GNU GPL version 3 or later <http://gnu.org/licenses/gpl.html>
This is free software: you are free to change and redistribute it.
There is NO WARRANTY, to the extent permitted by law.  Type "show copying"
and "show warranty" for details.
This GDB was configured as "x86_64-redhat-linux-gnu".
For bug reporting instructions, please see:
<http://www.gnu.org/software/gdb/bugs/>.
Attaching to process 25497
...
Reading symbols from /lib64/libdl.so.2...(no debugging symbols found)...done.
Loaded symbols for /lib64/libdl.so.2
Reading symbols from /lib64/librt.so.1...(no debugging symbols found)...done.
Loaded symbols for /lib64/librt.so.1
...
Reading symbols from /lib64/libm.so.6...(no debugging symbols found)...done.
Loaded symbols for /lib64/libm.so.6
...
Reading symbols from /lib64/libpthread.so.0...(no debugging symbols found)...done.
[Thread debugging using libthread_db enabled]
[New Thread 0x7f5c157fb700 (LWP 25531)]
[New Thread 0x7f5c161fc700 (LWP 25530)]
[New Thread 0x7f5c16bfd700 (LWP 25529)]
[New Thread 0x7f5c175fe700 (LWP 25528)]
[New Thread 0x7f5c17fff700 (LWP 25527)]
[New Thread 0x7f5c2cdfa700 (LWP 25526)]
[New Thread 0x7f5c2d7fb700 (LWP 25525)]
[New Thread 0x7f5c2e1fc700 (LWP 25524)]
[New Thread 0x7f5c2ebfd700 (LWP 25523)]
[New Thread 0x7f5c2f5fe700 (LWP 25522)]
[New Thread 0x7f5c2ffff700 (LWP 25521)]
[New Thread 0x7f5c48dfa700 (LWP 25520)]
[New Thread 0x7f5c497fb700 (LWP 25519)]
[New Thread 0x7f5c4a1fc700 (LWP 25518)]
[New Thread 0x7f5c4abfd700 (LWP 25517)]
[New Thread 0x7f5c4b5fe700 (LWP 25516)]
[New Thread 0x7f5c4bfff700 (LWP 25515)]
[New Thread 0x7f5c50f73700 (LWP 25514)]
[New Thread 0x7f5c51974700 (LWP 25513)]
[New Thread 0x7f5c5d3d2700 (LWP 25500)]
Loaded symbols for /lib64/libpthread.so.0
Reading symbols from /lib64/libc.so.6...(no debugging symbols found)...done.
Loaded symbols for /lib64/libc.so.6
Reading symbols from /lib64/ld-linux-x86-64.so.2...(no debugging symbols found)...done.
Loaded symbols for /lib64/ld-linux-x86-64.so.2
...
(gdb) b exit
Breakpoint 1 at 0x3ec7a35d40
(gdb) b abort
Breakpoint 2 at 0x3ec7a33f90
(gdb) b src/path/to/target_file/file.cc:...
Breakpoint 3 at 0x7f5c5ed8042d: file src/path/to/target_file/file.cc, line ....
(gdb) c
Continuing.
[Switching to Thread 0x7f5c16bfd700 (LWP 25529)]

Breakpoint 3, (omitted...)
(gdb) p ctx
$1 = {px = 0x7f5c080008e0, pn = {pi_ = 0x7f5c08001430}}
(gdb) p ctx.px.a_member_instance
$2 = {
...
too large to display, omitted...
...
}
(gdb) set print pretty on
(gdb) p ctx.px.dbg_data_
$3 = {
  url_param_string = {
    static npos = 18446744073709551615, 
    _M_dataplus = {
      <std::allocator<char>> = {
        <__gnu_cxx::new_allocator<char>> = {<No data fields>}, <No data fields>}, 
      members of std::basic_string<char, std::char_traits<char>, std::allocator<char> >::_Alloc_hider: 
      _M_p = 0x7f5c6beb5578 "zichen"
    }
  }, 
  request = 0x0, 
  search_context = 0x0, 
  xxx = {
...
    yyy = {
...
      }, <No data fields>}, 
...
  }, 
  doc_response_str = {
    static npos = 18446744073709551615, 
    _M_dataplus = {
      <std::allocator<char>> = {
        <__gnu_cxx::new_allocator<char>> = {<No data fields>}, <No data fields>}, 
      members of std::basic_string<char, std::char_traits<char>, std::allocator<char> >::_Alloc_hider: 
      _M_p = 0x7f5c6beb5578 "zichen"
    }
  }, '
...
too large to display, omitted...
...
}
(gdb)

如上,vector中的空string、map中的string、随处定义的string或者其他容器其他方式访存的string,_M_p指针均指向同一地址,值为"zichen",是首次请求传入服务端的值。

所以最后问题定位于,该类的c_str为定值定址。

脑残党没好好学习,直接看不出来,只能RTFS(Read The Friendly Source)了。

直接打开当前版本的C++源码:

[qianzichen@dev ~]$ vi /usr/local/gcc-4.8.5/include/c++/4.8.5/string
...
// You should have received a copy of the GNU General Public License and
// a copy of the GCC Runtime Library Exception along with this program;
// see the files COPYING3 and COPYING.RUNTIME respectively.  If not, see
// <http://www.gnu.org/licenses/>.

/** @file include/string
 *  This is a Standard C++ Library header.
 */

//
// ISO C++ 14882: 21  Strings library
//

#ifndef _GLIBCXX_STRING
#define _GLIBCXX_STRING 1

#pragma GCC system_header

#include <bits/c++config.h>
#include <bits/stringfwd.h>
#include <bits/char_traits.h>  // NB: In turn includes stl_algobase.h
#include <bits/allocator.h>
#include <bits/cpp_type_traits.h>
#include <bits/localefwd.h>    // For operators >>, <<, and getline.
#include <bits/ostream_insert.h>
#include <bits/stl_iterator_base_types.h>
#include <bits/stl_iterator_base_funcs.h>
#include <bits/stl_iterator.h>
#include <bits/stl_function.h> // For less
#include <ext/numeric_traits.h>
#include <bits/stl_algobase.h>
#include <bits/range_access.h>
#include <bits/basic_string.h>
#include <bits/basic_string.tcc>
...

看stringfwd.h

[qianzichen@dev ~]$ vi /usr/local/gcc-4.8.5/include/c++/4.8.5/bits/stringfwd.h
...
namespace std _GLIBCXX_VISIBILITY(default)
{
_GLIBCXX_BEGIN_NAMESPACE_VERSION

  /**
   *  @defgroup strings Strings
   *
   *  @{ 
  */

  template<class _CharT>
    struct char_traits;

  template<typename _CharT, typename _Traits = char_traits<_CharT>,
           typename _Alloc = allocator<_CharT> >
    class basic_string;

  template<> struct char_traits<char>;

  /// A string of @c char
  typedef basic_string<char>    string;   

#ifdef _GLIBCXX_USE_WCHAR_T
  template<> struct char_traits<wchar_t>;

  /// A string of @c wchar_t
  typedef basic_string<wchar_t> wstring;
...

如上,可以看出string类型为basic_string<char>类型,basic_string是一个模板类。

现看basic_string实现

[qianzichen@dev ~]$ vi /usr/local/gcc-4.8.5/include/c++/4.8.5/bits/basic_string.h

找到c_str函

/**
       *  @brief  Swap contents with another string.
       *  @param __s  String to swap with.
       *
       *  Exchanges the contents of this string with that of @a __s in constant
       *  time.
      */
      void
      swap(basic_string& __s);

      // String operations:
      /**
       *  @brief  Return const pointer to null-terminated contents.
       *
       *  This is a handle to internal data.  Do not modify or dire things may
       *  happen.
      */
      const _CharT*
      c_str() const _GLIBCXX_NOEXCEPT
      { return _M_data(); }

      /**
       *  @brief  Return const pointer to contents.
       *
       *  This is a handle to internal data.  Do not modify or dire things may
       *  happen.
      */
      const _CharT*
      data() const _GLIBCXX_NOEXCEPT
      { return _M_data(); }

继续看

 private:
      // Data Members (private):
      mutable _Alloc_hider      _M_dataplus;

      _CharT*
      _M_data() const
      { return  _M_dataplus._M_p; }

      _CharT*
      _M_data(_CharT* __p)
      { return (_M_dataplus._M_p = __p); }

所以返回的是 _M_dataplus 成员的 _M_p 成员。找到_Alloc_hider结构。

...
      // Use empty-base optimization: http://www.cantrip.org/emptyopt.html
      struct _Alloc_hider : _Alloc
      {    
        _Alloc_hider(_CharT* __dat, const _Alloc& __a) 
        : _Alloc(__a), _M_p(__dat) { }

        _CharT* _M_p; // The actual data.
      };   

    public:
...

_Alloc_hider 构造函的__dat参数初始化_M_p成员。其成员类型_CharT为实例化string类型时,basic_string模板类传入的类型。

现看basic_string的构造函

...
      // NB: We overload ctors in some cases instead of using default
      // arguments, per 17.4.4.4 para. 2 item 2.

      /**
       *  @brief  Default constructor creates an empty string.
       */
      basic_string()
#if _GLIBCXX_FULLY_DYNAMIC_STRING == 0
      : _M_dataplus(_S_empty_rep()._M_refdata(), _Alloc()) { }
#else
      : _M_dataplus(_S_construct(size_type(), _CharT(), _Alloc()), _Alloc()){ }
#endif
...

可能有两种委托构造,当前环境使用哪种呢?直接确定_GLIBCXX_FULLY_DYNAMIC_STRING的值不简单。我有点累了,就换一种优秀(肥柴)的方式,于是直接改源码如下。在预处理宏分支里写一些滑稽的,正常compiler不会定义的符号,如heihei(嘿嘿...)

...
      // NB: We overload ctors in some cases instead of using default
      // arguments, per 17.4.4.4 para. 2 item 2.

      /**
       *  @brief  Default constructor creates an empty string.
       */
      basic_string()
#if _GLIBCXX_FULLY_DYNAMIC_STRING == 0
      : _M_dataplus(_S_empty_rep()._M_refdata(), _Alloc()) { }
#else
      : _M_dataplus(_S_construct(size_type(), _CharT(), _Alloc()), _Alloc()){ heihei }
#endif
...

再单独写一个UT。简单到只用string相关,复杂到要到某个解析阶段(仅预处理还不能保证这块代码被编译)。

[qianzichen@dev ~]$ cat heihei.cc 
#include <string>
[qianzichen@dev ~]$

如上,只写一行,后编译。

[qianzichen@dev ~]$ /usr/local/gcc-4.8.5/bin/g++ heihei.cc 
/usr/lib/../lib64/crt1.o: In function `_start':
(.text+0x20): undefined reference to `main'
collect2: error: ld returned 1 exit status
[qianzichen@dev ~]$

如此,说明使用的是上面那个委托构造函。

...
      // NB: We overload ctors in some cases instead of using default
      // arguments, per 17.4.4.4 para. 2 item 2.

      /**
       *  @brief  Default constructor creates an empty string.
       */
      basic_string()
#if _GLIBCXX_FULLY_DYNAMIC_STRING == 0
      : _M_dataplus(_S_empty_rep()._M_refdata(), _Alloc()) { heihei }
#else
      : _M_dataplus(_S_construct(size_type(), _CharT(), _Alloc()), _Alloc()){ }
#endif
...

如不确定可分支验证,改源码如上,再编译。

[qianzichen@dev ~]$ /usr/local/gcc-4.8.5/bin/g++ heihei.cc 
In file included from /usr/local/gcc-4.8.5/include/c++/4.8.5/string:52:0,
                 from heihei.cc:1:
/usr/local/gcc-4.8.5/include/c++/4.8.5/bits/basic_string.h: In constructor ‘std::basic_string<_CharT, _Traits, _Alloc>::basic_string()’:
/usr/local/gcc-4.8.5/include/c++/4.8.5/bits/basic_string.h:439:62: error: ‘heihei’ was not declared in this scope
       : _M_dataplus(_S_empty_rep()._M_refdata(), _Alloc()) { heihei }
                                                              ^
/usr/local/gcc-4.8.5/include/c++/4.8.5/bits/basic_string.h:439:69: error: expected ‘;’ before ‘}’ token
       : _M_dataplus(_S_empty_rep()._M_refdata(), _Alloc()) { heihei }
                                                                     ^
[qianzichen@dev ~]$

如上,这次在源码中报错。

至此确定环境下的basic_string的构造函委托的是上面较简单的那个。

_S_empty_rep()._M_refdata() 为上文所提入参__dat

看_S_empty_rep结构

...
void
      _M_leak_hard();

      static _Rep&
      _S_empty_rep()
      { return _Rep::_S_empty_rep(); }

    public:
...

返回static上的_Rep类型实例的引用。具体为_Rep类型的静态函_S_empty_rep返回值。

直接看_Rep结构

...
      struct _Rep : _Rep_base
      {
        // Types:
        typedef typename _Alloc::template rebind<char>::other _Raw_bytes_alloc;

        // (Public) Data members:

        // The maximum number of individual char_type elements of an
...
      static _Rep&
        _S_empty_rep()
        {
          // NB: Mild hack to avoid strict-aliasing warnings.  Note that
          // _S_empty_rep_storage is never modified and the punning should
          // be reasonably safe in this case.
          void* __p = reinterpret_cast<void*>(&_S_empty_rep_storage);
          return *reinterpret_cast<_Rep*>(__p);
        }

        bool
        _M_is_leaked() const
        { return this->_M_refcount < 0; }
...

可见,静态函_S_empty_rep返回一个static上的_Rep类型实例的引用。

这里温柔的hacker温柔地hack了一下, shutup 了 compiler的strict-aliasing warnings

reinterpret_cast 为运算对象的位模式提供较低层次上的重新解释,类型改变了,compiler未给出警告等提示信息,当_S_empty_rep用一个_S_empty_rep_storage的地址返回引用时,显式声称这个转换合法。使用返回的引用时,就认定它的值为_Rep类型。

旧式类型转换,如

char *pc = (char *)ip;

效果与使用reinterpret_cast一样,如文后最小复现代码。

返回的地址为_S_empty_rep_storage的地址,查找该符号

...
        // m = ((npos - sizeof(_Rep))/sizeof(CharT)) - 1
        // In addition, this implementation quarters this amount.
        static const size_type  _S_max_size;
        static const _CharT     _S_terminal;

        // The following storage is init'd to 0 by the linker, resulting
        // (carefully) in an empty string with one reference.
        static size_type _S_empty_rep_storage[];
...

为static上的数组,独立于类型实例,该数据段在Linker链接阶段初始化为0。

这就解释了string的c_str(),为定值定址的问题。

整个程序一定某处访存了该址。致使这段内存被污染。

至此问题确定,继续查找服务端bug。

随手定义一个string,在我的代码中二分法查找bug区域。最终缩小到请求摘要之后,进入摘要模块,继续查找...,终于找到是在某一次序列化输出中,直接取了某个string的c_str址,有写入操作。作者应该是想直接利用这个buffer。

改为程序自定义buffer之后,问题解决。

最小复现代码:

[qianzichen@dev ~]$ vi heihei.cc
#include <string>
#include <iostream>

#include <string.h>

int main() {
  std::string test1;
  char *ptest1 = (char *)test1.c_str();

  strncpy(ptest1, "hug you", 8);
  std::cout << " ptest1 = " << ptest1 << std::endl;

  std::string test2;
  const char *ptest2 = test2.c_str();

  std::string test3;
  const char *ptest3 = test3.c_str();

  std::cout << " ptest2 = " << ptest2 << std::endl;
  std::cout << " ptest3 = " << ptest3 << std::endl;

  std::cout << " address of ptest1 = " << (unsigned long)ptest1 << std::endl;
  std::cout << " address of ptest2 = " << (unsigned long)ptest2 << std::endl;
  std::cout << " address of ptest3 = " << (unsigned long)ptest3 << std::endl;

  return 0;
}

执行

[qianzichen@dev ~]$ ./a.out 
 ptest1 = hug you
 ptest2 = hug you
 ptest3 = hug you
 address of ptest1 = 261138363096
 address of ptest2 = 261138363096
 address of ptest3 = 261138363096
[qianzichen@dev ~]$

更明显地打印出是同址同值。


复盘整个debug过程,需要反思的是,

首先要确认,即“软件首次行为是正确的”这个条件是否完全正确,否则方向不对容易进入误区。

在开发的时候想到过折衷,避开问题,但是核心问题不解决是不行的,折衷路径用的时间可能更多。在高性能,高并发场景下更是如此。所以还须洞察事物本质,“不破楼兰终不还”。


Linkerist
2019年1月24日于酒仙桥

推荐阅读更多精彩内容