应用与系统稳定性第七篇--- 用Asan 提前解决NDK疑难crash

背景初衷:后台监控发生很多崩溃在系统库中的问题,点击进去一看,backtrace并没有什么意义,全在系统库中,如果就这点信息,这些问题是无法解决的,本文讨论这类问题借助ASAN是怎么解决的。

Process Name: 'com.antfortune.wealth'
Thread Name: 'tfortune.wealth'
signal 11 (SIGSEGV)  code 1 (SEGV_MAPERR)  fault addr fa882b80
#00  pc 00056d06  /system/lib/libc.so (arena_run_reg_alloc)
#01  pc 00056b3c  /system/lib/libc.so (je_arena_tcache_fill_small)
#02  pc 000769d0  /system/lib/libc.so (je_tcache_alloc_small_hard)

一、ASAN是什么?

AddressSanitizer (ASan) 是一种基于编译器的快速检测工具,用于检测原生代码中的内存错误。Android 支持普通 ASan 和硬件加速 ASan (HWASan)。HWAsan 基于内存标记,只能在 AArch64 上使用,因为它依赖于 Top-Byte-Ignore 功能。此类工具可检测:

  • 堆栈和堆缓冲区上溢/下溢。
  • 释放之后的堆使用情况。
  • 超出范围的堆栈使用情况。
  • 返回之后的堆栈使用情况(只能在 Android 设备上使用 HWAsan)。
  • 重复释放/wild free。

详细了解移步官方文档:https://source.android.com/devices/tech/debug/asan https://github.com/google/sanitizers
基本原理:具体的算法可以参考WIKI,在此对AddressSanitizer算法做一个简短的介绍。AddressSanitizer主要包括两部分:插桩(Instrumentation)和动态运行库(Run-time library)。插桩主要是针对在llvm编译器级别对访问内存的操作(store,load,alloca等),将它们进行处理。动态运行库主要提供一些运行时的复杂的功能(比如poison/unpoison shadow memory)以及将malloc,free等系统调用函数hook住。其实该算法的思路很简单,如果想防住Buffer Overflow漏洞,只需要在每块内存区域右端(或两端,能防overflow和underflow)加一块区域(RedZone),使RedZone的区域的影子内存(Shadow Memory)设置为不可写即可。具体的示意图如下图所示。

image.png

https://blog.csdn.net/pang241/article/details/76137969

下面简单记录下怎么用这个工具解决一些看起来很难解决掉的问题。

二、在Android App中如何接入ASAN

1、新建项目

如果使用的是AndroidStudio3.3以下版本,可以直接忽略本节,因为IDE可自动创建带有JNI的项目。下面新建一个App项目,用JNI调用C中的getString函数,返回一个字符串,下面是Demo的实现。
a、新建JAVA类---TestAsan


public class TestAsan {
    static {
        System.loadLibrary("native_lib");
    }

    public static native String getStr();
}

b、在main目录下,新建jni目录,并在此目录下新建native_lib.c

//
// Created by wangjing on 2019/5/5.
//

#include <jni.h>
#include <string>

extern "C"
JNIEXPORT jstring JNICALL
Java_com_jingxun_asan_test_MainActivity_getStr( JNIEnv* env, jobject /* this */) {

    std::string hello = "jingxun hello";

    int *ptr = (int*)malloc(sizeof(int) * 3);

    ptr[4] = 6;//这个是否有问题呢?

    return env->NewStringUTF(hello.c_str());
}

c、在根目录下创建CMakeLists.txt

# For more information about using CMake with Android Studio, read the
# documentation: https://d.android.com/studio/projects/add-native-code.html

# Sets the minimum version of CMake required to build the native library.

cmake_minimum_required(VERSION 3.4.1)

# Creates and names a library, sets it as either STATIC
# or SHARED, and provides the relative paths to its source code.
# You can define multiple libraries, and CMake builds them for you.
# Gradle automatically packages shared libraries with your APK.

add_library( # Sets the name of the library.
             native-lib #.so库名 可自定义

             # Sets the library as a shared library.
             SHARED

             # Provides a relative path to your source file(s).
             src/main/jni/native-lib.c ) #源文件所在目录
# Searches for a specified prebuilt library and stores the path as a
# variable. Because CMake includes system libraries in the search path by
# default, you only need to specify the name of the public NDK library
# you want to add. CMake verifies that the library exists before
# completing its build.
find_library( # Sets the name of the path variable.
              log-lib
              # Specifies the name of the NDK library that
              # you want CMake to locate.
              log )
# Specifies libraries CMake should link to your target library. You
# can link multiple libraries, such as libraries you define in this
# build script, prebuilt third-party libraries, or system libraries.
target_link_libraries( # Specifies the target library.
                       native-lib #.so库名 可自定义
                       # Links the target library to the log library
                       # included in the NDK.
                       ${log-lib} )

d、gradle中配置externalNativeBuild,并且点击 Build->Make Project 生成so,然后点击运行。

 externalNativeBuild {
        cmake {
            path file('CMakeLists.txt')
        }
    }

2、案例A

    int *ptr = (int*)malloc(sizeof(int) * 3);
    ptr[4] = 6;

可以看到,我们的程序只malloc了3个int大小的控件,但是尝试访问第四个,ptr[4]的地址是未知的,如果后面用到了ptr[4],就有可能报出错误,在接入ASAN之后,可以把这个错误提前检测出来。

3、案例B

在比如,假设我们的JNI程序是这样子,对指针P double free,实际开发我们可能不会犯这样的错误,不过多线程环境下也很难保证我们不犯下这样的错误。

//
// Created by wangjing on 2019/5/5.
//

#include <jni.h>
#include <string>
#include <malloc.h>

int myfree(int *p){
    free(p);
    p=NULL;
    return 0;
}


extern "C"
JNIEXPORT jstring JNICALL
Java_com_jingxun_asan_test_MainActivity_getStr( JNIEnv* env, jobject o) {

    std::string hello = "Hello from C++";

    int *ptr = (int*)malloc(sizeof(int) * 3);

    myfree(ptr);
    myfree(ptr);

    return env->NewStringUTF(hello.c_str());
}

执行这个程序是会报错的,拿出tombstone看一下。

root@generic:/data/tombstones # cat tombstone_00 |more                         
*** *** *** *** *** *** *** *** *** *** *** *** *** *** *** ***
Build fingerprint: 'Android/sdk_google_phone_armv7/generic:6.0/MASTER/5056751:us
erdebug/test-keys'
Revision: '0'
ABI: 'arm'
pid: 4914, tid: 4922, name: FinalizerDaemon  >>> com.example.asantest <<<
signal 11 (SIGSEGV), code 1 (SEGV_MAPERR), fault addr 0xa
    r0 ab1bfbb0  r1 00000001  r2 ab1bfbb0  r3 00000006
    r4 70c5f4c8  r5 12de5600  r6 ab1bfbb0  r7 ffffffff
    r8 70c24aa8  r9 ad577700  sl 12de9430  fp 12de5600
    ip b6e82657  sp b4236560  lr 73213e97  pc b6e8267c  cpsr 60000170
    d0  b4e0f460ed86aff7  d1  b4d7f33c000001f4
    d2  b4d7f368b4e33300  d3  b4e2c14000400000
    d4  b4a2fd09731cb9cd  d5  7108590870e944c8
    d6  731cba71731cb9cd  d7  b6d1ce40000000a4
    d8  0000000000000000  d9  0000000000000000
    d10 0000000000000000  d11 0000000000000000
    d12 0000000000000000  d13 0000000000000000
    d14 0000000000000000  d15 0000000000000000
    d16 0000000000000000  d17 0000000000000000
    d18 0000000000000000  d19 0000000000000000
    d20 0000000000000000  d21 0000000000000000
    d22 0000000000000000  d23 0000000000000000
    d24 0000000000000000  d25 0000000000000000           
    d26 0000000000000000  d27 0000000000000000
    d28 0000000000000000  d29 0000000000000000
    d30 0000000000000000  d31 0000000000000000
    scr 80000010
         
backtrace:
    #00 pc 000af67c  /system/lib/libandroid_runtime.so
    #01 pc 0097bccc  /data/dalvik-cache/arm/system@framework@boot.art

从这个backtrace来看,啥也看不了,backtrace并没有什么意义,试想这种问题带到线上,通过log是多么难以解决!现在我们就要用ASAN的方式,来看看这些错误怎么解决。

4、开始接入ASAN

image.png

主要看看build.gradle中写了什么。

1、在工程根目录的build.gradle声明变量

project.ext {
      useASAN = true
      ndkDir = properties.getProperty('ndk.dir')
}

2、在CMakeLists.txt中增加ASAN支持

if(USEASAN)
set(CMAKE_C_FLAGS "${CMAKE_C_FLAGS} -fsanitize=address -fno-omit-frame-pointer")
set(CMAKE_CXX_FLAGS "${CMAKE_CXX_FLAGS} -fsanitize=address -fno-omit-frame-pointer")
set(CMAKE_SHARED_LINKER_FLAGS "${CMAKE_SHARED_LINKER_FLAGS} -fsanitize=address")
set(CMAKE_STATIC_LINKER_FLAGS "${CMAKE_STATIC_LINKER_FLAGS} -fsanitize=address")
endif(USEASAN)

2、在app的build.gradle下增加HelpUtil.sh以及copy ndk下的libasan.xxx.xx.so

//这个task将会被generate所依赖
task createAsanHelpUtilScript(dependsOn: copyAsanLibs) {
    for (String abi : SupportedABIs) {
        def dir = new File("app/asan/res/lib/" + abi)
        dir.mkdirs()
        def helpFile = new File(dir, "wrap.sh")
        generateHelpUtil(helpFile, abi)
        println "helpFile file path " + helpFile.path
    }
}

参数依赖了copyAsanLibs这个Task,可以复制Asan的so到我们指定的目录中

task copyAsanLibs(type:Copy) {
    def libDir = file("$rootProject.ext.ndkDir").absolutePath + "/toolchains/llvm/prebuilt/"
    for (String abi : SupportedABIs) {
        def dir = new File("app/asan/libs/" + abi)
        dir.mkdirs()
        if(abi == 'armeabi-v7a' || abi == 'armeabi')
            abi = "arm"
        if(abi == "arm64-v8a")
            abi = "aarch64"
        FileTree tree = fileTree(dir: libDir).include("**/*asan*${abi}*.so")
        tree.each { File file ->
            from file
            into dir.absolutePath
        }
    }
}

生成HelpUtil.sh这个文件

static def generateHelpUtil(file, abi) {
    if(abi == "armeabi" || abi == "armeabi-v7a")
        abi = "arm"
    if(abi == "arm64-v8a")
        abi = "aarch64"
    file.withWriter { writer ->
        writer.write('#!/system/bin/sh\n')
        writer.write('HERE="$(cd "$(dirname "$0")" && pwd)"\n')
        writer.write('export ASAN_OPTIONS=log_to_syslog=false,allow_user_segv_handler=1\n')
        writer.write('export ASAN_ACTIVATION_OPTIONS=include_if_exists=/data/local/tmp/asan.options.b\n')
        //LD_PRELOAD的本意是,允许程序优先加载指定的动态库
        writer.write("export LD_PRELOAD=\$HERE/libclang_rt.asan-${abi}-android.so\n")
        writer.write('\$@\n')
    }
}

上面写好了之后,需要在defaultConfig中做一些配置,指定C++标准,也可以不指定

def abiFiltersForWrapScript = []
def SupportedABIs = ['arm64-v8a']
  externalNativeBuild {
            cmake {
                cppFlags "-std=c++11 -frtti -fexceptions -g"
                abiFilters.addAll( SupportedABIs )
            }
        }

android中的buildType配置

 buildTypes {
        debug {
            externalNativeBuild {
                cmake {
                    if (rootProject.ext.useASAN)
                    //CMake 一共有2种编译工具链 - clang 和 gcc
                        arguments "-DUSEASAN=ON", "-DANDROID_TOOLCHAIN=clang"
                }
            }
            packagingOptions {
                doNotStrip "**.so"
                if (rootProject.ext.useASAN && abiFiltersForWrapScript) {
                    def exclude_abis = ["armeabi", "armeabi-v7a", "arm64-v8a", "x86", "x86_64", "mips", "mips64"]
                            .findAll { !(it in abiFiltersForWrapScript) }
                            .collect { "**/" + it + "/wrap.sh" }
                    excludes += exclude_abis
                }
            }

            if (rootProject.ext.useASAN) {
                sourceSets {
                    main {
                        jniLibs {
                            srcDir {
                                "asan/libs"
                            }
                        }
                        resources {
                            srcDir {
                                "asan/res"
                            }
                        }
                    }
                }
            }
        }

最后一步,在CMakeLists.txt中添加Asan的支持

if(USEASAN)
set(CMAKE_C_FLAGS "${CMAKE_C_FLAGS} -fsanitize=address -fno-omit-frame-pointer")
set(CMAKE_CXX_FLAGS "${CMAKE_CXX_FLAGS} -fsanitize=address -fno-omit-frame-pointer")
set(CMAKE_SHARED_LINKER_FLAGS "${CMAKE_SHARED_LINKER_FLAGS} -fsanitize=address")
set(CMAKE_STATIC_LINKER_FLAGS "${CMAKE_STATIC_LINKER_FLAGS} -fsanitize=address")
endif(USEASAN)

这样就接入完成了,下面直接运行看看效果!

三、分析方式

1、案例A分析

发生crash的log如下。

2019-05-10 15:49:00.741 14053-14053/? I/wrap.sh: =================================================================
2019-05-10 15:49:00.741 14053-14053/? I/wrap.sh: ==14057==ERROR: AddressSanitizer: heap-buffer-overflow on address 0x005f0b2adae0 at pc 0x005df0b8273c bp 0x007fdf2982f0 sp 0x007fdf2982e8
2019-05-10 15:49:00.741 14053-14053/? I/wrap.sh: WRITE of size 4 at 0x005f0b2adae0 thread T0 (ngxun.asan.test)
2019-05-10 15:49:00.762 14053-14053/? I/wrap.sh:     #0 0x5df0b82738  (/data/app/com.jingxun.asan.test-urz4gTktXt-wT-ZZtL1mrQ==/lib/arm64/libnative-lib.so+0xe738)
2019-05-10 15:49:00.762 14053-14053/? I/wrap.sh:     #1 0x5e07efe9e0  (/system/lib64/libart.so+0x5659e0)
2019-05-10 15:49:00.762 14053-14053/? I/wrap.sh:     #2 0x5e07ef5988  (/system/lib64/libart.so+0x55c988)
2019-05-10 15:49:00.762 14053-14053/? I/wrap.sh:     #3 0x5e07a69520  (/system/lib64/libart.so+0xd0520)
2019-05-10 15:49:00.762 14053-14053/? I/wrap.sh:     #4 0x5e07c19b90  (/system/lib64/libart.so+0x280b90)
2019-05-10 15:49:00.762 14053-14053/? I/wrap.sh:     #5 0x5e07c13ba4  (/system/lib64/libart.so+0x27aba4)
2019-05-10 15:49:00.762 14053-14053/? I/wrap.sh:     #6 0x5e07ec52dc  (/system/lib64/libart.so+0x52c2dc)
2019-05-10 15:49:00.762 14053-14053/? I/wrap.sh:     #7 0x5e07ee8014  (/system/lib64/libart.so+0x54f014)
2019-05-10 15:49:00.762 14053-14053/? I/wrap.sh:     #8 0x5e07bed8a8  (/system/lib64/libart.so+0x2548a8)
2019-05-10 15:49:00.762 14053-14053/? I/wrap.sh:     #9 0x5e07eb5b90  (/system/lib64/libart.so+0x51cb90)
2019-05-10 15:49:00.762 14053-14053/? I/wrap.sh:     #10 0x5e07efeafc  (/system/lib64/libart.so+0x565afc)
2019-05-10 15:49:00.762 14053-14053/? I/wrap.sh:     #11 0x75710644  (/system/framework/arm64/boot-framework.oat+0x1606644)
2019-05-10 15:49:00.763 14053-14053/? I/wrap.sh: 0x005f0b2adae0 is located 4 bytes to the right of 12-byte region [0x005f0b2adad0,0x005f0b2adadc)
2019-05-10 15:49:00.763 14053-14053/? I/wrap.sh: allocated by thread T0 (ngxun.asan.test) here:
2019-05-10 15:49:00.763 14053-14053/? I/wrap.sh:     #0 0x7e8c82b758  (/data/app/com.jingxun.asan.test-urz4gTktXt-wT-ZZtL1mrQ==/lib/arm64/libclang_rt.asan-aarch64-android.so+0xc9758)
2019-05-10 15:49:00.763 14053-14053/? I/wrap.sh:     #1 0x5df0b826c8  (/data/app/com.jingxun.asan.test-urz4gTktXt-wT-ZZtL1mrQ==/lib/arm64/libnative-lib.so+0xe6c8)
2019-05-10 15:49:00.763 14053-14053/? I/wrap.sh:     #2 0x5e07efe9e0  (/system/lib64/libart.so+0x5659e0)
2019-05-10 15:49:00.780 14053-14053/? I/wrap.sh:     #3 0x130a781c  (/dev/ashmem/dalvik-main space (region space) (deleted)+0x4a781c)

然后使用addr2line看一下具体发生的位置,一般用-C选项还原函数名称,-f选项展示函数名称,-e选项指定input file

─$ addr2line -Cife /Users/wangjing/study/Asan/AddressSanitize/app/build/intermediates/cmake/debug/obj/arm64-v8a/libnative-lib.so  0xe738
Java_com_jingxun_asan_test_MainActivity_getStr
/Users/wangjing/study/Asan/AddressSanitize/app/.externalNativeBuild/cmake/debug/arm64-v8a/../../../../src/main/cpp/native-lib.cpp:16

可以看到在native-lib.cpp这个文件的16行发生了错误,错误的原因是heap-buffer-overflow。

2、案例B分析

2019-05-10 15:58:37.163 14765-14765/? I/wrap.sh: =================================================================
2019-05-10 15:58:37.163 14765-14765/? I/wrap.sh: ==14769==ERROR: AddressSanitizer: attempting double-free on 0x003e30a7e8d0 in thread T0 (ngxun.asan.test):
2019-05-10 15:58:37.183 14765-14765/? I/wrap.sh:     #0 0x7a5410c3fc  (/data/app/com.jingxun.asan.test-3xyuXAOR5HBGmy7bb0bXHQ==/lib/arm64/libclang_rt.asan-aarch64-android.so+0xc93fc)
2019-05-10 15:58:37.183 14765-14765/? I/wrap.sh:     #1 0x79b948750c  (/data/app/com.jingxun.asan.test-3xyuXAOR5HBGmy7bb0bXHQ==/lib/arm64/libnative-lib.so+0xe50c)
2019-05-10 15:58:37.183 14765-14765/? I/wrap.sh:     #2 0x79b9487700  (/data/app/com.jingxun.asan.test-3xyuXAOR5HBGmy7bb0bXHQ==/lib/arm64/libnative-lib.so+0xe700)
2019-05-10 15:58:37.183 14765-14765/? I/wrap.sh:     #3 0x79d089e9e0  (/system/lib64/libart.so+0x5659e0)
2019-05-10 15:58:37.183 14765-14765/? I/wrap.sh:     #4 0x79d0895988  (/system/lib64/libart.so+0x55c988)
2019-05-10 15:58:37.183 14765-14765/? I/wrap.sh:     #5 0x79d0409520  (/system/lib64/libart.so+0xd0520)
2019-05-10 15:58:37.183 14765-14765/? I/wrap.sh:     #6 0x79d05b9b90  (/system/lib64/libart.so+0x280b90)
2019-05-10 15:58:37.183 14765-14765/? I/wrap.sh:     #7 0x79d05b3ba4  (/system/lib64/libart.so+0x27aba4)
2019-05-10 15:58:37.183 14765-14765/? I/wrap.sh:     #8 0x79d08652dc  (/system/lib64/libart.so+0x52c2dc)
2019-05-10 15:58:37.184 14765-14765/? I/wrap.sh:     #9 0x79d0888014  (/system/lib64/libart.so+0x54f014)
2019-05-10 15:58:37.184 14765-14765/? I/wrap.sh:     #10 0x79d058d8a8  (/system/lib64/libart.so+0x2548a8)
2019-05-10 15:58:37.184 14765-14765/? I/wrap.sh:     #11 0x79d0855b90  (/system/lib64/libart.so+0x51cb90)
2019-05-10 15:58:37.184 14765-14765/? I/wrap.sh:     #12 0x79d089eafc  (/system/lib64/libart.so+0x565afc)
2019-05-10 15:58:37.184 14765-14765/? I/wrap.sh:     #13 0x75710644  (/system/framework/arm64/boot-framework.oat+0x1606644)
2019-05-10 15:58:37.184 14765-14765/? I/wrap.sh: 0x003e30a7e8d0 is located 0 bytes inside of 12-byte region [0x003e30a7e8d0,0x003e30a7e8dc)
2019-05-10 15:58:37.184 14765-14765/? I/wrap.sh: freed by thread T0 (ngxun.asan.test) here:
2019-05-10 15:58:37.184 14765-14765/? I/wrap.sh:     #0 0x7a5410c3fc  (/data/app/com.jingxun.asan.test-3xyuXAOR5HBGmy7bb0bXHQ==/lib/arm64/libclang_rt.asan-aarch64-android.so+0xc93fc)
2019-05-10 15:58:37.184 14765-14765/? I/wrap.sh:     #1 0x79b948750c  (/data/app/com.jingxun.asan.test-3xyuXAOR5HBGmy7bb0bXHQ==/lib/arm64/libnative-lib.so+0xe50c)
2019-05-10 15:58:37.184 14765-14765/? I/wrap.sh:     #2 0x79b94876f0  (/data/app/com.jingxun.asan.test-3xyuXAOR5HBGmy7bb0bXHQ==/lib/arm64/libnative-lib.so+0xe6f0)
2019-05-10 15:58:37.184 14765-14765/? I/wrap.sh:     #3 0x79d089e9e0  (/system/lib64/libart.so+0x5659e0)
2019-05-10 15:58:37.202 14765-14765/? I/wrap.sh:     #4 0x1312a51c  (/dev/ashmem/dalvik-main space (region space) (deleted)+0x52a51c)

继续使用addr2line看一下具体发生的位置

╰─$ addr2line -Cife /Users/wangjing/study/Asan/AddressSanitize/app/build/intermediates/cmake/debug/obj/arm64-v8a/libnative-lib.so 0xe50c 
myfree(int*)
/Users/wangjing/study/Asan/AddressSanitize/app/.externalNativeBuild/cmake/debug/arm64-v8a/../../../../src/main/cpp/native-lib.cpp:10

可以看到在native-lib.cpp这个文件的10行myfree函数中发生了错误,错误的原因是double-free。

上面都是用addr2line这种简单的办法就弄好,对于一些复杂的问题,还可以使用gdb调试,或者抓取coredump分析,关键是我们首先需要找到这个错误发生在哪里,然后再去解决他。

https://github.com/google/sanitizers/wiki/AddressSanitizerOnAndroid

推荐阅读更多精彩内容