pthread_kill引发的争论

最近提测的一段代码中有一个,遇到一个诡异的bug,总是崩溃在pthread_kill这个函数上,并且不是每次比现。
调用逻辑大致如下,利用pthread_kill判断一个线程是否在运行

pthread_kill(pthread_[i], 0)

在最终的崩溃的栈上检查上述pthread_指针,i均正常,可以正常访问,
执行文件的编译也是正常的,
代码整体运行逻辑也没什么问题,针对pthread_的访问该加锁的也都加锁了

相当郁闷,好在及时找同事帮忙分析,同事立马指出pthread_t值在内部使用时很可能转换为指针使用,如果其值作为地址的内存区域被释放了很可能导致这个pthread_kill的crash(即使外部对pthread_t这个值的保存的内存区域未被释放),且代码流程上存在对一个pthread_t调用pthread_try_join后又对其调用pthread_kill的情况,同时内存被释放了如果没有被立即占用就可能还可以使用,这也就导致不是每次必现的crash,相当给力呀

简单画个图示意一下

pthread_t

有了这个方向和简单的在崩溃栈上验证了了一下

gdb>p *(pthread_ + i)
0x**********
gdb>x/10a 0x**********
can't access 

bingo差不多就是这个问题了,问题找到后,解决问题倒是比较简单(不在对pthread_try_join之后的pthread_t进行pthread_kill即可),重新提测验证通过~

问题解决了,本来一直想自己写段用例验证一下上述pthread_t的问题,即pthread_kill的man page描述着如果对于一个invalid pthread_t,将返回ESRCH,而不是crash才对。后面google后,发现好多人遇到了这个问题,stackoverflow的这个问题Segmentation fault caused by pthread_kill的回答很好的解释了这个问题,并且给了这个问题的两个相关链接,相当有意思
1、pthread_kill() Segmentation Fault when TID is invalid
2、pthread_t and similar types(blog内容贴在了最后,避免后续访问不到)

这个pthread_kill() Segmentation Fault when TID is invalid链接是之前有人针对这个问题向glibc nptl的开发者提了这个bug,然后作者的回答也比较直接拒了这个bug,然后两人就开始撕,单看争论的内容感觉这个确实是一个bug,大概是说文档和标准都没有这么说明,它怎么就崩溃了呢?后面又看了作者为这个问题写的一篇blogpthread_t and similar types,从posix的设计角度谈了一下为什么这么设计,写得也挺有意思。

看完后总体感觉设计者和使用者的角度还是不同的,使用者角度上接口提供的功能应该和文档描述一致,不应该隐藏一些晦涩的逻辑或者至少将功能情景描述清楚;设计者的角度则在于提供更大的灵活性和通用性,接口内不去实现特殊情况下的特定逻辑。这里实现上个人还是比较倾向设计者的角度,但对于man page的说明还是不太认同,确实写得不太清楚,不过后续还是有更新相关说明的

http://man7.org/linux/man-pages/man3/pthread_kill.3.html
The glibc implementation returns this error in the cases where an invalid thread ID can be detected. But note also that POSIX says that an attempt to use a thread ID whose lifetime has ended produces undefined behavior, and an attempt to use an invalid thread ID in a call to pthread_kill() can, for example cause a segmentation fault.

针对pthread_kill,其意思是如果内部检测到pthred_t是无效的则返回ESRCH,但这并不表明所有无效的pthread_t内部都能检测到,其原因是因为标准并未对pthread_t的实现类型进行明确的限制。找了glibc的pthread_kill的实现版本,发现只有tid<=0时才返回ESRCH,至于什么实时tid<=0待查(关于tid pthread_t pid tgid的区别可参考Difference between pid and tid),同时不同的实现的版本也有可能有区别,因此从这个角度看通过pthread_kill判断线程是否在运行貌似没有意义。。。

19  #include <errno.h>
20  #include <signal.h>
21  #include <pthreadP.h>
22  #include <tls.h>
23  #include <sysdep.h>
24  #include <unistd.h>
25  
26  
27  int
28  __pthread_kill (pthread_t threadid, int signo)
29  {
30    struct pthread *pd = (struct pthread *) threadid;
31  
32    /* Make sure the descriptor is valid.  */
33    if (DEBUGGING_P && INVALID_TD_P (pd))
34      /* Not a valid thread handle.  */
35      return ESRCH;
36  
37    /* Force load of pd->tid into local variable or register.  Otherwise
38       if a thread exits between ESRCH test and tgkill, we might return
39       EINVAL, because pd->tid would be cleared by the kernel.  */
40    pid_t tid = atomic_forced_read (pd->tid);
41    if (__glibc_unlikely (tid <= 0))
42      /* Not a valid thread handle.  */
43      return ESRCH;
44  
45    /* Disallow sending the signal we use for cancellation, timers,
46       for the setxid implementation.  */
47    if (signo == SIGCANCEL || signo == SIGTIMER || signo == SIGSETXID)
48      return EINVAL;
49  
50    /* We have a special syscall to do the work.  */
51    INTERNAL_SYSCALL_DECL (err);
52  
53    pid_t pid = __getpid ();
54  
55    int val = INTERNAL_SYSCALL_CALL (tgkill, err, pid, tid, signo);
56    return (INTERNAL_SYSCALL_ERROR_P (val, err)
57            ? INTERNAL_SYSCALL_ERRNO (val, err) : 0);
58  }
59  strong_alias (__pthread_kill, pthread_kill)

end~

后续
之后又看来几篇udrepper的blog,才发现udrepper是How to Write Shared Libraries这篇论文的作者,惊了。。。
blog地址mark一下https://www.akkadia.org/drepper/

pthread_t and similar types
Constantly people complain that the runtime does not catch their mistakes. They are hiding behind this requirement in the POSIX specification (for pthread_join in this case, also applies to pthread_kill and similar functions):

   The pthread_join() function shall fail if:
   [...]
  ESRCH  No thread could be found corresponding to that specified by the given thread ID.

The glibc implementation follows this requirement to the letter. IFF we can detect that the thread descriptor is invalid we do return ESRCH.

But: the above does not mean that all uses of invalid thread descriptors must result in ESRCH errors. The reason is simple: the standard does not restrict the implementation in any way in the definition of the type pthread_t. It does not even have to be an arithmetic type. This means it is valid to use a pointer type and this is just what NPTL does.

Nobody argues that functions like strcpy should not dump a core in case the buffer is invalid. The same for pthread_attr_t references passed to pthread_attr_init etc. The use of pthread_t when defined as a pointer is no different. The only complication is in the understanding that pthread_t can be a pointer type. This is obvious for void* etc.

In the POSIX committee we discussed several times changing the pthread_join and pthread_kill man pages. The ESRCH errors could be marked as may fail. But

this really is not necessary, see above.
it would mean we have to go through the entire specification and treat every other place where this is an issue the same way.
If somebody wants to do the work associated with the second step above and we have confidence in the results, we (= Austin Group) might make the change at some later date. But it is a rather high risk for no real gain. Programmers have to educate themselves anyway.

What remains is the question: how can programs avoid these mistakes? It is actually pretty simple: the program should make sure that no calls to pthread_kill, for instance, can happen when the thread is exiting. One way to solve this problem is:

1、Associate a variable running of some sort and a mutex with each thread.
2、In the function started by pthread_create (the thread function) set running to true.
3、Before returning from the thread function or calling pthread_exit or in a cancellation handler acquire the mutex, set running to false, unlock the mutex, and proceed.
4、Any thread trying to use pthread_kill etc first must get the mutex for the target thread, if running is true call pthread_kill, and finally unlock the mutex.
This ensures that no invalid descriptor is used. But I can already hear people complain:

This is too expensive!
That is ridiculous. The implementation would have to do something similar if it would try to catch bad thread descriptors. In fact, it would have to do more. What is important is to recognize that this price would have to be paid by every program, not just the buggy ones. This is wrong. Only those people who need this extra protection should pay the price.

But I don't have control over the code calling pthread_create!
Boo hoo, cry me a river. Don't expect sympathy for using proprietary software. I will never allow good free software to be shackled because of proprietary code. If you cannot get this changed in the code you pay good money for this just means it is time to find a new supplier or, even better, use free software.

In summary, this is entirely a problem of the programs which experience them. Existing Linux systems are proof that it is possible to write complex programs without requiring the implementation to help incompetent programmers. We will have a few more words in the next revision of the POSIX specification which talk about this issue. But I expect they will be ignored anyway and all focus remains on the shall fail errors of pthread_kill etc.

最后编辑于
©著作权归作者所有,转载或内容合作请联系作者
  • 序言:七十年代末,一起剥皮案震惊了整个滨河市,随后出现的几起案子,更是在滨河造成了极大的恐慌,老刑警刘岩,带你破解...
    沈念sama阅读 158,736评论 4 362
  • 序言:滨河连续发生了三起死亡事件,死亡现场离奇诡异,居然都是意外死亡,警方通过查阅死者的电脑和手机,发现死者居然都...
    沈念sama阅读 67,167评论 1 291
  • 文/潘晓璐 我一进店门,熙熙楼的掌柜王于贵愁眉苦脸地迎上来,“玉大人,你说我怎么就摊上这事。” “怎么了?”我有些...
    开封第一讲书人阅读 108,442评论 0 243
  • 文/不坏的土叔 我叫张陵,是天一观的道长。 经常有香客问我,道长,这世上最难降的妖魔是什么? 我笑而不...
    开封第一讲书人阅读 43,902评论 0 204
  • 正文 为了忘掉前任,我火速办了婚礼,结果婚礼上,老公的妹妹穿的比我还像新娘。我一直安慰自己,他们只是感情好,可当我...
    茶点故事阅读 52,302评论 3 287
  • 文/花漫 我一把揭开白布。 她就那样静静地躺着,像睡着了一般。 火红的嫁衣衬着肌肤如雪。 梳的纹丝不乱的头发上,一...
    开封第一讲书人阅读 40,573评论 1 216
  • 那天,我揣着相机与录音,去河边找鬼。 笑死,一个胖子当着我的面吹牛,可吹牛的内容都是我干的。 我是一名探鬼主播,决...
    沈念sama阅读 31,847评论 2 312
  • 文/苍兰香墨 我猛地睁开眼,长吁一口气:“原来是场噩梦啊……” “哼!你这毒妇竟也来了?” 一声冷哼从身侧响起,我...
    开封第一讲书人阅读 30,562评论 0 197
  • 序言:老挝万荣一对情侣失踪,失踪者是张志新(化名)和其女友刘颖,没想到半个月后,有当地人在树林里发现了一具尸体,经...
    沈念sama阅读 34,260评论 1 241
  • 正文 独居荒郊野岭守林人离奇死亡,尸身上长有42处带血的脓包…… 初始之章·张勋 以下内容为张勋视角 年9月15日...
    茶点故事阅读 30,531评论 2 245
  • 正文 我和宋清朗相恋三年,在试婚纱的时候发现自己被绿了。 大学时的朋友给我发了我未婚夫和他白月光在一起吃饭的照片。...
    茶点故事阅读 32,021评论 1 258
  • 序言:一个原本活蹦乱跳的男人离奇死亡,死状恐怖,灵堂内的尸体忽然破棺而出,到底是诈尸还是另有隐情,我是刑警宁泽,带...
    沈念sama阅读 28,367评论 2 253
  • 正文 年R本政府宣布,位于F岛的核电站,受9级特大地震影响,放射性物质发生泄漏。R本人自食恶果不足惜,却给世界环境...
    茶点故事阅读 33,016评论 3 235
  • 文/蒙蒙 一、第九天 我趴在偏房一处隐蔽的房顶上张望。 院中可真热闹,春花似锦、人声如沸。这庄子的主人今日做“春日...
    开封第一讲书人阅读 26,068评论 0 8
  • 文/苍兰香墨 我抬头看了看天上的太阳。三九已至,却和暖如春,着一层夹袄步出监牢的瞬间,已是汗流浃背。 一阵脚步声响...
    开封第一讲书人阅读 26,827评论 0 194
  • 我被黑心中介骗来泰国打工, 没想到刚下飞机就差点儿被人妖公主榨干…… 1. 我叫王不留,地道东北人。 一个月前我还...
    沈念sama阅读 35,610评论 2 274
  • 正文 我出身青楼,却偏偏与公主长得像,于是被迫代替她去往敌国和亲。 传闻我的和亲对象是个残疾皇子,可洞房花烛夜当晚...
    茶点故事阅读 35,514评论 2 269

推荐阅读更多精彩内容