好家伙，到底谁在用TBtools

写在前面

之前写了publish or perish的使用方法。那么问题来了，我们获取了这些数据能拿来干什么？于是萌生了一个想法，不如统计一下引用TBtools的文章，统计一下每年文章发表数量（截止12月初），发表的层次，用户都用了哪些功能等等。当初的想法是我一个滑铲就全部搞定了，但实现起来嘛……一干就是两个周，并且短期内我是再也不想看基因挖掘相关的任何东西了……统计到最后实在是想吐了，如果你发现你发的文章被错误的统计了，或者没统计到……欢迎私信告诉我，我一定不改（

数据导入和准备工作

这里的话数据我已经整理好了，如果你也感兴趣的话点这里进行下载(https://cowtransfer.com/s/6f63b578a5e34f)。
首先载入数据

#Wipe the environment
rm(list=ls())
#Set the work directory
setwd("/Users/yangshao/OneDrive")
#Import the data
dta <- read.csv("clean_cite.csv", header = T)
#Take a peek at the data
skim(dta)
#Dimension of the data
dim(dta)

PS: 这里是我的地址，不要无脑复制……
之后载入需要用到的包，接下来就可以进行分析了！如果没有安装对应的包，先让电脑安装就好了。实在不会可以让你师兄先来学，然后教你（

library(skimr)
library(tidyverse)
library(ggplot2)
library(ggpubr)
library(ggthemes)
library(tidytext)
library(stringr)
library(wordcloud)
library(networkD3)

越来越多的文章使用并引用TBtools

首先是每年引用文章数的可视化

#Citation per year
cite_trend <- sort(table(dta$Year), decreasing = TRUE)
cite_trend <- as.data.frame(cite_trend)
#Visualization
colnames(cite_trend)  <- c('Year','Num.')
ggbarplot(cite_trend, x = "Year", fill = "Year", order = c('2018','2019','2020'), y = "Num.", palette = 'jco') + 
  scale_y_continuous(expand = c(0, 0))

从2018年至2020年，TBtools的每年引用数呈现上升趋势（废话）。截止2020年12月初，2020年的引用数已经超过500次。需要说明的是，这里的数据是相对比较保守的数据，实际引用数可能远不止这些。

TBtools的用户都喜欢把文章发在谁家？

使用以下代码完成可视化，别问我为啥用ggpubr……

#Sort the top 15 publisher
cite_publisher <- head(sort(table(dta$Publisher), decreasing = TRUE),15)
#Transform the data to a dataframe
cite_publisher <- as.data.frame(cite_publisher)
#Rename the col names
colnames(cite_publisher)  <- c('Publisher','Count')
#Visualization
ggbarplot(cite_publisher, x = "Publisher", fill = "Publisher", y = "Count") + 
  scale_y_continuous(expand = c(0, 0)) +
  theme_pubr(
  base_family = "",
  border = FALSE,
  margin = TRUE,
  legend = 'none',
  x.text.angle = 45
)

Cite_Publisher

截止2020年12月初，mdpi发表了多数文章，施普林格（包含BMC）和爱斯维尔紧随其后。
三个出版商一共发表了538篇，占总数的65.1%。

cite_publisher %>% head(3) %>% select(Count) %>% sum()
cite_publisher %>% head(3) %>% select(Count) %>% sum()/count(dta)

那么引用了TBtools的文章发表在了哪些杂志上呢？同样的方式完成可视化，别问我为什么不用ggpubr……

#Sort the top 15 Journal
cite_source <- head(sort(table(dta$Source), decreasing = TRUE),15)
#Transform the data to a dataframe
cite_source <- as.data.frame(cite_source)
#Rename the col
colnames(cite_source)  <- c('Journal Name','Count')
#Visualization
ggplot(cite_source,aes(x = cite_source$`Journal Name`, y = Count, fill = cite_source$`Journal Name`)) + 
  geom_bar(stat='identity') + 
  scale_y_continuous(expand = c(0, 0)) + 
  xlab("Journal Name") +
  guides(fill=guide_legend(title="Journal Name")) +
  theme_pubr(
  base_family = "",
  border = FALSE,
  margin = TRUE
) + rremove("x.text")

Cite_journal

从结果上看，多数文章发在了OA杂志上，例如来自mdpi的International Journal of Molecular Sciences，例如来自Springer的BMC Plant Biology和BMC Genomics，例如来自Frontier的Frontier in Genetics和Frontiers in Physiology，例如PeerJ, PloS等……按你胃，有钱发OA真好……

用户最爱用热图

每篇文章都会在文章里提到使用了TBtools的哪些功能。很遗憾这部分功能只能手动登录了，因为很多情况下，大家对同一个功能叙述存在差异，例如同样是一个热图功能有的人说'the DEGs were visualized by TBtools'有的人会说'the heatmaps were used to show/draw 。。。。(TBtools， gihub)'或者是'the correlation coefficients were visualized by TBtools' 等等。还有Circos图，很多时候难以分辨到底是Circos画的还是用TBtools画的，这种情况提到了TBtools索性都用Synteny代替了。还有一些懒得索性就说'the data were visualized by aaa, bbb, and TBtools' 的，这种也只能自己手动找。笔者已经很认真的过滤了，但人工做这方面工作难免会出错，有些时候甚至看到某几个图就把几个常用功能一列完事了……所以如果你发现我列举的功能跟文章提到的有出入，李姐万岁……
在这里分析的时候需要先做成长透视表（pivot_longer），然后选择数据开始作图，这里因为需要做三次同样的图，干脆就把表格的命令做成一个函数，直接调用就好了。这样显得不那么弱智（

#Create a pivot table
dta %>% select(Year,17:25) %>% pivot_longer(!Year,
                                            names_to = 'Name',
                                            values_to = 'Fun') -> Fun_count_pivot
#Function for counting
Fun_count <- function(year, Pivot){
  table(Pivot %>% filter(Year == year) %>% .$Fun) %>% 
  as.data.frame() %>% .[-1,] -> Count_Year
  colnames(Count_Year) <-c('Function', 'Count')
  top_three <- tail(Count_Year[order(Count_Year$Count),],3)
  ggbarplot(Count_Year, x = "Function", fill = "Function", y = "Count") + 
  scale_y_continuous(limits = c(0, NA), expand = expansion(mult = c(0,.1))) +
  theme_pubr(
  base_family = "",
  border = FALSE,
  margin = TRUE
) + rremove("x.text") + geom_text(top_three, mapping = aes(label = Function), nudge_y = top_three$Count*.2) + geom_text(top_three, mapping = aes(label = Count), nudge_y = top_three$Count*.1)
}
#Visualization the mentioned functions in papers
Fun_count("2018",Fun_count_pivot)
Fun_count("2019",Fun_count_pivot)
Fun_count("2020",Fun_count_pivot)

2018年

2019年

2020年

从2018年到2020年，用户使用TBtools的姿势越来越多：2018年，主要提及的功能只有15个，而到了2020年文章中提及的功能有29个。从2019年开始，TBtools用户群体最多使用的三个功能为:'Heatmap', 'motif'和'Gene structure'。

	2018	Count	2019	Count	2020	Count
1	motif	8	Heatmap	71	Heatmap	315
2	Synteny	8	motif	66	motif	183
3	Gene structure	7	Gene structure	45	Gene structure	171
4	GO	7	Synteny	41	Synteny	102
5	Heatmap	6	Gene location	30	Phylogenetic tree	96
6	Gene location	5	Phylogenetic tree	22	Gene location	88
7	KEGG	5	GO	15	Venn diagram	43
8	Venn diagram	4	Venn diagram	14	GO	37
9	Fasta extraction	3	Fasta extraction	9	Seqlogo	35
10	Phylogenetic tree	3	MEME	8	Fasta extraction	21
11	Seqlogo	3	blastv	7	Hierarchical clustering	21
12	blast	2	Ks&Ka	6	Ks&Ka	21
13	MEME	2	Seqlogo	6	KEGG	13
14	eFP	1	KEGG	5	Upset plot	13
15	Ks&Ka	1	Hierarchical clustering	3	MEME	12

最多提及的功能可以通过以下代码查看

Fun_sum <- function(year, Pivot){
table(Pivot %>% filter(Year == year) %>% .$Fun) %>% 
  as.data.frame() %>% .[-1,] -> Fun_sum
  colnames(Fun_sum) <- c('Function', 'Count')
  Fun_sum <- head(Fun_sum[order(-Fun_sum$Count),],15)
  return(Fun_sum)
}
Count_2018 <- Fun_sum(2018,Fun_count_pivot)
Count_2019 <- Fun_sum(2019,Fun_count_pivot)
Count_2020 <- Fun_sum(2020,Fun_count_pivot)
data.frame("2018" = Count_2018,"2019" = Count_2019, "2020" = Count_2020) %>% view()

与2019年相比，今年的TBtools用户更喜欢用TBtools做进化树，韦恩图和fasta文件的相关操作。在2020年，TBtools的多数功能确实都有用户在用。尤其是陈博士重点照顾的Heatmap，gene structure annotation，motif和phylogenetic tree。常用到你不一起做就觉得少了点什么的程度。当然，这也侧面反映了今年家族挖掘的研究还是很流行……

使用了TBtools的文章都发在了什么水平杂志上

我知道IF这东西不好做横向对比。之前也有想法做个TBtools到底有没有帮助做着把文章发在更好的杂志上。但最后还是放弃了。说到底我认为一篇文章能否发在更好的杂志上还得看你的文章是否有真东西。工具是帮助你更好的展现你的想法，讲好你的故事而不是帮你做一些花里胡哨的图糊弄reviewer的。
所以在这里做一下TBtools这三年的IF趋势吧。

IF_by_year <- dta %>% select(Year, Source, ImpactFactor)
IF_by_year$Year <- as.character(IF_by_year$Year)
IF_by_year$ImpactFactor <- as.numeric(IF_by_year$ImpactFactor)
gghistogram(IF_by_year, x = 'ImpactFactor', bins = 100, add = "mean", color = "Year", palette = 'mpg', rug = TRUE,add_density = TRUE, alpha = .8,xticks.by = 1)
gghistogram(IF_by_year, x = 'ImpactFactor', bins = 100, add = "mean", color = "Year", palette = 'mpg', rug = TRUE,add_density = TRUE, alpha = .8, facet.by = "Year",xticks.by = 1)

有意思的是，这三年TBtools的文章主要落在了IF=3-4的区间。而在2020年，IF的两端出现了凸起。：一方面，拿家族基因挖掘当财富密码的人越来越多。这块相互竞争激烈，一些文章不得不选择低IF的杂志发表文章。而另一方面也有不少文章发表在了老牌期刊上，而这些期刊的IF常常落在6-7的区间。当然这里有个例外：'Genomics'，根据学之策公众号的实时IF统计显示，'Genomics'的实时IF实际在3.2左右。我能说什么呢，套路深啊……

dta %>% select(Year,Source, ImpactFactor) %>% 
  filter(ImpactFactor < 7 & ImpactFactor > 6 & Source != 'Genomics') %>% group_by(Source) %>%
  summarise(Count = n()) %>% ggplot(aes(x=Source, y=Count, fill = Source)) + 
  geom_bar(stat='identity',show.legend = FALSE) + 
  scale_y_continuous(limits = c(0, NA), expand = expansion(mult = c(0,.1))) + 
  coord_flip() + theme_few()

为了减少这部分的影响，这里剔除了这个'Genomics'期刊。

IF_by_year %>% 
  filter(Source != 'Genomics') %>%
gghistogram(., x = 'ImpactFactor', bins = 100, add = "mean", color = "Year", palette = 'mpg', rug = TRUE,add_density = TRUE, alpha = .8,xticks.by = 1)

dta %>% select(Year,Source, ImpactFactor,17:25) %>% 
  filter(ImpactFactor > 6& Source != 'Genomics') %>% view()

这里再次看了下6分以上的文章都喜欢用TBtools的什么功能。一看不得了，这可不是随便用的啊，一看就是有备而来。最多用的功能还是Heatmap。广州这个地方肯定很热……嗯，希望没什么奇怪的联想。这里就不做图了，感兴趣的不妨自己试试。

TBtools用户都喜欢研究什么

这个问题说真的我很难回答，毕竟做这个我不专业。大概琢磨了下分词，弄了个词云出来看看。

dta %>% select(Year, Title) %>% 
  unnest_tokens(output = 'word', input = 'Title', token = "words", to_lower = TRUE) %>% #Text segmentation
  select(word) %>% 
  anti_join(stop_words) %>% #delete the stop words 
  count(word, sort = TRUE) -> word_count

set.seed(2020) #for reproducibility 
wordcloud(words = word_count$word, 
          freq = word_count$n, 
          min.freq = 1,
          max.words = 500, 
          random.order = FALSE, 
          rot.per = 0.35,            
          colors=brewer.pal(8, "Dark2")
          )

wordcloud

hmm,genome-wid and gene family expression analysis/identification...这还能说什么呢。可涨点薪吧。

最后，因为植物口出身所以看一下都用了什么植物做研究……

get_source <- function(name){
dta %>% select(Title, ImpactFactor) %>% 
  mutate(lower_title = tolower(Title)) %>% 
  filter(str_detect(lower_title, name)) %>% 
  select(ImpactFactor)
}

list <- c('cotton','arabidopsis', 'tea','potato','rice','tomato','wheat','soybean','cucumber','barley','maize','tobacco','brachypodium','pepper','radish')

list_fruit <- c('pear','apple','citrus','cucumis','sugarcane','strawberry','dimocarpus','grape','litchi','banana','watermelon','jujube','peach','prunus')

index <- data.frame(list.i. = '', n = '')
for (i in 1:length(list)){
  count <- get_source(list[i]) %>% count()
  index_i <- data.frame(list[i],count)
  index <- rbind(index, index_i)
}
index <- index[-1,]
colnames(index) <- c('Species name','Count')
index$Count <- as.numeric(index$Count)
index <- index[order(-index$Count),]
view(index)

Species name	Count
wheat	56
tea	36
cotton	27
rice	26
soybean	18
cucumber	18
potato	15
arabidopsis	12
tomato	10
barley	10
maize	9
tobacco	6
brachypodium	5
pepper	5
radish	5

嗨，我觉得该改标题了，就叫什么？研究小麦的居然还不知道TBtools?好了……本来还想一个滑铲把家族基因统计了，我怕又铲到沟里去，算了……另外还有个包含水果的list（我知道这么分类不专业，别骂了别骂了），感兴趣的可以自己看看。

写在最后

没啥想说的了，最后搞个山鸡图和热图祝大家科研顺利吧……

#Create a new dataset
dta %>% filter(Source != 'Genomics') %>% 
  select(Year, Publisher, Source, ImpactFactor, 17:25) %>% 
  pivot_longer(cols = starts_with('Fun_'),
               names_to = 'Name', 
               values_to = 'Fun') %>% 
  select(Year, Publisher, Source, ImpactFactor, Fun) %>% 
  filter(Fun != '') -> sankey_dta
#Add an IF column
sankey_dta %>% 
  mutate(IF = ifelse(ImpactFactor > 6,'General Impact',
                     ifelse(ImpactFactor < 2, 'Have Impact',
                            ifelse(ImpactFactor > 2 & ImpactFactor < 6, 'Specific Impact',"no")))) %>% 
  select(Year,Publisher,Source,ImpactFactor, IF,Fun) -> sankey_dta
sankey_dta %>% group_by(Source,ImpactFactor) %>% filter(ImpactFactor>5|n()>10) -> sankey_dta

sankey_dta %>% select(Year,Publisher) %>% 
  group_by(Year,Publisher) %>% 
  summarise(value = n()) -> sankey_index
colnames(sankey_index) <- c('source','target','value')
sankey_index$source <- as.character(sankey_index$source)  

sankey_index <- sankey_dta %>% select(Publisher,Source) %>% 
  group_by(Publisher,Source) %>% 
  summarise(value = n()) %>% 
  rename(.,source = Publisher, target = Source) %>% 
  rbind(sankey_index,.)

sankey_dta %>% select(Source,ImpactFactor) %>% 
  mutate(IF = ifelse(ImpactFactor > 6,'General_Impact',
                     ifelse(ImpactFactor < 2, 'Have_Impact',
                            ifelse(ImpactFactor > 2 & ImpactFactor < 6, 'Specific_Impact',"no")))) %>% 
  group_by(Source,IF) %>% 
  summarise(value = n()) %>% rename(.,source = Source, target = IF) %>% 
  rbind(sankey_index,.) -> sankey_index

sankey_dta %>% select(ImpactFactor,Fun) %>% 
  mutate(IF = ifelse(ImpactFactor > 6,'General_Impact',
                     ifelse(ImpactFactor < 2, 'Have_Impact',
                            ifelse(ImpactFactor > 2 & ImpactFactor < 6, 'Specific_Impact',"no")))) %>% 
  group_by(IF,Fun) %>% 
  summarise(value = n()) %>% rename(.,source = IF, target = Fun) %>% 
  rbind(sankey_index,.) -> sankey_index

sankey_index[order(sankey_index$source,sankey_index$target),]

nodes <- data.frame(
  name=c(as.character(sankey_index$source), as.character(sankey_index$target)) %>% 
    unique()
)

sankey_index$IDsource <- match(sankey_index$source, nodes$name)-1 
sankey_index$IDtarget <- match(sankey_index$target, nodes$name)-1
sankey_index[order(sankey_index$IDsource,sankey_index$IDtarget),]
p <- sankeyNetwork(Links = sankey_index, Nodes = nodes, Source = "IDsource", Target = "IDtarget", 
              Value = "value", NodeID = "name", fontSize = 12, nodeWidth = 20,LinkGroup="source")
p

sankey

热图当然用TBtools做了。最近天气好冷，希望能热起来。嗯

heat.jpg

最后编辑于：2020.12.22 10:07:15

人面猴
序言：七十年代末，一起剥皮案震惊了整个滨河市，随后出现的几起案子，更是在滨河造成了极大的恐慌，老刑警刘岩，带你破解...
沈念sama阅读 160,646评论 4赞 366
死咒
序言：滨河连续发生了三起死亡事件，死亡现场离奇诡异，居然都是意外死亡，警方通过查阅死者的电脑和手机，发现死者居然都...
沈念sama阅读 67,979评论 1赞 301
救了他两次的神仙让他今天三更去死
文/潘晓璐我一进店门，熙熙楼的掌柜王于贵愁眉苦脸地迎上来，“玉大人，你说我怎么就摊上这事。” “怎么了？”我有些...
开封第一讲书人阅读 110,391评论 0赞 250
道士缉凶录：失踪的卖姜人
文/不坏的土叔我叫张陵，是天一观的道长。经常有香客问我，道长，这世上最难降的妖魔是什么？我笑而不...
开封第一讲书人阅读 44,356评论 0赞 215
港岛之恋（遗憾婚礼）
正文为了忘掉前任，我火速办了婚礼，结果婚礼上，老公的妹妹穿的比我还像新娘。我一直安慰自己，他们只是感情好，可当我...
茶点故事阅读 52,740评论 3赞 293
恶毒庶女顶嫁案：这布局不是一般人想出来的
文/花漫我一把揭开白布。她就那样静静地躺着，像睡着了一般。火红的嫁衣衬着肌肤如雪。梳的纹丝不乱的头发上，一...
开封第一讲书人阅读 40,836评论 1赞 224
城市分裂传说
那天，我揣着相机与录音，去河边找鬼。笑死，一个胖子当着我的面吹牛，可吹牛的内容都是我干的。我是一名探鬼主播，决...
沈念sama阅读 32,022评论 2赞 315
双鸳鸯连环套：你想象不到人心有多黑
文/苍兰香墨我猛地睁开眼，长吁一口气：“原来是场噩梦啊……” “哼！你这毒妇竟也来了？” 一声冷哼从身侧响起，我...
开封第一讲书人阅读 30,764评论 0赞 204
万荣杀人案实录
序言：老挝万荣一对情侣失踪，失踪者是张志新（化名）和其女友刘颖，没想到半个月后，有当地人在树林里发现了一具尸体，经...
沈念sama阅读 34,487评论 1赞 246
护林员之死
正文独居荒郊野岭守林人离奇死亡，尸身上长有42处带血的脓包…… 初始之章·张勋以下内容为张勋视角年9月15日...
茶点故事阅读 30,728评论 2赞 252
白月光启示录
正文我和宋清朗相恋三年，在试婚纱的时候发现自己被绿了。大学时的朋友给我发了我未婚夫和他白月光在一起吃饭的照片。...
茶点故事阅读 32,200评论 1赞 263
活死人
序言：一个原本活蹦乱跳的男人离奇死亡，死状恐怖，灵堂内的尸体忽然破棺而出，到底是诈尸还是另有隐情，我是刑警宁泽，带...
沈念sama阅读 28,548评论 3赞 260
日本核电站爆炸内幕
正文年R本政府宣布，位于F岛的核电站，受9级特大地震影响，放射性物质发生泄漏。R本人自食恶果不足惜，却给世界环境...
茶点故事阅读 33,217评论 3赞 241
男人毒药：我在死后第九天来索命
文/蒙蒙一、第九天我趴在偏房一处隐蔽的房顶上张望。院中可真热闹，春花似锦、人声如沸。这庄子的主人今日做“春日...
开封第一讲书人阅读 26,134评论 0赞 8
一桩弑父案，背后竟有这般阴谋
文/苍兰香墨我抬头看了看天上的太阳。三九已至，却和暖如春，着一层夹袄步出监牢的瞬间，已是汗流浃背。一阵脚步声响...
开封第一讲书人阅读 26,921评论 0赞 201
情欲美人皮
我被黑心中介骗来泰国打工，没想到刚下飞机就差点儿被人妖公主榨干…… 1. 我叫王不留，地道东北人。一个月前我还...
沈念sama阅读 35,919评论 2赞 283
代替公主和亲
正文我出身青楼，却偏偏与公主长得像，于是被迫代替她去往敌国和亲。传闻我的和亲对象是个残疾皇子，可洞房花烛夜当晚...
茶点故事阅读 35,766评论 2赞 274