# 【r<-基础|统计】频数检验

## 方案

1. 频数分布与预期或者理论的分布（比如50%的yes，50%的no）符合吗？（拟合优度检验）
2. 两组或多组之间的频率分布有差异吗？（独立检验）

``````data <- read.table(header=TRUE, text='
condition result
control      0
control      0
control      0
control      0
treatment      1
control      0
control      0
treatment      0
treatment      1
control      1
treatment      1
treatment      1
treatment      1
treatment      1
treatment      0
control      0
control      1
control      0
control      1
treatment      0
treatment      1
treatment      0
treatment      0
control      0
treatment      1
control      0
control      0
treatment      1
treatment      0
treatment      1
')
``````

### 拟合优度检验 （期望频率）

#### 卡方检验

``````#　为result列创建列联表，包含0和1两个值
#　注意“０”和“１”是列名而不是实际的值

ct <- table(data\$result)
ct
#>
#>  0  1
#> 17 13

# 也可以手动创建表格
#ct <- matrix(c(17,13), ncol=2)
#colnames(ct1) <- c("0", "1")

# 执行卡方检验
chisq.test(ct)
#>
#>  Chi-squared test for given probabilities
#>
#> data:  ct
#> X-squared = 0.53333, df = 1, p-value = 0.4652
``````

``````# 概率表 —— 和必须为1
pt <- c(.75, .25)
chisq.test(ct, p=pt)
#>
#>  Chi-squared test for given probabilities
#>
#> data:  ct
#> X-squared = 5.3778, df = 1, p-value = 0.02039
``````

``````chi_res <- chisq.test(ct, p=pt)
# View all the parts that can be extracted
str(chi_res)
#> List of 9
#>  \$ statistic: Named num 5.38
#>   ..- attr(*, "names")= chr "X-squared"
#>  \$ parameter: Named num 1
#>   ..- attr(*, "names")= chr "df"
#>  \$ p.value  : num 0.0204
#>  \$ method   : chr "Chi-squared test for given probabilities"
#>  \$ data.name: chr "ct"
#>  \$ observed : 'table' int [1:2(1d)] 17 13
#>   ..- attr(*, "dimnames")=List of 1
#>   .. ..\$ : chr [1:2] "0" "1"
#>  \$ expected : Named num [1:2] 22.5 7.5
#>   ..- attr(*, "names")= chr [1:2] "0" "1"
#>  \$ residuals: table [1:2(1d)] -1.16 2.01
#>   ..- attr(*, "dimnames")=List of 1
#>   .. ..\$ : chr [1:2] "0" "1"
#>  \$ stdres   : table [1:2(1d)] -2.32 2.32
#>   ..- attr(*, "dimnames")=List of 1
#>   .. ..\$ : chr [1:2] "0" "1"
#>  - attr(*, "class")= chr "htest"

# 获取卡方值
chi_res\$statistic
#> X-squared
#>  5.377778

# 获取p值
chi_res\$p.value
#> [1] 0.02039484
``````

#### 精确二项检验

``````ct <- table(data\$result)
ct
#>
#>  0  1
#> 17 13

binom.test(ct, p=0.5)
#>
#>  Exact binomial test
#>
#> data:  ct
#> number of successes = 17, number of trials = 30, p-value = 0.5847
#> alternative hypothesis: true probability of success is not equal to 0.5
#> 95 percent confidence interval:
#>  0.3742735 0.7453925
#> sample estimates:
#> probability of success
#>              0.5666667

# 使用75%的期望概率——注意1在第二列，所以只需要令p=0.25
binom.test(ct, p=0.25)
#>
#>  Exact binomial test
#>
#> data:  ct
#> number of successes = 17, number of trials = 30, p-value = 0.0002157
#> alternative hypothesis: true probability of success is not equal to 0.25
#> 95 percent confidence interval:
#>  0.3742735 0.7453925
#> sample estimates:
#> probability of success
#>              0.5666667
``````

``````bin_res <- binom.test(ct, p=0.25)
# View all the parts that can be extracted
str(bin_res)
#> List of 9
#>  \$ statistic  : Named num 17
#>   ..- attr(*, "names")= chr "number of successes"
#>  \$ parameter  : Named num 30
#>   ..- attr(*, "names")= chr "number of trials"
#>  \$ p.value    : Named num 0.000216
#>   ..- attr(*, "names")= chr "0"
#>  \$ conf.int   : atomic [1:2] 0.374 0.745
#>   ..- attr(*, "conf.level")= num 0.95
#>  \$ estimate   : Named num 0.567
#>   ..- attr(*, "names")= chr "probability of success"
#>  \$ null.value : Named num 0.25
#>   ..- attr(*, "names")= chr "probability of success"
#>  \$ alternative: chr "two.sided"
#>  \$ method     : chr "Exact binomial test"
#>  \$ data.name  : chr "ct"
#>  - attr(*, "class")= chr "htest"

# 获取p值
bin_res\$p.value
#>            0
#> 0.0002156938

# 获取95%置信区间
bin_res\$conf.int
#> [1] 0.3742735 0.7453925
#> attr(,"conf.level")
#> [1] 0.95
``````

### 独立检验（比较组间）

#### 卡方检验

``````ct <- table(data\$condition, data\$result)
ct
#>
#>              0  1
#>   control   11  3
#>   treatment  6 10

chisq.test(ct)
#>
#>  Pearson's Chi-squared test with Yates' continuity correction
#>
#> data:  ct
#> X-squared = 3.593, df = 1, p-value = 0.05802

chisq.test(ct, correct=FALSE)
#>
#>  Pearson's Chi-squared test
#>
#> data:  ct
#> X-squared = 5.1293, df = 1, p-value = 0.02353
``````

#### Fisher精确检验

``````ct <- table(data\$condition, data\$result)
ct
#>
#>              0  1
#>   control   11  3
#>   treatment  6 10

fisher.test(ct)
#>
#>  Fisher's Exact Test for Count Data
#>
#> data:  ct
#> p-value = 0.03293
#> alternative hypothesis: true odds ratio is not equal to 1
#> 95 percent confidence interval:
#>   0.966861 45.555016
#> sample estimates:
#> odds ratio
#>   5.714369
``````

#### Cochran-Mantel-Haenszel test

Cochran-Mantel-Haenszel 检验 （或称为 Mantel-Haenszel 检验)）用于检验重复测量两离散变量的独立性。通常使用 2x2xK列表表示，K是测量条件的次数。比如你想要指导是否一个处理（C vs. D）是否影响了恢复的概率（yes or no）。假设该处理一天监控测量三次——早上、中午和晚上，而你想要你的检验能够控制它。那么你可以使用CMH检验对2x2x3列联表进行操作，第三个变量是你想要控制的变量。

R中的CMH检验可以处理比2x2xK维度更高的数据，例如你处理3x3xK列联表。

``````fish <- read.table(header=TRUE, text='
Location Allele   Habitat Count
tillamook     94    marine    56
tillamook     94 estuarine    69
tillamook non-94    marine    40
tillamook non-94 estuarine    77
yaquina     94    marine    61
yaquina     94 estuarine   257
yaquina non-94    marine    57
yaquina non-94 estuarine   301
alsea     94    marine    73
alsea     94 estuarine    65
alsea non-94    marine    71
alsea non-94 estuarine    79
umpqua     94    marine    71
umpqua     94 estuarine    48
umpqua non-94    marine    55
umpqua non-94 estuarine    48
')
``````

``````# 制造一个3维的列联表，最后一个变量时要控制的Location变量
ct <- xtabs(Count ~ Allele + Habitat + Location, data=fish)
ct
#> , , Location = alsea
#>
#>         Habitat
#> Allele   estuarine marine
#>   94            65     73
#>   non-94        79     71
#>
#> , , Location = tillamook
#>
#>         Habitat
#> Allele   estuarine marine
#>   94            69     56
#>   non-94        77     40
#>
#> , , Location = umpqua
#>
#>         Habitat
#> Allele   estuarine marine
#>   94            48     71
#>   non-94        48     55
#>
#> , , Location = yaquina
#>
#>         Habitat
#> Allele   estuarine marine
#>   94           257     61
#>   non-94       301     57

# This prints ct in a "flat" format
ftable(ct)
#>                  Location alsea tillamook umpqua yaquina
#> Allele Habitat
#> 94     estuarine             65        69     48     257
#>        marine                73        56     71      61
#> non-94 estuarine             79        77     48     301
#>        marine                71        40     55      57

# 按指定方式进行变量输出
ftable(ct, row.vars=c("Location","Allele"), col.vars="Habitat")
#>                  Habitat estuarine marine
#> Location  Allele
#> alsea     94                    65     73
#>           non-94                79     71
#> tillamook 94                    69     56
#>           non-94                77     40
#> umpqua    94                    48     71
#>           non-94                48     55
#> yaquina   94                   257     61
#>           non-94               301     57

mantelhaen.test(ct)
#>
#>  Mantel-Haenszel chi-squared test with continuity correction
#>
#> data:  ct
#> Mantel-Haenszel X-squared = 5.0497, df = 1, p-value = 0.02463
#> alternative hypothesis: true common odds ratio is not equal to 1
#> 95 percent confidence interval:
#>  0.6005522 0.9593077
#> sample estimates:
#> common odds ratio
#>          0.759022
``````

``````# 下面两个看似不同的列联表，实际检验结果相同
ct.1 <- xtabs(Count ~ Habitat + Allele + Location, data=fish)
ct.2 <- xtabs(Count ~ Allele + Habitat + Location, data=fish)
mantelhaen.test(ct.1)
#>
#>  Mantel-Haenszel chi-squared test with continuity correction
#>
#> data:  ct.1
#> Mantel-Haenszel X-squared = 5.0497, df = 1, p-value = 0.02463
#> alternative hypothesis: true common odds ratio is not equal to 1
#> 95 percent confidence interval:
#>  0.6005522 0.9593077
#> sample estimates:
#> common odds ratio
#>          0.759022
mantelhaen.test(ct.2)
#>
#>  Mantel-Haenszel chi-squared test with continuity correction
#>
#> data:  ct.2
#> Mantel-Haenszel X-squared = 5.0497, df = 1, p-value = 0.02463
#> alternative hypothesis: true common odds ratio is not equal to 1
#> 95 percent confidence interval:
#>  0.6005522 0.9593077
#> sample estimates:
#> common odds ratio
#>          0.759022

# 把Allele放到最后，结果不同了
ct.3 <- xtabs(Count ~ Location + Habitat + Allele, data=fish)
ct.4 <- xtabs(Count ~ Habitat + Location + Allele, data=fish)
mantelhaen.test(ct.3)
#>
#>  Cochran-Mantel-Haenszel test
#>
#> data:  ct.3
#> Cochran-Mantel-Haenszel M^2 = 168.47, df = 3, p-value < 2.2e-16
mantelhaen.test(ct.4)
#>
#>  Cochran-Mantel-Haenszel test
#>
#> data:  ct.4
#> Cochran-Mantel-Haenszel M^2 = 168.47, df = 3, p-value < 2.2e-16

# 把Habitat放最后，结果也不同
ct.5 <- xtabs(Count ~ Allele + Location + Habitat, data=fish)
ct.6 <- xtabs(Count ~ Location + Allele + Habitat, data=fish)
mantelhaen.test(ct.5)
#>
#>  Cochran-Mantel-Haenszel test
#>
#> data:  ct.5
#> Cochran-Mantel-Haenszel M^2 = 2.0168, df = 3, p-value = 0.5689
mantelhaen.test(ct.6)
#>
#>  Cochran-Mantel-Haenszel test
#>
#> data:  ct.6
#> Cochran-Mantel-Haenszel M^2 = 2.0168, df = 3, p-value = 0.5689
``````

### McNemar检验

McNemar检验概念上是频数数据的一个被试内检验。例如，假设你想要检验是否一个处理增加了一个人对某个问题反应“yes”的概率，而且你只有每个人处理前和处理后的数据。标准的卡方检验将不合适，因为它假设了组别是独立的。取而代之，我们可以使用McNemar检验。该检验仅适用于当存在一个独立变量的两次测量时。用于McNemar的列联表与用于卡方检验的非常相似，但结构上是不同的。

``````data <- read.table(header=TRUE, text='
subject time result
1  pre      0
1 post      1
2  pre      1
2 post      1
3  pre      0
3 post      1
4  pre      1
4 post      0
5  pre      1
5 post      1
6  pre      0
6 post      1
7  pre      0
7 post      1
8  pre      0
8 post      1
9  pre      0
9 post      1
10  pre      1
10 post      1
11  pre      0
11 post      0
12  pre      1
12 post      1
13  pre      0
13 post      1
14  pre      0
14 post      0
15  pre      0
15 post      1
')
``````

``````library(tidyr)

data_wide
#>    subject post pre
#> 1        1    1   0
#> 2        2    1   1
#> 3        3    1   0
#> 4        4    0   1
#> 5        5    1   1
#> 6        6    1   0
#> 7        7    1   0
#> 8        8    1   0
#> 9        9    1   0
#> 10      10    1   1
#> 11      11    0   0
#> 12      12    1   1
#> 13      13    1   0
#> 14      14    0   0
#> 15      15    1   0
``````

``````ct <- table( data_wide[,c("pre","post")] )
ct
#>    post
#> pre 0 1
#>   0 2 8
#>   1 1 4

# 下面是用于标准卡方检验的列联表，注意差别喔

# table(data[,c("time","result")])
#       result
# time    0  1
#   post  3 12
#   pre  10  5
``````

``````mcnemar.test(ct)
#>
#>  McNemar's Chi-squared test with continuity correction
#>
#> data:  ct
#> McNemar's chi-squared = 4, df = 1, p-value = 0.0455
``````

``````library(exact2x2)
mcnemar.exact(ct)
#>
#>  Exact McNemar test (with central confidence intervals)
#>
#> data:  ct
#> b = 8, c = 1, p-value = 0.03906
#> alternative hypothesis: true odds ratio is not equal to 1
#> 95 percent confidence interval:
#>    1.072554 354.981246
#> sample estimates:
#> odds ratio
#>          8
``````

17.2万字 · 11.6万阅读 · 415人关注