R_for_Data_Science_Workflow&transform

这是英文版的4、5章节

4 Workflow: basics

  • 括号好东西,能帮你在赋值的时候同时打印变量,这点在Rmarkdown的时候也很有帮助

    y <- seq(1, 10, length.out = 5)
    y #
    > [1] 1.00 3.25 5.50 7.75 10.00
    
    (y <- seq(1, 10, length.out = 5))
    #> [1] 1.00 3.25 5.50 7.75 10.00
    
  • 快捷键 Alt + Shift + K. 会帮你把所有R studio的快捷键呈现出来

    有一点要注意的是,R studio的一些快捷键在输入法状态下是不能用的

5.1 Introduction

  • 这章讲了5个函数

    • Pick observations by their values (filter()).
  • Reorder the rows (arrange()).

    • Pick variables by their names (select()).
  • Create new variables with functions of existing variables (mutate()).

    • Collapse many values down to a single summary (summarise()).
  • All verbs work similarly:

    1. The first argument is a data frame.

    2. The subsequent arguments describe what to do with the data frame, using the variable names (without quotes).

      # tab可以自动打出变量名(比如这里的year)的,这点很方便,但似乎只有在这种情况下才可以
      
      flights %>% 
        select(., year)
      
      # 这种似乎就不行
      select(flights, year)
      
      # 还有就是记住变量名是没有引号的
      filter(flights, month == 1, day == 1)
      
    3. The result is a new data frame.

  • 5个函数可以和group_by 函数联用,which changes the scope of each function from operating on the entire dataset to operating on it group-by-group

但有时候得记得ungroup

Is ungroup() recommended after every group_by()?

5.2 Filter rows with filter()

  • 浮点数的问题

    sqrt(2) ^ 2 == 2
    #> [1] FALSE
    1 / 49 * 49 == 1
    #> [1] FALSE
    

这是因为计算机是用有限精度来处理这些问题的。碰到这些问题的时候,考虑使用near函数

near(sqrt(2) ^ 2, 2)
#> [1] TRUE
near(1 / 49 * 49, 1)
#> [1] TRUE
  • comparison operators(比较运算符) 以及 Logical operators(逻辑运算符) 再次记住这两个英文,方便你有问题查询google

  • 这个图片挺好的

    image
  • 在执行 | 的时候有一个有意思的事情

    # 我们会看到下面的输出结果是month==1的时候,而不是我们预期的是11或者12
    > head(filter(flights, month == (11 | 12)), n = 3)
    # A tibble: 3 x 19
       year month   day dep_time sched_dep_time dep_delay arr_time
      <int> <int> <int>    <int>          <int>     <dbl>    <int>
    1  2013     1     1      517            515         2      830
    2  2013     1     1      533            529         4      850
    3  2013     1     1      542            540         2      923
    # ... with 12 more variables: sched_arr_time <int>,
    #   arr_delay <dbl>, carrier <chr>, flight <int>,
    #   tailnum <chr>, origin <chr>, dest <chr>, air_time <dbl>,
    #   distance <dbl>, hour <dbl>, minute <dbl>,
    #   time_hour <dttm>
    
    # 是因为首先
    > 11 | 12
    [1] TRUE
    
    # 然后TRUE在数字环境的时候,就会变成1
    # 那么结果就变成了 month == 1
    > TRUE == 1
    [1] TRUE
    
    
    > head(filter(flights, month == 1), n = 3)
    # A tibble: 3 x 19
       year month   day dep_time sched_dep_time dep_delay arr_time
      <int> <int> <int>    <int>          <int>     <dbl>    <int>
    1  2013     1     1      517            515         2      830
    2  2013     1     1      533            529         4      850
    3  2013     1     1      542            540         2      923
    # ... with 12 more variables: sched_arr_time <int>,
    #   arr_delay <dbl>, carrier <chr>, flight <int>,
    #   tailnum <chr>, origin <chr>, dest <chr>, air_time <dbl>,
    #   distance <dbl>, hour <dbl>, minute <dbl>,
    #   time_hour <dttm>
    
  • x %in% y 是挺好的函数。This will select every row where x is one of the values in y .

    nov_dec <- filter(flights, month %in% c(11, 12))
    

    这会让你的函数会更加的压缩,比如这里有4,11,12的话,相比于 == 而言会更好写一点

    filter(flights, month %in% c(4, 11, 12))
    filter(flights, month == 4 |  month == 11 | month == 12)
    
  • remembering De Morgan’s law: !(x & y) is the same as !x | !y, and !(x | y) is the same as !x & !y.

    # 这里原书有错误:weren’t delayed (on arrival or departure) by more than two hours
    # 原书是或,但这里表述的是且
    filter(flights, !(arr_delay > 120 | dep_delay > 120))
    filter(flights, arr_delay <= 120, dep_delay <= 120)
    
  • && 和 || 我们在会后面遇到,记住这里不要在这里用!我们应该用的是 | 和 &

  • Missing Value,即 NA (“not availables”) 。我能想到的NA场景就是在RNA-Seq这类的p-value里面的结果遇到。

    # NA是具有传染性的
    # 因为NA代表着有这个数据,但你不知道是什么,所以对于NA的一切操作结果,你都只能是不知道,即NA
    # 当然,也有例子,见下面
    NA > 5
    #> [1] NA
    10 == NA
    #> [1] NA
    NA + 10
    #> [1] NA
    NA / 2
    #> [1] NA
    
    ------------------------------------------------
    NA == NA
    #> [1] NA
    
    # It’s easiest to understand why this is true with a bit more context:
    # Let x be Mary's age. We don't know how old she is.
    x <- NA
    
    # Let y be John's age. We don't know how old he is.
    y <- NA
    
    # Are John and Mary the same age?
    x == y
    #> [1] NA
    # We don't know!
    --------------------------------------------------
    
    
    is.na(x)
    #> [1] TRUE
    
    # filter只会包含condition是TRUE的,并不会保留 NA 的
    df <- tibble(x = c(1, NA, 3))
    filter(df, x > 1)
    #> # A tibble: 1 x 1
    #> x
    #> <dbl>
    #> 1 3
    
    # 想要NA结果的话得可以自己搞一个
    filter(df, is.na(x) | x > 1)
    #> # A tibble: 2 x 1
    #> x
    #> <dbl>
    #> 1 NA
    #> 2 3
    

  • Exercise 5.2.2

    Another useful dplyr filtering helper is between(). What does it do? Can you use it to simplify the code needed to answer the previous challenges?

    关于between的

    Description
    This is a shortcut for x >= left & x <= right, implemented efficiently in C++ for local values, and translated to the appropriate SQL for remote tables.

    Usage
    between(x, left, right)

    # 之前我们写
    filter(flights, month %in% 7:9)
    
    # 现在我们还可以用between
    filter(flights, between(month, 7,9))
    
    # 其实我感觉filter就是把下面步骤整合了
    flights[between(flights$month, 7, 9),]
    
    # 对于between,我们还可以验证标准正态分布的-1到1占据了68.3%的面积
    > x <- rnorm(1e7)
    > length(x[between(x, -1, 1)]) / 1e7
    [1] 0.6827031
    
  • Exercise 5.2.3

    How many flights have a missing dep_time? What other variables are missing? What might these rows represent?

    filter(flights, is.na(dep_time))
    #> # A tibble: 8,255 x 19
    #>    year month   day dep_time sched_dep_time dep_delay arr_time
    #>   <int> <int> <int>    <int>          <int>     <dbl>    <int>
    #> 1  2013     1     1       NA           1630        NA       NA
    #> 2  2013     1     1       NA           1935        NA       NA
    #> 3  2013     1     1       NA           1500        NA       NA
    #> 4  2013     1     1       NA            600        NA       NA
    #> 5  2013     1     2       NA           1540        NA       NA
    #> 6  2013     1     2       NA           1620        NA       NA
    #> # … with 8,249 more rows, and 12 more variables: sched_arr_time <int>,
    #> #   arr_delay <dbl>, carrier <chr>, flight <int>, tailnum <chr>,
    #> #   origin <chr>, dest <chr>, air_time <dbl>, distance <dbl>, hour <dbl>,
    #> #   minute <dbl>, time_hour <dttm>
    
  • Exercise 5.2.4

    Why is NA ^ 0 not missing? Why is NA | TRUE not missing? Why is FALSE & NA not missing? Can you figure out the general rule? (NA * 0 is a tricky counter example!)

# 任何数的0次方都是1
# 这个特性对于scale的时候很有用,因为有时候你对于所有数都是均等进行scale的话,会由于方差是0(因为方差是分母)而无法返回值,这时候你就可以 x - mean(x) / (sd(x) ^ 0) 来解决这个问题了
> NA ^ 0
  [1] 1

# 因为
# anything and FALSE is always FALSE.
# anything or TRUE is always TRUE
> NA | TRUE
[1] TRUE
> FALSE & NA
[1] FALSE

# 而这里就不是了,我们不确定NA是什么,而结果又是会根据NA值的不同而变化的,所以返回值就是NA了
> NA | FALSE
[1] NA
> NA & TRUE
[1] NA


# 关于这个问题,我更喜欢Quaro的回答
# https://www.quora.com/In-R-why-is-NA*0-not-equal-to-0

# 即NA可以代表任何值,可以代表0也可以代表NaN。
# 而 x * 0 == 0 这一特性只在值是有限的时候,
# 而在无限的时候结果则是NaN,即无意义数
# 所以这里等于用NA代表了两种结果
> NA * 0
[1] NA

> Inf * 0
[1] NaN

NaN 和 Null 要区分, NaN代表无意义,即 not a number。而 Null 代表的是一种特殊的对象,表示函数没有被赋予任何内容。

R中NA,NaN,Inf什么意思

R语言初级教程(12): NA、Inf、NaN、NULL 特殊值


5.3 Arrange rows with arrange()

  • arrange排序的时候,缺失值在最后

    df <- tibble(x = c(5, 2, NA))
    arrange(df, x)
    #> # A tibble: 3 x 1
    #> x
    #> <dbl>
    #> 1 2
    #> 2 5
    #> 3 NA
    arrange(df, desc(x))
    #> # A tibble: 3 x 1
    #> x
    #> <dbl>
    #> 1 5
    #> 2 2
    #> 3 NA
    
    

  • Exercise 5.3.1

    How could you use arrange() to sort all missing values to the start? (Hint: use is.na()).

    > (df <- tibble(x = c(5, 2, 8, NA, NA, NA),
    +               y = c(4, 3, 1, 4, 3, 8)))
    # A tibble: 6 x 2
          x     y
      <dbl> <dbl>
    1     5     4
    2     2     3
    3     8     1
    4    NA     4
    5    NA     3
    6    NA     8
    
    # 默认排序总是会把NA排在最后
    # 然后我们还应该注意的一点是NA对应的4、3、8其实是不会排序的
    #(虽然理论上来说如果x列相同,y列会再一次排序)
    # 我们看起来NA和NA是“一样的”,但前面的知识告诉我们,NA == NA的结果是NA,即不知道,所以这里并没有二次排序
    > arrange(df, x)
    # A tibble: 6 x 2
          x     y
      <dbl> <dbl>
    1     2     3
    2     5     4
    3     8     1
    4    NA     4
    5    NA     3
    6    NA     8
    
    # 我看了solution之后,明白了
    # is.na返回的是TRUE或者FALSE,而TRUE在与FALSE比较的时候,TRUE > FALSE
    # 然后再按desc(降序排列,从大到小排序)的话,就是 NA 的在前面了
    # 在is.na排完之后,我们就不再排序了,还是按照原来的顺序 5 -> 2 -> 8
    > arrange(df, desc(is.na(x)))
    # A tibble: 6 x 2
          x     y
      <dbl> <dbl>
    1    NA     4
    2    NA     3
    3    NA     8
    4     5     4
    5     2     3
    6     8     1
    
    # 如果你想在TRUE和FALSE排完之后,再做一波排序的话,可以在加上x
    # 那就是先排is.na(x)的结果,is.na(x)排完,再排一次x
    > arrange(df, desc(is.na(x)),x)
    # A tibble: 6 x 2
          x     y
      <dbl> <dbl>
    1    NA     4
    2    NA     3
    3    NA     8
    4     2     3
    5     5     4
    6     8     1
    
  • Exercise 5.3.3

    Sort flights to find the fastest flights.

    # 还可以加上数学表达式的结果排序
    flights %>% 
      arrange(., desc(distance / air_time))
    

5.4 Select columns with select()

  • select挑选

    # 你既可以挑选某几列
    # 也可以去掉某几列
    select(flights, -(year:day))
    #> # A tibble: 336,776 x 16
    #>   dep_time sched_dep_time dep_delay arr_time sched_arr_time arr_delay carrier
    #>      <int>          <int>     <dbl>    <int>          <int>     <dbl> <chr>  
    #> 1      517            515         2      830            819        11 UA     
    #> 2      533            529         4      850            830        20 UA     
    #> 3      542            540         2      923            850        33 AA     
    #> 4      544            545        -1     1004           1022       -18 B6     
    #> 5      554            600        -6      812            837       -25 DL     
    #> 6      554            558        -4      740            728        12 UA     
    #> # … with 3.368e+05 more rows, and 9 more variables: flight <int>,
    #> #   tailnum <chr>, origin <chr>, dest <chr>, air_time <dbl>, distance <dbl>,
    #> #   hour <dbl>, minute <dbl>, time_hour <dttm>
    
  • There are a number of helper functions you can use within select():

    • starts_with("abc"): matches names that begin with “abc”.

      > flights %>% 
      +   select(., - starts_with("ye"))
      # A tibble: 336,776 x 18
         month   day dep_time sched_dep_time dep_delay arr_time
         <int> <int>    <int>          <int>     <dbl>    <int>
       1     1     1      517            515         2      830
       2     1     1      533            529         4      850
       3     1     1      542            540         2      923
       4     1     1      544            545        -1     1004
       5     1     1      554            600        -6      812
       6     1     1      554            558        -4      740
       7     1     1      555            600        -5      913
       8     1     1      557            600        -3      709
       9     1     1      557            600        -3      838
      10     1     1      558            600        -2      753
      # ... with 336,766 more rows, and 12 more variables:
      #   sched_arr_time <int>, arr_delay <dbl>, carrier <chr>,
      #   flight <int>, tailnum <chr>, origin <chr>, dest <chr>,
      #   air_time <dbl>, distance <dbl>, hour <dbl>, minute <dbl>,
      #   time_hour <dttm>
      
    • ends_with("xyz"): matches names that end with “xyz”.

    • contains("ijk"): matches names that contain “ijk”.

    • matches("(.)\\1"): selects variables that match a regular expression. This one matches any variables that contain repeated characters. You’ll learn more about regular expressions in strings.

    • num_range("x", 1:3): matches x1, x2 and x3.

    See ?select for more details.

  • rename 和 select的区别

    # 可以看到rename好用多了
    > rename(as_tibble(iris), petal_length = Petal.Length)
    # A tibble: 150 x 5
       Sepal.Length Sepal.Width petal_length Petal.Width Species
              <dbl>       <dbl>        <dbl>       <dbl> <fct>  
     1          5.1         3.5          1.4         0.2 setosa 
     2          4.9         3            1.4         0.2 setosa 
     3          4.7         3.2          1.3         0.2 setosa 
     4          4.6         3.1          1.5         0.2 setosa 
     5          5           3.6          1.4         0.2 setosa 
     6          5.4         3.9          1.7         0.4 setosa 
     7          4.6         3.4          1.4         0.3 setosa 
     8          5           3.4          1.5         0.2 setosa 
     9          4.4         2.9          1.4         0.2 setosa 
    10          4.9         3.1          1.5         0.1 setosa 
    # ... with 140 more rows
    
    # select只会保留你选的
    > select(as_tibble(iris), petal_length = Petal.Length)
    # A tibble: 150 x 1
       petal_length
              <dbl>
     1          1.4
     2          1.4
     3          1.3
     4          1.5
     5          1.4
     6          1.7
     7          1.4
     8          1.5
     9          1.4
    10          1.5
    # ... with 140 more rows
    
  • everything也是很好用的

    # 可以把某几列调到前面
    select(flights, time_hour, air_time, everything())
    #> # A tibble: 336,776 x 19
    #>   time_hour           air_time  year month   day dep_time sched_dep_time
    #>   <dttm>                 <dbl> <int> <int> <int>    <int>          <int>
    #> 1 2013-01-01 05:00:00      227  2013     1     1      517            515
    #> 2 2013-01-01 05:00:00      227  2013     1     1      533            529
    #> 3 2013-01-01 05:00:00      160  2013     1     1      542            540
    #> 4 2013-01-01 05:00:00      183  2013     1     1      544            545
    #> 5 2013-01-01 06:00:00      116  2013     1     1      554            600
    #> 6 2013-01-01 05:00:00      150  2013     1     1      554            558
    #> # … with 3.368e+05 more rows, and 12 more variables: dep_delay <dbl>,
    #> #   arr_time <int>, sched_arr_time <int>, arr_delay <dbl>, carrier <chr>,
    #> #   flight <int>, tailnum <chr>, origin <chr>, dest <chr>, distance <dbl>,
    #> #   hour <dbl>, minute <dbl>
    

  • Exercise 5.4.1

    Brainstorm as many ways as possible to select dep_time, dep_delay, arr_time, and arr_delay from flights.

    来自solution

    # 关于变量名加不加引号这点,我一直没怎么搞清楚
    # 不过后面我搞清楚了一点点
    select(flights, dep_time, dep_delay, arr_time, arr_delay)
    select(flights, "dep_time", "dep_delay", "arr_time", "arr_delay")
    
    select(flights, 4, 6, 7, 9)
    
    variables <- c("dep_time", "dep_delay", "arr_time", "arr_delay")
    select(flights, one_of(c("dep_time", "dep_delay", "arr_time", "arr_delay")))
    select(flights, one_of(variables))
    
    select(flights, starts_with("dep_"), starts_with("arr_"))
    
  • Exercise 5.4.2

    What happens if you include the name of a variable multiple times in a select() call?

    # 这点对于我们的everything特别有用
    # 即select变量名冗余会只保留一个
    select(flights, year, month, day, year, year)
    #> # A tibble: 336,776 x 3
    #>    year month   day
    #>   <int> <int> <int>
    #> 1  2013     1     1
    #> 2  2013     1     1
    #> 3  2013     1     1
    #> 4  2013     1     1
    #> 5  2013     1     1
    #> 6  2013     1     1
    #> # … with 3.368e+05 more rows
    
    # arr_delay这个变量是冗余的
    select(flights, arr_delay, everything())
    #> # A tibble: 336,776 x 19
    #>   arr_delay  year month   day dep_time sched_dep_time dep_delay arr_time
    #>       <dbl> <int> <int> <int>    <int>          <int>     <dbl>    <int>
    #> 1        11  2013     1     1      517            515         2      830
    #> 2        20  2013     1     1      533            529         4      850
    #> 3        33  2013     1     1      542            540         2      923
    #> 4       -18  2013     1     1      544            545        -1     1004
    #> 5       -25  2013     1     1      554            600        -6      812
    #> 6        12  2013     1     1      554            558        -4      740
    #> # … with 3.368e+05 more rows, and 11 more variables: sched_arr_time <int>,
    #> #   carrier <chr>, flight <int>, tailnum <chr>, origin <chr>, dest <chr>,
    #> #   air_time <dbl>, distance <dbl>, hour <dbl>, minute <dbl>,
    #> #   time_hour <dttm>
    
  • Exercise 5.4.3

    What does the one_of() function do? Why might it be helpful in conjunction with this vector?

    这个solution部分的解答非常值得好好看

    vars <- c("year", "month", "day", "dep_delay", "arr_delay")
    select(flights, one_of(vars))
    #> # A tibble: 336,776 x 5
    #>    year month   day dep_delay arr_delay
    #>   <int> <int> <int>     <dbl>     <dbl>
    #> 1  2013     1     1         2        11
    #> 2  2013     1     1         4        20
    #> 3  2013     1     1         2        33
    #> 4  2013     1     1        -1       -18
    #> 5  2013     1     1        -6       -25
    #> 6  2013     1     1        -4        12
    #> # … with 3.368e+05 more rows
    
    select(flights, vars)
    #> # A tibble: 336,776 x 5
    #>    year month   day dep_delay arr_delay
    #>   <int> <int> <int>     <dbl>     <dbl>
    #> 1  2013     1     1         2        11
    #> 2  2013     1     1         4        20
    #> 3  2013     1     1         2        33
    #> 4  2013     1     1        -1       -18
    #> 5  2013     1     1        -6       -25
    #> 6  2013     1     1        -4        12
    #> # … with 3.368e+05 more rows
    

    如果 vars 是flight里面的变量,那么就会返回名字是vars的变量列,如果不是flight里面的变量,就会查找 vars的值,对应的变量列

    # 举个例子
    > year <- "month"
    > flights %>% 
    +   select(., year) %>% 
    +   head(., n = 2)
    # A tibble: 2 x 1
       year
      <int>
    1  2013
    2  2013
    
    > year_another <- "month"
    > flights %>% 
    +   select(., year_another) %>% 
    +   head(., n = 2)
    # A tibble: 2 x 1
      month
      <int>
    1     1
    2     1
    
    > year_second <- "month_another"
    > flights %>% 
    +   select(., year_second) %>% 
    +   head(., n = 2)
    Error: Unknown column `month_another` 
    Run `rlang::last_error()` to see where the error occurred.
    

    如果要消除这种“混淆”,利用 !!!

    > flights %>% 
    +   select(., !!!year) %>% 
    +   head(., n = 2)
    # A tibble: 2 x 1
      month
      <int>
    1     1
    2     1
    

    This behavior, which is used by many tidyverse functions, is an example of what is called non-standard evaluation (NSE) in R. See the dplyr vignette, Programming with dplyr, for more information on this topic.

  • Exercise 5.4.4

    Does the result of running the following code surprise you? How do the select helpers deal with case by default? How can you change that default?

    select(flights, contains("TIME"))
    #> # A tibble: 336,776 x 6
    #>   dep_time sched_dep_time arr_time sched_arr_time air_time
    #>      <int>          <int>    <int>          <int>    <dbl>
    #> 1      517            515      830            819      227
    #> 2      533            529      850            830      227
    #> 3      542            540      923            850      160
    #> 4      544            545     1004           1022      183
    #> 5      554            600      812            837      116
    #> 6      554            558      740            728      150
    #> # … with 3.368e+05 more rows, and 1 more variable: time_hour <dttm>
    

    这是因为大小写不敏感的问题

    select(flights, contains("TIME", ignore.case = FALSE))
    #> # A tibble: 336,776 x 0
    

5.5 Add new variables with mutate()

  • 我喜欢Mutate这个特性

    Note that you can refer to columns that you’ve just created:

    flights_sml <- select(flights, 
      year:day, 
      ends_with("delay"), 
      distance, 
      air_time
    )
    mutate(flights_sml,
      gain = dep_delay - arr_delay,
      speed = distance / air_time * 60
    )
    #> # A tibble: 336,776 x 9
    #>    year month   day dep_delay arr_delay distance air_time  gain speed
    #>   <int> <int> <int>     <dbl>     <dbl>    <dbl>    <dbl> <dbl> <dbl>
    #> 1  2013     1     1         2        11     1400      227    -9  370.
    #> 2  2013     1     1         4        20     1416      227   -16  374.
    #> 3  2013     1     1         2        33     1089      160   -31  408.
    #> 4  2013     1     1        -1       -18     1576      183    17  517.
    #> 5  2013     1     1        -6       -25      762      116    19  394.
    #> 6  2013     1     1        -4        12      719      150   -16  288.
    #> # … with 3.368e+05 more rows
    
    mutate(flights_sml,
      gain = dep_delay - arr_delay,
      hours = air_time / 60,
      gain_per_hour = gain / hours
    )
    #> # A tibble: 336,776 x 10
    #>    year month   day dep_delay arr_delay distance air_time  gain hours
    #>   <int> <int> <int>     <dbl>     <dbl>    <dbl>    <dbl> <dbl> <dbl>
    #> 1  2013     1     1         2        11     1400      227    -9  3.78
    #> 2  2013     1     1         4        20     1416      227   -16  3.78
    #> 3  2013     1     1         2        33     1089      160   -31  2.67
    #> 4  2013     1     1        -1       -18     1576      183    17  3.05
    #> 5  2013     1     1        -6       -25      762      116    19  1.93
    #> 6  2013     1     1        -4        12      719      150   -16  2.5 
    #> # … with 3.368e+05 more rows, and 1 more variable: gain_per_hour <dbl>
    
  • 如果你只想保留新产生的变量

    transmute(flights,
      gain = dep_delay - arr_delay,
      hours = air_time / 60,
      gain_per_hour = gain / hours
    )
    #> # A tibble: 336,776 x 3
    #>    gain hours gain_per_hour
    #>   <dbl> <dbl>         <dbl>
    #> 1    -9  3.78         -2.38
    #> 2   -16  3.78         -4.23
    #> 3   -31  2.67        -11.6 
    #> 4    17  3.05          5.57
    #> 5    19  1.93          9.83
    #> 6   -16  2.5          -6.4 
    #> # … with 3.368e+05 more rows
    
  • 有许多函数你都可以和mutate联用,从而创造新的变量。这些函数的关键特性就是向量化,即你输入一个向量,输出也是一个向量,输入输出向量里面包含的值是对应的。

    • Arithmetic operators: +, -, *, /, ^. These are all vectorised, using the so called “recycling rules”. If one parameter is shorter than the other, it will be automatically extended to be the same length. This is most useful when one of the arguments is a single number: air_time / 60, hours * 60 + minute, etc.

      Arithmetic operators are also useful in conjunction with the aggregate functions you’ll learn about later. For example, x / sum(x) calculates the proportion of a total, and y - mean(y) computes the difference from the mean.

    • Modular arithmetic(所谓的模运算): %/% (integer division) and %% (remainder), where x == y * (x %/% y) + (x %% y). Modular arithmetic is a handy tool because it allows you to break integers up into pieces. For example, in the flights dataset, you can compute hour and minute from dep_time with:

      transmute(flights,
        dep_time,
        hour = dep_time %/% 100,
        minute = dep_time %% 100
      )
      #> # A tibble: 336,776 x 3
      #>   dep_time  hour minute
      #>      <int> <dbl>  <dbl>
      #> 1      517     5     17
      #> 2      533     5     33
      #> 3      542     5     42
      #> 4      544     5     44
      #> 5      554     5     54
      #> 6      554     5     54
      #> # … with 3.368e+05 more rows
      
    • Logs: log(), log2(), log10(). Logarithms are an incredibly useful transformation for dealing with data that ranges across multiple orders of magnitude. They also convert multiplicative relationships to additive(累乘关系变成累加关系,最大似然估计那边会用到吧), a feature we’ll come back to in modelling.

      All else being equal, I recommend using log2() because it’s easy to interpret: a difference of 1 on the log scale corresponds to doubling on the original scale and a difference of -1 corresponds to halving.(对数标度下的数值增加 1 个单位,意味着初始数值加倍;减少 1 个单位,则意味着初始数值减半。常见的就是log2FoldChange了)

    • Offsets: lead() and lag() allow you to refer to leading or lagging values.(感觉是整体数据往前或者往后挪一格) This allows you to compute running differences (e.g. x - lag(x)) or find when values change (x != lag(x)). They are most useful in conjunction with group_by(), which you’ll learn about shortly.

      (x <- 1:10)
      #>  [1]  1  2  3  4  5  6  7  8  9 10
      lag(x)
      #>  [1] NA  1  2  3  4  5  6  7  8  9
      lead(x)
      #>  [1]  2  3  4  5  6  7  8  9 10 NA
      
    • Cumulative and rolling aggregates: R provides functions for running sums, products, mins and maxes: cumsum(), cumprod(), cummin(), cummax(); and dplyr provides cummean() for cumulative means. If you need rolling aggregates (i.e. a sum computed over a rolling window), try the RcppRoll package.

      x
      #>  [1]  1  2  3  4  5  6  7  8  9 10
      
      # 累积求和
      cumsum(x)
      #>  [1]  1  3  6 10 15 21 28 36 45 55
      
      # 累积求平均
      cummean(x)
      #>  [1] 1.0 1.5 2.0 2.5 3.0 3.5 4.0 4.5 5.0 5.5
      

      RcppRoll包的使用(我觉得RcppRoll包对于一些基因组的数据应该很有用)

      install.packages("RcppRoll")
      library(RcppRoll)
      
      # 看参数说明
      # n 滚动求和窗口的大小
      # by 表示每次窗口移动的距离
      
      # 也就是说如果 n = by的话,就是我们常见的bw压缩或者分bin求值的基因组了吧
      
      (x <- 1:10)
      [1]  1  2  3  4  5  6  7  8  9 10
      
      > roll_sum(x, n = 3, by = 3)
      [1]  6 15 24
      
      > roll_sum(x, n = 3, by = 2)
      [1]  6 12 18 24
      
      > roll_sum(x, n = 3, by = 1)
      [1]  6  9 12 15 18 21 24 27
      
      

      又让我想起了Sliding window 和 fixed windows。
      我估计是sliding windows会有bin(这里的bin经常和window是混用的)(bin就是RcppRoll的n)和step(Step就是RcppRoll的by)两个参数。但fixed windows只有一个参数,即bin,应该就没有step了,或者说step等于window。

      或者说 sliding windows 又会分overlap or be disjoint。overlap就是我们step < bin,而 disjoint 就是 step >= bin 了。

      参考

  • Logical comparisons, <, <=, >, >=, !=, and ==, which you learned about earlier. If you’re doing a complex sequence of logical operations it’s often a good idea to store the interim values in new variables so you can check that each step is working as expected.

  • Ranking: there are a number of ranking functions, but you should start with min_rank(). It does the most usual type of ranking (e.g. 1st, 2nd, 2nd, 4th). The default gives smallest values the small ranks; use desc(x) to give the largest values the smallest ranks.

    y <- c(1, 2, 2, NA, 3, 4)
    min_rank(y)
    #> [1]  1  2  2 NA  4  5
    min_rank(desc(y))
    #> [1]  5  3  3 NA  2  1
    

    If min_rank() doesn’t do what you need, look at the variants row_number(), dense_rank(), percent_rank(), cume_dist(), ntile(). See their help pages for more details.

    row_number(y) 
    # row_number(): equivalent to rank(ties.method = "first")
    #> [1]  1  2  3 NA  4  5
    
    dense_rank(y) # dense_rank(): like min_rank(), but with no gaps between ranks
    #> [1]  1  2  2 NA  3  4
     
    percent_rank(y) # 这是分位数(向量化版本)
    #> [1] 0.00 0.25 0.25   NA 0.75 1.00
    cume_dist(y) # 这是累积分布函数(向量化版本)
    #> [1] 0.2 0.6 0.6  NA 0.8 1.0
    

    Exercise 5.5.4的solution有一个解释我觉得蛮好的

    rankme <- tibble(
      x = c(10, 5, 1, 5, 5)
    )
    
    rankme <- mutate(rankme,
      x_row_number = row_number(x),
      x_min_rank = min_rank(x),
      x_dense_rank = dense_rank(x)
    )
    arrange(rankme, x)
    #> # A tibble: 5 x 4
    #>       x x_row_number x_min_rank x_dense_rank
    #>   <dbl>        <int>      <int>        <int>
    #> 1     1            1          1            1
    #> 2     5            2          2            2
    #> 3     5            3          2            2
    #> 4     5            4          2            2
    #> 5    10            5          5            3
    

  • Exercise 5.5.1

    Currently dep_time and sched_dep_time are convenient to look at, but hard to compute with because they’re not really continuous numbers. Convert them to a more convenient representation of number of minutes since midnight.

    看了solution才理解,这题意思应该是,现在 dep_time 里面的时间,表示很方便(517代表了5:17), 但计算很麻烦(517-417是100,但代表的不是100min,而是60分钟)。所以我们应该进行转换。转换成距离午夜12点的时间

    # 以1504(15:04)举例
    # 这里产生的904,就是距离午夜12点(24:00)904min
    1504 %/% 100 * 60 + 1504 %% 100
    #> [1] 904
    
    # 然后有一个小问题就是,距离午夜1440分钟,恰好是午夜12点,刚好一个轮回
    # 那么应该是距离午夜12点 0分钟,而不是1440
    # 所以最后是变成了
    (1504 %/% 100 * 60 + 1504 %% 100) %% 1440
    
    
    # 转换的话
    flights %>% 
      select(., dep_time, sched_dep_time) %>% 
      mutate(., dep_time_minus = (dep_time %/% 100 * 60 + dep_time %% 100) %% 1440, 
             sched_dep_time_minus = (sched_dep_time %/% 100 * 60 + sched_dep_time %% 100) %% 1440)
    # A tibble: 336,776 x 4
       dep_time sched_dep_time dep_time_minus sched_dep_time_minus
          <int>          <int>          <dbl>                <dbl>
     1      517            515            317                  315
     2      533            529            333                  329
     3      542            540            342                  340
     4      544            545            344                  345
     5      554            600            354                  360
     6      554            558            354                  358
     7      555            600            355                  360
     8      557            600            357                  360
     9      557            600            357                  360
    10      558            600            358                  360
    # ... with 336,766 more rows
    
    # 还可以写成函数
    time2mins <- function(x) {
      (x %/% 100 * 60 + x %% 100) %% 1440
    }
    
  • Exercise 5.5.6

    What trigonometric functions does R provide?

    这部分solution写的特别详细,大家有需要的自己去看就行


5.6 Grouped summaries with summarise()

  • It collapses a data frame to a single row

    summarise() is not terribly useful unless we pair it with group_by(). This changes the unit of analysis from the complete dataset to individual groups. Then, when you use the dplyr verbs on a grouped data frame they’ll be automatically applied “by group”. For example, if we applied exactly the same code to a data frame grouped by date, we get the average delay per date:

    summarise(flights, delay = mean(dep_delay, na.rm = TRUE))
    #> # A tibble: 1 x 1
    #>   delay
    #>   <dbl>
    #> 1  12.6
    
    
    # 记得NA.rm,有传染性
    flights %>% 
      group_by(year, month, day) %>% 
      summarise(mean = mean(dep_delay))
    #> # A tibble: 365 x 4
    #> # Groups:   year, month [12]
    #>    year month   day  mean
    #>   <int> <int> <int> <dbl>
    #> 1  2013     1     1    NA
    #> 2  2013     1     2    NA
    #> 3  2013     1     3    NA
    #> 4  2013     1     4    NA
    #> 5  2013     1     5    NA
    #> 6  2013     1     6    NA
    #> # … with 359 more rows
    
    

    我有时候感觉group就相当于把观测变得有层次了,比如上面就是先根据year划一层,然后根据month划一层,最后根据day划一层。

  • Whenever you do any aggregation, it’s always a good idea to include either a count (n()), or a count of non-missing values (sum(!is.na(x))). That way you can check that you’re not drawing conclusions based on very small amounts of data.

  • Just using means, counts, and sum can get you a long way, but R provides many other useful summary functions:

    • Measures of location: we’ve used mean(x), but median(x) is also useful. The mean is the sum divided by the length; the median is a value where 50% of x is above it, and 50% is below it.

      It’s sometimes useful to combine aggregation with logical subsetting. We haven’t talked about this sort of subsetting yet, but you’ll learn more about it in subsetting.

      not_cancelled %>% 
        group_by(year, month, day) %>% 
        summarise(
          avg_delay1 = mean(arr_delay),
          avg_delay2 = mean(arr_delay[arr_delay > 0]) # the average positive delay
        )
      #> # A tibble: 365 x 5
      #> # Groups:   year, month [12]
      #>    year month   day avg_delay1 avg_delay2
      #>   <int> <int> <int>      <dbl>      <dbl>
      #> 1  2013     1     1      12.7        32.5
      #> 2  2013     1     2      12.7        32.0
      #> 3  2013     1     3       5.73       27.7
      #> 4  2013     1     4      -1.93       28.3
      #> 5  2013     1     5      -1.53       22.6
      #> 6  2013     1     6       4.24       24.4
      #> # … with 359 more rows
      
    • Measures of spread: sd(x), IQR(x)(IQR就是boxplot里面的那个上四分位点 - 下四分位点), mad(x). The root mean squared deviation, or standard deviation sd(x), is the standard measure of spread. The interquartile range IQR(x) and median absolute deviation mad(x) are robust equivalents that may be more useful if you have outliers.

      
      # Why is distance to some destinations more variable than to others?
      not_cancelled %>% 
      group_by(dest) %>% 
        summarise(distance_sd = sd(distance)) %>% 
        arrange(desc(distance_sd))
      #> # A tibble: 104 x 2
      #>   dest  distance_sd
      #>   <chr>       <dbl>
      #> 1 EGE         10.5 
      #> 2 SAN         10.4 
      #> 3 SFO         10.2 
      #> 4 HNL         10.0 
      #> 5 SEA          9.98
      #> 6 LAS          9.91
      #> # … with 98 more rows
      

      mad函数是做什么用的?

    • Measures of rank: min(x), quantile(x, 0.25), max(x). Quantiles are a generalisation of the median. For example, quantile(x, 0.25) will find a value of x that is greater than 25% of the values, and less than the remaining 75%.

      
      # When do the first and last flights leave each day?
      not_cancelled %>% 
      group_by(year, month, day) %>% 
        summarise(
        first = min(dep_time),
          last = max(dep_time)
      )
      #> # A tibble: 365 x 5
      #> # Groups:   year, month [12]
      #>    year month   day first  last
      #>   <int> <int> <int> <int> <int>
      #> 1  2013     1     1   517  2356
      #> 2  2013     1     2    42  2354
      #> 3  2013     1     3    32  2349
      #> 4  2013     1     4    25  2358
      #> 5  2013     1     5    14  2357
      #> 6  2013     1     6    16  2355
      #> # … with 359 more rows
      
    • Measures of position: first(x), nth(x, 2), last(x). These work similarly to x[1], x[2], and x[length(x)] but let you set a default value if that position does not exist (i.e. you’re trying to get the 3rd element from a group that only has two elements). For example, we can find the first and last departure for each day:

      not_cancelled %>% 
        group_by(year, month, day) %>% 
        summarise(
          first_dep = first(dep_time), 
        last_dep = last(dep_time)
        )
      #> # A tibble: 365 x 5
      #> # Groups:   year, month [12]
      #>    year month   day first_dep last_dep
      #>   <int> <int> <int>     <int>    <int>
      #> 1  2013     1     1       517     2356
      #> 2  2013     1     2        42     2354
      #> 3  2013     1     3        32     2349
      #> 4  2013     1     4        25     2358
      #> 5  2013     1     5        14     2357
      #> 6  2013     1     6        16     2355
      #> # … with 359 more rows
      

      These functions are complementary to filtering on ranks. Filtering gives you all variables, with each observation in a separate row:

      not_cancelled %>% 
        group_by(year, month, day) %>% 
        mutate(r = min_rank(desc(dep_time))) %>% 
        filter(r %in% range(r))
      #> # A tibble: 770 x 20
      #> # Groups:   year, month, day [365]
      #>    year month   day dep_time sched_dep_time dep_delay arr_time sched_arr_time
      #>   <int> <int> <int>    <int>          <int>     <dbl>    <int>          <int>
      #> 1  2013     1     1      517            515         2      830            819
      #> 2  2013     1     1     2356           2359        -3      425            437
      #> 3  2013     1     2       42           2359        43      518            442
      #> 4  2013     1     2     2354           2359        -5      413            437
      #> 5  2013     1     3       32           2359        33      504            442
      #> 6  2013     1     3     2349           2359       -10      434            445
      #> # … with 764 more rows, and 12 more variables: arr_delay <dbl>, carrier <chr>,
      #> #   flight <int>, tailnum <chr>, origin <chr>, dest <chr>, air_time <dbl>,
      #> #   distance <dbl>, hour <dbl>, minute <dbl>, time_hour <dttm>, r <int>
      
      # 上面的不太清楚
      # 其实这就等于是把最大的dep_time和最小的dep_time同时输出了
      > not_cancelled %>%
      +   group_by(year, month, day) %>% 
      +   select(., year:day, dep_time) %>% 
      +   mutate(r = min_rank(desc(dep_time))) %>%
      +   filter(r %in% range(r))
      # A tibble: 770 x 5
      # Groups:   year, month, day [365]
          year month   day dep_time     r
         <int> <int> <int>    <int> <int>
       1  2013     1     1      517   831
       2  2013     1     1     2356     1
       3  2013     1     2       42   928
       4  2013     1     2     2354     1
       5  2013     1     3       32   900
       6  2013     1     3     2349     1
       7  2013     1     4       25   908
       8  2013     1     4     2358     1
       9  2013     1     4     2358     1
      10  2013     1     5       14   717
      # ... with 760 more rows
      
    • Counts: You’ve seen n(), which takes no arguments, and returns the size of the current group. To count the number of non-missing values, use sum(!is.na(x)). To count the number of distinct (unique) values, use n_distinct(x).(这函数可以唉,类似于Linux里面的uniq -c?)

      # Which destinations have the most carriers?
      not_cancelled %>% 
        group_by(dest) %>% 
        summarise(carriers = n_distinct(carrier)) %>% 
        arrange(desc(carriers))
      #> # A tibble: 104 x 2
      #>   dest  carriers
      #>   <chr>    <int>
      #> 1 ATL          7
      #> 2 BOS          7
      #> 3 CLT          7
      #> 4 ORD          7
      #> 5 TPA          7
      #> 6 AUS          6
      #> # … with 98 more rows
      

      Counts are so useful that dplyr provides a simple helper if all you want is a count:

      # 你可以看到not_cancelled是没有group的
      > not_cancelled
      # A tibble: 327,346 x 19
      
      # count自动完成了group_by+n()
      not_cancelled %>% 
        count(dest)
      #> # A tibble: 104 x 2
      #>   dest      n
      #>   <chr> <int>
      #> 1 ABQ     254
      #> 2 ACK     264
      #> 3 ALB     418
      #> 4 ANC       8
      #> 5 ATL   16837
      #> 6 AUS    2411
      #> # … with 98 more rows
      
      # 我其实感觉 n() 就类似于 length() 吧
      # 来自solution的exercise 5.6.2
      > not_cancelled %>%
      +   group_by(dest) %>%
      +   summarise(n = length(dest)) %>% 
      +   head(n = 3)
      # A tibble: 3 x 2
        dest      n
        <chr> <int>
      1 ABQ     254
      2 ACK     264
      3 ALB     418
      
      > not_cancelled %>%
      +   group_by(dest) %>%
      +   summarise(n = n()) %>% 
      +   head(n = 3)
      # A tibble: 3 x 2
        dest      n
        <chr> <int>
      1 ABQ     254
      2 ACK     264
      3 ALB     418
      
      

      You can optionally provide a weight variable. For example, you could use this to “count” (sum) the total number of miles a plane flew:

      之前是数每个group下的个数,这里是对每个group下,你指定的varible进行求和

      
      # 数个数
      > not_cancelled %>% 
      +   count(tailnum) %>% 
      +   head(., n = 3)
      # A tibble: 3 x 2
        tailnum     n
        <chr>   <int>
      1 D942DN      4
      2 N0EGMQ    352
      3 N10156    145
      
      > not_cancelled %>% 
      +   group_by(tailnum) %>% 
      
      +   summarise(n = n()) %>% 
      +   head(n = 3)
      # A tibble: 3 x 2
        tailnum     n
        <chr>   <int>
      1 D942DN      4
      2 N0EGMQ    352
      3 N10156    145
      
      
      # 求和
      > not_cancelled %>%
      +   count(tailnum, wt = distance) %>% 
      +   head(., n = 3)
      # A tibble: 3 x 2
        tailnum      n
        <chr>    <dbl>
      1 D942DN    3418
      2 N0EGMQ  239143
      3 N10156  109664
      
      > not_cancelled %>% 
      +   select(tailnum, distance) %>% 
      +   group_by(tailnum) %>%
      +   summarise(n = sum(distance)) %>% 
      +   head(n = 3)
      # A tibble: 3 x 2
        tailnum      n
        <chr>    <dbl>
      1 D942    3418
      2 N0EGMQ  239143
      3 N10156  109664
      
      
    • Counts and proportions of logical values: sum(x > 10), mean(y == 0). When used with numeric functions, TRUE is converted to 1 and FALSE to 0. This makes sum() and mean() very useful: sum(x) gives the number of TRUEs in x, and mean(x) gives the proportion.

    这个操作是很好的,举个最简单的例子

    # 下面这个例子可以用来说明
    # 正态分布的中位数是0
    > set.seed(19960203)
    > mean(rnorm(1E6) > 0)
    [1] 0.500266
    
    > set.seed(19960203)
    > sum(rnorm(1E6) > 0) / 1E6
    [1] 0.500266
    
    
    # How many flights left before 5am? (these usually indicate   delayed
    # flights from the previous day)
    not_cancelled %>% 
      group_by(year, month, day) %>% 
        summarise(n_early = sum(dep_time < 500))
    #> # A tibble: 365 x 4
    #> # Groups:   year, month [12]
    #>    year month   day n_early
    #>   <int> <int> <int>   <int>
    #> 1  2013     1     1       0
    #> 2  2013     1     2       3
    #> 3  2013     1     3       4
    #> 4  2013     1     4       3
    #> 5  2013     1     5       3
    #> 6  2013     1     6       2
    #> # … with 359 more rows
    
    # What proportion of flights are delayed by more than an hour?
    not_cancelled %>% 
      group_by(year, month, day) %>% 
      summarise(hour_prop = mean(arr_delay > 60))
    #> # A tibble: 365 x 4
    #> # Groups:   year, month [12]
    #>    year month   day hour_prop
    #>   <int> <int> <int>     <dbl>
    #> 1  2013     1     1    0.0722
    #> 2  2013     1     2    0.0851
    #> 3  2013     1     3    0.0567
    #> 4  2013     1     4    0.0396
    #> 5  2013     1     5    0.0349
    #> 6  2013     1     6    0.0470
    #> # … with 359 more rows
    
  • 有意思的功能:当使用多个变量进行分组时,每次的摘要统计会用掉一个分组变量。这样就可以轻松地对数据集进行循序渐进的分析:

    daily <- group_by(flights, year, month, day)
    (per_day   <- summarise(daily, flights = n()))
    #> # A tibble: 365 x 4
    #> # Groups:   year, month [12]
    #>    year month   day flights
    #>   <int> <int> <int>   <int>
    #> 1  2013     1     1     842
    #> 2  2013     1     2     943
    #> 3  2013     1     3     914
    #> 4  2013     1     4     915
    #> 5  2013     1     5     720
    #> 6  2013     1     6     832
    #> # … with 359 more rows
    (per_month <- summarise(per_day, flights = sum(flights)))
    #> # A tibble: 12 x 3
    #> # Groups:   year [1]
    #>    year month flights
    #>   <int> <int>   <int>
    #> 1  2013     1   27004
    #> 2  2013     2   24951
    #> 3  2013     3   28834
    #> 4  2013     4   28330
    #> 5  2013     5   28796
    #> 6  2013     6   28243
    #> # … with 6 more rows
    (per_year  <- summarise(per_month, flights = sum(flights)))
    #> # A tibble: 1 x 2
    #>    year flights
    #>   <int>   <int>
    #> 1  2013  336776
    

在循序渐进地进行摘要分析时,需要小心:使用求和与计数操作是没问题的,但如果想要使用加权平均和方差的话,就要仔细考虑一下,在基于秩的统计数据(如中位数)上是无法进行这些操作的。换句话说,对分组求和的结果再求和就是对整体求和,但分组中位数的中位数可不是整体的中位数。(这话来自中文版)

我觉得意思就是

# 整体求和 和 把整体分割成组,然后分组求和的结果是一样的
# 这个你可以通过自己列公式看出来
> sum(1:8)
[1] 36
> sum(sum(1:4),sum(5:6),sum(7:8))
[1] 36

# 但整体求中位数 和 把整体分割成组,然后分组求中位数,再求中位数的结果是不一样的
> median(1:8)
[1] 4.5
> median(median(1:4),median(5:6),median(7:8))
[1] 2.5

# 均值同理
> mean(1:8)
[1] 4.5
> mean(mean(1:4),mean(5:6),mean(7:8))
[1] 2.5

  • Exercise 5.6.2

    Come up with another approach that will give you the same output as not_cancelled %>% count(dest) and not_cancelled %>% count(tailnum, wt = distance) (without using count()).

    not_cancelled <- flights %>%
      filter(!is.na(dep_delay), !is.na(arr_delay))
    
    > not_cancelled %>%
    +   count(dest) %>% 
    +   head(n = 3)
    # A tibble: 3 x 2
      dest      n
      <chr> <int>
    1 ABQ     254
    2 ACK     264
    3 ALB     418
    
    
    > not_cancelled %>%
    +   group_by(dest) %>%
    +   summarise(n = length(dest)) %>% 
    +   head(n = 3)
    # A tibble: 3 x 2
      dest      n
      <chr> <int>
    1 ABQ     254
    2 ACK     264
    3 ALB     418
    
    --------------------------------------------------------------------------------------------------
    
    # 其实group_by加summarise我觉得可以认为是下面这个操作的多次循环
    # 以ABQ为例
    > not_cancelled %>% 
    +   filter(., dest == "ABQ") %>% 
    +   nrow()
    [1] 254
    
    
    ------------------------------------------------------------------------------------------------------
    
    > not_cancelled %>%
    +   group_by(dest) %>%
    +   summarise(n = n()) %>% 
    +   head(n = 3)
    # A tibble: 3 x 2
      dest      n
      <chr> <int>
    1 ABQ     254
    2 ACK     264
    3 ALB     418
    
    # Another alternative to count() is to use the combination of the group_by() and tally() verbs. 
    # In fact, count() is effectively a short-cut for group_by() followed by tally().
    > not_cancelled %>%
    +   group_by(dest) %>%
    +   tally() %>% 
    +   head(n = 3)
    # A tibble: 3 x 2
      dest      n
    <chr> <int>
    1 ABQ     254
    2 ACK     264
    3 ALB     418
    
> not_cancelled %>%
+   count(tailnum, wt = distance) %>% 
+   head(n = 3)
# A tibble: 3 x 2
  tailnum      n
  <chr>    <dbl>
1 D942DN    3418
2 N0EGMQ  239143
3 N10156  109664

> not_cancelled %>%
+   group_by(tailnum) %>% 
+   summarise(n = sum(distance)) %>% 
+   head(n = 3)
# A tibble: 3 x 2
  tailnum      n
  <chr>    <dbl>
1 D942DN    3418
2 N0EGMQ  239143
3 N10156  109664

# 同样举个例子
> not_cancelled %>%
+   filter(., tailnum == "D942DN") %>% 
+   pull(distance) %>% 
+   sum()
[1] 3418

# Like the previous example, we can also use the combination group_by() and tally(). 
# Any arguments to tally() are summed.
> not_cancelled %>%
+   group_by(tailnum) %>%
+   tally(distance) %>% 
+   head(n = 3)
# A tibble: 3 x 2
  tailnum      n
  <chr>    <dbl>
1 D942DN    3418
2 N0EGMQ  239143
3 N10156  109664
  • Exercise 5.6.6

    What does the sort argument to count() do? When might you use it?

    > not_cancelled %>%
    +   count(dest, sort = T) %>% 
    +   head(n = 3)
    # A tibble: 3 x 2
      dest      n
      <chr> <int>
    1 ATL   16837
    2 ORD   16566
    3 LAX   16026
    
    # 等价于
    
    > not_cancelled %>%
    +   count(dest) %>% 
    +   arrange(., desc(n)) %>% 
    +   head(n = 3)
    # A tibble: 3 x 2
      dest      n
      <chr> <int>
    1 ATL   16837
    2 ORD   16566
    3 LAX   16026
    

5.7 Grouped mutates (and filters)

  • 虽然与 summarize() 函数结合起来使用是最有效的,但分组也可以与 mutate() 和 filter()函数结合,以完成非常便捷的操作。(中文版这段话)

    flights_sml <- select(flights, 
      year:day, 
      ends_with("delay"), 
      distance, 
      air_time
    )
    
    # Find the worst members of each group:
    # 这里对arr_delay进行排序,按从大到小排。然后再得到rank,rank越靠前就是arr_delay越大的
    # 这里就是得到每组前10多延迟的
    flights_sml %>% 
      group_by(year, month, day) %>%
      filter(rank(desc(arr_delay)) < 10) %>% 
      head(n = 3)
    # A tibble: 3 x 7
    # Groups:   year, month, day [1]
       year month   day dep_delay arr_delay distance air_time
      <int> <int> <int>     <dbl>     <dbl>    <dbl>    <dbl>
    1  2013     1     1       853       851      184       41
    2  2013     1     1       290       338     1134      213
    3  2013     1     1       260       263      266       46
    
    
    # Find all groups bigger than a threshold:
    popular_dests <- flights %>% 
      group_by(dest) %>% 
      filter(n() > 365)
    
    # Standardise to compute per group metrics:
    # 这里等于是计算了百分比
    # 始终记住:向量化,向量化
    popular_dests %>% 
      filter(arr_delay > 0) %>% 
      mutate(prop_delay = arr_delay / sum(arr_delay)) %>% 
      select(year:day, dest, arr_delay, prop_delay)
    #> # A tibble: 131,106 x 6
    #> # Groups:   dest [77]
    #>    year month   day dest  arr_delay prop_delay
    #>   <int> <int> <int> <chr>     <dbl>      <dbl>
    #> 1  2013     1     1 IAH          11  0.000111 
    #> 2  2013     1     1 IAH          20  0.000201 
    #> 3  2013     1     1 MIA          33  0.000235 
    #> 4  2013     1     1 ORD          12  0.0000424
    #> 5  2013     1     1 FLL          19  0.0000938
    #> 6  2013     1     1 ORD           8  0.0000283
    #> # … with 1.311e+05 more rows
    
    # 这样可能直观一点
    > popular_dests %>% 
    +   filter(arr_delay > 0) %>% 
    +   mutate(prop_delay = arr_delay / sum(arr_delay)) %>% 
    +   select(year:day, dest, arr_delay, prop_delay) %>% 
    +   arrange(dest)
    # A tibble: 131,106 x 6
    # Groups:   dest [77]
        year month   day dest  arr_delay prop_delay
       <int> <int> <int> <chr>     <dbl>      <dbl>
     1  2013     1     1 ALB          40   0.00418 
     2  2013     1     1 ALB          44   0.00459 
     3  2013     1     2 ALB          71   0.00741 
     4  2013     1     2 ALB          82   0.00856 
     5  2013     1     3 ALB          40   0.00418 
     6  2013     1     4 ALB          30   0.00313 
     7  2013     1     6 ALB          95   0.00992 
     8  2013     1     6 ALB           4   0.000418
     9  2013     1     7 ALB          41   0.00428 
    10  2013     1    10 ALB         120   0.0125  
    # ... with 131,096 more rows
    
最后编辑于
©著作权归作者所有,转载或内容合作请联系作者
  • 序言:七十年代末,一起剥皮案震惊了整个滨河市,随后出现的几起案子,更是在滨河造成了极大的恐慌,老刑警刘岩,带你破解...
    沈念sama阅读 159,458评论 4 363
  • 序言:滨河连续发生了三起死亡事件,死亡现场离奇诡异,居然都是意外死亡,警方通过查阅死者的电脑和手机,发现死者居然都...
    沈念sama阅读 67,454评论 1 294
  • 文/潘晓璐 我一进店门,熙熙楼的掌柜王于贵愁眉苦脸地迎上来,“玉大人,你说我怎么就摊上这事。” “怎么了?”我有些...
    开封第一讲书人阅读 109,171评论 0 243
  • 文/不坏的土叔 我叫张陵,是天一观的道长。 经常有香客问我,道长,这世上最难降的妖魔是什么? 我笑而不...
    开封第一讲书人阅读 44,062评论 0 207
  • 正文 为了忘掉前任,我火速办了婚礼,结果婚礼上,老公的妹妹穿的比我还像新娘。我一直安慰自己,他们只是感情好,可当我...
    茶点故事阅读 52,440评论 3 287
  • 文/花漫 我一把揭开白布。 她就那样静静地躺着,像睡着了一般。 火红的嫁衣衬着肌肤如雪。 梳的纹丝不乱的头发上,一...
    开封第一讲书人阅读 40,661评论 1 219
  • 那天,我揣着相机与录音,去河边找鬼。 笑死,一个胖子当着我的面吹牛,可吹牛的内容都是我干的。 我是一名探鬼主播,决...
    沈念sama阅读 31,906评论 2 313
  • 文/苍兰香墨 我猛地睁开眼,长吁一口气:“原来是场噩梦啊……” “哼!你这毒妇竟也来了?” 一声冷哼从身侧响起,我...
    开封第一讲书人阅读 30,609评论 0 200
  • 序言:老挝万荣一对情侣失踪,失踪者是张志新(化名)和其女友刘颖,没想到半个月后,有当地人在树林里发现了一具尸体,经...
    沈念sama阅读 34,379评论 1 246
  • 正文 独居荒郊野岭守林人离奇死亡,尸身上长有42处带血的脓包…… 初始之章·张勋 以下内容为张勋视角 年9月15日...
    茶点故事阅读 30,600评论 2 246
  • 正文 我和宋清朗相恋三年,在试婚纱的时候发现自己被绿了。 大学时的朋友给我发了我未婚夫和他白月光在一起吃饭的照片。...
    茶点故事阅读 32,085评论 1 261
  • 序言:一个原本活蹦乱跳的男人离奇死亡,死状恐怖,灵堂内的尸体忽然破棺而出,到底是诈尸还是另有隐情,我是刑警宁泽,带...
    沈念sama阅读 28,409评论 2 254
  • 正文 年R本政府宣布,位于F岛的核电站,受9级特大地震影响,放射性物质发生泄漏。R本人自食恶果不足惜,却给世界环境...
    茶点故事阅读 33,072评论 3 237
  • 文/蒙蒙 一、第九天 我趴在偏房一处隐蔽的房顶上张望。 院中可真热闹,春花似锦、人声如沸。这庄子的主人今日做“春日...
    开封第一讲书人阅读 26,088评论 0 8
  • 文/苍兰香墨 我抬头看了看天上的太阳。三九已至,却和暖如春,着一层夹袄步出监牢的瞬间,已是汗流浃背。 一阵脚步声响...
    开封第一讲书人阅读 26,860评论 0 195
  • 我被黑心中介骗来泰国打工, 没想到刚下飞机就差点儿被人妖公主榨干…… 1. 我叫王不留,地道东北人。 一个月前我还...
    沈念sama阅读 35,704评论 2 276
  • 正文 我出身青楼,却偏偏与公主长得像,于是被迫代替她去往敌国和亲。 传闻我的和亲对象是个残疾皇子,可洞房花烛夜当晚...
    茶点故事阅读 35,608评论 2 270