# Data Science with R in 4 Weeks - Week 1 - Day1

#### Week 1 -Day 1

Viewing data

(1) 然后放到你的根目录；放到别的目录也行，只是后面要指明路径

（2）在terminal中，输入  sudo R CMD INSTALL package.tar.gz；记得把package换成你要用的package的名字

Analyzing Data

x <- count(temp, vars=c("playerID") -  每个playerID出现多少次

nrow(x)  - 唯一不重复的playerID

x <- aggregate(temp, by=list(temp\$playerID), length)

dim(x)

test <- aggregate(temp, by=list(temp\$year), length)

x <- aggregate(temp["year"], by=list(temp\$year), length)， 得到的结果是：

Group.1 year

1    1937  142

2    1938  124

3    1939  112

4    1940  92

5    1941  95

6    1942  68

> y <- aggregate(temp["playerID"], by=list(temp\$year), length)

Group.1 playerID

1    1937      142

2    1938      124

3    1939      112

4    1940      92

5    1941      95

6    1942      68

http://stats.nba.com/player/#!/201599/?p=deandre-jordan

teamdata <- as.data.frame(team) - convert to datafram

teamdata["new_column"] <- NA   -  add a new column and filled with NA

teamdata\$new_column <- teamdata\$o_pts / teamdata\$games - calculate

newdata <- teamdata[order(-teamdata\$new_column),] - 排序

teamdata\$new_column <- ifelse(teamdata\$game == 0, NA, teamdata\$o_pts / teamdata\$games)  ： 如果game是0的话，结果是NA，而不是无穷大

summary(teamdata\$new_column) 我们发现，最大的原来是是126.50

Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's

0.00  94.77  102.60  98.32  109.50  126.50      90

> stas <- newdata[,c("name", "new_column")]

name new_column

234          Oakland Oaks  126.4872

555        Denver Nuggets  126.4756

152 Philadelphia Warriors  125.4375

198    Philadelphia 76ers  125.2222

130        Boston Celtics  124.4933

601        Denver Nuggets  123.7439

http://stackoverflow.com/questions/24831580/return-row-of-data-frame-based-on-value-in-a-column-r

y <- stas[which(stas\$new_column == max(stas\$new_column, na.rm= TRUE)), ]

name new_column

234 Oakland Oaks  126.4872

y <- stas[which(stas\$new_column == min(stas\$new_column, na.rm= TRUE)), ]

name new_column

89 Baltimore Bullets          0

teamdata <- as.data.frame(temp)

teamdata\$new_column <- ifelse(teamdata\$games == 0, NA, teamdata\$d_pts / teamdata\$games)

stats <- teamdata[, c("name","year","new_column")]

> y <- stats[which(stats\$new_column == max(stats\$new_column, na.rm= TRUE)), ]

> y

name year new_column

769 Denver Nuggets 1990  130.7683

teamdata <- as.data.frame(temp)

teamdata\$new_column <- ifelse(teamdata\$games == 0, NA, teamdata\$won / teamdata\$games)

stats <- teamdata[, c("name","year","new_column")]

> y <- stats[which(stats\$new_column == max(stats\$new_column, na.rm= TRUE)), ]

name year new_column

1445    Chicago Gears 1947          1

1448 Houston Mavericks 1947          1

http://stackoverflow.com/questions/1536590/how-to-select-rows-from-data-frame-with-2-conditions

> teamdata <- as.data.frame(temp)

> teamdata\$new_column <- ifelse(teamdata\$games == 0, NA, teamdata\$won / teamdata\$games)

> stats <- teamdata[, c("name","year","new_column")]

z <- subset(stats, stats\$name == "Chicago Bulls")

name year new_column

193 Chicago Bulls 1966  0.4074074

214 Chicago Bulls 1967  0.3536585

238 Chicago Bulls 1968  0.4024390

263 Chicago Bulls 1969  0.4756098

289 Chicago Bulls 1970  0.6219512

317 Chicago Bulls 1971  0.6951220