# 正文

One of the reasons that it's easy to get confused between scaling and normalization is because the terms are sometimes used interchangeably and, to make it even more confusing, they are very similar! In both cases, you're transforming the values of numeric variables so that the transformed data points have specific helpful properties. The difference is that, in scaling, you're changing the range of your data while in normalization you're changing the shape of the distribution of your data. Let's talk a little more in-depth about each of these options.

# 认知

## Scale

scale意味着你可以转化你的数据到一个制定的范围，类似于1-100或者0-1。当你使用某种基于数值大小的方法的时候（比如SVM或者KNN）时，就需要用到scale。

Scale示例

## Normalization

scale只是改变你数据的range（范围），Normalization则是一个更加激进的转化。
Normalization的目的就在于把你的数据转化为一个正态分布，从而进行下游的数据分析(t-tests, ANOVAs, linear regression, linear discriminant analysis (LDA) and Gaussian naive Bayes).

image.png

# R语言操作

?scale
## 可以得到以下的介绍
The value of center determines how column centering is performed. If center is a numeric-alike vector with length equal to the number of columns of x, then each column of x has the corresponding value from center subtracted from it. If center is TRUE then centering is done by subtracting the column means (omitting NAs) of x from their corresponding columns, and if center is FALSE, no centering is done.

The value of scale determines how column scaling is performed (after centering). If scale is a numeric-alike vector with length equal to the number of columns of x, then each column of x is divided by the corresponding value from scale. If scale is TRUE then scaling is done by dividing the (centered) columns of x by their standard deviations if center is TRUE, and the root mean square otherwise. If scale is FALSE, no scaling is done.

The root-mean-square for a (possibly centered) column is defined as sqrt(sum(x^2)/(n-1)), where x is a vector of the non-missing values and n is the number of non-missing values. In the case center = TRUE, this is the same as the standard deviation, but in general it is not. (To scale by the standard deviations without centering, use scale(x, center = FALSE, scale = apply(x, 2, sd, na.rm = TRUE)).)


> x <- matrix(1:20, ncol = 4)
> x
[,1] [,2] [,3] [,4]
[1,]    1    6   11   16
[2,]    2    7   12   17
[3,]    3    8   13   18
[4,]    4    9   14   19
[5,]    5   10   15   20
> scale(x, center = T, scale = T)
[,1]       [,2]       [,3]       [,4]
[1,] -1.2649111 -1.2649111 -1.2649111 -1.2649111
[2,] -0.6324555 -0.6324555 -0.6324555 -0.6324555
[3,]  0.0000000  0.0000000  0.0000000  0.0000000
[4,]  0.6324555  0.6324555  0.6324555  0.6324555
[5,]  1.2649111  1.2649111  1.2649111  1.2649111
attr(,"scaled:center")
  3  8 13 18
attr(,"scaled:scale")
 1.581139 1.581139 1.581139 1.581139
> scale(x, center = T, scale = F)
[,1] [,2] [,3] [,4]
[1,]   -2   -2   -2   -2
[2,]   -1   -1   -1   -1
[3,]    0    0    0    0
[4,]    1    1    1    1
[5,]    2    2    2    2
attr(,"scaled:center")
  3  8 13 18
> scale(x, center = T, scale = F)/sd(scale(x, center = T, scale = F)[1:5])
[,1]       [,2]       [,3]       [,4]
[1,] -1.2649111 -1.2649111 -1.2649111 -1.2649111
[2,] -0.6324555 -0.6324555 -0.6324555 -0.6324555
[3,]  0.0000000  0.0000000  0.0000000  0.0000000
[4,]  0.6324555  0.6324555  0.6324555  0.6324555
[5,]  1.2649111  1.2649111  1.2649111  1.2649111
attr(,"scaled:center")
  3  8 13 18


data <- runif(100, min = 10, max = 100)

plot(1:100, data)
plot(1:100, scale(data, center = T, scale = F))
plot(1:100, scale(data, center = T, scale = T))

raw_data
data_center
data_center_scale

# 结语

R语言里面的scale()函数的centerscale参数需要用对才可以正确处理你的数据。