（翻译）TensorFlow 线性模型教程

TensorFlow Linear Model Tutorial#

TensorFlow 线性模型教程

In this tutorial, we will use the TF.Learn API in TensorFlow to solve a binary classification problem: Given census data about a person such as age, gender, education and occupation (the features), we will try to predict whether or not the person earns more than 50,000 dollars a year (the target label). We will train alogistic regressionmodel, and given an individual's information our model will output a number between 0 and 1, which can be interpreted as the probability that the individual has an annual income of over 50,000 dollars.

在本教程中，我们将会使用TensorFlow中的 TF.Learn API来解决一个二元分类问题：给定关于一个人诸如年龄，性别，受教育程度和职业（特征）的人口普查数据，我们要试图推断一个人是否能够在一年内赚取50,000美金或更多。我们将会训练一个Logistic回归模型，然后输入一个市民的信息到我们的模型。会输出一个0到1这个区间的值，即表示这个市民有多大的概率年收入会达到50,000美金或更多。

Setup#

安装#

To try the code for this tutorial:
Install TensorFlowif you haven't already.
Downloadthe tutorial code.
Install the pandas data analysis library. tf.learn doesn't require pandas, but it does support it, and this tutorial uses pandas. To install pandas:

安装TensorFlow如果你还没安装的话。
下载教程代码.
安装pandas数据分析库。tf.learn并不需要pandas，但是也对其提供了支持，并且本篇教程使用到了pandas。

安装pandas：

Get pip:

# Ubuntu/Linux 64-bit
$ sudo apt-get install python-pip python-dev

# Mac OS X
$ sudo easy_install pip
$ sudo easy_install --upgrade six

Use pip to install pandas:

$ sudo pip install pandas

If you have trouble installing pandas, consult theinstructionson the pandas site.

如果你在安装pandas中出现问题，可以在pandas的官网咨询说明

Execute the tutorial code with the following command to train the linear model described in this tutorial:

使用以下命令执行教程代码以训练本教程中描述的线性模型：

$ python wide_n_deep_tutorial.py --model_type=wide

Read on to find out how this code builds its linear model.

阅读并理解这段代码是如何建立线性模型的。

Reading The Census Data#

读入人口普查数据#

The dataset we'll be using is theCensus Income Dataset. You can download thetraining dataandtest datamanually or use code like this:

这里使用的数据集是人口普查收入数据集。你可以选择手动下载这个训练数据和测试数据或者是使用下列代码来下载：

import tempfile
import urllib
train_file = tempfile.NamedTemporaryFile()
test_file = tempfile.NamedTemporaryFile()
urllib.urlretrieve("https://archive.ics.uci.edu/ml/machine-learning-databases/adult/adult.data", train_file.name)
urllib.urlretrieve("https://archive.ics.uci.edu/ml/machine-learning-databases/adult/adult.test", test_file.name)

Once the CSV files are downloaded, let's read them intoPandasdataframes

当CSV文件下载完成时，就可以立即将它读入到Pandas的数据框里

import pandas as pd
COLUMNS = ["age", "workclass", "fnlwgt", "education", "education_num",
           "marital_status", "occupation", "relationship", "race", "gender",
           "capital_gain", "capital_loss", "hours_per_week", "native_country",
           "income_bracket"]
df_train = pd.read_csv(train_file, names=COLUMNS, skipinitialspace=True)
df_test = pd.read_csv(test_file, names=COLUMNS, skipinitialspace=True, skiprows=1)

Since the task is a binary classification problem, we'll construct a label column named "label" whose value is 1 if the income is over 50K, and 0 otherwise.

因为这是个二元分类问题，我们应该建立一个名字叫"label"标签列，且其值为1的表示输入大于50K，0则反之。

LABEL_COLUMN = "label"
df_train[LABEL_COLUMN] = (df_train["income_bracket"].apply(lambda x: ">50K" in x)).astype(int)
df_test[LABEL_COLUMN] = (df_test["income_bracket"].apply(lambda x: ">50K" in x)).astype(int)

Next, let's take a look at the dataframe and see which columns we can use to predict the target label. The columns can be grouped into two types—categorical and continuous columns:

A column is called categorical if its value can only be one of the categories in a finite set. For example, the native country of a person (U.S., India, Japan, etc.) or the education level (high school, college, etc.) are categorical columns.
A column is called continuous if its value can be any numerical value in a continuous range. For example, the capital gain of a person (e.g. $14,084) is a continuous column.

接下来，让我们来看看这个数据框并查看可以使用哪些列去预测目标。这些列可能包含两种类型——分类列和连续列。

如果他的值是有限集目录中的一个，那这个列就被称之为分类。例如一个人的祖国（美国，印度，日本等）或者是教育程度（高中，学院等）为分类列。
如果他的值为一个连续的范围内的任意数值，那么这个列就被称之为连续。例如一个人的资本收益（例如$14,084）是一个连续列。

CATEGORICAL_COLUMNS = ["workclass", "education", "marital_status", "occupation",
                       "relationship", "race", "gender", "native_country"]
CONTINUOUS_COLUMNS = ["age", "education_num", "capital_gain", "capital_loss", "hours_per_week"]

Here's a list of columns available in the Census Income dataset:

这里是人口普查数据集中可用的数据列

|Column Name|Type|Description|
| ------------- |---------| -----------|
|age|Continuous|The age of the individual|
|workclass|Categorical|The type of employer the individual has (government, military, private, etc.).|
|fnlwgt|Continuous|The number of people the census takers believe that observation represents (sample weight). This variable will not be used.|
|education|Categorical|highest level of education achieved for that individual.|
|education_num|Continuous|The highest level of education in numerical form.|
|marital_status|Categorical|Marital status of the individual.|
|occupation|Categorical|The occupation of the individual.|
|relationship|Categorical|Wife, Own-child, Husband, Not-in-family, Other-relative, Unmarried.|
|race|Categorical|White, Asian-Pac-Islander, Amer-Indian-Eskimo, Other, Black.|
|gender|Categorical|Female, Male.|
|capital_gain|Continuous|Capital gains recorded.|
|capital_loss|Continuous|Capital Losses recorded.|
|hours_per_week|Continuous|Hours worked per week.|
|native_country|Categorical|Country of origin of the individual.|
|income|Categorical|">50K" or "<=50K", meaning whether the person makes more than $50,000 annually.|

列名	类型	描述
age	连续	市民的年龄
workclass	分类	市民职位的所属类型(政府, 军队, 私人, 等等)。
fnlwgt	连续	这个值表示受访者提供的消息的置信度(简单的权重)。不会使用到这个值。
education	分类	市民的最高学历。
education_num	连续	市民最高学历的数字形式。
marital_status	分类	市民的婚姻状况。
occupation	分类	市民的职位。
relationship	分类	妻子, 育儿, 丈夫, 不在家庭里, 其他亲属, 未婚。
race	分类	白种人, 亚太岛民, 美洲-印度-爱斯基摩人, 其他, 黑种人。
gender	分类	女性，男性。
capital_gain	连续	资本利得记录。
capital_loss	连续	资本亏损记录。
hours_per_week	连续	每周工作时间。
native_country	分类	市民的祖国。
income	分类	">50K" or "<=50K", 意味着该人的年收入是否超过$ 50,000。

Converting Data into Tensors#

转换数据成张量#

When building a TF.Learn model, the input data is specified by means of an Input Builder function. This builder function will not be called until it is later passed to TF.Learn methods such as fit and evaluate.
The purpose of this function is to construct the input data, which is represented in the form ofTensorsorSparseTensors. In more detail, the Input Builder function returns the following as a pair:

当建立了一个TF.Learn模型，即意味着输入的数据是通过Input Builder，这个函数输入的。这个Input Builder函数直到其被传递到例如fit和evaluate的TF.Learn方法之前是不会进行调用的。这个函数的目的是构造出输入数据的格式，即其会将数据表示成张量或者是稀疏张量，更详细的话，Input Builder函数返回下列结果对：

feature_cols: A dict from feature column names toTensorsorSparseTensors.

label: ATensorcontaining the label column.
特征列: 一个从特征列中被命名为张量或者是稀疏张量的字典。
标签: 包含标签列的一个张量。

The keys of the feature_cols will be used to construct columns in the next section. Because we want to call the fit and evaluate methods with different data, we define two different input builder functions,train_input_fn and test_input_fn which are identical except that they pass different data to input_fn. Note that input_fn will be called while constructing the TensorFlow graph, not while running the graph. What it is returning is a representation of the input data as the fundamental unit of TensorFlow computations, a Tensor(or SparseTensor).

feature_cols的键值会在下一节构造列中使用到。因为我们想要分别使用不同的数据来调用fit方法和evaluate方法，我们定义两个不同的Input Builder函数，train_input_fn 与 test_input_fn，除了他们传递不同的数据到input_fn之外，其余的都是一样的。要注意的是，input_fn会在TensorFlow图被构造时进行调用，而不是在其运行时被调用。其返回的张量（或稀疏张量）是输入数据作为TensorFlow计算的基本单位的表示。

Our model represents the input data asconstanttensors, meaning that the tensor represents a constant value, in this case the values of a particular column of df_train or df_test. This is the simplest way to pass data into TensorFlow. Another more advanced way to represent input data would be to construct anInput Readerthat represents a file or other data source, and iterates through the file as TensorFlow runs the graph. Each continuous column in the train or test dataframe will be converted into a Tensor, which in general is a good format to represent dense data. For cateogorical data, we must represent the data as a SparseTensor. This data format is good for representing sparse data.

我们的模型将输入的数据当做常量张量，即这个张量代表一个常量值，在这种情况下其值是df_train或df_test特定列的值。这种传递数据给TensorFlow的方式是最简单。另一种更为高级的表示输入数据的方式是构造一个Input Reader，将表示一个文件或其他数据源后，再在TensorFlow运行图时迭代使用该文件。每一个连续列在训练或者测试数据框时会被转化为一个张量，通常而言这是一种良好的表示密级数据的格式。对于分类数据，我们必须讲这些数据当做一个稀疏张量。这种数据格式很适合代表稀疏数据。

import tensorflow as tf

def input_fn(df):
  # Creates a dictionary mapping from each continuous feature column name (k) to
  # the values of that column stored in a constant Tensor.
  continuous_cols = {k: tf.constant(df[k].values)
                     for k in CONTINUOUS_COLUMNS}
  # Creates a dictionary mapping from each categorical feature column name (k)
  # to the values of that column stored in a tf.SparseTensor.
  categorical_cols = {k: tf.SparseTensor(
      indices=[[i, 0] for i in range(df[k].size)],
      values=df[k].values,
      shape=[df[k].size, 1])
                      for k in CATEGORICAL_COLUMNS}
  # Merges the two dictionaries into one.
  feature_cols = dict(continuous_cols.items() + categorical_cols.items())
  # Converts the label column into a constant Tensor.
  label = tf.constant(df[LABEL_COLUMN].values)
  # Returns the feature columns and the label.
  return feature_cols, label

def train_input_fn():
  return input_fn(df_train)

def eval_input_fn():
  return input_fn(df_test)

Selecting and Engineering Features for the Model#

为模型选择和管理特征#

Selecting and crafting the right set of feature columns is key to learning an effective model. A feature column can be either one of the raw columns in the original dataframe (let's call them base feature columns), or any new columns created based on some transformations defined over one or multiple base columns (let's call them derived feature columns). Basically, "feature column" is an abstract concept of any raw or derived variable that can be used to predict the target label.

选择和制作一个正确的特征集是学习一个有效模型的关键。，一个特征列能同时为原始的数据框中的原始列（让我们称其为基本特征列），或基于在一个或多个基本列上所进行一些变换后创建的任何新列（让我们称其为派生特征列）。基本而言，"特征列"是可以用于预测目标标签的任何原始或派生变量的抽象概念。

Base Categorical Feature Columns##

基本分类特征列##

To define a feature column for a categorical feature, we can create a SparseColumn using the TF.Learn API. If you know the set of all possible feature values of a column and there are only a few of them, you can use sparse_column_with_keys. Each key in the list will get assigned an auto-incremental ID starting from 0. For example, for the gender column we can assign the feature string "Female" to an integer ID of 0 and "Male" to 1 by doing:

为了一个分类特征去定义一个特征列，我们需要使用TF.Learn API来创建一个稀疏列。

gender = tf.contrib.layers.sparse_column_with_keys( column_name="gender", keys=["Female", "Male"])

What if we don't know the set of possible values in advance? Not a problem. We can use sparse_column_with_hash_bucket instead:

如果我们事先不知道可能值的集合该怎么办？没问题，我们可以使用sparse_column_with_hash_bucket 来代替：

education = tf.contrib.layers.sparse_column_with_hash_bucket("education", hash_bucket_size=1000)

What will happen is that each possible value in the feature column education will be hashed to an integer ID as we encounter them in training. See an example illustration below:

当我们进行训练时，“教育”特征列中每一个可能值都会被哈希成一个整数ID，可以参见下面的示例表格：

|ID|Feature|
|-|-|
|...||
|9|"Bachelors"|
|...||
|103|"Doctorate"|
|...||
|375|"Masters"|
|...||

No matter which way we choose to define a SparseColumn, each feature string will be mapped into an integer ID by looking up a fixed mapping or by hashing. Note that hashing collisions are possible, but may not significantly impact the model quality. Under the hood, the LinearModel class is responsible for managing the mapping and creating tf.Variable to store the model parameters (also known as model weights) for each feature ID. The model parameters will be learned through the model training process we'll go through later.

不管我们选择以哪种方式去定义一个稀疏列，每一个特征字符串都会被通过查找其固定的映射或哈希为为一个整数ID。要注意的是哈希碰撞是有可能出现的，但是大多数并不会显著的影响到模型的质量。线性模型类是负责管理映射并创建tf.Variable以储存每一个特征ID的模型参数（也称为模型权重）。模型参数将会通过我们以后将要经历的模型训练过程来学习。

We'll do the similar trick to define the other categorical features:

我们将会通过类似的技巧来定义其他的分类特征：

relationship = tf.contrib.layers.sparse_column_with_hash_bucket("relationship", hash_bucket_size=100)
workclass = tf.contrib.layers.sparse_column_with_hash_bucket("workclass", hash_bucket_size=100)
occupation = tf.contrib.layers.sparse_column_with_hash_bucket("occupation", hash_bucket_size=1000)
native_country = tf.contrib.layers.sparse_column_with_hash_bucket("native_country", hash_bucket_size=1000)


　

>##Base Continuous Feature Columns##

##基本连续特征列##

>Similarly, we can define a **RealValuedColumn** for each continuous feature column that we want to use in the model:

同样地，我们可以为每一个我们想在模型中使用的连续特征列来定义一个**实值列**：

>```
age = tf.contrib.layers.real_valued_column("age")
education_num = tf.contrib.layers.real_valued_column("education_num")
capital_gain = tf.contrib.layers.real_valued_column("capital_gain")
capital_loss = tf.contrib.layers.real_valued_column("capital_loss")
hours_per_week = tf.contrib.layers.real_valued_column("hours_per_week")

Making Continuous Features Categorical through Bucketization##

通过桶化分类连续特征##

Sometimes the relationship between a continuous feature and the label is not linear. As an hypothetical example, a person's income may grow with age in the early stage of one's career, then the growth may slow at some point, and finally the income decreases after retirement. In this scenario, using the raw age as a real-valued feature column might not be a good choice because the model can only learn one of the three cases:

Income always increases at some rate as age grows (positive correlation),
Income always decreases at some rate as age grows (negative correlation), or
Income stays the same no matter at what age (no correlation)

有时候连续特征和其标签的关系并不是线性的。在一个假设的例子里，一个人的收入或许会在其职业的早期阶段随着年级的增加以增长，然后到了某些时候，增长率会开始下降，最终到退休时减少。在这个场景里，我们使用原始年龄当做一个实值特征列或许并不是一个好选择，因为这个模型只能够学习到这三个事件：

收入总是会在年龄增加时增加（正相关），
收入总是会在年龄增加时下降，或者（负相关）
收入总是不会根据年龄的变化而变化（无相关）。

If we want to learn the fine-grained correlation between income and each age group seperately, we can leveragebucketization. Bucketization is a process of dividing the entire range of a continuous feature into a set of consecutive bins/buckets, and then converting the original numerical feature into a bucket ID (as a categorical feature) depending on which bucket that value falls into. So, we can define a bucketized_column over age as:

如果我们想要在收入和每个年龄组之间学习这个细粒度相关，我们需要利用桶化。桶化是将连续特征的整个范围划分为一组连续的箱/桶的过程，然后根据值落入哪个桶中，就将其原始数值特征转换为桶ID（作为一个分类特征）。所以，我们可以定义buckerized_column为age，即：

age_buckets = tf.contrib.layers.bucketized_column(age, boundaries=[18, 25, 30, 35, 40, 45, 50, 55, 60, 65])

where the boundaries is a list of bucket boundaries. In this case, there are 10 boundaries, resulting in 11 age group buckets (from age 17 and below, 18-24, 25-29, ..., to 65 and over).

其中边界参数是桶边界的列表。在这种情况下，这里有10个边界值，产生了11个年龄组桶（从最低到17岁，18-24， 25-29，到65及其以上）。

Intersecting Multiple Columns with CrossedColumn##

使用CrossedColumn相交多个列##

Using each base feature column separately may not be enough to explain the data. For example, the correlation between education and the label (earning > 50,000 dollars) may be different for different occupations. Therefore, if we only learn a single model weight for education="Bachelors" and education="Masters", we won't be able to capture every single education-occupation combination (e.g. distinguishing between education="Bachelors" AND occupation="Exec-managerial" and education="Bachelors" AND occupation="Craft-repair"). To learn the differences between different feature combinations, we can add crossed feature columns to the model.

分别使用每个基本特征列可能并不足以解释整体数据。例如，受教育水平与标签（赚取大于50,000美元）也许根据每个职业的不同而不同。因此，如果我们只学习education =“Bachelors”和education =“Masters”的单个模型权重，我们就没办法获取到每一个教育--职业的组合（例如education="Bachelors" 和occupation="Exec-managerial" 跟 education="Bachelors" 和 occupation="Craft-repair"（的预测结果）是有差别的）。为了了解不同特征组合之间的差异，我们可以引入交叉特征列到这个模型中。

education_x_occupation = tf.contrib.layers.crossed_column([education, occupation],hash_bucket_size=int(1e4))

We can also create a CrossedColumn over more than two columns. Each constituent column can be either a base feature column that is categorical (SparseColumn), a bucketized real-valued feature column (BucketizedColumn), or even another CrossColumn. Here's an example:

我们还可以在两个以上的列上创建一个交叉列。每一个构成列也可以是分类基本要素列（稀疏列），一个经过桶化的实值特征列（桶化列），或甚至是其他的交叉列，例子如下：

age_buckets_x_education_x_occupation = tf.contrib.layers.crossed_column( [age_buckets, education, occupation], hash_bucket_size=int(1e6))

Defining The Logistic Regression Model##

定义Logistic回归模型

After processing the input data and defining all the feature columns, we're now ready to put them all together and build a Logistic Regression model. In the previous section we've seen several types of base and derived feature columns, including:

SparseColumn
RealValuedColumn
BucketizedColumn
CrossedColumn

在处理输入数据和定义所有的特征列后，我们要准备把他们结合在一起去建立一个Logistic回归模型。在之前的一节我们一节看到过多种多样的基本特征列和派生特征列，包括了：

稀疏列
实值列
桶化列
交叉列

All of these are subclasses of the abstract FeatureColumn class, and can be added to the feature_columns field of a model:

所有的这些列都是抽象与特征列的子类，并且都是可以添加到模型的feature_columns字段：

model_dir = tempfile.mkdtemp()
m = tf.contrib.learn.LinearClassifier(feature_columns=[
gender, native_country, education, occupation, workclass, marital_status, race,
age_buckets, education_x_occupation, age_buckets_x_education_x_occupation],
model_dir=model_dir)


　

>The model also automatically learns a bias term, which controls the prediction one would make without observing any features (see the section "How Logistic Regression Works" for more explanations). The learned model files will be stored in **model_dir**.

模型同样也会自动学习偏置项，其控制着没有任何特征值出现的情况下的预测情况（有关更多说明，可以参阅“Logistic回归是如何工作的”部分）。经过学习的模型文件会储存在**model_dir**里。

>#Training and Evaluating Our Model#

#训练和评估我们的模型#

>After adding all the features to the model, now let's look at how to actually train the model. Training a model is just a one-liner using the TF.Learn API:

在添加完所有的特征到模型之后，现在是时候让我们看看该如何训练这个模型。训练一个模型在TF.Learn API中只需使用一行代码：

>`m.fit(input_fn=train_input_fn, steps=200)`

　

>After the model is trained, we can evaluate how good our model is at predicting the labels of the holdout data:

在模型经过训练之后，我们可以评估我们的模型在评估标签的方面能够做到多好：

>```
results = m.evaluate(input_fn=eval_input_fn, steps=1)
for key in sorted(results): 
  print "%s: %s" % (key, results[key])```

>The first line of the output should be something like **accuracy: 0.83557522**, which means the accuracy is 83.6%. Feel free to try more features and transformations and see if you can do even better!
If you'd like to see a working end-to-end example, you can download our[example code](https://github.com/tensorflow/tensorflow/blob/master/tensorflow/examples/learn/wide_n_deep_tutorial.py) and set the **model_type** flag to **wide**.

输出的第一行应该是类似于**accuracy: 0.83557522**，它的含义是准确率达到了83.6%。现在可以随意尝试多种特征及其的转换以获得你是否能够使其做的更好。如果你想要一个完整的代码，你可以下载我们的[完整代码](https://github.com/tensorflow/tensorflow/blob/master/tensorflow/examples/learn/wide_n_deep_tutorial.py)，然后设置**model_type**标识为**wide**。

>#Adding Regularization to Prevent Overfitting#

#添加正则化以防止过拟合#

>Regularization is a technique used to avoid **overfitting**. Overfitting happens when your model does well on the data it is trained on, but worse on test data that the model has not seen before, such as live traffic. Overfitting generally occurs when a model is excessively complex, such as having too many parameters relative to the number of observed training data. Regularization allows for you to control your model's complexity and makes the model more generalizable to unseen data.

正则化是一项用于避免**过拟合**的技术。过拟合发生在你的模型在训练时表现很好，但是在使用未曾使用过的测试数据时却表现得很糟糕，例如使用实时的交通情况。过拟合通常在模型过于复杂时出现，例如相对于观察到的训练数据而言，（测试数据）具有太多的参数。正则化允许你控制你的模型的复杂度，并使得模型能够更广泛应用于陌生的数据。

>In the Linear Model library, you can add L1 and L2 regularizations to the model as:

在这个线性模型库，你可以添加L1和L2这两个正则在模型中，即：

>```
m = tf.contrib.learn.LinearClassifier(feature_columns=[
  gender, native_country, education, occupation, workclass, marital_status, race,
  age_buckets, education_x_occupation, age_buckets_x_education_x_occupation],
  optimizer=tf.train.FtrlOptimizer(
    learning_rate=0.1,
    l1_regularization_strength=1.0,
    l2_regularization_strength=1.0),
  model_dir=model_dir)

One important difference between L1 and L2 regularization is that L1 regularization tends to make model weights stay at zero, creating sparser models, whereas L2 regularization also tries to make the model weights closer to zero but not necessarily zero. Therefore, if you increase the strength of L1 regularization, you will have a smaller model size because many of the model weights will be zero. This is often desirable when the feature space is very large but sparse, and when there are resource constraints that prevent you from serving a model that is too large.

在L1与L2两个正则化中，其最大的差别是L1正则趋向于使模型权重保持为0，以创建出稀疏模型。而L2正则化同样趋向于使模型的权限保持为0但不强制为0.因此，如果你增加L1正则的权重，你会拥有一个更小的模型尺寸，因为许多模型权重都变为了0。通常在特征空间很大但是又很稀疏，并且你的资源环境并不允许你提供一个过大的模型时，这种情况下是可取的。

In practice, you should try various combinations of L1, L2 regularization strengths and find the best parameters that best control overfitting and give you a desirable model size.

在练习中，你应该尝试各种结合L1，L2正则权重以找出最好的参数去控制过拟合并且为你的模型提供一个更适合的大小尺寸。

How Logistic Regression Works#

Logistic回归是如何工作的#

Finally, let's take a minute to talk about what the Logistic Regression model actually looks like in case you're not already familiar with it. We'll denote the label as
$Y$

, and the set of observed features as a feature vector
$x=[x_1,x_2...,x_d]$

. We define
$Y=1$

if an individual earned > 50,000 dollars and
$Y=0$

otherwise. In Logistic Regression, the probability of the label being positive
$(Y=1)$

given the features
$x$

is given as: ![][formula_P]

最后，让我们花几分钟的时间去谈谈关于Logistic回归模型实际上的样子以防止你还对此不熟悉。我们定义标签为

$Y$

，然后设置可观测的特征为一个特征向量，即。我们再定义

$Y=1$

为一个年收入大于或等于50,000美元，反之为

$Y=0$

。在Logistic回归中，对给定特征

$x$

获得标签为正

$(Y=1)$

的概率是![][formula_P]

where ![][formula_w] are the model weights for the features

$x=[x_1,x_2...,x_d]$

.
$b$

is a constant that is often called the bias of the model. The equation consists of two parts—A linear model and a logistic function:

其中 ![][formula_w] 是特征

$x=[x_1,x_2...,x_d]$

的模型权重。

$b$

是一个被称为模型偏置的常量。该方程由两部分组成--一个线性模型和一个logistic函数：

Linear Model: First, we can see that ![][long_formula_wx] is a linear model where the output is a linear function of the input features

$x$

. The bias
$b$

is the prediction one would make without observing any features. The model weight
$w_i$

reflects how the feature
$x_i$

is correlated with the positive label. If
$x_i$

is positively correlated with the positive label, the weight
$w_i$

increases, and the probability
$P(Y=1|x)$

will be closer to 1. On the other hand, if
$x_i$

is negatively correlated with the positive label, then the weight
$w_i$

decreases and the probability
$P(Y=1|x)$

will be closer to 0.

线性模型：首先，我们可以看到 ![][long_formula_wx] 是一个线性模型，其输出是输入特征

$x$

的线性函数。偏置

$b$

是没有观察到任何特征的预测情况。模型权重

$w_i$

反映了特征

$x_i$

是否与正标签相关。如果

$x_i$

是对正标签正相关的，当权重

$w_i$

增加时，

$P(Y=1|x)$

的概率也会更加趋近与1。在另一方面，如果

$x_i$

对于正标签是负相关的，那么当权重

$w_i$

减少时，

$P(Y=1|x)$

的概率也会更加趋近于0。

Logistic Function: Second, we can see that there's a logistic function (also known as the sigmoid function)![][formula_S] being applied to the linear model. The logistic function is used to convert the output of the linear model ![][short_formula_wx] from any real number into the range of

$[0,1]$

, which can be interpreted as a probability.

Logistic函数：其次，我们可以看到这个logistic函数（也被称为sigmoid函数）![][formula_S]也被应用与这个线性模型中。这个logistic函数是用于将线性模型的输出![][short_formula_wx] 从任意实数转换成

$[0,1]$

的范围，对其我们可以称之为概率。

Model training is an optimization problem: The goal is to find a set of model weights (i.e. model parameters) to minimize a loss function defined over the training data, such as logistic loss for Logistic Regression models. The loss function measures the discrepancy between the ground-truth label and the model's prediction. If the prediction is very close to the ground-truth label, the loss value will be low; if the prediction is very far from the label, then the loss value would be high.

模型的训练是一个优化问题：目标是寻找一组模型权重（即模型参数）在训练数据中最小化损失函数，比如Logisitc回归模型的对数损失。损失函数测量事实标签与测量标签的差异。如果测量结果十分接近事实标签，那这个损失值会很低。如果预测结果跟事实标签相差很大，那么这个损失值将会很大。

Learn Deeper#

深入学习#

If you're interested in learning more, check out ourWide & Deep Learning Tutorialwhere we'll show you how to combine the strengths of linear models and deep neural networks by jointly training them using the TF.Learn API.

如果你还有兴趣想要深入学习，看看我们的广度&深度学习教程，在其中我们会向你展示如何使用TF.Learn API结合线性模型与深度神经网络的各自的长项去训练。

[formula_S]:http://latex.codecogs.com/png.latex?S(t)=1/(1+exp(-t))
[formula_P]:http://latex.codecogs.com/png.latex?P(Y=1|x)=\frac1{1+exp(-(w^Tx+b))}
[formula_w]:http://latex.codecogs.com/png.latex?w=[w_1,w_2,...,w_d]
[short_formula_wx]:http://latex.codecogs.com/png.latex?w^Tx+b
[long_formula_wx]:http://latex.codecogs.com/png.latex?w^Tx+b=b+w_1x_1+...+w_dx_d

最后编辑于：2017.12.05 04:27:23

人面猴
序言：七十年代末，一起剥皮案震惊了整个滨河市，随后出现的几起案子，更是在滨河造成了极大的恐慌，老刑警刘岩，带你破解...
沈念sama阅读 159,716评论 4赞 364
死咒
序言：滨河连续发生了三起死亡事件，死亡现场离奇诡异，居然都是意外死亡，警方通过查阅死者的电脑和手机，发现死者居然都...
沈念sama阅读 67,558评论 1赞 294
救了他两次的神仙让他今天三更去死
文/潘晓璐我一进店门，熙熙楼的掌柜王于贵愁眉苦脸地迎上来，“玉大人，你说我怎么就摊上这事。” “怎么了？”我有些...
开封第一讲书人阅读 109,431评论 0赞 244
道士缉凶录：失踪的卖姜人
文/不坏的土叔我叫张陵，是天一观的道长。经常有香客问我，道长，这世上最难降的妖魔是什么？我笑而不...
开封第一讲书人阅读 44,127评论 0赞 209
港岛之恋（遗憾婚礼）
正文为了忘掉前任，我火速办了婚礼，结果婚礼上，老公的妹妹穿的比我还像新娘。我一直安慰自己，他们只是感情好，可当我...
茶点故事阅读 52,511评论 3赞 287
恶毒庶女顶嫁案：这布局不是一般人想出来的
文/花漫我一把揭开白布。她就那样静静地躺着，像睡着了一般。火红的嫁衣衬着肌肤如雪。梳的纹丝不乱的头发上，一...
开封第一讲书人阅读 40,692评论 1赞 222
城市分裂传说
那天，我揣着相机与录音，去河边找鬼。笑死，一个胖子当着我的面吹牛，可吹牛的内容都是我干的。我是一名探鬼主播，决...
沈念sama阅读 31,915评论 2赞 313
双鸳鸯连环套：你想象不到人心有多黑
文/苍兰香墨我猛地睁开眼，长吁一口气：“原来是场噩梦啊……” “哼！你这毒妇竟也来了？” 一声冷哼从身侧响起，我...
开封第一讲书人阅读 30,664评论 0赞 202
万荣杀人案实录
序言：老挝万荣一对情侣失踪，失踪者是张志新（化名）和其女友刘颖，没想到半个月后，有当地人在树林里发现了一具尸体，经...
沈念sama阅读 34,412评论 1赞 246
护林员之死
正文独居荒郊野岭守林人离奇死亡，尸身上长有42处带血的脓包…… 初始之章·张勋以下内容为张勋视角年9月15日...
茶点故事阅读 30,616评论 2赞 245
白月光启示录
正文我和宋清朗相恋三年，在试婚纱的时候发现自己被绿了。大学时的朋友给我发了我未婚夫和他白月光在一起吃饭的照片。...
茶点故事阅读 32,105评论 1赞 260
活死人
序言：一个原本活蹦乱跳的男人离奇死亡，死状恐怖，灵堂内的尸体忽然破棺而出，到底是诈尸还是另有隐情，我是刑警宁泽，带...
沈念sama阅读 28,424评论 2赞 254
日本核电站爆炸内幕
正文年R本政府宣布，位于F岛的核电站，受9级特大地震影响，放射性物质发生泄漏。R本人自食恶果不足惜，却给世界环境...
茶点故事阅读 33,098评论 3赞 238
男人毒药：我在死后第九天来索命
文/蒙蒙一、第九天我趴在偏房一处隐蔽的房顶上张望。院中可真热闹，春花似锦、人声如沸。这庄子的主人今日做“春日...
开封第一讲书人阅读 26,096评论 0赞 8
一桩弑父案，背后竟有这般阴谋
文/苍兰香墨我抬头看了看天上的太阳。三九已至，却和暖如春，着一层夹袄步出监牢的瞬间，已是汗流浃背。一阵脚步声响...
开封第一讲书人阅读 26,869评论 0赞 197
情欲美人皮
我被黑心中介骗来泰国打工，没想到刚下飞机就差点儿被人妖公主榨干…… 1. 我叫王不留，地道东北人。一个月前我还...
沈念sama阅读 35,748评论 2赞 276
代替公主和亲
正文我出身青楼，却偏偏与公主长得像，于是被迫代替她去往敌国和亲。传闻我的和亲对象是个残疾皇子，可洞房花烛夜当晚...
茶点故事阅读 35,641评论 2赞 271

（翻译）TensorFlow 线性模型教程

TensorFlow Linear Model Tutorial#

Setup#

安装#

Reading The Census Data#

读入人口普查数据#

Converting Data into Tensors#

转换数据成张量#

Selecting and Engineering Features for the Model#

为模型选择和管理特征#

Base Categorical Feature Columns##

基本分类特征列##

Making Continuous Features Categorical through Bucketization##

通过桶化分类连续特征##

Intersecting Multiple Columns with CrossedColumn##

使用CrossedColumn相交多个列##

Defining The Logistic Regression Model##

How Logistic Regression Works#

Logistic回归是如何工作的#

Learn Deeper#

深入学习#

推荐阅读更多精彩内容