[0.2] Tensorflow踩坑记之头疼的tf.data

今天尝试总结一下 tf.data 这个API的一些用法吧。之所以会用到这个API，是因为需要处理的数据量很大，而且数据均是分布式的存储在多台服务器上，所以没有办法采用传统的喂数据方式，而是运用了 tf.data 对数据进行了相应的预处理，并且最近正赶上总结需要，尝试写一下关于 tf.data 的一些用法，有错误的地方一定告诉我哈。

Tensorflow的数据读取

先来看一下Tensorflow的数据读取机制吧

知乎专栏-何之源：十图详解tensorflow数据读取机制

这一篇文章对于 tensorflow的数据读取机制 讲解得很不错，大噶可以先看一下，有一个了解。

Dataset API是怎么用的呢

虽然上面的资料关于 tf.data 讲解得都很好，但是我没有找到一个很完整滴运用 tf.data.TextLineDataset() 和 tf.data.TFRecordDataset() 的例子，所以才想尝试写一写这篇总结。

MNIST的经典例子

本篇博客结合 mnist 的经典例子，针对不同的源数据：csv数据和tfrecord数据，分别运用 tf.data.TextLineDataset() 和 tf.data.TFRecordDataset() 创建不同的 Dataset 并运用四种不同的 Iterator ，分别是 单次，可初始化，可重新初始化，以及可馈送迭代器 的方式实现对源数据的预处理工作。

我将相关的资料放在了澜子的Github 上，欢迎互粉哇（星星眼）。其中包括了所需的 后缀名为csv和tfrecords的源数据 (data的文件夹)，以及在 jupyter notebook实现的具体代码 (tf_dataset_learn.ipynb)。

如果有需要的同学可以直接
git clone https://github.com/lanhongvp/tensorflow_dataset_learn.git
然后用 jupyter 跑一跑看看输出，这样可以有一个比较直观的认识。关于 Git和Github 的使用，大噶可以看我VSCODE_GIT这一篇博客啦。接下来，针对MNIST例子做一个简单的说明吧。

tf.data.TFRecordDataset() & make_one_shot_iterator()

tf.data.TFRecordDataset() 输入参数直接是后缀名为tfrecords的文件路径，正因如此，即可解决数据量过大，导致无法单机训练的问题。本篇博客中，文件路径即为/Users/honglan/Desktop/train_output.tfrecords，此处是我自己电脑上的路径，大家可以 根据自己的需要修改为对应的文件路径。
make_one_shot_iterator() 即为单次迭代器，是最简单的迭代器形式，仅支持对数据集进行一次迭代，不需要显式初始化。
配合 MNIST数据集以及tf.data.TFRecordDataset()，实现代码如下。

# Validate tf.data.TFRecordDataset() using make_one_shot_iterator()
import tensorflow as tf
import numpy as np

num_epochs = 2
num_class = 10
sess = tf.Session()

# Use `tf.parse_single_example()` to extract data from a `tf.Example`
# protocol buffer, and perform any additional per-record preprocessing.
def parser(record):
    keys_to_features = {
        "image_raw": tf.FixedLenFeature((), tf.string, default_value=""),
        "pixels": tf.FixedLenFeature((), tf.int64, default_value=tf.zeros([], dtype=tf.int64)),
        "label": tf.FixedLenFeature((), tf.int64,
                                    default_value=tf.zeros([], dtype=tf.int64)),
    }
    parsed = tf.parse_single_example(record, keys_to_features)

    # Parse the string into an array of pixels corresponding to the image
    images = tf.decode_raw(parsed["image_raw"],tf.uint8)
    images = tf.reshape(images,[28,28,1])
    labels = tf.cast(parsed['label'], tf.int32)
    labels = tf.one_hot(labels,num_class)
    pixels = tf.cast(parsed['pixels'], tf.int32)
    print("IMAGES",images)
    print("LABELS",labels)
    
    return {"image_raw": images}, labels


filenames = ["/Users/honglan/Desktop/train_output.tfrecords"] 
# replace the filenames with your own path
dataset = tf.data.TFRecordDataset(filenames)
print("DATASET",dataset)

# Use `Dataset.map()` to build a pair of a feature dictionary and a label
# tensor for each example.
dataset = dataset.map(parser)
print("DATASET_1",dataset)
dataset = dataset.shuffle(buffer_size=10000)
print("DATASET_2",dataset)
dataset = dataset.batch(32)
print("DATASET_3",dataset)
dataset = dataset.repeat(num_epochs)
print("DATASET_4",dataset)
iterator = dataset.make_one_shot_iterator()

# `features` is a dictionary in which each value is a batch of values for
# that feature; `labels` is a batch of labels.
features, labels = iterator.get_next()

print("FEATURES",features)
print("LABELS",labels)
print("SESS_RUN_LABELS \n",sess.run(labels))

tf.data.TFRecordDataset() & Initializable iterator

make_initializable_iterator() 为可初始化迭代器，运用此迭代器首先需要先运行显式 iterator.initializer 操作，然后才能使用。并且，可运用 可初始化迭代器实现训练集和验证集的切换。
配合 MNIST数据集 实现代码如下。

# Validate tf.data.TFRecordDataset() using make_initializable_iterator()
# In order to switch between train and validation data
num_epochs = 2
num_class = 10

def parser(record):
    keys_to_features = {
        "image_raw": tf.FixedLenFeature((), tf.string, default_value=""),
        "pixels": tf.FixedLenFeature((), tf.int64, default_value=tf.zeros([], dtype=tf.int64)),
        "label": tf.FixedLenFeature((), tf.int64,
                                    default_value=tf.zeros([], dtype=tf.int64)),
    }
    parsed = tf.parse_single_example(record, keys_to_features)
    
    # Parse the string into an array of pixels corresponding to the image
    images = tf.decode_raw(parsed["image_raw"],tf.uint8)
    images = tf.reshape(images,[28,28,1])
    labels = tf.cast(parsed['label'], tf.int32)
    labels = tf.one_hot(labels,10)
    pixels = tf.cast(parsed['pixels'], tf.int32)
    print("IMAGES",images)
    print("LABELS",labels)
    
    return {"image_raw": images}, labels


filenames = tf.placeholder(tf.string, shape=[None])
dataset = tf.data.TFRecordDataset(filenames)
dataset = dataset.map(parser) # Parse the record into tensors
# print("DATASET",dataset)
dataset = dataset.shuffle(buffer_size=10000)
dataset = dataset.batch(32)
dataset = dataset.repeat(num_epochs)
print("DATASET",dataset)
iterator = dataset.make_initializable_iterator()
features, labels = iterator.get_next()
print("ITERATOR",iterator)
print("FEATURES",features)
print("LABELS",labels)


# Initialize `iterator` with training data.
training_filenames = ["/Users/honglan/Desktop/train_output.tfrecords"] 
# replace the filenames with your own path
sess.run(iterator.initializer,feed_dict={filenames: training_filenames})
print("TRAIN\n",sess.run(labels))
# print(sess.run(features))

# Initialize `iterator` with validation data.
validation_filenames = ["/Users/honglan/Desktop/val_output.tfrecords"] 
# replace the filenames with your own path
sess.run(iterator.initializer, feed_dict={filenames: validation_filenames})
print("VAL\n",sess.run(labels))

tf.data.TextLineDataset() & Reinitializable iterator

tf.data.TextLineDataset()，输入参数可以是后缀名为csv或者是txt的源数据的文件路径。
此处用的迭代器是 Reinitializable iterator，即为可重新初始化迭代器。官方定义如下。配合 MNIST数据集 实现代码见第二部分。

可重新初始化迭代器可以通过多个不同的 Dataset 对象进行初始化。例如，您可能有一个训练输入管道，它会对输入图片进行随机扰动来改善泛化；还有一个验证输入管道，它会评估对未修改数据的预测。这些管道通常会使用不同的 Dataset 对象，这些对象具有相同的结构（即每个组件具有相同类型和兼容形状）。

# validate tf.data.TextLineDataset() using Reinitializable iterator
# In order to switch between train and validation data

def decode_line(line):
    # Decode the line to tensor
    record_defaults = [[1.0] for col in range(785)]
    items = tf.decode_csv(line, record_defaults)
    features = items[1:785]
    label = items[0]

    features = tf.cast(features, tf.float32)
    features = tf.reshape(features,[28,28,1])
    label = tf.cast(label, tf.int64)
    label = tf.one_hot(label,num_class)
    return features,label


def create_dataset(filename, batch_size=32, is_shuffle=False, n_repeats=0):
    """create dataset for train and validation dataset"""
    dataset = tf.data.TextLineDataset(filename).skip(1)
    if n_repeats > 0:
        dataset = dataset.repeat(n_repeats)         # for train
    # dataset = dataset.map(decode_line).map(normalize)  
    dataset = dataset.map(decode_line)    
    # decode and normalize
    if is_shuffle:
        dataset = dataset.shuffle(10000)            # shuffle
    dataset = dataset.batch(batch_size)
    return dataset


training_filenames = ["/Users/honglan/Desktop/train.csv"] 
# replace the filenames with your own path
validation_filenames = ["/Users/honglan/Desktop/val.csv"] 
# replace the filenames with your own path

# Create different datasets
training_dataset = create_dataset(training_filenames, batch_size=32, \
                                  is_shuffle=True, n_repeats=num_epochs) # train_filename
validation_dataset = create_dataset(validation_filenames, batch_size=32, \
                                  is_shuffle=True, n_repeats=num_epochs) # val_filename

# A reinitializable iterator is defined by its structure. We could use the
# `output_types` and `output_shapes` properties of either `training_dataset`
# or `validation_dataset` here, because they are compatible.
iterator = tf.data.Iterator.from_structure(training_dataset.output_types,
                                           training_dataset.output_shapes)
features, labels = iterator.get_next()

training_init_op = iterator.make_initializer(training_dataset)
validation_init_op = iterator.make_initializer(validation_dataset)

# Using reinitializable iterator to alternate between training and validation.
sess.run(training_init_op)
print("TRAIN\n",sess.run(labels))
# print(sess.run(features))

# Reinitialize `iterator` with validation data.
sess.run(validation_init_op)
print("VAL\n",sess.run(labels))

tf.data.TextLineDataset() & Feedable iterator.

数据集读取方式同上一部分一样，运用tf.data.TextLineDataset()此处运用的迭代器是 可馈送迭代器，其可以与 tf.placeholder 一起使用，通过熟悉的 feed_dict 机制选择每次调用 tf.Session.run 时所使用的 Iterator。并使用 tf.data.Iterator.from_string_handle定义一个可让在两个数据集之间切换的可馈送迭代器，结合 MNIST数据集 的代码如下

# validate tf.data.TextLineDataset() using two different iterator
# In order to switch between train and validation data

def decode_line(line):
    # Decode the line to tensor
    record_defaults = [[1.0] for col in range(785)]
    items = tf.decode_csv(line, record_defaults)
    features = items[1:785]
    label = items[0]

    features = tf.cast(features, tf.float32)
    features = tf.reshape(features,[28,28])
    label = tf.cast(label, tf.int64)
    label = tf.one_hot(label,num_class)
    return features,label


def create_dataset(filename, batch_size=32, is_shuffle=False, n_repeats=0):
    """create dataset for train and validation dataset"""
    dataset = tf.data.TextLineDataset(filename).skip(1)
    if n_repeats > 0:
        dataset = dataset.repeat(n_repeats)         # for train
    # dataset = dataset.map(decode_line).map(normalize)  
    dataset = dataset.map(decode_line)    
    # decode and normalize
    if is_shuffle:
        dataset = dataset.shuffle(10000)            # shuffle
    dataset = dataset.batch(batch_size)
    return dataset


training_filenames = ["/Users/honglan/Desktop/train.csv"] 
# replace the filenames with your own path
validation_filenames = ["/Users/honglan/Desktop/val.csv"] 
# replace the filenames with your own path

# Create different datasets
training_dataset = create_dataset(training_filenames, batch_size=32, \
                                  is_shuffle=True, n_repeats=num_epochs) # train_filename
validation_dataset = create_dataset(validation_filenames, batch_size=32, \
                                  is_shuffle=True, n_repeats=num_epochs) # val_filename

# A feedable iterator is defined by a handle placeholder and its structure. We
# could use the `output_types` and `output_shapes` properties of either
# `training_dataset` or `validation_dataset` here, because they have
# identical structure.
handle = tf.placeholder(tf.string, shape=[])
iterator = tf.data.Iterator.from_string_handle(
    handle, training_dataset.output_types, training_dataset.output_shapes)
features, labels = iterator.get_next()

# You can use feedable iterators with a variety of different kinds of iterator
# (such as one-shot and initializable iterators).
training_iterator = training_dataset.make_one_shot_iterator()
validation_iterator = validation_dataset.make_initializable_iterator()

# The `Iterator.string_handle()` method returns a tensor that can be evaluated
# and used to feed the `handle` placeholder.
training_handle = sess.run(training_iterator.string_handle())
validation_handle = sess.run(validation_iterator.string_handle())

# Using different handle to alternate between training and validation.
print("TRAIN\n",sess.run(labels, feed_dict={handle: training_handle}))
# print(sess.run(features))

# Initialize `iterator` with validation data.
sess.run(validation_iterator.initializer)
print("VAL\n",sess.run(labels, feed_dict={handle: validation_handle}))

小结

运用tfrecords处理数据的速度明显加快
可以根据自身需要选择不同的iterator方式对源数据进行预处理
单机训练时也可以采用 tf.data中API的相应处理方式

最后编辑于：2018.08.23 14:39:26

人面猴
序言：七十年代末，一起剥皮案震惊了整个滨河市，随后出现的几起案子，更是在滨河造成了极大的恐慌，老刑警刘岩，带你破解...
沈念sama阅读 160,585评论 4赞 365
死咒
序言：滨河连续发生了三起死亡事件，死亡现场离奇诡异，居然都是意外死亡，警方通过查阅死者的电脑和手机，发现死者居然都...
沈念sama阅读 67,923评论 1赞 301
救了他两次的神仙让他今天三更去死
文/潘晓璐我一进店门，熙熙楼的掌柜王于贵愁眉苦脸地迎上来，“玉大人，你说我怎么就摊上这事。” “怎么了？”我有些...
开封第一讲书人阅读 110,314评论 0赞 248
道士缉凶录：失踪的卖姜人
文/不坏的土叔我叫张陵，是天一观的道长。经常有香客问我，道长，这世上最难降的妖魔是什么？我笑而不...
开封第一讲书人阅读 44,346评论 0赞 214
港岛之恋（遗憾婚礼）
正文为了忘掉前任，我火速办了婚礼，结果婚礼上，老公的妹妹穿的比我还像新娘。我一直安慰自己，他们只是感情好，可当我...
茶点故事阅读 52,718评论 3赞 291
恶毒庶女顶嫁案：这布局不是一般人想出来的
文/花漫我一把揭开白布。她就那样静静地躺着，像睡着了一般。火红的嫁衣衬着肌肤如雪。梳的纹丝不乱的头发上，一...
开封第一讲书人阅读 40,828评论 1赞 223
城市分裂传说
那天，我揣着相机与录音，去河边找鬼。笑死，一个胖子当着我的面吹牛，可吹牛的内容都是我干的。我是一名探鬼主播，决...
沈念sama阅读 32,020评论 2赞 315
双鸳鸯连环套：你想象不到人心有多黑
文/苍兰香墨我猛地睁开眼，长吁一口气：“原来是场噩梦啊……” “哼！你这毒妇竟也来了？” 一声冷哼从身侧响起，我...
开封第一讲书人阅读 30,758评论 0赞 204
万荣杀人案实录
序言：老挝万荣一对情侣失踪，失踪者是张志新（化名）和其女友刘颖，没想到半个月后，有当地人在树林里发现了一具尸体，经...
沈念sama阅读 34,486评论 1赞 246
护林员之死
正文独居荒郊野岭守林人离奇死亡，尸身上长有42处带血的脓包…… 初始之章·张勋以下内容为张勋视角年9月15日...
茶点故事阅读 30,722评论 2赞 251
白月光启示录
正文我和宋清朗相恋三年，在试婚纱的时候发现自己被绿了。大学时的朋友给我发了我未婚夫和他白月光在一起吃饭的照片。...
茶点故事阅读 32,196评论 1赞 262
活死人
序言：一个原本活蹦乱跳的男人离奇死亡，死状恐怖，灵堂内的尸体忽然破棺而出，到底是诈尸还是另有隐情，我是刑警宁泽，带...
沈念sama阅读 28,546评论 3赞 258
日本核电站爆炸内幕
正文年R本政府宣布，位于F岛的核电站，受9级特大地震影响，放射性物质发生泄漏。R本人自食恶果不足惜，却给世界环境...
茶点故事阅读 33,211评论 3赞 240
男人毒药：我在死后第九天来索命
文/蒙蒙一、第九天我趴在偏房一处隐蔽的房顶上张望。院中可真热闹，春花似锦、人声如沸。这庄子的主人今日做“春日...
开封第一讲书人阅读 26,132评论 0赞 8
一桩弑父案，背后竟有这般阴谋
文/苍兰香墨我抬头看了看天上的太阳。三九已至，却和暖如春，着一层夹袄步出监牢的瞬间，已是汗流浃背。一阵脚步声响...
开封第一讲书人阅读 26,916评论 0赞 200
情欲美人皮
我被黑心中介骗来泰国打工，没想到刚下飞机就差点儿被人妖公主榨干…… 1. 我叫王不留，地道东北人。一个月前我还...
沈念sama阅读 35,904评论 2赞 283
代替公主和亲
正文我出身青楼，却偏偏与公主长得像，于是被迫代替她去往敌国和亲。传闻我的和亲对象是个残疾皇子，可洞房花烛夜当晚...
茶点故事阅读 35,758评论 2赞 274