• 如何上传论文到 arXiv

    上传你的论文到 arXiv,看似麻烦,实则简单。

    步骤

    新建文件夹,将 mwe.texmwe.bib 放进去。

    文件 mwe.bib

    @Book{Goossens,
      author    = {Goossens, Michel and Mittelbach, Frank and 
                   Samarin, Alexander},
      title     = {The LaTeX Companion},
      edition   = {1},
      publisher = {Addison-Wesley},
      location  = {Reading, Mass.},
      year      = {1994},
    }
    @Book{adams,
      title     = {The Restaurant at the End of the Universe},
      author    = {Douglas Adams},
      series    = {The Hitchhiker's Guide to the Galaxy},
      publisher = {Pan Macmillan},
      year      = {1980},
    }
    

    文件 mwe.tex

    \documentclass[10pt,a4paper]{article}
    
    \usepackage{hyperref} % for better urls
    
    
    \begin{document}
    
    This is text with \cite{Goossens} and \cite{adams}.
    
    \nocite{*} % to test all bib entrys
    \bibliographystyle{unsrt}
    \bibliography{mwe} % file mwe.bib
    
    \end{document}
    

    首先运行 pdflatex mwe.tex,会生成一系列新文件,例如 mwe.auc

    然后运行bibtex mwe。 BiBTeX 会编译生成新文件: mwe.bblmwe.blg. mwe.blg 是 BiBTeX 运行产生的日志文件, mwe.bbl 是提交需要的文件。

    文件 mwe.bbl 内容:

    \begin{thebibliography}{1}
    
    \bibitem{Goossens}
    Michel Goossens, Frank Mittelbach, and Alexander Samarin.
    \newblock {\em The LaTeX Companion}.
    \newblock Addison-Wesley, 1 edition, 1994.
    
    \bibitem{adams}
    Douglas Adams.
    \newblock {\em The Restaurant at the End of the Universe}.
    \newblock The Hitchhiker's Guide to the Galaxy. Pan Macmillan, 1980.
    
    \end{thebibliography}
    

    第二次运行 pdflatex mwe.tex 得到正确的页码。

    复制 mwe.texmwe-arxiv.tex 并删除对 bibtex 的引用,复制 mwe.bbl 文件内容或引用 mwe.bbl 文件到 mwe-arxiv.tex

    直接复制mwe.bblmwe-arxiv.tex 文件(适合引用较少,此时上传文件只需要包含mwe-arxiv.tex即可):

    \documentclass[10pt,a4paper]{article}
    
    \usepackage{hyperref} % for better urls
    
    
    \begin{document}
    
    This is text with \cite{Goossens} and \cite{adams}.
    
    \nocite{*} % to test all bib entrys
    %\bibliographystyle{unsrt} % <======================== not longer needed!
    %\bibliography{\jobname} % <========================== not longer needed!
    \begin{thebibliography}{1} % <================================== mwe.bbl
    
    \bibitem{Goossens}
    Michel Goossens, Frank Mittelbach, and Alexander Samarin.
    \newblock {\em The LaTeX Companion}.
    \newblock Addison-Wesley, 1 edition, 1994.
    
    \bibitem{adams}
    Douglas Adams.
    \newblock {\em The Restaurant at the End of the Universe}.
    \newblock The Hitchhiker's Guide to the Galaxy. Pan Macmillan, 1980.
    
    \end{thebibliography} % <======================================= mwe.bbl
    
    \end{document}
    

    使用 \input{mwe.bbl} (上传文件包含mwe-arxiv.texmwe.bbl):

    \documentclass[10pt,a4paper]{article}
    
    \usepackage{hyperref} % for better urls
    
    
    \begin{document}
    
    This is text with \cite{Goossens} and \cite{adams}.
    
    \nocite{*} % to test all bib entrys
    %\bibliographystyle{unsrt} % <======================== not longer needed!
    %\bibliography{\jobname} % <========================== not longer needed!
    \input{mwe.bbl} % <============================================= mwe.bbl
    
    \end{document}
    

    总结

    1. 首先运行 pdflatex mwe.texbibtex mwe 生成必要文件。
    2. 然后运行 pdflatex mwe.tex 得到正确的页码。
    3. 将文件mwe.bbl 引入到 mwe.tex
    4. 上传mwe.texmwe.bbl 文件(其他 Figures 文件正常引入即可)。

    Tips

    支持的 Figure 格式:

    文件名只允许包含 a-z A-Z 0-9 _ + - . , = 字符,Figure1.PDF and figure1.pdf 不是同一个文件。

    Schedule(all times Eastern US):

    Submissions received between Will be announced Mailed to subscribers
    Monday 14:00 – Tuesday 14:00 Tuesday 20:00 Tuesday night / Wednesday morning
    Tuesday 14:00 – Wednesday 14:00 Wednesday 20:00 Wednesday night / Thursday morning
    Wednesday 14:00 – Thursday 14:00 Thursday 20:00 Thursday night / Friday morning
    Thursday 14:00 – Friday 14:00 Sunday 20:00 Sunday night / Monday morning
    Friday 14:00 – Monday 14:00 Monday 20:00 Monday night / Tuesday morning

    参考

    1. https://tex.stackexchange.com/questions/329198/how-to-obtain-and-use-the-bbl-file-in-my-tex-document-for-arxiv-submission

    2. https://arxiv.org/help/submit

  • TensorFlow 的 TFRecord 和 QueueRunner 简介

    通常我们下载的数据集都是以压缩文件的格式存在,解压后会有多个文件夹,像 traintestval 等等。而文件也有可能多达数万或者数百万个。这种形式的数据集不但读取复杂、慢,而且占用磁盘空间。这时二进制的格式文件的优点便显现出来了。我们可以把数据集存储为一个二进制文件,这样就没有了 traintestval 等等的文件夹。更重要的是,这些数据只会占据一块内存(Block of Memory),而不需要一个一个单独加载文件。因此使用二进制文件效率更高。

    你以为 TensorFlow 都为你封装好二进制文件文件的读写、解析方式了吗?是的,都封装好了~本文就是介绍如何将数据转换为 TFRecord 格式。

    CIFAR-10 数据集

    本文以 CIFAR-10 数据集为例,什么是 CIFAR-10 数据集?看这儿 => 图像数据集 ~

    假设你已经有了以下数据:

    CIFAR-10

    写,保存为 TFRecord 格式

    定义的一些常量:

    _NUM_TRAIN_FILES = 5
    # The height and width of each image.
    _IMAGE_SIZE = 32
    # The names of the classes.
    _CLASS_NAMES = [
        'airplane',
        'automobile',
        'bird',
        'cat',
        'deer',
        'dog',
        'frog',
        'horse',
        'ship',
        'truck',
    ]
    

    这里我们创建两个 split 文件,分别存储 train 和 test 需要的数据:

    dataset_dir = 'data'
    if not tf.gfile.Exists(dataset_dir):
        tf.gfile.MakeDirs(dataset_dir)
    training_filename = _get_output_filename(dataset_dir, 'train')
    testing_filename = _get_output_filename(dataset_dir, 'test')
    

    _get_output_filename 函数用来生成文件名:

    def _get_output_filename(dataset_dir, split_name):
        """Creates the output filename.
        Args:
          dataset_dir: The dataset directory where the dataset is stored.
          split_name: The name of the train/test split.
        Returns:
          An absolute file path.
        """
        return '%s/cifar10_%s.tfrecord' % (dataset_dir, split_name)
    

    然后,处理训练数据:

    # First, process the training data:
    with tf.python_io.TFRecordWriter(training_filename) as tfrecord_writer:
        offset = 0
        for i in range(_NUM_TRAIN_FILES):
            filename = os.path.join('./cifar-10-batches-py', 'data_batch_%d' % (i + 1))
            offset = _add_to_tfrecord(filename, tfrecord_writer, offset)
    

    即依次读取 data_batch_? 文件,调用 _add_to_tfrecord 将其保存为 TFRecord 格式。

    def _add_to_tfrecord(filename, tfrecord_writer, offset=0):
        """Loads data from the cifar10 pickle files and writes files to a TFRecord.
        Args:
          filename: The filename of the cifar10 pickle file.
          tfrecord_writer: The TFRecord writer to use for writing.
          offset: An offset into the absolute number of images previously written.
        Returns:
          The new offset.
        """
        with tf.gfile.Open(filename, 'rb') as f:
            data = pickle.load(f, encoding='bytes')
        images = data[b'data']
        num_images = images.shape[0]
    
        images = images.reshape((num_images, 3, 32, 32))
        labels = data[b'labels']
        with tf.Graph().as_default():
            image_placeholder = tf.placeholder(tf.uint8)
            encoded_image = tf.image.encode_png(image_placeholder)
            with tf.Session() as sess:
                for j in range(num_images):
                    sys.stdout.write('\r>> Reading file [%s] image %d/%d' % (filename, offset + j + 1, offset + num_images))
                    sys.stdout.flush()
                    image = np.squeeze(images[j]).transpose((1, 2, 0))
                    label = labels[j]
                    png_string = sess.run(encoded_image, feed_dict={image_placeholder: image})
                    example = image_to_tfexample(png_string, b'png', _IMAGE_SIZE, _IMAGE_SIZE, label)
                    tfrecord_writer.write(example.SerializeToString())
        return offset + num_images
    

    因为 CIFAR-10 数据集的图片是 10000x3072 numpy array 格式的,因此需要 reshapetf.image.encode_png 需要的格式:[height, width, channels]tf.image.encode_png 返回编码后的字符串,然后还需要保存图片的宽高、格式信息。调用 image_to_tfexample 将这些数据保存到 tf.train.Example 中:

    def image_to_tfexample(image_data, image_format, height, width, class_id):
        return tf.train.Example(features=tf.train.Features(feature={
            'image/encoded': tf.train.Feature(bytes_list=tf.train.BytesList(value=[image_data])),
            'image/format': tf.train.Feature(bytes_list=tf.train.BytesList(value=[image_format])),
            'image/class/label': tf.train.Feature(int64_list=tf.train.Int64List(value=[class_id])),
            'image/height': tf.train.Feature(int64_list=tf.train.Int64List(value=[height])),
            'image/width': tf.train.Feature(int64_list=tf.train.Int64List(value=[width])),
        }))
    

    TensorFlow 会将数据转换为 tf.train.Example Protobuf 对象,Example 包含 Features, Features 包含 一个 dict,来区分不同的 FeatureFeature 可以包含 FloatListByteList 或者 Int64List。注意这里的 key,image/encodedimage/format等,是可以随便定义的,这里是TensorFlow 默认的图片数据集的 key ,我们一般采取 TensorFlow 默认的值。

    有了 example,我们将其转换为字符串写入到文件就完成了整个 TFRecord 格式文件的制作。

    tfrecord_writer.write(example.SerializeToString())
    

    同理,制作测试数据集:

    # Next, process the testing data:
    with tf.python_io.TFRecordWriter(testing_filename) as tfrecord_writer:
        filename = os.path.join('./cifar-10-batches-py', 'test_batch')
        _add_to_tfrecord(filename, tfrecord_writer)
    

    最后,我们会得到两个文件:

    file

    这就是最后的 TFRecord 格式文件,二进制文件。

    最简单的就是直接读取:

    reconstructed_images = []
    record_iterator = tf.python_io.tf_record_iterator(path='./data/cifar10_train.tfrecord')
    for string_iterator in record_iterator:
        example = tf.train.Example()
        example.ParseFromString(string_iterator)
        height = example.features.feature['image/height'].int64_list.value[0]
        width = example.features.feature['image/width'].int64_list.value[0]
        png_string = example.features.feature['image/encoded'].bytes_list.value[0]
        label = example.features.feature['image/class/label'].int64_list.value[0]
        with tf.Session() as sess:
            image_placeholder = tf.placeholder(dtype=tf.string)
            decoded_img = tf.image.decode_png(image_placeholder, channels=3)
            reconstructed_img = sess.run(decoded_img, feed_dict={image_placeholder: png_string})
        reconstructed_images.append((reconstructed_img, label))
    

    其实就是“写”的逆过程。生成一个 Example,分析读取的字符串,然后从 features 中根据 key 获取相应的对象即可。图片的话我们使用 tf.image.decode_png 解码,即 tf.image.encode_png 逆过程。

    读取后可以直接来显示:

    plt.imshow(reconstructed_images[0][0])
    plt.title(_CLASS_NAMES[reconstructed_images[0][1]])
    plt.show()
    

    这种方法比较直接,直接从文件读取并分析,但是如果数据较多就会你比较慢,而且没有考虑分布式、队列、多线程的问题。我们还可以使用文件队列来读取。

    队列

    # first construct a queue containing a list of filenames.
    # this lets a user split up there dataset in multiple files to keep
    # size down
    filename_queue = tf.train.string_input_producer(['./data/cifar10_train.tfrecord'])
    # Unlike the TFRecordWriter, the TFRecordReader is symbolic, 即所做的操作不会立即执行
    reader = tf.TFRecordReader()
    _, serialized_example = reader.read(filename_queue)
    features = tf.parse_single_example(serialized_example, features={
        'image/encoded': tf.FixedLenFeature((), tf.string, default_value=''),
        'image/format': tf.FixedLenFeature((), tf.string, default_value='png'),
        'image/height': tf.FixedLenFeature((), tf.int64),
        'image/width': tf.FixedLenFeature((), tf.int64),
        'image/class/label': tf.FixedLenFeature([], tf.int64, default_value=tf.zeros([], dtype=tf.int64))
    })
    image = tf.image.decode_png(features['image/encoded'], channels=3)
    image = tf.image.resize_image_with_crop_or_pad(image, 32, 32)
    label = features['image/class/label']
    init_op = tf.group(tf.global_variables_initializer(), tf.local_variables_initializer())
    with tf.Session() as sess:
        sess.run(init_op)
        tf.train.start_queue_runners()
        # grab examples back.
        # first example from file
        image_val_1, label_val_1 = sess.run([image, label])
        # second example from file
        image_val_2, label_val_2 = sess.run([image, label])
        print(image_val_1, label_val_1)
        print(image_val_2, label_val_2)
        plt.imshow(image_val_1)
        plt.title(_CLASS_NAMES[label_val_1])
        plt.show()
    

    首先定义我们的文件名队列 filename_queue,包含一个文件名列表,这样我们可以把大的文件分成多个小的文件,保证单个文件不会太大,本例只有一个文件。然后用 TFRecordReader 读取。TensorFlow 的 graphs 包含一些状态变量,允许 TFRecordReader 记住 tfrecord 读到了哪儿,下次从哪儿读起,因此我们需要 sess.run(init_op) 来初始化这些状态。与 tf.python_io.tf_record_iterator 不同的是 TFRecordReader 总是作用在文件名(filename_queue)队列上,它会弹出一个文件名读取数据,直到 tfrecord 为空,然后读取下一个文件名对应的文件。

    如何生成文件名队列呢,这时我们需要 QueueRunners 来做。QueueRunners 其实就是一个线程,使用 session 不断执行入队操作,TensorFlow 已经封装好了 tf.train.QueueRunner 对象。但是大部分时间 QueueRunner 只是底层操作,我们不会直接操作它,本例使用 tf.train.string_input_producer 生成。

    此时,需要发送信号让 TensorFlow 开起线程,执行 QueueRunners,否则,代码将会永远阻塞,等待数据入队。因此需要执行 tf.train.start_queue_runners(),此行代码执行完会立即创建线程。注意,必须在 initialization 运算符(sess.run(init_op))执行之后调用

    tf.parse_single_example 根据我们定义的 features 数据格式解析。最终,image_val_1 就是图片数据集中的一张图片,shape 为 ( 32, 32, 3)

    队列读取的流程图如下:

    file

    Batch

    上例中,我们获得的 imagelabel 都是单个的 Example 对象,代表数据集中的一条数据。训练的时候不可能单条数据训练,如何生成 batches?

    images_batch, labels_batch = tf.train.shuffle_batch(
        [image, label], batch_size=128,
        capacity=2000,
        min_after_dequeue=1000)
    
    with tf.Session() as sess:
        sess.run(init_op)
        tf.train.start_queue_runners()
        labels, images = sess.run([labels_batch, images_batch])
        print(labels.shape)
    

    这里我们使用 tf.train.shuffle_batch 将单个的 imagelabel Example 对象生成 batches 。tf.train.shuffle_batch 实际上构建了另一种 QueueRunnerRandomShuffleQueueRandomShuffleQueue 将单个的 imagelabel 累积成队列,直到包含 batch_size + min_after_dequeue 个。然后随机选择 batch_size 条数据返回,因此 shuffle_batch 的返回值实际上是 RandomShuffleQueue 执行 dequeue_many 的返回值。

    如果 tensor 的形状为 [x, y, z]shuffle_batch 返回的对应的 tensor 形状为 [batch_size, x, y, z],本例 labelsimages 形状分别为(128, )(128, 32, 32, 3)

    DatasetDataProvider

    如果我们使用 tf.contrib.slim,我们可以将读取过程封装的更优雅。

    定义我们的数据集 cifar10.py,具体怎么定义相信看了以上代码,下面的代码不用解释也能看懂了吧~

    from __future__ import absolute_import
    from __future__ import division
    from __future__ import print_function
    
    import os
    import tensorflow as tf
    
    slim = tf.contrib.slim
    
    _FILE_PATTERN = 'cifar10_%s.tfrecord'
    
    SPLITS_TO_SIZES = {'train': 50000, 'test': 10000}
    
    _NUM_CLASSES = 10
    
    _ITEMS_TO_DESCRIPTIONS = {
        'image': 'A [32 x 32 x 3] color image.',
        'label': 'A single integer between 0 and 9',
    }
    
    
    def get_split(split_name, dataset_dir, file_pattern=None, reader=None):
        """Gets a dataset tuple with instructions for reading cifar10.
        Args:
          split_name: A train/test split name.
          dataset_dir: The base directory of the dataset sources.
          file_pattern: The file pattern to use when matching the dataset sources.
            It is assumed that the pattern contains a '%s' string so that the split
            name can be inserted.
          reader: The TensorFlow reader type.
        Returns:
          A `Dataset` namedtuple.
        Raises:
          ValueError: if `split_name` is not a valid train/test split.
        """
        if split_name not in SPLITS_TO_SIZES:
            raise ValueError('split name %s was not recognized.' % split_name)
    
        if not file_pattern:
            file_pattern = _FILE_PATTERN
        file_pattern = os.path.join(dataset_dir, file_pattern % split_name)
    
        # Allowing None in the signature so that dataset_factory can use the default.
        if not reader:
            reader = tf.TFRecordReader
    
        keys_to_features = {
            'image/encoded': tf.FixedLenFeature((), tf.string, default_value=''),
            'image/format': tf.FixedLenFeature((), tf.string, default_value='png'),
            'image/class/label': tf.FixedLenFeature(
                [], tf.int64, default_value=tf.zeros([], dtype=tf.int64)),
        }
    
        items_to_handlers = {
            'image': slim.tfexample_decoder.Image(shape=[32, 32, 3]),
            'label': slim.tfexample_decoder.Tensor('image/class/label'),
        }
    
        decoder = slim.tfexample_decoder.TFExampleDecoder(
            keys_to_features, items_to_handlers)
    
        labels_to_names = None
        if has_labels(dataset_dir):
            labels_to_names = read_label_file(dataset_dir)
    
        return slim.dataset.Dataset(
            data_sources=file_pattern,
            reader=reader,
            decoder=decoder,
            num_samples=SPLITS_TO_SIZES[split_name],
            items_to_descriptions=_ITEMS_TO_DESCRIPTIONS,
            num_classes=_NUM_CLASSES,
            labels_to_names=labels_to_names)
    
    
    def has_labels(dataset_dir, filename='labels.txt'):
        """Specifies whether or not the dataset directory contains a label map file.
        Args:
          dataset_dir: The directory in which the labels file is found.
          filename: The filename where the class names are written.
        Returns:
          `True` if the labels file exists and `False` otherwise.
        """
        return tf.gfile.Exists(os.path.join(dataset_dir, filename))
    
    
    def read_label_file(dataset_dir, filename='labels.txt'):
        """Reads the labels file and returns a mapping from ID to class name.
        Args:
          dataset_dir: The directory in which the labels file is found.
          filename: The filename where the class names are written.
        Returns:
          A map from a label (integer) to class name.
        """
        labels_filename = os.path.join(dataset_dir, filename)
        with tf.gfile.Open(labels_filename, 'rb') as f:
            lines = f.read().decode()
        lines = lines.split('\n')
        lines = filter(None, lines)
    
        labels_to_class_names = {}
        for line in lines:
            index = line.index(':')
            labels_to_class_names[int(line[:index])] = line[index + 1:]
        return labels_to_class_names
    

    读取的话就非常简单了:

    dataset = cifar10.get_split('train', DATA_DIR)
    provider = slim.dataset_data_provider.DatasetDataProvider(dataset)
    [image, label] = provider.get(['image', 'label'])
    

    以上所有代码都可以在 tensorflow/model 仓库下的 slim 中找到~

    总结

    流程:

    1. 生成 TFRecord 格式文件
    2. 定义 record reader 分析 TFRecord 文件
    3. 定义 batcher
    4. 构建网络模型
    5. 初始化所有运算符
    6. 开始 queue runners.
    7. 训练 loop

    参考

    1. Tfrecords Guide
    2. TensorFlow Data Input (Part 1): Placeholders, Protobufs & Queues
    3. 关于tensorflow 的数据读取线程管理QueueRunner
  • 用 LSTM 预测字符序列

    循环神经网络(Recurrent Neural Network)简介中我们了解了什么是 RNN,本文用 TensorFlow 实现一个超级简单的字符预测模型,并对代码进行详细说明,防止自己以后忘记( ╯□╰ )。

    首先定义 RNN 网络:

    class RNN:
        def __init__(self,
                     in_size,
                     cell_size,
                     num_layers,
                     out_size,
                     sess,
                     lr=0.003):
            self.in_size = in_size  # 输入数据大小,这里为字符数目(注意这里将大写字母转换为了小写字母,减小了字符数目,加快训练)
            self.cell_size = cell_size  # LSTM cell 的unit数目
            self.num_layers = num_layers  # 因为是多层LSTM,num_layers指定了层数
            self.out_size = out_size  # 输出数据大小,同样为字符数目
            self.sess = sess  # session
            self.lr = lr  # 学习速率
    
            # 储存上一次的state, 测试的时候用
            self.last_state = np.zeros([num_layers * 2 * cell_size])
    
            # 输入数据,(batch, time_step, in_size)
            self.inputs = tf.placeholder(tf.float32, shape=[None, None, in_size])
            self.lstm_cells = [
                rnn.BasicLSTMCell(cell_size, state_is_tuple=False)
                for _ in range(num_layers)
            ]
            self.lstm = rnn.MultiRNNCell(self.lstm_cells, state_is_tuple=False)
    
            self.init_state = tf.placeholder(tf.float32, [None, num_layers * 2 * cell_size])
    
            # 定义 recurrent neural network
            outputs, self.new_state = tf.nn.dynamic_rnn(
                self.lstm,
                self.inputs,
                initial_state=self.init_state,
                dtype=tf.float32)
    
            # 最后使用全连接层来计算loss,w b 为全连接层参数
            w = tf.Variable(tf.random_normal([cell_size, out_size], stddev=0.01))
            b = tf.Variable(tf.random_normal([out_size], stddev=0.01))
    
            reshaped_outputs = tf.matmul(tf.reshape(outputs, [-1, cell_size]), w) + b
            # 将输出转换为概率
            fc = tf.nn.softmax(reshaped_outputs)
    
            shape = tf.shape(outputs)
            self.final_outputs = tf.reshape(fc, [shape[0], shape[1], out_size])
    
            # labels, (batch, time_step, in_size)
            self.targets = tf.placeholder(tf.float32, [None, None, out_size])
    
            # loss
            self.cost = tf.reduce_mean(
                tf.nn.softmax_cross_entropy_with_logits(
                    logits=reshaped_outputs,
                    labels=tf.reshape(self.targets, [-1, out_size])))
            self.optimizer = tf.train.RMSPropOptimizer(lr).minimize(self.cost)
    
        def train(self, inputs, targets):
            # init_state,(batch_size, num_layers * 2 * self.cell_size)
            init_state = np.zeros(
                [inputs.shape[0], self.num_layers * 2 * self.cell_size])
    
            _, loss = self.sess.run(
                [self.optimizer, self.cost],
                feed_dict={
                    self.inputs: inputs,
                    self.targets: targets,
                    self.init_state: init_state
                })
    
            return loss
    
        def get_next_char_pro(self, x, init=False):
            """根据输入字符x, 预测下一个字符并返回,x 的 shape 为(1, in_size), 即为单个字符的 one-hot 形式
            """
            if init:
                init_state = np.zeros([self.num_layers * 2 * self.cell_size])
            else:
                init_state = self.last_state
    
            out, next_state = self.sess.run(
                [self.final_outputs, self.new_state],
                feed_dict={self.inputs: [x], self.init_state: [init_state]})
            # 将当前state储存起来,下次预测的时候使用,这样就保留了上下文信息
            self.last_state = next_state[0]
            return out[0][0]
    

    处理数据:

    def generate_one_hot_data(data, vocabulary):
        data_ = np.zeros([len(data), len(vocabulary)])
    
        count = 0
        for char in data:
            i = vocabulary.index(char)
            data_[count, i] = 1.0
            count += 1
        return data_
    
    
    path = 'data/data.txt'
    
    with open(path, 'r') as f:
        data = f.read()
    
    data = data.lower()  # 全部小写,降低复杂度
    
    vocabulary = list(set(data))  # 字符列表
    
    one_hot_data = generate_one_hot_data(data, vocabulary)  # one_hot_data,shape 为(len(data), len(vocabulary))
    

    定义参数:

    in_size = out_size = len(vocabulary)
    
    cell_size = 128
    num_layers = 2
    batch_size = 64
    time_steps = 128
    
    NUM_TRAIN_BATCHES = 5000
    
    config = tf.ConfigProto()
    config.gpu_options.allow_growth = True
    sess = tf.InteractiveSession(config=config)
    
    model = RNN(in_size=in_size,
                cell_size=cell_size,
                num_layers=num_layers,
                out_size=out_size,
                sess=sess)
    
    sess.run(tf.global_variables_initializer())
    
    inputs = np.zeros([batch_size, time_steps, in_size])
    targets = np.zeros([batch_size, time_steps, out_size])
    
    possible_batch_start_ids = range(len(data) - time_steps - 1)
    

    生成训练数据:

    for i in range(NUM_TRAIN_BATCHES):
        batch_start_ids = random.sample(possible_batch_start_ids, batch_size)
    
        for j in range(time_steps):
            inputs_ids = [k + j for k in batch_start_ids]
            targets_ids = [k + j + 1 for k in batch_start_ids]
    
            inputs[:, j, :] = one_hot_data[inputs_ids, :]
            targets[:, j, :] = one_hot_data[targets_ids, :]
    
        loss = model.train(inputs, targets)
    
        if i % 100 == 0:
            print('loss: {:.5f} of batch {}'.format(loss, i))
    
    

    测试:

    # 预测一个以'we ' 开头的句子
    TEST_PREFIX = 'we '
    
    for i in range(len(TEST_PREFIX)):
        out = model.get_next_char_pro(
            generate_one_hot_data(TEST_PREFIX[i], vocabulary), i == 0)
    
    gen_str = TEST_PREFIX
    
    for i in range(200):
        index = np.random.choice(range(len(vocabulary)), p=out)
        pred = vocabulary[index]
        # print(index, len(out), len(vocabulary))
        gen_str += pred
    
        out = model.get_next_char_pro(generate_one_hot_data(pred, vocabulary))
    
    print(gen_str)
    

    训练结果:

    loss: 3.12114 of batch 100
    loss: 3.06464 of batch 200
    loss: 2.49030 of batch 300
    loss: 2.20478 of batch 400
    loss: 2.02495 of batch 500
    loss: 1.88005 of batch 600
    loss: 1.74067 of batch 700
    loss: 1.70199 of batch 800
    loss: 1.69426 of batch 900
    loss: 1.54315 of batch 1000
    loss: 1.55390 of batch 1100
    loss: 1.48157 of batch 1200
    loss: 1.49068 of batch 1300
    loss: 1.48002 of batch 1400
    loss: 1.44718 of batch 1500
    loss: 1.44463 of batch 1600
    loss: 1.44997 of batch 1700
    loss: 1.41875 of batch 1800
    loss: 1.39939 of batch 1900
    loss: 1.40289 of batch 2000
    loss: 1.36561 of batch 2100
    loss: 1.37862 of batch 2200
    loss: 1.39195 of batch 2300
    loss: 1.36577 of batch 2400
    loss: 1.33385 of batch 2500
    loss: 1.36705 of batch 2600
    loss: 1.28989 of batch 2700
    loss: 1.35604 of batch 2800
    loss: 1.30150 of batch 2900
    loss: 1.30576 of batch 3000
    loss: 1.30832 of batch 3100
    loss: 1.28968 of batch 3200
    loss: 1.28770 of batch 3300
    loss: 1.27296 of batch 3400
    loss: 1.31154 of batch 3500
    loss: 1.30212 of batch 3600
    loss: 1.29709 of batch 3700
    loss: 1.28448 of batch 3800
    loss: 1.28766 of batch 3900
    loss: 1.25970 of batch 4000
    loss: 1.27008 of batch 4100
    loss: 1.29763 of batch 4200
    loss: 1.25666 of batch 4300
    loss: 1.29813 of batch 4400
    loss: 1.26807 of batch 4500
    loss: 1.23903 of batch 4600
    loss: 1.21010 of batch 4700
    loss: 1.27084 of batch 4800
    loss: 1.27161 of batch 4900
    loss: 1.25789 of batch 5000
    loss: 1.22986 of batch 5100
    loss: 1.24404 of batch 5200
    loss: 1.27089 of batch 5300
    loss: 1.23036 of batch 5400
    loss: 1.25348 of batch 5500
    loss: 1.23626 of batch 5600
    loss: 1.21493 of batch 5700
    loss: 1.20419 of batch 5800
    loss: 1.23771 of batch 5900
    loss: 1.20754 of batch 6000
    loss: 1.23489 of batch 6100
    loss: 1.20233 of batch 6200
    loss: 1.20366 of batch 6300
    loss: 1.23586 of batch 6400
    loss: 1.21687 of batch 6500
    loss: 1.19479 of batch 6600
    loss: 1.21297 of batch 6700
    loss: 1.23598 of batch 6800
    loss: 1.19476 of batch 6900
    loss: 1.21584 of batch 7000
    loss: 1.22816 of batch 7100
    loss: 1.19449 of batch 7200
    loss: 1.19346 of batch 7300
    loss: 1.23466 of batch 7400
    loss: 1.18541 of batch 7500
    loss: 1.19469 of batch 7600
    loss: 1.21069 of batch 7700
    loss: 1.19641 of batch 7800
    loss: 1.15550 of batch 7900
    loss: 1.19861 of batch 8000
    loss: 1.22582 of batch 8100
    loss: 1.19766 of batch 8200
    loss: 1.19041 of batch 8300
    loss: 1.15410 of batch 8400
    loss: 1.13109 of batch 8500
    loss: 1.16434 of batch 8600
    loss: 1.19457 of batch 8700
    loss: 1.18558 of batch 8800
    loss: 1.18043 of batch 8900
    loss: 1.17171 of batch 9000
    loss: 1.17663 of batch 9100
    loss: 1.14107 of batch 9200
    loss: 1.20001 of batch 9300
    loss: 1.16926 of batch 9400
    loss: 1.14761 of batch 9500
    loss: 1.15305 of batch 9600
    loss: 1.20601 of batch 9700
    loss: 1.16141 of batch 9800
    loss: 1.15704 of batch 9900
    

    一些输出:

    prefix:she
    sheated here do in the weary blood
    unless he single thou mayst break your ancients dren
    lads, then i in to thy fair age.
    so i say the nursed with mine eyes from my sleep.
    
    clarence:
    how is it pale and mo
    
    
    prefix:the
    the tears of mine.
    hark! they may providy to the extremement,
    make an office of day, like to be so run out.
    dost than the heaven and your pardon did forght
    the crown: all shall be too close hather change
    prefix:i
    in heavy part:
    in your comforts of our state exposed
    for in the kin a phalteres times disprison.
    
    isabella:
    dispatch and great to do a plance:
    and now they do repent us to be pitteth caser
    which in my 
    
    
    prefix:i
    it confined him to his necessities,
    but to an e this young belonged,'
    as it susminine of france of vices of folly,
    when though 'i trust up me not, our reward,
    sir, you shall be absent disposed.
    
    warwic
    
    
    prefix:me
    me;
    for will is angelo, take it,--and thou knows he did:
    one catesby!
    
    pronirerd:
    uppecured for me hencefully, you make one flight,
    making down thee thought in who thy by the world,
    but name, though hat
    
  • TensorFlow 和 NumPy 的 Broadcasting 机制

    TensorFlow 采用 NumPy 的 Broadcasting 机制,来处理不同形状的 Tensor 之间的算术运算,来节省内存、提高计算效率。

    NumPy 数组运算通常是逐元素(element-by-element )计算,因此要求两个数组的形状必须相同

    >>> a = np.array([1.0, 2.0, 3.0])
    >>> b = np.array([2.0, 2.0, 2.0])
    >>> a * b
    array([ 2.,  4.,  6.])
    

    NumPy 的 Broadcasting 机制解除了这种限制,在两个数组的形状满足某种条件的情况下,不同形状的数组之间仍可以进行算术运算。最简单的就是数组乘以一个标量:

    >>> a = np.array([1.0, 2.0, 3.0])
    >>> b = 2.0
    >>> a * b
    array([ 2.,  4.,  6.])
    

    结果和第一个 b 是数组的例子相同。可以认为标量 b 被拉伸成了和 a 相同形状的数组,拉伸后数组每个元素的值为先前标量值的复制,这样形式上和第一种例子相同,因此结果当然一样。但这只是理论上的,复制操作并不会真正进行,只是在计算时使用标量的值罢了。因此,第二个例子效率更高,因为节省了内存

    Broadcasting 规则

    当两个数组进行算术运算时,NumPy 从后向前,逐元素比较两个数组的形状。当逐个比较的元素值满足以下条件时,认为满足 Broadcasting 的条件:

    1. 相等
    2. 其中一个是1

    当不满足时,会抛出 ValueError: frames are not aligne 异常。算术运算的结果的形状的每一元素,是两个数组形状逐元素比较时的最大值。

    而且,两个数组可以有不同的维度。比如一个 ` 256x256x3 的数组储存 RGB 值,如果对每个颜色通道进行不同的放缩,我们可以乘以一个一维、形状为 (3, ) 的数组。因为是**从后向前**比较,因此 3 == 3`,符合 Broadcasting 规则 。

    Image  (3d array): 256 x 256 x 3
    Scale  (1d array):             3
    Result (3d array): 256 x 256 x 3
    

    当其中一个是 1 时,就会被“拉伸”成和另一个相同大小,即“复制”(没有真正复制)元素值来 Match 另一个,如:

    A      (4d array):  8 x 1 x 6 x 1
    B      (3d array):      7 x 1 x 5
    Result (4d array):  8 x 7 x 6 x 5
    

    更多的例子:

    A      (2d array):  5 x 4
    B      (1d array):      1
    Result (2d array):  5 x 4
    
    A      (2d array):  5 x 4
    B      (1d array):      4
    Result (2d array):  5 x 4
    
    A      (3d array):  15 x 3 x 5
    B      (3d array):  15 x 1 x 5
    Result (3d array):  15 x 3 x 5
    
    A      (3d array):  15 x 3 x 5
    B      (2d array):       3 x 5
    Result (3d array):  15 x 3 x 5
    
    A      (3d array):  15 x 3 x 5
    B      (2d array):       3 x 1
    Result (3d array):  15 x 3 x 5
    

    一些反例(不满足 Broadcasting 规则 ):

    A      (1d array):  3
    B      (1d array):  4 # trailing dimensions do not match
    
    A      (2d array):      2 x 1
    B      (3d array):  8 x 4 x 3 # second from last dimensions mismatched
    

    实践:

    >>> x = np.arange(4)
    >>> xx = x.reshape(4,1)
    >>> y = np.ones(5)
    >>> z = np.ones((3,4))
    
    >>> x.shape
    (4,)
    
    >>> y.shape
    (5,)
    
    >>> x + y
    <type 'exceptions.ValueError'>: shape mismatch: objects cannot be broadcast to a single shape
    
    >>> xx.shape
    (4, 1)
    
    >>> y.shape
    (5,)
    
    >>> (xx + y).shape
    (4, 5)
    
    >>> xx + y
    array([[ 1.,  1.,  1.,  1.,  1.],
           [ 2.,  2.,  2.,  2.,  2.],
           [ 3.,  3.,  3.,  3.,  3.],
           [ 4.,  4.,  4.,  4.,  4.]])
    
    >>> x.shape
    (4,)
    
    >>> z.shape
    (3, 4)
    
    >>> (x + z).shape
    (3, 4)
    
    >>> x + z
    array([[ 1.,  2.,  3.,  4.],
           [ 1.,  2.,  3.,  4.],
           [ 1.,  2.,  3.,  4.]])
    

    再例如:

    >>> a = np.array([0.0, 10.0, 20.0, 30.0])
    >>> b = np.array([1.0, 2.0, 3.0])
    >>> a[:, np.newaxis] + b
    array([[  1.,   2.,   3.],
           [ 11.,  12.,  13.],
           [ 21.,  22.,  23.],
           [ 31.,  32.,  33.]])
    

    newaxis 操作为数组 a 插入一维,变成二维 4x1 数组,因此 4x1 的数组加 (3, ) 的数组,结果为 4x3 的数组。

    参考

    1. https://docs.scipy.org/doc/numpy/user/basics.broadcasting.html
  • 常用代码片段

    Ubuntu 安装最新 Nginx:

    sudo -s
    nginx=stable # use nginx=development for latest development version
    add-apt-repository ppa:nginx/$nginx
    apt-get update
    apt-get install nginx
    

    查看 Mysql 各数据表大小:

    SELECT table_schema                                        "DB Name", 
       Round(Sum(data_length + index_length) / 1024 / 1024, 1) "DB Size in MB" 
    FROM   information_schema.tables 
    GROUP  BY table_schema; 
    

    安装55:

    sudo apt-get update
    sudo apt-get install python3-pip
    sudo pip3 install shadowsocks
    sudo mkdir /var/ss
    sudo vim /var/ss/ss.json
    ssserver -c /var/ss/ss.json -d start
    
    {
        "server":"0.0.0.0",
        "server_port":8388,
        "local_address": "127.0.0.1",
        "local_port":1080,
        "password":"mypassword",
        "timeout":300,
        "method":"aes-256-cfb",
        "fast_open": false
    }
    

    Useful

    # NVIDIA GPUs 
    nvidia-smi
    
    # folder size. -h: human readable, -s: for summary
    du -hs /path/to/directory
    # 查看 CUDA 版本
    cat /usr/local/cuda/version.txt
    # 查看 cuDNN 版本
    cat /usr/local/cuda/include/cudnn.h | grep CUDNN_MAJOR -A 2
    

    解压

    tar -cvf myfolder.tar myfolder
    tar -xf archive.tar -C /target/directory
    tar -xvzf archive.tar.gz -C /target/directory
    

    文件

    # 查看当前文件夹文件数量
    find . -type f | wc -l
    # 查询文件行数
    wc -l a.txt
    # 查询文件单词个数
    wc -w a.txt
    # 输出整个文件,并在每行前面添加行号
    cat -n a.txt 
    # 查看磁盘空间及目录容量
    df -hl
    du -sh [目录名] 返回该目录的大小
    

    SSH

    # 将远程服务器 <remote-ip> 的 127.0.0.1:6006 端口转发到本地 16006 端口。即本地输入
    # localhost:16006 实际上访问的是远程服务器的 127.0.0.1:6006。
    ssh -L 16006:127.0.0.1:6006 <username>@<remote-ip> -p <port>
    
    ssh -N -f -L localhost:16006:localhost:6006 <[email protected]>
    -N : no remote commands
    -f : put ssh in the background
    -L <machine1>:<portA>:<machine2>:<portB> : forward <machine2>:<portB> (remote scope) to <machine1>:<portA> (local scope)
    

    Copy & Sync

    # -a will keep permissions,etc, and -h will be human readable. 
    # --progress2 which shows the overall percentage
    rsync -ah --info=progress2 source destination
    

    安装 pyenv

    sudo apt-get install --no-install-recommends make build-essential libssl-dev zlib1g-dev libbz2-dev libreadline-dev libsqlite3-dev wget curl llvm libncurses5-dev xz-utils tk-dev libxml2-dev libxmlsec1-dev libffi-dev liblzma-dev
    git clone https://github.com/pyenv/pyenv.git ~/.pyenv
    echo 'export PYENV_ROOT="$HOME/.pyenv"' >> ~/.zshrc
    echo 'export PATH="$PYENV_ROOT/bin:$PATH"' >> ~/.zshrc
    echo -e 'if command -v pyenv 1>/dev/null 2>&1; then\n  eval "$(pyenv init -)"\nfi' >> ~/.zshrc
    exec "$SHELL"
    pyenv install 3.6.5
    pyenv global 3.6.5
    exec "$SHELL"
    

    安装 cocoapi

    git clone https://github.com/cocodataset/cocoapi.git
    cd cocoapi/PythonAPI
    python setup.py build_ext install