The Distributed TensorFlow Guide

Nov. 16, 2017, 8:19 p.m. By: TechLeer Admin

Distributed TensorFlow

Distributed Tensorflow is mainly a part of the "regular" Tensorflow, that occupies its very own subdirectory within itself. "The distributed version of TensorFlow is supported by a high performance, open source, general RPC framework that puts mobile and HTTP/2 first called gRPC for inter-process communication"- As noted According to the information in the directory's readme of the new version.

A few features of Distributed Tensorflow include:

  • CPU/GPU Scheduling

  • Cost-based Optimization

  • Fault Tolerance.

  • Scalability.

In order to get started, a Tensorflow server binary (grpc_tensorflow_server) and a gRPC-based client from the source need to be build, as the currently available binaries do not support distributed processing. You can also get Simple instructions by Google's own publicly-available build tool called Bazel to do the same. This was a basic introduction on Distributed Tensorflow.

Now, Moving on towards The Distributed Tensorflow Guide, What is its use and all that it has to give-

This Distributed Tensorflow guide is a collection of tutorials of basic distributed Tensorflow and distributed training examples that can act as boilerplate codes. Many of the examples mentioned in the guide focus mainly on the implementation of well-known distributed training schemes and Almost all the examples only use data-parallelism(between graph replication) and can be run on a single machine with a CPU as well.

Distributed TensorFlow

What basically acts as the motivation for this guide finds its roots from the current state of distributed deep learning. Today, Deep learning simply just papers demonstrate typical successful new architectures on some benchmark, but rarely show how those models can actually be trained with 1000 times the data which is now the requirement in the industry.

To add to it, even more, most successful distributed cases use state-of-the-art hardware and there has been almost very little to none research that shows the potential of asynchronous training - The reason why there are a lot of those examples provided to the users in this guide. There's definitely a lot that is going on in the world of TensorFlow these days and finally, with all of the above reasons and the lack of documentation for distributed Tensorflow, it all contributed to the real cause and motivation behind the starting of this project.

Tensorflow is a great tool that prides itself on its feature of scalability, but it's unfortunate that there are hardly a few examples that show how to make a model scale in the context of data size. The ultimate aim of this guide is try to aid all those who are interested in distributed deep learning, may it be beginners or researchers for that matter, it doesn't limit itself to anything and to provide them with all that is needed to do the same.

The specific Requirements :

  • Python 2.7

  • TensorFlow >= 1.2

For More Information: GitHub