“TensorFlow - Install CUDA, CuDNN & TensorFlow in AWS EC2 P2”

This article was written in 2017 which some information need to be updated by now. In particular the Amazon AMI instance is free now. Nevertheless, sometimes building a AMI for your software platform is needed and therefore I will leave this article AS IS.

We deploy our TensorFlow applications to the AWS EC2 with Nvidia GPU(s). In the first half, we use an Amazon Deep Learning AMI. In the second half, we build a TensorFlow platform from scratch using Ubuntu.

EC2 P2 spot instance with Deep Learning AMI

Amazon EC2 instances with GPU are expensive. Therefore, we pick the spot instances to reduce the cost. Nevertheless, spot instances can be terminated when your maximum bidding price falls below the current price. To mitigate that, we can create and attach another EBS volume to the instance. This additional volume stores training data, results and checkpoints. When an instance is terminated, we create a new instance and re-attach the EBS volume to the new instance. The training can later be resumed from the checkpoints.

Different AWS regions have different pricing. Usually N. Virginia, Ohio, Oregon have cheaper price. Terminate the instance when it is not needed.

The cheapest instance with a GPU is p2.xlarge. For a p2.xlarge instance in 2017, the lowest AMI charge is $0.9/hour while the lowest spot instance charge is $0.2/hour.

p2.xlarge and Deep Learning AMI

In the AWS console, we create a new p2.xlarge spot instance.

Then locate the Deep learning AMI.

Network setting

We keep the default for the network.

Different regions may have different network configurations.

Storage

In the next page, we keep the setting in the storage and instance details. A 50 GiB EBS volume is created.

We can attach another EBS volume to store permanent data later.

Keypair and security

We configure a key and a security group for the new instance:

Create a new key pair:

Create a security group:

Review

The final configuration should look similar to:

ssh access

We copy our private key tf.pem into the home directory of our local machine and configure the ssh to access the spot instance.

cd ~/.ssh
mv ../tf.pem .    # Change the location of ../tf.pem accordingly
chmod og-r tf.pem

Edit the ssh configuration

vi config

Replace the hostname with your public IP address of the spot instance:

Host tf
    Hostname xx.xx.xx.xx
    Port 22
    User ec2-user
    ServerAliveInterval 60
    IdentityFile ~/.ssh/tf.pem

To access the spot instance:

ssh tf

Testing

We should see K80 as the GPU when we run nvidia-smi on the spot instance.

$ nvidia-smi
Tue Sep 19 18:36:16 2017
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 375.66                 Driver Version: 375.66                    |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|===============================+======================+======================|
|   0  Tesla K80           On   | 0000:00:1E.0     Off |                    0 |
| N/A   64C    P8    30W / 149W |      0MiB / 11439MiB |      0%      Default |
+-------------------------------+----------------------+----------------------+

+-----------------------------------------------------------------------------+
| Processes:                                                       GPU Memory |
|  GPU       PID  Type  Process name                               Usage      |
|=============================================================================|
|  No running processes found                                                 |
+-----------------------------------------------------------------------------+

To find the CUDA version (version 8):

$ nvcc --version
nvcc: NVIDIA (R) Cuda compiler driver
Copyright (c) 2005-2016 NVIDIA Corporation
Built on Tue_Jan_10_13:22:03_CST_2017
Cuda compilation tools, release 8.0, V8.0.61

To find the CuDNN version (version 5.1):

cat /usr/local/cuda/include/cudnn.h | grep CUDNN_MAJOR -A 2
#define CUDNN_MAJOR      5
#define CUDNN_MINOR      1
#define CUDNN_PATCHLEVEL 10
--
#define CUDNN_VERSION    (CUDNN_MAJOR * 1000 + CUDNN_MINOR * 100 + CUDNN_PATCHLEVEL)

#include "driver_types.h"

Run a TensorFlow application from the TensorFlow distribution:

git clone https://github.com/tensorflow/tensorflow.git

Run the MNist application. It will take a couple minute for the application to download the MNIst dataset.

python3 tensorflow/tensorflow/examples/tutorials/mnist/fully_connected_feed.py
Successfully downloaded train-images-idx3-ubyte.gz 9912422 bytes.
Extracting /tmp/tensorflow/mnist/input_data/train-images-idx3-ubyte.gz
Successfully downloaded train-labels-idx1-ubyte.gz 28881 bytes.
Extracting /tmp/tensorflow/mnist/input_data/train-labels-idx1-ubyte.gz
Successfully downloaded t10k-images-idx3-ubyte.gz 1648877 bytes.
Extracting /tmp/tensorflow/mnist/input_data/t10k-images-idx3-ubyte.gz
Successfully downloaded t10k-labels-idx1-ubyte.gz 4542 bytes.
Extracting /tmp/tensorflow/mnist/input_data/t10k-labels-idx1-ubyte.gz
2017-09-19 18:39:57.043575: W tensorflow/core/platform/cpu_feature_guard.cc:45] The TensorFlow library wasn't compiled to use SSE4.1 instructions, but these are available on your machine and could speed up CPU computations.
2017-09-19 18:39:57.043605: W tensorflow/core/platform/cpu_feature_guard.cc:45] The TensorFlow library wasn't compiled to use SSE4.2 instructions, but these are available on your machine and could speed up CPU computations.
2017-09-19 18:39:57.043611: W tensorflow/core/platform/cpu_feature_guard.cc:45] The TensorFlow library wasn't compiled to use AVX instructions, but these are available on your machine and could speed up CPU computations.
2017-09-19 18:39:57.043615: W tensorflow/core/platform/cpu_feature_guard.cc:45] The TensorFlow library wasn't compiled to use AVX2 instructions, but these are available on your machine and could speed up CPU computations.
2017-09-19 18:39:57.043619: W tensorflow/core/platform/cpu_feature_guard.cc:45] The TensorFlow library wasn't compiled to use FMA instructions, but these are available on your machine and could speed up CPU computations.
2017-09-19 18:39:58.214521: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:893] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2017-09-19 18:39:58.215033: I tensorflow/core/common_runtime/gpu/gpu_device.cc:955] Found device 0 with properties:
name: Tesla K80
major: 3 minor: 7 memoryClockRate (GHz) 0.8235
pciBusID 0000:00:1e.0
Total memory: 11.17GiB
Free memory: 11.11GiB
2017-09-19 18:39:58.215056: I tensorflow/core/common_runtime/gpu/gpu_device.cc:976] DMA: 0
2017-09-19 18:39:58.215062: I tensorflow/core/common_runtime/gpu/gpu_device.cc:986] 0:   Y
2017-09-19 18:39:58.215069: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1045] Creating TensorFlow device (/gpu:0) -> (device: 0, name: Tesla K80, pci bus id: 0000:00:1e.0)
Step 0: loss = 2.32 (7.246 sec)
Step 100: loss = 2.18 (0.002 sec)
Step 200: loss = 1.95 (0.002 sec)
Step 300: loss = 1.75 (0.002 sec)
Step 400: loss = 1.29 (0.002 sec)
Step 500: loss = 1.09 (0.002 sec)
Step 600: loss = 0.87 (0.001 sec)
Step 700: loss = 0.77 (0.001 sec)
Step 800: loss = 0.67 (0.002 sec)
Step 900: loss = 0.54 (0.002 sec)
Training Data Eval:
  Num examples: 55000  Num correct: 46762  Precision @ 1: 0.8502
Validation Data Eval:
  Num examples: 5000  Num correct: 4274  Precision @ 1: 0.8548
Test Data Eval:
  Num examples: 10000  Num correct: 8560  Precision @ 1: 0.8560
Step 1000: loss = 0.45 (0.007 sec)
Step 1100: loss = 0.50 (0.101 sec)
Step 1200: loss = 0.48 (0.002 sec)
Step 1300: loss = 0.46 (0.001 sec)
Step 1400: loss = 0.46 (0.001 sec)
Step 1500: loss = 0.53 (0.002 sec)
Step 1600: loss = 0.54 (0.002 sec)
Step 1700: loss = 0.39 (0.001 sec)
Step 1800: loss = 0.50 (0.002 sec)
Step 1900: loss = 0.45 (0.001 sec)
Training Data Eval:
  Num examples: 55000  Num correct: 49169  Precision @ 1: 0.8940
Validation Data Eval:
  Num examples: 5000  Num correct: 4526  Precision @ 1: 0.9052
Test Data Eval:
  Num examples: 10000  Num correct: 9013  Precision @ 1: 0.9013

Build a Ubuntu TensorFlow

Instead of creating an EC2 instance from a Deep Learning AMI, this section creates an EC2 P2 instance from an Ubuntu AMI.

The Ubuntu AMI id can be located here by searching “64 us ebs 16.04 HVM” for the US region.

US East (Ohio)	us-east-2
US East (N. Virginia)	us-east-1
US West (N. California)	us-west-1
US West (Oregon)	us-west-2

Spot instance information:

Also increase the SSD storage size

Since this is Ubuntu, we need to change the user from ec2-user to ubuntu in the ssh configuration.

Host tf
    Hostname xx.xx.xx.xx
    Port 22
    User ubuntu
    ServerAliveInterval 60
    IdentityFile ~/.ssh/tf.pem

After creating the instance, we ssh’ed into the system and verify we have a Nvidia GPU.

ssh tf

$ lspci | grep -i nvidia
00:1e.0 3D controller: NVIDIA Corporation GK210GL [Tesla K80] (rev a1)

Install & upgrade pre-requisite package

sudo apt-get update
sudo apt-get --assume-yes upgrade
sudo apt-get --assume-yes install build-essential

CUDA 8

To install a new version of TensorFlow, we need to verify the version of CUDA and CuDNN that it supports. Go to the Nvidia developer web site to locate the new download if it is different from below.

Install CUDA 8.0

wget http://developer.download.nvidia.com/compute/cuda/repos/ubuntu1604/x86_64/cuda-repo-ubuntu1604_8.0.61-1_amd64.deb
sudo dpkg -i cuda-repo-ubuntu1604_8.0.61-1_amd64.deb
sudo apt-get update
sudo apt-get install cuda

Add CUDA to PATH and LD_LIBRARY_PATH

cat >> ~/.bashrc << 'EOF'
export PATH=/usr/local/cuda-8.0/bin${PATH:+:${PATH}}
export LD_LIBRARY_PATH=/usr/local/cuda-8.0/lib64\
${LD_LIBRARY_PATH:+:${LD_LIBRARY_PATH}}
EOF
source ~/.bashrc

Verify CUDA installation

$ nvcc --version
vcc: NVIDIA (R) Cuda compiler driver
Copyright (c) 2005-2016 NVIDIA Corporation
Built on Tue_Jan_10_13:22:03_CST_2017
Cuda compilation tools, release 8.0, V8.0.61

$ nvidia-smi    

Wed Sep 20 01:26:38 2017
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 384.66                 Driver Version: 384.66                    |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|===============================+======================+======================|
|   0  Tesla K80           Off  | 00000000:00:1E.0 Off |                    0 |
| N/A   45C    P0    74W / 149W |      0MiB / 11439MiB |     98%      Default |
+-------------------------------+----------------------+----------------------+

Upgrade again and reboot

sudo apt-get --assume-yes dist-upgrade
sudo reboot

CuDNN 6

Download the CuDNN here to your local machine and copy the zip file to the server (assume the ssh is configured correctly for tf):

scp cudnn-8.0-linux-x64-v6.0.tgz ubuntu@tf:~

In the spot instance:

cd
tar -xzf cudnn-8.0-linux-x64-v6.0.tgz
cd cuda
sudo cp lib64/* /usr/local/cuda/lib64/
sudo cp include/* /usr/local/cuda/include/

To speed up the performance:

sudo nvidia-smi -pm 1     # Configure the GPU setting to be persistent
sudo nvidia-smi --auto-boost-default=0 # Disable autoboost
sudo nvidia-smi -ac 2505,875           # Set GPU clock speed to the highest

Verify the installation again:

nvidia-smi

Anaconda3

We are installing Anaconda3 for the datascience libraries.

cd
wget https://repo.continuum.io/archive/Anaconda3-4.4.0-Linux-x86_64.sh
bash Anaconda3-4.4.0-Linux-x86_64.sh -b
cat >> ~/.bashrc << 'EOF'
export PATH=$HOME/anaconda3/bin:${PATH}
EOF
source .bashrc

conda upgrade -y --all

Note, we are running in Python 3.

Create a new env. tf and we activate it.

conda create --name tf
source activate tf

Note, whenever we logout from a session, we need to set the env again.

source activate tf

It is important to source activate tf to the correct environment before running the TensorFlow app.

All and the current env. can be found by:

(tf) $ conda info --envs
# conda environments:
#
tf                    *  /home/ubuntu/anaconda3/envs/tf
root                     /home/ubuntu/anaconda3

To end the session if we are done:

source deactivate

Install TensorFlow

source activate tf  # if we are not in tf env.
pip install tensorflow-gpu

Run the TensorFlow application. (This will take a few minutes to first download the dataset.)

git clone https://github.com/tensorflow/tensorflow.git
python tensorflow/tensorflow/examples/tutorials/mnist/fully_connected_feed.py

Build TensorFlow

As shown below, the official TensorFlow release was built with generic settings but not with the optimized settings for your computer.

The TensorFlow library wasn't compiled to use SSE4.1 instructions, but these are available on your machine and could speed up CPU computations.
The TensorFlow library wasn't compiled to use SSE4.2 instructions, but these are available on your machine and could speed up CPU computations.
The TensorFlow library wasn't compiled to use AVX instructions, but these are available on your machine and could speed up CPU computations.
The TensorFlow library wasn't compiled to use AVX2 instructions, but these are available on your machine and could speed up CPU computations.
The TensorFlow library wasn't compiled to use FMA instructions, but these are available on your machine and could speed up CPU computations.

However, in our GAN application, we do not see significant performance improvement. (Details later)

Here, we are building TensorFlow manually. Checkout TensorFlow if not done already. (This is done in previous steps. )

$ cd
$ git clone https://github.com/tensorflow/tensorflow.git

Switch to v1.3.1

$ cd ~/tensorflow
$ git checkout v1.3.1

Install bazel (the build tool)

conda install -c conda-forge bazel

Configure the build

Make sure to enable CUDA

(tf) $ ./configure
...
Do you wish to build TensorFlow with CUDA support? [y/N] y

(tf) $ ./configure
Extracting Bazel installation...
.........
You have bazel 0.4.5- installed.
Please specify the location of python. [Default is /home/ubuntu/anaconda3/bin/python]:
Found possible Python library paths:
  /home/ubuntu/anaconda3/lib/python3.6/site-packages
Please input the desired Python library path to use.  Default is [/home/ubuntu/anaconda3/lib/python3.6/site-packages]

Using python library path: /home/ubuntu/anaconda3/lib/python3.6/site-packages
Do you wish to build TensorFlow with MKL support? [y/N] y
MKL support will be enabled for TensorFlow
Do you wish to download MKL LIB from the web? [Y/n]
Please specify optimization flags to use during compilation when bazel option "--config=opt" is specified [Default is -march=native]:
Do you wish to use jemalloc as the malloc implementation? [Y/n]
jemalloc enabled
Do you wish to build TensorFlow with Google Cloud Platform support? [y/N]
No Google Cloud Platform support will be enabled for TensorFlow
Do you wish to build TensorFlow with Hadoop File System support? [y/N]
No Hadoop File System support will be enabled for TensorFlow
Do you wish to build TensorFlow with the XLA just-in-time compiler (experimental)? [y/N]
No XLA support will be enabled for TensorFlow
Do you wish to build TensorFlow with VERBS support? [y/N]
No VERBS support will be enabled for TensorFlow
Do you wish to build TensorFlow with OpenCL support? [y/N]
No OpenCL support will be enabled for TensorFlow
Do you wish to build TensorFlow with CUDA support? [y/N] y
CUDA support will be enabled for TensorFlow
Do you want to use clang as CUDA compiler? [y/N]
nvcc will be used as CUDA compiler
Please specify the CUDA SDK version you want to use, e.g. 7.0. [Leave empty to default to CUDA 8.0]:
Please specify the location where CUDA 8.0 toolkit is installed. Refer to README.md for more details. [Default is /usr/local/cuda]:
Please specify which gcc should be used by nvcc as the host compiler. [Default is /usr/bin/gcc]:
Please specify the cuDNN version you want to use. [Leave empty to default to cuDNN 6.0]:
Please specify the location where cuDNN 6 library is installed. Refer to README.md for more details. [Default is /usr/local/cuda]:
Please specify a list of comma-separated Cuda compute capabilities you want to build with.
You can find the compute capability of your device at: https://developer.nvidia.com/cuda-gpus.
Please note that each additional compute capability significantly increases your build time and binary size.
[Default is: "3.7"]:
Do you wish to build TensorFlow with MPI support? [y/N]
MPI support will not be enabled for TensorFlow
Configuration finished

Install libcupti-dev

sudo apt-get install libcupti-dev

Build TensorFlow (Will take an 1 hour)

bazel build --cxxopt="-D_GLIBCXX_USE_CXX11_ABI=0" --config=opt --config=cuda //tensorflow/tools/pip_package:build_pip_package 

Replace the TensorFlow

bazel-bin/tensorflow/tools/pip_package/build_pip_package ~/

cd
ls -al *.whl
pip uninstall tensorflow-gpu
pip install tensorflow-1.3.1-cp36-cp36m-linux_x86_64.whl

cd
python tensorflow/tensorflow/examples/tutorials/mnist/fully_connected_feed.py

Create a new AMI

We can use the new instance to create a new AMI to built a new instance without going through the previous steps.

Once a new instance is created with the new AMI, we need to

sudo nvidia-smi -pm 1     # Configure the GPU setting to be persistent
sudo nvidia-smi --auto-boost-default=0 # Disable autoboost
sudo nvidia-smi -ac 2505,875           # Set GPU clock speed to the highest

This AMI is available publicly as ami-5db19338 on us-east-2 region (Ohio).

Attach a EBS for permanent data

In the AWS console, we create a new volume under the elastic block store.

Then we attached the EBS to our instance.

$ lsblk
NAME    MAJ:MIN RM SIZE RO TYPE MOUNTPOINT
xvda    202:0    0  30G  0 disk
└─xvda1 202:1    0  30G  0 part /
xvdf    202:80   0  20G  0 disk

Our new drive is called xvdf above.

Check if it is empty. (If the output is “/dev/xvdf: data”, it is empty. )

$ sudo file -s /dev/xvdf
/dev/xvdf: data

Format the drive if it is empty. WARNING: This will erase data on xvdf.

sudo mkfs -t ext4 /dev/xvdf

Create a new directory for your application and data.

mkdir /home/ubuntu/gan

Note: If you want to have multiple applications to run in this instance, you should create a directory like “/data” instead.

Mount the drive to the new directory

sudo mount /dev/xvdf /home/ubuntu/gan

Change the directory ownership

sudo chown -R ubuntu:ubuntu /home/ubuntu/gan

Verify

$ df -h
Filesystem      Size  Used Avail Use% Mounted on
udev             30G     0   30G   0% /dev
tmpfs           6.0G  8.9M  6.0G   1% /run
/dev/xvda1       30G   16G   14G  54% /
tmpfs            30G     0   30G   0% /dev/shm
tmpfs           5.0M     0  5.0M   0% /run/lock
tmpfs            30G     0   30G   0% /sys/fs/cgroup
tmpfs           6.0G     0  6.0G   0% /run/user/1000
/dev/xvdf        20G   44M   19G   1% /home/ubuntu/gan

To enable auto-mount on reboot, we need to locate the UUID for the new EBS.

$ sudo cp /etc/fstab /etc/fstab.bak
$ sudo blkid
/dev/xvda1: LABEL="cloudimg-rootfs" UUID="5ba6ba49-ee5b-42b0-93f7-2f2b89fc1c71" TYPE="ext4" PARTUUID="ac6c9084-01"
/dev/xvdf: UUID="abe58d1a-8037-477d-9034-777e1b31fa35" TYPE="ext4"

Edit the automunt file.

sudo vi /etc/fstab

Replace the UUID with the one for your ESB and add the line below to the file

UUID=abe58d1a-8037-477d-9034-777e1b31fa35 /home/ubuntu/gan ext4 noatime,defaults 0 0

Note: consult with your system admin for the mount options and flags.

Performance comparison

No doubt, a machine with GPU runs a TensorFlow application much faster. Our TensorFlow application Generative adversary network GAN runs 25 times faster in the P2 instance than a local Mac machine. However, this is a poor comparison because our 4-year old Mac has a slower CPU. Nevertheless, the speed improvement is significant since we need to run our GAN application overnight in P2 to even show some reasonable result. Running GAN on a machine without GPU is not feasible.

Does it worth to build a custom environment rather than using the Amazon Deep Learning AMI? Amazon deep learning AMI creates an environment in minutes with all the important packages. But it charges extra per hour for the software. To build our own custom system, we can use the latest version of CUDA, CuDNN and Python libraries. But un-expected issues will often pop up that will take time to resolve. The first built may take hours to finish. But once an AMI is built, there are not much difference to create a new instance. Does it worth it to rebuilt the TensorFlow code with the CPU optimization for the computer? Our GAN application uses CNN. ReLU, fully connected network heavily. From end to end, we see no speed improvement. We suspect most important tasks are done in the GPU which the CPU optimization does not make an impact. Nevertheless, we do realize the variable initialization and checkpoint saving is much faster in our custom build. But it is not frequent enough to make a difference. It is still in-conclusive on whether a custom TensorFlow build is faster. In our situration, it is not significant for our GAN application.