Next, we can create a test Python script, to show whether TensorFlow sees the GPUs on the GPU node. We will name
it test.py:
1234567
# Test whether TensorFlow can be importedprint("Importing TensorFlow",flush=True)importtensorflowastf# Test whether TensorFlow sees the GPUprint("\nlisting physical devices:",flush=True)print(tf.config.list_physical_devices())
We'll then configure our conda environment for the job. This can be accomplished using Anaconda's
robust environment.yaml system 2. Because we want to use TensorFlow on a GPU node, we will install only the
tensorflow-gpu package. If other packages were required, they could be added as a new entry in dependencies.
The environment.yaml we will use is below:
We'll finish our job setup by writing a PBS submission script to take care
of setting up the TensorFlow environment and running the Python script. We will name this file job.pbs:
#!/bin/bash#PBS -N TensorFlow-Test#PBS -j oe#PBS -l nodes=1#PBS -l ppn=1#PBS -l walltime=00:00:10:00#PBS -q GOF# ===================# CONFIG FOR THIS JOB# ===================# condaModule is taken directly from the list of installed modules.# The full list can be found by running `module avail`condaModule="python/anaconda3-5.2.0"# Name of the environment file we createdtfEnvironmentFile="environment.yaml"# Name of the python script we want to runpythonScriptFile="test.py"# =======# RUN JOB# =======# Create the environment and install TensorFlow
moduleload$condaModule
condaenvcreate-f$tfEnvironmentFile--force&>conda_install_log.txt
# Activate the TensorFlow environment we createdenvironmentName=$(grep"name:"$tfEnvironmentFile|head-n1|awk'{print $NF}')sourceactivate$environmentName# Run TensorFlow
python$pythonScriptFile&>python_log.txt
The above script will run on a single GPU node with a single CPU core reserved (the job will still have access to the
GPU resource), for at most 10 minutes. To help make the script easier to modify for production work, we have included
three environment variables in the configuration section:
condaModule: This is the name of the desired version of Anaconda, which can be found by calling module avail
tfEnvironmentFile: This is the name of the environment file we want to use, in this case it is the environment.yaml
file we created earlier.
pythonScriptFile: This is the name of the python script we want to run. In this case, it is the test.py script we
created earlier.
The job wil then run, first installing all the modules specified in environment.yaml. A log of the installation will
be saved to conda_install_log.txt. Note the use of the --force option; this will over-write any conda environments
with the same name that was specified in environment.yaml.
The job will then source the environment, and will run the python script. The python script's stodut and stderr will be
written to python_log.txt.
After the job completes, we will see several files in the job directory.
conda_install_log.txt will contain a log of the conda installation process, showing which libraries were downloaded.
This is useful for debugging or aiding in the reproducibility of results.
shell.out, which contains anything that would have been printed to the screen in the job.
python_log.txt contains a log of the stdout and stderr from the Python job. It may look something like:
1 2 3 4 5 6 7 8 910111213141516171819202122
Importing TensorFlow
2021-03-25 00:32:43.766494: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcudart.so.10.1
listing physical devices:
2021-03-25 00:32:57.337563: I tensorflow/compiler/jit/xla_cpu_device.cc:41] Not creating XLA devices, tf_xla_enable_xla_devices not set
2021-03-25 00:32:57.340276: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcuda.so.1
2021-03-25 00:32:58.027658: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:941] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2021-03-25 00:32:58.028227: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1720] Found device 0 with properties:
pciBusID: 0000:00:1e.0 name: Tesla V100-SXM2-16GB computeCapability: 7.0
coreClock: 1.53GHz coreCount: 80 deviceMemorySize: 15.78GiB deviceMemoryBandwidth: 836.37GiB/s
2021-03-25 00:32:58.028260: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcudart.so.10.1
2021-03-25 00:32:58.176165: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcublas.so.10
2021-03-25 00:32:58.176246: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcublasLt.so.10
2021-03-25 00:32:58.323200: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcufft.so.10
2021-03-25 00:32:58.549187: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcurand.so.10
2021-03-25 00:32:58.679114: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcusolver.so.10
2021-03-25 00:32:58.768426: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcusparse.so.10
2021-03-25 00:32:59.036819: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcudnn.so.7
2021-03-25 00:32:59.037031: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:941] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2021-03-25 00:32:59.037623: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:941] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2021-03-25 00:32:59.038052: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1862] Adding visible gpu devices: 0
[PhysicalDevice(name='/physical_device:CPU:0', device_type='CPU'), PhysicalDevice(name='/physical_device:GPU:0', device_type='GPU')]
In this output, we can see that a single GPU was found: an nVidia TESLA V100-SXM2-16GB. In this case, this is the GPU
that would have been used by TensorFlow to perform its calculations.