PyTorch

Introduction

PyTorch is a popular machine learning framework.

We use PyTorch version f5788898a928cb2489926c1a5418c94c598c361b. We study resnet50, bert, deepwave models.

We apply the following commands to compile PyTorch from source.

spack install miniconda3

conda install numpy ninja pyyaml mkl mkl-include setuptools cmake cffi typing_extensions future six requests dataclasses

conda install -c pytorch magma-cuda110

export CMAKE_PREFIX_PATH=${CONDA_PREFIX:-"$(dirname $(which conda))/../"}
export USE_CUDA=1
export REL_WITH_DEB_INFO=1
export MAX_JOBS=16
export USE_NINJA=OFF 
python setup.py install
  • resnet

We get the resnet example from the pytorch benchmark repo.

To ease the installtion, we provide 1-spatial-convolution-model.py and 1-spatial-convolution-unit.py to check layer-wise and end-to-end performance.

  • deepwave

We provide the instructions for installing deepwave here.

To ease checking the problematic kernel, we provide 2-replication-pad3d.py script which only has a single ReplicationPad3d kernel.

  • bert

We get the reset example from the pytorch benchmark.

To ease checking the problematic kernel, we provide 3-embedding-unit.py script which only has a single Embedding kernel.

Profiling

Profiling a Python application takes extra steps than a normal application. We have a general guide to profile application in the FAQ page.

An example profiling command is attached below for reference:

LD_LIBRARY_PATH=/path/to/python/install/lib/python<version>/site-packages/torch:$LD_LIBRARY_PATH hpcrun -e gpu=nvidia,data_flow -ck HPCRUN_SANITIZER_READ_TRACE_IGNORE=1 -ck HPCRUN_SANITIZER_DATA_FLOW_HASH=0 -ck HPCRUN_SANITIZER_GPU_ANALYSIS_BLOCKS=1 -ck HPCRUN_SANITIZER_GPU_PATCH_RECORD_NUM=131072 python ./<pytorch-script>.py

Optimization

We don’t provide an automate performance testing suite for PyTorch in GVProf because recompile PyTorch for just small code changes still take long time and is a pain on low end servers.

  • data_flow - redundant values

Please refer to this issue

  • data_flow - redundant values - value_pattern - single zeros

Please refer to these two: issue1 and issue2