Deepwave

Introduction

Deepwave is a wave propagation software implemented based on.

We study deepwave version 1154692258da342accd21df02f7fa9ddd008f75f. The input for deepwave is attached in GVProf’s samples.

We first add -lineinfo -g to the _make_cuda_extension function in setup.py, and then add -g to the _make_cpp_extension function. Next we use pip install . to install deepwave.

Note that this pip is supposed be the pip installed by conda as we use conda across all the python samples

To run the deepwave example in GVProf, we need to install matplotlib by conda install matplotlib.

Profiling

Currently, using gvprof to profile python applications is intricate. We use HPCToolkit to profile and analyze deepwave separatedly. Please refer to the FAQ page for the complete guide.

With the default configuration, this example takes a relatively long time. We can change num_epochs to 1 and let it break after finishing the first batch. This deepwave application introduces higher overhead (150-200x) than other applications (~20x) because its kernels access millions of memory addresses with lots of gaps. As a result, we are not able to merge all of the memory access ranges on the GPU. Then, we will spend long time in both copying memory addresses from the GPU to the host and updating host memories.

For value pattern profiling, we monitor the most expensive propagate kernel using the following options.

LD_LIBRARY_PATH=/path/to/python/install/lib/python<version>/site-packages/torch:$LD_LIBRARY_PATH hpcrun -e gpu=nvidia,value_pattern@10000 -ck HPCRUN_SANITIZER_WHITELIST=./whitelist -ck HPCRUN_SANITIZER_KERNEL_SAMPLING_FREQUENCY=100000 python ./Deepwave_SEAM_example1.py

For data flow profiling, we turn on these knobs to accelerate the profiling process.

LD_LIBRARY_PATH=/path/to/python/install/lib/python<version>/site-packages/torch:$LD_LIBRARY_PATH hpcrun -e gpu=nvidia,data_flow -ck HPCRUN_SANITIZER_READ_TRACE_IGNORE=1 -ck HPCRUN_SANITIZER_DATA_FLOW_HASH=0 -ck HPCRUN_SANITIZER_GPU_ANALYSIS_BLOCKS=1 -ck HPCRUN_SANITIZER_GPU_PATCH_RECORD_NUM=131072 python ./Deepwave_SEAM_example1.py

# this gives you additional speedup
# export OMP_NUM_THREADS=16

More information about accelerating data flow and value pattern profiling can be found in the FAQ page

Optimization

Please refer to the replication_pad3d issue in PyTorch.