BMBF Project »High Performance Deep Learning Framework«

The goal of the BMBF project »High Performance Deep Learning Framework« (HP-DLF) is to provide researchers and developers in the »Deep Learning« domain an easy access to current and future high-performance computing systems.

How does an autonomously driving car recognize pedestrians and other road users? How does speech recognition for everyday use work? How does the Internet search engine recognize people in photos? The answer is: with machine learning algorithms. In recent years considerable progress has been made in the field of machine learning. A significant part of this success is due to the further development of so-called »Deep Learning« algorithms.  

In the course of this development, larger and more complex artificial neural networks are being designed and trained. However, this procedure, which has been successful for many practical applications, requires enormous computing effort and a great deal of training data. Therefore, the further development of »Deep Learning« depends on the development of methods and infrastructures that will ensure the predictability of increasingly complex neural networks in the future.

Within HP-DLF we have developed the open-source framework Tarantella, which is based on state-of-the-art technologies from both deep learning and HPC and enables the scalable training of deep neural networks on supercomputers.

Scheme »High Performance Deep Learning Framework«.
© Fraunhofer ITWM
Scheme »High Performance Deep Learning Framework«.

Goals of the Project

  • Support the introduction of HPC to a new, large user group right from the start with innovative tools.
  • Hide the complexity of the hardware from the users and lead them to a highly scalable and energy-efficient solution.
  • Not only make existing HPC methods accessible to new users, but also gain knowledge about the system requirements of a very important HPC application in the future.

To this end, a new software framework is to be developed that automates the highly complex parallelization of the training of large neural networks on heterogeneous computing clusters.

For example, image segmentation is an extremely active research direction which drives the development of such complex models. It provides some of the most challenging requirements in terms of computation and memory. For instance, detecting tumors using 3D MRI images relies on models that span over tens of gigabytes and process equally large data samples.

Even though traditionally deep learning networks are trained using GPUs on a single machine, such limited resources may be insuficient to store the model and process the data. Thus, the frameworks designed for training such models have to manage distributed, large-scale executions on HPC clusters, despite the inherent performance penalty introduced by the need to communicate between nodes. To mitigate this drawback, achieving efficient distributed training is an essential prerequisite for speeding up the training process.

The HPDLF project targets multiple aspects of distributed neural networks training.

(1) It aims at providing deep learning researchers with an easy-to-use, intuitive interface for distributed learning. Whether they focus on data parallelism, model parallelism, or pipelining, users will have the possiblity to select the most suitable approach and test it at large scales.

(2) Our Framework Tarantella should provide flexible components to enable hybrid approaches that can combine and take advantage of multiple distributed training mechanisms.

(3) The project will provide the building blocks for research in the area of distribution strategies for neural networks. To this end, we will devise an extensible interface for practitioners to experiment with new optimization algorithms and to explore the parameter space for existing optimizers.

Our framework builds on top of state-of-the-art technolgies in both fields of deep learning and HPC:

  • TensorFlow: the most widely used machine learning platform in both research and production
  • GPI-2: an asynchronous, flexible, and scalable communication library for parallel applications---developed in our division

Speed-up with Tarantella

Tarantella provides strong scalability using data parallelism on models like ResNet-50 and Transformers. It reaches speed-ups of up to 50x on GPU and CPU clusters, as you can see on the graphics:

ResNet50 on NVidia V100 GPUs
© Fraunhofer ITWM
ResNet50 on NVidia V100 GPUs
ResNet50 on Intel Skylake CPUs
© Fraunhofer ITWM
ResNet50 on Intel Skylake CPUs
Transformer_on_NVidia_V100_GPUs
© Fraunhofer ITWM
Transformer on NVidia V100 GPUs