BMBF Project HP-DLF - Fraunhofer ITWM

The goal of the BMBF project »High Performance Deep Learning Framework« (HP-DLF) is to provide researchers and developers in the »Deep Learning« domain an easy access to current and future high-performance computing systems.

How does an autonomously driving car recognize pedestrians and other road users? How does speech recognition for everyday use work? How does the Internet search engine recognize people in photos? The answer is: with machine learning algorithms. In recent years considerable progress has been made in the field of machine learning. A significant part of this success is due to the further development of so-called »Deep Learning« algorithms.

In the course of this development, larger and more complex artificial neural networks are being designed and trained. However, this procedure, which has been successful for many practical applications, requires enormous computing effort and a great deal of training data. Therefore, the further development of »Deep Learning« depends on the development of methods and infrastructures that will ensure the predictability of increasingly complex neural networks in the future.

Within HP-DLF we have developed the open-source framework Tarantella, which is based on state-of-the-art technologies from both deep learning and HPC and enables the scalable training of deep neural networks on supercomputers.

© Fraunhofer ITWM
Scheme »High Performance Deep Learning Framework«.

Goals of the Project

Support the introduction of HPC to a new, large user group right from the start with innovative tools.
Hide the complexity of the hardware from the users and lead them to a highly scalable and energy-efficient solution.
Not only make existing HPC methods accessible to new users, but also gain knowledge about the system requirements of a very important HPC application in the future.

To this end, a new software framework is to be developed that automates the highly complex parallelization of the training of large neural networks on heterogeneous computing clusters.

For example, image segmentation is an extremely active research direction which drives the development of such complex models. It provides some of the most challenging requirements in terms of computation and memory. For instance, detecting tumors using 3D MRI images relies on models that span over tens of gigabytes and process equally large data samples.

Even though traditionally deep learning networks are trained using GPUs on a single machine, such limited resources may be insuficient to store the model and process the data. Thus, the frameworks designed for training such models have to manage distributed, large-scale executions on HPC clusters, despite the inherent performance penalty introduced by the need to communicate between nodes. To mitigate this drawback, achieving efficient distributed training is an essential prerequisite for speeding up the training process.

The HPDLF project targets multiple aspects of distributed neural networks training.

(1) It aims at providing deep learning researchers with an easy-to-use, intuitive interface for distributed learning. Whether they focus on data parallelism, model parallelism, or pipelining, users will have the possiblity to select the most suitable approach and test it at large scales.

(2) Our Framework Tarantella should provide flexible components to enable hybrid approaches that can combine and take advantage of multiple distributed training mechanisms.

(3) The project will provide the building blocks for research in the area of distribution strategies for neural networks. To this end, we will devise an extensible interface for practitioners to experiment with new optimization algorithms and to explore the parameter space for existing optimizers.

Our framework builds on top of state-of-the-art technolgies in both fields of deep learning and HPC:

TensorFlow: the most widely used machine learning platform in both research and production
GPI-2: an asynchronous, flexible, and scalable communication library for parallel applications---developed in our division

Speed-up with Tarantella

Tarantella provides strong scalability using data parallelism on models like ResNet-50 and Transformers. It reaches speed-ups of up to 50x on GPU and CPU clusters, as you can see on the graphics:

© Fraunhofer ITWM
ResNet50 on NVidia V100 GPUs

© Fraunhofer ITWM
ResNet50 on Intel Skylake CPUs

Transformer_on_NVidia_V100_GPUs — © Fraunhofer ITWM
Transformer on NVidia V100 GPUs

Project Partners

Project Coordination: Fraunhofer ITWM (Dr. Peter Labus, Competence Center High Performance Computing)
German Research Center for Artificial Intelligence, DFKI (Prof. Dr. Philipp Slusallek
TU Dresden (Prof. Dr. Wolfgang E. Nagel, Centre for Information Services and High Performance Computing (ZIH)
TU Heidelberg (Prof. Carsten Rother, Ph.D., Visual Learning Lab)

Project Duration

The work in the project is scheduled for a period of three years (1.11.2017-31.10.2020).

Computing Power for Deep Learning

BMBF Project »High Performance Deep Learning Framework«

More Information on the Deep Learning Framework Tarantella

Goals of the Project

Further Information

Speed-up with Tarantella

Project Partners

Project Duration

Related Links:

Contact Press / Media

Dr. Martin Kühn