Accelerating distributed deep neural network training with pipelined MPI allreduce

Castelló, Adrián; Quintana-Ortí, Enrique S.; Duato Marín, José Francisco

doi:10.1007/s10586-021-03370-9

Identificarse

Buscar en RiuNet

Listar

Todo RiuNet
Esta colección

Mi cuenta

Acceder

Estadísticas

Ver Estadísticas de uso

Ayuda RiuNet

Admin. UPV

Compartir/Enviar a

Citas

Estadísticas

Accelerating distributed deep neural network training with pipelined MPI allreduce

Mostrar el registro completo del ítem

Castelló, A.; Quintana-Ortí, ES.; Duato Marín, JF. (2021). Accelerating distributed deep neural network training with pipelined MPI allreduce. Cluster Computing. 24(4):3797-3813. https://doi.org/10.1007/s10586-021-03370-9

Por favor, use este identificador para citar o enlazar este ítem: http://hdl.handle.net/10251/186225

Ficheros en el ítem

Nombre: CastelloQuintana- ...

Tamaño: 808.7Kb

Formato: PDF

Descripción: Versión del Autor.

Abrir/Preview

Nombre: Castello&#7692021 ...

Tamaño: 1.068Mb

Formato: Desconocido

Descripción: Versión editorial

Solicitar una copia al autor

Metadatos del ítem

Título:

Accelerating distributed deep neural network training with pipelined MPI allreduce

Autor:

Castelló, Adrián

Quintana-Ortí, Enrique S. Duato Marín, José Francisco

Entidad UPV:

Universitat Politècnica de València. Departamento de Informática de Sistemas y Computadores - Departament d'Informàtica de Sistemes i Computadors

Fecha difusión:

2021-12

Resumen:

[EN] TensorFlow (TF) is usually combined with the Horovod (HVD) workload distribution package to obtain a parallel tool to train deep neural network on clusters of computers. HVD in turn utilizes a blocking Allreduce ...[+]

Palabras clave:

Message Passing Interface (MPI) , Collective communication primitives , Allreduce , Deep learning , Distributed training

Derechos de uso:

Reserva de todos los derechos

Fuente:

Cluster Computing. (issn: 1386-7857 )

DOI:

10.1007/s10586-021-03370-9

Editorial:

Springer-Verlag

Versión del editor:

https://doi.org/10.1007/s10586-021-03370-9

Código del Proyecto:

info:eu-repo/grantAgreement/AEI/Plan Estatal de Investigación Científica y Técnica y de Innovación 2013-2016/TIN2017-82972-R/ES/TECNICAS ALGORITMICAS PARA COMPUTACION DE ALTO RENDIMIENTO CONSCIENTE DEL CONSUMO ENERGETICO Y RESISTENTE A ERRORES/
info:eu-repo/grantAgreement/MCIU//TIN2017-82972-R/
info:eu-repo/grantAgreement/PRACE//2010PA5531/
info:eu-repo/grantAgreement/AGENCIA ESTATAL DE INVESTIGACION//FJC2019-039222-I//AYUDA JUAN DE LA CIERVA FORMACION-CASTELLO GIMENO, ADRIAN/

Agradecimientos:

Project TIN2017-82972-R of the Spanish Ministerio de Ciencia, Innovacion y Universidades. Agencia Valenciana de la Innovacion. Juan de la Cierva-Formacion project FJC2019-039222-I of the Ministerio de Ciencia, Innovacion ...[+]

Tipo:

Artículo

References

Abadi, M., Barham, P., Chen, J., Chen, Z., Davis, A., Dean, J., Devin, M., Ghemawat, S., Irving, G., Isard, M., et al.: TensorFlow: a system for large-scale machine learning. In: 12th USENIX Symposium on Operating Systems Design and Implementation (OSDI 16), pp. 265–283 (2016)

Alsmadi, I., Khreishah, A., Dianxiang, X.: Network slicing to improve multicasting in hpc clusters. Clust. Comput. 21(3), 1493–1506 (2018)

Awan, A.A., Bedorf, J., Chu, C.-H., Subramoni, H., Panda, D.K.: Scalable distributed DNN training using TensorFlow and CUDA-aware MPI: characterization, designs, and performance evaluation (2018). arXiv:1810.11112

Awan, A.A., Chu, C.-H., Subramoni, H., Panda, D.K.: Optimized broadcast for deep learning workloads on dense-GPU InfiniBand clusters: MPI or NCCL? In: Proceedings of the 25th European MPI Users’ Group Meeting, pp. 1–9 (2018)

Ben-Nun, T., Hoefler, T.: Demystifying parallel and distributed deep learning: an in-depth concurrency analysis. ACM Comput. Surv. 52(4), 65:1–65:43 (2019)

Castelló, A., Catalán, M., Dolz, M.F., Mestre, J.I., Quintana-Ortí, E.S., Duato, J.: Evaluation of MPI Allreduce for distributed training of convolutional neural networks. In: 29th Euromicro International Conference on Parallel, Distributed and Network-based Processing (PDP) (2021)

Castelló, A., Dolz, M.F., Quintana-Ortí, E.S., Duato, J.: Analysis of model parallelism for distributed neural networks. In: Proceedings of the 26th European MPI Users’ Group Meeting, EuroMPI ’19, New York, NY, USA (2019). Association for Computing Machinery

Castelló, A., Dolz, M.F., Quintana-Ortí, E.S., Duato, J.: Theoretical scalability analysis of distributed deep convolutional neural networks. In: 2019 19th IEEE/ACM International Symposium on Cluster, Cloud and Grid Computing (CCGRID), pp. 534–541 (2019)

Chan, E., Heimlich, M., Purkayastha, A., van de Geijn, R.: Collective communication: theory, practice, and experience. Concurr. Comput. 19(13), 1749–1783 (2007)

Clarke, L., Glendinning, I., Hempel, R.: The MPI message passing interface standard. In: Programming Environments for Massively Parallel Distributed Systems, pp. 213–218. Springer, Berlin (1994)

Deng, J., Dong, W., Socher, R., Li, L.-J., Li, K., Fei-Fei, L.: ImageNet: a large-scale hierarchical image database. In: 2009 IEEE Conference on Computer Vision and Pattern Recognition, pp. 248–255 (2009)

Hasanov, K., Lastovetsky, A.: Hierarchical redesign of classic MPI reduction algorithms. J. Supercomput. 73(2), 713–725 (2017)

He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 770–778 (2016)

Google Inc. Tensorflow benchmarks. https://github.com/tensorflow/benchmarks

Ivanov, A., Dryden, N., Ben-Nun, T., Li, S., Hoefler, T.: Data movement is all you need: a case study on optimizing transformers (2020). arXiv:2007.00072

Kang, Q., Träff, J.L., Al-Bahrani, R., Agrawal, A., Choudhary, A., Liao, W.-k.: Full-duplex inter-group all-to-all broadcast algorithms with optimal bandwidth. In: Proceedings of the 25th European MPI Users’ Group Meeting, pp. 1–10 (2018)

Kang, Q., Träff, J.L., Al-Bahrani, R., Agrawal, A., Choudhary, A., Liao, W-k: Scalable algorithms for MPI intergroup Allgather and Allgatherv. Parallel Comput. 85, 220–230 (2019)

Kim, Y., Choi, H., Lee, J., Kim, J.-S., Jei, H., Roh, H.: Towards an optimized distributed deep learning framework for a heterogeneous multi-gpu cluster. Clust. Comput. 23(3), 2287–2300 (2020)

Krizhevsky, A., Hinton, G., et al.: Learning multiple layers of features from tiny images. Technical report, Department of Computer Sciences, University of Toronto (2009)

Krizhevsky, A., Sutskever, I., Hinton, G.E.: ImageNet classification with deep convolutional neural networks. In: Proceedings of the 25th International Conference on Neural Information Processing Systems - Volume 1, NIPS’12, pp. 1097–1105, Curran Associates Inc. (2012)

Kurnosov, M., Tokmasheva, E.: Shared memory based mpi broadcast algorithms for numa systems. In: Russian Supercomputing Days, pp. 473–485. Springer, Berlin (2020)

Li, S., Hoefler, T., Chungjin, H., Snir, M.: Improved MPI collectives for MPI processes in shared address spaces. Clust. Comput. 17(4), 1139–1155 (2014)

Nguyen, T.T., Wahib, M., Takano, R.: Hierarchical distributed-memory multi-leader MPI\_Allreduce for deep learning workloads. In: 2018 Sixth International Symposium on Computing and Networking Workshops (CANDARW), pp. 216–222. IEEE (2018)

Paszke, A., Gross, S., Massa, F., Lerer, A., Bradbury, J., Chanan, G., Killeen, T., Lin, Z., Gimelshein, N., Antiga, L., et al.: Pytorch: an imperative style, high-performance deep learning library. In: Advances in Neural Information Processing Systems, pp. 8026–8037 (2019)

Sergeev, A., Del Balso, M.: Horovod: fast and easy distributed deep learning in TensorFlow (2018). arXiv:1802.05799

Shalf, J.: HPC interconnects at the end of Moore’s Law. In: 2019 Optical Fiber Communications Conference and Exhibition (OFC), pp. 1–3 (2019)

Simonyan, K., Zisserman, A.: Very deep convolutional networks for large-scale image recognition (2014). arXiv:1409.1556

Snir, M., Otto, S.W., Huss-Lederman, S., Walker, D.W., Dongarra, J.: MPI: The Complete Reference. The MIT Press, New York (1996)

Sze, V., Chen, Y.-H., Yang, T.-J., Emer, J.S.: Efficient processing of deep neural networks: a tutorial and survey. Proc. IEEE 105(12), 2295–2329 (2017)

Thakur, R., Rabenseifner, R., Gropp, W.: Optimization of collective communication operations in MPICH. Int. J. High Perform. Comput. Appl. 19(1), 49–66 (2005)

Worringen, J.: Pipelining and overlapping for MPI collective operations. In: 28th Annual IEEE International Conference on Local Computer Networks, 2003, pp. 548–557. IEEE (2003)

Zhao, Y., Wang, L., Wu, W., Bosilca, G., Vuduc, R., Ye, J., Tang, W., Xu, Z.: Efficient communications in training large scale neural networks. In: Proceedings of the on Thematic Workshops of ACM Multimedia 2017, pp. 110–116 (2017)

Zhong, D., Cao, Q., Bosilca, G., Dongarra, J.: Using advanced vector extensions avx-512 for mpi reductions. In: 27th European MPI Users’ Group Meeting, pp. 1–10 (2020)

[-]

recommendations

Este ítem aparece en la(s) siguiente(s) colección(ones)

Mostrar el registro completo del ítem

Accelerating distributed deep neural network training with pipelined MPI allreduce

RiuNet: Repositorio Institucional de la Universidad Politécnica de Valencia

Buscar en RiuNet

Listar

Todo RiuNet

Esta colección

Mi cuenta

Estadísticas

Ayuda RiuNet

Admin. UPV

Compartir/Enviar a

Citas

Estadísticas

Accelerating distributed deep neural network training with pipelined MPI allreduce

Ficheros en el ítem

Metadatos del ítem

References

recommendations

Este ítem aparece en la(s) siguiente(s) colección(ones)