- -

Accelerating distributed deep neural network training with pipelined MPI allreduce

RiuNet: Repositorio Institucional de la Universidad Politécnica de Valencia

Compartir/Enviar a

Citas

Estadísticas

  • Estadisticas de Uso

Accelerating distributed deep neural network training with pipelined MPI allreduce

Mostrar el registro sencillo del ítem

Ficheros en el ítem

dc.contributor.author Castelló, Adrián es_ES
dc.contributor.author Quintana-Ortí, Enrique S. es_ES
dc.contributor.author Duato Marín, José Francisco es_ES
dc.date.accessioned 2022-09-16T18:04:22Z
dc.date.available 2022-09-16T18:04:22Z
dc.date.issued 2021-12 es_ES
dc.identifier.issn 1386-7857 es_ES
dc.identifier.uri http://hdl.handle.net/10251/186225
dc.description.abstract [EN] TensorFlow (TF) is usually combined with the Horovod (HVD) workload distribution package to obtain a parallel tool to train deep neural network on clusters of computers. HVD in turn utilizes a blocking Allreduce primitive to share information among processes, combined with a communication thread to overlap communication with computation. In this work, we perform a thorough experimental analysis to expose (1) the importance of selecting the best algorithm in MPI libraries to realize the Allreduce operation; and (2) the performance acceleration that can be attained when replacing a blocking Allreduce with its non-blocking counterpart (while maintaining the blocking behaviour via the appropriate synchronization mechanism). Furthermore, (3) we explore the benefits of applying pipelining to the communication exchange, demonstrating that these improvements carry over to distributed training via TF+HVD. Finally, (4) we show that pipelining can also boost performance for applications that make heavy use of other collectives, such as Broadcast and Reduce-Scatter. es_ES
dc.description.sponsorship Project TIN2017-82972-R of the Spanish Ministerio de Ciencia, Innovacion y Universidades. Agencia Valenciana de la Innovacion. Juan de la Cierva-Formacion project FJC2019-039222-I of the Ministerio de Ciencia, Innovacion y Universidades PRACE Preparatory Access project #2010PA5531. es_ES
dc.language Inglés es_ES
dc.publisher Springer-Verlag es_ES
dc.relation.ispartof Cluster Computing es_ES
dc.rights Reserva de todos los derechos es_ES
dc.subject Message Passing Interface (MPI) es_ES
dc.subject Collective communication primitives es_ES
dc.subject Allreduce es_ES
dc.subject Deep learning es_ES
dc.subject Distributed training es_ES
dc.subject.classification ARQUITECTURA Y TECNOLOGIA DE COMPUTADORES es_ES
dc.title Accelerating distributed deep neural network training with pipelined MPI allreduce es_ES
dc.type Artículo es_ES
dc.identifier.doi 10.1007/s10586-021-03370-9 es_ES
dc.relation.projectID info:eu-repo/grantAgreement/AEI/Plan Estatal de Investigación Científica y Técnica y de Innovación 2013-2016/TIN2017-82972-R/ES/TECNICAS ALGORITMICAS PARA COMPUTACION DE ALTO RENDIMIENTO CONSCIENTE DEL CONSUMO ENERGETICO Y RESISTENTE A ERRORES/ es_ES
dc.relation.projectID info:eu-repo/grantAgreement/MCIU//TIN2017-82972-R/ es_ES
dc.relation.projectID info:eu-repo/grantAgreement/PRACE//2010PA5531/ es_ES
dc.relation.projectID info:eu-repo/grantAgreement/AGENCIA ESTATAL DE INVESTIGACION//FJC2019-039222-I//AYUDA JUAN DE LA CIERVA FORMACION-CASTELLO GIMENO, ADRIAN/ es_ES
dc.rights.accessRights Abierto es_ES
dc.contributor.affiliation Universitat Politècnica de València. Departamento de Informática de Sistemas y Computadores - Departament d'Informàtica de Sistemes i Computadors es_ES
dc.description.bibliographicCitation Castelló, A.; Quintana-Ortí, ES.; Duato Marín, JF. (2021). Accelerating distributed deep neural network training with pipelined MPI allreduce. Cluster Computing. 24(4):3797-3813. https://doi.org/10.1007/s10586-021-03370-9 es_ES
dc.description.accrualMethod S es_ES
dc.relation.publisherversion https://doi.org/10.1007/s10586-021-03370-9 es_ES
dc.description.upvformatpinicio 3797 es_ES
dc.description.upvformatpfin 3813 es_ES
dc.type.version info:eu-repo/semantics/publishedVersion es_ES
dc.description.volume 24 es_ES
dc.description.issue 4 es_ES
dc.relation.pasarela S\448457 es_ES
dc.contributor.funder AGENCIA ESTATAL DE INVESTIGACION es_ES
dc.contributor.funder Ministerio de Ciencia, Innovación y Universidades es_ES
dc.contributor.funder Partnership for Advanced Computing in Europe AISBL es_ES
dc.description.references Abadi, M., Barham, P., Chen, J., Chen, Z., Davis, A., Dean, J., Devin, M., Ghemawat, S., Irving, G., Isard, M., et al.: TensorFlow: a system for large-scale machine learning. In: 12th USENIX Symposium on Operating Systems Design and Implementation (OSDI 16), pp. 265–283 (2016) es_ES
dc.description.references Alsmadi, I., Khreishah, A., Dianxiang, X.: Network slicing to improve multicasting in hpc clusters. Clust. Comput. 21(3), 1493–1506 (2018) es_ES
dc.description.references Awan, A.A., Bedorf, J., Chu, C.-H., Subramoni, H., Panda, D.K.: Scalable distributed DNN training using TensorFlow and CUDA-aware MPI: characterization, designs, and performance evaluation (2018). arXiv:1810.11112 es_ES
dc.description.references Awan, A.A., Chu, C.-H., Subramoni, H., Panda, D.K.: Optimized broadcast for deep learning workloads on dense-GPU InfiniBand clusters: MPI or NCCL? In: Proceedings of the 25th European MPI Users’ Group Meeting, pp. 1–9 (2018) es_ES
dc.description.references Ben-Nun, T., Hoefler, T.: Demystifying parallel and distributed deep learning: an in-depth concurrency analysis. ACM Comput. Surv. 52(4), 65:1–65:43 (2019) es_ES
dc.description.references Castelló, A., Catalán, M., Dolz, M.F., Mestre, J.I., Quintana-Ortí, E.S., Duato, J.: Evaluation of MPI Allreduce for distributed training of convolutional neural networks. In: 29th Euromicro International Conference on Parallel, Distributed and Network-based Processing (PDP) (2021) es_ES
dc.description.references Castelló, A., Dolz, M.F., Quintana-Ortí, E.S., Duato, J.: Analysis of model parallelism for distributed neural networks. In: Proceedings of the 26th European MPI Users’ Group Meeting, EuroMPI ’19, New York, NY, USA (2019). Association for Computing Machinery es_ES
dc.description.references Castelló, A., Dolz, M.F., Quintana-Ortí, E.S., Duato, J.: Theoretical scalability analysis of distributed deep convolutional neural networks. In: 2019 19th IEEE/ACM International Symposium on Cluster, Cloud and Grid Computing (CCGRID), pp. 534–541 (2019) es_ES
dc.description.references Chan, E., Heimlich, M., Purkayastha, A., van de Geijn, R.: Collective communication: theory, practice, and experience. Concurr. Comput. 19(13), 1749–1783 (2007) es_ES
dc.description.references Clarke, L., Glendinning, I., Hempel, R.: The MPI message passing interface standard. In: Programming Environments for Massively Parallel Distributed Systems, pp. 213–218. Springer, Berlin (1994) es_ES
dc.description.references Deng, J., Dong, W., Socher, R., Li, L.-J., Li, K., Fei-Fei, L.: ImageNet: a large-scale hierarchical image database. In: 2009 IEEE Conference on Computer Vision and Pattern Recognition, pp. 248–255 (2009) es_ES
dc.description.references Hasanov, K., Lastovetsky, A.: Hierarchical redesign of classic MPI reduction algorithms. J. Supercomput. 73(2), 713–725 (2017) es_ES
dc.description.references He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 770–778 (2016) es_ES
dc.description.references Google Inc. Tensorflow benchmarks. https://github.com/tensorflow/benchmarks es_ES
dc.description.references Ivanov, A., Dryden, N., Ben-Nun, T., Li, S., Hoefler, T.: Data movement is all you need: a case study on optimizing transformers (2020). arXiv:2007.00072 es_ES
dc.description.references Kang, Q., Träff, J.L., Al-Bahrani, R., Agrawal, A., Choudhary, A., Liao, W.-k.: Full-duplex inter-group all-to-all broadcast algorithms with optimal bandwidth. In: Proceedings of the 25th European MPI Users’ Group Meeting, pp. 1–10 (2018) es_ES
dc.description.references Kang, Q., Träff, J.L., Al-Bahrani, R., Agrawal, A., Choudhary, A., Liao, W-k: Scalable algorithms for MPI intergroup Allgather and Allgatherv. Parallel Comput. 85, 220–230 (2019) es_ES
dc.description.references Kim, Y., Choi, H., Lee, J., Kim, J.-S., Jei, H., Roh, H.: Towards an optimized distributed deep learning framework for a heterogeneous multi-gpu cluster. Clust. Comput. 23(3), 2287–2300 (2020) es_ES
dc.description.references Krizhevsky, A., Hinton, G., et al.: Learning multiple layers of features from tiny images. Technical report, Department of Computer Sciences, University of Toronto (2009) es_ES
dc.description.references Krizhevsky, A., Sutskever, I., Hinton, G.E.: ImageNet classification with deep convolutional neural networks. In: Proceedings of the 25th International Conference on Neural Information Processing Systems - Volume 1, NIPS’12, pp. 1097–1105, Curran Associates Inc. (2012) es_ES
dc.description.references Kurnosov, M., Tokmasheva, E.: Shared memory based mpi broadcast algorithms for numa systems. In: Russian Supercomputing Days, pp. 473–485. Springer, Berlin (2020) es_ES
dc.description.references Li, S., Hoefler, T., Chungjin, H., Snir, M.: Improved MPI collectives for MPI processes in shared address spaces. Clust. Comput. 17(4), 1139–1155 (2014) es_ES
dc.description.references Nguyen, T.T., Wahib, M., Takano, R.: Hierarchical distributed-memory multi-leader MPI\_Allreduce for deep learning workloads. In: 2018 Sixth International Symposium on Computing and Networking Workshops (CANDARW), pp. 216–222. IEEE (2018) es_ES
dc.description.references Paszke, A., Gross, S., Massa, F., Lerer, A., Bradbury, J., Chanan, G., Killeen, T., Lin, Z., Gimelshein, N., Antiga, L., et al.: Pytorch: an imperative style, high-performance deep learning library. In: Advances in Neural Information Processing Systems, pp. 8026–8037 (2019) es_ES
dc.description.references Sergeev, A., Del Balso, M.: Horovod: fast and easy distributed deep learning in TensorFlow (2018). arXiv:1802.05799 es_ES
dc.description.references Shalf, J.: HPC interconnects at the end of Moore’s Law. In: 2019 Optical Fiber Communications Conference and Exhibition (OFC), pp. 1–3 (2019) es_ES
dc.description.references Simonyan, K., Zisserman, A.: Very deep convolutional networks for large-scale image recognition (2014). arXiv:1409.1556 es_ES
dc.description.references Snir, M., Otto, S.W., Huss-Lederman, S., Walker, D.W., Dongarra, J.: MPI: The Complete Reference. The MIT Press, New York (1996) es_ES
dc.description.references Sze, V., Chen, Y.-H., Yang, T.-J., Emer, J.S.: Efficient processing of deep neural networks: a tutorial and survey. Proc. IEEE 105(12), 2295–2329 (2017) es_ES
dc.description.references Thakur, R., Rabenseifner, R., Gropp, W.: Optimization of collective communication operations in MPICH. Int. J. High Perform. Comput. Appl. 19(1), 49–66 (2005) es_ES
dc.description.references Worringen, J.: Pipelining and overlapping for MPI collective operations. In: 28th Annual IEEE International Conference on Local Computer Networks, 2003, pp. 548–557. IEEE (2003) es_ES
dc.description.references Zhao, Y., Wang, L., Wu, W., Bosilca, G., Vuduc, R., Ye, J., Tang, W., Xu, Z.: Efficient communications in training large scale neural networks. In: Proceedings of the on Thematic Workshops of ACM Multimedia 2017, pp. 110–116 (2017) es_ES
dc.description.references Zhong, D., Cao, Q., Bosilca, G., Dongarra, J.: Using advanced vector extensions avx-512 for mpi reductions. In: 27th European MPI Users’ Group Meeting, pp. 1–10 (2020) es_ES


Este ítem aparece en la(s) siguiente(s) colección(ones)

Mostrar el registro sencillo del ítem