Abadi, M., Barham, P., Chen, J., Chen, Z., Davis, A., Dean, J., Devin, M., Ghemawat, S., Irving, G., Isard, M., et al.: TensorFlow: a system for large-scale machine learning. In: 12th USENIX Symposium on Operating Systems Design and Implementation (OSDI 16), pp. 265–283 (2016)
Alsmadi, I., Khreishah, A., Dianxiang, X.: Network slicing to improve multicasting in hpc clusters. Clust. Comput. 21(3), 1493–1506 (2018)
Awan, A.A., Bedorf, J., Chu, C.-H., Subramoni, H., Panda, D.K.: Scalable distributed DNN training using TensorFlow and CUDA-aware MPI: characterization, designs, and performance evaluation (2018). arXiv:1810.11112
[+]
Abadi, M., Barham, P., Chen, J., Chen, Z., Davis, A., Dean, J., Devin, M., Ghemawat, S., Irving, G., Isard, M., et al.: TensorFlow: a system for large-scale machine learning. In: 12th USENIX Symposium on Operating Systems Design and Implementation (OSDI 16), pp. 265–283 (2016)
Alsmadi, I., Khreishah, A., Dianxiang, X.: Network slicing to improve multicasting in hpc clusters. Clust. Comput. 21(3), 1493–1506 (2018)
Awan, A.A., Bedorf, J., Chu, C.-H., Subramoni, H., Panda, D.K.: Scalable distributed DNN training using TensorFlow and CUDA-aware MPI: characterization, designs, and performance evaluation (2018). arXiv:1810.11112
Awan, A.A., Chu, C.-H., Subramoni, H., Panda, D.K.: Optimized broadcast for deep learning workloads on dense-GPU InfiniBand clusters: MPI or NCCL? In: Proceedings of the 25th European MPI Users’ Group Meeting, pp. 1–9 (2018)
Ben-Nun, T., Hoefler, T.: Demystifying parallel and distributed deep learning: an in-depth concurrency analysis. ACM Comput. Surv. 52(4), 65:1–65:43 (2019)
Castelló, A., Catalán, M., Dolz, M.F., Mestre, J.I., Quintana-Ortí, E.S., Duato, J.: Evaluation of MPI Allreduce for distributed training of convolutional neural networks. In: 29th Euromicro International Conference on Parallel, Distributed and Network-based Processing (PDP) (2021)
Castelló, A., Dolz, M.F., Quintana-Ortí, E.S., Duato, J.: Analysis of model parallelism for distributed neural networks. In: Proceedings of the 26th European MPI Users’ Group Meeting, EuroMPI ’19, New York, NY, USA (2019). Association for Computing Machinery
Castelló, A., Dolz, M.F., Quintana-Ortí, E.S., Duato, J.: Theoretical scalability analysis of distributed deep convolutional neural networks. In: 2019 19th IEEE/ACM International Symposium on Cluster, Cloud and Grid Computing (CCGRID), pp. 534–541 (2019)
Chan, E., Heimlich, M., Purkayastha, A., van de Geijn, R.: Collective communication: theory, practice, and experience. Concurr. Comput. 19(13), 1749–1783 (2007)
Clarke, L., Glendinning, I., Hempel, R.: The MPI message passing interface standard. In: Programming Environments for Massively Parallel Distributed Systems, pp. 213–218. Springer, Berlin (1994)
Deng, J., Dong, W., Socher, R., Li, L.-J., Li, K., Fei-Fei, L.: ImageNet: a large-scale hierarchical image database. In: 2009 IEEE Conference on Computer Vision and Pattern Recognition, pp. 248–255 (2009)
Hasanov, K., Lastovetsky, A.: Hierarchical redesign of classic MPI reduction algorithms. J. Supercomput. 73(2), 713–725 (2017)
He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 770–778 (2016)
Google Inc. Tensorflow benchmarks. https://github.com/tensorflow/benchmarks
Ivanov, A., Dryden, N., Ben-Nun, T., Li, S., Hoefler, T.: Data movement is all you need: a case study on optimizing transformers (2020). arXiv:2007.00072
Kang, Q., Träff, J.L., Al-Bahrani, R., Agrawal, A., Choudhary, A., Liao, W.-k.: Full-duplex inter-group all-to-all broadcast algorithms with optimal bandwidth. In: Proceedings of the 25th European MPI Users’ Group Meeting, pp. 1–10 (2018)
Kang, Q., Träff, J.L., Al-Bahrani, R., Agrawal, A., Choudhary, A., Liao, W-k: Scalable algorithms for MPI intergroup Allgather and Allgatherv. Parallel Comput. 85, 220–230 (2019)
Kim, Y., Choi, H., Lee, J., Kim, J.-S., Jei, H., Roh, H.: Towards an optimized distributed deep learning framework for a heterogeneous multi-gpu cluster. Clust. Comput. 23(3), 2287–2300 (2020)
Krizhevsky, A., Hinton, G., et al.: Learning multiple layers of features from tiny images. Technical report, Department of Computer Sciences, University of Toronto (2009)
Krizhevsky, A., Sutskever, I., Hinton, G.E.: ImageNet classification with deep convolutional neural networks. In: Proceedings of the 25th International Conference on Neural Information Processing Systems - Volume 1, NIPS’12, pp. 1097–1105, Curran Associates Inc. (2012)
Kurnosov, M., Tokmasheva, E.: Shared memory based mpi broadcast algorithms for numa systems. In: Russian Supercomputing Days, pp. 473–485. Springer, Berlin (2020)
Li, S., Hoefler, T., Chungjin, H., Snir, M.: Improved MPI collectives for MPI processes in shared address spaces. Clust. Comput. 17(4), 1139–1155 (2014)
Nguyen, T.T., Wahib, M., Takano, R.: Hierarchical distributed-memory multi-leader MPI\_Allreduce for deep learning workloads. In: 2018 Sixth International Symposium on Computing and Networking Workshops (CANDARW), pp. 216–222. IEEE (2018)
Paszke, A., Gross, S., Massa, F., Lerer, A., Bradbury, J., Chanan, G., Killeen, T., Lin, Z., Gimelshein, N., Antiga, L., et al.: Pytorch: an imperative style, high-performance deep learning library. In: Advances in Neural Information Processing Systems, pp. 8026–8037 (2019)
Sergeev, A., Del Balso, M.: Horovod: fast and easy distributed deep learning in TensorFlow (2018). arXiv:1802.05799
Shalf, J.: HPC interconnects at the end of Moore’s Law. In: 2019 Optical Fiber Communications Conference and Exhibition (OFC), pp. 1–3 (2019)
Simonyan, K., Zisserman, A.: Very deep convolutional networks for large-scale image recognition (2014). arXiv:1409.1556
Snir, M., Otto, S.W., Huss-Lederman, S., Walker, D.W., Dongarra, J.: MPI: The Complete Reference. The MIT Press, New York (1996)
Sze, V., Chen, Y.-H., Yang, T.-J., Emer, J.S.: Efficient processing of deep neural networks: a tutorial and survey. Proc. IEEE 105(12), 2295–2329 (2017)
Thakur, R., Rabenseifner, R., Gropp, W.: Optimization of collective communication operations in MPICH. Int. J. High Perform. Comput. Appl. 19(1), 49–66 (2005)
Worringen, J.: Pipelining and overlapping for MPI collective operations. In: 28th Annual IEEE International Conference on Local Computer Networks, 2003, pp. 548–557. IEEE (2003)
Zhao, Y., Wang, L., Wu, W., Bosilca, G., Vuduc, R., Ye, J., Tang, W., Xu, Z.: Efficient communications in training large scale neural networks. In: Proceedings of the on Thematic Workshops of ACM Multimedia 2017, pp. 110–116 (2017)
Zhong, D., Cao, Q., Bosilca, G., Dongarra, J.: Using advanced vector extensions avx-512 for mpi reductions. In: 27th European MPI Users’ Group Meeting, pp. 1–10 (2020)
[-]