- -

Micro-kernels for portable and efficient matrix multiplication in deep learning

RiuNet: Repositorio Institucional de la Universidad Politécnica de Valencia

Compartir/Enviar a

Citas

Estadísticas

  • Estadisticas de Uso

Micro-kernels for portable and efficient matrix multiplication in deep learning

Mostrar el registro sencillo del ítem

Ficheros en el ítem

dc.contributor.author Alaejos-López, Guillermo es_ES
dc.contributor.author Castelló, Adrián es_ES
dc.contributor.author Martínez, Héctor es_ES
dc.contributor.author Alonso-Jordá, Pedro es_ES
dc.contributor.author Igual, Francisco D. es_ES
dc.contributor.author Quintana-Ortí, Enrique S. es_ES
dc.date.accessioned 2023-11-13T19:03:46Z
dc.date.available 2023-11-13T19:03:46Z
dc.date.issued 2023-05 es_ES
dc.identifier.issn 0920-8542 es_ES
dc.identifier.uri http://hdl.handle.net/10251/199584
dc.description.abstract [EN] We provide a practical demonstration that it is possible to systematically generate a variety of high-performance micro-kernels for the general matrix multiplication (gemm) via generic templates which can be easily customized to different processor architectures and micro-kernel dimensions. These generic templates employ vector intrinsics to exploit the SIMD (single instruction, multiple data) units in current general-purpose processors and, for the particular type of gemm problems encountered in deep learning, deliver a floating-point throughput rate on par with or even higher than that obtained with conventional, carefully tuned implementations of gemm in current linear algebra libraries (e.g., BLIS, AMD AOCL, ARMPL). Our work exposes the structure of the template-based micro-kernels for ARM Neon (128-bit SIMD), ARM SVE (variable-length SIMD) and Intel AVX512 (512-bit SIMD), showing considerable performance for an NVIDIA Carmel processor (ARM Neon), a Fujitsu A64FX processor (ARM SVE) and on an AMD EPYC 7282 processor (256-bit SIMD). es_ES
dc.description.sponsorship Open Access funding provided thanks to the CRUE-CSIC agreement with Springer Nature. This work was supported by the research projects PID2020-113656RB-C22, RTI2018-093684-B-I00, and PID2021-126576NB-I00, of MCIN/AEI/10.13039/501100011033 and by "ERDF A way of making Europe", and CM via Multiannual Agreement with Complutense University in the line Program to Stimulate Research for Young Doctors in the context of the V PRICIT under projects PR65/19-22445 and CM S2018/TCS-4423. A. Castello is a FJC2019-039222-I fellow supported by MCIN/AEI/10.13039/501100011033. H. Martinez is a postdoctoral fellow supported by the Consejeria de Transformacion Economica, Industria, Conocimiento y Universidades de la Junta de Andalucia. es_ES
dc.language Inglés es_ES
dc.publisher Springer-Verlag es_ES
dc.relation.ispartof The Journal of Supercomputing es_ES
dc.rights Reconocimiento (by) es_ES
dc.subject Matrix multiplication es_ES
dc.subject Linear algebra libraries es_ES
dc.subject High performance es_ES
dc.subject Vector intrinsics es_ES
dc.subject SIMD units es_ES
dc.subject.classification ARQUITECTURA Y TECNOLOGIA DE COMPUTADORES es_ES
dc.subject.classification CIENCIAS DE LA COMPUTACION E INTELIGENCIA ARTIFICIAL es_ES
dc.title Micro-kernels for portable and efficient matrix multiplication in deep learning es_ES
dc.type Artículo es_ES
dc.identifier.doi 10.1007/s11227-022-05003-3 es_ES
dc.relation.projectID info:eu-repo/grantAgreement/AEI/Plan Estatal de Investigación Científica y Técnica y de Innovación 2017-2020/PID2020-113656RB-C22/ES/COMPUTACION Y COMUNICACIONES DE ALTAS PRESTACIONES CONSCIENTES DEL CONSUMO ENERGETICO. APLICACIONES AL APRENDIZAJE PROFUNDO COMPUTACIONAL - UPV/ es_ES
dc.relation.projectID info:eu-repo/grantAgreement/CAM//PR65%2F19-22445/ es_ES
dc.relation.projectID info:eu-repo/grantAgreement/AEI/Plan Estatal de Investigación Científica y Técnica y de Innovación 2017-2020/RTI2018-093684-B-I00/ES/HETEROGENEIDAD Y ESPECIALIZACION EN LA ERA POST-MOORE/ es_ES
dc.relation.projectID info:eu-repo/grantAgreement/CAM//S2018%2FTCS-4423 / es_ES
dc.relation.projectID info:eu-repo/grantAgreement/MCIU//FJC2019-039222-I/ es_ES
dc.relation.projectID info:eu-repo/grantAgreement/AEI//PID2021-126576NB-I00/ es_ES
dc.rights.accessRights Abierto es_ES
dc.contributor.affiliation Universitat Politècnica de València. Escola Tècnica Superior d'Enginyeria Informàtica es_ES
dc.contributor.affiliation Universitat Politècnica de València. Departamento de Informática de Sistemas y Computadores - Departament d'Informàtica de Sistemes i Computadors es_ES
dc.description.bibliographicCitation Alaejos-López, G.; Castelló, A.; Martínez, H.; Alonso-Jordá, P.; Igual, FD.; Quintana-Ortí, ES. (2023). Micro-kernels for portable and efficient matrix multiplication in deep learning. The Journal of Supercomputing. 79:8124-8147. https://doi.org/10.1007/s11227-022-05003-3 es_ES
dc.description.accrualMethod S es_ES
dc.relation.publisherversion https://doi.org/10.1007/s11227-022-05003-3 es_ES
dc.description.upvformatpinicio 8124 es_ES
dc.description.upvformatpfin 8147 es_ES
dc.type.version info:eu-repo/semantics/publishedVersion es_ES
dc.description.volume 79 es_ES
dc.relation.pasarela S\482443 es_ES
dc.contributor.funder Comunidad de Madrid es_ES
dc.contributor.funder Junta de Andalucía es_ES
dc.contributor.funder Agencia Estatal de Investigación es_ES
dc.contributor.funder European Regional Development Fund es_ES
dc.contributor.funder Universitat Politècnica de València es_ES
dc.contributor.funder Ministerio de Ciencia, Innovación y Universidades es_ES
dc.description.references Dongarra JJ, Du Croz J, Hammarling S, Duff I (1990) A set of level 3 basic linear algebra subprograms. ACM Trans Math Softw 16(1):1–17 es_ES
dc.description.references Goto K, van de Geijn RA (2008) Anatomy of a high-performance matrix multiplication. ACM Trans Math Softw 34(3):12:1-12:25 es_ES
dc.description.references Van Zee FG, van de Geijn RA (2015) BLIS: a framework for rapidly instantiating BLAS functionality. ACM Trans Math Softw 41(3):14:1-14:33 es_ES
dc.description.references Xianyi Z, Qian W, Yunquan Z (2012) Model-driven level 3 BLAS performance optimization on Loongson 3A processor. In: 2012 IEEE 18th International Conference on Parallel and Distributed Systems (ICPADS) es_ES
dc.description.references Smith TM, van de Geijn RA (2019) The MOMMS family of matrix multiplication algorithms. CoRR, vol. abs/1904.05717. [Online]. Available: http://arxiv.org/abs/1904.05717 es_ES
dc.description.references Gunnels JA, Gustavson FG, Henry GM, van de Geijn RA (2004) A family of high-performance matrix multiplication algorithms. In: Proc. 7th Int. Conf. on Applied Parallel Computing: State of the Art in Scientific Computing, ser. PARA’04, pp 256-265 es_ES
dc.description.references Castelló A, Igual FD, Quintana-Ortí ES (2022) Anatomy of the BLIS family of algorithms for matrix multiplication. In: 2022 30th Euromicro International Conference on Parallel, Distributed and Network-based Processing (PDP), pp 92–99 es_ES
dc.description.references Chellapilla K, Puri S, Simard P (2006) High performance convolutional neural networks for document processing. In: International Workshop on Frontiers in Handwriting Recognition es_ES
dc.description.references Barrachina S, Dolz MF, San Juan P, Quintana-Ortí ES (2022) Efficient and portable GEMM-based convolution operators for deep neural network training on multicore processors. J Parallel Distrib Comput 167(C):240–254 es_ES
dc.description.references Low TM, Igual FD, Smith TM, Quintana-Ortí ES (2016) Analytical modeling is enough for high-performance BLIS. ACM Trans Math Softw 43(2):12:1-12:18 es_ES
dc.description.references Williams S, Waterman A, Patterson D (2009) Roofline: an insightful visual performance model for multicore architectures. Commun ACM 52(4):65–76 es_ES
dc.description.references Dowd K, Severance CR (1998) High performance computing, 2nd ed. O’Reilly es_ES
dc.description.references Zee FGV, Smith TM, Marker B, Low TM, Geijn RAVD, Igual FD, Smelyanskiy M, Zhang X, Kistler M, Austel V, Gunnels JA, Killough L (2016) The BLIS framework: experiments in portability. ACM Trans Math Softw 42(2):1–19 es_ES
dc.description.references Smith TM, van de Geijn R, Smelyanskiy M, Hammond JR, Zee FGV (2014) Anatomy of high-performance many-threaded matrix multiplication. In: Proc. IEEE 28th Int. Parallel and Distributed Processing Symp. ser. IPDPS’14, pp 1049–1059 es_ES
dc.description.references He K, Zhang X, Ren S, Sun J (2016) Deep residual learning for image recognition. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp 770–778 es_ES
dc.description.references Simonyan K, Zisserman A (2014) Very deep convolutional networks for large-scale image recognition, arXiv preprint arXiv:1409.1556 es_ES
dc.description.references Szegedy C, et al. (2014) Going deeper with convolutions, CoRR, vol. abs/1409.4842, [Online]. Available: http://arxiv.org/abs/1409.4842 es_ES


Este ítem aparece en la(s) siguiente(s) colección(ones)

Mostrar el registro sencillo del ítem