Micro-kernels for portable and efficient matrix multiplication in deep learning

Alaejos-López, Guillermo; Castelló, Adrián; Martínez, Héctor; Alonso-Jordá, Pedro; Igual, Francisco D.; Quintana-Ortí, Enrique S.

doi:10.1007/s11227-022-05003-3

Identificarse

Buscar en RiuNet

Listar

Todo RiuNet
Esta colección

Mi cuenta

Acceder

Estadísticas

Ver Estadísticas de uso

Ayuda RiuNet

Admin. UPV

Compartir/Enviar a

Citas

Estadísticas

Micro-kernels for portable and efficient matrix multiplication in deep learning

Mostrar el registro sencillo del ítem

Ficheros en el ítem

Nombre: Alaejos-LopezCast ...

Tamaño: 2.710Mb

Formato: PDF

Descripción: Versión editorial

Abrir

dc.contributor.author	Alaejos-López, Guillermo	es_ES
dc.contributor.author	Castelló, Adrián	es_ES
dc.contributor.author	Martínez, Héctor	es_ES
dc.contributor.author	Alonso-Jordá, Pedro	es_ES
dc.contributor.author	Igual, Francisco D.	es_ES
dc.contributor.author	Quintana-Ortí, Enrique S.	es_ES
dc.date.accessioned	2023-11-13T19:03:46Z
dc.date.available	2023-11-13T19:03:46Z
dc.date.issued	2023-05	es_ES
dc.identifier.issn	0920-8542	es_ES
dc.identifier.uri	http://hdl.handle.net/10251/199584
dc.description.abstract	[EN] We provide a practical demonstration that it is possible to systematically generate a variety of high-performance micro-kernels for the general matrix multiplication (gemm) via generic templates which can be easily customized to different processor architectures and micro-kernel dimensions. These generic templates employ vector intrinsics to exploit the SIMD (single instruction, multiple data) units in current general-purpose processors and, for the particular type of gemm problems encountered in deep learning, deliver a floating-point throughput rate on par with or even higher than that obtained with conventional, carefully tuned implementations of gemm in current linear algebra libraries (e.g., BLIS, AMD AOCL, ARMPL). Our work exposes the structure of the template-based micro-kernels for ARM Neon (128-bit SIMD), ARM SVE (variable-length SIMD) and Intel AVX512 (512-bit SIMD), showing considerable performance for an NVIDIA Carmel processor (ARM Neon), a Fujitsu A64FX processor (ARM SVE) and on an AMD EPYC 7282 processor (256-bit SIMD).	es_ES
dc.description.sponsorship	Open Access funding provided thanks to the CRUE-CSIC agreement with Springer Nature. This work was supported by the research projects PID2020-113656RB-C22, RTI2018-093684-B-I00, and PID2021-126576NB-I00, of MCIN/AEI/10.13039/501100011033 and by "ERDF A way of making Europe", and CM via Multiannual Agreement with Complutense University in the line Program to Stimulate Research for Young Doctors in the context of the V PRICIT under projects PR65/19-22445 and CM S2018/TCS-4423. A. Castello is a FJC2019-039222-I fellow supported by MCIN/AEI/10.13039/501100011033. H. Martinez is a postdoctoral fellow supported by the Consejeria de Transformacion Economica, Industria, Conocimiento y Universidades de la Junta de Andalucia.	es_ES
dc.language	Inglés	es_ES
dc.publisher	Springer-Verlag	es_ES
dc.relation.ispartof	The Journal of Supercomputing	es_ES
dc.rights	Reconocimiento (by)	es_ES
dc.subject	Matrix multiplication	es_ES
dc.subject	Linear algebra libraries	es_ES
dc.subject	High performance	es_ES
dc.subject	Vector intrinsics	es_ES
dc.subject	SIMD units	es_ES
dc.subject.classification	ARQUITECTURA Y TECNOLOGIA DE COMPUTADORES	es_ES
dc.subject.classification	CIENCIAS DE LA COMPUTACION E INTELIGENCIA ARTIFICIAL	es_ES
dc.title	Micro-kernels for portable and efficient matrix multiplication in deep learning	es_ES
dc.type	Artículo	es_ES
dc.identifier.doi	10.1007/s11227-022-05003-3	es_ES
dc.relation.projectID	info:eu-repo/grantAgreement/AEI/Plan Estatal de Investigación Científica y Técnica y de Innovación 2017-2020/PID2020-113656RB-C22/ES/COMPUTACION Y COMUNICACIONES DE ALTAS PRESTACIONES CONSCIENTES DEL CONSUMO ENERGETICO. APLICACIONES AL APRENDIZAJE PROFUNDO COMPUTACIONAL - UPV/	es_ES
dc.relation.projectID	info:eu-repo/grantAgreement/CAM//PR65%2F19-22445/	es_ES
dc.relation.projectID	info:eu-repo/grantAgreement/AEI/Plan Estatal de Investigación Científica y Técnica y de Innovación 2017-2020/RTI2018-093684-B-I00/ES/HETEROGENEIDAD Y ESPECIALIZACION EN LA ERA POST-MOORE/	es_ES
dc.relation.projectID	info:eu-repo/grantAgreement/CAM//S2018%2FTCS-4423 /	es_ES
dc.relation.projectID	info:eu-repo/grantAgreement/MCIU//FJC2019-039222-I/	es_ES
dc.relation.projectID	info:eu-repo/grantAgreement/AEI//PID2021-126576NB-I00/	es_ES
dc.rights.accessRights	Abierto	es_ES
dc.contributor.affiliation	Universitat Politècnica de València. Escola Tècnica Superior d'Enginyeria Informàtica	es_ES
dc.contributor.affiliation	Universitat Politècnica de València. Departamento de Informática de Sistemas y Computadores - Departament d'Informàtica de Sistemes i Computadors	es_ES
dc.description.bibliographicCitation	Alaejos-López, G.; Castelló, A.; Martínez, H.; Alonso-Jordá, P.; Igual, FD.; Quintana-Ortí, ES. (2023). Micro-kernels for portable and efficient matrix multiplication in deep learning. The Journal of Supercomputing. 79:8124-8147. https://doi.org/10.1007/s11227-022-05003-3	es_ES
dc.description.accrualMethod	S	es_ES
dc.relation.publisherversion	https://doi.org/10.1007/s11227-022-05003-3	es_ES
dc.description.upvformatpinicio	8124	es_ES
dc.description.upvformatpfin	8147	es_ES
dc.type.version	info:eu-repo/semantics/publishedVersion	es_ES
dc.description.volume	79	es_ES
dc.relation.pasarela	S\482443	es_ES
dc.contributor.funder	Comunidad de Madrid	es_ES
dc.contributor.funder	Junta de Andalucía	es_ES
dc.contributor.funder	Agencia Estatal de Investigación	es_ES
dc.contributor.funder	European Regional Development Fund	es_ES
dc.contributor.funder	Universitat Politècnica de València	es_ES
dc.contributor.funder	Ministerio de Ciencia, Innovación y Universidades	es_ES
dc.description.references	Dongarra JJ, Du Croz J, Hammarling S, Duff I (1990) A set of level 3 basic linear algebra subprograms. ACM Trans Math Softw 16(1):1–17	es_ES
dc.description.references	Goto K, van de Geijn RA (2008) Anatomy of a high-performance matrix multiplication. ACM Trans Math Softw 34(3):12:1-12:25	es_ES
dc.description.references	Van Zee FG, van de Geijn RA (2015) BLIS: a framework for rapidly instantiating BLAS functionality. ACM Trans Math Softw 41(3):14:1-14:33	es_ES
dc.description.references	Xianyi Z, Qian W, Yunquan Z (2012) Model-driven level 3 BLAS performance optimization on Loongson 3A processor. In: 2012 IEEE 18th International Conference on Parallel and Distributed Systems (ICPADS)	es_ES
dc.description.references	Smith TM, van de Geijn RA (2019) The MOMMS family of matrix multiplication algorithms. CoRR, vol. abs/1904.05717. [Online]. Available: http://arxiv.org/abs/1904.05717	es_ES
dc.description.references	Gunnels JA, Gustavson FG, Henry GM, van de Geijn RA (2004) A family of high-performance matrix multiplication algorithms. In: Proc. 7th Int. Conf. on Applied Parallel Computing: State of the Art in Scientific Computing, ser. PARA’04, pp 256-265	es_ES
dc.description.references	Castelló A, Igual FD, Quintana-Ortí ES (2022) Anatomy of the BLIS family of algorithms for matrix multiplication. In: 2022 30th Euromicro International Conference on Parallel, Distributed and Network-based Processing (PDP), pp 92–99	es_ES
dc.description.references	Chellapilla K, Puri S, Simard P (2006) High performance convolutional neural networks for document processing. In: International Workshop on Frontiers in Handwriting Recognition	es_ES
dc.description.references	Barrachina S, Dolz MF, San Juan P, Quintana-Ortí ES (2022) Efficient and portable GEMM-based convolution operators for deep neural network training on multicore processors. J Parallel Distrib Comput 167(C):240–254	es_ES
dc.description.references	Low TM, Igual FD, Smith TM, Quintana-Ortí ES (2016) Analytical modeling is enough for high-performance BLIS. ACM Trans Math Softw 43(2):12:1-12:18	es_ES
dc.description.references	Williams S, Waterman A, Patterson D (2009) Roofline: an insightful visual performance model for multicore architectures. Commun ACM 52(4):65–76	es_ES
dc.description.references	Dowd K, Severance CR (1998) High performance computing, 2nd ed. O’Reilly	es_ES
dc.description.references	Zee FGV, Smith TM, Marker B, Low TM, Geijn RAVD, Igual FD, Smelyanskiy M, Zhang X, Kistler M, Austel V, Gunnels JA, Killough L (2016) The BLIS framework: experiments in portability. ACM Trans Math Softw 42(2):1–19	es_ES
dc.description.references	Smith TM, van de Geijn R, Smelyanskiy M, Hammond JR, Zee FGV (2014) Anatomy of high-performance many-threaded matrix multiplication. In: Proc. IEEE 28th Int. Parallel and Distributed Processing Symp. ser. IPDPS’14, pp 1049–1059	es_ES
dc.description.references	He K, Zhang X, Ren S, Sun J (2016) Deep residual learning for image recognition. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp 770–778	es_ES
dc.description.references	Simonyan K, Zisserman A (2014) Very deep convolutional networks for large-scale image recognition, arXiv preprint arXiv:1409.1556	es_ES
dc.description.references	Szegedy C, et al. (2014) Going deeper with convolutions, CoRR, vol. abs/1409.4842, [Online]. Available: http://arxiv.org/abs/1409.4842	es_ES

Este ítem aparece en la(s) siguiente(s) colección(ones)

Mostrar el registro sencillo del ítem

Micro-kernels for portable and efficient matrix multiplication in deep learning

RiuNet: Repositorio Institucional de la Universidad Politécnica de Valencia

Buscar en RiuNet

Listar

Todo RiuNet

Esta colección

Mi cuenta

Estadísticas

Ayuda RiuNet

Admin. UPV

Compartir/Enviar a

Citas

Estadísticas

Micro-kernels for portable and efficient matrix multiplication in deep learning

Ficheros en el ítem

Este ítem aparece en la(s) siguiente(s) colección(ones)