Mostrar el registro sencillo del ítem
dc.contributor.author | Alaejos-López, Guillermo | es_ES |
dc.contributor.author | Castelló, Adrián | es_ES |
dc.contributor.author | Martínez, Héctor | es_ES |
dc.contributor.author | Alonso-Jordá, Pedro | es_ES |
dc.contributor.author | Igual, Francisco D. | es_ES |
dc.contributor.author | Quintana-Ortí, Enrique S. | es_ES |
dc.date.accessioned | 2023-11-13T19:03:46Z | |
dc.date.available | 2023-11-13T19:03:46Z | |
dc.date.issued | 2023-05 | es_ES |
dc.identifier.issn | 0920-8542 | es_ES |
dc.identifier.uri | http://hdl.handle.net/10251/199584 | |
dc.description.abstract | [EN] We provide a practical demonstration that it is possible to systematically generate a variety of high-performance micro-kernels for the general matrix multiplication (gemm) via generic templates which can be easily customized to different processor architectures and micro-kernel dimensions. These generic templates employ vector intrinsics to exploit the SIMD (single instruction, multiple data) units in current general-purpose processors and, for the particular type of gemm problems encountered in deep learning, deliver a floating-point throughput rate on par with or even higher than that obtained with conventional, carefully tuned implementations of gemm in current linear algebra libraries (e.g., BLIS, AMD AOCL, ARMPL). Our work exposes the structure of the template-based micro-kernels for ARM Neon (128-bit SIMD), ARM SVE (variable-length SIMD) and Intel AVX512 (512-bit SIMD), showing considerable performance for an NVIDIA Carmel processor (ARM Neon), a Fujitsu A64FX processor (ARM SVE) and on an AMD EPYC 7282 processor (256-bit SIMD). | es_ES |
dc.description.sponsorship | Open Access funding provided thanks to the CRUE-CSIC agreement with Springer Nature. This work was supported by the research projects PID2020-113656RB-C22, RTI2018-093684-B-I00, and PID2021-126576NB-I00, of MCIN/AEI/10.13039/501100011033 and by "ERDF A way of making Europe", and CM via Multiannual Agreement with Complutense University in the line Program to Stimulate Research for Young Doctors in the context of the V PRICIT under projects PR65/19-22445 and CM S2018/TCS-4423. A. Castello is a FJC2019-039222-I fellow supported by MCIN/AEI/10.13039/501100011033. H. Martinez is a postdoctoral fellow supported by the Consejeria de Transformacion Economica, Industria, Conocimiento y Universidades de la Junta de Andalucia. | es_ES |
dc.language | Inglés | es_ES |
dc.publisher | Springer-Verlag | es_ES |
dc.relation.ispartof | The Journal of Supercomputing | es_ES |
dc.rights | Reconocimiento (by) | es_ES |
dc.subject | Matrix multiplication | es_ES |
dc.subject | Linear algebra libraries | es_ES |
dc.subject | High performance | es_ES |
dc.subject | Vector intrinsics | es_ES |
dc.subject | SIMD units | es_ES |
dc.subject.classification | ARQUITECTURA Y TECNOLOGIA DE COMPUTADORES | es_ES |
dc.subject.classification | CIENCIAS DE LA COMPUTACION E INTELIGENCIA ARTIFICIAL | es_ES |
dc.title | Micro-kernels for portable and efficient matrix multiplication in deep learning | es_ES |
dc.type | Artículo | es_ES |
dc.identifier.doi | 10.1007/s11227-022-05003-3 | es_ES |
dc.relation.projectID | info:eu-repo/grantAgreement/AEI/Plan Estatal de Investigación Científica y Técnica y de Innovación 2017-2020/PID2020-113656RB-C22/ES/COMPUTACION Y COMUNICACIONES DE ALTAS PRESTACIONES CONSCIENTES DEL CONSUMO ENERGETICO. APLICACIONES AL APRENDIZAJE PROFUNDO COMPUTACIONAL - UPV/ | es_ES |
dc.relation.projectID | info:eu-repo/grantAgreement/CAM//PR65%2F19-22445/ | es_ES |
dc.relation.projectID | info:eu-repo/grantAgreement/AEI/Plan Estatal de Investigación Científica y Técnica y de Innovación 2017-2020/RTI2018-093684-B-I00/ES/HETEROGENEIDAD Y ESPECIALIZACION EN LA ERA POST-MOORE/ | es_ES |
dc.relation.projectID | info:eu-repo/grantAgreement/CAM//S2018%2FTCS-4423 / | es_ES |
dc.relation.projectID | info:eu-repo/grantAgreement/MCIU//FJC2019-039222-I/ | es_ES |
dc.relation.projectID | info:eu-repo/grantAgreement/AEI//PID2021-126576NB-I00/ | es_ES |
dc.rights.accessRights | Abierto | es_ES |
dc.contributor.affiliation | Universitat Politècnica de València. Escola Tècnica Superior d'Enginyeria Informàtica | es_ES |
dc.contributor.affiliation | Universitat Politècnica de València. Departamento de Informática de Sistemas y Computadores - Departament d'Informàtica de Sistemes i Computadors | es_ES |
dc.description.bibliographicCitation | Alaejos-López, G.; Castelló, A.; Martínez, H.; Alonso-Jordá, P.; Igual, FD.; Quintana-Ortí, ES. (2023). Micro-kernels for portable and efficient matrix multiplication in deep learning. The Journal of Supercomputing. 79:8124-8147. https://doi.org/10.1007/s11227-022-05003-3 | es_ES |
dc.description.accrualMethod | S | es_ES |
dc.relation.publisherversion | https://doi.org/10.1007/s11227-022-05003-3 | es_ES |
dc.description.upvformatpinicio | 8124 | es_ES |
dc.description.upvformatpfin | 8147 | es_ES |
dc.type.version | info:eu-repo/semantics/publishedVersion | es_ES |
dc.description.volume | 79 | es_ES |
dc.relation.pasarela | S\482443 | es_ES |
dc.contributor.funder | Comunidad de Madrid | es_ES |
dc.contributor.funder | Junta de Andalucía | es_ES |
dc.contributor.funder | Agencia Estatal de Investigación | es_ES |
dc.contributor.funder | European Regional Development Fund | es_ES |
dc.contributor.funder | Universitat Politècnica de València | es_ES |
dc.contributor.funder | Ministerio de Ciencia, Innovación y Universidades | es_ES |
dc.description.references | Dongarra JJ, Du Croz J, Hammarling S, Duff I (1990) A set of level 3 basic linear algebra subprograms. ACM Trans Math Softw 16(1):1–17 | es_ES |
dc.description.references | Goto K, van de Geijn RA (2008) Anatomy of a high-performance matrix multiplication. ACM Trans Math Softw 34(3):12:1-12:25 | es_ES |
dc.description.references | Van Zee FG, van de Geijn RA (2015) BLIS: a framework for rapidly instantiating BLAS functionality. ACM Trans Math Softw 41(3):14:1-14:33 | es_ES |
dc.description.references | Xianyi Z, Qian W, Yunquan Z (2012) Model-driven level 3 BLAS performance optimization on Loongson 3A processor. In: 2012 IEEE 18th International Conference on Parallel and Distributed Systems (ICPADS) | es_ES |
dc.description.references | Smith TM, van de Geijn RA (2019) The MOMMS family of matrix multiplication algorithms. CoRR, vol. abs/1904.05717. [Online]. Available: http://arxiv.org/abs/1904.05717 | es_ES |
dc.description.references | Gunnels JA, Gustavson FG, Henry GM, van de Geijn RA (2004) A family of high-performance matrix multiplication algorithms. In: Proc. 7th Int. Conf. on Applied Parallel Computing: State of the Art in Scientific Computing, ser. PARA’04, pp 256-265 | es_ES |
dc.description.references | Castelló A, Igual FD, Quintana-Ortí ES (2022) Anatomy of the BLIS family of algorithms for matrix multiplication. In: 2022 30th Euromicro International Conference on Parallel, Distributed and Network-based Processing (PDP), pp 92–99 | es_ES |
dc.description.references | Chellapilla K, Puri S, Simard P (2006) High performance convolutional neural networks for document processing. In: International Workshop on Frontiers in Handwriting Recognition | es_ES |
dc.description.references | Barrachina S, Dolz MF, San Juan P, Quintana-Ortí ES (2022) Efficient and portable GEMM-based convolution operators for deep neural network training on multicore processors. J Parallel Distrib Comput 167(C):240–254 | es_ES |
dc.description.references | Low TM, Igual FD, Smith TM, Quintana-Ortí ES (2016) Analytical modeling is enough for high-performance BLIS. ACM Trans Math Softw 43(2):12:1-12:18 | es_ES |
dc.description.references | Williams S, Waterman A, Patterson D (2009) Roofline: an insightful visual performance model for multicore architectures. Commun ACM 52(4):65–76 | es_ES |
dc.description.references | Dowd K, Severance CR (1998) High performance computing, 2nd ed. O’Reilly | es_ES |
dc.description.references | Zee FGV, Smith TM, Marker B, Low TM, Geijn RAVD, Igual FD, Smelyanskiy M, Zhang X, Kistler M, Austel V, Gunnels JA, Killough L (2016) The BLIS framework: experiments in portability. ACM Trans Math Softw 42(2):1–19 | es_ES |
dc.description.references | Smith TM, van de Geijn R, Smelyanskiy M, Hammond JR, Zee FGV (2014) Anatomy of high-performance many-threaded matrix multiplication. In: Proc. IEEE 28th Int. Parallel and Distributed Processing Symp. ser. IPDPS’14, pp 1049–1059 | es_ES |
dc.description.references | He K, Zhang X, Ren S, Sun J (2016) Deep residual learning for image recognition. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp 770–778 | es_ES |
dc.description.references | Simonyan K, Zisserman A (2014) Very deep convolutional networks for large-scale image recognition, arXiv preprint arXiv:1409.1556 | es_ES |
dc.description.references | Szegedy C, et al. (2014) Going deeper with convolutions, CoRR, vol. abs/1409.4842, [Online]. Available: http://arxiv.org/abs/1409.4842 | es_ES |