Low precision matrix multiplication for efficient deep learning in NVIDIA Carmel processors

San Juan-Sebastian, Pablo; Rodríguez-Sánchez, Rafael; Igual, Francisco D.; Alonso-Jordá, Pedro; Quintana-Ortí, Enrique S.

doi:10.1007/s11227-021-03636-4

Identificarse

Buscar en RiuNet

Listar

Todo RiuNet
Esta colección

Mi cuenta

Acceder

Estadísticas

Ver Estadísticas de uso

Ayuda RiuNet

Admin. UPV

Compartir/Enviar a

Citas

Estadísticas

Low precision matrix multiplication for efficient deep learning in NVIDIA Carmel processors

Mostrar el registro sencillo del ítem

Ficheros en el ítem

Nombre: Low precision matrix ...

Tamaño: 540.2Kb

Formato: PDF

Descripción: Versión del Autor.

Abrir

Nombre: SanRodriguez-Sanc ...

Tamaño: 517.2Kb

Formato: PDF

Descripción: Versión editorial

Solicitar una copia al autor

dc.contributor.author	San Juan-Sebastian, Pablo	es_ES
dc.contributor.author	Rodríguez-Sánchez, Rafael	es_ES
dc.contributor.author	Igual, Francisco D.	es_ES
dc.contributor.author	Alonso-Jordá, Pedro	es_ES
dc.contributor.author	Quintana-Ortí, Enrique S.	es_ES
dc.date.accessioned	2022-11-10T19:02:43Z
dc.date.available	2022-11-10T19:02:43Z
dc.date.issued	2021-10	es_ES
dc.identifier.issn	0920-8542	es_ES
dc.identifier.uri	http://hdl.handle.net/10251/189610
dc.description.abstract	[EN] We introduce a high performance, multi-threaded realization of the gemm kernel for the ARMv8.2 architecture that operates with 16-bit (half precision)/queryKindly check and confirm whether the corresponding author is correctly identified. floating point operands. Our code is especially designed for efficient machine learning inference (and to a certain extent, also training) with deep neural networks. The results on the NVIDIA Carmel multicore processor, which implements the ARMv8.2 architecture, show considerable performance gains for the gemm kernel, close to the theoretical peak acceleration that could be expected when moving from 32-bit arithmetic/data to 16-bit. Combined with the type of convolution operator arising in convolutional neural networks, the speed-ups are more modest though still relevant.	es_ES
dc.description.sponsorship	This work was supported by projects TIN2017-82972-R and RTI2018-093684-B-I00 from the Ministerio de Ciencia, Innovacion y Universidades, project S2018/TCS-4423 of the Comunidad de Madrid, project PR65/19-22445 of the UCM, and project Prometeo/2019/109 of the Generalitat Valenciana.	es_ES
dc.language	Inglés	es_ES
dc.publisher	Springer-Verlag	es_ES
dc.relation.ispartof	The Journal of Supercomputing	es_ES
dc.rights	Reserva de todos los derechos	es_ES
dc.subject	Deep learning	es_ES
dc.subject	Matrix multiplication	es_ES
dc.subject	High performance	es_ES
dc.subject	NVIDIA Carmel system-on-chip (SoC)	es_ES
dc.subject.classification	CIENCIAS DE LA COMPUTACION E INTELIGENCIA ARTIFICIAL	es_ES
dc.subject.classification	ARQUITECTURA Y TECNOLOGIA DE COMPUTADORES	es_ES
dc.title	Low precision matrix multiplication for efficient deep learning in NVIDIA Carmel processors	es_ES
dc.type	Artículo	es_ES
dc.identifier.doi	10.1007/s11227-021-03636-4	es_ES
dc.relation.projectID	info:eu-repo/grantAgreement/AEI/Plan Estatal de Investigación Científica y Técnica y de Innovación 2013-2016/TIN2017-82972-R/ES/TECNICAS ALGORITMICAS PARA COMPUTACION DE ALTO RENDIMIENTO CONSCIENTE DEL CONSUMO ENERGETICO Y RESISTENTE A ERRORES/	es_ES
dc.relation.projectID	info:eu-repo/grantAgreement/CAM//S2018%2FTCS-4423 /	es_ES
dc.relation.projectID	info:eu-repo/grantAgreement/AEI/Plan Estatal de Investigación Científica y Técnica y de Innovación 2017-2020/RTI2018-093684-B-I00/ES/HETEROGENEIDAD Y ESPECIALIZACION EN LA ERA POST-MOORE/	es_ES
dc.relation.projectID	info:eu-repo/grantAgreement/CAM//PR65%2F19-22445/	es_ES
dc.relation.projectID	info:eu-repo/grantAgreement/GVA//PROMETEO%2F2019%2F109//COMUNICACION Y COMPUTACION INTELIGENTES Y SOCIALES/	es_ES
dc.rights.accessRights	Abierto	es_ES
dc.contributor.affiliation	Universitat Politècnica de València. Escola Tècnica Superior d'Enginyeria Informàtica	es_ES
dc.description.bibliographicCitation	San Juan-Sebastian, P.; Rodríguez-Sánchez, R.; Igual, FD.; Alonso-Jordá, P.; Quintana-Ortí, ES. (2021). Low precision matrix multiplication for efficient deep learning in NVIDIA Carmel processors. The Journal of Supercomputing. 77(10):11257-11269. https://doi.org/10.1007/s11227-021-03636-4	es_ES
dc.description.accrualMethod	S	es_ES
dc.relation.publisherversion	https://doi.org/10.1007/s11227-021-03636-4	es_ES
dc.description.upvformatpinicio	11257	es_ES
dc.description.upvformatpfin	11269	es_ES
dc.type.version	info:eu-repo/semantics/publishedVersion	es_ES
dc.description.volume	77	es_ES
dc.description.issue	10	es_ES
dc.relation.pasarela	S\448133	es_ES
dc.contributor.funder	Comunidad de Madrid	es_ES
dc.contributor.funder	Generalitat Valenciana	es_ES
dc.contributor.funder	Agencia Estatal de Investigación	es_ES
dc.description.references	Deng L et al (2013) Recent advances in deep learning for speech research at Microsoft. In: 2013 IEEE international conference on acoustics, speech and signal processing, May, pp 8604–8608	es_ES
dc.description.references	Krizhevsky A, Sutskever I, Hinton GE (2012) ImageNet classification with deep convolutional neural networks. In: Proceedings of the 25th international conference on neural information processing systems—vol 1, ser. NIPS’12. Curran Associates Inc., USA, pp 1097–1105	es_ES
dc.description.references	Zhang J, Zong C (2015) Deep neural networks in machine translation: an overview. IEEE Intell Syst 30(5):16–25	es_ES
dc.description.references	Devlin J et al (2019) BERT: pre-training of deep bidirectional transformers for language understanding. In: Proceedings of 2019 conference North American Chapter of the Association for Computational Linguistics: Human Language Technologies, vol 1, pp 4171–4186	es_ES
dc.description.references	Sze V et al (2017) Efficient processing of deep neural networks: a tutorial and survey. Proc IEEE 105(12):2295–2329	es_ES
dc.description.references	Vaswani A et al (2017) Attention is all you need. In: Advances in neural information processing systems, vol 30, pp 5998–6008	es_ES
dc.description.references	Chellapilla K, Puri S, Simard P (2006) High performance convolutional neural networks for document processing. In: International workshop on frontiers in handwriting recognition, available as INRIA-00112631 report from https://hal.inria.fr/inria-00112631	es_ES
dc.description.references	Van Zee FG, van de Geijn RA (2015) BLIS: a framework for rapidly instantiating BLAS functionality. ACM Trans Math Softw 41(3):14:1–14:33	es_ES
dc.description.references	Dongarra JJ, Du Croz J, Hammarling S, Duff I (1990) A set of level 3 basic linear algebra subprograms. ACM Trans Math Softw 16(1):1–17	es_ES
dc.description.references	Goto K, van de Geijn R (2008) Anatomy of high-performance matrix multiplication. ACM Trans Math Softw 34(3):12:1–12:25	es_ES
dc.description.references	Low TM, Igual FD, Smith TM, Quintana-Orti ES (2016) Analytical modeling is enough for high-performance blis. ACM Trans Math Softw 43(2):1–18. https://doi.org/10.1145/2925987	es_ES
dc.description.references	Fabeiro JF, Andrade D, Fraguela BB (2016) Writing a performance-portable matrix multiplication. Parallel Comput 52:65–77	es_ES
dc.description.references	Zee FGV, Smith TM, Marker B, Low TM, Geijn RAVD, Igual FD, Smelyanskiy M, Zhang X, Kistler M, Austel V, Gunnels JA, Killough L (2016) The BLIS framework: experiments in portability. ACM Trans Math Softw 42(2):1–19. https://doi.org/10.1145/2755561	es_ES
dc.description.references	Smith TM, van de Geijn R, Smelyanskiy M, Hammond JR, Zee FGV (2014) Anatomy of high-performance many-threaded matrix multiplication. In: IPDPS ’14: Proceedings of the international parallel and distributed processing symposium (to appear)	es_ES
dc.description.references	Catalán S et al (2016) Architecture-aware configuration and scheduling of matrix multiplication on asymmetric multicore processors. Cluster Comput 19(3):1037–1051	es_ES
dc.description.references	Hennessy JL, Patterson DA (2003) Computer architecture: a quantitative approach. Morgan Kaufmann Pub, San Francisco	es_ES
dc.description.references	San Juan P, Castelló PS, Dolz MF, Alonso-Jordá P, Quintana-Ortí ES (2020) High performance and portable convolution operators for multicore processors. In: Proceedings of 32nd international Symposium on Computer Architecture and High Performance Computing (SBAC-PAD), pp 91–98	es_ES
dc.description.references	BLIS Performance benchmarks (2020). https://github.com/flame/blis/blob/master/docs/Performance.md	es_ES

Este ítem aparece en la(s) siguiente(s) colección(ones)

Mostrar el registro sencillo del ítem

Low precision matrix multiplication for efficient deep learning in NVIDIA Carmel processors

RiuNet: Repositorio Institucional de la Universidad Politécnica de Valencia

Buscar en RiuNet

Listar

Todo RiuNet

Esta colección

Mi cuenta

Estadísticas

Ayuda RiuNet

Admin. UPV

Compartir/Enviar a

Citas

Estadísticas

Low precision matrix multiplication for efficient deep learning in NVIDIA Carmel processors

Ficheros en el ítem

Este ítem aparece en la(s) siguiente(s) colección(ones)