dc.contributor.author |
San Juan-Sebastian, Pablo
|
es_ES |
dc.contributor.author |
Rodríguez-Sánchez, Rafael
|
es_ES |
dc.contributor.author |
Igual, Francisco D.
|
es_ES |
dc.contributor.author |
Alonso-Jordá, Pedro
|
es_ES |
dc.contributor.author |
Quintana-Ortí, Enrique S.
|
es_ES |
dc.date.accessioned |
2022-11-10T19:02:43Z |
|
dc.date.available |
2022-11-10T19:02:43Z |
|
dc.date.issued |
2021-10 |
es_ES |
dc.identifier.issn |
0920-8542 |
es_ES |
dc.identifier.uri |
http://hdl.handle.net/10251/189610 |
|
dc.description.abstract |
[EN] We introduce a high performance, multi-threaded realization of the gemm kernel for the ARMv8.2 architecture that operates with 16-bit (half precision)/queryKindly check and confirm whether the corresponding author is correctly identified. floating point operands. Our code is especially designed for efficient machine learning inference (and to a certain extent, also training) with deep neural networks. The results on the NVIDIA Carmel multicore processor, which implements the ARMv8.2 architecture, show considerable performance gains for the gemm kernel, close to the theoretical peak acceleration that could be expected when moving from 32-bit arithmetic/data to 16-bit. Combined with the type of convolution operator arising in convolutional neural networks, the speed-ups are more modest though still relevant. |
es_ES |
dc.description.sponsorship |
This work was supported by projects TIN2017-82972-R and RTI2018-093684-B-I00 from the Ministerio de Ciencia, Innovacion y Universidades, project S2018/TCS-4423 of the Comunidad de Madrid, project PR65/19-22445 of the UCM, and project Prometeo/2019/109 of the Generalitat Valenciana. |
es_ES |
dc.language |
Inglés |
es_ES |
dc.publisher |
Springer-Verlag |
es_ES |
dc.relation.ispartof |
The Journal of Supercomputing |
es_ES |
dc.rights |
Reserva de todos los derechos |
es_ES |
dc.subject |
Deep learning |
es_ES |
dc.subject |
Matrix multiplication |
es_ES |
dc.subject |
High performance |
es_ES |
dc.subject |
NVIDIA Carmel system-on-chip (SoC) |
es_ES |
dc.subject.classification |
CIENCIAS DE LA COMPUTACION E INTELIGENCIA ARTIFICIAL |
es_ES |
dc.subject.classification |
ARQUITECTURA Y TECNOLOGIA DE COMPUTADORES |
es_ES |
dc.title |
Low precision matrix multiplication for efficient deep learning in NVIDIA Carmel processors |
es_ES |
dc.type |
Artículo |
es_ES |
dc.identifier.doi |
10.1007/s11227-021-03636-4 |
es_ES |
dc.relation.projectID |
info:eu-repo/grantAgreement/AEI/Plan Estatal de Investigación Científica y Técnica y de Innovación 2013-2016/TIN2017-82972-R/ES/TECNICAS ALGORITMICAS PARA COMPUTACION DE ALTO RENDIMIENTO CONSCIENTE DEL CONSUMO ENERGETICO Y RESISTENTE A ERRORES/ |
es_ES |
dc.relation.projectID |
info:eu-repo/grantAgreement/CAM//S2018%2FTCS-4423 / |
es_ES |
dc.relation.projectID |
info:eu-repo/grantAgreement/AEI/Plan Estatal de Investigación Científica y Técnica y de Innovación 2017-2020/RTI2018-093684-B-I00/ES/HETEROGENEIDAD Y ESPECIALIZACION EN LA ERA POST-MOORE/ |
es_ES |
dc.relation.projectID |
info:eu-repo/grantAgreement/CAM//PR65%2F19-22445/ |
es_ES |
dc.relation.projectID |
info:eu-repo/grantAgreement/GVA//PROMETEO%2F2019%2F109//COMUNICACION Y COMPUTACION INTELIGENTES Y SOCIALES/ |
es_ES |
dc.rights.accessRights |
Abierto |
es_ES |
dc.contributor.affiliation |
Universitat Politècnica de València. Escola Tècnica Superior d'Enginyeria Informàtica |
es_ES |
dc.description.bibliographicCitation |
San Juan-Sebastian, P.; Rodríguez-Sánchez, R.; Igual, FD.; Alonso-Jordá, P.; Quintana-Ortí, ES. (2021). Low precision matrix multiplication for efficient deep learning in NVIDIA Carmel processors. The Journal of Supercomputing. 77(10):11257-11269. https://doi.org/10.1007/s11227-021-03636-4 |
es_ES |
dc.description.accrualMethod |
S |
es_ES |
dc.relation.publisherversion |
https://doi.org/10.1007/s11227-021-03636-4 |
es_ES |
dc.description.upvformatpinicio |
11257 |
es_ES |
dc.description.upvformatpfin |
11269 |
es_ES |
dc.type.version |
info:eu-repo/semantics/publishedVersion |
es_ES |
dc.description.volume |
77 |
es_ES |
dc.description.issue |
10 |
es_ES |
dc.relation.pasarela |
S\448133 |
es_ES |
dc.contributor.funder |
Comunidad de Madrid |
es_ES |
dc.contributor.funder |
Generalitat Valenciana |
es_ES |
dc.contributor.funder |
Agencia Estatal de Investigación |
es_ES |
dc.description.references |
Deng L et al (2013) Recent advances in deep learning for speech research at Microsoft. In: 2013 IEEE international conference on acoustics, speech and signal processing, May, pp 8604–8608 |
es_ES |
dc.description.references |
Krizhevsky A, Sutskever I, Hinton GE (2012) ImageNet classification with deep convolutional neural networks. In: Proceedings of the 25th international conference on neural information processing systems—vol 1, ser. NIPS’12. Curran Associates Inc., USA, pp 1097–1105 |
es_ES |
dc.description.references |
Zhang J, Zong C (2015) Deep neural networks in machine translation: an overview. IEEE Intell Syst 30(5):16–25 |
es_ES |
dc.description.references |
Devlin J et al (2019) BERT: pre-training of deep bidirectional transformers for language understanding. In: Proceedings of 2019 conference North American Chapter of the Association for Computational Linguistics: Human Language Technologies, vol 1, pp 4171–4186 |
es_ES |
dc.description.references |
Sze V et al (2017) Efficient processing of deep neural networks: a tutorial and survey. Proc IEEE 105(12):2295–2329 |
es_ES |
dc.description.references |
Vaswani A et al (2017) Attention is all you need. In: Advances in neural information processing systems, vol 30, pp 5998–6008 |
es_ES |
dc.description.references |
Chellapilla K, Puri S, Simard P (2006) High performance convolutional neural networks for document processing. In: International workshop on frontiers in handwriting recognition, available as INRIA-00112631 report from https://hal.inria.fr/inria-00112631 |
es_ES |
dc.description.references |
Van Zee FG, van de Geijn RA (2015) BLIS: a framework for rapidly instantiating BLAS functionality. ACM Trans Math Softw 41(3):14:1–14:33 |
es_ES |
dc.description.references |
Dongarra JJ, Du Croz J, Hammarling S, Duff I (1990) A set of level 3 basic linear algebra subprograms. ACM Trans Math Softw 16(1):1–17 |
es_ES |
dc.description.references |
Goto K, van de Geijn R (2008) Anatomy of high-performance matrix multiplication. ACM Trans Math Softw 34(3):12:1–12:25 |
es_ES |
dc.description.references |
Low TM, Igual FD, Smith TM, Quintana-Orti ES (2016) Analytical modeling is enough for high-performance blis. ACM Trans Math Softw 43(2):1–18. https://doi.org/10.1145/2925987 |
es_ES |
dc.description.references |
Fabeiro JF, Andrade D, Fraguela BB (2016) Writing a performance-portable matrix multiplication. Parallel Comput 52:65–77 |
es_ES |
dc.description.references |
Zee FGV, Smith TM, Marker B, Low TM, Geijn RAVD, Igual FD, Smelyanskiy M, Zhang X, Kistler M, Austel V, Gunnels JA, Killough L (2016) The BLIS framework: experiments in portability. ACM Trans Math Softw 42(2):1–19. https://doi.org/10.1145/2755561 |
es_ES |
dc.description.references |
Smith TM, van de Geijn R, Smelyanskiy M, Hammond JR, Zee FGV (2014) Anatomy of high-performance many-threaded matrix multiplication. In: IPDPS ’14: Proceedings of the international parallel and distributed processing symposium (to appear) |
es_ES |
dc.description.references |
Catalán S et al (2016) Architecture-aware configuration and scheduling of matrix multiplication on asymmetric multicore processors. Cluster Comput 19(3):1037–1051 |
es_ES |
dc.description.references |
Hennessy JL, Patterson DA (2003) Computer architecture: a quantitative approach. Morgan Kaufmann Pub, San Francisco |
es_ES |
dc.description.references |
San Juan P, Castelló PS, Dolz MF, Alonso-Jordá P, Quintana-Ortí ES (2020) High performance and portable convolution operators for multicore processors. In: Proceedings of 32nd international Symposium on Computer Architecture and High Performance Computing (SBAC-PAD), pp 91–98 |
es_ES |
dc.description.references |
BLIS Performance benchmarks (2020). https://github.com/flame/blis/blob/master/docs/Performance.md |
es_ES |