- -

Low precision matrix multiplication for efficient deep learning in NVIDIA Carmel processors

RiuNet: Repositorio Institucional de la Universidad Politécnica de Valencia

Compartir/Enviar a

Citas

Estadísticas

  • Estadisticas de Uso

Low precision matrix multiplication for efficient deep learning in NVIDIA Carmel processors

Mostrar el registro completo del ítem

San Juan-Sebastian, P.; Rodríguez-Sánchez, R.; Igual, FD.; Alonso-Jordá, P.; Quintana-Ortí, ES. (2021). Low precision matrix multiplication for efficient deep learning in NVIDIA Carmel processors. The Journal of Supercomputing. 77(10):11257-11269. https://doi.org/10.1007/s11227-021-03636-4

Por favor, use este identificador para citar o enlazar este ítem: http://hdl.handle.net/10251/189610

Ficheros en el ítem

Metadatos del ítem

Título: Low precision matrix multiplication for efficient deep learning in NVIDIA Carmel processors
Autor: San Juan-Sebastian, Pablo Rodríguez-Sánchez, Rafael Igual, Francisco D. Alonso-Jordá, Pedro Quintana-Ortí, Enrique S.
Entidad UPV: Universitat Politècnica de València. Escola Tècnica Superior d'Enginyeria Informàtica
Fecha difusión:
Resumen:
[EN] We introduce a high performance, multi-threaded realization of the gemm kernel for the ARMv8.2 architecture that operates with 16-bit (half precision)/queryKindly check and confirm whether the corresponding author is ...[+]
Palabras clave: Deep learning , Matrix multiplication , High performance , NVIDIA Carmel system-on-chip (SoC)
Derechos de uso: Reserva de todos los derechos
Fuente:
The Journal of Supercomputing. (issn: 0920-8542 )
DOI: 10.1007/s11227-021-03636-4
Editorial:
Springer-Verlag
Versión del editor: https://doi.org/10.1007/s11227-021-03636-4
Código del Proyecto:
info:eu-repo/grantAgreement/AEI/Plan Estatal de Investigación Científica y Técnica y de Innovación 2013-2016/TIN2017-82972-R/ES/TECNICAS ALGORITMICAS PARA COMPUTACION DE ALTO RENDIMIENTO CONSCIENTE DEL CONSUMO ENERGETICO Y RESISTENTE A ERRORES/
info:eu-repo/grantAgreement/CAM//S2018%2FTCS-4423 /
info:eu-repo/grantAgreement/AEI/Plan Estatal de Investigación Científica y Técnica y de Innovación 2017-2020/RTI2018-093684-B-I00/ES/HETEROGENEIDAD Y ESPECIALIZACION EN LA ERA POST-MOORE/
info:eu-repo/grantAgreement/CAM//PR65%2F19-22445/
info:eu-repo/grantAgreement/GVA//PROMETEO%2F2019%2F109//COMUNICACION Y COMPUTACION INTELIGENTES Y SOCIALES/
Agradecimientos:
This work was supported by projects TIN2017-82972-R and RTI2018-093684-B-I00 from the Ministerio de Ciencia, Innovacion y Universidades, project S2018/TCS-4423 of the Comunidad de Madrid, project PR65/19-22445 of the UCM, ...[+]
Tipo: Artículo

References

Deng L et al (2013) Recent advances in deep learning for speech research at Microsoft. In: 2013 IEEE international conference on acoustics, speech and signal processing, May, pp 8604–8608

Krizhevsky A, Sutskever I, Hinton GE (2012) ImageNet classification with deep convolutional neural networks. In: Proceedings of the 25th international conference on neural information processing systems—vol 1, ser. NIPS’12. Curran Associates Inc., USA, pp 1097–1105

Zhang J, Zong C (2015) Deep neural networks in machine translation: an overview. IEEE Intell Syst 30(5):16–25 [+]
Deng L et al (2013) Recent advances in deep learning for speech research at Microsoft. In: 2013 IEEE international conference on acoustics, speech and signal processing, May, pp 8604–8608

Krizhevsky A, Sutskever I, Hinton GE (2012) ImageNet classification with deep convolutional neural networks. In: Proceedings of the 25th international conference on neural information processing systems—vol 1, ser. NIPS’12. Curran Associates Inc., USA, pp 1097–1105

Zhang J, Zong C (2015) Deep neural networks in machine translation: an overview. IEEE Intell Syst 30(5):16–25

Devlin J et al (2019) BERT: pre-training of deep bidirectional transformers for language understanding. In: Proceedings of 2019 conference North American Chapter of the Association for Computational Linguistics: Human Language Technologies, vol 1, pp 4171–4186

Sze V et al (2017) Efficient processing of deep neural networks: a tutorial and survey. Proc IEEE 105(12):2295–2329

Vaswani A et al (2017) Attention is all you need. In: Advances in neural information processing systems, vol 30, pp 5998–6008

Chellapilla K, Puri S, Simard P (2006) High performance convolutional neural networks for document processing. In: International workshop on frontiers in handwriting recognition, available as INRIA-00112631 report from https://hal.inria.fr/inria-00112631

Van Zee FG, van de Geijn RA (2015) BLIS: a framework for rapidly instantiating BLAS functionality. ACM Trans Math Softw 41(3):14:1–14:33

Dongarra JJ, Du Croz J, Hammarling S, Duff I (1990) A set of level 3 basic linear algebra subprograms. ACM Trans Math Softw 16(1):1–17

Goto K, van de Geijn R (2008) Anatomy of high-performance matrix multiplication. ACM Trans Math Softw 34(3):12:1–12:25

Low TM, Igual FD, Smith TM, Quintana-Orti ES (2016) Analytical modeling is enough for high-performance blis. ACM Trans Math Softw 43(2):1–18. https://doi.org/10.1145/2925987

Fabeiro JF, Andrade D, Fraguela BB (2016) Writing a performance-portable matrix multiplication. Parallel Comput 52:65–77

Zee FGV, Smith TM, Marker B, Low TM, Geijn RAVD, Igual FD, Smelyanskiy M, Zhang X, Kistler M, Austel V, Gunnels JA, Killough L (2016) The BLIS framework: experiments in portability. ACM Trans Math Softw 42(2):1–19. https://doi.org/10.1145/2755561

Smith TM, van de Geijn R, Smelyanskiy M, Hammond JR, Zee FGV (2014) Anatomy of high-performance many-threaded matrix multiplication. In: IPDPS ’14: Proceedings of the international parallel and distributed processing symposium (to appear)

Catalán S et al (2016) Architecture-aware configuration and scheduling of matrix multiplication on asymmetric multicore processors. Cluster Comput 19(3):1037–1051

Hennessy JL, Patterson DA (2003) Computer architecture: a quantitative approach. Morgan Kaufmann Pub, San Francisco

San Juan P, Castelló PS, Dolz MF, Alonso-Jordá P, Quintana-Ortí ES (2020) High performance and portable convolution operators for multicore processors. In: Proceedings of 32nd international Symposium on Computer Architecture and High Performance Computing (SBAC-PAD), pp 91–98

BLIS Performance benchmarks (2020). https://github.com/flame/blis/blob/master/docs/Performance.md

[-]

recommendations

 

Este ítem aparece en la(s) siguiente(s) colección(ones)

Mostrar el registro completo del ítem