[EN] We address the efficient realization of matrix multiplication (gemm), with application in the convolution operator for machine learning, for the RISC-V core present in the GreenWaves GAP8 processor. Our approach ...
Castelló, Adrián; Quintana-Ortí, Enrique S.; Duato Marín, José Francisco(Springer-Verlag, 2021-12)
[EN] TensorFlow (TF) is usually combined with the Horovod (HVD) workload distribution package to obtain a parallel tool to train deep neural network on clusters of computers. HVD in turn utilizes a blocking Allreduce ...
Castelló, Adrián; Mayo Gual, Rafael; Seo, Sangmin; Balaji, Pavan; Quintana Ortí, Enrique Salvador; Peña, Antonio J.(Institute of Electrical and Electronics Engineers, 2020-09-01)
[EN] With the appearance of multi-/many core machines, applications and runtime systems have evolved in order to exploit the new on-node concurrency brought by new software paradigms. POSIX threads (Pthreads) was widely-adopted ...
[EN] For many distributed applications, data communication poses an important bottleneck from the points of view of performance and energy consumption. As more cores are integrated per node, in general the global performance ...
Del Campo Calvo, Francisco Javier(Universitat Politècnica de València, 2023-09-25)
[ES] La adopción de las redes neuronales en prácticamente todos los ámbitos científicos está propiciando su uso en una amplia variedad de dispositivos. Estos dispositivos pueden ser de muy diversa naturaleza: desde grandes ...
[EN] Tuning and optimising the operations executed in deep learning frameworks is a fundamental task in accelerating the processing of deep neural networks (DNNs). However, this optimisation usually requires extensive ...
Maciá Lillo, Antonio(Universitat Politècnica de València, 2022-09-06)
[ES] Vivimos en un mundo interconectado, con microcontroladores y procesadores de consumo ultra-reducido (MCUs), integrados dentro de relojes y electrodomésticos inteligentes, asistentes de voz, teléfonos móviles, y todo ...
[EN] We take a step forward towards developing high-performance codes for the convolution operator, based on the Winograd algorithm, that are easy to customise for general-purpose processor architectures. In our approach, ...
Castelló-Gimeno, Adrián; Peña Monferrer, Antonio José; Mayo Gual, Rafael; Planas,Judit; Quintana Ortí, Enrique Salvador; Balaji, Pavan(Springer-Verlag, 2018-11)
[EN] Directive-based programming models, such as OpenMP, OpenACC, and OmpSs, enable users to accelerate applications by using coprocessors with little effort. These devices offer significant computing power, but their use ...
Alaejos López, Guillermo(Universitat Politècnica de València, 2022-04-27)
[ES] El auge en la aplicación de redes neuronales profundas (RNPs) en una gran variedad de campos científicos ha propiciado su uso no solo en servidores de cómputo sino también en dispositivos de bajo consumo.
Los cálculos ...
Castelló, Adrián; SERGIO BARRACHINA; DOLZ ZARAGOZÁ, MANUEL FRANCISCO; Enrique S. Quintana-Ortí; San Juan-Sebastian, Pablo; Tomás Domínguez, Andrés Enrique(Elsevier, 2022-04)
[EN] We evolve PyDTNN, a framework for distributed parallel training of Deep Neural Networks (DNNs), into an efficient inference tool for convolutional neural networks. Our optimization process on multicore ARM processors ...
Solaz Vivó, Celia(Universitat Politècnica de València, 2024-10-13)
[ES] Esta investigación se realiza con un objetivo principal que es el estudio de la portabilidad de la biblioteca JAX de Python, desarrollada por Google, en diferentes componentes hardware (CPU, GPU y TPU). JAX es una ...
Reaño González, Carlos; Silla Jiménez, Federico; Castello Gimeno, Adrián; Peña Monferrer, Antonio José; Mayo Gual, Rafael; Quintana Ortí, Enrique Salvador; Duato Marín, José Francisco(Wiley, 2015-09-25)
Graphics processing units (GPUs) are being increasingly embraced by the high-performance computing
community as an effective way to reduce execution time by accelerating parts of their applications. remote
CUDA (rCUDA) ...
[EN] We provide a practical demonstration that it is possible to systematically generate a variety of high-performance micro-kernels for the general matrix multiplication (gemm) via generic templates which can be easily ...
[EN] We address the efficient implementation of the convolution operator on the GAP8 parallel ultra-low power platform (PULP), a heterogeneous multi-core processor equipped with a fabric controller (FC); a cluster of eight ...
[EN] In this work, we assess the performance and energy efciency of high-performance
codes for the convolution operator, based on the direct, explicit/implicit lowering and Winograd algorithms used for deep learning (DL) ...
[EN] We investigate a parallelization strategy for dense matrix factorization (DMF) algorithms, using OpenMP, that departs from the legacy (or conventional) solution, which simply extracts concurrency from a multi-threaded ...
Iserte Agut, Sergio; Castello Gimeno, Adrián; Mayo Gual, Rafael; Quintana Ortí, Enrique Salvador; Silla Jiménez, Federico; Duato Marín, José Francisco; Reaño González, Carlos; Prades Gasulla, Javier(IEEE, 2014-10-22)
SLURM is a resource manager that can be leveraged to share a collection of heterogeneous resources among the jobs in execution in a cluster. However, SLURM is not designed to handle resources such as graphics processing ...