Buscar en RiuNet

Listar

Todo RiuNet

Mi cuenta

Acceder

Ayuda RiuNet

Admin. UPV

Listar por autor "Castelló Gimeno, Adrián"

Mostrando ítems 1-19 de 19

A BLIS-like matrix multiplication for machine learning in the RISC-V ISA-based GAP8 processor

Ramírez-Betancourth, Cristian; Castelló, Adrián; Quintana-Ortí, Enrique S. (Springer-Verlag, 2022-11)

[EN] We address the efficient realization of matrix multiplication (gemm), with application in the convolution operator for machine learning, for the RISC-V core present in the GreenWaves GAP8 processor. Our approach ...
Accelerating distributed deep neural network training with pipelined MPI allreduce

Castelló, Adrián; Quintana-Ortí, Enrique S.; Duato Marín, José Francisco (Springer-Verlag, 2021-12)

[EN] TensorFlow (TF) is usually combined with the Horovod (HVD) workload distribution package to obtain a parallel tool to train deep neural network on clusters of computers. HVD in turn utilizes a blocking Allreduce ...
Algorithm 1039: Automatic Generators for a Family of Matrix Multiplication Routines with Apache TVM

Alaejos-López, Guillermo; Castelló, Adrián; Alonso-Jordá, Pedro; Igual, Francisco D.; Martínez, Héctor; Quintana-Ortí, Enrique S. (Association for Computing Machinery, 2024-03)
Analysis of threading libraries for high performance computing

Castelló, Adrián; Mayo Gual, Rafael; Seo, Sangmin; Balaji, Pavan; Quintana Ortí, Enrique Salvador; Peña, Antonio J. (Institute of Electrical and Electronics Engineers, 2020-09-01)

[EN] With the appearance of multi-/many core machines, applications and runtime systems have evolved in order to exploit the new on-node concurrency brought by new software paradigms. POSIX threads (Pthreads) was widely-adopted ...
Analyzing the impact of the MPI allreduce in distributed training of convolutional neural networks

Castelló, Adrián; Catalán, Mar; Dolz, Manuel F.; Quintana-Ortí, Enrique S.; Duato, José (Springer-Verlag, 2023-05)

[EN] For many distributed applications, data communication poses an important bottleneck from the points of view of performance and energy consumption. As more cores are integrated per node, in general the global performance ...
Auto-generación de núcleos computacionales para redes neuronales sobre GPUs

Del Campo Calvo, Francisco Javier (Universitat Politècnica de València, 2023-09-25)

[ES] La adopción de las redes neuronales en prácticamente todos los ámbitos científicos está propiciando su uso en una amplia variedad de dispositivos. Estos dispositivos pueden ser de muy diversa naturaleza: desde grandes ...
BestOf: an online implementation selector for the training and inference of deep neural networks

Barrachina, Sergio; Castelló, Adrián; Dolz, Manuel F.; Tomás Domínguez, Andrés Enrique (Springer-Verlag, 2022-05-20)

[EN] Tuning and optimising the operations executed in deep learning frameworks is a fundamental task in accelerating the processing of deep neural networks (DNNs). However, this optimisation usually requires extensive ...
Diseño de algoritmos eficientes para aprendizaje automático sobre MCUs

Maciá Lillo, Antonio (Universitat Politècnica de València, 2022-09-06)

[ES] Vivimos en un mundo interconectado, con microcontroladores y procesadores de consumo ultra-reducido (MCUs), integrados dentro de relojes y electrodomésticos inteligentes, asistentes de voz, teléfonos móviles, y todo ...
Efficient and Portable Winograd Convolutions for Multi-core Processors

Dolz Zaragozá, Manuel Francisco; Martínez, Héctor; Castelló, Adrián; Alonso-Jordá, Pedro; Quintana-Ortí, Enrique S. (Springer-Verlag, 2023-02-12)

[EN] We take a step forward towards developing high-performance codes for the convolution operator, based on the Winograd algorithm, that are easy to customise for general-purpose processor architectures. In our approach, ...
Exploring the interoperability of remote GPGPU virtualization using rCUDA and directive-based programming models

Castelló-Gimeno, Adrián; Peña Monferrer, Antonio José; Mayo Gual, Rafael; Planas,Judit; Quintana Ortí, Enrique Salvador; Balaji, Pavan (Springer-Verlag, 2018-11)

[EN] Directive-based programming models, such as OpenMP, OpenACC, and OmpSs, enable users to accelerate applications by using coprocessors with little effort. These devices offer significant computing power, but their use ...
Generación automática de núcleos computacionales para redes neuronales

Alaejos López, Guillermo (Universitat Politècnica de València, 2022-04-27)

[ES] El auge en la aplicación de redes neuronales profundas (RNPs) en una gran variedad de campos científicos ha propiciado su uso no solo en servidores de cómputo sino también en dispositivos de bajo consumo. Los cálculos ...
High performance and energy efficient inference for deep learning on multicore ARM processors using general optimization techniques and BLIS

Castelló, Adrián; SERGIO BARRACHINA; DOLZ ZARAGOZÁ, MANUEL FRANCISCO; Enrique S. Quintana-Ortí; San Juan-Sebastian, Pablo; Tomás Domínguez, Andrés Enrique (Elsevier, 2022-04)

[EN] We evolve PyDTNN, a framework for distributed parallel training of Deep Neural Networks (DNNs), into an efficient inference tool for convolutional neural networks. Our optimization process on multicore ARM processors ...
Implementación de núcleos computacionales para redes neuronales con JAX

Solaz Vivó, Celia (Universitat Politècnica de València, 2024-10-13)

[ES] Esta investigación se realiza con un objetivo principal que es el estudio de la portabilidad de la biblioteca JAX de Python, desarrollada por Google, en diferentes componentes hardware (CPU, GPU y TPU). JAX es una ...
Improving the User Experience of the rCUDA Remote GPU Virtualization Framework

Reaño González, Carlos; Silla Jiménez, Federico; Castello Gimeno, Adrián; Peña Monferrer, Antonio José; Mayo Gual, Rafael; Quintana Ortí, Enrique Salvador; Duato Marín, José Francisco (Wiley, 2015-09-25)

Graphics processing units (GPUs) are being increasingly embraced by the high-performance computing community as an effective way to reduce execution time by accelerating parts of their applications. remote CUDA (rCUDA) ...
Micro-kernels for portable and efficient matrix multiplication in deep learning

Alaejos-López, Guillermo; Castelló, Adrián; Martínez, Héctor; Alonso-Jordá, Pedro; Igual, Francisco D.; Quintana-Ortí, Enrique S. (Springer-Verlag, 2023-05)

[EN] We provide a practical demonstration that it is possible to systematically generate a variety of high-performance micro-kernels for the general matrix multiplication (gemm) via generic templates which can be easily ...
Parallel GEMM-based convolution for deep learning on multicore RISC-V processors

Ramírez-Betancourth, Cristian; Castelló, Adrián; Martínez, Héctor; Quintana-Ortí, Enrique S. (Springer-Verlag, 2024-02)

[EN] We address the efficient implementation of the convolution operator on the GAP8 parallel ultra-low power platform (PULP), a heterogeneous multi-core processor equipped with a fabric controller (FC); a cluster of eight ...
Performance energy trade-offs of deep learning convolution algorithms on ARM processors

Dolz Zaragozá, Manuel Francisco; Barrachina, Sergio; Martínez, Héctor; Castelló, Adrián; Maciá, Antonio; Fabregat, Germán; Tomás Domínguez, Andrés Enrique (Springer-Verlag, 2023-01-21)

[EN] In this work, we assess the performance and energy efciency of high-performance codes for the convolution operator, based on the direct, explicit/implicit lowering and Winograd algorithms used for deep learning (DL) ...
Programming parallel dense matrix factorizations with look-ahead and OpenMP

Catalán, Sandra; Castelló, Adrián; Igual, Francisco D.; Rodríguez-Sánchez, Rafael; Quintana Ortí, Enrique Salvador (Springer-Verlag, 2020-03)

[EN] We investigate a parallelization strategy for dense matrix factorization (DMF) algorithms, using OpenMP, that departs from the legacy (or conventional) solution, which simply extracts concurrency from a multi-threaded ...
SLURM Support for Remote GPU Virtualization: Implementation and Performance Study

Iserte Agut, Sergio; Castello Gimeno, Adrián; Mayo Gual, Rafael; Quintana Ortí, Enrique Salvador; Silla Jiménez, Federico; Duato Marín, José Francisco; Reaño González, Carlos; Prades Gasulla, Javier (IEEE, 2014-10-22)

SLURM is a resource manager that can be leveraged to share a collection of heterogeneous resources among the jobs in execution in a cluster. However, SLURM is not designed to handle resources such as graphics processing ...