- -

The RECIPE approach to challenges in deeply heterogeneous high performance systems

RiuNet: Repositorio Institucional de la Universidad Politécnica de Valencia

Compartir/Enviar a

Citas

Estadísticas

  • Estadisticas de Uso

The RECIPE approach to challenges in deeply heterogeneous high performance systems

Mostrar el registro sencillo del ítem

Ficheros en el ítem

dc.contributor.author Agosta, Giovanni es_ES
dc.contributor.author Fornaciari, William es_ES
dc.contributor.author Atienza, David es_ES
dc.contributor.author Canal, Ramon es_ES
dc.contributor.author Cilardo, Alessandro es_ES
dc.contributor.author Flich Cardo, José es_ES
dc.contributor.author Hernández Luz, Carles es_ES
dc.contributor.author Kulczewski, Michal es_ES
dc.contributor.author Massari, Giuseppe es_ES
dc.contributor.author Tornero-Gavilá, Rafael es_ES
dc.contributor.author Zapater, Marina es_ES
dc.date.accessioned 2021-05-28T03:34:27Z
dc.date.available 2021-05-28T03:34:27Z
dc.date.issued 2020-09 es_ES
dc.identifier.issn 0141-9331 es_ES
dc.identifier.uri http://hdl.handle.net/10251/166916
dc.description.abstract [EN] RECIPE (REliable power and time-ConstraInts-aware Predictive management of heterogeneous Exascale systems) is a recently started project funded within the H2020 FETHPC programme, which is expressly targeted at exploring new High-Performance Computing (HPC) technologies. RECIPE aims at introducing a hierarchical runtime resource management infrastructure to optimize energy efficiency and minimize the occurrence of thermal hotspots, while enforcing the time constraints imposed by the applications and ensuring reliability for both time-critical and throughput-oriented computation that run on deeply heterogeneous accelerator-based systems. This paper presents a detailed overview of RECIPE, identifying the fundamental challenges as well as the key innovations addressed by the project. In particular, the need for predictive reliability approaches to maximizing hardware lifetime and guarantee application performance is identified as the key concern for RECIPE. We address it through hierarchical resource management of the heterogeneous architectural components of the system, driven by estimates of the application latency and hardware reliability obtained respectively through timing analysis and modeling thermal properties and mean-time-to-failure of subsystems. We show the impact of prediction accuracy on the overheads imposed by the checkpointing policy, as well as a possible application to a weather forecasting use case. es_ES
dc.description.sponsorship The activities described in this article received funding from the European Union's Horizon 2020 research and innovation programme under the FETHPC grant agreement no. 801137 RECIPE: REliable power and time-ConstraInts-aware Predictive management of heterogeneous Exascale systems. es_ES
dc.language Inglés es_ES
dc.publisher Elsevier es_ES
dc.relation.ispartof Microprocessors and Microsystems es_ES
dc.rights Reconocimiento - No comercial - Sin obra derivada (by-nc-nd) es_ES
dc.subject HPC es_ES
dc.subject Heterogeneous computing es_ES
dc.subject Run-time management es_ES
dc.subject.classification ARQUITECTURA Y TECNOLOGIA DE COMPUTADORES es_ES
dc.title The RECIPE approach to challenges in deeply heterogeneous high performance systems es_ES
dc.type Artículo es_ES
dc.identifier.doi 10.1016/j.micpro.2020.103185 es_ES
dc.relation.projectID info:eu-repo/grantAgreement/EC/H2020/801137/EU/REliable power and time-ConstraInts-aware Predictive management of heterogeneous Exascale systems/ es_ES
dc.rights.accessRights Abierto es_ES
dc.contributor.affiliation Universitat Politècnica de València. Departamento de Informática de Sistemas y Computadores - Departament d'Informàtica de Sistemes i Computadors es_ES
dc.description.bibliographicCitation Agosta, G.; Fornaciari, W.; Atienza, D.; Canal, R.; Cilardo, A.; Flich Cardo, J.; Hernández Luz, C.... (2020). The RECIPE approach to challenges in deeply heterogeneous high performance systems. Microprocessors and Microsystems. 77:1-13. https://doi.org/10.1016/j.micpro.2020.103185 es_ES
dc.description.accrualMethod S es_ES
dc.relation.publisherversion https://doi.org/10.1016/j.micpro.2020.103185 es_ES
dc.description.upvformatpinicio 1 es_ES
dc.description.upvformatpfin 13 es_ES
dc.type.version info:eu-repo/semantics/publishedVersion es_ES
dc.description.volume 77 es_ES
dc.relation.pasarela S\431096 es_ES
dc.contributor.funder European Commission es_ES
dc.description.references Flich, J., Agosta, G., Ampletzer, P., Alonso, D. A., Brandolese, C., Cappe, E., … Zoni, D. (2018). Exploring manycore architectures for next-generation HPC systems through the MANGO approach. Microprocessors and Microsystems, 61, 154-170. doi:10.1016/j.micpro.2018.05.011 es_ES
dc.description.references https://euroexa.eu. es_ES
dc.description.references https://www.altera.com/products/sip/memory/stratix-10-mx/overview.html. es_ES
dc.description.references http://www.mango-project.eu. es_ES
dc.description.references https://www.infinibandta.org/infiniband-roadmap/. es_ES
dc.description.references Reghenzani, F., Massari, G., & Fornaciari, W. (2018). chronovise: Measurement-Based Probabilistic Timing Analysis framework. Journal of Open Source Software, 3(28), 711. doi:10.21105/joss.00711 es_ES
dc.description.references Abella, J., Padilla, M., Castillo, J. D., & Cazorla, F. J. (2017). Measurement-Based Worst-Case Execution Time Estimation Using the Coefficient of Variation. ACM Transactions on Design Automation of Electronic Systems, 22(4), 1-29. doi:10.1145/3065924 es_ES
dc.description.references https://lanl.gov/projects/trinity/specifications.php. es_ES
dc.description.references https://www.bsc.es/marenostrum/marenostrum/technical-information. es_ES
dc.description.references https://www.olcf.ornl.gov/olcf-resources/compute-systems/titan/. es_ES
dc.description.references Bellasi, P., Massari, G., & Fornaciari, W. (2015). Effective Runtime Resource Management Using Linux Control Groups with the BarbequeRTRM Framework. ACM Transactions on Embedded Computing Systems, 14(2), 1-17. doi:10.1145/2658990 es_ES
dc.description.references Egwutuoha, I. P., Levy, D., Selic, B., & Chen, S. (2013). A survey of fault tolerance mechanisms and checkpoint/restart implementations for high performance computing systems. The Journal of Supercomputing, 65(3), 1302-1326. doi:10.1007/s11227-013-0884-0 es_ES
dc.description.references Lee, K., & Wong, S. S. (2017). Fault-Tolerant FPGA with Column-Based Redundancy and Power Gating Using RRAM. IEEE Transactions on Computers, 66(6), 946-956. doi:10.1109/tc.2016.2634533 es_ES
dc.description.references Cheatham, J. A., Emmert, J. M., & Baumgart, S. (2006). A survey of fault tolerant methodologies for FPGAs. ACM Transactions on Design Automation of Electronic Systems, 11(2), 501-533. doi:10.1145/1142155.1142167 es_ES
dc.description.references Parris, M. G., Sharma, C. A., & Demara, R. F. (2011). Progress in autonomous fault recovery of field programmable gate arrays. ACM Computing Surveys, 43(4), 1-30. doi:10.1145/1978802.1978810 es_ES
dc.description.references A. Iranfar, F. Terraneo, W.A. Simon, L. Dragic, I. Pilji, M. Zapater Sancho, W. Fornaciari, M. Kovac, D. Atienza Alonso, Thermal characterization of next-generation workloads on heterogeneous MPSoCs (2017). es_ES
dc.description.references Zoni, D., & Fornaciari, W. (2015). Modeling DVFS and Power-Gating Actuators for Cycle-Accurate NoC-Based Simulators. ACM Journal on Emerging Technologies in Computing Systems, 12(3), 1-24. doi:10.1145/2751561 es_ES
dc.description.references Curtsinger, C., & Berger, E. D. (2013). STABILIZER. ACM SIGARCH Computer Architecture News, 41(1), 219-228. doi:10.1145/2490301.2451141 es_ES
dc.description.references Kormann, J., Rodríguez, J. E., Gutierrez, N., Ferrer, M., Rojas, O., de la Puente, J., … Cela, J. M. (2016). Toward an automatic full-wave inversion: Synthetic study cases. The Leading Edge, 35(12), 1047-1052. doi:10.1190/tle35121047.1 es_ES
dc.description.references Fusi, M., Mazzocchetti, F., Farres, A., Kosmidis, L., Canal, R., Cazorla, F. J., & Abella, J. (2020). On the Use of Probabilistic Worst-Case Execution Time Estimation for Parallel Applications in High Performance Systems. Mathematics, 8(3), 314. doi:10.3390/math8030314 es_ES
dc.description.references D.W. Wright, R.A. Richardson, W. Edeling, J. Lakhlili, R.C. Sinclair, V. Jacauskas, D. Suleimenova, B. Bosak, M. Kulczewski, T. Piontek, P. Kopta, I. Chirca, H. Arabnejad, O.O. Luk, O. Hoenen, J. Weglarz, D. Crommelin, D. Groen, Building confidence in simulation: Application of easyvvuq, Submitted to Journal of Advanced Theory and Simulations on 12/12/2019. es_ES


Este ítem aparece en la(s) siguiente(s) colección(ones)

Mostrar el registro sencillo del ítem