Selected Publications

Journal Papers

  • Collective Mind: Towards practical and collaborative auto-tuning


    Grigori Fursin, Renato Miceli, Anton Lokhmotov, Michael Gerndt, Marc Baboulin, Allen D. Malony, Zbigniew Chamski, Diego Novillo, Davide Del Vento
    Scientific Programming Journal, special issue on "Automatic Performance Tuning for HPC Architectures", IOS Press. vol 22, issue 4, pp 309-329. July 2014
    doi: 10.3233/SPR-140396

    Abstract: Empirical auto-tuning and machine learning techniques have been showing high potential to improve execution time, power consumption, code size, reliability and other important metrics of various applications for more than two decades. However, they are still far from widespread production use due to lack of native support for auto-tuning in an ever changing and complex software and hardware stack, large and multi-dimensional optimization spaces, excessively long exploration times, and lack of unified mechanisms for preserving and sharing of optimization knowledge and research material. We present a possible collaborative approach to solve above problems using Collective Mind knowledge management system. In contrast with previous cTuning framework, this modular infrastructure allows to preserve and share through the Internet the whole auto-tuning setups with all related artifacts and their software and hardware dependencies besides just performance data. It also allows to gradually structure, systematize and describe all available research material including tools, benchmarks, data sets, search strategies and machine learning models. Researchers can take advantage of shared components and data with extensible meta-description to quickly and collaboratively validate and improve existing auto-tuning and benchmarking techniques or prototype new ones. The community can now gradually learn and improve complex behavior of all existing computer systems while exposing behavior anomalies or model mispredictions to an interdisciplinary community in a reproducible way for further analysis. We present several practical, collaborative and model-driven auto-tuning scenarios. We also decided to release all material at c-mind.org/repo to set up an example for a collaborative and reproducible research as well as our new publication model in computer engineering where experimental results are continuously shared and validated by the community.
    Keywords: high performance computing; systematic auto-tuning; systematic benchmarking; big data driven optimization; modeling of computer behavior; performance prediction; collaborative knowledge management; public repository of knowledge; NoSQL repository; code and data sharing; specification sharing; collaborative experimentation; machine learning; data mining; multi-objective optimization; model driven optimization; agile development; plugin-based tuning; performance regression buildbot; open access publication model; reproducible research


  • Business-driven short-term management of a hybrid IT infrastructure


    Paulo Ditarso Maciel Jr., Francisco Brasileiro, Ricardo Araújo Santos, David Candeia, Raquel Lopes, Marcus Carvalho, Renato Miceli, Nazareno Andrade, Miranda Mowbray
    Journal of Parallel and Distributed Computing (JPDC), Elsevier. vol 72, issue 2, pp 106-119. February 2012
    doi: 10.1016/j.jpdc.2011.11.001


    Abstract: We consider the problem of managing a hybrid computing infrastructure whose processing elements are comprised of in-house dedicated machines, virtual machines acquired on-demand from a cloud computing provider through short-term reservation contracts, and virtual machines made available by the remote peers of a best-effort peer-to-peer (P2P) grid. Each of these resources has different cost basis and associated quality of service guarantees. The applications that run in this hybrid infrastructure are characterized by a utility function: the utility gained with the completion of an application depends on the time taken to execute it. We take a business-driven approach to manage this infrastructure, aiming at maximizing the profit yielded, that is, the utility produced as a result of the applications that are run minus the cost of the computing resources that are used to run them. We propose a heuristic to be used by a contract planner agent that establishes the contracts with the cloud computing provider to balance the cost of running an application and the utility that is obtained with its execution, with the goal of producing a high overall profit. Our analytical results show that the simple heuristic proposed achieves very high relative efficiency in the use of the hybrid infrastructure. We also demonstrate that the ability to estimate the grid behaviour is an important condition for making contracts that allow such relative efficiency values to be achieved. On the other hand, our simulation results with realistic error predictions show only a modest improvement in the profit achieved by the simple heuristic proposed, when compared to a heuristic that does not consider the grid when planning contracts, but uses it, and another that is completely oblivious to the existence of the grid. This calls for the development of more accurate predictors for the availability of P2P grids, and more elaborated heuristics that can better deal with the several sources of non-determinism present in this hybrid infrastructure.
    Keywords: Cloud computing; Grid computing; Peer-to-peer; Business-driven IT management; Short-term management; Capacity planning



Conference Papers

  • Investigating Performance Benefits from OpenACC Kernel Directives


    Benjamin Eagan, Gilles Civario, Renato Miceli
    Proceedings of the 2013 International Conference on Parallel Computing (ParCo 2013). Parallel Computing: Accelerating Computational Science and Engineering (CSE), Advances in Parallel Computing book series, IOS Press. Munich, Germany. vol 25, pp 616-625. March 2014
    doi: 10.3233/978-1-61499-381-0-616
    Link: pdf slides

    Abstract: OpenACC is a high-level programming model that uses directives for offloading computation to accelerators. This paper explores the benefit of using OpenACC performance tuning directives to manually specify GPU scheduling, versus the scheduling OpenACC applies by default. We performed manual scheduling using gang and vector clauses in a directive, and applied to matrix-matrix multiply and Classical Gram-Schmidt orthonormalisation test cases. We then tested using the NVIDIA M2090 and K20 GPGPUs, in conjunction with both the PGI and CAPS implementations of OpenACC. The speedup realised by tuning the gang and vector values ranged from 1.0 to 3.1 in the test cases examined. This shows that the gang and vector values have a large impact on performance, and in some cases the compilers are able to automatically select ideal gang and vector values.
    Keywords: OpenACC; GPGPU Computing; Software Tuning


  • AutoTune: A Plugin-Driven Approach to the Automatic Tuning of Parallel Applications


    Renato Miceli, Gilles Civario, Anna Sikora, Eduardo César, Michael Gerndt, Houssam Haitof, Carmen Navarrete, Siegfried Benkner, Martin Sandrieser, Laurent Morin, François Bodin
    Proceedings of the 11th International Workshop on the State-of-the-Art in Scientific and Parallel Computing (PARA 2012). Applied Parallel and Scientific Computing, Springer Lecture Notes in Computer Science (LNCS). Helsinki, Finland. vol 7782, pp 328-342. February 2013
    doi: 10.1007/978-3-642-36803-5_24
    Link: pdf slides


    Abstract: Performance analysis and tuning is an important step in programming multicore- and manycore-based parallel architectures. While there are several tools to help developers analyze application performance, no tool provides recommendations about how to tune the code. The AutoTune project is extending Periscope, an automatic distributed performance analysis tool developed by Technische Universität München, with plugins for performance and energy efficiency tuning. The resulting Periscope Tuning Framework will be able to tune serial and parallel codes for multicore and manycore architectures and return tuning recommendations that can be integrated into the production version of the code. The whole tuning process -- both performance analysis and tuning -- will be performed automatically during a single run of the application.
    Keywords: multicore-based parallel architectures; parallel and distributed applications; automatic analysis; automatic performance; energy tuning; performance optimization and tuning


  • Transitioning a message passing interface wavefront sensor model to a graphics processor environment


    Michael T. Browne, Renato Miceli
    Proceedings of the Symposium on Integrated Modeling of Complex Optomechanical Systems. Kiruna, Sweden. Proc. SPIE 8336, 833610. August 2011
    doi: 10.1117/12.915921


    Abstract: Previous work produced a parallel and moderately scalable wavefront sensor model as part of a larger integrated telescope model. This relied on traditional high performance computing (HPC) techniques using optimised C and MPI based parallelism to marry maximum performance with the productive high-level modelling environment of MATLAB. In the intervening period the computational power and flexibility offered by graphics processors (GPUs) has increased dramatically. This presents both new options in terms of the level of hardware required to perform simulations and also new capabilities in terms of the scope of such simulations. We present a discussion of the currently available approaches and test case performance results based on a port to a GPU platform.
    Keywords: Computer hardware; Graphics processing units; Interfaces; MATLAB; Modeling; Simulations; Telescopes; Visualization; Wavefront sensors


  • Predicting the Quality of Service of a Peer-to-Peer Desktop Grid


    Marcus Carvalho, Renato Miceli, Paulo Ditarso Maciel Jr., Francisco Brasileiro, Raquel Lopes
    Proceedings of the 10th IEEE/ACM International Conference on Cluster, Cloud and Grid Computing (CCGrid 2010). Melbourne, Australia. pp 649-654. May 2010
    doi: 10.1109/CCGRID.2010.50
    Link: pdf slides


    Abstract: Peer-to-peer (P2P) desktop grids have been proposed as an economical way to increase the processing capabilities of information technology (IT) infrastructures. In a P2P grid, a peer donates its idle resources to the other peers in the system, and, in exchange, can use the idle resources of other peers when its processing demand surpasses its local computing capacity. Despite their cost-effectiveness, scheduling of processing demands on IT infrastructures that encompass P2P desktop grids is more difficult. At the root of this difficulty is the fact that the quality of the service provided by P2P desktop grids varies significantly over time. The research we report in this paper tackles the problem of estimating the quality of service of P2P desktop grids. We base our study on the OurGrid system, which implements an autonomous incentive mechanism based on reciprocity, called the Network of Favours (NoF). In this paper we propose a model for predicting the quality of service of a P2P desktop grid that uses the NoF incentive mechanism. The model proposed is able to estimate the amount of resources that is available for a peer in the system at future instants of time. We also evaluate the accuracy of the model by running simulation experiments fed with field data. Our results show that in the worst scenario the proposed model is able to predict how much of a given demand for resources a peer is going to obtain from the grid with a mean prediction error of only 7.2%.
    Keywords: grid computing; peer-to-peer; performance prediction; quality of service; resource availability



  • Generating Mock-Based Test Automatically


    Sabrina F. Souto, Renato Miceli, Dalton Serey
    Proceedings of the Third Latin American Workshop on Aspect-Oriented Software Development (LA-WASP 2009). Fortaleza, Brazil. pp 65-66. October 2009
    Link: LA-WASP 2009 proceedings available, digital copy of the paper

    Abstract: Mock objects are used to improve both efficiency and effectiveness of unit testing. They can completely isolate objects under test from the rest of the application allowing easier root cause analysis of defects. Writing tests that use mocks, however, can be a tedious, costly task and may lead to the inclusion of defects. Furthermore, mock-based unit tests are known to be shortlived – they are usually discarded due to several design changes on the system. In this paper, we propose a technique that generates mock-based tests to face the mentioned drawbacks. Based on the analysis of execution traces, interactions between a target object and its collaborators are captured, by using Aspect Oriented Programming. We also present Automock, a proof of concept tool developed to evaluate the feasibility of the technique.
    Keywords: Software testing; aspect oriented programming; mock objects; test automation




Posters

  • AutoTune: Plugin-based Tuning of Parallel Codes


    Renato Miceli
    The 8th International Conference on High-Performance and Embedded Architectures and Compilers (HiPEAC 2013). Berlin, Germany. January 2013
    Link: poster on the AutoTune website

    Abstract: The increasing complexity of parallel architectures for HPC made it extremely difficult to develop programs that exploit the full capability of the hardware. Application developers have to go through several cycles of program analysis and tuning after the code was written and debugged. Thus, the development process has become cumbersome and unveils a huge productivity gap. While some tools aid the developers on performance analysis, no tool supports the code tuning stage.
    The AutoTune project’s goal is to develop an extensible tuning environment that automates the application tuning process. The framework, named the Periscope Tuning Framework (PTF), will identify tuning recommendations in special application tuning runs, using plugins for performance and energy efficiency tuning of parallel codes for multicore and manycore architectures. The tuning recommendations generated by PTF can then be manually or automatically applied to optimize the code for later production runs.


  • AutoTune: Automatic Online Code Tuning


    Renato Miceli, Gilles Civario, François Bodin
    NVIDIA GPU Technology Conference 2012 (GTC 2012). San Jose, USA. May 2012
    Link: poster on NVIDIA On-Demand website
    Achievements: this poster was the example poster in the Call for Posters for GTC 2013 and 2014

    Abstract: Performance analysis and tuning is an important step in programming multicore and manycore architectures. There are several tools to help developers analyze application performance; still, no tool provides recommendations about how to tune the code. AutoTune will extend Periscope, an automatic online and distributed performance analysis tool developed by Technische Universität München, with plugins for performance and energy efficiency tuning. The resulting Periscope Tuning Framework will be able to tune serial and parallel codes with and without GPU kernels; in addition, it will return tuning recommendations that can be integrated into the production version of the code.
    Keywords: Development Tools & Libraries; GTC 2012 - ID P0400




Technical Presentations

  • Opportunities and Strategies for I/O Auto-Tuning


    Renato Miceli
    Talk at the Dagstuhl Seminar 13401, "Automatic Application Tuning for HPC Architectures". Dagstuhl, Germany. October 2013
    doi: 10.4230/DagRep.3.9.214
    Link: pdf slides

    Abstract: In the HPC community, I/O issues are both one of the most common bottleneck and scalability limiting factors in codes, and one scientists and developers alike are the least aware of. For this, automating the I/O tuning is likely to lead to significant performance improvements, especially since the corresponding manual tuning is a complex task often out of reach of users and code developers. In this talk we will discuss the opportunities and possible approaches to automatically tuning the I/O of HPC software. We will introduce some of the most common I/O issues one can encounter while developing or using a HPC code, examine their relationship to the machine hardware, OS and filesystem settings, and explore prospective automatic tuning strategies adapted to address each one of these issues.
    Keywords: Parallel Computing, Programming Tools, Performance Analysis and Tuning


  • Real-Time Risk Simulation: The GPU Revolution In Profit Margin Analysis


    Gilles Civario, Renato Miceli
    Session at NVIDIA GPU Technology Conference 2012 (GTC 2012). San Jose, USA. May 2012
    Link: presentation audio/video and pdf slides
    Achievements: this work was awarded in the HPCwire Readers Choice Awards 2012 for the best use of HPC in financial services

    Abstract: Discover how ICHEC helped a world leading company in its sector, to dramatically speed-up and improve the quality of its real-time risk management tool chain. In this session, we present the method used for porting the core-part of the simulation engines to GPUs using CUDA. This porting was realized on two very different simulation algorithms and resulted in speed-ups of 2 to 3 orders of magnitude, allowing much greater accuracy of the results in a real-time environment.
    Keywords: Finance; GTC 2012 - ID S0034




White Papers

  • Performance Improvement in Kernels by Guiding Compiler Auto-Vectorization Heuristics


    William Killian, Renato Miceli, EunJung Park, Marco Alvarez Vega, John Cavazos
    Deliverable for the PRACE Project, Second Implementation Phase, Work Package 12. September 2014
    Link: full paper on the PRACE website

    Abstract: Vectorization support in hardware continues to expand and grow as we still continue on superscalar architectures. Unfortunately, compilers are not always able to generate optimal code for the hardware; detecting and generating vectorized code is extremely complex. Programmers can use a number of tools to aid in development and tuning, but most of these tools require expert or domain-specific knowledge to use. In this work we aim to provide techniques for determining the best way to optimize certain codes, with an end goal of guiding the compiler into generating optimized code without requiring expert knowledge from the developer. Initially, we study how to combine vectorization reports with iterative compilation and code generation and summarize our insights and patterns on how the compiler vectorizes code. Our utilities for iterative compilation and code generation can be further used by non-experts in the generation and analysis of programs. Finally, we leverage the obtained knowledge to design a Support Vector Machine classifier to predict the speedup of a program given a sequence of optimization. We show that our classifier is able to predict the speedup of 56% of the inputs within 15% overprediction and 50% underprediction, with 82% of these accurate within 15% both ways.


  • The State-of-the-Art in Directive-Guided Auto-Tuning for Accelerator and Heterogeneous Many-Core Architectures


    Renato Miceli, François Bodin
    Deliverable for the PRACE Project, Second Implementation Phase, Work Package 12. March 2013
    Link: full paper on the PRACE website

    Abstract: In this whitepaper we discuss the latest achievements in the field of auto-tuning of applications for accelerator and heterogeneous many-core architectures guided by programming directives. We provide both an academic perspective, presenting preliminary results obtained by the EU FP7 AutoTune project, and an industrial point of view, demonstrated by the commercial uptake by a leader in compiler technology and services, CAPS Entreprise.


  • Business-Driven Management of Hybrid IT Infrastructures


    Paulo Ditarso Maciel Jr., Marcus Carvalho, Nazareno Andrade, Francisco Brasileiro, Ricardo Araújo, David Maia, Renato Miceli, Raquel Lopes, Miranda Mowbray
    Deliverable for the Hybrid Clouds Project. August 2009
    Link: full paper on the LSD website

    Abstract: With the emergence of the cloud computing paradigm and the continuous search to reduce the cost of running Information Technology (IT) infrastructures, we are currently experiencing an important change in the way these infrastructures are assembled, configured and managed. In this research we consider the problem of managing a hybrid high-performance computing infrastructure whose processing elements are comprised of in-house dedicated machines, virtual machines acquired from cloud computing providers, and remote virtual machines made available by a best-effort peer-to-peer (P2P) grid. Each of these resources has a different cost basis. The applications that run in this hybrid infrastructure are characterised by a utility function: the utility yielded by the completion of an application depends on the time taken to execute it. We take a business-driven approach to managing this infrastructure, aiming to maximise profit, that is, the utility produced as a result of the applications that are run minus the cost of the computing resources that are used to run them. We assume that the cost of computing resources from the local in-house machines is unavoidable, i.e.. the in-house infrastructure has a fixed cost whether or not its resources are used. We also assume that the cost of computing resources from the P2P grid (when they are available) is negligible, because the grid is based on the exchange of spare resources between peers. Applications are run using computing power just from these two sources whenever possible. Any extra capacity required to improve the profitability of the infrastructure is purchased from the cloud computing market. We assume that this extra capacity is reserved for future use through short term contracts which are negotiated without human intervention. The cost per unit of computing resource may vary significantly between contracts, with more urgent contracts normally being more expensive. However, due to the uncertainty inherent in the best-effort grid, it may not be possible to know in advance exactly how much computing resource will be needed from the cloud computing market. Overestimation of the amount of resources required leads to the reservation of more than is necessary; while underestimation leads to the necessity of negotiating additional contracts later on to acquire the remaining required capacity. We propose heuristics to be used by a contract planning agent in order to balance the cost of running the applications and the utility that is achieved with their execution, with the aim of producing a high overall profit. We demonstrate that the ability to estimate the grid behaviour is an important condition for making contracts that produce high efficiency in the use of the hybrid infrastructure. We propose a model for predicting the behaviour of a P2P grid that uses a particular incentive mechanism, and assess the suitability of this model using field data. Our results show that the proposed model is able to predict the grid behaviour with an average error that is not larger than 16% for the scenarios evaluated, leading to a worst case efficiency of 85.32%.
    Keywords: Cloud computing; Grid computing; Peer-to-peer; Business-driven IT management; Capacity planning




Project Reports