Optimizations for Energy Efficiency Within Distributed Memory Programming Models

Optimizations for Energy Efficiency Within Distributed Memory Programming Models

Author: Siddhartha Jana

Publisher:

Published: 2016

Total Pages:

ISBN-13:

DOWNLOAD EBOOK

With the breakdown of Dennard Scaling and Moore's law, power consumption appears to be a primary challenge on the pathway to exascale computing. Extreme Scale Research reports indicate the energy consumption during movement of data off-chip is orders of magnitude higher than within a chip. The direct outcome of this has been a rising concern about the energy and power consumption of large-scale applications that rely on various communication libraries and parallelism constructs for distributed computing. While innovative designs of hardware set the upper bounds for power consumption, there is a need for the software to adapt itself to achieve maximum efficiency at minimal joules. This work presents detailed analyses of multiple factors within the software stack, that affect the energy consumption of large scale distributed memory HPC applications and programming environments. As part of this empirical analyses, we isolate multiple constraints imposed by the communication, memory, and the execution model that affect energy profiles of such applications. With regards to the communication model, empirical analyses in this thesis reveals significant impact due to constraints like the size of the data payload being transferred, the number of data fragments, the overhead of memory management, the use of additional OS threads, as well as the hardware design of the underlying processor. Additional software design characteristics that have been shown to have a significant impact on communication-intensive kernels include -- the design of remote data-access patterns (greater than 40\% energy savings), the transport layer protocols (25X improvement in bytes/joules) as well as the choice of the interconnect (760X improvement in bytes/joules). This dissertation also revisits a two-decade-old programming paradigm - Active Messages, and presents empirical evidence that suggests that integrating it within current SPMD execution models leads to significant performance and energy efficiency. It is hoped that the work presented in this literature paves the way for taking software design into consideration while designing current and future large-scale energy-efficient systems operating within a power budget.


Memory Optimizations of Embedded Applications for Energy Efficiency

Memory Optimizations of Embedded Applications for Energy Efficiency

Author: Jong Soo Park

Publisher: Stanford University

Published: 2011

Total Pages: 177

ISBN-13:

DOWNLOAD EBOOK

The current embedded processors often do not satisfy increasingly demanding computation requirements of embedded applications within acceptable energy efficiency, whereas application-specific integrated circuits require excessive design costs. In the Stanford Elm project, it was identified that instruction and data delivery, not computation, dominate the energy consumption of embedded processors. Consequently, the energy efficiency of delivering instructions and data must be sufficiently improved to close the efficiency gap between application-specific integrated circuits and programmable embedded processors. This dissertation demonstrates that the compiler and run-time system can play a crucial role in improving the energy efficiency of delivering instructions and data. Regarding instruction delivery, I present a compiler algorithm that manages L0 instruction scratch-pad memories that reside between processor cores and L1 caches. Despite the lack of tags, the scratch-pad memories with our algorithm can achieve lower miss rates than caches with the same capacities, saving significant instruction delivery energy. Regarding data delivery, I present methods that minimize memory-space requirements for parallelizing stream applications, applications that are commonly found in the embedded domain. When stream applications are parallelized in pipelining, large enough buffers are required between pipeline stages to sustain the throughput (e.g., double buffering). For static stream applications where production and consumption rates of stages are close to compile-time constants, a compiler analysis is presented, which computes the minimum buffer capacity that maximizes the throughput. Based on this analysis, a new static streamscheduling algorithm is developed, which yields considerable speed-up and data delivery energy saving compared to a previous algorithm. For dynamic stream applications, I present a dynamically-sized array-based queue design that achieves speed-up and data delivery energy saving compared to a linked-list based queue design.


Advanced Memory Optimization Techniques for Low-Power Embedded Processors

Advanced Memory Optimization Techniques for Low-Power Embedded Processors

Author: Manish Verma

Publisher: Springer Science & Business Media

Published: 2007-06-20

Total Pages: 192

ISBN-13: 1402058977

DOWNLOAD EBOOK

This book proposes novel memory hierarchies and software optimization techniques for the optimal utilization of memory hierarchies. It presents a wide range of optimizations, progressively increasing in the complexity of analysis and of memory hierarchies. The final chapter covers optimization techniques for applications consisting of multiple processes found in most modern embedded devices.


Optimizing HPC Applications with Intel Cluster Tools

Optimizing HPC Applications with Intel Cluster Tools

Author: Alexander Supalov

Publisher: Apress

Published: 2014-10-09

Total Pages: 291

ISBN-13: 1430264977

DOWNLOAD EBOOK

Optimizing HPC Applications with Intel® Cluster Tools takes the reader on a tour of the fast-growing area of high performance computing and the optimization of hybrid programs. These programs typically combine distributed memory and shared memory programming models and use the Message Passing Interface (MPI) and OpenMP for multi-threading to achieve the ultimate goal of high performance at low power consumption on enterprise-class workstations and compute clusters. The book focuses on optimization for clusters consisting of the Intel® Xeon processor, but the optimization methodologies also apply to the Intel® Xeon Phi™ coprocessor and heterogeneous clusters mixing both architectures. Besides the tutorial and reference content, the authors address and refute many myths and misconceptions surrounding the topic. The text is augmented and enriched by descriptions of real-life situations.


Constructing and Evaluating Weak Memory Models

Constructing and Evaluating Weak Memory Models

Author: Sizhuo Zhang

Publisher:

Published: 2019

Total Pages: 224

ISBN-13:

DOWNLOAD EBOOK

A memory model for an instruction set architecture (ISA) specifies all the legal multithreaded-program behaviors, and consequently constrains processor implementations. Weak memory models are a consequence of the desire of architects to preserve the flexibility of implementing optimizations that are used in uniprocessors, while building a shared-memory multiprocessor. Commercial weak memory models like ARM and POWER are extremely complicated: it has taken over a decade to formalize their definitions. These formalization efforts are mostly empirical--they try to capture empirically observed behaviors in commercial processors--and do not provide any insights into the reasons for the complications in weak-memory-model definitions. This thesis takes a constructive approach to study weak memory models. We first construct a base model for weak memory models by considering how a multiprocessor is formed by connecting uniprocessors to a shared memory system. We try to minimize the constraints in the base model as long as the model enforces single-threaded correctness and matches the common assumptions made in multithreaded programs. With the base model, we can show not only the differences among different weak memory models, but also the implications of these differences, e.g., more definitional complexity or more implementation flexibility or failures to match programming assumptions. The construction of the base model also reveals that allowing load-store reordering (i.e., a younger store is executed before an older load) is the source of definitional complexity of weak memory models. We construct a new weak memory model WMM that disallows load-store reordering, and consequently, has a much simpler definition. We show that WMM has almost the same performance as existing weak memory models. To evaluate the performance/power/area (PPA) of weak memory models versus that of strong memory models like TSO, we build an out-of-order superscalar cachecoherent multiprocessor. Our evaluation considers out-of-order multiprocessors of small sizes and benchmark programs written using portable multithreaded libraries and compiler built-ins. We find that the PPA of an optimized TSO implementation can match the PPA of implementations of weak memory models. These results provide a key insight that load execution in TSO processors can be as aggressive as, or even more aggressive than, that in weak-memory-model processors. Based on this insight, we further conjecture that weak memory models cannot provide better performance than TSO in case of high-performance out-of-order processors. However, whether weak memory models have advantages over TSO in case of energy-efficient in-order processors or embedded microcontrollers remains an open question.


Fast, Efficient and Predictable Memory Accesses

Fast, Efficient and Predictable Memory Accesses

Author: Lars Wehmeyer

Publisher: Springer Science & Business Media

Published: 2006-09-08

Total Pages: 263

ISBN-13: 140204822X

DOWNLOAD EBOOK

Speed improvements in memory systems have not kept pace with the speed improvements of processors, leading to embedded systems whose performance is limited by the memory. This book presents design techniques for fast, energy-efficient and timing-predictable memory systems that achieve high performance and low energy consumption. In addition, the use of scratchpad memories significantly improves the timing predictability of the entire system, leading to tighter worst case execution time bounds.


Modeling and Optimization of Parallel and Distributed Embedded Systems

Modeling and Optimization of Parallel and Distributed Embedded Systems

Author: Arslan Munir

Publisher: John Wiley & Sons

Published: 2016-02-08

Total Pages: 399

ISBN-13: 1119086418

DOWNLOAD EBOOK

This book introduces the state-of-the-art in research in parallel and distributed embedded systems, which have been enabled by developments in silicon technology, micro-electro-mechanical systems (MEMS), wireless communications, computer networking, and digital electronics. These systems have diverse applications in domains including military and defense, medical, automotive, and unmanned autonomous vehicles. The emphasis of the book is on the modeling and optimization of emerging parallel and distributed embedded systems in relation to the three key design metrics of performance, power and dependability. Key features: Includes an embedded wireless sensor networks case study to help illustrate the modeling and optimization of distributed embedded systems. Provides an analysis of multi-core/many-core based embedded systems to explain the modeling and optimization of parallel embedded systems. Features an application metrics estimation model; Markov modeling for fault tolerance and analysis; and queueing theoretic modeling for performance evaluation. Discusses optimization approaches for distributed wireless sensor networks; high-performance and energy-efficient techniques at the architecture, middleware and software levels for parallel multicore-based embedded systems; and dynamic optimization methodologies. Highlights research challenges and future research directions. The book is primarily aimed at researchers in embedded systems; however, it will also serve as an invaluable reference to senior undergraduate and graduate students with an interest in embedded systems research.


Energy-Efficient Distributed Computing Systems

Energy-Efficient Distributed Computing Systems

Author: Albert Y. Zomaya

Publisher: John Wiley & Sons

Published: 2012-07-26

Total Pages: 605

ISBN-13: 1118342003

DOWNLOAD EBOOK

The energy consumption issue in distributed computing systems raises various monetary, environmental and system performance concerns. Electricity consumption in the US doubled from 2000 to 2005. From a financial and environmental standpoint, reducing the consumption of electricity is important, yet these reforms must not lead to performance degradation of the computing systems. These contradicting constraints create a suite of complex problems that need to be resolved in order to lead to 'greener' distributed computing systems. This book brings together a group of outstanding researchers that investigate the different facets of green and energy efficient distributed computing. Key features: One of the first books of its kind Features latest research findings on emerging topics by well-known scientists Valuable research for grad students, postdocs, and researchers Research will greatly feed into other technologies and application domains


Study of Performance on SMP and Distributed Memory Architectures Using a Shared Memory Programming Model

Study of Performance on SMP and Distributed Memory Architectures Using a Shared Memory Programming Model

Author:

Publisher:

Published: 1997

Total Pages: 26

ISBN-13:

DOWNLOAD EBOOK

In this paper we examine the use of a shared memory programming model to address the problem of portability of application codes between distributed memory and shared memory architectures. We do this with an extension of the Parallel C Preprocessor. The extension, borrowed from Split-C and AC, uses type qualifiers instead of storage class modifiers to declare variables that are shared among processors. The type qualifier declaration supports an abstract shared memory facility on distributed memory machines while making direct use of hardware support on shared memory architectures. Our benchmarking study spans a wide range of shared memory and distributed memory platforms. Benchmarks include Gaussian elimination with back substitution, a two-dimensional fast Fourier transform, and a matrix-matrix multiply. We find that the type-qualifier-based shared memory programming model is capable of efficiently spanning both distributed memory and shared memory architectures. Although the resulting shared memory programming model is portable, it does not remove the need to arrange for overlapped or blocked remote memory references on platforms that require these tuning measures in order to obtain good performance.


Ultra-Low Energy Domain-Specific Instruction-Set Processors

Ultra-Low Energy Domain-Specific Instruction-Set Processors

Author: Francky Catthoor

Publisher: Springer Science & Business Media

Published: 2010-08-05

Total Pages: 416

ISBN-13: 9048195284

DOWNLOAD EBOOK

Modern consumers carry many electronic devices, like a mobile phone, digital camera, GPS, PDA and an MP3 player. The functionality of each of these devices has gone through an important evolution over recent years, with a steep increase in both the number of features as in the quality of the services that they provide. However, providing the required compute power to support (an uncompromised combination of) all this functionality is highly non-trivial. Designing processors that meet the demanding requirements of future mobile devices requires the optimization of the embedded system in general and of the embedded processors in particular, as they should strike the correct balance between flexibility, energy efficiency and performance. In general, a designer will try to minimize the energy consumption (as far as needed) for a given performance, with a sufficient flexibility. However, achieving this goal is already complex when looking at the processor in isolation, but, in reality, the processor is a single component in a more complex system. In order to design such complex system successfully, critical decisions during the design of each individual component should take into account effect on the other parts, with a clear goal to move to a global Pareto optimum in the complete multi-dimensional exploration space. In the complex, global design of battery-operated embedded systems, the focus of Ultra-Low Energy Domain-Specific Instruction-Set Processors is on the energy-aware architecture exploration of domain-specific instruction-set processors and the co-optimization of the datapath architecture, foreground memory, and instruction memory organisation with a link to the required mapping techniques or compiler steps at the early stages of the design. By performing an extensive energy breakdown experiment for a complete embedded platform, both energy and performance bottlenecks have been identified, together with the important relations between the different components. Based on this knowledge, architecture extensions are proposed for all the bottlenecks.