Deals constructively with recognized software problems. Focuses on the unreliability of computer programs and offers state-of-the-art solutions. Covers—software development, software testing, structured programming, composite design, language design, proofs of program correctness, and mathematical reliability models. Written in an informal style for anyone whose work is affected by the unreliability of software. Examples illustrate key ideas, over 180 references.
The overwhelming majority of a software system’s lifespan is spent in use, not in design or implementation. So, why does conventional wisdom insist that software engineers focus primarily on the design and development of large-scale computing systems? In this collection of essays and articles, key members of Google’s Site Reliability Team explain how and why their commitment to the entire lifecycle has enabled the company to successfully build, deploy, monitor, and maintain some of the largest software systems in the world. You’ll learn the principles and practices that enable Google engineers to make systems more scalable, reliable, and efficient—lessons directly applicable to your organization. This book is divided into four sections: Introduction—Learn what site reliability engineering is and why it differs from conventional IT industry practices Principles—Examine the patterns, behaviors, and areas of concern that influence the work of a site reliability engineer (SRE) Practices—Understand the theory and practice of an SRE’s day-to-day work: building and operating large distributed computing systems Management—Explore Google's best practices for training, communication, and meetings that your organization can use
This book presents current methods for dealing with software reliability, illustrating the advantages and disadvantages of each method. The description of the techniques is intended for a non-expert audience with some minimal technical background. It also describes some advanced techniques, aimed at researchers and practitioners in software engineering. This reference will serve as an introduction to formal methods and techniques and will be a source for learning about various ways to enhance software reliability. Various projects and exercises give readers hands-on experience with the various formal methods and tools.
Computer systems, whether hardware or software, are subject to failure. Precisely, what is a failure? It is defined as: The inability of a system or system component to perform a required function within specified limits. Afailure may be produced when a fault is encountered and a loss of the expected service to the user results [IEEE/AIAA P1633]. This brings us to the question of what is a fault? A fault is defect in the hardware or computer code that can be the cause of one or more failures. Software-based systems have become the dominant player in the computer systems world. Since it is imperative that computer systems operate reliably, considering the criticality of software, particularly in safety critical systems, the IEEE and AIAA commissioned the development of the Recommended Practice on Software Reliability. This tutorial serves as a companion document with the purpose of elaborating on key software reliability process practices in more detail than can be specified in the Recommended Practice. However, since other subjects like maintainability and availability are also covered, the tutorial can be used as a stand-alone document. While the focus of the Recommended Practice is software reliability, software and hardware do not operate in a vacuum. Therefore, both software and hardware are addressed in this tutorial in an integrated fashion. The narrative of the tutorial is augmented with illustrative solved problems. The recommended practice [IEEE P1633] is a composite of models and tools and describes the "what and how" of software reliability engineering. It is important for an organization to have a disciplined process if it is to produce high reliability software. This process uses a life cycle approach to software reliability that takes into account the risk to reliability due to requirements changes. A requirements change may induce ambiguity and uncertainty in the development process that cause errors in implementing the changes. Subsequently, these errors may propagate through later phases of development and maintenance. In view of the life cycle ramifications of the software reliability process, maintenance is included in this tutorial. Furthermore, because reliability and maintainability determine availability, the latter is also included.
Focuses on the core systems engineering tasks of writing, managing, and tracking requirements for reliability, maintainability, and supportability that are most likely to satisfy customers and lead to success for suppliers This book helps systems engineers lead the development of systems and services whose reliability, maintainability, and supportability meet and exceed the expectations of their customers and promote success and profit for their suppliers. This book is organized into three major parts: reliability, maintainability, and supportability engineering. Within each part, there is material on requirements development, quantitative modelling, statistical analysis, and best practices in each of these areas. Heavy emphasis is placed on correct use of language. The author discusses the use of various sustainability engineering methods and techniques in crafting requirements that are focused on the customers’ needs, unambiguous, easily understood by the requirements’ stakeholders, and verifiable. Part of each major division of the book is devoted to statistical analyses needed to determine when requirements are being met by systems operating in customer environments. To further support systems engineers in writing, analyzing, and interpreting sustainability requirements, this book also Contains “Language Tips” to help systems engineers learn the different languages spoken by specialists and non-specialists in the sustainability disciplines Provides exercises in each chapter, allowing the reader to try out some of the ideas and procedures presented in the chapter Delivers end-of-chapter summaries of the current reliability, maintainability, and supportability engineering best practices for systems engineers Reliability, Maintainability, and Supportability is a reference for systems engineers and graduate students hoping to learn how to effectively determine and develop appropriate requirements so that designers may fulfil the intent of the customer.
Highly selected from submissions and rigorously reviewed, 44 papers cover models and trends in digital product evolution, whether software could and should be more reliable than the world in which it is used, predicting and estimating reliability, improving process, maintaining software, reliability and testing, modelling and validating reliability, test planning and automation, simulation, special test methods, improving process, diagnosing faults, analyzing and optimizing reliability, evolutionary software, code defect classification and metrics, and safety-critical software and fault injection. In addition, materials from panel discussions cover the next generation of dependability standards, achieving adequate levels of reliability in practice, and assessing reliability in emerging techniques. No subject index. Annotation copyrighted by Book News, Inc., Portland, OR.
Revised and updated for professional software engineers, systems analysts and project managers, this highly acclaimed book provides key concepts of software reliability and practical solutions for measuring reliability.
This practical resource presents basic probabilistic and statistical methods or tools used to extract the information from reliability data to make sound decisions. It consolidates and condenses the reliability data analysis methods most often used in everyday practice into an easy-to-follow guide, while also providing a solid foundation from which to explore more complex methods if desired. The book provides mathematical and Excel spreadsheet formulas to estimate parameters and confidence bounds (uncertainty) for the most common probability distributions used in reliability analysis. Several other Excel tools are provided to aid users without access to expensive, dedicated, commercial tools. This book and tools were developed by the authors after many years of teaching the fundamentals of reliability data analysis to a broad range of technical and non-technical military and civilian personnel, making it useful for both novice and experienced engineers.
Can a system be considered truly reliable if it isn't fundamentally secure? Or can it be considered secure if it's unreliable? Security is crucial to the design and operation of scalable systems in production, as it plays an important part in product quality, performance, and availability. In this book, experts from Google share best practices to help your organization design scalable and reliable systems that are fundamentally secure. Two previous O’Reilly books from Google—Site Reliability Engineering and The Site Reliability Workbook—demonstrated how and why a commitment to the entire service lifecycle enables organizations to successfully build, deploy, monitor, and maintain software systems. In this latest guide, the authors offer insights into system design, implementation, and maintenance from practitioners who specialize in security and reliability. They also discuss how building and adopting their recommended best practices requires a culture that’s supportive of such change. You’ll learn about secure and reliable systems through: Design strategies Recommendations for coding, testing, and debugging practices Strategies to prepare for, respond to, and recover from incidents Cultural best practices that help teams across your organization collaborate effectively