By Alex Yakyma and Harry Koehnemann

Version 4.0 of the Scaled Agile Framework® SAFe® was released earlier this year and has generated keen interest from complex systems builders and large software enterprises. The new version provides specific depth for organizations that build world’s largest and most critical solutions. Version 4.0 has a new optional layer that includes a set of practices for building large, complex systems in a Lean and Agile manner. It provides better modularity and addresses a number of challenges that large systems builders encounter when organizing around the flow of value, such as coordinating Agile Release Trains in a large Value Stream, or coordinating multiple value streams in the portfolio.SAFe-Value-StreamIn this article we will discuss the challenges large systems builders face and key approaches for addressing them. But before we start down that path, let’s understand first what makes complex systems development so complex.

The Roots of Complexity

So, what makes the development process so much harder in the world of large, complex systems? Let’s consider some common factors:

  1. The multidisciplinary nature of the systems. These systems require collaboration and integrated components from a broad set of deep engineering skills that include software, firmware, hardware, electrical, mechanics, optics, and hydraulics, to name just a few.
  2. The architectural complexity of the systems. Even in the case of pure software, complex solutions may contain hundreds, or even thousands, of interconnected components, subsystems, services, and applications.
  3. Legacy technologies. Modern systems leverage existing solutions initially developed (potentially) decades ago and involve high cost of change and poor support for modern ways of engineering and testing.
  4. Complex production environments. Our complex operational environments challenge the creation of equivalent environments for integration, testing, and demonstration of results.

These factors are sometimes mistakenly considered impediments to adopting Agile and Lean. However, let’s explore where we stand with traditional methodologies.

Traditional Methods Are Not Up to the Task

Traditional, phase-gate methods fail to cope with these complexities. Despite heavy weight and robustness, they are subject to a fundamental flaw: the assumption that complex systems can be “figured” in a speculative, detailed, up-front manner. As a result, enterprises prematurely commit to detailed requirements and design before any actual learning begins, therefore dismissing a broader variety of more economically beneficial solutions that may emerge later. They fail to ensure quality and fit for purpose because of heavily inhibited feedback. They run over schedule and budget due to lack of objective measures of progress, measures that would rely on tangible increments of the solution rather than the amount of effort already spent or other poor proxies for value.

Despite all the problems with traditional methods, transition to Lean and Agile is often impeded by a number of myths that surround complex systems development. Let’s consider the most common ones:

Myth 1: Frequent integration and testing is not possible in the case of hardware development (or other non-software domains).

Myth 2: People from different domains (SW, FW, HW, etc.) can’t work together.

Myth 3: Complex systems development must follow the phase-gate process model to be successful.

 Myth 4: Non-software disciplines cannot produce meaningful value in small increments.

Let’s try to take a more pragmatic view and address these myths referring to a more generic view across all engineering disciplines as part of product development.

The Problems May Be Different But the Principles Are Universal

In his article on Agile in hardware development, Ken Rubin concludes that applicability of Agile methods is not a binary choice but rather is influenced by the cost of change of the underlying system. Hardware, he argues, has a higher cost of change than software. This dictates certain adjustments but does not exclude agility—just the opposite, in fact, since the cost of error is also often higher in the case of hardware. Dantar P. Oosterwal, in his book The Lean Machine, shares the experience of Harley-Davidson in their search of a better product development method. The great epiphany, he points out, was to realize that the success of a new development initiative did not depend on the success of individual phases in the process. Instead, projects that went through multiple, consequent design cycles supported by actual system integration were significantly more successful than their “waterfalled” counterparts.

These and other examples suggest that complex systems development is governed by a leaner and more Agile set of principles. Furthermore, these principles would hold true for different business contexts, engineering disciplines, and solutions. SAFe considers nine such principles:Principles

Having stated a set of governing principles, now is the time to consider specific practices and patterns for complex systems development.

Putting the Principles to Work

We will split the discussion of implementing the practices into three sections:

1 – Organizing Around Value

Large, physical systems often organize around functional areas. Organizing around function or discipline helps ensure technical integrity, but it also contributes to handoffs, delays, and waiting across team boundaries. SAFe adopts a more pragmatic approach: organizing around value. This paradigm aims at the key goal of establishing the shortest sustainable lead time in value delivery, and it achieves it by building organizational units in such a way as to contain most dependencies inside each unit rather than spreading them across different units.Dependencies

Building fully cross-functional and cross-domain Agile Teams (of 6 – 9 people) may not be feasible in many cases. However, creating an Agile Release Train—a self-organized team of teams that usually consists of 50 – 125 people—that includes all functions and aspects of engineering is absolutely feasible in most cases and should be done whenever it meets the objective of establishing a sustainable, fast flow of value.ARTsLet’s consider an example: a product that involves software, firmware, and hardware development, with hundreds of engineers in all domains. How should we organize the trains?

For that we need more context. Let’s say that in our particular case, hardware is tightly coupled with firmware, which in turn creates an abstraction layer for the software operating system and the specific applications and services that run on top of it. This gives us the first cue: putting hardware and firmware engineers on one Agile Release Train is probably a good idea. And in case we end up with too many people, we might end up with multiple FW-HW trains, each organized around a subset of the key system capabilities. But should software teams be on these same trains as well, or should they be separate?

In the case of our system, hardware and firmware changes are being released at a relatively infrequent rate, while software teams should be able to produce over-the-air updates every few weeks. Given both decoupled interfaces and release schedules in this particular case, we might benefit from actually having software teams on a separate train.Example2 – Synchronize Development

Once organized in the structure that supports value creation, we need to establish the actual rhythm of development. Aligning on a common cadence creates such a rhythm by focusing large programs to the work in the current Program Increment or PI (8 – 12 weeks), reducing excess work-in-process (WIP), and making unpredictable events more predictable. This common cadence aligns the diverse value stream members we find in these large, complex systems. We expect practitioners from different engineering domains, as well as Suppliers and the Customer, to participate in the PI Planning so they can understand what we are building as a set of ARTs in a larger value stream. SAFe® 4.0 provides additional mechanisms of alignment via Pre- and Post-PI Planning, in addition to the standard PI planning routine practiced by Agile Release Trains.

Teams on the train also execute on a cadence of short Iterations, each providing a demo of an integrated increment within the ART’s area of concern.

To close the loop, PI boundaries provide Inspect and Adapt opportunities that build on the objective measure of progress—the integrated, end-to-end Solution Demo. Thus the cadence provides the “meter” for incremental Solution development from end to end—the direct opposite of the phase-gate process model.SolutionIn order to support such a cadence, teams and trains in the value stream need to learn to frequently integrate and test.

3 – Frequent Integration and Testing

Delivering value quickly challenges large systems due to the lead time needed to acquire and then integrate their functional parts. Despite those challenges, we strive to demonstrate value quickly by continually providing increasingly closer approximations of the end-to-end, integrated solution. We create these closer approximations at least every PI, and possibly every iteration, to demonstrate progress and provide objective evaluation of the current solution for stakeholder feedback.

Achieving these frequent learning Milestones may be difficult when aiming only at a full, end-to-end integration of all subsystems and components. We might need to look for a way to provide approximations for the subsystems of the real solution for which the cost of integration will be sufficiently low and that would allow us to perform such integration on a frequent basis. When we do so, and replace a subsystem with a simpler “proxy,” we reduce the cost of integration but may negatively impact the quality of feedback from such integration. So, for example, replacing the subsystem with a primitive stub may incur significant cost later, when we learn that we made a number of false assumptions based on a very shallow proxy for our subsystem. In the general case, we are dealing with an economic trade-off, as the picture below suggests.

Cost

The horizontal axis represents different possible ways to “proxy” the subsystem behavior. It starts with the subsystem itself (as an ideal, perfectly accurate proxy) and follows a range of cyber-physical proxies through to pure software alternatives and all the way to the simplest possible stub. The vertical axis represents the cost associated with the process. The blue line shows decreasing cost of integration, while the grey line represents increasing cost of inaccurate feedback as we move from more complex and accurate approximations to more lightweight and primitive ones. Somewhere in the intersection of the two curves lies the optimum choice for our subsystem.

If we look at the entire solution, such a decision generally needs to be made for every subsystem.Entire solutionEach subsystem has its unique correlation of cost of integration and feedback fidelity. Therefore, we require a balanced approach to identifying the best point on the spectrum of possible proxies. The entire solution integration then relies on the respective choices for each subsystem: for example, some may be high-fidelity proxies, some may be actual subsystems and some may be just stubs, as the picture above suggests.

It is also useful to consider a more “incremental” approach to picking the right spot on the spectrum for different subsystems. It is not uncommon for hardware engineers to separate logical control from their physical device to create a closed testing loop. Initially, when no functional behavior exists, stubs that simulate the subsystem’s request-response behavior serve as proxies for integration and testing. Also, early in the process teams may use models (model in the loop, or MIL) that later evolve to software (SIL) and eventually hardware (HIL) proxies. Each step more closely approximates the sweet spot.

Summary

In this article we considered the key challenges that make complex systems development so complex. While some may appear to be impediments to the adoption of Lean and Agile practices at the first glance, it is even more critical to apply Lean and Agile where cost of error may be incredibly high. We explored the myths that surround Lean-Agile in a complex, multidisciplinary world and tried to take a balanced view based on SAFe principles—immutable, universally applicable “laws of physics” that govern product development. We put those principles to work by considering specific implementation of the core practices around organizational structure, synchronized development, and integration and testing. We showed how practices could be adjusted to provide improved results in different contexts based on selective optimization.

Scaled Agile Framework and SAFe are registered trademarks of Scaled Agile, Inc.

put-the-indispensable-elements-of-safe-to-work

Join the Discussion

    • Armond Mehrabian

      Alex and Harry,

      Excellent article guys. I feel like we’re just scraping the surface of this huge topic. I currently have three clients that are wrestling with these issues.

      What are your thoughts on adopting the integration strategy (graph in section 3) at the right side of horizontal axis (stub) at the beginning of a green-field project and progressively replacing it with mechanisms to the left until we have the real-time subsystem? This way we are able to have objective milestones right from the start and move progressively towards the real (higher fidelity) solution.

      However, if we’re building on top of an existing subsystem (i.e. it’s part of the solution context), is this graph still accurate? Wouldn’t it cost the same whether we’re building on top of a stub or an existing subsystem?

    • Alex Yakyma

      Armond,

      In case of a greenfield system everything is a stub in the very beginning simply because you don’t have anything, by and large. Now, most of the systems builders don’t build new systems completely from scratch. There’re always subsystems and components that exist but may require certain change. That BTW provides a venue for another set of practices that are sometimes referred to as “canary build” – an integration based on older existing subsystem(s). So, for instance, you would run a new network OS on your old router model, possibly scaffolded, to validate a broad set of “commonalities” that those two models have. In this case, it’s the “variability” part that will largely constitute the Cost of Innacurate Feedback, as the figure suggests. Now, “older version of the system” is simply an example of a proxy, so it lands itself somewhere on the spectrum.

      As for the gradual evolution of the subsystems, there are two different cases to be considered: 1) when the subsystem evolves from nothing to something and that doesn’t really mean that we are moving right-to-left on the spectrum, it rather means that the spectrum is expanding and 2) the subsystem is already in development but the cost of integration is too high and we are starting with something simpler but immediately pave the path towards higher fidelity in the future. As for case 2, the way the system is designed may well determine our options. Well segregated interfaces (ex: Hardware abstraction layers, or HAL) allow us to move relatively easily from a simple stub to Software-in-the-loop, to Hardware-in-the-loop option.

    • Frank Schophuizen

      Indeed, as Armond states we are just scratching the surface and as Alex states complex systems hardly ever start from scratch. I would like to add some other complicating factors. (1) Complex system development often involves multiple release trains (teams-of-teams) that need to align and sync somehow to contribute to a business epic (also called initiative, theme, proposition, portfolio epic). (2) Release trains may be contributing to multiple business epics, e.g. products being deployed in multiple product lines or systems. (3) Some release trains may be too “small” to be organized as release trains (e.g. 2 app development teams of 10 people, one for Android and one for iOS) but should – or should not – be organized as release trains for better alignment with other release trains. And (4) combining Agile and non-Agile (stage-gating) teams, including external partners.

      And let’s not forget, not only complex systems hardly ever start from scratch, complex organizations never start from scratch. Deploying a new paradigm in a complex organization, with existing people familiar with current traditions, with tooling infrastructure aligned with current processes cannot be turned over overnight. It takes many years during which people, organization and infrastructure is in a hybrid transition and learning state.

      Growing from small to large by scaling up Scrum to SAFe may seem simple compared to growing from one large and complex universe (traditional, stage-gating) to the next (agile, SAFe).

    51 − 42 =