Operational Excellence maturity model

2025-07-14

The journey of Operational Excellence is one of continuous improvement, where each stage builds on the last to drive greater efficiency and effectiveness across workload design, implementation, and support.

At its core, it's about streamlining key practices like deployment, monitoring, testing, and automation. The journey begins with a strong foundation: a shared vocabulary, standardized practices, and a DevOps mindset that encourages collaboration and stability. From there, standardization introduces consistency and predictability into processes. As teams grow more proficient, individual tasks evolve into integrated workflows, supported by production-ready capabilities such as automated testing, intelligent monitoring, and continuous integration.

When systems go live in production, operations become even more advanced. Teams are equipped to manage change quickly and reliably, meeting quality benchmarks and implementing feature requests from product owners with confidence.

The most mature stage is all about optimization and innovation. Here, teams operate at scale, continuously adapting systems in real time to meet evolving business needs and technological shifts. However, this isn't a fixed destination; it's a dynamic mindset of always improving, always adapting.

The model is structured into five distinct maturity levels, each with a primary goal and a set of core strategies. Use the tabbed views below to explore each level. Be sure to also review the highlighted tradeoffs and associated risks as you progress.

Goal icon Emphasize teamwork and unity in problem-solving to establish a strong foundation that creates consistent and stable operations in later stages.

Establish a DevOps mindset at Level 1 to ensure the success of future strategies. Implement well-established DevOps methodologies to enhance process efficiency. Focus on building essential and common vocabulary, processes, and tools for stable operations.

Key strategies

Encourage collaboration and foster a blameless culture
Adopt standard collaboration methodologies and tools
Set up source control processes
Use infrastructure as code (IaC) as your primary deployment approach
Prioritize security from the start

✓ Encourage collaboration and foster a blameless culture

Align team efforts with business needs while fostering a collaborative culture.

Members from centralized teams, full-time staff dedicated to workload functionality, partners, or vendors often manage workload operations. These individuals should function as a collective force, with mutual respect and acknowledgment for each other's expertise. If teams operate as independent parts, complexities and friction can occur. Independent teams undermine the goal of functioning as a single, efficient system that drives business outcomes.

To reduce an isolated sense of ownership, advocate for a unified approach to problem-solving. All efforts should cater to the needs of the business. View both successes and failures as shared outcomes.

✓ Adopt standard collaboration methodologies and tools

Begin with industry-proven tools and software development lifecycle (SDLC) processes that suit your workload and enhance development efficiency. Don't diverge from proven methods and avoid custom methodologies because they often introduce higher friction.

Popular choices include Agile, Scrum, and Kanban boards. Most experienced developers, DevOps engineers, and product owners have familiarity with these tools, which minimizes the learning curve for new hires.

Initially, use established industry standards to incorporate standardization. Optimize processes later. Ensure that the tools that you select can grow with your needs, without requiring a switch to cutting-edge solutions prematurely.

✓ Set up source control processes

Based on the scale of the application, decide how to structure source code. For larger systems, each team should have its own processes for building and deploying components they're responsible for. They should have clearly defined interfaces that allow for component discoverability and sharing with other parts of the system. Select a source control technology and set up processes to ensure team members don't interfere with each other's work.

Similarly, a single deployment pipeline might be more effective for smaller scale applications. This simplifies coordination and might also be better for reliability. However, it can be challenging to update or migrate specific parts of the system.

✓ Use infrastructure as code (IaC) as your primary deployment approach

Use a declarative approach as the standard for deployments to ensure consistency, repeatability, and long-term benefits like automation, self-documentation, and change history.

Prefer IaC deployments over portal deployments to avoid risks from inconsistent configurations and lack of testing. Avoid compiled languages or proprietary formats that are restricted to specific programs.

Start with a good foundation by using tools that Azure natively supports, like Bicep and Terraform. Evaluate tools to ensure that they simplify your future journey. Ensure that the technology provider has good documentation and a reliable service support program.

Risk: Consider missed modernization opportunities as risks. For example, you should modernize tools and processes that you use in on-premises solutions. When you migrate to the cloud, these tools often require hard-to-manage custom scripts and can cause problems if you don't modernize them.

To mitigate this risk, explore modern technology options and update on-premises processes.

One of the goals for adopting IaC is consistency. Make templates flexible enough to deploy across various environments. Use parameters, variables, and configuration files to modify resource settings for each environment. Abstract only the necessary settings, and avoid over-abstraction of settings that rarely change. Also, avoid overcomplicating solutions by relying on extensive template libraries. This practice can lead to maintenance challenges.

Establish a solid IaC foundation to create more opportunities for deployment and system management optimization in future levels. For example, you can add desired state configuration or GitOps.

✓ Prioritize security from the start

Prioritize security even at this early stage. Security measures are often based on segmentation, such as roles, resources, and networking, which introduces complexities. The team must acknowledge these complexities, build security measures early on, and plan on investing in security over time. This approach avoids deferring security implementations to later stages.

Risk: Development, support, and operations processes can create friction. Security efforts often face resistance, even though teams start strong with good intentions.

To mitigate the risk, add security tasks to backlogs. This practice ensures accountability within the team and makes progress trackable alongside development tasks.

Make tools and processes transparent to easily detect vulnerabilities through audits and peer reviews. Explore industry-standard tools that support vulnerability scanning and security controls, even if you don't fully implement them yet.

Ensure that your tools and deployment practices use the same identity provider as your production environments to minimize different identity control planes.

Goal icon Standardize foundational processes. This approach streamlines decision-making responsibilities and defines the requirements for system deployment and monitoring.

At Level 2, the team should adopt a more structured approach and focus development activities on the core functionality of the workload. Establishing consistency early on helps minimize operational burdens in later stages.

Key strategies

Define team roles and decision-making responsibilities
Strive to make improvements, no matter how small
Standardize foundational processes
Buy tools instead of building custom tools
Adopt automation across the workload
Extend your infrastructure-as-code experience to configure your infrastructure and manage your applications
Define your workload's deployment strategy
Design the workload monitoring stack

✓ Define team roles and decision-making responsibilities

Adopt a product mindset. Instead of viewing the workload as integration of tools, technologies, or job functions, treat it as a cohesive product with a clear focus on the end goal. In Level 2, apply a more structured approach where each role is clearly defined and respected.

Expertise within the team often varies. This diversity can be useful in distributing decision-making among various job functions. For example, specific team members might excel at making technical decisions, while other team members might be experts in defining business outcomes to remain competitive in the ecosystem.

Risk: Some workload teams adopt a consensus-driven culture, where they commit to tasks only when everyone agrees. This culture promotes inclusivity, but it often stifles initiatives when full consensus isn't achieved.

Ensure a well-structured decision-making process by using the following principles:

Designate a directly responsible individual to ensure that decision-making is distributed among team members and aligned with their areas of expertise, instead of being centralized with one person.

Document who the decision-makers are, and include this information in the onboarding materials for new employees.
Consider adopting a decision-making methodology that clearly defines specific roles and responsibilities. Be mindful that these approaches can create division and shift focus away from product goals. Establish checks and balances to prevent siloed decision-making and reduce friction.

✓ Strive to make improvements, no matter how small

Fostering a continuous improvement mindset means making decisions today with the understanding that they can be refined tomorrow.

Delaying changes can cause the team to miss present improvement opportunities. Avoid overthinking and indecision. Striving for a perfect solution might hinder small yet meaningful progress. Focus on making improvements now while continuously seeking ways to improve.

Technical debt is a strategic tool in development for capturing short-term decisions. It can serve as a motivator for incremental updates, which prevents unnecessary accumulation. Treat technical debt as a recurring task in the backlog.

✓ Standardize foundational processes

Different classes of workloads have unique process requirements tailored to their specific characteristics. For example, AI workloads rely on machine learning operations and generative AI operations to drive data pipelines to the model. Mission-critical workloads prioritize real-time monitoring dashboards that site reliability engineers can quickly act on.

Within a workload class, strive for standardization to enhance consistency and reduce operational burden. For AI workloads that include both discriminative and generative models, standardize the processes around data operations. These operations include data access, cleansing, and transformation before it's used to train models or ground generative AI models.

Standardization is recommended for the following use cases:

Process	Benefit
Problem tracking and management	Facilitates better communication across roles, helps in prioritization, and is needed for historical analysis of past problems
Communication tools and processes, especially to handle incidents	Minimizes the risk of miscommunication and improves coordination among team members to resolve problems faster
Code styles, resource naming conventions, and documentation standards	Enhances code readability and maintainability by establishing guidelines
Testing procedures	Ensures that all changes go through a selected set of tests, which provides quality assurance
Continuous integration and continuous deployment	Ensures automated testing, integration, and deployment of code changes, which results in more reliable releases

Risk: Continuous improvement and innovation often occur when a team slightly deviates from established standards to explore better approaches. These deviations should be encouraged but structured. For example, hosting innovation days allows the team to focus on pre-selected improvement projects, which fosters fresh ideas and experimentation.

✓ Buy tools instead of building custom tools

Standardized processes include the necessary tools for effective implementation. At this level, prioritize off-the-shelf solutions instead of custom-built solutions, which you can reconsider later for specialized use cases.

Day-to-day tools for workloads include development, testing, monitoring, and deployment tools. Purchased tools streamline workflows and ensure consistency. This consistency allows teams to focus on delivering features without the complexity of developing and maintaining custom solutions.

Risk: When considering tools, there's often a tendency to overemphasize the tool's extensibility and future potential instead of its core functionality. At this stage, focus on tools that are practical, address your current problems, and fit your current workflow.

✓ Adopt automation across the workload

As you develop a new or existing workload, seek opportunities to integrate automation. Designing a new workload with automation in mind from the start makes future adoption seamless. Similarly, incorporating automation into existing workloads, or brownfield workloads, early in their life cycle helps you gain efficiency and maintain consistency over time.

To streamline adoption, use mature, familiar off-the-shelf tools that are compatible with your cloud platform instead of building solutions from scratch. Explore native automation tools from your cloud provider to simplify the design. For example, many Azure services support autoscaling for performance and failover capabilities for disaster recovery. When you assess non-Microsoft tools, factor in your team's expertise and any relevant business standards.

The following areas can benefit from automation:

Routine operational tasks, like monitoring and alerting, and update management
Software development life cycle tasks, like deployments and testing
Workload performance optimizations, like resource scaling
Security and governance mechanisms, like scans and policy enforcement
Backup and recovery activities
Cost optimizations, like resource deallocations and shutdowns

Risk: In the early stages of your workload development, be careful about focusing too much on building or integrating automation because it can divert attention from delivering the workload to production. Take a measured approach to ensure that your workload is manageable while maintaining your development velocity.

Trade-off: If a task can be done infrequently, efficiently, and safely by humans, it might not be worth automating. For example, automating the annual refresh of a certificate might not justify the investment of development cycles.

✓ Extend your infrastructure-as-code experience to configure your infrastructure and manage your applications

In Level 1, the focus is on adopting infrastructure-as-code (IaC) tools to deploy infrastructure and pipelines for application code. At Level 2, extend that practice to include configuration and management of that deployed infrastructure and applications.

Use a desired state configuration approach to bootstrap your resources and avoid configuration drift. Different tasks and platforms require different automation tools. For example, Ansible is suitable for managing desired state configuration for virtual machines (VMs), while a GitOps solution, such as Flux, is suitable for Kubernetes clusters.

Determine the right level of automation for your post-deployment tasks to minimize your operational burden while keeping the design simple. Tasks like installing certificates, OS configurations, and database seeding are all good options for automating. Also, consider extending your automation to include deploying and configuring your app on newly deployed VMs or container hosts.

Risk: Avoid unnecessary tool sprawl. Developers or development teams that use different approaches and technology can result in a fractured tooling ecosystem. Standardize on a select number of tools for your workload that meet your requirements, and ensure that your workload team is trained on those tools. Likewise, be selective about adopting organizational standards for tooling. If your organization suggests tools that add excessive risk to your workload, evaluate alternative tools that are more suitable.

✓ Define your workload's deployment strategy

A deployment strategy is a critical component of operational excellence. A well-designed deployment strategy ensures that services remain available to users by reducing or eliminating downtime during updates or changes. Gain consensus from stakeholders on how and when changes are deployed to production. Consider the following points:

Define tolerated downtime. Determine if the workload can support downtime without causing significant problems or financial loss. Clearly specify if zero downtime is a requirement for routine deployments.
Establish deployment frequency. Determine deployment frequency based on feature development. Agree on a schedule, whether it's daily, weekly, quarterly, or another suitable approach. When possible, prioritize smaller, more frequent deployments if they align with your scenario.
Plan for emergency deployments. Develop a plan for implementing procedures that manage emergency deployments, such as security hotfixes. This approach ensures that team members understand their responsibilities and can act quickly when needed.

Design a repeatable deployment system that can be automated to minimize errors and ensure consistency. Include provisions for rollback to restore the system to a functional state if errors occur in the last deployment.

✓ Design the workload monitoring stack

Designing a monitoring system requires you to select what to monitor and understand the importance of those metrics for users.

Start by collecting logs and metrics from all components in the workload. Take advantage of platform-provided monitoring tools. These tools are integrated with the services and provide functional and operational insights with little configuration. Securely store this data in a dependable storage solution that can be queried for analysis.

Risk: Avoid collecting excessive data because it can create noise and increase costs. Start with basic metrics like CPU, memory usage, and storage usage. Add useful application health metrics over time.

Based on the initial analysis, work with the stakeholders to define what both healthy and unhealthy states mean for the workload. You use this information in later stages to develop a health model that accurately reflects that health status.

Risk: Your monitoring pipeline serves as a tool for collecting business metrics, including chargebacks, transaction service-level agreements, capacity assurances, and sales totals. Maintain a clear distinction between workload health metrics and business metrics.

Collect business metrics as an application feature instead of through monitoring configurations. Monitoring data streams can be sampled and aren't usually recoverable in a disaster. Treat business-critical data as workload data and keep it separate from workload health signals.

Goal icon Ensure that the system meets the quality standards promised to users and prevents violations of service-level agreements.

At previous levels, the workload team focuses on building features and getting the system into production. At this level, the focus shifts from building features to maintaining and improving a live system. With real users now relying on it, the priority becomes change management through efficient day-2 operations such as triage, maintenance, upgrades, and troubleshooting.

The main strategy is to use real-world experience to improve operations. Testing also becomes a non-negotiable practice to reduce risks associated with changes. You must integrate testing into every part of development, from fixing bugs to adding features and refining incident response. Without it, serious problems might go undetected until they reach production.

At this level, technical debt becomes a real concern. Implementations that are less than ideal might go live, which can complicate maintenance. Teams should analyze the maintenance burden and focus on reducing it.

Key strategies

Use safe deployment practices
Optimize build operations
Validate incident response processes
Optimize operations by using monitoring data from production
Automate maintenance
Improve efficiency by aligning tools and processes with organizational standards
Manage technical debt at a regular cadence

✓ Use safe deployment practices

After production, the three key types of changes typically include routine updates, new feature updates, and emergency updates. Use safe deployment practices to keep the system stable during these changes. Regardless of the type of change, treat every change as a potential point of failure for the workload's users.

Integrate the following strategies into your change control process:

Validate continuously and comprehensively. Test early and often throughout the development life cycle and as changes progress through different environments. Ideally, each time an artifact changes, create tests focused on those changes. Then run the full test suite to validate flows end-to-end. Test results provide validation data, but business stakeholders should still approve these changes.

Trade-off: Running the entire test suite builds confidence in deployments. However, it might not be practical for all changes because of time and cost. Balance thorough testing with cost considerations. Tailor the approval process based on the impact of changes. Minor changes should have a simplified procedure, while significant changes, like new features, require thorough review.

At this level, you can adopt advanced operational concepts such as regional failovers. The goal is to fully automate these processes, with a focus on self-healing in most scenarios. These processes must also be tested extensively.
Implement versioning for your APIs. Manage changes to your data model carefully to ensure backward compatibility. An API versioning strategy helps existing systems continue to run smoothly after changes are deployed. Retrospective versioning can be difficult, so establish a strategy early.
Roll out incremental updates. By Level 3, deployment processes are standardized by using automated pipelines across all environments. At Level 4 maturity, the workload is in production. The focus shifts to refining incremental updates, including managing release cycles.

Deploy small, frequent updates to simplify validation for a small set of changes. Automate validation tasks like load testing, deployment to test environments, and A/B testing.

Note

Safe deployment patterns, like canary and blue-green deployments, provide flexibility and reliability through side-by-side deployments. For example, in blue-green deployments, a new environment is built, traffic is shifted, and the old environment is decommissioned. Other deployment techniques include feature flags and dark launches. These approaches allow testing in production before changes are rolled out to all users. This capability is available with specific Azure services such as Azure App Service, where updates can be rolled out by gradually swapping between deployment slots.
Recover from deployment errors. Expect some updates to fail. With incremental updates, troubleshooting becomes faster when problems occur. If a failure occurs, stop the system to prevent further damage and implement changes to fix the problem. Restoring from backups is acceptable if it maintains continuity. The goal is to move forward to a stable version instead of relying solely on rollback procedures.

✓ Optimize build operations

At Level 3, you should have separate deployment cycles for different layers of the architecture based on their rate of change. At a minimum, keep infrastructure and code pipelines.

Now that the workload is in production, revisit the layering approach. If possible, further decouple architectural components to enable more flexible release cadences. This approach reduces delays and minimizes failures in individual components. Also, run tests and long-running processes as parallel jobs to save time and enhance developer productivity.

✓ Validate incident response processes

At Level 3, you establish an on-call support system with playbooks to define responses to incidents. However, having a playbook is only the first step. Now that the workload is in production, you must validate and enhance the effectiveness of your incident management process and develop a robust communication plan. Consider the following practices:

Test responses to incidents. Incorporate responses from technology, people, and processes. To introduce realism into your validation efforts, we recommend that you run game days. Game days are planned events where faults are introduced to test the team's ability to detect and resolve problems. This approach ensures that the team has the right tools, resources, and procedures in place. Chaos engineering is another valuable technique that introduces controlled disruptions to observe the outcomes. Alternatively, manual methods such as disabling back ends on a global load balancer or performing a database failover can also be used to test the response.
Develop a communication plan. Clearly define communication responsibilities across the workload team, support teams, and emergency response personnel. Standardizing the cadence and format of internal status updates to business stakeholders fosters transparency and trust. In specific scenarios, such as security breaches, responsible disclosure to end users is required. Ensure that the appropriate type and level of information are clearly defined in these external communications.
Conduct an incident review. Treat every incident as an opportunity to learn from production. Use this process to identify weaknesses in your deployment and development processes, and commit to making system improvements.

✓ Optimize operations by using monitoring data from production

At Level 4, advanced monitoring should emit, correlate, and analyze metrics within a business context. At this level, improve its accuracy by learning from production. Use monitoring data to refine processes that were built on best guesses. Consider the following key examples:

The primary focus in Level 3 is developing a health model for the workload. At Level 4, fine-tune the alerting system and set realistic objectives and service-level indicators.
As part of day-2 operations, minimizing configuration drift should be a key priority. Without this focus, the runtime environment might gradually diverge from its intended state. Begin by capturing a snapshot of the known-good configuration. Then take advantage of observability metrics from production to compare current behavior against that baseline. This approach ensures ongoing alignment with the intended system state.
This level is ideal for introducing feedback loops to better understand how the system behaves under specific stressors and to predict the impact of new features. System telemetry guides these feedback loops by providing key insights that help forecast workload changes and shape proactive solutions to potential problems. You can also use this data to help you prioritize technical debt.

As a general practice, fine-tune the monitoring stack based on observability data and patterns in production. Consider the following practices:

Adjust logging levels to balance visibility and noise to capture activities on the critical paths.
Amplify the important alerts while suppressing the irrelevant ones.

✓ Automate maintenance

At Level 3, automation efforts primarily focus on deploying to production. By Level 4, teams have significantly reduced manual work by automating build, test, and deployment processes by using continuous integration and continuous delivery pipelines. Like with quality gates, specific approvals might also be managed through automated workflows.

At Level 4, operational automation should be driven by real-world production experience and focused on addressing technical debt.

Consider the following day-2 automation examples.

Process	Benefit
Automate rotation of certificates, API keys, and other secrets.	Automation guarantees timely rotations, eliminating the need for manual intervention, which saves time and reduces the likelihood of human error.
Automate routine maintenance of infrastructure.	Routine infrastructure maintenance requires extensive testing and coordination. Automation can expedite these tasks, reducing manual effort and minimizing risks.
Automate emergency response process.	Without proper automation, people might resort to hasty, uncoordinated actions during an emergency release, potentially leading to further problems.
Automate scaling of resources on load spikes and drops.	Autoscaling ensures that resources are allocated dynamically based on demand. This allocation results in more efficient use of resources because when demand decreases, resources are deallocated, without excessive operational overhead.
Automate data retrieval and delivery.	This approach reduces the time and effort required to fulfill data requests sent by users. Instead of manually accessing databases, scripts are triggered for accessing the database, retrieving relevant data, and sending it to the user.
Automate the creation of developer environments based on specific criteria.	This approach ensures that environments are consistently created to facilitate safe changes in the workload, as part of the team's day-2 operations.

Note

When you develop a deployment automation strategy, start with known and predictable tasks. Factor in common points of failure. After these points are automated, extend coverage to handle unforeseen problems, some of which might require manual intervention. For example, start by automating routine tasks like infrastructure updates because they're more manageable. Then tackle emergency hot fixes because they might include unknown failure scenarios.

For example, a team might routinely deploy a workload by using controlled exposure to users across all geographies. This process might take several days to complete. They also need the ability to deploy hot fixes sooner by skipping specific steps. The automation process should account for those expedited deployments.

The primary goal is to identify repetitive, human-driven tasks that might have been overlooked in earlier stages because of deadlines. But you shouldn't automate everything. Return on investment should guide automation. Prefer using existing technologies and knowledge instead of starting with entirely new tools. If lightweight tooling is needed, evaluate its life cycle and maintenance requirements.

✓ Improve efficiency by aligning tools and processes with organizational standards

At Level 4 maturity, focus on gaining operational efficiency by evaluating engineering assets and processes. Identify which assets are essential but not core to your business.

For these assets, consider the following points:

Use shared tools already available in your organization.
Consider non-Microsoft software for specific tasks, like data conversion.

Prebuilt assets come with support channels and can replace custom solutions. This approach reduces the operational burden of your team-created solutions. Evaluate how well these resources meet your needs and identify any remaining gaps.

Explore the following areas of the workload:

Evaluate your custom code. Instead of writing custom code for tasks like parsing, evaluate open-source solutions that are considered industry standard. Using these tools can reduce the need for code maintenance and result in a smaller code base. Explore options already available within your organization. There might be existing libraries that you can integrate into your workload to handle routine tasks such as authentication.
Evaluate your tool chain. Assess areas where you can rely on other teams that use similar tools. Adjust your use of libraries, templates, and modules accordingly. Align infrastructure-as-code tools across the organization to streamline operations.
Evaluate your processes. Identify centralized processes that can do tasks that you might have implemented yourself, such as security scanning. Instead of managing your own quarantine process for NuGet packages, use the organization's existing security team's process by informing them of the modules used in your workload.

Supportability is another key area. Early on, development teams often handle support themselves by monitoring metrics and fixing live problems. At this stage, consider setting up dedicated roles like on-call engineers. If your organization has a shared support team, use it to reduce the support load on developers.

Note

If possible, transition day-to-day support to external vendors. Vendors don't have deep context like the development team or the architects who bring the workload to production. Before you hand off tasks to a vendor, make sure that the system is stable in production and clearly define management tasks. Vendors need key elements to succeed. Define thresholds in the health model that represent Healthy, Unhealthy, and Degraded states. Train vendors on playbooks, tools, and other troubleshooting resources. If they can't identify causes, set up well-defined pathways for escalating and routing problems to the workload team.

✓ Manage technical debt at a regular cadence

Technical debt is the result of shortcuts that you take during development to meet deadlines, which can result in implementations that are less than ideal. Teams should work on reducing this debt by analyzing maintenance complexity and time. If technical debt isn't addressed, systems can become more complex and harder to maintain or scale. This complexity slows innovation as developers spend more time fixing problems instead of working on new features.

Consider the following tactical recommendations for handling technical debt:

Track technical debt alongside feature work.
Reserve capacity in every sprint for addressing technical debt, separate from feature development. Occasionally, dedicate entire sprints to addressing technical debt.
Add the proposed resolution to the backlog right away if you plan to incur new technical debt for new features.

Technical debt is a normal part of development and an opportunity for improvement. As new features are added, debt accumulates. Balance the effort of paying off old debt with the new debt from developing new features.

Next steps

Review the Operational Excellence design review checklist to get details on the recommendations.

Share via

✓ Encourage collaboration and foster a blameless culture

✓ Adopt standard collaboration methodologies and tools

✓ Set up source control processes

✓ Use infrastructure as code (IaC) as your primary deployment approach

✓ Prioritize security from the start

✓ Define team roles and decision-making responsibilities

✓ Strive to make improvements, no matter how small

✓ Standardize foundational processes

✓ Buy tools instead of building custom tools

✓ Adopt automation across the workload

✓ Extend your infrastructure-as-code experience to configure your infrastructure and manage your applications

✓ Define your workload's deployment strategy

✓ Design the workload monitoring stack

✓ Use separate environments for promoting releases

✓ Perform sufficient testing

✓ Automate testing and other quality checks when possible

✓ Set up approval processes

✓ Implement automated deployments

✓ Develop a health model

✓ Set up your incident management process

✓ Use safe deployment practices

✓ Optimize build operations

✓ Validate incident response processes

✓ Optimize operations by using monitoring data from production

✓ Automate maintenance

✓ Improve efficiency by aligning tools and processes with organizational standards

✓ Manage technical debt at a regular cadence

✓ Spot rearchitecture opportunities based on observed growth and future potential

✓ Use automation to further reduce friction

✓ Create developer environments for each feature or change

✓ Enable self-service capabilities to your workload's various job functions

Share via

Operational Excellence maturity model

✓ Encourage collaboration and foster a blameless culture

✓ Adopt standard collaboration methodologies and tools

✓ Set up source control processes

✓ Use infrastructure as code (IaC) as your primary deployment approach

✓ Prioritize security from the start

Next steps

Feedback

Additional resources