Course: Network Admin to Site Reliability Engineer (incl.guidance)

$949.00
$1,148.29 incl. vat

duration: 80 hours |

Language: English (US) |

access duration: 365 days |

Details

Site Reliability Engineering is a combination of software engineering and IT Operations to help create scalable and reliable software systems. In this learning path, you will explore the skills required to go from a Network Admin, DevOps Engineer, Chaos Engineer, and finally a Site Reliability Engineer.

When you choose this learning path you get:

  • Access to the courses Part 1: Network Admin, Part 2: DevOps Engineer, Part 3: Chaos Engineer, and Part 4: Site Reliability Engineer. You will also get access to many more courses.
  • Guidance from our Learning & Development team, together with you we set goals, create a schedule and monitor your progress.

This program is divided into four parts, all focused on teaching you the necessary skills to become a Site Reliability Engineer.

Part 1: Network Admin

The first part covers crucial aspects of Network Administration, with a focus on OS deployment, backup and recovery, monitoring distributed systems, and SRE scenario planning.

Part 2: DevOps Engineer

In this part, the focus will be on build & release engineering best practices, automation and simplicity best practices for SRE, then you move to SRE postmortem culture best practices, and you will finish with cloud and container architectures for the SRE.

Part 3: Chaos Engineer

Now you will learn about troubleshooting, emergency response and incident handling. Furthermore, you will learn about testing for reliability, load balancing, overload and cascading failures, distributed reliability, data pipelines and integrity.

Part 4: Site Reliability Engineer

In the last part, you will learn about scaling the SRE team, operational loads, communication and collaboration, managing software reliability metrics, and the SRE engagement model.

Result

After completing all the chapters of this learning path, you will have solid knowledge of all the important aspects of becoming a site reliability engineer.

Prerequisites

No formal prerequisites. However, it is recommended to be familiar with Site Reliability Engineering, Networking and DevOps.

Target audience

System Administrator, Network Administrator

Content

Network Admin to Site Reliability Engineer (incl.guidance)

80 hours

Site Reliability: Engineering

Site Reliability Engineers are often considered the link between software development and operations. In this course, you'll explore the principles of site reliability engineering as well as common concerns such as measuring and managing risk, and risk tolerance. You'll also learn how to ensure a satisfactory level of service by implementing Service Level Objectives, Service Level Agreements, and Service Level Indicators.

Site Reliability: Tools & Automation

There are numerous tools available to Site Reliability Engineers to help with planning, managing, deploying, automating, and monitoring services and infrastructure. In this course, you'll explore these tools as well some the benefits of automation and the automation process. You'll also discover common pitfalls and failures, as well as how to manage of post-mortem incidents.

OS Deployment Strategies: Upgrading & Maintaining Systems

When it comes to production environments, administrators are typically responsible for the deployment, management, and the continuous updating of client and server systems. In this course, you'll learn about deploying and updating systems, and Windows 10 upgrade and migration considerations. You'll then explore upgrading the edition of Window 10, as well as the supported Windows 10 upgrade paths. Next, you'll learn about the options available for each of modern, dynamic, and traditional deployments. You'll then examine how to migrate files and settings, the Windows Assessment and Deployment Kit, and the Microsoft Deployment Toolkit. Lastly, you'll explore Windows To Go, Windows Updates, Windows 10 servicing and support features, and the Windows Servicing Channels feature.

OS Deployment Strategies: Deploying Modern Systems

Cloud services are rapidly changing the nature of how technology services are implemented, and migrating toward a cloud-based model can provide many benefits to an organization. In this course, you'll explore the various cloud computing deployment models to understand the flexibility, speed, and infrastructure benefits of moving to a cloud solution. You'll also discover the benefits of cloud services models such as Infrastructure as a Service, Platform as a Service, Software as a Service, as well as Identity as a Service and Network as a Service.

OS Deployment Strategies: Maintaining & Managing Modern Systems

Keeping your systems current is a primary concern of any organization, not only to ensure that your systems do not create vulnerabilities or expose weaknesses that could be exploited by attackers, but to ensure reliable and stable day-to-day operations. In this course, you'll explore the numerous Windows features available to administrators to simplify the management and maintenance of clients and server systems. You'll also learn about features such as Group Policy and Windows Server Update Services that allow administrators to centralize management and configuration of operating systems and users settings. Finally, you'll learn how to deploy and maximize these administrative features, as well as others that come standard with Windows.

Backup & Recovery: Business Continuity & Disaster Recovery

Disasters can occur at any time and to any sized organization, so administrators should invest the time and resources to properly plan for business continuity and disaster recovery. In this course, you'll learn how to plan for business continuity, assess risk, and perform business impact assessments. You'll also learn about system resilience, sensitive data types, and data classifications. Lastly, you'll see a comparison of Recovery Time Objective and Recovery Point Objective, and examine what to include when preparing a disaster recovery training plan.

Backup & Recovery: Enterprise Backup Strategies

Critical information must be backed up and protected for a company's survival. In this course, you'll learn about onsite and offsite backup and the recovery solution. You'll examine the three main cloud providers - Amazon Web Services, Microsoft Azure, and Google. You'll then learn about considerations for local backup and bring your own device backups. Finally, you'll explore the cultural impact involved in moving to the cloud and how employee communication and inclusion could be vital to a successful migration.

Backup & Recovery: Windows Client Backup and Recovery Tools

For the vitality of any company, data protection solutions are essential. There are numerous types of built-in backup and recovery tools available in the Windows 10 operating system. In this course, you'll learn about features such as File History, System Image Backups, and OneDrive and how they can be used to keep data safe and secure. Next, you'll examine how to repair a Windows 10 PC using the Advanced Startup options, enable volume shadow copies, and create a recovery drive for access to the advanced start-up options. Finally, you'll learn about the various restore features such as System Restore, that can be used to restore a system to a previously known working version.

Describing Distributed Systems

Distributed systems involves numerous computers that work together but appear as only a single computer to the operator. In this course, you'll learn about distributed systems can provide numerous benefits including performance, availability, and autonomy. You'll also explore distributed systems in greater detail, and learn strategies and best practices for monitoring them.

Monitoring Distributed Systems

Principles and techniques are key in building a successful monitoring and alerting system. In this course, you'll explore the 'four golden signals' of monitoring while learning how to differentiate between symptoms and causes. You'll also learn about the guidelines for designing a monitoring system, questions to ask when creating rules for monitoring, and how to monitor for the long term.

Site Reliability Engineering: Scenario Planning

Scenario planning helps site reliability engineers strategically prepare for uncertainties that may disrupt or negatively affect services. In this course, you'll explore scenario planning use cases and the strategies utilized to prepare for disasters. You'll examine the functions of Disaster Recovery Testing (DiRT) and Customer Reliability Engineering teams, which help manage the impact of a disaster or disruption. Next, you'll identify disaster recovery testing events and recognize how to plan and design tests for DiRT. You'll move on to describe the production incident lifecycle and how to minimize production incidents. You'll identify unmanaged responses, how to rectify untrained responses, and the activities used to train response teams. Finally, you'll examine how to test people and how they self-organize and interact using various role-playing and test scenarios.

Final Exam: Network Admin

Final Exam: Network Admin will test your knowledge and application of the topics presented throughout the Network Admin track of the Skillsoft Aspire Network Admin to Site Reliability Engineer Journey.

Build & Release Engineering Best Practices: Release Engineering

It's important to know why the roles, philosophy, and principles behind release engineering - a relatively new discipline of software engineering - are used for building and delivering software. In this course, you'll learn about the automated release system called Rapid, and how it can be used to provide a framework for delivering reliable software builds and releases. You'll also learn about configuration management and the importance of collaboration between release engineers and site reliability engineers.

Build & Release Engineering Best Practices: Release Management

Release management can guide your software development efforts from planning to deployment, resulting in better customer satisfaction with the end product. In this course, you'll learn about the benefits of using a release management process to manage and improve the development of a software build. You'll then move on to explore key concepts and principles that apply to release management, as well as common considerations and potential challenges to be aware of. Lastly, you'll learn about common toolsets used by release engineers and best practices related to continuous integration and release deployment.

Best Practices for the SRE: Automation

It has been proven that the automation of processes and systems commonly results in higher production rates and increased productivity. In this course, you'll learn the basics of automation, including benefits such as consistency, efficiency, problem-solving, and cost-savings. You'll examine the potential challenges of automation, including integration, complexity, and security. Lastly, you'll learn the value of automation for a Site Reliability Engineer and how SREs are using automation to improve daily operations and overcome obstacles.

Best Practices for the SRE: Use Cases for Automation

Site Reliability Engineers often use automation and orchestration capabilities to scale security and performance, ensuring sites are reliable and efficient. In this course, you'll learn about common use cases for automating systems and processes. You'll examine PowerShell capabilities that can be used to automate a variety of Windows administrative tasks including user creation, patching and updating, bulk enrollment, and software installations. Lastly, you'll learn about cluster turnup automation, reliability, and enabling failure at scale.

SRE Simplicity: Software System Complexity

Simple systems and software are proven to be easier to develop, understand, maintain, and test. For site reliability engineers, simplicity should be an end-to-end goal and cover all aspects of the software life cycle. In this course, you'll explore the importance of simple systems and software code. You'll identify the different types of software complexity, such as structural complexity, organizational complexity, complexity of use, and theoretical complexity, and learn how to differentiate between complex and complicated code. You'll move on to recognize how to measure complexity using various metrics, such as cyclomatic complexity, the Halstead metric, and the maintainability index. Lastly, you'll examine class coupling, using NPATH to measure the complexity of a piece of code, and prioritizing the simplification of projects and resources.

SRE Simplicity: Simple Software Systems

When creating a simple software system, it is essential to identify and remove any unwanted complexity, whether accidental or essential. By eliminating complexity, site reliability engineers can ensure the final software product is more stable and reliable. In this course, you'll learn to differentiate between agility and stability and explore the importance of stability testing. You'll learn about key metrics and methods, such as production analysis and agile process metrics, which can be used by software development teams to ensure business goals are met. Lastly, you'll learn how to avoid introducing potential defects and bugs by limiting the number of negative lines of code in a project.

SRE Postmortums: Blameless Postmortem Culture Creation

There are various, frequently-used premortem and postmortem techniques adopted by site reliability engineers (SRE) to diagnose issues and come up with problem resolution ideas and alternative approaches. To do this effectively, SREs need to account for several factors at play, including the workplace culture and work collaboration. In this course, you'll learn how to promote a blameless culture - one without finger-pointing and animated language. You'll explore the key characteristics of good and bad postmortems, and discover the benefits of reviewing postmortems, sharing knowledge, giving feedback, and rewarding positive behavior. You'll then learn how to respond to postmortem culture implementation failure. Lastly, you'll discover how using the right postmortem templates and postmortem management tools can improve how you write postmortems and manage their associated data.

Cloud and Containers for the SRE: Cloud Architectures & Solutions

When deploying a medium to a large-sized cloud solution, there are many factors to consider, such as the numerous cloud environments to choose from and the different levels of management and security they each require. In this course, you'll explore these environments in detail, with a specific focus on their application in SRE. You'll examine the features, purpose, benefits, and potential drawbacks of services such as Software as a Service (SaaS), Infrastructure as a Service (IaaS), Platform as a Service (PaaS), and Anything as a Service (XaaS). You'll then investigate private, public, hybrid, and community clouds and on and off-premises software. Moving on, you'll delve into cloud architecture-related topics, such as orchestration, automation, elasticity, and cloud bursting. Lastly, you'll study cloud payment models, resource allocation, and on-demand self-service.

Cloud and Containers for the SRE: Containers

Containers in cloud computing are a form of operating system virtualization that allows users or administrators to deploy and run applications without the need for virtual machines. Containers can be deployed and run virtually anywhere, and support Linux, Windows, and Mac operating systems. In this course, you'll explore the various types of container solutions, including Kubernetes, Docker, and AWS. You'll outline how containers enable a more efficient continuous integration and delivery system and why they're needed for SRE. You'll also examine container storage, security, and migration. You'll list the high-availability solutions available for containers and investigate the Containers as a Service concept. Lastly, you'll recognize how the container ecosystem is revolutionizing software delivery, and identify the role of Docker and Kubernetes in container orchestration.

Cloud and Containers for the SRE: Implementing Container Solutions

Although containerization technologies such as Docker and Kubernetes can function independently, they can also benefit significantly from one another. Furthermore, open source automation tools such as Jenkins can be used to increase resource utilization and efficiency through pipelines. In this course, you'll explore the many benefits of pipelines, and learn how to use them to build code. You'll outline the benefits of Git and GitHub for revision control and identify the distributed version control tools that can be used to manage source code history. You'll then work with Jenkinsfiles to write pipeline-as-a-code and code to use at the build stage, after the build and test stages, and for recording failures. Next, you'll use the Jenkins Pipeline to set the environment variables and outline the key steps and factors needed in your code review. Lastly, you'll learn how to use Kubernetes to deploy applications with high availability, scalability, and resilience.

Final Exam: SRE DevOps Engineer

Final Exam: DevOps Engineer will test your knowledge and application of the topics presented throughout the DevOps Engineer track of the Skillsoft Aspire Network Admin to Site Reliability Engineer Journey.

SRE Troubleshooting Processes

Troubleshooting is a critical skill for site reliability engineers (SREs). Using past experiences, a proper mindset, and a stable troubleshooting process, SREs can effectively report, triage, examine, diagnose, test, and cure system issues. In this course, you'll explore troubleshooting approaches and best practices, while also learning how to avoid common pitfalls. You'll explore issue reporting, triaging, examination, diagnosis, and testing. You'll recognize how to simplify and reduce troubleshooting, use the ""what, why, and where"" technique, and examine negative results. You'll also investigate how to observe and interpret recent changes to identify what went wrong with a system. Lastly, you'll locate probable cause factors and outline the steps used to make troubleshooting more effective.

SRE Troubleshooting: Tools

Site reliability engineers (SREs) are typically good problem solvers. They need to think logically to identify problems, correct them, and prevent them from happening again. In this course, you'll explore several built-in and open-source troubleshooting tools SREs can use for resolving system issues. You'll start by examining the techniques of logging and whitebox and blackbox monitoring used to monitor system events. You'll then work with the various built-in Windows troubleshooting tools, namely the Event Viewer, Resource Monitor, and System Information tools. Next, you'll use Google Cloud Dataflow to process logs, before outlining the purpose and benefits of the StatsD standard and the /api/search endpoint. Lastly, you'll identify how Google's Dapper is used for troubleshooting distributed systems, and the open standards tool, Prometheus, for instrumenting software and exposing metrics.

SRE Emergency & Incident Response: Responding to Emergencies

Site Reliability Engineers (SREs) are responsible for assigning the appropriate resources and responsibilities to effectively deal with unexpected emergencies. To do this, SREs should ensure the proper processes and teams are in place before an emergency occurs. In this course, you'll explore the different emergency types and outline how to plan for them. You'll examine the causes of and how to respond to test-induced, change-induced, and process-induced emergencies and what's involved in proactive approaches to emergency testing and planning. You'll then outline the critical steps to correctly documenting emergencies, including the history of outages and mistakes. You'll then differentiate between business continuity and disaster recovery planning and outline how to create both types of plans and conduct a business impact analysis. Lastly, you'll explore some IT recovery strategies.

SRE Emergency & Incident Response: Incident Response

A well-prepared and organized approach is key to addressing and managing the aftermath of a system failure, security breach, or cyberattack. In this course, you'll explore the fundamental principles an SRE needs to be familiar with when responding to and managing incidents. You'll identify the goals, requirements, best practices, and key players involved in incident management. You'll learn how to deal with managed and unmanaged incidents and what's involved in an incident response plan. You'll identify incident response roles and responsibilities, and how to use incident metrics to manage incidents at scale. You'll outline what's involved in establishing a computer security incident response team (CSIRT), including each key team member's roles and responsibilities. Lastly, you'll examine what goes into an incident response policy.

SRE Testing Tasks: Software Reliability & Testing

Site reliability engineers (SREs) can use various testing techniques to ensure software operations are as failure-free as possible for a specified time in a specified environment. In this course, you'll explore multiple testing techniques, their purposes, and the tasks involved in their execution. You'll start by examining traditional software testing approaches, such as unit tests, integration tests, and system tests. Next, you'll investigate the components and use cases of various reliability metrics applied to SRE testing, including mean time to failure (MTTF), mean time to recover (MTTR), and mean time between failures (MTBF). Lastly, you'll outline several software testing approaches, such as stress, configuration, integration, acceptance, production, and canary testing, among others. You'll identify when, how, and by whom each of these testing types is carried out.

SRE Testing Tasks: Testing Considerations

Site reliability engineers (SREs) need to create a healthy test and build environment to ensure that products being distributed integrate and function as expected. In this course, you'll explore the fundamentals of creating a robust SRE test and build environment, looking at the standard tools and techniques available for testing at scale. You'll examine disaster and statistical testing, and learn about working with deadlines and production configurations. You'll investigate the topic of test failures, identifying why an SRE should expect specific tests to fail and how results for test failures can help maximize knowledge about operations and end-users. Lastly, you'll look at the why and how of incorporating break glass procedures, integration testing configuration files, and fake back-end versions into your testing procedures.

SRE Load Balancing Techniques: Front-end Load Balancing

Today's distributed systems can consist of hundreds or even thousands of servers, and getting them to work together efficiently is a challenge. Load balancing is a multifaceted concept whose many techniques can help SREs face this challenge. In this course, you'll explore how front-end load balancing works and its associated techniques, concepts, and capabilities. You'll examine the characteristics of load balancers, their use in application delivery and security, and the use of DNS load balancers. You'll outline strategies for virtual IP load balancing, cloud load balancing, and handling overload. Finally, you'll learn how the Google Front End Service, Andromeda virtualization stack, Maglev network load balancing service, and the Envoy edge and service proxy are used for load balancing-related tasks.

SRE Load Balancing Techniques: Data Center Load Balancing

A Site Reliability Engineer (SRE) must know how to perform load balancing within the data center, both internally and externally. In this course, you'll learn about load balancing, including various methods for balancing loads in the data center. You'll begin by examining what data center load balancing is and its importance to performance, as well as load balancing policies. You'll then learn how to deal with unhealthy tasks using flow control, and tips and tricks for optimizing load balancing. Next, you'll examine methods for limiting connection pools with subsetting, and the various load balancing components. Lastly, you'll learn how to balance loads internally and externally using HTTPS and TCP/UDP, and how to balance loads using SSL and TCP proxy load balancing.

Site Reliability Engineer: Managing Overloads

Site reliability engineers (SREs) are typically responsible for preventing and managing overloads. A common misconception is that overloads only affect computer systems. However, overloads also comprise types of occupational stress, which invariably negatively affect an organization. In this course, you'll explore the fundamental concepts and methods involved in managing overloads. You'll start by identifying operational load types and how they relate to performance. You'll then outline how to mitigate workloads and prioritize work before recognizing the specific consequences of overloads. You'll then describe how to manage client-side traffic using per customer limitations and client-side throttling. You'll examine tools such as criticality values and utilization signals. Finally, you'll explore approaches used for handling overload errors and learn how to identify issues caused by loads associated with connections.

Site Reliability Engineer: Managing Cascading Failures

Cascading failures are a concern for site reliability engineers (SREs) because they often stem from positive feedback and grow over time. In this course, you'll examine the various cascading failure triggers, such as overloads, CPU, and memory issues. You'll also explore the resource exhaustion issues resulting from cascading failures and the adverse effects on overall performance and stability. You'll outline steps to prevent server overloads, ensure efficient queue management, deal with latency, and manage slow startups. You'll explore terms such as ""load shedding"" and ""code retries."" You'll also identify the benefits of setting deadlines and how propagating cancellations can reduce or eliminate unneeded work and preserve resources for other needs. Finally, you'll outline the steps involved in testing cascading failures and in addressing them immediately.

Distributed Reliability: SRE Critical State Management

Anticipating failures that will affect your company's systems is a crucial site reliability engineer duty. These failures are especially significant when they affect distributed systems, which is why efficient algorithms and strategies are essential in minimizing the likelihood of failures. In this course, you'll explore both critical state management and the CAP theorem, identifying how both concepts relate to distributed systems. Next, you'll examine several distributed system management algorithms and strategies, including deterministic and nondeterministic algorithms, distributed system models, and Byzantine faults. You'll then outline how each of these benefits distributed system management. Finally, you'll investigate the Multi-Paxos message flow protocol and how it works with distributed systems. Finally, you'll describe what's involved in deploying and monitoring a consensus-based system to increase distributed system performance.

Distributed Reliability: SRE Distributed Periodic Scheduling

Maintaining a distributed system requires constant maintenance to ensure failures don't interfere with that system's reliability and availability. Using periodic scheduling and replication, site reliability engineers can minimize the effect failures may have on a system's performance. One way to automate this process is to utilize the system daemon, cron. In this course, you'll explore how to use cron for task scheduling, the purpose, components, and operators involved in cron jobs, and the format and characters of cron syntax. You'll outline how cron works with distributed periodic scheduling and idempotency, and in largescale deployments. Next, you'll review the PAXOS distributed consensus algorithm, best practices for its use, and how it applies to distributed replication. Lastly, you'll practice scheduling a cron job and using cron syntax generators.

SRE Data Pipelines & Integrity: Data Pipelines

Site reliability engineers often find data processing complex as demands for faster, more reliable, and extra cost-effective results continue to evolve. In this course, you'll explore techniques and best practices for managing a data pipeline. You'll start by examining the various pipeline application models and their recommended uses. You'll then learn how to define and measure service level objectives, plan for dependency failures, and create and maintain pipeline documentation. Next, you'll outline the phases of a pipeline development lifecycle's typical release flow before investigating more challenging topics such as managing data processing pipelines, using big data with simple data pipelines, and using periodic pipeline patterns. Lastly, you'll delve into the components of Google Workflow and recognize how to work with this system.

SRE Data Pipelines & Integrity: Pipeline Design

Site reliability engineers (SREs) encounter numerous and varied pipeline technologies and frameworks in their work. When building a pipeline, SREs need to invest considerable time during the design phase to ensure the results work best for the specific case. In this course, you'll explore the numerous features of a pipeline, such as latency, high availability, development, and operations. You'll also examine the two different pipeline mutations: idempotent and two-phase, as well as the checkpointing technique and various code patterns. You'll then investigate the five core characteristics of the pipeline maturity matrix and outline how they should be used to design the pipeline technology. You'll then identify potential failure modes, outage causes, and different prevention and response techniques. Finally, you'll outline event delivery system design and operations and how to plan for customer integration and support.

SRE Data Pipelines & Integrity: Data Integrity

Data integrity is vital as it ensures end-user data accuracy and consistency in conjunction with an adequate level of service and availability. In this course, you'll learn how to choose a strategy for data integrity, including how to account for any potential upsides and tradeoffs. You'll explore various types of failures that lead to data loss and the existence of the many data failure modes. You'll also identify data integrity challenges. Next, you'll examine in detail the soft deletion, back up and recovery, and early detection layers of defense-in-depth, before investigating the data integrity challenges a cloud developer may encounter in high-velocity environments. Finally, you'll outline considerations for implementing out-of-band data validation and successful data recovery and identify how the primary SRE principles apply to data integrity.

SRE Products at Scale: Product Launches

Site Reliability Engineers (SREs) often contribute to the launch of new products and features. These launches can occur in rapid iterations and at scale, so SREs need to be prepared to help them succeed. In this course, you'll examine launch coordination engineering to build and release reliable and fast products. You'll identify the criteria for a successful product launch and how to develop and use launch checklists to reduce failure and ensure consistency and completeness. Next, you'll outline the techniques used for reliable launches and how launch coordination engineers can help mitigate the repetition of launch mistakes. You'll investigate the production readiness review model used to identify a service's reliability needs. Lastly, you'll outline the characteristics of SRE engagement and early engagement models, as well as SRE engagement frameworks.

Final Exam: Chaos Engineer

Final Exam: Chaos Engineer will test your knowledge and application of the topics presented throughout the Chaos Engineer track of the Skillsoft Aspire Network Admin to Site Reliability Engineer Journey.

SRE Team Management: Scaling the Team

When adding a new site reliability engineer (SRE) to your team, it's important that the new member not only has the required skills but also receives the proper training. This allows the new SRE to fit into the team and get up to speed as quickly as possible. In this course, you'll learn about the best practices for onboarding a new SRE team member, including methods and tools that can be used during the onboarding process. Next, you'll explore the technical skills that an SRE requires, including the ability to reverse engineer an application to determine the root cause of a problem. Finally, you'll examine the skills and knowledge an SRE requires when on-call, including those needed to provide support and manage support issues.

SRE Team Management: Managing Operational Loads

To ensure and maintain a system's functional state, site reliability engineers (SRE) must learn how to identify, calculate, and manage a system's operational load, which generally falls into three categories: ongoing operation activities, tickets, and pages. In this course, you'll explore these categories in detail. You'll start by outlining methods for managing operational loads at the team level and using support ticketing systems and service level objectives. Next, you'll investigate 'toil,' a term used to describe the operational work associated with running and maintaining a production service. You'll outline steps for identifying, calculating, and eliminating toil and examine the adverse effects toil can have on a team. Additionally, you'll outline how to work with interrupts and distinguish between crucial metrics used for managing them. Lastly, you'll identify the human element factors to consider when dealing with interrupts, including efficiency, distractibility, and respect.

SRE Team Management: Operational Overload

Site reliability engineers (SREs) are responsible for many administrative tasks, often splitting their time between reactive ops work and special projects. To ensure teams do not become overloaded, SREs may be transferred to a team in order to prevent or help mitigate overload. In this course, you will learn how to deal with operational overload. You'll start by examining ops mode, which is an approach used to ensure services are properly maintained and optimized. You'll discover factors that contribute to team morale and stress. In addition, you will outline emergency planning strategies and best practices, as well as learn how to categorize emergencies and prepare detailed emergency plans. Next, you'll explore how knowledge sharing relates to emergency preparedness, the key to writing successful postmortems, the importance of service level objectives, and how an appropriate level of detail is required to properly explain your findings. Lastly, you'll discover the key factors and attributes of successful teams. You'll examine a team-first approach and differentiate between questioning techniques such as open/closed, funnel, probing, and leading.

Core Skills for Site Reliability Engineers: SRE Collaboration & Communication

Collaboration is key to getting the most out of your team and ensuring your clients receive their desired service. In this course, you'll learn to collaborate and communicate as an SRE effectively. You'll learn how to run traditional and virtual meetings to ensure maximum effectiveness and productivity, whether it's with customers, internal or external team members, or distributed teams. You'll examine how to plan, carry out, and post-analyze meetings using best practices and sufficient preparation, tailoring these methods to suit the participants and the end-goal. You'll delve into the unique characteristics of different meeting types, such as those for problem-solving or innovation. You'll explore the advantages and challenges of SRE pair programming. You'll then end the course by investigating some helpful collaboration and communication tools.

SRE Metric Management: Software Reliability Metrics

To improve the chances of creating, monitoring, and maintaining a successful software development project, site reliability engineers and all team members must be aware of which metrics to measure. They also need a working knowledge of both automated and manual testing methods. In this course, you'll learn how to manage and select SRE metrics and how various testing methods work. You'll begin by learning what metrics need to be measured for project management, software development, and APIs - examining in detail CI/CD, cloud API, and software project metrics, to name a few. Next, you'll compare both manual and automated testing methods and the goals of each. Lastly, you'll investigate automated testing frameworks and platforms, test cases and types, and best practices and pitfalls to consider.

SRE Metric Management: Software Reliability Monitoring and Reporting

Once SRE metrics have been identified, site reliability engineers (SREs) must know how to perform fault analysis on a system, classify defects, and monitor and report data. In this course, you'll explore the tools and best practices for carrying out these procedures. You'll begin by identifying various fault analysis methods and tools. You'll then classify software defects and bugs with a focus on severity and priority. Next, you'll investigate strategies for monitoring APIs and explore some tools used for this task. You'll then examine in detail several tools for collecting, analyzing, and reporting metric data using a customizable dashboard, including those that comprise the ELK Stack - Elasticsearch, Logstash, and Kibana. Furthermore, you'll explore the data collection tool Beats and the beneficial use cases for Elasticsearch notifications.

SRE Engagement: Production Readiness Review

Production Readiness Review (PRR), the standard first step of SRE engagement, and its phases are used to identify a service's reliability needs. The concept of ""early engagement"" is then used to evolve the Simple PRR model. In this course, you'll investigate SRE engagement, early engagement, and Production Readiness Review. You'll start by delving into each phase of the SRE Production Readiness Review (PRR) model, namely, engagement, analysis, refactoring, training, onboarding, and continuous improvement. Next, you'll learn how early engagement can be used to evolve the Simple PRR model. You'll then examine how SRE platforms and frameworks can provide structural solutions. Finally, you'll learn how to use the SRE engagement model to manage software projects, comparing it to the traditional System Development Life Cycle (SDLC) model.

SRE Engagement: The SRE Engagement Model

The SRE engagement model and SRE service lifecycle have note-worthy similarities and differences to the traditional software development life cycle. In this course, you'll explore these differences and investigate the SRE engagement model's components and how to work with it in various circumstances. You'll learn the steps for setting up and building SRE service relationships and establishing a roadmap for sprints and communication. You'll examine how to measure the impact of SRE engagement, set ground rules for SRE teams, and sustain effective relationships with other SREs and developers. Next, you'll study the steps to take for scaling SRE to larger environments and for ending an engagement. Lastly, you'll review case studies to see the results of how others have used the SRE engagement model used in real-life.

Final Exam: Site Reliability Engineer

Final Exam: Site Reliability Engineer will test your knowledge and application of the topics presented throughout the Site Reliability Engineer track of the Skillsoft Aspire Network Admin to Site Reliability Engineer Journey.

Course options

We offer several optional training products to enhance your learning experience. If you are planning to use our training course in preperation for an official exam then whe highly recommend using these optional training products to ensure an optimal learning experience. Sometimes there is only a practice exam or/and practice lab available.

Optional practice exam (trial exam)

To supplement this training course you may add a special practice exam. This practice exam comprises a number of trial exams which are very similar to the real exam, both in terms of form and content. This is the ultimate way to test whether you are ready for the exam. 

Optional practice lab

To supplement this training course you may add a special practice lab. You perform the tasks on real hardware and/or software applicable to your Lab. The labs are fully hosted in our cloud. The only thing you need to use our practice labs is a web browser. In the LiveLab environment you will find exercises which you can start immediatelyThe lab enviromentconsist of complete networks containing for example, clients, servers,etc. This is the ultimate way to gain extensive hands-on experience. 

WHY_ICTTRAININGEN

Via ons opleidingsconcept bespaar je tot 80% op trainingen

Start met leren wanneer je wilt. Je bepaalt zelf het gewenste tempo

Spar met medecursisten en profileer je als autoriteit in je vakgebied.

Ontvang na succesvolle afronding van je cursus het officiële certificaat van deelname van Icttrainingen.nl

Krijg inzicht in uitgebreide voortgangsinformatie van jezelf of je medewerkers

Kennis opdoen met interactieve e-learning en uitgebreide praktijkopdrachten door gecertificeerde docenten

Orderproces

Once we have processed your order and payment, we will give you access to your courses. If you still have any questions about our ordering process, please refer to the button below.

read more about the order process

What is included?

Certificate of participation Yes
Monitor Progress Yes
Award Winning E-learning Yes
Mobile ready Yes
Sharing knowledge Unlimited access to our IT professionals community
Study advice Our consultants are here for you to advice about your study career and options
Study materials Certified teachers with in depth knowledge about the subject.
Service World's best service

Platform

Na bestelling van je training krijg je toegang tot ons innovatieve leerplatform. Hier vind je al je gekochte (of gevolgde) trainingen, kan je eventueel cursisten aanmaken en krijg je toegang tot uitgebreide voortgangsinformatie.

Life Long Learning

Follow multiple courses? Read more about our Life Long Learning concept

read more

Contact us

Need training advise? Contact us!


contact