Course: Network Admin to Site Reliability Engineer – Part 3: Chaos Engineer

$529.00
$640.09 incl. vat

duration: 29 hours |

Language: English (US) |

access duration: 180 days |

Details

In this course, the focus will be on troubleshooting and creating order out of chaos as a site reliability engineer. You will learn how to identify and address system issues effectively, and what the different troubleshooting approaches are, and you’ll learn to simplify and streamline the process while avoiding common pitfalls. With a focus on issue reporting, examination, diagnosis, and testing, you’ll gain insights into observing recent changes, locating probable causes and thus enhancing troubleshooting efficiency. Additionally, you’ll delve into a variety of troubleshooting tools, including logging, monitoring techniques, and cloud-based solutions like Google Cloud Dataflow. By mastering these tools, you can swiftly diagnose and resolve system issues, ensuring optimal performance and reliability.

Next, you will learn proactive planning techniques and response strategies to prepare for unexpected emergencies. You will explore different emergency types, understand the importance of documentation, and develop incident response plans to minimize downtime and mitigate risks effectively. Furthermore, you’ll expand your expertise in software reliability testing by exploring various testing techniques and reliability metrics to ensure failure-free software operations. Lastly, you’ll gain insights into coordinating and executing successful product launches, emphasizing reliability, scalability, and consistency.

Result

After completing this course, you will be ready to create order out of the chaos as a site reliability engineer. You will have a sufficient understanding of topics, such as emergency response and incident handling, testing for reliability, load balancing, overload and cascading failures, distributed reliability, data pipelines and integrity, and deploying products at scale.

Prerequisites

No formal prerequisites. However, it is recommended to be familiar with Site Reliability Engineering, Networking and DevOps.

It is also recommended to first follow Parts 1 and 2 of the learning path ‘’Network Admin to Site Reliability Engineer’’.

  • Part 1: Network Admin
  • Part 2: DevOps Engineer

Target audience

System Administrator, Network Administrator

Content

Network Admin to Site Reliability Engineer – Part 3: Chaos Engineer

29 hours

SRE Troubleshooting Processes

Troubleshooting is a critical skill for site reliability engineers (SREs). Using past experiences, a proper mindset, and a stable troubleshooting process, SREs can effectively report, triage, examine, diagnose, test, and cure system issues. In this course, you'll explore troubleshooting approaches and best practices, while also learning how to avoid common pitfalls. You'll explore issue reporting, triaging, examination, diagnosis, and testing. You'll recognize how to simplify and reduce troubleshooting, use the ""what, why, and where"" technique, and examine negative results. You'll also investigate how to observe and interpret recent changes to identify what went wrong with a system. Lastly, you'll locate probable cause factors and outline the steps used to make troubleshooting more effective.

SRE Troubleshooting: Tools

Site reliability engineers (SREs) are typically good problem solvers. They need to think logically to identify problems, correct them, and prevent them from happening again. In this course, you'll explore several built-in and open-source troubleshooting tools SREs can use for resolving system issues. You'll start by examining the techniques of logging and whitebox and blackbox monitoring used to monitor system events. You'll then work with the various built-in Windows troubleshooting tools, namely the Event Viewer, Resource Monitor, and System Information tools. Next, you'll use Google Cloud Dataflow to process logs, before outlining the purpose and benefits of the StatsD standard and the /api/search endpoint. Lastly, you'll identify how Google's Dapper is used for troubleshooting distributed systems, and the open standards tool, Prometheus, for instrumenting software and exposing metrics.

SRE Emergency & Incident Response: Responding to Emergencies

Site Reliability Engineers (SREs) are responsible for assigning the appropriate resources and responsibilities to effectively deal with unexpected emergencies. To do this, SREs should ensure the proper processes and teams are in place before an emergency occurs. In this course, you'll explore the different emergency types and outline how to plan for them. You'll examine the causes of and how to respond to test-induced, change-induced, and process-induced emergencies and what's involved in proactive approaches to emergency testing and planning. You'll then outline the critical steps to correctly documenting emergencies, including the history of outages and mistakes. You'll then differentiate between business continuity and disaster recovery planning and outline how to create both types of plans and conduct a business impact analysis. Lastly, you'll explore some IT recovery strategies.

SRE Emergency & Incident Response: Incident Response

A well-prepared and organized approach is key to addressing and managing the aftermath of a system failure, security breach, or cyberattack. In this course, you'll explore the fundamental principles an SRE needs to be familiar with when responding to and managing incidents. You'll identify the goals, requirements, best practices, and key players involved in incident management. You'll learn how to deal with managed and unmanaged incidents and what's involved in an incident response plan. You'll identify incident response roles and responsibilities, and how to use incident metrics to manage incidents at scale. You'll outline what's involved in establishing a computer security incident response team (CSIRT), including each key team member's roles and responsibilities. Lastly, you'll examine what goes into an incident response policy.

SRE Testing Tasks: Software Reliability & Testing

Site reliability engineers (SREs) can use various testing techniques to ensure software operations are as failure-free as possible for a specified time in a specified environment. In this course, you'll explore multiple testing techniques, their purposes, and the tasks involved in their execution. You'll start by examining traditional software testing approaches, such as unit tests, integration tests, and system tests. Next, you'll investigate the components and use cases of various reliability metrics applied to SRE testing, including mean time to failure (MTTF), mean time to recover (MTTR), and mean time between failures (MTBF). Lastly, you'll outline several software testing approaches, such as stress, configuration, integration, acceptance, production, and canary testing, among others. You'll identify when, how, and by whom each of these testing types is carried out.

SRE Testing Tasks: Testing Considerations

Site reliability engineers (SREs) need to create a healthy test and build environment to ensure that products being distributed integrate and function as expected. In this course, you'll explore the fundamentals of creating a robust SRE test and build environment, looking at the standard tools and techniques available for testing at scale. You'll examine disaster and statistical testing, and learn about working with deadlines and production configurations. You'll investigate the topic of test failures, identifying why an SRE should expect specific tests to fail and how results for test failures can help maximize knowledge about operations and end-users. Lastly, you'll look at the why and how of incorporating break glass procedures, integration testing configuration files, and fake back-end versions into your testing procedures.

SRE Load Balancing Techniques: Front-end Load Balancing

Today's distributed systems can consist of hundreds or even thousands of servers, and getting them to work together efficiently is a challenge. Load balancing is a multifaceted concept whose many techniques can help SREs face this challenge. In this course, you'll explore how front-end load balancing works and its associated techniques, concepts, and capabilities. You'll examine the characteristics of load balancers, their use in application delivery and security, and the use of DNS load balancers. You'll outline strategies for virtual IP load balancing, cloud load balancing, and handling overload. Finally, you'll learn how the Google Front End Service, Andromeda virtualization stack, Maglev network load balancing service, and the Envoy edge and service proxy are used for load balancing-related tasks.

SRE Load Balancing Techniques: Data Center Load Balancing

A Site Reliability Engineer (SRE) must know how to perform load balancing within the data center, both internally and externally. In this course, you'll learn about load balancing, including various methods for balancing loads in the data center. You'll begin by examining what data center load balancing is and its importance to performance, as well as load balancing policies. You'll then learn how to deal with unhealthy tasks using flow control, and tips and tricks for optimizing load balancing. Next, you'll examine methods for limiting connection pools with subsetting, and the various load balancing components. Lastly, you'll learn how to balance loads internally and externally using HTTPS and TCP/UDP, and how to balance loads using SSL and TCP proxy load balancing.

Site Reliability Engineer: Managing Overloads

Site reliability engineers (SREs) are typically responsible for preventing and managing overloads. A common misconception is that overloads only affect computer systems. However, overloads also comprise types of occupational stress, which invariably negatively affect an organization. In this course, you'll explore the fundamental concepts and methods involved in managing overloads. You'll start by identifying operational load types and how they relate to performance. You'll then outline how to mitigate workloads and prioritize work before recognizing the specific consequences of overloads. You'll then describe how to manage client-side traffic using per customer limitations and client-side throttling. You'll examine tools such as criticality values and utilization signals. Finally, you'll explore approaches used for handling overload errors and learn how to identify issues caused by loads associated with connections.

Site Reliability Engineer: Managing Cascading Failures

Cascading failures are a concern for site reliability engineers (SREs) because they often stem from positive feedback and grow over time. In this course, you'll examine the various cascading failure triggers, such as overloads, CPU, and memory issues. You'll also explore the resource exhaustion issues resulting from cascading failures and the adverse effects on overall performance and stability. You'll outline steps to prevent server overloads, ensure efficient queue management, deal with latency, and manage slow startups. You'll explore terms such as ""load shedding"" and ""code retries."" You'll also identify the benefits of setting deadlines and how propagating cancellations can reduce or eliminate unneeded work and preserve resources for other needs. Finally, you'll outline the steps involved in testing cascading failures and in addressing them immediately.

Distributed Reliability: SRE Critical State Management

Anticipating failures that will affect your company's systems is a crucial site reliability engineer duty. These failures are especially significant when they affect distributed systems, which is why efficient algorithms and strategies are essential in minimizing the likelihood of failures. In this course, you'll explore both critical state management and the CAP theorem, identifying how both concepts relate to distributed systems. Next, you'll examine several distributed system management algorithms and strategies, including deterministic and nondeterministic algorithms, distributed system models, and Byzantine faults. You'll then outline how each of these benefits distributed system management. Finally, you'll investigate the Multi-Paxos message flow protocol and how it works with distributed systems. Finally, you'll describe what's involved in deploying and monitoring a consensus-based system to increase distributed system performance.

Distributed Reliability: SRE Distributed Periodic Scheduling

Maintaining a distributed system requires constant maintenance to ensure failures don't interfere with that system's reliability and availability. Using periodic scheduling and replication, site reliability engineers can minimize the effect failures may have on a system's performance. One way to automate this process is to utilize the system daemon, cron. In this course, you'll explore how to use cron for task scheduling, the purpose, components, and operators involved in cron jobs, and the format and characters of cron syntax. You'll outline how cron works with distributed periodic scheduling and idempotency, and in largescale deployments. Next, you'll review the PAXOS distributed consensus algorithm, best practices for its use, and how it applies to distributed replication. Lastly, you'll practice scheduling a cron job and using cron syntax generators.

SRE Data Pipelines & Integrity: Data Pipelines

Site reliability engineers often find data processing complex as demands for faster, more reliable, and extra cost-effective results continue to evolve. In this course, you'll explore techniques and best practices for managing a data pipeline. You'll start by examining the various pipeline application models and their recommended uses. You'll then learn how to define and measure service level objectives, plan for dependency failures, and create and maintain pipeline documentation. Next, you'll outline the phases of a pipeline development lifecycle's typical release flow before investigating more challenging topics such as managing data processing pipelines, using big data with simple data pipelines, and using periodic pipeline patterns. Lastly, you'll delve into the components of Google Workflow and recognize how to work with this system.

SRE Data Pipelines & Integrity: Pipeline Design

Site reliability engineers (SREs) encounter numerous and varied pipeline technologies and frameworks in their work. When building a pipeline, SREs need to invest considerable time during the design phase to ensure the results work best for the specific case. In this course, you'll explore the numerous features of a pipeline, such as latency, high availability, development, and operations. You'll also examine the two different pipeline mutations: idempotent and two-phase, as well as the checkpointing technique and various code patterns. You'll then investigate the five core characteristics of the pipeline maturity matrix and outline how they should be used to design the pipeline technology. You'll then identify potential failure modes, outage causes, and different prevention and response techniques. Finally, you'll outline event delivery system design and operations and how to plan for customer integration and support.

SRE Data Pipelines & Integrity: Data Integrity

Data integrity is vital as it ensures end-user data accuracy and consistency in conjunction with an adequate level of service and availability. In this course, you'll learn how to choose a strategy for data integrity, including how to account for any potential upsides and tradeoffs. You'll explore various types of failures that lead to data loss and the existence of the many data failure modes. You'll also identify data integrity challenges. Next, you'll examine in detail the soft deletion, back up and recovery, and early detection layers of defense-in-depth, before investigating the data integrity challenges a cloud developer may encounter in high-velocity environments. Finally, you'll outline considerations for implementing out-of-band data validation and successful data recovery and identify how the primary SRE principles apply to data integrity.

SRE Products at Scale: Product Launches

Site Reliability Engineers (SREs) often contribute to the launch of new products and features. These launches can occur in rapid iterations and at scale, so SREs need to be prepared to help them succeed. In this course, you'll examine launch coordination engineering to build and release reliable and fast products. You'll identify the criteria for a successful product launch and how to develop and use launch checklists to reduce failure and ensure consistency and completeness. Next, you'll outline the techniques used for reliable launches and how launch coordination engineers can help mitigate the repetition of launch mistakes. You'll investigate the production readiness review model used to identify a service's reliability needs. Lastly, you'll outline the characteristics of SRE engagement and early engagement models, as well as SRE engagement frameworks.

Final Exam: Chaos Engineer

Final Exam: Chaos Engineer will test your knowledge and application of the topics presented throughout the Chaos Engineer track of the Skillsoft Aspire Network Admin to Site Reliability Engineer Journey.

Course options

We offer several optional training products to enhance your learning experience. If you are planning to use our training course in preperation for an official exam then whe highly recommend using these optional training products to ensure an optimal learning experience. Sometimes there is only a practice exam or/and practice lab available.

Optional practice exam (trial exam)

To supplement this training course you may add a special practice exam. This practice exam comprises a number of trial exams which are very similar to the real exam, both in terms of form and content. This is the ultimate way to test whether you are ready for the exam. 

Optional practice lab

To supplement this training course you may add a special practice lab. You perform the tasks on real hardware and/or software applicable to your Lab. The labs are fully hosted in our cloud. The only thing you need to use our practice labs is a web browser. In the LiveLab environment you will find exercises which you can start immediatelyThe lab enviromentconsist of complete networks containing for example, clients, servers,etc. This is the ultimate way to gain extensive hands-on experience. 

WHY_ICTTRAININGEN

Via ons opleidingsconcept bespaar je tot 80% op trainingen

Start met leren wanneer je wilt. Je bepaalt zelf het gewenste tempo

Spar met medecursisten en profileer je als autoriteit in je vakgebied.

Ontvang na succesvolle afronding van je cursus het officiële certificaat van deelname van Icttrainingen.nl

Krijg inzicht in uitgebreide voortgangsinformatie van jezelf of je medewerkers

Kennis opdoen met interactieve e-learning en uitgebreide praktijkopdrachten door gecertificeerde docenten

Orderproces

Once we have processed your order and payment, we will give you access to your courses. If you still have any questions about our ordering process, please refer to the button below.

read more about the order process

What is included?

Certificate of participation Yes
Monitor Progress Yes
Award Winning E-learning Yes
Mobile ready Yes
Sharing knowledge Unlimited access to our IT professionals community
Study advice Our consultants are here for you to advice about your study career and options
Study materials Certified teachers with in depth knowledge about the subject.
Service World's best service

Platform

Na bestelling van je training krijg je toegang tot ons innovatieve leerplatform. Hier vind je al je gekochte (of gevolgde) trainingen, kan je eventueel cursisten aanmaken en krijg je toegang tot uitgebreide voortgangsinformatie.

Life Long Learning

Follow multiple courses? Read more about our Life Long Learning concept

read more

Contact us

Need training advise? Contact us!


contact