Course Kingdom

- Course -

Site Reliability Engineer



School of cloud computing

17 September, 2025

Advance your tech career by learning to design, deploy, and maintain reliable, scalable systems through this Nanodegree. Featuring real-world projects, practical tools, and personalized expert feedbac...

$89.00 FREE

Course 1: Welcome! Welcome! We're so glad you're here. Join us in learning a bit more about what to expect in this program and ways to succeed.45 minutesAn Introduction to Your Nanodegree ProgramWelcome! We're so glad you're here. Join us in learning a bit more about what to expect and ways to succeed.Getting HelpYou are starting a challenging but rewarding journey! Take 5 minutes to read how to get help with projects and content.Course 2: Establishing a foundation in observability In this course, we will learn about the founding concepts of Observability in terms of people and tools. 14 hoursIntroduction to Establishing a Foundation in ObservabilityThis lesson will introduce you to the course, including what SRE is and why it matters.SRE Roles and Responsibilities in EnterpriseIn this lesson, we will learn how to distinguish unique SRE roles and responsibilities within an enterprise.Improving Enterprise Workflows with SRE Best PracticesIn this lesson, we will investigate enterprise workflows that can be improved with common SRE practices using cost-benefit analysis.SRE TeamsIn this lesson, we will learn how to define an optimal SRE team structure and work allocation given business needs.Monitoring System PerformanceBy the end of this lesson, you will have a fully-functional monitoring system that uses some of the most popular tools in the industry.Deploying System ObservabilityIn this project, you will apply the skills you have acquired in the Establish a Foundation in Observability course to configure a monitoring software stack.Course 3: Planning for High Availability and Incident Response In this course, we will look at how SREs view availability and reliability for their infrastructure. We'll learn how to create effective monitoring using SLOs and SLIs. We will create dashboards in Grafana. Next, we'll identify all our IT assets, ensure they are configured for high availability. And then we will craft a disaster recovery plan to make sure failover is seamless and automated. After that, we'll deploy the infrastructure to AWS using Terraform. We'll learn the benefits of infrastructure as code. We'll see how easy it is to deploy to multiple regions. Finally, we'll learn how to make databases highly available and disaster recovery ready. We'll look at recovery strategies and implement them in AWS via Terraform.16 hoursCourse IntroductionIntroduction to the course. We will look at how the topics all tie into being an SRE and what skills we'll learn and apply.SLOs and SLIsIn this lesson, we will learn about how SREs monitor using SLOs and SLIs. We will create queries in Prometheus and dashboard in Grafana.IT Assets, Availability and Disaster RecoveryIn this lesson, we will identify all IT assets, make those assets highly available, and put together a disaster recovery plan for those assets.Creating and deploying HA and DR infrastructure using TerraformIn this lesson, we will deploy our HA/DR infrastructure using Terraform to AWS.High Availability and DR of DatabasesIn this lesson, we'll learn about database reliability and availability and how we can make databases more available. We will then deploy a replicated database cluster to AWS and also see a failover.Deploying High Availability InfrastructureIn this project, you will apply the skills you've learned in this course, by defining and implementing a resilient infrastructure in a cloud platform.Course 4: Self Healing Architectures Self-healing architecture is resilient enough to withstand failure and resolve issues without human intervention through automation. In this course, you'll gain skills in self-healing architecture design strategies, deployment strategies, and cloud automation11 hoursIntroduction to Self-Healing ArchitecturesWelcome to Self-healing Architectures! In this lesson, you'll learn more about the course and the topic. Self-healing System Design FundamentalsIn this lesson, you'll learn about self-healing system design fundamentals like single points of failure, tiered architecture, automation strategies, and microservice design. Self-healing Deployment StrategiesIn this lesson, you'll learn about and implement several self-healing deployment strategiesCloud AutomationIn this lesson, you'll learn about several different self-healing cloud automation configurations for microservices and virtual machines. Deployment RouletteIn this project, you'll put everything you learned in the course into practice by playing the role of an SRE fixing and deploying applications using self-healing strategiesCourse 5: Establishing a Culture of Reliability This course is all about how to foster a culture that is based on reliability. We will learn how to utilize best practices for several key areas of being a Site Reliability Engineer (SRE) and how they contribute to a culture of reliability. We will cover how to have balanced and effective on-call rotations as well as how to handle incidents. Next, we will discuss how to review your system throughout its lifecycle to find and mitigate any potential risk factors. Managing system capacity at all phases of a system's lifecycle is another major component to ensuring that everything is operating at maximum reliability. We will round out this course by discussing a thorn in every SRE's side: toil. We will discuss how to identify and reduce toil to maximize time spent performing operational work.18 hoursIntroduction to Establishing a Culture of ReliabilityIn this lesson, we cover some introductory material to help you start with a solid foundation. Improving On-Call Effectiveness Having a solid on-call is very important to achieving peak reliability. This lesson discusses how to have balanced on-call shifts with a solid incident management process that your team can follow.Reliability ReviewsIn this lesson, we learn how to review your system from the start to prepare for a release. It is important that you have systems in place to find potential risks and develop mitigations for them.Managing System CapacitySystem capacity is an essential part of ensuring reliability. This lesson discusses how to balance system capacity with costs to ensure that resources and money are not being wasted.Toil ReductionToil is the bane of every SRE team, and this lesson is all about how to reduce toil to allow your team to focus on operational work that improves reliability.Plan, Reduce, RepeatTo wrap everything up, you will complete the final project, where you will be participating in three scenarios that will tie everything you have learned together.Course 6: Congratulations! Congratulations on finishing your program!10 minutesCongratulations!Congratulations on your graduation from this program! Please join us in celebrating your accomplishments.CompanyAbout Us Why Udacity? Blog In the News Jobs at Udacity Become a Mentor Partner with Udacity ResourcesCatalog Career Outcomes Help and FAQ Scholarships Resource Center Udacity SchoolsSchool of Artificial Intelligence School of Autonomous Systems School of Business School of Cloud Computing School of Cybersecurity School of Data Science School of Executive Leadership School of Product Management School of Programming and Development Career Resources Featured ProgramsBusiness Analytics SQL AWS Cloud Architect Data Analyst Intro to Programming Digital Marketing Self Driving Car Engineer Only at UdacityArtificial Intelligence Deep Learning Digital Marketing Flying Car and Autonomous Flight Engineer Intro to Self-Driving Cars Machine Learning Engineer Robotics Software Engineer


Join us on Telegram



Join our Udemy Courses Telegram Channel



Enroll Now

Subscribe us on Youtube