Site Reliability Engineering (SRE)
Benefits
Strike the balance between speed and reliability
Reap the benefits of speed
Automate end to end, from writing code to running services in production. Align dev and ops around shared goals to go faster. Connect to the tools you love, including incident management, as you minimize toil.
Improve reliability with proven SRE principles
Leverage SRE principles developed at Google and proven to work at scale. Easily implement SRE best practices with Google Cloud’s operations suite to speed up problem resolution and improve reliability.
We meet you where you are in your SRE journey
Drive higher software delivery, irrespective of company size, industry, or whether you are using VMs, Kubernetes, or serverless. Choose from free tools or paid offerings to jump-start your SRE journey.
Key features
SRE tools and resources to make your operations and SRE teams run better
Monitor service health using SRE principles
Monitor the health of your services and work with developers to increase the velocity of changes using built-in support for service monitoring. Select metrics for SLIs, set SLOs, and track error budgets to mitigate risk for your service. Use powerful dashboards to aggregate metrics and logs, including golden signals to reduce MTTR and quickly answer questions about service health.
Out-of-the-box integrations to increase automation, reduce toil
Use our built-in integrations with the tools you love to troubleshoot incidents quickly. Implement progressive rollouts and roll back changes safely. Pre-built integrations with Cloud Build are available to allow you to build, test, and deploy artifacts to Google Kubernetes Engine, App Engine, Cloud Functions, Firebase, and Cloud Run as part of your CI/CD.
One integrated view for faster resolution
Get one unified view across logs, events, metrics, and SLOs. Get in-context observability data, right within service consoles of Google Kubernetes Engine, Cloud Run, Compute Engine, Anthos and other run times. Collect metrics, traces, and logs with zero setup. Sub-second ingestion latency and terabyte per-second ingestion rate ensure you can perform real-time log management and analysis at scale.
Get extra help from Google Cloud SRE specialists
If you would like more hands-on help through the journey, we have additional services to consider including Google consulting services. Reach out to sales to see which option would work for your organization. Learn from our CRE team and customer success stories for how Google Cloud tools and practices have helped other companies implement SRE in their organization.
Drive SRE/developer collaboration to “shift-left” observability
With OpenTelemetry (OT) packages and Google Exporter, developers can instrument and export trace data to Cloud Trace. Our new unified Ops agent (in preview), collects metrics and logs and also supports OpenTelemetry to capture and transport metrics. We are working to implement OT libraries as out-of-the-box features in many of our cloud products. Cloud SQL Insights is one example of this effort.
Related services
SRE integrations and products
Build and deploy new cloud applications, store artifacts, and monitor app security and reliability on Google Cloud.
Documentation
Learn how to implement SRE at your organization with these resources
Google Site Reliability Engineering
Access the SRE books, hear from SREs, and learn how we SRE at Google.
Creating an SLO
To monitor a service, you need at least one service-level objective (SLO). Learn step by step how to create your first SLO in Cloud Monitoring.
Hands-on labs: Troubleshooting workloads on GKE for SREs
Learn how to navigate resource pages of GKE, use the GKE dashboard, create logs-based metrics, create an SLO, and define an alert to notify SRE staff of incidents.
Engineering for reliability
Learn how to define and defend your SLOs in Google Cloud's operations suite and improve observability of your applications running in Google Cloud.
SRE: Measuring and managing reliability
This course teaches the theory of service-level objectives (SLOs), a principled way of describing and measuring the desired reliability of a service.
Developing a Google SRE culture
This course introduces key practices of Google SRE and the important role IT and business leaders play in the success of SRE organizational adoption.
What's new in Google Cloud SRE
Sign up for Google Cloud newsletters to receive product updates, event information, special offers, and more.