Google Cloud Service Health (2024)

Summary

On Monday, February 27, 2023, Google Cloud Networking’s production network experienced significant packet loss starting 04:58 US/Pacific for a duration of up to seven minutes. This caused errors and failures in several downstream Google Cloud and Google Workspace services that took up to an additional six minutes to recover. To our customers that were impacted during this outage, we sincerely apologize. This is not the level of quality and reliability we strive to offer you, and we are taking immediate steps to improve the platform’s performance and availability. We have conducted an internal investigation and are taking steps to improve our service.

Root Cause

Google’s production network has several levels of redundancy and several systems to ensure optimal bandwidth routing. One of our back end control plane elements responsible for calculating optimal paths for bandwidth, consumes snapshot data from a critical element that provides detailed network modeling, including topology, statistics, and forwarding table information. During a routine update to the critical elements snapshot data, an incomplete snapshot was inadvertently shared which removed several sites from the topology map. This caused traffic originating from those sites to be discarded until recovery mechanisms kicked in and correctly reprogrammed all paths. The packet loss caused errors and failures in multiple downstream Google Cloud and Google Workspace services.

Remediation and Prevention

Google's automation systems mitigated this failure by pushing a complete topology snapshot during the next programming cycle. The proper sites were restored and the network converged by 05:05 US/Pacific.

Google is committed to preventing this type of disruption from reoccurring and is taking the following actions:

Input validation in the control plane has been fully activated for the most critical components and is rolling out to remaining components. The bandwidth routing system is now more robust to accommodate unexpectedly large changes in the topology
Deploy a fix to the topology system to prevent this cause of incomplete snapshots.
The system providing topology input to the control plane will be sharded not to span multiple regions. This will ensure that an erroneous input, if any, will not impact traffic originating from many regions at the same time.
Safer sequencing when removing sites from the topology map

Detailed Description of Impact

On Monday, 27 February 2023, from 04:58 to 05:12 US/Pacific unless otherwise noted:

Google Cloud Platform Services:

Apigee:Up to 20% of requests in affected regions experienced timeouts and elevated 5xx error rates in southamerica-east1, us-east1 and us-central1.

Virtual Private Cloud:Affected customers/users experienced increased packet loss for cross region traffic between affected regions: asia-east1, asia-southeast1, europe-north1, europe-west1, europe-west4, us-central1, us-central2, us-east1, us-east4, us-west1, us-west4

Cloud Interconnect:Affected customers/users in regions us-east4, asia-southeast1, southamerica-east1, us-east7, us-west4 experienced increased packet loss on their interconnects.

BigQuery:Affected customers/users running queries in affected BigQuery regions experienced increased latency and elevated UNAVAILABLE error rates (retriable 503 errors) between 04:58 and 05:10. Regions affected were: aws-us-east-1, azure-eastus2, southamerica-east1, us-multiregion, us-east4, us-east5, us-east7, us-south1.

Cloud Dataflow:Overall two regions were impacted: us-central1 and us-east4. ~20% of Dataflow jobs in us-central1 experienced streaming data disruption: no data passed through the Dataflow pipelines for about 13 minutes. No visible impact in us-central1 for Dataflow Batch jobs. For us-east4 the impact was on 55% of Dataflow Streaming jobs and 75% of Dataflow Batch jobs, for about 13 minutes.

Cloud Bigtable:Affected customers/users in us-central1, us-east4, and us-west1 experienced unavailable errors (retriable 503 errors) or deadline exceeded (504 errors). 11.3% of customer projects were affected by the issue.

Cloud Key Management Service (KMS):Affected customers/users in us-east4, us-east5, us-central1, multi-region us, multi-region nam7, and global experienced reduced availability in the form of retriable 503 (unavailable) and 504 (deadline exceeded) errors across two categories of KMS keys: software and hardware keys. For software keys, 2.4% of customers were affected, across all the regions mentioned above. For hardware keys, the impact was limited to the regions: us-central1, multi-region us and global where 0.7% of customers were affected. During the outage, 0.1% of software requests returned 503 and 504 errors and 0.78% of hardware requests returned 503 and 504 errors. Requests which were not able to reach our servers likely got 503 (unavailable) which means they have retried and eventually succeeded or got a 504 (deadline exceeded) error.

Cloud Monitoring:Affected customers/users experienced elevated latency on Cloud Monitoring dashboards. Customers writing metric data through the Monitoring API experienced elevated error rates.

Persistent Disk:Affected customers/users would have seen read, writes, and unmaps stuck. Approximately 0.12% of devices were affected globally. Affected regions were: us-central1, us-east4, southamerica-east1, us-west4, asia-southeast1.

Cloud SQL:Affected customers/users may have seen intermittent connectivity issues for ~10 minutes from 05:00-05:10 US/Pacific (us-central1, us-east4, us-west4, southamerica-east1). Retrying should have succeeded, as the success rate was ~80-95% depending on region and method.

Cloud Workflows:~1% of (global) requests to the workflow executions API failed. 75% of requests failed in us-west4. 46% of requests failed in us-east4. ~3.5% of requests failed in us-central1. ~2.5% of requests failed in southamerica-east1 and northamerica-northeast2.

Cloud Console:Affected customers/users experienced an elevated number of GUI failures. Up to 9% of page views were affected during impact period..

Cloud Load Balancing:Affected customers/users experienced elevated 500 error rates from load balancers for traffic passing through asia-southeast1, us-central1, us-west4.

Google Compute Engine:Affected customers/users experienced elevated 500 error rates when sending HTTP requests to Google Compute Engine APIs. This would also include timeouts when reading or writing GCE metadata guest attributes. Additionally, affected users would have experienced an increase in latency and elevated UNAVAILABLE error rates (retriable 503 errors) between 05:00 am PST and 05:10 am PST for Compute Engine Frontend UI pages.

Cloud Run:Affected customers/users experienced elevated error rates (400 or 500 errors), request time outs, and control plane request failures. Retrying the requests may have succeeded in some cases..

Cloud App Engine:Affected customers/users experienced elevated error rates (400 or 500 errors), request time outs, and control plane request failures. Retrying the requests may have succeeded in some cases. .

Cloud Functions:Affected customers/users experienced elevated error rates (400 or 500 errors), request time outs, and control plane request failures. Retrying the requests may have succeeded in some cases.

See Also

Google Cloud Service Health

Cloud VPN:Affected customers/users experienced elevated packet loss in us-east4, asia-southeast1, southamerica-east1, us-east7, us-west4.

Identity and Access Management:Affected customers/users experienced NOT_FOUND, PERMISSION_DENIED, DEADLINE_EXCEEDED and UNAVAILABLE errors.

Cloud Pub/Sub:Affected customers/users experienced unavailability and increased latency for Publish operations, with the most severe impact occurring in us-east1, us-east4, and us-east5, where approximately 40% of projects experienced at least one minute in which fewer than 99% of requests succeeded. In most regions, <10% of projects experienced impact. Availability of subscribe operations was also impacted in a similar pattern. In combination with inability of the system to move message data between locations as normal, this led to increased end-to-end message delivery latency for approximately 20% of subscriptions.

Google Cloud Storage:On 2023-02-27, from 04:58 to 05:06 (8 minutes), some GCS customers/users in us-east4 experienced reduced availability in the form of Service Unavailable (retryable) errors at a rate of about 1% overall. Less than 1% of customer projects had an error rate of more than 1%. Some GCS customers in the configurations us-multiregion, us-central1 and us-east1 also experienced an elevated error rate during the impact window that did not exceed 1%.

Cloud Firestore:Affected customers/users would have seen increased unavailability errors with Firestore and Datastore API in various regions, including southamerica-east1, us-west4, us-east4, us-east1 and nam5. Globally, around 33% of active customers and 0.38% of active requests received 502 unavailable errors or increased latency between 4:58 and 5:10 US/Pacific.

Cloud Data Loss Prevention:No user impact

Cloud Memorystore:

Control Plane: Between 4:58 and 5:09 US/Pacific, some customers/users issuing control plane requests (like GetInstance, CreateInstance, etc) experienced a significant increase in latency, and, in ~35% of the cases, - request failures with 5xx error code. The issue was most pronounced in us-east4, us-west4, us-east1, and asia-southeast1 cloud regions, but was noticeable in other regions as well.
Data plane: Between 5:00 and 5:04 US/Pacific, a number of instances in us-east4, us-west4, southamerica-east1, us-central1, and europe-west2 experienced 1-5 minutes of unavailability, where customers were unable to connect to their Redis server. In many STANDARD-tier instances, this resulted in a failover.

Cloud Spanner:On 2023-02-27, from 04:57 to 05:06 US/Pacific (9 minutes), some customers/users in us-east4, nam3, nam7, nam9, nam11, nam12, nam-eur-asia3 experienced reduced availability in the form of deadline exceeded (retryable) errors and also an increase in latency. Less than 1% of customer projects had an error rate of more than 1%.

Cloud Build:Cloud Build API customers/users in 2 regions (southamerica-east1 and prod-global) experienced high latency and DEADLINE_EXCEEDED responses. Availability SLO for {Get,List}WorkerPool in southamerica-east1 was down to 13% for 3 minutes and consumed 25% of the 30-day error budget. Availability SLO for ReceiveGitHubDotComWebhook was down to 28% for 3 minutes and consumed 9% of the 30-day error budget.

Container Registry:Container registry customers/users in the global region experienced high latency and HTTP 504 responses. Availability SLO for manifests_get consumed 6% of the 30-day error budget. Availability SLO for ping_and_token_availability consumed 15% of the 30-day error budget.

Cloud Tasks:Cloud Tasks customers/users in the us-central1 region experienced high latency and DEADLINE_EXCEEDED responses for CreateTasks requests. Remote Procedure Call (RPC) error rate increased from 0 to 10% for 3 minutes.

Google Kubernetes Engine (GKE):GKE customers/users may have experienced service degradation and elevated 500 errors in affected locations.

Workspace services

Gmail:Affected customers/users would have experienced unavailability, 502 errors when accessing Gmail, and email delivery delays and failures between 04:58 and 05:06 US/Pacific.

Google Calendar:Affected customers/users experienced general unavailability when accessing Calendar.

Google Chat:Affected customers/users at affected locations experienced errors when attempting to access and use Google Chat.

Google Meet:Affected customers/users experienced failure rates of up to 14% when attempting to start or join a new meeting.

Google Docs:Customers/users in affected locations would experience errors when loading or accessing documents.

Google Drive:Up to 10% of customers/users accessing Google Drive during the time window experienced unavailability (HTTP 500 errors).

Google TasksAffected customers/users experienced availability issues with tasks.

Google Voice:Affected customers/users experienced up to a 2.9% error rate when interacting with Voice API. Up to 2% of Ongoing GV calls may have been dropped. Up to 13% of desk phones may not have been able to make or receive calls during the window and any outgoing calls on these would have been dropped.