Backup retry storms: How you can improve backup reliability

Backup retry storms: How you can improve backup reliability
Retries are a diagnostic signal of policy health rather than an effective treatment: Acronis H2 2025 telemetry shows policies with many retries typically suffer persistent customer-side failures (quota exhaustion, VSS writer failures, connectivity issues) that retries do not fix. Uncapped retries create β€œretry storms” that waste CPU, I/O and network resources, degrade performance for users and neighboring tenants, and should be limited with meaningful delays and alerts. #Acronis #AzureBackup

Keypoints

  • Telemetry from Acronis H2 2025 shows policies with 10+ daily runs produced ~4.7 million errors in October 2025, with always_full policies having the highest per-job error rate (10.01%).
  • The top failure causes during business hours were SpaceQuotaReachedHard (15.9%), MemoErrorVssWriterFail (11.48%), and ConnectToPlatformFailedNetwork (9.11%), all persistent issues not resolved by retries.
  • Industry guidance (Google Cloud, Microsoft Azure, AWS) treats retries as a controlled response to transient faults and emphasizes limits, observability, and escalation rather than unlimited retries.
  • Retries consume scheduler capacity and server resources (VSS snapshots, disk reads, network uploads, agent metadata writes) and can produce heavy invisible infrastructure waste despite a single β€œfailed” status in the console.
  • Statistics show selection bias: higher run-count buckets reflect failing policies (more attempts occur because failures happened), so comparing error rates across run-count buckets does not prove retries are harmful.
  • Practical recommendations: cap retries by backup type (e.g., always_incremental: max 2 retries; always_full: max 1 retry), set delays, alert when a policy exceeds four runs/day, and investigate recurring failures instead of increasing retries.

MITRE Techniques

  • [None ] No MITRE ATT&CK techniques are mentioned in the article.

Indicators of Compromise

  • [Domain ] Reference and documentation sources cited – cloud.google.com, docs.aws.amazon.com (and learn.microsoft.com referenced).
  • [Error/Event name ] Acronis failure codes and diagnostics – SpaceQuotaReachedHard, MemoErrorVssWriterFail, and other error identifiers like ConnectToPlatformFailedNetwork.
  • [Policy name ] Backup policy identifiers used in analysis and recommendations – always_full, always_incremental (and weekly_full_daily_inc).


Read more: https://www.acronis.com/en/tru/posts/backup-retry-storms-how-you-can-improve-backup-reliability/