[Openstack] [cinder] Pruning Old Volume Backups with Ceph Backend

I back up my volumes daily, using incremental backups to minimize
network traffic and storage consumption. I want to periodically remove
old backups, and during this pruning operation, avoid entering a state
where a volume has no recent backups. Ceph RBD appears to support this
workflow, but unfortunately, Cinder does not. I can only delete the
*latest* backup of a given volume, and this precludes any reasonable
way to prune backups. Here, I'll show you.

Let's make three backups of the same volume:
openstack volume backup create --name backup-1 --force volume-foo
openstack volume backup create --name backup-2 --force volume-foo
openstack volume backup create --name backup-3 --force volume-foo

Cinder reports the following via `volume backup show`:
- backup-1 is not an incremental backup, but backup-2 and backup-3 are
- All but the latest backup have dependent backups (`has_dependent_backups`).

We take a backup every day, and after a week we're on backup-7. We
want to start deleting older backups so that we don't keep
accumulating backups forever! What happens when we try?

# openstack volume backup delete backup-1
Failed to delete backup with name or ID 'backup-1': Invalid backup:
Incremental backups exist for this backup. (HTTP 400)

We can't delete backup-1 because Cinder considers it a "base" backup
which `has_dependent_backups`. What about backup-2? Same story. Adding
the `--force` flag just gives a slightly different error message. The
*only* backup that Cinder will delete is backup-7 -- the very latest
one. This means that if we want to remove the oldest backups of a
volume, *we must first remove all newer backups of the same volume*,
i.e. delete literally all of our backups.

Also, we cannot force creation of another *full* (non-incrmental)
backup in order to free all of the earlier backups for removal.
(Omitting the `--incremental` flag has no effect; you still get an
incremental backup.)

Can we hope for better? Let's reach behind Cinder to the Ceph backend.
Volume backups are represented as a "base" RBD image with a snapshot
for each incremental backup:

# rbd snap ls volume-e742c4e2-e331-4297-a7df-c25e729fdd83.backup.base
   577 backup.e3c1bcff-c1a4-450f-a2a5-a5061c8e3733.snap.1535046973.43
10240 MB Thu Aug 23 10:57:48 2018
   578 backup.93fbd83b-f34d-45bc-a378-18268c8c0a25.snap.1535047520.44
10240 MB Thu Aug 23 11:05:43 2018
   579 backup.b6bed35a-45e7-4df1-bc09-257aa01efe9b.snap.1535047564.46
10240 MB Thu Aug 23 11:06:47 2018
   580 backup.10128aba-0e18-40f1-acfb-11d7bb6cb487.snap.1535048513.71
10240 MB Thu Aug 23 11:22:23 2018
   581 backup.8cd035b9-63bf-4920-a8ec-c07ba370fb94.snap.1535048538.72
10240 MB Thu Aug 23 11:22:47 2018
   582 backup.cb7b6920-a79e-408e-b84f-5269d80235b2.snap.1535048559.82
10240 MB Thu Aug 23 11:23:04 2018
   583 backup.a7871768-1863-435f-be9d-b50af47c905a.snap.1535048588.26
10240 MB Thu Aug 23 11:23:31 2018
   584 backup.b18522e4-d237-4ee5-8786-78eac3d590de.snap.1535052729.52
10240 MB Thu Aug 23 12:32:43 2018

It seems that each snapshot stands alone and doesn't depend on others.
Ceph lets me delete the older snapshots.

# rbd snap rm volume-e742c4e2-e331-4297-a7df-c25e729fdd83.backup.base at backup.e3c1bcff-c1a4-450f-a2a5-a5061c8e3733.snap.1535046973.43
Removing snap: 100% complete...done.
# rbd snap rm volume-e742c4e2-e331-4297-a7df-c25e729fdd83.backup.base at backup.10128aba-0e18-40f1-acfb-11d7bb6cb487.snap.1535048513.71
Removing snap: 100% complete...done.

Now that we nuked backup-1 and backup-4, can we still restore from
backup-7 and launch an instance with it?

openstack volume create --size 10 --bootable volume-foo-restored
openstack volume backup restore backup-7 volume-foo-restored
openstack server create --volume volume-foo-restored --flavor medium1

Yes! We can SSH to the instance and it appears intact.

Perhaps each snapshot in Ceph stores a complete diff from the base RBD
image (rather than each successive snapshot depending on the last). If
this is true, then Cinder is unnecessarily protective of older
backups. Cinder represents these as "with dependents" and doesn't let
us touch them, even though Ceph will let us delete older RBD
snapshots, apparently without disrupting newer snapshots of the same
volume. If we could remove this limitation, Cinder backups would be
significantly more useful for us. We mostly host servers with
non-cloud-native workloads (IaaS for research scientists). For these,
full-disk backups at the infrastructure level are an important
supplement to file-level or application-level backups.

It would be great if someone else could confirm or disprove what I'm
seeing here. I'd also love to hear from anyone else using Cinder
backups this way.


Chris Martin at CyVerse