1eda14cbcSMatt Macy /* 2eda14cbcSMatt Macy * CDDL HEADER START 3eda14cbcSMatt Macy * 4eda14cbcSMatt Macy * The contents of this file are subject to the terms of the 5eda14cbcSMatt Macy * Common Development and Distribution License (the "License"). 6eda14cbcSMatt Macy * You may not use this file except in compliance with the License. 7eda14cbcSMatt Macy * 8eda14cbcSMatt Macy * You can obtain a copy of the license at usr/src/OPENSOLARIS.LICENSE 9271171e0SMartin Matuska * or https://opensource.org/licenses/CDDL-1.0. 10eda14cbcSMatt Macy * See the License for the specific language governing permissions 11eda14cbcSMatt Macy * and limitations under the License. 12eda14cbcSMatt Macy * 13eda14cbcSMatt Macy * When distributing Covered Code, include this CDDL HEADER in each 14eda14cbcSMatt Macy * file and include the License file at usr/src/OPENSOLARIS.LICENSE. 15eda14cbcSMatt Macy * If applicable, add the following below this CDDL HEADER, with the 16eda14cbcSMatt Macy * fields enclosed by brackets "[]" replaced with your own identifying 17eda14cbcSMatt Macy * information: Portions Copyright [yyyy] [name of copyright owner] 18eda14cbcSMatt Macy * 19eda14cbcSMatt Macy * CDDL HEADER END 20eda14cbcSMatt Macy */ 21eda14cbcSMatt Macy 22eda14cbcSMatt Macy /* 23eda14cbcSMatt Macy * Copyright (c) 2017 by Delphix. All rights reserved. 24eda14cbcSMatt Macy */ 25eda14cbcSMatt Macy 26eda14cbcSMatt Macy /* 27eda14cbcSMatt Macy * Storage Pool Checkpoint 28eda14cbcSMatt Macy * 29eda14cbcSMatt Macy * A storage pool checkpoint can be thought of as a pool-wide snapshot or 30eda14cbcSMatt Macy * a stable version of extreme rewind that guarantees no blocks from the 31eda14cbcSMatt Macy * checkpointed state will have been overwritten. It remembers the entire 32eda14cbcSMatt Macy * state of the storage pool (e.g. snapshots, dataset names, etc..) from the 33eda14cbcSMatt Macy * point that it was taken and the user can rewind back to that point even if 34eda14cbcSMatt Macy * they applied destructive operations on their datasets or even enabled new 35eda14cbcSMatt Macy * zpool on-disk features. If a pool has a checkpoint that is no longer 36eda14cbcSMatt Macy * needed, the user can discard it. 37eda14cbcSMatt Macy * 38eda14cbcSMatt Macy * == On disk data structures used == 39eda14cbcSMatt Macy * 40eda14cbcSMatt Macy * - The pool has a new feature flag and a new entry in the MOS. The feature 41eda14cbcSMatt Macy * flag is set to active when we create the checkpoint and remains active 42eda14cbcSMatt Macy * until the checkpoint is fully discarded. The entry in the MOS config 43eda14cbcSMatt Macy * (DMU_POOL_ZPOOL_CHECKPOINT) is populated with the uberblock that 44eda14cbcSMatt Macy * references the state of the pool when we take the checkpoint. The entry 45eda14cbcSMatt Macy * remains populated until we start discarding the checkpoint or we rewind 46eda14cbcSMatt Macy * back to it. 47eda14cbcSMatt Macy * 48eda14cbcSMatt Macy * - Each vdev contains a vdev-wide space map while the pool has a checkpoint, 49eda14cbcSMatt Macy * which persists until the checkpoint is fully discarded. The space map 50eda14cbcSMatt Macy * contains entries that have been freed in the current state of the pool 51eda14cbcSMatt Macy * but we want to keep around in case we decide to rewind to the checkpoint. 52eda14cbcSMatt Macy * [see vdev_checkpoint_sm] 53eda14cbcSMatt Macy * 54eda14cbcSMatt Macy * - Each metaslab's ms_sm space map behaves the same as without the 55eda14cbcSMatt Macy * checkpoint, with the only exception being the scenario when we free 56eda14cbcSMatt Macy * blocks that belong to the checkpoint. In this case, these blocks remain 57eda14cbcSMatt Macy * ALLOCATED in the metaslab's space map and they are added as FREE in the 58eda14cbcSMatt Macy * vdev's checkpoint space map. 59eda14cbcSMatt Macy * 60eda14cbcSMatt Macy * - Each uberblock has a field (ub_checkpoint_txg) which holds the txg that 61eda14cbcSMatt Macy * the uberblock was checkpointed. For normal uberblocks this field is 0. 62eda14cbcSMatt Macy * 63eda14cbcSMatt Macy * == Overview of operations == 64eda14cbcSMatt Macy * 65eda14cbcSMatt Macy * - To create a checkpoint, we first wait for the current TXG to be synced, 66eda14cbcSMatt Macy * so we can use the most recently synced uberblock (spa_ubsync) as the 67eda14cbcSMatt Macy * checkpointed uberblock. Then we use an early synctask to place that 68eda14cbcSMatt Macy * uberblock in MOS config, increment the feature flag for the checkpoint 69eda14cbcSMatt Macy * (marking it active), and setting spa_checkpoint_txg (see its use below) 70eda14cbcSMatt Macy * to the TXG of the checkpointed uberblock. We use an early synctask for 71eda14cbcSMatt Macy * the aforementioned operations to ensure that no blocks were dirtied 72eda14cbcSMatt Macy * between the current TXG and the TXG of the checkpointed uberblock 73eda14cbcSMatt Macy * (e.g the previous txg). 74eda14cbcSMatt Macy * 75eda14cbcSMatt Macy * - When a checkpoint exists, we need to ensure that the blocks that 76eda14cbcSMatt Macy * belong to the checkpoint are freed but never reused. This means that 77eda14cbcSMatt Macy * these blocks should never end up in the ms_allocatable or the ms_freeing 78eda14cbcSMatt Macy * trees of a metaslab. Therefore, whenever there is a checkpoint the new 79eda14cbcSMatt Macy * ms_checkpointing tree is used in addition to the aforementioned ones. 80eda14cbcSMatt Macy * 81eda14cbcSMatt Macy * Whenever a block is freed and we find out that it is referenced by the 82eda14cbcSMatt Macy * checkpoint (we find out by comparing its birth to spa_checkpoint_txg), 83eda14cbcSMatt Macy * we place it in the ms_checkpointing tree instead of the ms_freeingtree. 84eda14cbcSMatt Macy * This way, we divide the blocks that are being freed into checkpointed 85eda14cbcSMatt Macy * and not-checkpointed blocks. 86eda14cbcSMatt Macy * 87eda14cbcSMatt Macy * In order to persist these frees, we write the extents from the 88eda14cbcSMatt Macy * ms_freeingtree to the ms_sm as usual, and the extents from the 89eda14cbcSMatt Macy * ms_checkpointing tree to the vdev_checkpoint_sm. This way, these 90eda14cbcSMatt Macy * checkpointed extents will remain allocated in the metaslab's ms_sm space 91eda14cbcSMatt Macy * map, and therefore won't be reused [see metaslab_sync()]. In addition, 92eda14cbcSMatt Macy * when we discard the checkpoint, we can find the entries that have 93eda14cbcSMatt Macy * actually been freed in vdev_checkpoint_sm. 94eda14cbcSMatt Macy * [see spa_checkpoint_discard_thread_sync()] 95eda14cbcSMatt Macy * 96eda14cbcSMatt Macy * - To discard the checkpoint we use an early synctask to delete the 97eda14cbcSMatt Macy * checkpointed uberblock from the MOS config, set spa_checkpoint_txg to 0, 98eda14cbcSMatt Macy * and wakeup the discarding zthr thread (an open-context async thread). 99eda14cbcSMatt Macy * We use an early synctask to ensure that the operation happens before any 100eda14cbcSMatt Macy * new data end up in the checkpoint's data structures. 101eda14cbcSMatt Macy * 102eda14cbcSMatt Macy * Once the synctask is done and the discarding zthr is awake, we discard 103eda14cbcSMatt Macy * the checkpointed data over multiple TXGs by having the zthr prefetching 104eda14cbcSMatt Macy * entries from vdev_checkpoint_sm and then starting a synctask that places 105eda14cbcSMatt Macy * them as free blocks into their respective ms_allocatable and ms_sm 106eda14cbcSMatt Macy * structures. 107eda14cbcSMatt Macy * [see spa_checkpoint_discard_thread()] 108eda14cbcSMatt Macy * 109eda14cbcSMatt Macy * When there are no entries left in the vdev_checkpoint_sm of all 110eda14cbcSMatt Macy * top-level vdevs, a final synctask runs that decrements the feature flag. 111eda14cbcSMatt Macy * 112eda14cbcSMatt Macy * - To rewind to the checkpoint, we first use the current uberblock and 113eda14cbcSMatt Macy * open the MOS so we can access the checkpointed uberblock from the MOS 114eda14cbcSMatt Macy * config. After we retrieve the checkpointed uberblock, we use it as the 115eda14cbcSMatt Macy * current uberblock for the pool by writing it to disk with an updated 116eda14cbcSMatt Macy * TXG, opening its version of the MOS, and moving on as usual from there. 117eda14cbcSMatt Macy * [see spa_ld_checkpoint_rewind()] 118eda14cbcSMatt Macy * 119eda14cbcSMatt Macy * An important note on rewinding to the checkpoint has to do with how we 120eda14cbcSMatt Macy * handle ZIL blocks. In the scenario of a rewind, we clear out any ZIL 121eda14cbcSMatt Macy * blocks that have not been claimed by the time we took the checkpoint 122eda14cbcSMatt Macy * as they should no longer be valid. 123eda14cbcSMatt Macy * [see comment in zil_claim()] 124eda14cbcSMatt Macy * 125eda14cbcSMatt Macy * == Miscellaneous information == 126eda14cbcSMatt Macy * 127eda14cbcSMatt Macy * - In the hypothetical event that we take a checkpoint, remove a vdev, 128eda14cbcSMatt Macy * and attempt to rewind, the rewind would fail as the checkpointed 129eda14cbcSMatt Macy * uberblock would reference data in the removed device. For this reason 130eda14cbcSMatt Macy * and others of similar nature, we disallow the following operations that 131eda14cbcSMatt Macy * can change the config: 132eda14cbcSMatt Macy * vdev removal and attach/detach, mirror splitting, and pool reguid. 133eda14cbcSMatt Macy * 134eda14cbcSMatt Macy * - As most of the checkpoint logic is implemented in the SPA and doesn't 135eda14cbcSMatt Macy * distinguish datasets when it comes to space accounting, having a 136eda14cbcSMatt Macy * checkpoint can potentially break the boundaries set by dataset 137eda14cbcSMatt Macy * reservations. 138eda14cbcSMatt Macy */ 139eda14cbcSMatt Macy 140eda14cbcSMatt Macy #include <sys/dmu_tx.h> 141eda14cbcSMatt Macy #include <sys/dsl_dir.h> 142eda14cbcSMatt Macy #include <sys/dsl_synctask.h> 143eda14cbcSMatt Macy #include <sys/metaslab_impl.h> 144eda14cbcSMatt Macy #include <sys/spa.h> 145eda14cbcSMatt Macy #include <sys/spa_impl.h> 146eda14cbcSMatt Macy #include <sys/spa_checkpoint.h> 147eda14cbcSMatt Macy #include <sys/vdev_impl.h> 148eda14cbcSMatt Macy #include <sys/zap.h> 149eda14cbcSMatt Macy #include <sys/zfeature.h> 150eda14cbcSMatt Macy 151eda14cbcSMatt Macy /* 152eda14cbcSMatt Macy * The following parameter limits the amount of memory to be used for the 153eda14cbcSMatt Macy * prefetching of the checkpoint space map done on each vdev while 154eda14cbcSMatt Macy * discarding the checkpoint. 155eda14cbcSMatt Macy * 156eda14cbcSMatt Macy * The reason it exists is because top-level vdevs with long checkpoint 157eda14cbcSMatt Macy * space maps can potentially take up a lot of memory depending on the 158eda14cbcSMatt Macy * amount of checkpointed data that has been freed within them while 159eda14cbcSMatt Macy * the pool had a checkpoint. 160eda14cbcSMatt Macy */ 161dbd5678dSMartin Matuska static uint64_t zfs_spa_discard_memory_limit = 16 * 1024 * 1024; 162eda14cbcSMatt Macy 163eda14cbcSMatt Macy int 164eda14cbcSMatt Macy spa_checkpoint_get_stats(spa_t *spa, pool_checkpoint_stat_t *pcs) 165eda14cbcSMatt Macy { 166eda14cbcSMatt Macy if (!spa_feature_is_active(spa, SPA_FEATURE_POOL_CHECKPOINT)) 167eda14cbcSMatt Macy return (SET_ERROR(ZFS_ERR_NO_CHECKPOINT)); 168eda14cbcSMatt Macy 169da5137abSMartin Matuska memset(pcs, 0, sizeof (pool_checkpoint_stat_t)); 170eda14cbcSMatt Macy 171eda14cbcSMatt Macy int error = zap_contains(spa_meta_objset(spa), 172eda14cbcSMatt Macy DMU_POOL_DIRECTORY_OBJECT, DMU_POOL_ZPOOL_CHECKPOINT); 173eda14cbcSMatt Macy ASSERT(error == 0 || error == ENOENT); 174eda14cbcSMatt Macy 175eda14cbcSMatt Macy if (error == ENOENT) 176eda14cbcSMatt Macy pcs->pcs_state = CS_CHECKPOINT_DISCARDING; 177eda14cbcSMatt Macy else 178eda14cbcSMatt Macy pcs->pcs_state = CS_CHECKPOINT_EXISTS; 179eda14cbcSMatt Macy 180eda14cbcSMatt Macy pcs->pcs_space = spa->spa_checkpoint_info.sci_dspace; 181eda14cbcSMatt Macy pcs->pcs_start_time = spa->spa_checkpoint_info.sci_timestamp; 182eda14cbcSMatt Macy 183eda14cbcSMatt Macy return (0); 184eda14cbcSMatt Macy } 185eda14cbcSMatt Macy 186eda14cbcSMatt Macy static void 187eda14cbcSMatt Macy spa_checkpoint_discard_complete_sync(void *arg, dmu_tx_t *tx) 188eda14cbcSMatt Macy { 189eda14cbcSMatt Macy spa_t *spa = arg; 190eda14cbcSMatt Macy 191eda14cbcSMatt Macy spa->spa_checkpoint_info.sci_timestamp = 0; 192eda14cbcSMatt Macy 193eda14cbcSMatt Macy spa_feature_decr(spa, SPA_FEATURE_POOL_CHECKPOINT, tx); 194eda14cbcSMatt Macy spa_notify_waiters(spa); 195eda14cbcSMatt Macy 196eda14cbcSMatt Macy spa_history_log_internal(spa, "spa discard checkpoint", tx, 197eda14cbcSMatt Macy "finished discarding checkpointed state from the pool"); 198eda14cbcSMatt Macy } 199eda14cbcSMatt Macy 200eda14cbcSMatt Macy typedef struct spa_checkpoint_discard_sync_callback_arg { 201eda14cbcSMatt Macy vdev_t *sdc_vd; 202eda14cbcSMatt Macy uint64_t sdc_txg; 203eda14cbcSMatt Macy uint64_t sdc_entry_limit; 204eda14cbcSMatt Macy } spa_checkpoint_discard_sync_callback_arg_t; 205eda14cbcSMatt Macy 206eda14cbcSMatt Macy static int 207eda14cbcSMatt Macy spa_checkpoint_discard_sync_callback(space_map_entry_t *sme, void *arg) 208eda14cbcSMatt Macy { 209eda14cbcSMatt Macy spa_checkpoint_discard_sync_callback_arg_t *sdc = arg; 210eda14cbcSMatt Macy vdev_t *vd = sdc->sdc_vd; 211eda14cbcSMatt Macy metaslab_t *ms = vd->vdev_ms[sme->sme_offset >> vd->vdev_ms_shift]; 212eda14cbcSMatt Macy uint64_t end = sme->sme_offset + sme->sme_run; 213eda14cbcSMatt Macy 214eda14cbcSMatt Macy if (sdc->sdc_entry_limit == 0) 215eda14cbcSMatt Macy return (SET_ERROR(EINTR)); 216eda14cbcSMatt Macy 217eda14cbcSMatt Macy /* 218eda14cbcSMatt Macy * Since the space map is not condensed, we know that 219eda14cbcSMatt Macy * none of its entries is crossing the boundaries of 220eda14cbcSMatt Macy * its respective metaslab. 221eda14cbcSMatt Macy * 222eda14cbcSMatt Macy * That said, there is no fundamental requirement that 223eda14cbcSMatt Macy * the checkpoint's space map entries should not cross 224eda14cbcSMatt Macy * metaslab boundaries. So if needed we could add code 225eda14cbcSMatt Macy * that handles metaslab-crossing segments in the future. 226eda14cbcSMatt Macy */ 227eda14cbcSMatt Macy VERIFY3U(sme->sme_type, ==, SM_FREE); 228eda14cbcSMatt Macy VERIFY3U(sme->sme_offset, >=, ms->ms_start); 229eda14cbcSMatt Macy VERIFY3U(end, <=, ms->ms_start + ms->ms_size); 230eda14cbcSMatt Macy 231eda14cbcSMatt Macy /* 232eda14cbcSMatt Macy * At this point we should not be processing any 233eda14cbcSMatt Macy * other frees concurrently, so the lock is technically 234eda14cbcSMatt Macy * unnecessary. We use the lock anyway though to 235eda14cbcSMatt Macy * potentially save ourselves from future headaches. 236eda14cbcSMatt Macy */ 237eda14cbcSMatt Macy mutex_enter(&ms->ms_lock); 238eda14cbcSMatt Macy if (range_tree_is_empty(ms->ms_freeing)) 239eda14cbcSMatt Macy vdev_dirty(vd, VDD_METASLAB, ms, sdc->sdc_txg); 240eda14cbcSMatt Macy range_tree_add(ms->ms_freeing, sme->sme_offset, sme->sme_run); 241eda14cbcSMatt Macy mutex_exit(&ms->ms_lock); 242eda14cbcSMatt Macy 243eda14cbcSMatt Macy ASSERT3U(vd->vdev_spa->spa_checkpoint_info.sci_dspace, >=, 244eda14cbcSMatt Macy sme->sme_run); 245eda14cbcSMatt Macy ASSERT3U(vd->vdev_stat.vs_checkpoint_space, >=, sme->sme_run); 246eda14cbcSMatt Macy 247eda14cbcSMatt Macy vd->vdev_spa->spa_checkpoint_info.sci_dspace -= sme->sme_run; 248eda14cbcSMatt Macy vd->vdev_stat.vs_checkpoint_space -= sme->sme_run; 249eda14cbcSMatt Macy sdc->sdc_entry_limit--; 250eda14cbcSMatt Macy 251eda14cbcSMatt Macy return (0); 252eda14cbcSMatt Macy } 253eda14cbcSMatt Macy 254eda14cbcSMatt Macy #ifdef ZFS_DEBUG 255eda14cbcSMatt Macy static void 256eda14cbcSMatt Macy spa_checkpoint_accounting_verify(spa_t *spa) 257eda14cbcSMatt Macy { 258eda14cbcSMatt Macy vdev_t *rvd = spa->spa_root_vdev; 259eda14cbcSMatt Macy uint64_t ckpoint_sm_space_sum = 0; 260eda14cbcSMatt Macy uint64_t vs_ckpoint_space_sum = 0; 261eda14cbcSMatt Macy 262eda14cbcSMatt Macy for (uint64_t c = 0; c < rvd->vdev_children; c++) { 263eda14cbcSMatt Macy vdev_t *vd = rvd->vdev_child[c]; 264eda14cbcSMatt Macy 265eda14cbcSMatt Macy if (vd->vdev_checkpoint_sm != NULL) { 266eda14cbcSMatt Macy ckpoint_sm_space_sum += 267eda14cbcSMatt Macy -space_map_allocated(vd->vdev_checkpoint_sm); 268eda14cbcSMatt Macy vs_ckpoint_space_sum += 269eda14cbcSMatt Macy vd->vdev_stat.vs_checkpoint_space; 270eda14cbcSMatt Macy ASSERT3U(ckpoint_sm_space_sum, ==, 271eda14cbcSMatt Macy vs_ckpoint_space_sum); 272eda14cbcSMatt Macy } else { 273eda14cbcSMatt Macy ASSERT0(vd->vdev_stat.vs_checkpoint_space); 274eda14cbcSMatt Macy } 275eda14cbcSMatt Macy } 276eda14cbcSMatt Macy ASSERT3U(spa->spa_checkpoint_info.sci_dspace, ==, ckpoint_sm_space_sum); 277eda14cbcSMatt Macy } 278eda14cbcSMatt Macy #endif 279eda14cbcSMatt Macy 280eda14cbcSMatt Macy static void 281eda14cbcSMatt Macy spa_checkpoint_discard_thread_sync(void *arg, dmu_tx_t *tx) 282eda14cbcSMatt Macy { 283eda14cbcSMatt Macy vdev_t *vd = arg; 284eda14cbcSMatt Macy int error; 285eda14cbcSMatt Macy 286eda14cbcSMatt Macy /* 287eda14cbcSMatt Macy * The space map callback is applied only to non-debug entries. 288eda14cbcSMatt Macy * Because the number of debug entries is less or equal to the 289eda14cbcSMatt Macy * number of non-debug entries, we want to ensure that we only 290eda14cbcSMatt Macy * read what we prefetched from open-context. 291eda14cbcSMatt Macy * 292eda14cbcSMatt Macy * Thus, we set the maximum entries that the space map callback 293eda14cbcSMatt Macy * will be applied to be half the entries that could fit in the 294eda14cbcSMatt Macy * imposed memory limit. 295eda14cbcSMatt Macy * 296eda14cbcSMatt Macy * Note that since this is a conservative estimate we also 297eda14cbcSMatt Macy * assume the worst case scenario in our computation where each 298eda14cbcSMatt Macy * entry is two-word. 299eda14cbcSMatt Macy */ 300eda14cbcSMatt Macy uint64_t max_entry_limit = 301eda14cbcSMatt Macy (zfs_spa_discard_memory_limit / (2 * sizeof (uint64_t))) >> 1; 302eda14cbcSMatt Macy 303eda14cbcSMatt Macy /* 304eda14cbcSMatt Macy * Iterate from the end of the space map towards the beginning, 305eda14cbcSMatt Macy * placing its entries on ms_freeing and removing them from the 306eda14cbcSMatt Macy * space map. The iteration stops if one of the following 307eda14cbcSMatt Macy * conditions is true: 308eda14cbcSMatt Macy * 309eda14cbcSMatt Macy * 1] We reached the beginning of the space map. At this point 310eda14cbcSMatt Macy * the space map should be completely empty and 311eda14cbcSMatt Macy * space_map_incremental_destroy should have returned 0. 312eda14cbcSMatt Macy * The next step would be to free and close the space map 313eda14cbcSMatt Macy * and remove its entry from its vdev's top zap. This allows 314eda14cbcSMatt Macy * spa_checkpoint_discard_thread() to move on to the next vdev. 315eda14cbcSMatt Macy * 316eda14cbcSMatt Macy * 2] We reached the memory limit (amount of memory used to hold 317eda14cbcSMatt Macy * space map entries in memory) and space_map_incremental_destroy 318eda14cbcSMatt Macy * returned EINTR. This means that there are entries remaining 319eda14cbcSMatt Macy * in the space map that will be cleared in a future invocation 320eda14cbcSMatt Macy * of this function by spa_checkpoint_discard_thread(). 321eda14cbcSMatt Macy */ 322eda14cbcSMatt Macy spa_checkpoint_discard_sync_callback_arg_t sdc; 323eda14cbcSMatt Macy sdc.sdc_vd = vd; 324eda14cbcSMatt Macy sdc.sdc_txg = tx->tx_txg; 325eda14cbcSMatt Macy sdc.sdc_entry_limit = max_entry_limit; 326eda14cbcSMatt Macy 327eda14cbcSMatt Macy uint64_t words_before = 328eda14cbcSMatt Macy space_map_length(vd->vdev_checkpoint_sm) / sizeof (uint64_t); 329eda14cbcSMatt Macy 330eda14cbcSMatt Macy error = space_map_incremental_destroy(vd->vdev_checkpoint_sm, 331eda14cbcSMatt Macy spa_checkpoint_discard_sync_callback, &sdc, tx); 332eda14cbcSMatt Macy 333eda14cbcSMatt Macy uint64_t words_after = 334eda14cbcSMatt Macy space_map_length(vd->vdev_checkpoint_sm) / sizeof (uint64_t); 335eda14cbcSMatt Macy 336eda14cbcSMatt Macy #ifdef ZFS_DEBUG 337eda14cbcSMatt Macy spa_checkpoint_accounting_verify(vd->vdev_spa); 338eda14cbcSMatt Macy #endif 339eda14cbcSMatt Macy 34033b8c039SMartin Matuska zfs_dbgmsg("discarding checkpoint: txg %llu, vdev id %lld, " 341eda14cbcSMatt Macy "deleted %llu words - %llu words are left", 34233b8c039SMartin Matuska (u_longlong_t)tx->tx_txg, (longlong_t)vd->vdev_id, 34333b8c039SMartin Matuska (u_longlong_t)(words_before - words_after), 34433b8c039SMartin Matuska (u_longlong_t)words_after); 345eda14cbcSMatt Macy 346eda14cbcSMatt Macy if (error != EINTR) { 347eda14cbcSMatt Macy if (error != 0) { 34833b8c039SMartin Matuska zfs_panic_recover("zfs: error %lld was returned " 349eda14cbcSMatt Macy "while incrementally destroying the checkpoint " 350c7046f76SMartin Matuska "space map of vdev %llu\n", 35133b8c039SMartin Matuska (longlong_t)error, vd->vdev_id); 352eda14cbcSMatt Macy } 353eda14cbcSMatt Macy ASSERT0(words_after); 354eda14cbcSMatt Macy ASSERT0(space_map_allocated(vd->vdev_checkpoint_sm)); 355eda14cbcSMatt Macy ASSERT0(space_map_length(vd->vdev_checkpoint_sm)); 356eda14cbcSMatt Macy 357eda14cbcSMatt Macy space_map_free(vd->vdev_checkpoint_sm, tx); 358eda14cbcSMatt Macy space_map_close(vd->vdev_checkpoint_sm); 359eda14cbcSMatt Macy vd->vdev_checkpoint_sm = NULL; 360eda14cbcSMatt Macy 361eda14cbcSMatt Macy VERIFY0(zap_remove(spa_meta_objset(vd->vdev_spa), 362eda14cbcSMatt Macy vd->vdev_top_zap, VDEV_TOP_ZAP_POOL_CHECKPOINT_SM, tx)); 363eda14cbcSMatt Macy } 364eda14cbcSMatt Macy } 365eda14cbcSMatt Macy 366eda14cbcSMatt Macy static boolean_t 367eda14cbcSMatt Macy spa_checkpoint_discard_is_done(spa_t *spa) 368eda14cbcSMatt Macy { 369eda14cbcSMatt Macy vdev_t *rvd = spa->spa_root_vdev; 370eda14cbcSMatt Macy 371eda14cbcSMatt Macy ASSERT(!spa_has_checkpoint(spa)); 372eda14cbcSMatt Macy ASSERT(spa_feature_is_active(spa, SPA_FEATURE_POOL_CHECKPOINT)); 373eda14cbcSMatt Macy 374eda14cbcSMatt Macy for (uint64_t c = 0; c < rvd->vdev_children; c++) { 375eda14cbcSMatt Macy if (rvd->vdev_child[c]->vdev_checkpoint_sm != NULL) 376eda14cbcSMatt Macy return (B_FALSE); 377eda14cbcSMatt Macy ASSERT0(rvd->vdev_child[c]->vdev_stat.vs_checkpoint_space); 378eda14cbcSMatt Macy } 379eda14cbcSMatt Macy 380eda14cbcSMatt Macy return (B_TRUE); 381eda14cbcSMatt Macy } 382eda14cbcSMatt Macy 383eda14cbcSMatt Macy boolean_t 384eda14cbcSMatt Macy spa_checkpoint_discard_thread_check(void *arg, zthr_t *zthr) 385eda14cbcSMatt Macy { 386e92ffd9bSMartin Matuska (void) zthr; 387eda14cbcSMatt Macy spa_t *spa = arg; 388eda14cbcSMatt Macy 389eda14cbcSMatt Macy if (!spa_feature_is_active(spa, SPA_FEATURE_POOL_CHECKPOINT)) 390eda14cbcSMatt Macy return (B_FALSE); 391eda14cbcSMatt Macy 392eda14cbcSMatt Macy if (spa_has_checkpoint(spa)) 393eda14cbcSMatt Macy return (B_FALSE); 394eda14cbcSMatt Macy 395eda14cbcSMatt Macy return (B_TRUE); 396eda14cbcSMatt Macy } 397eda14cbcSMatt Macy 398eda14cbcSMatt Macy void 399eda14cbcSMatt Macy spa_checkpoint_discard_thread(void *arg, zthr_t *zthr) 400eda14cbcSMatt Macy { 401eda14cbcSMatt Macy spa_t *spa = arg; 402eda14cbcSMatt Macy vdev_t *rvd = spa->spa_root_vdev; 403eda14cbcSMatt Macy 404eda14cbcSMatt Macy for (uint64_t c = 0; c < rvd->vdev_children; c++) { 405eda14cbcSMatt Macy vdev_t *vd = rvd->vdev_child[c]; 406eda14cbcSMatt Macy 407eda14cbcSMatt Macy while (vd->vdev_checkpoint_sm != NULL) { 408eda14cbcSMatt Macy space_map_t *checkpoint_sm = vd->vdev_checkpoint_sm; 409eda14cbcSMatt Macy int numbufs; 410eda14cbcSMatt Macy dmu_buf_t **dbp; 411eda14cbcSMatt Macy 412eda14cbcSMatt Macy if (zthr_iscancelled(zthr)) 413eda14cbcSMatt Macy return; 414eda14cbcSMatt Macy 415eda14cbcSMatt Macy ASSERT3P(vd->vdev_ops, !=, &vdev_indirect_ops); 416eda14cbcSMatt Macy 417eda14cbcSMatt Macy uint64_t size = MIN(space_map_length(checkpoint_sm), 418eda14cbcSMatt Macy zfs_spa_discard_memory_limit); 419eda14cbcSMatt Macy uint64_t offset = 420eda14cbcSMatt Macy space_map_length(checkpoint_sm) - size; 421eda14cbcSMatt Macy 422eda14cbcSMatt Macy /* 423eda14cbcSMatt Macy * Ensure that the part of the space map that will 424eda14cbcSMatt Macy * be destroyed by the synctask, is prefetched in 425eda14cbcSMatt Macy * memory before the synctask runs. 426eda14cbcSMatt Macy */ 427eda14cbcSMatt Macy int error = dmu_buf_hold_array_by_bonus( 428eda14cbcSMatt Macy checkpoint_sm->sm_dbuf, offset, size, 429eda14cbcSMatt Macy B_TRUE, FTAG, &numbufs, &dbp); 430eda14cbcSMatt Macy if (error != 0) { 431eda14cbcSMatt Macy zfs_panic_recover("zfs: error %d was returned " 432eda14cbcSMatt Macy "while prefetching checkpoint space map " 433eda14cbcSMatt Macy "entries of vdev %llu\n", 434eda14cbcSMatt Macy error, vd->vdev_id); 435eda14cbcSMatt Macy } 436eda14cbcSMatt Macy 437eda14cbcSMatt Macy VERIFY0(dsl_sync_task(spa->spa_name, NULL, 438eda14cbcSMatt Macy spa_checkpoint_discard_thread_sync, vd, 439eda14cbcSMatt Macy 0, ZFS_SPACE_CHECK_NONE)); 440eda14cbcSMatt Macy 441eda14cbcSMatt Macy dmu_buf_rele_array(dbp, numbufs, FTAG); 442eda14cbcSMatt Macy } 443eda14cbcSMatt Macy } 444eda14cbcSMatt Macy 445eda14cbcSMatt Macy VERIFY(spa_checkpoint_discard_is_done(spa)); 446eda14cbcSMatt Macy VERIFY0(spa->spa_checkpoint_info.sci_dspace); 447eda14cbcSMatt Macy VERIFY0(dsl_sync_task(spa->spa_name, NULL, 448eda14cbcSMatt Macy spa_checkpoint_discard_complete_sync, spa, 449eda14cbcSMatt Macy 0, ZFS_SPACE_CHECK_NONE)); 450eda14cbcSMatt Macy } 451eda14cbcSMatt Macy 452eda14cbcSMatt Macy 453eda14cbcSMatt Macy static int 454eda14cbcSMatt Macy spa_checkpoint_check(void *arg, dmu_tx_t *tx) 455eda14cbcSMatt Macy { 456e92ffd9bSMartin Matuska (void) arg; 457eda14cbcSMatt Macy spa_t *spa = dmu_tx_pool(tx)->dp_spa; 458eda14cbcSMatt Macy 459eda14cbcSMatt Macy if (!spa_feature_is_enabled(spa, SPA_FEATURE_POOL_CHECKPOINT)) 460eda14cbcSMatt Macy return (SET_ERROR(ENOTSUP)); 461eda14cbcSMatt Macy 462eda14cbcSMatt Macy if (!spa_top_vdevs_spacemap_addressable(spa)) 463eda14cbcSMatt Macy return (SET_ERROR(ZFS_ERR_VDEV_TOO_BIG)); 464eda14cbcSMatt Macy 465eda14cbcSMatt Macy if (spa->spa_removing_phys.sr_state == DSS_SCANNING) 466eda14cbcSMatt Macy return (SET_ERROR(ZFS_ERR_DEVRM_IN_PROGRESS)); 467eda14cbcSMatt Macy 468*e716630dSMartin Matuska if (spa->spa_raidz_expand != NULL) 469*e716630dSMartin Matuska return (SET_ERROR(ZFS_ERR_RAIDZ_EXPAND_IN_PROGRESS)); 470*e716630dSMartin Matuska 471eda14cbcSMatt Macy if (spa->spa_checkpoint_txg != 0) 472eda14cbcSMatt Macy return (SET_ERROR(ZFS_ERR_CHECKPOINT_EXISTS)); 473eda14cbcSMatt Macy 474eda14cbcSMatt Macy if (spa_feature_is_active(spa, SPA_FEATURE_POOL_CHECKPOINT)) 475eda14cbcSMatt Macy return (SET_ERROR(ZFS_ERR_DISCARDING_CHECKPOINT)); 476eda14cbcSMatt Macy 477eda14cbcSMatt Macy return (0); 478eda14cbcSMatt Macy } 479eda14cbcSMatt Macy 480eda14cbcSMatt Macy static void 481eda14cbcSMatt Macy spa_checkpoint_sync(void *arg, dmu_tx_t *tx) 482eda14cbcSMatt Macy { 483e92ffd9bSMartin Matuska (void) arg; 484eda14cbcSMatt Macy dsl_pool_t *dp = dmu_tx_pool(tx); 485eda14cbcSMatt Macy spa_t *spa = dp->dp_spa; 486eda14cbcSMatt Macy uberblock_t checkpoint = spa->spa_ubsync; 487eda14cbcSMatt Macy 488eda14cbcSMatt Macy /* 489eda14cbcSMatt Macy * At this point, there should not be a checkpoint in the MOS. 490eda14cbcSMatt Macy */ 491eda14cbcSMatt Macy ASSERT3U(zap_contains(spa_meta_objset(spa), DMU_POOL_DIRECTORY_OBJECT, 492eda14cbcSMatt Macy DMU_POOL_ZPOOL_CHECKPOINT), ==, ENOENT); 493eda14cbcSMatt Macy 494eda14cbcSMatt Macy ASSERT0(spa->spa_checkpoint_info.sci_timestamp); 495eda14cbcSMatt Macy ASSERT0(spa->spa_checkpoint_info.sci_dspace); 496eda14cbcSMatt Macy 497eda14cbcSMatt Macy /* 498eda14cbcSMatt Macy * Since the checkpointed uberblock is the one that just got synced 499eda14cbcSMatt Macy * (we use spa_ubsync), its txg must be equal to the txg number of 500eda14cbcSMatt Macy * the txg we are syncing, minus 1. 501eda14cbcSMatt Macy */ 502eda14cbcSMatt Macy ASSERT3U(checkpoint.ub_txg, ==, spa->spa_syncing_txg - 1); 503eda14cbcSMatt Macy 504eda14cbcSMatt Macy /* 505eda14cbcSMatt Macy * Once the checkpoint is in place, we need to ensure that none of 506eda14cbcSMatt Macy * its blocks will be marked for reuse after it has been freed. 507eda14cbcSMatt Macy * When there is a checkpoint and a block is freed, we compare its 508eda14cbcSMatt Macy * birth txg to the txg of the checkpointed uberblock to see if the 509eda14cbcSMatt Macy * block is part of the checkpoint or not. Therefore, we have to set 510eda14cbcSMatt Macy * spa_checkpoint_txg before any frees happen in this txg (which is 511eda14cbcSMatt Macy * why this is done as an early_synctask as explained in the comment 512eda14cbcSMatt Macy * in spa_checkpoint()). 513eda14cbcSMatt Macy */ 514eda14cbcSMatt Macy spa->spa_checkpoint_txg = checkpoint.ub_txg; 515eda14cbcSMatt Macy spa->spa_checkpoint_info.sci_timestamp = checkpoint.ub_timestamp; 516eda14cbcSMatt Macy 517eda14cbcSMatt Macy checkpoint.ub_checkpoint_txg = checkpoint.ub_txg; 518eda14cbcSMatt Macy VERIFY0(zap_add(spa->spa_dsl_pool->dp_meta_objset, 519eda14cbcSMatt Macy DMU_POOL_DIRECTORY_OBJECT, DMU_POOL_ZPOOL_CHECKPOINT, 520eda14cbcSMatt Macy sizeof (uint64_t), sizeof (uberblock_t) / sizeof (uint64_t), 521eda14cbcSMatt Macy &checkpoint, tx)); 522eda14cbcSMatt Macy 523eda14cbcSMatt Macy /* 524eda14cbcSMatt Macy * Increment the feature refcount and thus activate the feature. 525eda14cbcSMatt Macy * Note that the feature will be deactivated when we've 526eda14cbcSMatt Macy * completely discarded all checkpointed state (both vdev 527eda14cbcSMatt Macy * space maps and uberblock). 528eda14cbcSMatt Macy */ 529eda14cbcSMatt Macy spa_feature_incr(spa, SPA_FEATURE_POOL_CHECKPOINT, tx); 530eda14cbcSMatt Macy 531eda14cbcSMatt Macy spa_history_log_internal(spa, "spa checkpoint", tx, 532eda14cbcSMatt Macy "checkpointed uberblock txg=%llu", (u_longlong_t)checkpoint.ub_txg); 533eda14cbcSMatt Macy } 534eda14cbcSMatt Macy 535eda14cbcSMatt Macy /* 536eda14cbcSMatt Macy * Create a checkpoint for the pool. 537eda14cbcSMatt Macy */ 538eda14cbcSMatt Macy int 539eda14cbcSMatt Macy spa_checkpoint(const char *pool) 540eda14cbcSMatt Macy { 541eda14cbcSMatt Macy int error; 542eda14cbcSMatt Macy spa_t *spa; 543eda14cbcSMatt Macy 544eda14cbcSMatt Macy error = spa_open(pool, &spa, FTAG); 545eda14cbcSMatt Macy if (error != 0) 546eda14cbcSMatt Macy return (error); 547eda14cbcSMatt Macy 548eda14cbcSMatt Macy mutex_enter(&spa->spa_vdev_top_lock); 549eda14cbcSMatt Macy 550eda14cbcSMatt Macy /* 551eda14cbcSMatt Macy * Wait for current syncing txg to finish so the latest synced 552eda14cbcSMatt Macy * uberblock (spa_ubsync) has all the changes that we expect 553eda14cbcSMatt Macy * to see if we were to revert later to the checkpoint. In other 554eda14cbcSMatt Macy * words we want the checkpointed uberblock to include/reference 555eda14cbcSMatt Macy * all the changes that were pending at the time that we issued 556eda14cbcSMatt Macy * the checkpoint command. 557eda14cbcSMatt Macy */ 558eda14cbcSMatt Macy txg_wait_synced(spa_get_dsl(spa), 0); 559eda14cbcSMatt Macy 560eda14cbcSMatt Macy /* 561eda14cbcSMatt Macy * As the checkpointed uberblock references blocks from the previous 562eda14cbcSMatt Macy * txg (spa_ubsync) we want to ensure that are not freeing any of 563eda14cbcSMatt Macy * these blocks in the same txg that the following synctask will 564eda14cbcSMatt Macy * run. Thus, we run it as an early synctask, so the dirty changes 565eda14cbcSMatt Macy * that are synced to disk afterwards during zios and other synctasks 566eda14cbcSMatt Macy * do not reuse checkpointed blocks. 567eda14cbcSMatt Macy */ 568eda14cbcSMatt Macy error = dsl_early_sync_task(pool, spa_checkpoint_check, 569eda14cbcSMatt Macy spa_checkpoint_sync, NULL, 0, ZFS_SPACE_CHECK_NORMAL); 570eda14cbcSMatt Macy 571eda14cbcSMatt Macy mutex_exit(&spa->spa_vdev_top_lock); 572eda14cbcSMatt Macy 573eda14cbcSMatt Macy spa_close(spa, FTAG); 574eda14cbcSMatt Macy return (error); 575eda14cbcSMatt Macy } 576eda14cbcSMatt Macy 577eda14cbcSMatt Macy static int 578eda14cbcSMatt Macy spa_checkpoint_discard_check(void *arg, dmu_tx_t *tx) 579eda14cbcSMatt Macy { 580e92ffd9bSMartin Matuska (void) arg; 581eda14cbcSMatt Macy spa_t *spa = dmu_tx_pool(tx)->dp_spa; 582eda14cbcSMatt Macy 583eda14cbcSMatt Macy if (!spa_feature_is_active(spa, SPA_FEATURE_POOL_CHECKPOINT)) 584eda14cbcSMatt Macy return (SET_ERROR(ZFS_ERR_NO_CHECKPOINT)); 585eda14cbcSMatt Macy 586eda14cbcSMatt Macy if (spa->spa_checkpoint_txg == 0) 587eda14cbcSMatt Macy return (SET_ERROR(ZFS_ERR_DISCARDING_CHECKPOINT)); 588eda14cbcSMatt Macy 589eda14cbcSMatt Macy VERIFY0(zap_contains(spa_meta_objset(spa), 590eda14cbcSMatt Macy DMU_POOL_DIRECTORY_OBJECT, DMU_POOL_ZPOOL_CHECKPOINT)); 591eda14cbcSMatt Macy 592eda14cbcSMatt Macy return (0); 593eda14cbcSMatt Macy } 594eda14cbcSMatt Macy 595eda14cbcSMatt Macy static void 596eda14cbcSMatt Macy spa_checkpoint_discard_sync(void *arg, dmu_tx_t *tx) 597eda14cbcSMatt Macy { 598e92ffd9bSMartin Matuska (void) arg; 599eda14cbcSMatt Macy spa_t *spa = dmu_tx_pool(tx)->dp_spa; 600eda14cbcSMatt Macy 601eda14cbcSMatt Macy VERIFY0(zap_remove(spa_meta_objset(spa), DMU_POOL_DIRECTORY_OBJECT, 602eda14cbcSMatt Macy DMU_POOL_ZPOOL_CHECKPOINT, tx)); 603eda14cbcSMatt Macy 604eda14cbcSMatt Macy spa->spa_checkpoint_txg = 0; 605eda14cbcSMatt Macy 606eda14cbcSMatt Macy zthr_wakeup(spa->spa_checkpoint_discard_zthr); 607eda14cbcSMatt Macy 608eda14cbcSMatt Macy spa_history_log_internal(spa, "spa discard checkpoint", tx, 609eda14cbcSMatt Macy "started discarding checkpointed state from the pool"); 610eda14cbcSMatt Macy } 611eda14cbcSMatt Macy 612eda14cbcSMatt Macy /* 613eda14cbcSMatt Macy * Discard the checkpoint from a pool. 614eda14cbcSMatt Macy */ 615eda14cbcSMatt Macy int 616eda14cbcSMatt Macy spa_checkpoint_discard(const char *pool) 617eda14cbcSMatt Macy { 618eda14cbcSMatt Macy /* 619eda14cbcSMatt Macy * Similarly to spa_checkpoint(), we want our synctask to run 620eda14cbcSMatt Macy * before any pending dirty data are written to disk so they 621eda14cbcSMatt Macy * won't end up in the checkpoint's data structures (e.g. 622eda14cbcSMatt Macy * ms_checkpointing and vdev_checkpoint_sm) and re-create any 623eda14cbcSMatt Macy * space maps that the discarding open-context thread has 624eda14cbcSMatt Macy * deleted. 625eda14cbcSMatt Macy * [see spa_discard_checkpoint_sync and spa_discard_checkpoint_thread] 626eda14cbcSMatt Macy */ 627eda14cbcSMatt Macy return (dsl_early_sync_task(pool, spa_checkpoint_discard_check, 628eda14cbcSMatt Macy spa_checkpoint_discard_sync, NULL, 0, 629eda14cbcSMatt Macy ZFS_SPACE_CHECK_DISCARD_CHECKPOINT)); 630eda14cbcSMatt Macy } 631eda14cbcSMatt Macy 632eda14cbcSMatt Macy EXPORT_SYMBOL(spa_checkpoint_get_stats); 633eda14cbcSMatt Macy EXPORT_SYMBOL(spa_checkpoint_discard_thread); 634eda14cbcSMatt Macy EXPORT_SYMBOL(spa_checkpoint_discard_thread_check); 635eda14cbcSMatt Macy 636dbd5678dSMartin Matuska ZFS_MODULE_PARAM(zfs_spa, zfs_spa_, discard_memory_limit, U64, ZMOD_RW, 637eda14cbcSMatt Macy "Limit for memory used in prefetching the checkpoint space map done " 638eda14cbcSMatt Macy "on each vdev while discarding the checkpoint"); 639