1eda14cbcSMatt Macy /* 2eda14cbcSMatt Macy * CDDL HEADER START 3eda14cbcSMatt Macy * 4eda14cbcSMatt Macy * The contents of this file are subject to the terms of the 5eda14cbcSMatt Macy * Common Development and Distribution License (the "License"). 6eda14cbcSMatt Macy * You may not use this file except in compliance with the License. 7eda14cbcSMatt Macy * 8eda14cbcSMatt Macy * You can obtain a copy of the license at usr/src/OPENSOLARIS.LICENSE 9271171e0SMartin Matuska * or https://opensource.org/licenses/CDDL-1.0. 10eda14cbcSMatt Macy * See the License for the specific language governing permissions 11eda14cbcSMatt Macy * and limitations under the License. 12eda14cbcSMatt Macy * 13eda14cbcSMatt Macy * When distributing Covered Code, include this CDDL HEADER in each 14eda14cbcSMatt Macy * file and include the License file at usr/src/OPENSOLARIS.LICENSE. 15eda14cbcSMatt Macy * If applicable, add the following below this CDDL HEADER, with the 16eda14cbcSMatt Macy * fields enclosed by brackets "[]" replaced with your own identifying 17eda14cbcSMatt Macy * information: Portions Copyright [yyyy] [name of copyright owner] 18eda14cbcSMatt Macy * 19eda14cbcSMatt Macy * CDDL HEADER END 20eda14cbcSMatt Macy */ 21eda14cbcSMatt Macy /* 22eda14cbcSMatt Macy * Copyright 2009 Sun Microsystems, Inc. All rights reserved. 23eda14cbcSMatt Macy * Use is subject to license terms. 24eda14cbcSMatt Macy */ 25eda14cbcSMatt Macy 26eda14cbcSMatt Macy /* 27eda14cbcSMatt Macy * Copyright (c) 2012, 2018 by Delphix. All rights reserved. 28eda14cbcSMatt Macy */ 29eda14cbcSMatt Macy 30eda14cbcSMatt Macy #include <sys/zfs_context.h> 31eda14cbcSMatt Macy #include <sys/vdev_impl.h> 32eda14cbcSMatt Macy #include <sys/spa_impl.h> 33eda14cbcSMatt Macy #include <sys/zio.h> 34eda14cbcSMatt Macy #include <sys/avl.h> 35eda14cbcSMatt Macy #include <sys/dsl_pool.h> 36eda14cbcSMatt Macy #include <sys/metaslab_impl.h> 37eda14cbcSMatt Macy #include <sys/spa.h> 38eda14cbcSMatt Macy #include <sys/abd.h> 39eda14cbcSMatt Macy 40eda14cbcSMatt Macy /* 41eda14cbcSMatt Macy * ZFS I/O Scheduler 42eda14cbcSMatt Macy * --------------- 43eda14cbcSMatt Macy * 44eda14cbcSMatt Macy * ZFS issues I/O operations to leaf vdevs to satisfy and complete zios. The 45eda14cbcSMatt Macy * I/O scheduler determines when and in what order those operations are 46eda14cbcSMatt Macy * issued. The I/O scheduler divides operations into five I/O classes 47eda14cbcSMatt Macy * prioritized in the following order: sync read, sync write, async read, 48eda14cbcSMatt Macy * async write, and scrub/resilver. Each queue defines the minimum and 49eda14cbcSMatt Macy * maximum number of concurrent operations that may be issued to the device. 50eda14cbcSMatt Macy * In addition, the device has an aggregate maximum. Note that the sum of the 51eda14cbcSMatt Macy * per-queue minimums must not exceed the aggregate maximum. If the 52eda14cbcSMatt Macy * sum of the per-queue maximums exceeds the aggregate maximum, then the 53eda14cbcSMatt Macy * number of active i/os may reach zfs_vdev_max_active, in which case no 54eda14cbcSMatt Macy * further i/os will be issued regardless of whether all per-queue 55eda14cbcSMatt Macy * minimums have been met. 56eda14cbcSMatt Macy * 57eda14cbcSMatt Macy * For many physical devices, throughput increases with the number of 58eda14cbcSMatt Macy * concurrent operations, but latency typically suffers. Further, physical 59eda14cbcSMatt Macy * devices typically have a limit at which more concurrent operations have no 60eda14cbcSMatt Macy * effect on throughput or can actually cause it to decrease. 61eda14cbcSMatt Macy * 62eda14cbcSMatt Macy * The scheduler selects the next operation to issue by first looking for an 63eda14cbcSMatt Macy * I/O class whose minimum has not been satisfied. Once all are satisfied and 64eda14cbcSMatt Macy * the aggregate maximum has not been hit, the scheduler looks for classes 65eda14cbcSMatt Macy * whose maximum has not been satisfied. Iteration through the I/O classes is 66eda14cbcSMatt Macy * done in the order specified above. No further operations are issued if the 67eda14cbcSMatt Macy * aggregate maximum number of concurrent operations has been hit or if there 68eda14cbcSMatt Macy * are no operations queued for an I/O class that has not hit its maximum. 69eda14cbcSMatt Macy * Every time an i/o is queued or an operation completes, the I/O scheduler 70eda14cbcSMatt Macy * looks for new operations to issue. 71eda14cbcSMatt Macy * 72eda14cbcSMatt Macy * All I/O classes have a fixed maximum number of outstanding operations 73eda14cbcSMatt Macy * except for the async write class. Asynchronous writes represent the data 74eda14cbcSMatt Macy * that is committed to stable storage during the syncing stage for 75eda14cbcSMatt Macy * transaction groups (see txg.c). Transaction groups enter the syncing state 76eda14cbcSMatt Macy * periodically so the number of queued async writes will quickly burst up and 77eda14cbcSMatt Macy * then bleed down to zero. Rather than servicing them as quickly as possible, 78eda14cbcSMatt Macy * the I/O scheduler changes the maximum number of active async write i/os 79eda14cbcSMatt Macy * according to the amount of dirty data in the pool (see dsl_pool.c). Since 80eda14cbcSMatt Macy * both throughput and latency typically increase with the number of 81eda14cbcSMatt Macy * concurrent operations issued to physical devices, reducing the burstiness 82eda14cbcSMatt Macy * in the number of concurrent operations also stabilizes the response time of 83eda14cbcSMatt Macy * operations from other -- and in particular synchronous -- queues. In broad 84eda14cbcSMatt Macy * strokes, the I/O scheduler will issue more concurrent operations from the 85eda14cbcSMatt Macy * async write queue as there's more dirty data in the pool. 86eda14cbcSMatt Macy * 87eda14cbcSMatt Macy * Async Writes 88eda14cbcSMatt Macy * 89eda14cbcSMatt Macy * The number of concurrent operations issued for the async write I/O class 90eda14cbcSMatt Macy * follows a piece-wise linear function defined by a few adjustable points. 91eda14cbcSMatt Macy * 92eda14cbcSMatt Macy * | o---------| <-- zfs_vdev_async_write_max_active 93eda14cbcSMatt Macy * ^ | /^ | 94eda14cbcSMatt Macy * | | / | | 95eda14cbcSMatt Macy * active | / | | 96eda14cbcSMatt Macy * I/O | / | | 97eda14cbcSMatt Macy * count | / | | 98eda14cbcSMatt Macy * | / | | 99eda14cbcSMatt Macy * |------------o | | <-- zfs_vdev_async_write_min_active 100eda14cbcSMatt Macy * 0|____________^______|_________| 101eda14cbcSMatt Macy * 0% | | 100% of zfs_dirty_data_max 102eda14cbcSMatt Macy * | | 103eda14cbcSMatt Macy * | `-- zfs_vdev_async_write_active_max_dirty_percent 104eda14cbcSMatt Macy * `--------- zfs_vdev_async_write_active_min_dirty_percent 105eda14cbcSMatt Macy * 106eda14cbcSMatt Macy * Until the amount of dirty data exceeds a minimum percentage of the dirty 107eda14cbcSMatt Macy * data allowed in the pool, the I/O scheduler will limit the number of 108eda14cbcSMatt Macy * concurrent operations to the minimum. As that threshold is crossed, the 109eda14cbcSMatt Macy * number of concurrent operations issued increases linearly to the maximum at 110eda14cbcSMatt Macy * the specified maximum percentage of the dirty data allowed in the pool. 111eda14cbcSMatt Macy * 112eda14cbcSMatt Macy * Ideally, the amount of dirty data on a busy pool will stay in the sloped 113eda14cbcSMatt Macy * part of the function between zfs_vdev_async_write_active_min_dirty_percent 114eda14cbcSMatt Macy * and zfs_vdev_async_write_active_max_dirty_percent. If it exceeds the 115eda14cbcSMatt Macy * maximum percentage, this indicates that the rate of incoming data is 116eda14cbcSMatt Macy * greater than the rate that the backend storage can handle. In this case, we 117eda14cbcSMatt Macy * must further throttle incoming writes (see dmu_tx_delay() for details). 118eda14cbcSMatt Macy */ 119eda14cbcSMatt Macy 120eda14cbcSMatt Macy /* 121eda14cbcSMatt Macy * The maximum number of i/os active to each device. Ideally, this will be >= 1227877fdebSMatt Macy * the sum of each queue's max_active. 123eda14cbcSMatt Macy */ 124be181ee2SMartin Matuska uint_t zfs_vdev_max_active = 1000; 125eda14cbcSMatt Macy 126eda14cbcSMatt Macy /* 127eda14cbcSMatt Macy * Per-queue limits on the number of i/os active to each device. If the 128eda14cbcSMatt Macy * number of active i/os is < zfs_vdev_max_active, then the min_active comes 1297877fdebSMatt Macy * into play. We will send min_active from each queue round-robin, and then 1307877fdebSMatt Macy * send from queues in the order defined by zio_priority_t up to max_active. 1317877fdebSMatt Macy * Some queues have additional mechanisms to limit number of active I/Os in 1327877fdebSMatt Macy * addition to min_active and max_active, see below. 133eda14cbcSMatt Macy * 134eda14cbcSMatt Macy * In general, smaller max_active's will lead to lower latency of synchronous 135eda14cbcSMatt Macy * operations. Larger max_active's may lead to higher overall throughput, 136eda14cbcSMatt Macy * depending on underlying storage. 137eda14cbcSMatt Macy * 138eda14cbcSMatt Macy * The ratio of the queues' max_actives determines the balance of performance 139eda14cbcSMatt Macy * between reads, writes, and scrubs. E.g., increasing 140eda14cbcSMatt Macy * zfs_vdev_scrub_max_active will cause the scrub or resilver to complete 141eda14cbcSMatt Macy * more quickly, but reads and writes to have higher latency and lower 142eda14cbcSMatt Macy * throughput. 143eda14cbcSMatt Macy */ 144be181ee2SMartin Matuska static uint_t zfs_vdev_sync_read_min_active = 10; 145be181ee2SMartin Matuska static uint_t zfs_vdev_sync_read_max_active = 10; 146be181ee2SMartin Matuska static uint_t zfs_vdev_sync_write_min_active = 10; 147be181ee2SMartin Matuska static uint_t zfs_vdev_sync_write_max_active = 10; 148be181ee2SMartin Matuska static uint_t zfs_vdev_async_read_min_active = 1; 149be181ee2SMartin Matuska /* */ uint_t zfs_vdev_async_read_max_active = 3; 150be181ee2SMartin Matuska static uint_t zfs_vdev_async_write_min_active = 2; 151be181ee2SMartin Matuska /* */ uint_t zfs_vdev_async_write_max_active = 10; 152be181ee2SMartin Matuska static uint_t zfs_vdev_scrub_min_active = 1; 153be181ee2SMartin Matuska static uint_t zfs_vdev_scrub_max_active = 3; 154be181ee2SMartin Matuska static uint_t zfs_vdev_removal_min_active = 1; 155be181ee2SMartin Matuska static uint_t zfs_vdev_removal_max_active = 2; 156be181ee2SMartin Matuska static uint_t zfs_vdev_initializing_min_active = 1; 157be181ee2SMartin Matuska static uint_t zfs_vdev_initializing_max_active = 1; 158be181ee2SMartin Matuska static uint_t zfs_vdev_trim_min_active = 1; 159be181ee2SMartin Matuska static uint_t zfs_vdev_trim_max_active = 2; 160be181ee2SMartin Matuska static uint_t zfs_vdev_rebuild_min_active = 1; 161be181ee2SMartin Matuska static uint_t zfs_vdev_rebuild_max_active = 3; 162eda14cbcSMatt Macy 163eda14cbcSMatt Macy /* 164eda14cbcSMatt Macy * When the pool has less than zfs_vdev_async_write_active_min_dirty_percent 165eda14cbcSMatt Macy * dirty data, use zfs_vdev_async_write_min_active. When it has more than 166eda14cbcSMatt Macy * zfs_vdev_async_write_active_max_dirty_percent, use 167eda14cbcSMatt Macy * zfs_vdev_async_write_max_active. The value is linearly interpolated 168eda14cbcSMatt Macy * between min and max. 169eda14cbcSMatt Macy */ 170be181ee2SMartin Matuska uint_t zfs_vdev_async_write_active_min_dirty_percent = 30; 171be181ee2SMartin Matuska uint_t zfs_vdev_async_write_active_max_dirty_percent = 60; 172eda14cbcSMatt Macy 173eda14cbcSMatt Macy /* 1747877fdebSMatt Macy * For non-interactive I/O (scrub, resilver, removal, initialize and rebuild), 1757877fdebSMatt Macy * the number of concurrently-active I/O's is limited to *_min_active, unless 1767877fdebSMatt Macy * the vdev is "idle". When there are no interactive I/Os active (sync or 1777877fdebSMatt Macy * async), and zfs_vdev_nia_delay I/Os have completed since the last 1787877fdebSMatt Macy * interactive I/O, then the vdev is considered to be "idle", and the number 1797877fdebSMatt Macy * of concurrently-active non-interactive I/O's is increased to *_max_active. 1807877fdebSMatt Macy */ 181e92ffd9bSMartin Matuska static uint_t zfs_vdev_nia_delay = 5; 1827877fdebSMatt Macy 1837877fdebSMatt Macy /* 1847877fdebSMatt Macy * Some HDDs tend to prioritize sequential I/O so high that concurrent 1857877fdebSMatt Macy * random I/O latency reaches several seconds. On some HDDs it happens 1867877fdebSMatt Macy * even if sequential I/Os are submitted one at a time, and so setting 1877877fdebSMatt Macy * *_max_active to 1 does not help. To prevent non-interactive I/Os, like 1887877fdebSMatt Macy * scrub, from monopolizing the device no more than zfs_vdev_nia_credit 1897877fdebSMatt Macy * I/Os can be sent while there are outstanding incomplete interactive 1907877fdebSMatt Macy * I/Os. This enforced wait ensures the HDD services the interactive I/O 1917877fdebSMatt Macy * within a reasonable amount of time. 1927877fdebSMatt Macy */ 193e92ffd9bSMartin Matuska static uint_t zfs_vdev_nia_credit = 5; 1947877fdebSMatt Macy 1957877fdebSMatt Macy /* 196eda14cbcSMatt Macy * To reduce IOPs, we aggregate small adjacent I/Os into one large I/O. 197eda14cbcSMatt Macy * For read I/Os, we also aggregate across small adjacency gaps; for writes 198eda14cbcSMatt Macy * we include spans of optional I/Os to aid aggregation at the disk even when 199eda14cbcSMatt Macy * they aren't able to help us aggregate at this level. 200eda14cbcSMatt Macy */ 201be181ee2SMartin Matuska static uint_t zfs_vdev_aggregation_limit = 1 << 20; 202be181ee2SMartin Matuska static uint_t zfs_vdev_aggregation_limit_non_rotating = SPA_OLD_MAXBLOCKSIZE; 203be181ee2SMartin Matuska static uint_t zfs_vdev_read_gap_limit = 32 << 10; 204be181ee2SMartin Matuska static uint_t zfs_vdev_write_gap_limit = 4 << 10; 205eda14cbcSMatt Macy 206eda14cbcSMatt Macy /* 207eda14cbcSMatt Macy * Define the queue depth percentage for each top-level. This percentage is 208eda14cbcSMatt Macy * used in conjunction with zfs_vdev_async_max_active to determine how many 209eda14cbcSMatt Macy * allocations a specific top-level vdev should handle. Once the queue depth 210eda14cbcSMatt Macy * reaches zfs_vdev_queue_depth_pct * zfs_vdev_async_write_max_active / 100 211eda14cbcSMatt Macy * then allocator will stop allocating blocks on that top-level device. 212eda14cbcSMatt Macy * The default kernel setting is 1000% which will yield 100 allocations per 213eda14cbcSMatt Macy * device. For userland testing, the default setting is 300% which equates 214eda14cbcSMatt Macy * to 30 allocations per device. 215eda14cbcSMatt Macy */ 216eda14cbcSMatt Macy #ifdef _KERNEL 217be181ee2SMartin Matuska uint_t zfs_vdev_queue_depth_pct = 1000; 218eda14cbcSMatt Macy #else 219be181ee2SMartin Matuska uint_t zfs_vdev_queue_depth_pct = 300; 220eda14cbcSMatt Macy #endif 221eda14cbcSMatt Macy 222eda14cbcSMatt Macy /* 223eda14cbcSMatt Macy * When performing allocations for a given metaslab, we want to make sure that 224eda14cbcSMatt Macy * there are enough IOs to aggregate together to improve throughput. We want to 225eda14cbcSMatt Macy * ensure that there are at least 128k worth of IOs that can be aggregated, and 226eda14cbcSMatt Macy * we assume that the average allocation size is 4k, so we need the queue depth 227eda14cbcSMatt Macy * to be 32 per allocator to get good aggregation of sequential writes. 228eda14cbcSMatt Macy */ 229be181ee2SMartin Matuska uint_t zfs_vdev_def_queue_depth = 32; 230eda14cbcSMatt Macy 231eda14cbcSMatt Macy static int 232eda14cbcSMatt Macy vdev_queue_offset_compare(const void *x1, const void *x2) 233eda14cbcSMatt Macy { 234eda14cbcSMatt Macy const zio_t *z1 = (const zio_t *)x1; 235eda14cbcSMatt Macy const zio_t *z2 = (const zio_t *)x2; 236eda14cbcSMatt Macy 237eda14cbcSMatt Macy int cmp = TREE_CMP(z1->io_offset, z2->io_offset); 238eda14cbcSMatt Macy 239eda14cbcSMatt Macy if (likely(cmp)) 240eda14cbcSMatt Macy return (cmp); 241eda14cbcSMatt Macy 242eda14cbcSMatt Macy return (TREE_PCMP(z1, z2)); 243eda14cbcSMatt Macy } 244eda14cbcSMatt Macy 245*7b5e6873SMartin Matuska #define VDQ_T_SHIFT 29 246eda14cbcSMatt Macy 247eda14cbcSMatt Macy static int 248*7b5e6873SMartin Matuska vdev_queue_to_compare(const void *x1, const void *x2) 249eda14cbcSMatt Macy { 250eda14cbcSMatt Macy const zio_t *z1 = (const zio_t *)x1; 251eda14cbcSMatt Macy const zio_t *z2 = (const zio_t *)x2; 252eda14cbcSMatt Macy 253*7b5e6873SMartin Matuska int tcmp = TREE_CMP(z1->io_timestamp >> VDQ_T_SHIFT, 254*7b5e6873SMartin Matuska z2->io_timestamp >> VDQ_T_SHIFT); 255*7b5e6873SMartin Matuska int ocmp = TREE_CMP(z1->io_offset, z2->io_offset); 256*7b5e6873SMartin Matuska int cmp = tcmp ? tcmp : ocmp; 257eda14cbcSMatt Macy 258*7b5e6873SMartin Matuska if (likely(cmp | (z1->io_queue_state == ZIO_QS_NONE))) 259eda14cbcSMatt Macy return (cmp); 260eda14cbcSMatt Macy 261eda14cbcSMatt Macy return (TREE_PCMP(z1, z2)); 262eda14cbcSMatt Macy } 263eda14cbcSMatt Macy 264*7b5e6873SMartin Matuska static inline boolean_t 265*7b5e6873SMartin Matuska vdev_queue_class_fifo(zio_priority_t p) 266*7b5e6873SMartin Matuska { 267*7b5e6873SMartin Matuska return (p == ZIO_PRIORITY_SYNC_READ || p == ZIO_PRIORITY_SYNC_WRITE || 268*7b5e6873SMartin Matuska p == ZIO_PRIORITY_TRIM); 269*7b5e6873SMartin Matuska } 270*7b5e6873SMartin Matuska 271*7b5e6873SMartin Matuska static void 272*7b5e6873SMartin Matuska vdev_queue_class_add(vdev_queue_t *vq, zio_t *zio) 273*7b5e6873SMartin Matuska { 274*7b5e6873SMartin Matuska zio_priority_t p = zio->io_priority; 275*7b5e6873SMartin Matuska vq->vq_cqueued |= 1U << p; 276*7b5e6873SMartin Matuska if (vdev_queue_class_fifo(p)) 277*7b5e6873SMartin Matuska list_insert_tail(&vq->vq_class[p].vqc_list, zio); 278*7b5e6873SMartin Matuska else 279*7b5e6873SMartin Matuska avl_add(&vq->vq_class[p].vqc_tree, zio); 280*7b5e6873SMartin Matuska } 281*7b5e6873SMartin Matuska 282*7b5e6873SMartin Matuska static void 283*7b5e6873SMartin Matuska vdev_queue_class_remove(vdev_queue_t *vq, zio_t *zio) 284*7b5e6873SMartin Matuska { 285*7b5e6873SMartin Matuska zio_priority_t p = zio->io_priority; 286*7b5e6873SMartin Matuska uint32_t empty; 287*7b5e6873SMartin Matuska if (vdev_queue_class_fifo(p)) { 288*7b5e6873SMartin Matuska list_t *list = &vq->vq_class[p].vqc_list; 289*7b5e6873SMartin Matuska list_remove(list, zio); 290*7b5e6873SMartin Matuska empty = list_is_empty(list); 291*7b5e6873SMartin Matuska } else { 292*7b5e6873SMartin Matuska avl_tree_t *tree = &vq->vq_class[p].vqc_tree; 293*7b5e6873SMartin Matuska avl_remove(tree, zio); 294*7b5e6873SMartin Matuska empty = avl_is_empty(tree); 295*7b5e6873SMartin Matuska } 296*7b5e6873SMartin Matuska vq->vq_cqueued &= ~(empty << p); 297*7b5e6873SMartin Matuska } 298*7b5e6873SMartin Matuska 299be181ee2SMartin Matuska static uint_t 3007877fdebSMatt Macy vdev_queue_class_min_active(vdev_queue_t *vq, zio_priority_t p) 301eda14cbcSMatt Macy { 302eda14cbcSMatt Macy switch (p) { 303eda14cbcSMatt Macy case ZIO_PRIORITY_SYNC_READ: 304eda14cbcSMatt Macy return (zfs_vdev_sync_read_min_active); 305eda14cbcSMatt Macy case ZIO_PRIORITY_SYNC_WRITE: 306eda14cbcSMatt Macy return (zfs_vdev_sync_write_min_active); 307eda14cbcSMatt Macy case ZIO_PRIORITY_ASYNC_READ: 308eda14cbcSMatt Macy return (zfs_vdev_async_read_min_active); 309eda14cbcSMatt Macy case ZIO_PRIORITY_ASYNC_WRITE: 310eda14cbcSMatt Macy return (zfs_vdev_async_write_min_active); 311eda14cbcSMatt Macy case ZIO_PRIORITY_SCRUB: 3127877fdebSMatt Macy return (vq->vq_ia_active == 0 ? zfs_vdev_scrub_min_active : 3137877fdebSMatt Macy MIN(vq->vq_nia_credit, zfs_vdev_scrub_min_active)); 314eda14cbcSMatt Macy case ZIO_PRIORITY_REMOVAL: 3157877fdebSMatt Macy return (vq->vq_ia_active == 0 ? zfs_vdev_removal_min_active : 3167877fdebSMatt Macy MIN(vq->vq_nia_credit, zfs_vdev_removal_min_active)); 317eda14cbcSMatt Macy case ZIO_PRIORITY_INITIALIZING: 3187877fdebSMatt Macy return (vq->vq_ia_active == 0 ?zfs_vdev_initializing_min_active: 3197877fdebSMatt Macy MIN(vq->vq_nia_credit, zfs_vdev_initializing_min_active)); 320eda14cbcSMatt Macy case ZIO_PRIORITY_TRIM: 321eda14cbcSMatt Macy return (zfs_vdev_trim_min_active); 322eda14cbcSMatt Macy case ZIO_PRIORITY_REBUILD: 3237877fdebSMatt Macy return (vq->vq_ia_active == 0 ? zfs_vdev_rebuild_min_active : 3247877fdebSMatt Macy MIN(vq->vq_nia_credit, zfs_vdev_rebuild_min_active)); 325eda14cbcSMatt Macy default: 326eda14cbcSMatt Macy panic("invalid priority %u", p); 327eda14cbcSMatt Macy return (0); 328eda14cbcSMatt Macy } 329eda14cbcSMatt Macy } 330eda14cbcSMatt Macy 331be181ee2SMartin Matuska static uint_t 332eda14cbcSMatt Macy vdev_queue_max_async_writes(spa_t *spa) 333eda14cbcSMatt Macy { 334be181ee2SMartin Matuska uint_t writes; 335eda14cbcSMatt Macy uint64_t dirty = 0; 336eda14cbcSMatt Macy dsl_pool_t *dp = spa_get_dsl(spa); 337eda14cbcSMatt Macy uint64_t min_bytes = zfs_dirty_data_max * 338eda14cbcSMatt Macy zfs_vdev_async_write_active_min_dirty_percent / 100; 339eda14cbcSMatt Macy uint64_t max_bytes = zfs_dirty_data_max * 340eda14cbcSMatt Macy zfs_vdev_async_write_active_max_dirty_percent / 100; 341eda14cbcSMatt Macy 342eda14cbcSMatt Macy /* 343eda14cbcSMatt Macy * Async writes may occur before the assignment of the spa's 344eda14cbcSMatt Macy * dsl_pool_t if a self-healing zio is issued prior to the 345eda14cbcSMatt Macy * completion of dmu_objset_open_impl(). 346eda14cbcSMatt Macy */ 347eda14cbcSMatt Macy if (dp == NULL) 348eda14cbcSMatt Macy return (zfs_vdev_async_write_max_active); 349eda14cbcSMatt Macy 350eda14cbcSMatt Macy /* 351eda14cbcSMatt Macy * Sync tasks correspond to interactive user actions. To reduce the 352eda14cbcSMatt Macy * execution time of those actions we push data out as fast as possible. 353eda14cbcSMatt Macy */ 3547877fdebSMatt Macy dirty = dp->dp_dirty_total; 3557877fdebSMatt Macy if (dirty > max_bytes || spa_has_pending_synctask(spa)) 356eda14cbcSMatt Macy return (zfs_vdev_async_write_max_active); 357eda14cbcSMatt Macy 358eda14cbcSMatt Macy if (dirty < min_bytes) 359eda14cbcSMatt Macy return (zfs_vdev_async_write_min_active); 360eda14cbcSMatt Macy 361eda14cbcSMatt Macy /* 362eda14cbcSMatt Macy * linear interpolation: 363eda14cbcSMatt Macy * slope = (max_writes - min_writes) / (max_bytes - min_bytes) 364eda14cbcSMatt Macy * move right by min_bytes 365eda14cbcSMatt Macy * move up by min_writes 366eda14cbcSMatt Macy */ 367eda14cbcSMatt Macy writes = (dirty - min_bytes) * 368eda14cbcSMatt Macy (zfs_vdev_async_write_max_active - 369eda14cbcSMatt Macy zfs_vdev_async_write_min_active) / 370eda14cbcSMatt Macy (max_bytes - min_bytes) + 371eda14cbcSMatt Macy zfs_vdev_async_write_min_active; 372eda14cbcSMatt Macy ASSERT3U(writes, >=, zfs_vdev_async_write_min_active); 373eda14cbcSMatt Macy ASSERT3U(writes, <=, zfs_vdev_async_write_max_active); 374eda14cbcSMatt Macy return (writes); 375eda14cbcSMatt Macy } 376eda14cbcSMatt Macy 377be181ee2SMartin Matuska static uint_t 378*7b5e6873SMartin Matuska vdev_queue_class_max_active(vdev_queue_t *vq, zio_priority_t p) 379eda14cbcSMatt Macy { 380eda14cbcSMatt Macy switch (p) { 381eda14cbcSMatt Macy case ZIO_PRIORITY_SYNC_READ: 382eda14cbcSMatt Macy return (zfs_vdev_sync_read_max_active); 383eda14cbcSMatt Macy case ZIO_PRIORITY_SYNC_WRITE: 384eda14cbcSMatt Macy return (zfs_vdev_sync_write_max_active); 385eda14cbcSMatt Macy case ZIO_PRIORITY_ASYNC_READ: 386eda14cbcSMatt Macy return (zfs_vdev_async_read_max_active); 387eda14cbcSMatt Macy case ZIO_PRIORITY_ASYNC_WRITE: 388*7b5e6873SMartin Matuska return (vdev_queue_max_async_writes(vq->vq_vdev->vdev_spa)); 389eda14cbcSMatt Macy case ZIO_PRIORITY_SCRUB: 3907877fdebSMatt Macy if (vq->vq_ia_active > 0) { 3917877fdebSMatt Macy return (MIN(vq->vq_nia_credit, 3927877fdebSMatt Macy zfs_vdev_scrub_min_active)); 3937877fdebSMatt Macy } else if (vq->vq_nia_credit < zfs_vdev_nia_delay) 3947877fdebSMatt Macy return (MAX(1, zfs_vdev_scrub_min_active)); 395eda14cbcSMatt Macy return (zfs_vdev_scrub_max_active); 396eda14cbcSMatt Macy case ZIO_PRIORITY_REMOVAL: 3977877fdebSMatt Macy if (vq->vq_ia_active > 0) { 3987877fdebSMatt Macy return (MIN(vq->vq_nia_credit, 3997877fdebSMatt Macy zfs_vdev_removal_min_active)); 4007877fdebSMatt Macy } else if (vq->vq_nia_credit < zfs_vdev_nia_delay) 4017877fdebSMatt Macy return (MAX(1, zfs_vdev_removal_min_active)); 402eda14cbcSMatt Macy return (zfs_vdev_removal_max_active); 403eda14cbcSMatt Macy case ZIO_PRIORITY_INITIALIZING: 4047877fdebSMatt Macy if (vq->vq_ia_active > 0) { 4057877fdebSMatt Macy return (MIN(vq->vq_nia_credit, 4067877fdebSMatt Macy zfs_vdev_initializing_min_active)); 4077877fdebSMatt Macy } else if (vq->vq_nia_credit < zfs_vdev_nia_delay) 4087877fdebSMatt Macy return (MAX(1, zfs_vdev_initializing_min_active)); 409eda14cbcSMatt Macy return (zfs_vdev_initializing_max_active); 410eda14cbcSMatt Macy case ZIO_PRIORITY_TRIM: 411eda14cbcSMatt Macy return (zfs_vdev_trim_max_active); 412eda14cbcSMatt Macy case ZIO_PRIORITY_REBUILD: 4137877fdebSMatt Macy if (vq->vq_ia_active > 0) { 4147877fdebSMatt Macy return (MIN(vq->vq_nia_credit, 4157877fdebSMatt Macy zfs_vdev_rebuild_min_active)); 4167877fdebSMatt Macy } else if (vq->vq_nia_credit < zfs_vdev_nia_delay) 4177877fdebSMatt Macy return (MAX(1, zfs_vdev_rebuild_min_active)); 418eda14cbcSMatt Macy return (zfs_vdev_rebuild_max_active); 419eda14cbcSMatt Macy default: 420eda14cbcSMatt Macy panic("invalid priority %u", p); 421eda14cbcSMatt Macy return (0); 422eda14cbcSMatt Macy } 423eda14cbcSMatt Macy } 424eda14cbcSMatt Macy 425eda14cbcSMatt Macy /* 426681ce946SMartin Matuska * Return the i/o class to issue from, or ZIO_PRIORITY_NUM_QUEUEABLE if 427eda14cbcSMatt Macy * there is no eligible class. 428eda14cbcSMatt Macy */ 429eda14cbcSMatt Macy static zio_priority_t 430eda14cbcSMatt Macy vdev_queue_class_to_issue(vdev_queue_t *vq) 431eda14cbcSMatt Macy { 432*7b5e6873SMartin Matuska uint32_t cq = vq->vq_cqueued; 433*7b5e6873SMartin Matuska zio_priority_t p, p1; 434eda14cbcSMatt Macy 435*7b5e6873SMartin Matuska if (cq == 0 || vq->vq_active >= zfs_vdev_max_active) 436eda14cbcSMatt Macy return (ZIO_PRIORITY_NUM_QUEUEABLE); 437eda14cbcSMatt Macy 4387877fdebSMatt Macy /* 4397877fdebSMatt Macy * Find a queue that has not reached its minimum # outstanding i/os. 4407877fdebSMatt Macy * Do round-robin to reduce starvation due to zfs_vdev_max_active 4417877fdebSMatt Macy * and vq_nia_credit limits. 4427877fdebSMatt Macy */ 443*7b5e6873SMartin Matuska p1 = vq->vq_last_prio + 1; 444*7b5e6873SMartin Matuska if (p1 >= ZIO_PRIORITY_NUM_QUEUEABLE) 445*7b5e6873SMartin Matuska p1 = 0; 446*7b5e6873SMartin Matuska for (p = p1; p < ZIO_PRIORITY_NUM_QUEUEABLE; p++) { 447*7b5e6873SMartin Matuska if ((cq & (1U << p)) != 0 && vq->vq_cactive[p] < 448*7b5e6873SMartin Matuska vdev_queue_class_min_active(vq, p)) 449*7b5e6873SMartin Matuska goto found; 450eda14cbcSMatt Macy } 451*7b5e6873SMartin Matuska for (p = 0; p < p1; p++) { 452*7b5e6873SMartin Matuska if ((cq & (1U << p)) != 0 && vq->vq_cactive[p] < 453*7b5e6873SMartin Matuska vdev_queue_class_min_active(vq, p)) 454*7b5e6873SMartin Matuska goto found; 4557877fdebSMatt Macy } 456eda14cbcSMatt Macy 457eda14cbcSMatt Macy /* 458eda14cbcSMatt Macy * If we haven't found a queue, look for one that hasn't reached its 459eda14cbcSMatt Macy * maximum # outstanding i/os. 460eda14cbcSMatt Macy */ 461eda14cbcSMatt Macy for (p = 0; p < ZIO_PRIORITY_NUM_QUEUEABLE; p++) { 462*7b5e6873SMartin Matuska if ((cq & (1U << p)) != 0 && vq->vq_cactive[p] < 463*7b5e6873SMartin Matuska vdev_queue_class_max_active(vq, p)) 464*7b5e6873SMartin Matuska break; 4657877fdebSMatt Macy } 466eda14cbcSMatt Macy 467*7b5e6873SMartin Matuska found: 468*7b5e6873SMartin Matuska vq->vq_last_prio = p; 469*7b5e6873SMartin Matuska return (p); 470eda14cbcSMatt Macy } 471eda14cbcSMatt Macy 472eda14cbcSMatt Macy void 473eda14cbcSMatt Macy vdev_queue_init(vdev_t *vd) 474eda14cbcSMatt Macy { 475eda14cbcSMatt Macy vdev_queue_t *vq = &vd->vdev_queue; 476eda14cbcSMatt Macy zio_priority_t p; 477eda14cbcSMatt Macy 478eda14cbcSMatt Macy vq->vq_vdev = vd; 479eda14cbcSMatt Macy 480eda14cbcSMatt Macy for (p = 0; p < ZIO_PRIORITY_NUM_QUEUEABLE; p++) { 481*7b5e6873SMartin Matuska if (vdev_queue_class_fifo(p)) { 482*7b5e6873SMartin Matuska list_create(&vq->vq_class[p].vqc_list, 483*7b5e6873SMartin Matuska sizeof (zio_t), 484*7b5e6873SMartin Matuska offsetof(struct zio, io_queue_node.l)); 485eda14cbcSMatt Macy } else { 486*7b5e6873SMartin Matuska avl_create(&vq->vq_class[p].vqc_tree, 487*7b5e6873SMartin Matuska vdev_queue_to_compare, sizeof (zio_t), 488*7b5e6873SMartin Matuska offsetof(struct zio, io_queue_node.a)); 489eda14cbcSMatt Macy } 490eda14cbcSMatt Macy } 491*7b5e6873SMartin Matuska avl_create(&vq->vq_read_offset_tree, 492*7b5e6873SMartin Matuska vdev_queue_offset_compare, sizeof (zio_t), 493*7b5e6873SMartin Matuska offsetof(struct zio, io_offset_node)); 494*7b5e6873SMartin Matuska avl_create(&vq->vq_write_offset_tree, 495*7b5e6873SMartin Matuska vdev_queue_offset_compare, sizeof (zio_t), 496*7b5e6873SMartin Matuska offsetof(struct zio, io_offset_node)); 497eda14cbcSMatt Macy 498eda14cbcSMatt Macy vq->vq_last_offset = 0; 499*7b5e6873SMartin Matuska list_create(&vq->vq_active_list, sizeof (struct zio), 500*7b5e6873SMartin Matuska offsetof(struct zio, io_queue_node.l)); 501*7b5e6873SMartin Matuska mutex_init(&vq->vq_lock, NULL, MUTEX_DEFAULT, NULL); 502eda14cbcSMatt Macy } 503eda14cbcSMatt Macy 504eda14cbcSMatt Macy void 505eda14cbcSMatt Macy vdev_queue_fini(vdev_t *vd) 506eda14cbcSMatt Macy { 507eda14cbcSMatt Macy vdev_queue_t *vq = &vd->vdev_queue; 508eda14cbcSMatt Macy 509*7b5e6873SMartin Matuska for (zio_priority_t p = 0; p < ZIO_PRIORITY_NUM_QUEUEABLE; p++) { 510*7b5e6873SMartin Matuska if (vdev_queue_class_fifo(p)) 511*7b5e6873SMartin Matuska list_destroy(&vq->vq_class[p].vqc_list); 512*7b5e6873SMartin Matuska else 513*7b5e6873SMartin Matuska avl_destroy(&vq->vq_class[p].vqc_tree); 514*7b5e6873SMartin Matuska } 515*7b5e6873SMartin Matuska avl_destroy(&vq->vq_read_offset_tree); 516*7b5e6873SMartin Matuska avl_destroy(&vq->vq_write_offset_tree); 517eda14cbcSMatt Macy 518*7b5e6873SMartin Matuska list_destroy(&vq->vq_active_list); 519eda14cbcSMatt Macy mutex_destroy(&vq->vq_lock); 520eda14cbcSMatt Macy } 521eda14cbcSMatt Macy 522eda14cbcSMatt Macy static void 523eda14cbcSMatt Macy vdev_queue_io_add(vdev_queue_t *vq, zio_t *zio) 524eda14cbcSMatt Macy { 525*7b5e6873SMartin Matuska zio->io_queue_state = ZIO_QS_QUEUED; 526*7b5e6873SMartin Matuska vdev_queue_class_add(vq, zio); 527*7b5e6873SMartin Matuska if (zio->io_type == ZIO_TYPE_READ) 528*7b5e6873SMartin Matuska avl_add(&vq->vq_read_offset_tree, zio); 529*7b5e6873SMartin Matuska else if (zio->io_type == ZIO_TYPE_WRITE) 530*7b5e6873SMartin Matuska avl_add(&vq->vq_write_offset_tree, zio); 531eda14cbcSMatt Macy } 532eda14cbcSMatt Macy 533eda14cbcSMatt Macy static void 534eda14cbcSMatt Macy vdev_queue_io_remove(vdev_queue_t *vq, zio_t *zio) 535eda14cbcSMatt Macy { 536*7b5e6873SMartin Matuska vdev_queue_class_remove(vq, zio); 537*7b5e6873SMartin Matuska if (zio->io_type == ZIO_TYPE_READ) 538*7b5e6873SMartin Matuska avl_remove(&vq->vq_read_offset_tree, zio); 539*7b5e6873SMartin Matuska else if (zio->io_type == ZIO_TYPE_WRITE) 540*7b5e6873SMartin Matuska avl_remove(&vq->vq_write_offset_tree, zio); 541*7b5e6873SMartin Matuska zio->io_queue_state = ZIO_QS_NONE; 542eda14cbcSMatt Macy } 543eda14cbcSMatt Macy 5447877fdebSMatt Macy static boolean_t 5457877fdebSMatt Macy vdev_queue_is_interactive(zio_priority_t p) 5467877fdebSMatt Macy { 5477877fdebSMatt Macy switch (p) { 5487877fdebSMatt Macy case ZIO_PRIORITY_SCRUB: 5497877fdebSMatt Macy case ZIO_PRIORITY_REMOVAL: 5507877fdebSMatt Macy case ZIO_PRIORITY_INITIALIZING: 5517877fdebSMatt Macy case ZIO_PRIORITY_REBUILD: 5527877fdebSMatt Macy return (B_FALSE); 5537877fdebSMatt Macy default: 5547877fdebSMatt Macy return (B_TRUE); 5557877fdebSMatt Macy } 5567877fdebSMatt Macy } 5577877fdebSMatt Macy 558eda14cbcSMatt Macy static void 559eda14cbcSMatt Macy vdev_queue_pending_add(vdev_queue_t *vq, zio_t *zio) 560eda14cbcSMatt Macy { 561eda14cbcSMatt Macy ASSERT(MUTEX_HELD(&vq->vq_lock)); 562eda14cbcSMatt Macy ASSERT3U(zio->io_priority, <, ZIO_PRIORITY_NUM_QUEUEABLE); 563*7b5e6873SMartin Matuska vq->vq_cactive[zio->io_priority]++; 564*7b5e6873SMartin Matuska vq->vq_active++; 5657877fdebSMatt Macy if (vdev_queue_is_interactive(zio->io_priority)) { 5667877fdebSMatt Macy if (++vq->vq_ia_active == 1) 5677877fdebSMatt Macy vq->vq_nia_credit = 1; 5687877fdebSMatt Macy } else if (vq->vq_ia_active > 0) { 5697877fdebSMatt Macy vq->vq_nia_credit--; 5707877fdebSMatt Macy } 571*7b5e6873SMartin Matuska zio->io_queue_state = ZIO_QS_ACTIVE; 572*7b5e6873SMartin Matuska list_insert_tail(&vq->vq_active_list, zio); 573eda14cbcSMatt Macy } 574eda14cbcSMatt Macy 575eda14cbcSMatt Macy static void 576eda14cbcSMatt Macy vdev_queue_pending_remove(vdev_queue_t *vq, zio_t *zio) 577eda14cbcSMatt Macy { 578eda14cbcSMatt Macy ASSERT(MUTEX_HELD(&vq->vq_lock)); 579eda14cbcSMatt Macy ASSERT3U(zio->io_priority, <, ZIO_PRIORITY_NUM_QUEUEABLE); 580*7b5e6873SMartin Matuska vq->vq_cactive[zio->io_priority]--; 581*7b5e6873SMartin Matuska vq->vq_active--; 5827877fdebSMatt Macy if (vdev_queue_is_interactive(zio->io_priority)) { 5837877fdebSMatt Macy if (--vq->vq_ia_active == 0) 5847877fdebSMatt Macy vq->vq_nia_credit = 0; 5857877fdebSMatt Macy else 5867877fdebSMatt Macy vq->vq_nia_credit = zfs_vdev_nia_credit; 5877877fdebSMatt Macy } else if (vq->vq_ia_active == 0) 5887877fdebSMatt Macy vq->vq_nia_credit++; 589*7b5e6873SMartin Matuska list_remove(&vq->vq_active_list, zio); 590*7b5e6873SMartin Matuska zio->io_queue_state = ZIO_QS_NONE; 591eda14cbcSMatt Macy } 592eda14cbcSMatt Macy 593eda14cbcSMatt Macy static void 594eda14cbcSMatt Macy vdev_queue_agg_io_done(zio_t *aio) 595eda14cbcSMatt Macy { 596eda14cbcSMatt Macy abd_free(aio->io_abd); 597eda14cbcSMatt Macy } 598eda14cbcSMatt Macy 599eda14cbcSMatt Macy /* 600eda14cbcSMatt Macy * Compute the range spanned by two i/os, which is the endpoint of the last 601eda14cbcSMatt Macy * (lio->io_offset + lio->io_size) minus start of the first (fio->io_offset). 602eda14cbcSMatt Macy * Conveniently, the gap between fio and lio is given by -IO_SPAN(lio, fio); 603eda14cbcSMatt Macy * thus fio and lio are adjacent if and only if IO_SPAN(lio, fio) == 0. 604eda14cbcSMatt Macy */ 605eda14cbcSMatt Macy #define IO_SPAN(fio, lio) ((lio)->io_offset + (lio)->io_size - (fio)->io_offset) 606eda14cbcSMatt Macy #define IO_GAP(fio, lio) (-IO_SPAN(lio, fio)) 607eda14cbcSMatt Macy 608eda14cbcSMatt Macy /* 609eda14cbcSMatt Macy * Sufficiently adjacent io_offset's in ZIOs will be aggregated. We do this 610eda14cbcSMatt Macy * by creating a gang ABD from the adjacent ZIOs io_abd's. By using 611eda14cbcSMatt Macy * a gang ABD we avoid doing memory copies to and from the parent, 612eda14cbcSMatt Macy * child ZIOs. The gang ABD also accounts for gaps between adjacent 613eda14cbcSMatt Macy * io_offsets by simply getting the zero ABD for writes or allocating 614eda14cbcSMatt Macy * a new ABD for reads and placing them in the gang ABD as well. 615eda14cbcSMatt Macy */ 616eda14cbcSMatt Macy static zio_t * 617eda14cbcSMatt Macy vdev_queue_aggregate(vdev_queue_t *vq, zio_t *zio) 618eda14cbcSMatt Macy { 619eda14cbcSMatt Macy zio_t *first, *last, *aio, *dio, *mandatory, *nio; 620eda14cbcSMatt Macy uint64_t maxgap = 0; 621eda14cbcSMatt Macy uint64_t size; 622eda14cbcSMatt Macy uint64_t limit; 623eda14cbcSMatt Macy boolean_t stretch = B_FALSE; 624eda14cbcSMatt Macy uint64_t next_offset; 625eda14cbcSMatt Macy abd_t *abd; 626*7b5e6873SMartin Matuska avl_tree_t *t; 627eda14cbcSMatt Macy 628*7b5e6873SMartin Matuska /* 629*7b5e6873SMartin Matuska * TRIM aggregation should not be needed since code in zfs_trim.c can 630*7b5e6873SMartin Matuska * submit TRIM I/O for extents up to zfs_trim_extent_bytes_max (128M). 631*7b5e6873SMartin Matuska */ 632*7b5e6873SMartin Matuska if (zio->io_type == ZIO_TYPE_TRIM) 633*7b5e6873SMartin Matuska return (NULL); 634*7b5e6873SMartin Matuska 635*7b5e6873SMartin Matuska if (zio->io_flags & ZIO_FLAG_DONT_AGGREGATE) 636*7b5e6873SMartin Matuska return (NULL); 637*7b5e6873SMartin Matuska 638eda14cbcSMatt Macy if (vq->vq_vdev->vdev_nonrot) 639eda14cbcSMatt Macy limit = zfs_vdev_aggregation_limit_non_rotating; 640eda14cbcSMatt Macy else 641eda14cbcSMatt Macy limit = zfs_vdev_aggregation_limit; 642*7b5e6873SMartin Matuska if (limit == 0) 643eda14cbcSMatt Macy return (NULL); 644*7b5e6873SMartin Matuska limit = MIN(limit, SPA_MAXBLOCKSIZE); 645eda14cbcSMatt Macy 6467877fdebSMatt Macy /* 6477877fdebSMatt Macy * I/Os to distributed spares are directly dispatched to the dRAID 6487877fdebSMatt Macy * leaf vdevs for aggregation. See the comment at the end of the 6497877fdebSMatt Macy * zio_vdev_io_start() function. 6507877fdebSMatt Macy */ 6517877fdebSMatt Macy ASSERT(vq->vq_vdev->vdev_ops != &vdev_draid_spare_ops); 6527877fdebSMatt Macy 653eda14cbcSMatt Macy first = last = zio; 654eda14cbcSMatt Macy 655*7b5e6873SMartin Matuska if (zio->io_type == ZIO_TYPE_READ) { 656eda14cbcSMatt Macy maxgap = zfs_vdev_read_gap_limit; 657*7b5e6873SMartin Matuska t = &vq->vq_read_offset_tree; 658*7b5e6873SMartin Matuska } else { 659*7b5e6873SMartin Matuska ASSERT3U(zio->io_type, ==, ZIO_TYPE_WRITE); 660*7b5e6873SMartin Matuska t = &vq->vq_write_offset_tree; 661*7b5e6873SMartin Matuska } 662eda14cbcSMatt Macy 663eda14cbcSMatt Macy /* 664eda14cbcSMatt Macy * We can aggregate I/Os that are sufficiently adjacent and of 665eda14cbcSMatt Macy * the same flavor, as expressed by the AGG_INHERIT flags. 666eda14cbcSMatt Macy * The latter requirement is necessary so that certain 667eda14cbcSMatt Macy * attributes of the I/O, such as whether it's a normal I/O 668eda14cbcSMatt Macy * or a scrub/resilver, can be preserved in the aggregate. 669eda14cbcSMatt Macy * We can include optional I/Os, but don't allow them 670eda14cbcSMatt Macy * to begin a range as they add no benefit in that situation. 671eda14cbcSMatt Macy */ 672eda14cbcSMatt Macy 673eda14cbcSMatt Macy /* 674eda14cbcSMatt Macy * We keep track of the last non-optional I/O. 675eda14cbcSMatt Macy */ 676eda14cbcSMatt Macy mandatory = (first->io_flags & ZIO_FLAG_OPTIONAL) ? NULL : first; 677eda14cbcSMatt Macy 678eda14cbcSMatt Macy /* 679eda14cbcSMatt Macy * Walk backwards through sufficiently contiguous I/Os 680eda14cbcSMatt Macy * recording the last non-optional I/O. 681eda14cbcSMatt Macy */ 682*7b5e6873SMartin Matuska zio_flag_t flags = zio->io_flags & ZIO_FLAG_AGG_INHERIT; 683eda14cbcSMatt Macy while ((dio = AVL_PREV(t, first)) != NULL && 684eda14cbcSMatt Macy (dio->io_flags & ZIO_FLAG_AGG_INHERIT) == flags && 685eda14cbcSMatt Macy IO_SPAN(dio, last) <= limit && 686eda14cbcSMatt Macy IO_GAP(dio, first) <= maxgap && 687eda14cbcSMatt Macy dio->io_type == zio->io_type) { 688eda14cbcSMatt Macy first = dio; 689eda14cbcSMatt Macy if (mandatory == NULL && !(first->io_flags & ZIO_FLAG_OPTIONAL)) 690eda14cbcSMatt Macy mandatory = first; 691eda14cbcSMatt Macy } 692eda14cbcSMatt Macy 693eda14cbcSMatt Macy /* 694eda14cbcSMatt Macy * Skip any initial optional I/Os. 695eda14cbcSMatt Macy */ 696eda14cbcSMatt Macy while ((first->io_flags & ZIO_FLAG_OPTIONAL) && first != last) { 697eda14cbcSMatt Macy first = AVL_NEXT(t, first); 698eda14cbcSMatt Macy ASSERT(first != NULL); 699eda14cbcSMatt Macy } 700eda14cbcSMatt Macy 701eda14cbcSMatt Macy 702eda14cbcSMatt Macy /* 703eda14cbcSMatt Macy * Walk forward through sufficiently contiguous I/Os. 704eda14cbcSMatt Macy * The aggregation limit does not apply to optional i/os, so that 705eda14cbcSMatt Macy * we can issue contiguous writes even if they are larger than the 706eda14cbcSMatt Macy * aggregation limit. 707eda14cbcSMatt Macy */ 708eda14cbcSMatt Macy while ((dio = AVL_NEXT(t, last)) != NULL && 709eda14cbcSMatt Macy (dio->io_flags & ZIO_FLAG_AGG_INHERIT) == flags && 710eda14cbcSMatt Macy (IO_SPAN(first, dio) <= limit || 711eda14cbcSMatt Macy (dio->io_flags & ZIO_FLAG_OPTIONAL)) && 712*7b5e6873SMartin Matuska IO_SPAN(first, dio) <= SPA_MAXBLOCKSIZE && 713eda14cbcSMatt Macy IO_GAP(last, dio) <= maxgap && 714eda14cbcSMatt Macy dio->io_type == zio->io_type) { 715eda14cbcSMatt Macy last = dio; 716eda14cbcSMatt Macy if (!(last->io_flags & ZIO_FLAG_OPTIONAL)) 717eda14cbcSMatt Macy mandatory = last; 718eda14cbcSMatt Macy } 719eda14cbcSMatt Macy 720eda14cbcSMatt Macy /* 721eda14cbcSMatt Macy * Now that we've established the range of the I/O aggregation 722eda14cbcSMatt Macy * we must decide what to do with trailing optional I/Os. 723eda14cbcSMatt Macy * For reads, there's nothing to do. While we are unable to 724eda14cbcSMatt Macy * aggregate further, it's possible that a trailing optional 725eda14cbcSMatt Macy * I/O would allow the underlying device to aggregate with 726eda14cbcSMatt Macy * subsequent I/Os. We must therefore determine if the next 727eda14cbcSMatt Macy * non-optional I/O is close enough to make aggregation 728eda14cbcSMatt Macy * worthwhile. 729eda14cbcSMatt Macy */ 730eda14cbcSMatt Macy if (zio->io_type == ZIO_TYPE_WRITE && mandatory != NULL) { 731eda14cbcSMatt Macy zio_t *nio = last; 732eda14cbcSMatt Macy while ((dio = AVL_NEXT(t, nio)) != NULL && 733eda14cbcSMatt Macy IO_GAP(nio, dio) == 0 && 734eda14cbcSMatt Macy IO_GAP(mandatory, dio) <= zfs_vdev_write_gap_limit) { 735eda14cbcSMatt Macy nio = dio; 736eda14cbcSMatt Macy if (!(nio->io_flags & ZIO_FLAG_OPTIONAL)) { 737eda14cbcSMatt Macy stretch = B_TRUE; 738eda14cbcSMatt Macy break; 739eda14cbcSMatt Macy } 740eda14cbcSMatt Macy } 741eda14cbcSMatt Macy } 742eda14cbcSMatt Macy 743eda14cbcSMatt Macy if (stretch) { 744eda14cbcSMatt Macy /* 745eda14cbcSMatt Macy * We are going to include an optional io in our aggregated 746eda14cbcSMatt Macy * span, thus closing the write gap. Only mandatory i/os can 747eda14cbcSMatt Macy * start aggregated spans, so make sure that the next i/o 748eda14cbcSMatt Macy * after our span is mandatory. 749eda14cbcSMatt Macy */ 750eda14cbcSMatt Macy dio = AVL_NEXT(t, last); 751dbd5678dSMartin Matuska ASSERT3P(dio, !=, NULL); 752eda14cbcSMatt Macy dio->io_flags &= ~ZIO_FLAG_OPTIONAL; 753eda14cbcSMatt Macy } else { 754eda14cbcSMatt Macy /* do not include the optional i/o */ 755eda14cbcSMatt Macy while (last != mandatory && last != first) { 756eda14cbcSMatt Macy ASSERT(last->io_flags & ZIO_FLAG_OPTIONAL); 757eda14cbcSMatt Macy last = AVL_PREV(t, last); 758eda14cbcSMatt Macy ASSERT(last != NULL); 759eda14cbcSMatt Macy } 760eda14cbcSMatt Macy } 761eda14cbcSMatt Macy 762eda14cbcSMatt Macy if (first == last) 763eda14cbcSMatt Macy return (NULL); 764eda14cbcSMatt Macy 765eda14cbcSMatt Macy size = IO_SPAN(first, last); 766*7b5e6873SMartin Matuska ASSERT3U(size, <=, SPA_MAXBLOCKSIZE); 767eda14cbcSMatt Macy 768184c1b94SMartin Matuska abd = abd_alloc_gang(); 769eda14cbcSMatt Macy if (abd == NULL) 770eda14cbcSMatt Macy return (NULL); 771eda14cbcSMatt Macy 772eda14cbcSMatt Macy aio = zio_vdev_delegated_io(first->io_vd, first->io_offset, 773eda14cbcSMatt Macy abd, size, first->io_type, zio->io_priority, 7744e8d558cSMartin Matuska flags | ZIO_FLAG_DONT_QUEUE, vdev_queue_agg_io_done, NULL); 775eda14cbcSMatt Macy aio->io_timestamp = first->io_timestamp; 776eda14cbcSMatt Macy 777eda14cbcSMatt Macy nio = first; 778eda14cbcSMatt Macy next_offset = first->io_offset; 779eda14cbcSMatt Macy do { 780eda14cbcSMatt Macy dio = nio; 781eda14cbcSMatt Macy nio = AVL_NEXT(t, dio); 782dbd5678dSMartin Matuska ASSERT3P(dio, !=, NULL); 783eda14cbcSMatt Macy zio_add_child(dio, aio); 784eda14cbcSMatt Macy vdev_queue_io_remove(vq, dio); 785eda14cbcSMatt Macy 786eda14cbcSMatt Macy if (dio->io_offset != next_offset) { 787eda14cbcSMatt Macy /* allocate a buffer for a read gap */ 788eda14cbcSMatt Macy ASSERT3U(dio->io_type, ==, ZIO_TYPE_READ); 789eda14cbcSMatt Macy ASSERT3U(dio->io_offset, >, next_offset); 790eda14cbcSMatt Macy abd = abd_alloc_for_io( 791eda14cbcSMatt Macy dio->io_offset - next_offset, B_TRUE); 792eda14cbcSMatt Macy abd_gang_add(aio->io_abd, abd, B_TRUE); 793eda14cbcSMatt Macy } 794eda14cbcSMatt Macy if (dio->io_abd && 795eda14cbcSMatt Macy (dio->io_size != abd_get_size(dio->io_abd))) { 796eda14cbcSMatt Macy /* abd size not the same as IO size */ 797eda14cbcSMatt Macy ASSERT3U(abd_get_size(dio->io_abd), >, dio->io_size); 798eda14cbcSMatt Macy abd = abd_get_offset_size(dio->io_abd, 0, dio->io_size); 799eda14cbcSMatt Macy abd_gang_add(aio->io_abd, abd, B_TRUE); 800eda14cbcSMatt Macy } else { 801eda14cbcSMatt Macy if (dio->io_flags & ZIO_FLAG_NODATA) { 802eda14cbcSMatt Macy /* allocate a buffer for a write gap */ 803eda14cbcSMatt Macy ASSERT3U(dio->io_type, ==, ZIO_TYPE_WRITE); 804eda14cbcSMatt Macy ASSERT3P(dio->io_abd, ==, NULL); 805eda14cbcSMatt Macy abd_gang_add(aio->io_abd, 806eda14cbcSMatt Macy abd_get_zeros(dio->io_size), B_TRUE); 807eda14cbcSMatt Macy } else { 808eda14cbcSMatt Macy /* 809eda14cbcSMatt Macy * We pass B_FALSE to abd_gang_add() 810eda14cbcSMatt Macy * because we did not allocate a new 811eda14cbcSMatt Macy * ABD, so it is assumed the caller 812eda14cbcSMatt Macy * will free this ABD. 813eda14cbcSMatt Macy */ 814eda14cbcSMatt Macy abd_gang_add(aio->io_abd, dio->io_abd, 815eda14cbcSMatt Macy B_FALSE); 816eda14cbcSMatt Macy } 817eda14cbcSMatt Macy } 818eda14cbcSMatt Macy next_offset = dio->io_offset + dio->io_size; 819eda14cbcSMatt Macy } while (dio != last); 820eda14cbcSMatt Macy ASSERT3U(abd_get_size(aio->io_abd), ==, aio->io_size); 821eda14cbcSMatt Macy 822eda14cbcSMatt Macy /* 8232faf504dSMartin Matuska * Callers must call zio_vdev_io_bypass() and zio_execute() for 8242faf504dSMartin Matuska * aggregated (parent) I/Os so that we could avoid dropping the 8252faf504dSMartin Matuska * queue's lock here to avoid a deadlock that we could encounter 8262faf504dSMartin Matuska * due to lock order reversal between vq_lock and io_lock in 8272faf504dSMartin Matuska * zio_change_priority(). 828eda14cbcSMatt Macy */ 829eda14cbcSMatt Macy return (aio); 830eda14cbcSMatt Macy } 831eda14cbcSMatt Macy 832eda14cbcSMatt Macy static zio_t * 833eda14cbcSMatt Macy vdev_queue_io_to_issue(vdev_queue_t *vq) 834eda14cbcSMatt Macy { 835eda14cbcSMatt Macy zio_t *zio, *aio; 836eda14cbcSMatt Macy zio_priority_t p; 837eda14cbcSMatt Macy avl_index_t idx; 838eda14cbcSMatt Macy avl_tree_t *tree; 839eda14cbcSMatt Macy 840eda14cbcSMatt Macy again: 841eda14cbcSMatt Macy ASSERT(MUTEX_HELD(&vq->vq_lock)); 842eda14cbcSMatt Macy 843eda14cbcSMatt Macy p = vdev_queue_class_to_issue(vq); 844eda14cbcSMatt Macy 845eda14cbcSMatt Macy if (p == ZIO_PRIORITY_NUM_QUEUEABLE) { 846eda14cbcSMatt Macy /* No eligible queued i/os */ 847eda14cbcSMatt Macy return (NULL); 848eda14cbcSMatt Macy } 849eda14cbcSMatt Macy 850*7b5e6873SMartin Matuska if (vdev_queue_class_fifo(p)) { 851*7b5e6873SMartin Matuska zio = list_head(&vq->vq_class[p].vqc_list); 852*7b5e6873SMartin Matuska } else { 853eda14cbcSMatt Macy /* 854*7b5e6873SMartin Matuska * For LBA-ordered queues (async / scrub / initializing), 855*7b5e6873SMartin Matuska * issue the I/O which follows the most recently issued I/O 856*7b5e6873SMartin Matuska * in LBA (offset) order, but to avoid starvation only within 857*7b5e6873SMartin Matuska * the same 0.5 second interval as the first I/O. 858eda14cbcSMatt Macy */ 859*7b5e6873SMartin Matuska tree = &vq->vq_class[p].vqc_tree; 860*7b5e6873SMartin Matuska zio = aio = avl_first(tree); 861*7b5e6873SMartin Matuska if (zio->io_offset < vq->vq_last_offset) { 862*7b5e6873SMartin Matuska vq->vq_io_search.io_timestamp = zio->io_timestamp; 863*7b5e6873SMartin Matuska vq->vq_io_search.io_offset = vq->vq_last_offset; 864*7b5e6873SMartin Matuska zio = avl_find(tree, &vq->vq_io_search, &idx); 865*7b5e6873SMartin Matuska if (zio == NULL) { 866eda14cbcSMatt Macy zio = avl_nearest(tree, idx, AVL_AFTER); 867*7b5e6873SMartin Matuska if (zio == NULL || 868*7b5e6873SMartin Matuska (zio->io_timestamp >> VDQ_T_SHIFT) != 869*7b5e6873SMartin Matuska (aio->io_timestamp >> VDQ_T_SHIFT)) 870*7b5e6873SMartin Matuska zio = aio; 871*7b5e6873SMartin Matuska } 872*7b5e6873SMartin Matuska } 873*7b5e6873SMartin Matuska } 874eda14cbcSMatt Macy ASSERT3U(zio->io_priority, ==, p); 875eda14cbcSMatt Macy 876eda14cbcSMatt Macy aio = vdev_queue_aggregate(vq, zio); 8772faf504dSMartin Matuska if (aio != NULL) { 878eda14cbcSMatt Macy zio = aio; 8792faf504dSMartin Matuska } else { 880eda14cbcSMatt Macy vdev_queue_io_remove(vq, zio); 881eda14cbcSMatt Macy 882eda14cbcSMatt Macy /* 8832faf504dSMartin Matuska * If the I/O is or was optional and therefore has no data, we 8842faf504dSMartin Matuska * need to simply discard it. We need to drop the vdev queue's 8852faf504dSMartin Matuska * lock to avoid a deadlock that we could encounter since this 8862faf504dSMartin Matuska * I/O will complete immediately. 887eda14cbcSMatt Macy */ 888eda14cbcSMatt Macy if (zio->io_flags & ZIO_FLAG_NODATA) { 889eda14cbcSMatt Macy mutex_exit(&vq->vq_lock); 890eda14cbcSMatt Macy zio_vdev_io_bypass(zio); 891eda14cbcSMatt Macy zio_execute(zio); 892eda14cbcSMatt Macy mutex_enter(&vq->vq_lock); 893eda14cbcSMatt Macy goto again; 894eda14cbcSMatt Macy } 8952faf504dSMartin Matuska } 896eda14cbcSMatt Macy 897eda14cbcSMatt Macy vdev_queue_pending_add(vq, zio); 898eda14cbcSMatt Macy vq->vq_last_offset = zio->io_offset + zio->io_size; 899eda14cbcSMatt Macy 900eda14cbcSMatt Macy return (zio); 901eda14cbcSMatt Macy } 902eda14cbcSMatt Macy 903eda14cbcSMatt Macy zio_t * 904eda14cbcSMatt Macy vdev_queue_io(zio_t *zio) 905eda14cbcSMatt Macy { 906eda14cbcSMatt Macy vdev_queue_t *vq = &zio->io_vd->vdev_queue; 9072faf504dSMartin Matuska zio_t *dio, *nio; 9082faf504dSMartin Matuska zio_link_t *zl = NULL; 909eda14cbcSMatt Macy 910eda14cbcSMatt Macy if (zio->io_flags & ZIO_FLAG_DONT_QUEUE) 911eda14cbcSMatt Macy return (zio); 912eda14cbcSMatt Macy 913eda14cbcSMatt Macy /* 914eda14cbcSMatt Macy * Children i/os inherent their parent's priority, which might 915eda14cbcSMatt Macy * not match the child's i/o type. Fix it up here. 916eda14cbcSMatt Macy */ 917eda14cbcSMatt Macy if (zio->io_type == ZIO_TYPE_READ) { 918eda14cbcSMatt Macy ASSERT(zio->io_priority != ZIO_PRIORITY_TRIM); 919eda14cbcSMatt Macy 920eda14cbcSMatt Macy if (zio->io_priority != ZIO_PRIORITY_SYNC_READ && 921eda14cbcSMatt Macy zio->io_priority != ZIO_PRIORITY_ASYNC_READ && 922eda14cbcSMatt Macy zio->io_priority != ZIO_PRIORITY_SCRUB && 923eda14cbcSMatt Macy zio->io_priority != ZIO_PRIORITY_REMOVAL && 924eda14cbcSMatt Macy zio->io_priority != ZIO_PRIORITY_INITIALIZING && 925eda14cbcSMatt Macy zio->io_priority != ZIO_PRIORITY_REBUILD) { 926eda14cbcSMatt Macy zio->io_priority = ZIO_PRIORITY_ASYNC_READ; 927eda14cbcSMatt Macy } 928eda14cbcSMatt Macy } else if (zio->io_type == ZIO_TYPE_WRITE) { 929eda14cbcSMatt Macy ASSERT(zio->io_priority != ZIO_PRIORITY_TRIM); 930eda14cbcSMatt Macy 931eda14cbcSMatt Macy if (zio->io_priority != ZIO_PRIORITY_SYNC_WRITE && 932eda14cbcSMatt Macy zio->io_priority != ZIO_PRIORITY_ASYNC_WRITE && 933eda14cbcSMatt Macy zio->io_priority != ZIO_PRIORITY_REMOVAL && 934eda14cbcSMatt Macy zio->io_priority != ZIO_PRIORITY_INITIALIZING && 935eda14cbcSMatt Macy zio->io_priority != ZIO_PRIORITY_REBUILD) { 936eda14cbcSMatt Macy zio->io_priority = ZIO_PRIORITY_ASYNC_WRITE; 937eda14cbcSMatt Macy } 938eda14cbcSMatt Macy } else { 939eda14cbcSMatt Macy ASSERT(zio->io_type == ZIO_TYPE_TRIM); 940eda14cbcSMatt Macy ASSERT(zio->io_priority == ZIO_PRIORITY_TRIM); 941eda14cbcSMatt Macy } 942eda14cbcSMatt Macy 9434e8d558cSMartin Matuska zio->io_flags |= ZIO_FLAG_DONT_QUEUE; 9447cd22ac4SMartin Matuska zio->io_timestamp = gethrtime(); 945eda14cbcSMatt Macy 946eda14cbcSMatt Macy mutex_enter(&vq->vq_lock); 947eda14cbcSMatt Macy vdev_queue_io_add(vq, zio); 948eda14cbcSMatt Macy nio = vdev_queue_io_to_issue(vq); 949eda14cbcSMatt Macy mutex_exit(&vq->vq_lock); 950eda14cbcSMatt Macy 951eda14cbcSMatt Macy if (nio == NULL) 952eda14cbcSMatt Macy return (NULL); 953eda14cbcSMatt Macy 954eda14cbcSMatt Macy if (nio->io_done == vdev_queue_agg_io_done) { 9552faf504dSMartin Matuska while ((dio = zio_walk_parents(nio, &zl)) != NULL) { 9562faf504dSMartin Matuska ASSERT3U(dio->io_type, ==, nio->io_type); 9572faf504dSMartin Matuska zio_vdev_io_bypass(dio); 9582faf504dSMartin Matuska zio_execute(dio); 9592faf504dSMartin Matuska } 960eda14cbcSMatt Macy zio_nowait(nio); 961eda14cbcSMatt Macy return (NULL); 962eda14cbcSMatt Macy } 963eda14cbcSMatt Macy 964eda14cbcSMatt Macy return (nio); 965eda14cbcSMatt Macy } 966eda14cbcSMatt Macy 967eda14cbcSMatt Macy void 968eda14cbcSMatt Macy vdev_queue_io_done(zio_t *zio) 969eda14cbcSMatt Macy { 970eda14cbcSMatt Macy vdev_queue_t *vq = &zio->io_vd->vdev_queue; 9712faf504dSMartin Matuska zio_t *dio, *nio; 9722faf504dSMartin Matuska zio_link_t *zl = NULL; 973eda14cbcSMatt Macy 9747cd22ac4SMartin Matuska hrtime_t now = gethrtime(); 9757cd22ac4SMartin Matuska vq->vq_io_complete_ts = now; 9767cd22ac4SMartin Matuska vq->vq_io_delta_ts = zio->io_delta = now - zio->io_timestamp; 9777cd22ac4SMartin Matuska 978eda14cbcSMatt Macy mutex_enter(&vq->vq_lock); 979eda14cbcSMatt Macy vdev_queue_pending_remove(vq, zio); 980eda14cbcSMatt Macy 981eda14cbcSMatt Macy while ((nio = vdev_queue_io_to_issue(vq)) != NULL) { 982eda14cbcSMatt Macy mutex_exit(&vq->vq_lock); 983eda14cbcSMatt Macy if (nio->io_done == vdev_queue_agg_io_done) { 9842faf504dSMartin Matuska while ((dio = zio_walk_parents(nio, &zl)) != NULL) { 9852faf504dSMartin Matuska ASSERT3U(dio->io_type, ==, nio->io_type); 9862faf504dSMartin Matuska zio_vdev_io_bypass(dio); 9872faf504dSMartin Matuska zio_execute(dio); 9882faf504dSMartin Matuska } 989eda14cbcSMatt Macy zio_nowait(nio); 990eda14cbcSMatt Macy } else { 991eda14cbcSMatt Macy zio_vdev_io_reissue(nio); 992eda14cbcSMatt Macy zio_execute(nio); 993eda14cbcSMatt Macy } 994eda14cbcSMatt Macy mutex_enter(&vq->vq_lock); 995eda14cbcSMatt Macy } 996eda14cbcSMatt Macy 997eda14cbcSMatt Macy mutex_exit(&vq->vq_lock); 998eda14cbcSMatt Macy } 999eda14cbcSMatt Macy 1000eda14cbcSMatt Macy void 1001eda14cbcSMatt Macy vdev_queue_change_io_priority(zio_t *zio, zio_priority_t priority) 1002eda14cbcSMatt Macy { 1003eda14cbcSMatt Macy vdev_queue_t *vq = &zio->io_vd->vdev_queue; 1004eda14cbcSMatt Macy 1005eda14cbcSMatt Macy /* 1006eda14cbcSMatt Macy * ZIO_PRIORITY_NOW is used by the vdev cache code and the aggregate zio 1007eda14cbcSMatt Macy * code to issue IOs without adding them to the vdev queue. In this 1008eda14cbcSMatt Macy * case, the zio is already going to be issued as quickly as possible 1009eda14cbcSMatt Macy * and so it doesn't need any reprioritization to help. 1010eda14cbcSMatt Macy */ 1011eda14cbcSMatt Macy if (zio->io_priority == ZIO_PRIORITY_NOW) 1012eda14cbcSMatt Macy return; 1013eda14cbcSMatt Macy 1014eda14cbcSMatt Macy ASSERT3U(zio->io_priority, <, ZIO_PRIORITY_NUM_QUEUEABLE); 1015eda14cbcSMatt Macy ASSERT3U(priority, <, ZIO_PRIORITY_NUM_QUEUEABLE); 1016eda14cbcSMatt Macy 1017eda14cbcSMatt Macy if (zio->io_type == ZIO_TYPE_READ) { 1018eda14cbcSMatt Macy if (priority != ZIO_PRIORITY_SYNC_READ && 1019eda14cbcSMatt Macy priority != ZIO_PRIORITY_ASYNC_READ && 1020eda14cbcSMatt Macy priority != ZIO_PRIORITY_SCRUB) 1021eda14cbcSMatt Macy priority = ZIO_PRIORITY_ASYNC_READ; 1022eda14cbcSMatt Macy } else { 1023eda14cbcSMatt Macy ASSERT(zio->io_type == ZIO_TYPE_WRITE); 1024eda14cbcSMatt Macy if (priority != ZIO_PRIORITY_SYNC_WRITE && 1025eda14cbcSMatt Macy priority != ZIO_PRIORITY_ASYNC_WRITE) 1026eda14cbcSMatt Macy priority = ZIO_PRIORITY_ASYNC_WRITE; 1027eda14cbcSMatt Macy } 1028eda14cbcSMatt Macy 1029eda14cbcSMatt Macy mutex_enter(&vq->vq_lock); 1030eda14cbcSMatt Macy 1031eda14cbcSMatt Macy /* 1032eda14cbcSMatt Macy * If the zio is in none of the queues we can simply change 1033eda14cbcSMatt Macy * the priority. If the zio is waiting to be submitted we must 1034eda14cbcSMatt Macy * remove it from the queue and re-insert it with the new priority. 1035eda14cbcSMatt Macy * Otherwise, the zio is currently active and we cannot change its 1036eda14cbcSMatt Macy * priority. 1037eda14cbcSMatt Macy */ 1038*7b5e6873SMartin Matuska if (zio->io_queue_state == ZIO_QS_QUEUED) { 1039*7b5e6873SMartin Matuska vdev_queue_class_remove(vq, zio); 1040eda14cbcSMatt Macy zio->io_priority = priority; 1041*7b5e6873SMartin Matuska vdev_queue_class_add(vq, zio); 1042*7b5e6873SMartin Matuska } else if (zio->io_queue_state == ZIO_QS_NONE) { 1043eda14cbcSMatt Macy zio->io_priority = priority; 1044eda14cbcSMatt Macy } 1045eda14cbcSMatt Macy 1046eda14cbcSMatt Macy mutex_exit(&vq->vq_lock); 1047eda14cbcSMatt Macy } 1048eda14cbcSMatt Macy 1049eda14cbcSMatt Macy /* 1050eda14cbcSMatt Macy * As these two methods are only used for load calculations we're not 1051eda14cbcSMatt Macy * concerned if we get an incorrect value on 32bit platforms due to lack of 1052eda14cbcSMatt Macy * vq_lock mutex use here, instead we prefer to keep it lock free for 1053eda14cbcSMatt Macy * performance. 1054eda14cbcSMatt Macy */ 1055*7b5e6873SMartin Matuska uint32_t 1056eda14cbcSMatt Macy vdev_queue_length(vdev_t *vd) 1057eda14cbcSMatt Macy { 1058*7b5e6873SMartin Matuska return (vd->vdev_queue.vq_active); 1059eda14cbcSMatt Macy } 1060eda14cbcSMatt Macy 1061eda14cbcSMatt Macy uint64_t 1062eda14cbcSMatt Macy vdev_queue_last_offset(vdev_t *vd) 1063eda14cbcSMatt Macy { 1064eda14cbcSMatt Macy return (vd->vdev_queue.vq_last_offset); 1065eda14cbcSMatt Macy } 1066eda14cbcSMatt Macy 1067*7b5e6873SMartin Matuska uint64_t 1068*7b5e6873SMartin Matuska vdev_queue_class_length(vdev_t *vd, zio_priority_t p) 1069*7b5e6873SMartin Matuska { 1070*7b5e6873SMartin Matuska vdev_queue_t *vq = &vd->vdev_queue; 1071*7b5e6873SMartin Matuska if (vdev_queue_class_fifo(p)) 1072*7b5e6873SMartin Matuska return (list_is_empty(&vq->vq_class[p].vqc_list) == 0); 1073*7b5e6873SMartin Matuska else 1074*7b5e6873SMartin Matuska return (avl_numnodes(&vq->vq_class[p].vqc_tree)); 1075*7b5e6873SMartin Matuska } 1076*7b5e6873SMartin Matuska 1077be181ee2SMartin Matuska ZFS_MODULE_PARAM(zfs_vdev, zfs_vdev_, aggregation_limit, UINT, ZMOD_RW, 1078eda14cbcSMatt Macy "Max vdev I/O aggregation size"); 1079eda14cbcSMatt Macy 1080be181ee2SMartin Matuska ZFS_MODULE_PARAM(zfs_vdev, zfs_vdev_, aggregation_limit_non_rotating, UINT, 1081c03c5b1cSMartin Matuska ZMOD_RW, "Max vdev I/O aggregation size for non-rotating media"); 1082eda14cbcSMatt Macy 1083be181ee2SMartin Matuska ZFS_MODULE_PARAM(zfs_vdev, zfs_vdev_, read_gap_limit, UINT, ZMOD_RW, 1084eda14cbcSMatt Macy "Aggregate read I/O over gap"); 1085eda14cbcSMatt Macy 1086be181ee2SMartin Matuska ZFS_MODULE_PARAM(zfs_vdev, zfs_vdev_, write_gap_limit, UINT, ZMOD_RW, 1087eda14cbcSMatt Macy "Aggregate write I/O over gap"); 1088eda14cbcSMatt Macy 1089be181ee2SMartin Matuska ZFS_MODULE_PARAM(zfs_vdev, zfs_vdev_, max_active, UINT, ZMOD_RW, 1090eda14cbcSMatt Macy "Maximum number of active I/Os per vdev"); 1091eda14cbcSMatt Macy 1092be181ee2SMartin Matuska ZFS_MODULE_PARAM(zfs_vdev, zfs_vdev_, async_write_active_max_dirty_percent, 1093be181ee2SMartin Matuska UINT, ZMOD_RW, "Async write concurrency max threshold"); 1094eda14cbcSMatt Macy 1095be181ee2SMartin Matuska ZFS_MODULE_PARAM(zfs_vdev, zfs_vdev_, async_write_active_min_dirty_percent, 1096be181ee2SMartin Matuska UINT, ZMOD_RW, "Async write concurrency min threshold"); 1097eda14cbcSMatt Macy 1098be181ee2SMartin Matuska ZFS_MODULE_PARAM(zfs_vdev, zfs_vdev_, async_read_max_active, UINT, ZMOD_RW, 1099eda14cbcSMatt Macy "Max active async read I/Os per vdev"); 1100eda14cbcSMatt Macy 1101be181ee2SMartin Matuska ZFS_MODULE_PARAM(zfs_vdev, zfs_vdev_, async_read_min_active, UINT, ZMOD_RW, 1102eda14cbcSMatt Macy "Min active async read I/Os per vdev"); 1103eda14cbcSMatt Macy 1104be181ee2SMartin Matuska ZFS_MODULE_PARAM(zfs_vdev, zfs_vdev_, async_write_max_active, UINT, ZMOD_RW, 1105eda14cbcSMatt Macy "Max active async write I/Os per vdev"); 1106eda14cbcSMatt Macy 1107be181ee2SMartin Matuska ZFS_MODULE_PARAM(zfs_vdev, zfs_vdev_, async_write_min_active, UINT, ZMOD_RW, 1108eda14cbcSMatt Macy "Min active async write I/Os per vdev"); 1109eda14cbcSMatt Macy 1110be181ee2SMartin Matuska ZFS_MODULE_PARAM(zfs_vdev, zfs_vdev_, initializing_max_active, UINT, ZMOD_RW, 1111eda14cbcSMatt Macy "Max active initializing I/Os per vdev"); 1112eda14cbcSMatt Macy 1113be181ee2SMartin Matuska ZFS_MODULE_PARAM(zfs_vdev, zfs_vdev_, initializing_min_active, UINT, ZMOD_RW, 1114eda14cbcSMatt Macy "Min active initializing I/Os per vdev"); 1115eda14cbcSMatt Macy 1116be181ee2SMartin Matuska ZFS_MODULE_PARAM(zfs_vdev, zfs_vdev_, removal_max_active, UINT, ZMOD_RW, 1117eda14cbcSMatt Macy "Max active removal I/Os per vdev"); 1118eda14cbcSMatt Macy 1119be181ee2SMartin Matuska ZFS_MODULE_PARAM(zfs_vdev, zfs_vdev_, removal_min_active, UINT, ZMOD_RW, 1120eda14cbcSMatt Macy "Min active removal I/Os per vdev"); 1121eda14cbcSMatt Macy 1122be181ee2SMartin Matuska ZFS_MODULE_PARAM(zfs_vdev, zfs_vdev_, scrub_max_active, UINT, ZMOD_RW, 1123eda14cbcSMatt Macy "Max active scrub I/Os per vdev"); 1124eda14cbcSMatt Macy 1125be181ee2SMartin Matuska ZFS_MODULE_PARAM(zfs_vdev, zfs_vdev_, scrub_min_active, UINT, ZMOD_RW, 1126eda14cbcSMatt Macy "Min active scrub I/Os per vdev"); 1127eda14cbcSMatt Macy 1128be181ee2SMartin Matuska ZFS_MODULE_PARAM(zfs_vdev, zfs_vdev_, sync_read_max_active, UINT, ZMOD_RW, 1129eda14cbcSMatt Macy "Max active sync read I/Os per vdev"); 1130eda14cbcSMatt Macy 1131be181ee2SMartin Matuska ZFS_MODULE_PARAM(zfs_vdev, zfs_vdev_, sync_read_min_active, UINT, ZMOD_RW, 1132eda14cbcSMatt Macy "Min active sync read I/Os per vdev"); 1133eda14cbcSMatt Macy 1134be181ee2SMartin Matuska ZFS_MODULE_PARAM(zfs_vdev, zfs_vdev_, sync_write_max_active, UINT, ZMOD_RW, 1135eda14cbcSMatt Macy "Max active sync write I/Os per vdev"); 1136eda14cbcSMatt Macy 1137be181ee2SMartin Matuska ZFS_MODULE_PARAM(zfs_vdev, zfs_vdev_, sync_write_min_active, UINT, ZMOD_RW, 1138eda14cbcSMatt Macy "Min active sync write I/Os per vdev"); 1139eda14cbcSMatt Macy 1140be181ee2SMartin Matuska ZFS_MODULE_PARAM(zfs_vdev, zfs_vdev_, trim_max_active, UINT, ZMOD_RW, 1141eda14cbcSMatt Macy "Max active trim/discard I/Os per vdev"); 1142eda14cbcSMatt Macy 1143be181ee2SMartin Matuska ZFS_MODULE_PARAM(zfs_vdev, zfs_vdev_, trim_min_active, UINT, ZMOD_RW, 1144eda14cbcSMatt Macy "Min active trim/discard I/Os per vdev"); 1145eda14cbcSMatt Macy 1146be181ee2SMartin Matuska ZFS_MODULE_PARAM(zfs_vdev, zfs_vdev_, rebuild_max_active, UINT, ZMOD_RW, 1147eda14cbcSMatt Macy "Max active rebuild I/Os per vdev"); 1148eda14cbcSMatt Macy 1149be181ee2SMartin Matuska ZFS_MODULE_PARAM(zfs_vdev, zfs_vdev_, rebuild_min_active, UINT, ZMOD_RW, 1150eda14cbcSMatt Macy "Min active rebuild I/Os per vdev"); 1151eda14cbcSMatt Macy 1152be181ee2SMartin Matuska ZFS_MODULE_PARAM(zfs_vdev, zfs_vdev_, nia_credit, UINT, ZMOD_RW, 11537877fdebSMatt Macy "Number of non-interactive I/Os to allow in sequence"); 11547877fdebSMatt Macy 1155be181ee2SMartin Matuska ZFS_MODULE_PARAM(zfs_vdev, zfs_vdev_, nia_delay, UINT, ZMOD_RW, 11567877fdebSMatt Macy "Number of non-interactive I/Os before _max_active"); 11577877fdebSMatt Macy 1158be181ee2SMartin Matuska ZFS_MODULE_PARAM(zfs_vdev, zfs_vdev_, queue_depth_pct, UINT, ZMOD_RW, 1159eda14cbcSMatt Macy "Queue depth percentage for each top-level vdev"); 1160d411c1d6SMartin Matuska 1161d411c1d6SMartin Matuska ZFS_MODULE_PARAM(zfs_vdev, zfs_vdev_, def_queue_depth, UINT, ZMOD_RW, 1162d411c1d6SMartin Matuska "Default queue depth for each allocator"); 1163