1.\" $NetBSD: bufferio.9,v 1.18 2019/09/12 21:08:35 sevan Exp $ 2.\" 3.\" Copyright (c) 2015 The NetBSD Foundation, Inc. 4.\" All rights reserved. 5.\" 6.\" This code is derived from software contributed to The NetBSD Foundation 7.\" by Taylor R. Campbell. 8.\" 9.\" Redistribution and use in source and binary forms, with or without 10.\" modification, are permitted provided that the following conditions 11.\" are met: 12.\" 1. Redistributions of source code must retain the above copyright 13.\" notice, this list of conditions and the following disclaimer. 14.\" 2. Redistributions in binary form must reproduce the above copyright 15.\" notice, this list of conditions and the following disclaimer in the 16.\" documentation and/or other materials provided with the distribution. 17.\" 18.\" THIS SOFTWARE IS PROVIDED BY THE NETBSD FOUNDATION, INC. AND CONTRIBUTORS 19.\" ``AS IS'' AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED 20.\" TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR 21.\" PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL THE FOUNDATION OR CONTRIBUTORS 22.\" BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR 23.\" CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF 24.\" SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS 25.\" INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN 26.\" CONTRACT, STRICT LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) 27.\" ARISING IN ANY WAY OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE 28.\" POSSIBILITY OF SUCH DAMAGE. 29.\" 30.Dd September 12, 2019 31.Dt BUFFERIO 9 32.Os 33.Sh NAME 34.Nm BUFFERIO , 35.Nm biodone , 36.Nm biowait , 37.Nm getiobuf , 38.Nm putiobuf , 39.Nm nestiobuf_setup , 40.Nm nestiobuf_done 41.Nd block I/O buffer transfers 42.Sh SYNOPSIS 43.In sys/buf.h 44.Ft void 45.Fn biodone "buf_t *bp" 46.Ft int 47.Fn biowait "buf_t *bp" 48.Ft buf_t * 49.Fn getiobuf "struct vnode *vp" "bool waitok" 50.Ft void 51.Fn putiobuf "buf_t *bp" 52.Ft void 53.Fn nestiobuf_setup "buf_t *mbp" "buf_t *bp" "int offset" \ 54 "size_t size" 55.Ft void 56.Fn nestiobuf_done "buf_t *mbp" "int donebytes" "int error" 57.Sh DESCRIPTION 58The 59.Nm 60subsystem manages block I/O buffer transfers, described by the 61.Vt "struct buf" 62structure, which serves multiple purposes between users in 63.Nm , 64users in 65.Xr buffercache 9 , 66and users in block device drivers to execute transfers to physical 67disks. 68.Sh BLOCK DEVICE USERS 69Users of 70.Nm 71wishing to submit a buffer for block I/O transfer must obtain a 72.Vt "struct buf" , 73e.g. via 74.Fn getiobuf , 75fill its parameters, and submit it to a block device with 76.Xr bdev_strategy 9 , 77usually via 78.Xr VOP_STRATEGY 9 . 79.Pp 80The parameters to an I/O transfer described by 81.Fa bp 82are specified by the following 83.Vt "struct buf" 84fields: 85.Bl -tag -width 6n -offset abcd 86.It Fa bp Ns Li "->b_flags" 87Flags specifying the type of transfer. 88.Bl -tag -width 6n -compact 89.It Dv B_READ 90Transfer is read from device. 91If not set, transfer is write to device. 92.It Dv B_ASYNC 93Asynchronous I/O. 94Caller must not provide 95.Fa bp Ns Li "->b_iodone" 96and must not call 97.Fn biowait bp . 98.El 99For legibility, callers should indicate writes by passing the 100pseudo-flag 101.Dv B_WRITE , 102which is zero. 103.It Fa bp Ns Li "->b_data" 104Pointer to kernel virtual address of source/target for transfer. 105.It Fa bp Ns Li "->b_bcount" 106Nonnegative number of bytes requested for transfer. 107.It Fa bp Ns Li "->b_blkno" 108Block number at which to do transfer. 109.It Fa bp Ns Li "->b_iodone" 110I/O completion callback. 111.Dv B_ASYNC 112must not be set in 113.Fa bp Ns Li "->b_flags" . 114.El 115.Pp 116Additionally, if the I/O transfer is a write associated with a 117.Xr vnode 9 118.Fa vp , 119then before the user submits it to a block device, the user must 120increment 121.Fa vp Ns Li "->v_numoutput" . 122The user must not acquire 123.Fa vp Ns Ap s 124vnode lock between incrementing 125.Fa vp Ns Li "->v_numoutput" 126and submitting 127.Fa bp 128to a block device \(em doing so will likely cause deadlock with the 129syncer. 130.Pp 131Block I/O transfer completion may be notified by the 132.Fa bp Ns Li "->b_iodone" 133callback, by signalling 134.Fn biowait 135waiters, or not at all in the 136.Dv B_ASYNC 137case. 138.Bl -dash 139.It 140If the user sets the 141.Fa bp Ns Li "->b_iodone" 142callback to a 143.Pf non- Dv NULL 144function pointer, it will be called in soft interrupt context when the 145I/O transfer is complete. 146The user 147.Em may not 148call 149.Fn biowait bp 150in this case. 151.It 152If 153.Dv B_ASYNC 154is set, then the I/O transfer is asynchronous and the user will not be 155notified when it is completed. 156The user 157.Em may not 158call 159.Fn biowait bp 160in this case. 161.It 162Otherwise, if 163.Fa bp Ns Li "->b_iodone" 164is 165.Dv NULL 166and 167.Dv B_ASYNC 168is not specified, the user may wait for the I/O transfer to complete 169with 170.Fn biowait bp . 171.El 172.Pp 173Once an I/O transfer has completed, its 174.Vt "struct buf" 175may be reused, but the user must first clear the 176.Dv BO_DONE 177flag of 178.Fa bp Ns Li "->b_oflags" 179before reusing it. 180.Sh NESTED I/O TRANSFERS 181Sometimes an I/O transfer from a single buffer in memory cannot go to a 182single location on a block device: it must be split up into smaller 183transfers for each segment of the memory buffer. 184.Pp 185After initializing the 186.Li b_flags , 187.Li b_data , 188and 189.Li b_bcount 190parameters of an I/O transfer for the buffer, called the 191.Em master 192buffer, the user can issue smaller transfers for segments of the buffer 193using 194.Fn nestiobuf_setup . 195When nested I/O transfers complete, in any order, they debit from the 196amount of work left to be done in the master buffer. 197If any segments of the buffer were skipped, the user can report this 198with 199.Fn nestiobuf_done 200to debit the skipped part of the work. 201.Pp 202The master buffer's I/O transfer is completed when all nested buffers' 203I/O transfers are completed, and if 204.Fn nestiobuf_done 205is called in the case of skipped segments. 206.Pp 207For writes associated with a vnode 208.Fa vp , 209.Fn nestiobuf_setup 210accounts for 211.Fa vp Ns Li "->v_numoutput" , 212so the caller is not allowed to acquire 213.Fa vp Ns Ap s 214vnode lock before submitting the nested I/O transfer to a block 215device. 216However, the caller is responsible for accounting the master buffer in 217.Fa vp Ns Li "->v_numoutput" . 218This must be done very carefully because after incrementing 219.Fa vp Ns Li "->v_numoutput" , 220the caller is not allowed to acquire 221.Fa vp Ns Ap s 222vnode lock before either calling 223.Fn nestiobuf_done 224or submitting the last nested I/O transfer to a block device. 225.Pp 226For example: 227.Bd -literal -offset abcd 228struct buf *mbp, *bp; 229size_t skipped = 0; 230unsigned i; 231int error = 0; 232 233mbp = getiobuf(vp, true); 234mbp->b_data = data; 235mbp->b_resid = mbp->b_bcount = datalen; 236mbp->b_flags = B_WRITE; 237 238KASSERT(0 < nsegs); 239KASSERT(datalen == nsegs*segsz); 240for (i = 0; i < nsegs; i++) { 241 struct vnode *devvp; 242 daddr_t blkno; 243 244 vn_lock(vp, LK_EXCLUSIVE | LK_RETRY); 245 error = VOP_BMAP(vp, i*segsz, &devvp, &blkno, NULL); 246 VOP_UNLOCK(vp); 247 if (error == 0 && blkno == -1) 248 error = EIO; 249 if (error) { 250 /* Give up early, don't try to handle holes. */ 251 skipped += datalen - i*segsz; 252 break; 253 } 254 255 bp = getiobuf(vp, true); 256 nestiobuf_setup(bp, mbp, i*segsz, segsz); 257 bp->b_blkno = blkno; 258 if (i == nsegs - 1) /* Last segment. */ 259 break; 260 VOP_STRATEGY(devvp, bp); 261} 262 263/* 264 * Account v_numoutput for master write. 265 * (Must not vn_lock before last VOP_STRATEGY!) 266 */ 267mutex_enter(&vp->v_interlock); 268vp->v_numoutput++; 269mutex_exit(&vp->v_interlock); 270 271if (skipped) 272 nestiobuf_done(mbp, skipped, error); 273else 274 VOP_STRATEGY(devvp, bp); 275.Ed 276.Sh BLOCK DEVICE DRIVERS 277Block device drivers implement a 278.Sq strategy 279method, in the 280.Li d_strategy 281member of 282.Li struct bdevsw 283.Pq Xr driver 9 , 284to queue a buffer for disk I/O. 285The inputs to the strategy method are: 286.Bl -tag -width 6n -offset abcd 287.It Fa bp Ns Li "->b_flags" 288Flags specifying the type of transfer. 289.Bl -tag -width 6n -compact 290.It Dv B_READ 291Transfer is read from device. 292If not set, transfer is write to device. 293.El 294.It Fa bp Ns Li "->b_data" 295Pointer to kernel virtual address of source/target for transfer. 296.It Fa bp Ns Li "->b_bcount" 297Nonnegative number of bytes requested for transfer. 298.It Fa bp Ns Li "->b_blkno" 299Block number at which to do transfer, relative to partition start. 300.El 301.Pp 302If the strategy method uses 303.Xr bufq 9 , 304it must additionally initialize the following fields before queueing 305.Fa bp 306with 307.Xr bufq_put 9 : 308.Bl -tag -width 6n -offset abcd 309.It Fa bp Ns Li "->b_rawblkno" 310Block number relative to volume start. 311.El 312.Pp 313When the I/O transfer is complete, whether it succeeded or failed, the 314strategy method must: 315.Bl -dash 316.It 317Set 318.Fa bp Ns Li "->b_error" 319to zero on success, or to an 320.Xr errno 2 321error code on failure. 322.It 323Set 324.Fa bp Ns Li "->b_resid" 325to the number of bytes remaining to transfer, whether on success or 326on failure. 327If no bytes were transferred, this must be set to 328.Fa bp Ns Li "->b_bcount" . 329.It 330Call 331.Fn biodone bp . 332.El 333.Sh FUNCTIONS 334.Bl -tag -width abcd 335.It Fn biodone bp 336Notify that the I/O transfer described by 337.Fa bp 338has completed. 339.Pp 340To be called by a block device driver. 341Caller must first set 342.Fa bp Ns Li "->b_error" 343to an error code and 344.Fa bp Ns Li "->b_resid" 345to the number of bytes remaining to transfer. 346.It Fn biowait bp 347Wait for the synchronous I/O transfer described by 348.Fa bp 349to complete. 350Returns the value of 351.Fa bp Ns Li "->b_error" . 352.Pp 353To be called by a user requesting the I/O transfer. 354.Pp 355May not be called if 356.Fa bp 357has a callback or is asynchronous \(em that is, if 358.Fa bp Ns Li "->b_iodone" 359is set, or if 360.Dv B_ASYNC 361is set in 362.Fa bp Ns Li "->b_flags" . 363.It Fn getiobuf vp waitok 364Allocate a 365.Vt "struct buf" 366for an I/O transfer. 367If 368.Fa vp 369is 370.Pf non- Dv NULL , 371the transfer is associated with it. 372If 373.Fa waitok 374is false, 375returns 376.Dv NULL 377if none can be allocated immediately. 378.Pp 379The resulting 380.Vt "struct buf" 381pointer must eventually be passed to 382.Fn putiobuf 383to release it. 384Do 385.Em not 386use 387.Xr brelse 9 . 388.Pp 389The buffer may not be used for an asynchronous I/O transfer, because 390there is no way to know when it is completed and may be safely passed 391to 392.Fn putiobuf . 393Asynchronous I/O transfers are allowed only for buffers in the 394.Xr buffercache 9 . 395.Pp 396May sleep if 397.Fa waitok 398is true. 399.It Fn putiobuf bp 400Free 401.Fa bp , 402which must have been allocated by 403.Fn getiobuf . 404Either 405.Fa bp 406must never have been submitted to a block device, or the I/O transfer 407must have completed. 408.El 409.Sh CODE REFERENCES 410The 411.Nm 412subsystem is implemented in 413.Pa sys/kern/vfs_bio.c . 414.Sh SEE ALSO 415.Xr buffercache 9 , 416.Xr bufq 9 417.Sh BUGS 418The 419.Nm 420abstraction provides no way to cancel an I/O transfer once it has been 421submitted to a block device. 422.Pp 423The 424.Nm 425abstraction provides no way to do I/O transfers with non-kernel pages, 426e.g. directly to buffers in userland without copying into the kernel 427first. 428.Pp 429The 430.Vt "struct buf" 431type is all mixed up with the 432.Xr buffercache 9 . 433.Pp 434The 435.Nm 436abstraction is a totally idiotic API design. 437.Pp 438The 439.Li v_numoutput 440accounting required of 441.Nm 442callers is asinine. 443