1*86d7f5d3SJohn Marino.\" Copyright (c) 2001-2003 International Computer Science Institute 2*86d7f5d3SJohn Marino.\" 3*86d7f5d3SJohn Marino.\" Permission is hereby granted, free of charge, to any person obtaining a 4*86d7f5d3SJohn Marino.\" copy of this software and associated documentation files (the "Software"), 5*86d7f5d3SJohn Marino.\" to deal in the Software without restriction, including without limitation 6*86d7f5d3SJohn Marino.\" the rights to use, copy, modify, merge, publish, distribute, sublicense, 7*86d7f5d3SJohn Marino.\" and/or sell copies of the Software, and to permit persons to whom the 8*86d7f5d3SJohn Marino.\" Software is furnished to do so, subject to the following conditions: 9*86d7f5d3SJohn Marino.\" 10*86d7f5d3SJohn Marino.\" The above copyright notice and this permission notice shall be included in 11*86d7f5d3SJohn Marino.\" all copies or substantial portions of the Software. 12*86d7f5d3SJohn Marino.\" 13*86d7f5d3SJohn Marino.\" The names and trademarks of copyright holders may not be used in 14*86d7f5d3SJohn Marino.\" advertising or publicity pertaining to the software without specific 15*86d7f5d3SJohn Marino.\" prior permission. Title to copyright in this software and any associated 16*86d7f5d3SJohn Marino.\" documentation will at all times remain with the copyright holders. 17*86d7f5d3SJohn Marino.\" 18*86d7f5d3SJohn Marino.\" THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR 19*86d7f5d3SJohn Marino.\" IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, 20*86d7f5d3SJohn Marino.\" FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE 21*86d7f5d3SJohn Marino.\" AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER 22*86d7f5d3SJohn Marino.\" LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING 23*86d7f5d3SJohn Marino.\" FROM, OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER 24*86d7f5d3SJohn Marino.\" DEALINGS IN THE SOFTWARE. 25*86d7f5d3SJohn Marino.\" 26*86d7f5d3SJohn Marino.\" $FreeBSD: /repoman/r/ncvs/src/share/man/man4/multicast.4,v 1.1 2003/10/17 15:12:01 bmah Exp $ 27*86d7f5d3SJohn Marino.\" $DragonFly: src/share/man/man4/multicast.4,v 1.7 2008/05/02 02:05:05 swildner Exp $ 28*86d7f5d3SJohn Marino.\" 29*86d7f5d3SJohn Marino.Dd September 4, 2003 30*86d7f5d3SJohn Marino.Dt MULTICAST 4 31*86d7f5d3SJohn Marino.Os 32*86d7f5d3SJohn Marino.\" 33*86d7f5d3SJohn Marino.Sh NAME 34*86d7f5d3SJohn Marino.Nm multicast 35*86d7f5d3SJohn Marino.Nd Multicast Routing 36*86d7f5d3SJohn Marino.\" 37*86d7f5d3SJohn Marino.Sh SYNOPSIS 38*86d7f5d3SJohn Marino.Cd "options MROUTING" 39*86d7f5d3SJohn Marino.Pp 40*86d7f5d3SJohn Marino.In sys/types.h 41*86d7f5d3SJohn Marino.In sys/socket.h 42*86d7f5d3SJohn Marino.In netinet/in.h 43*86d7f5d3SJohn Marino.In net/ip_mroute/ip_mroute.h 44*86d7f5d3SJohn Marino.In netinet6/ip6_mroute.h 45*86d7f5d3SJohn Marino.Ft int 46*86d7f5d3SJohn Marino.Fn getsockopt "int s" IPPROTO_IP MRT_INIT "void *optval" "socklen_t *optlen" 47*86d7f5d3SJohn Marino.Ft int 48*86d7f5d3SJohn Marino.Fn setsockopt "int s" IPPROTO_IP MRT_INIT "const void *optval" "socklen_t optlen" 49*86d7f5d3SJohn Marino.Ft int 50*86d7f5d3SJohn Marino.Fn getsockopt "int s" IPPROTO_IPV6 MRT6_INIT "void *optval" "socklen_t *optlen" 51*86d7f5d3SJohn Marino.Ft int 52*86d7f5d3SJohn Marino.Fn setsockopt "int s" IPPROTO_IPV6 MRT6_INIT "const void *optval" "socklen_t optlen" 53*86d7f5d3SJohn Marino.Sh DESCRIPTION 54*86d7f5d3SJohn Marino.Tn "Multicast routing" 55*86d7f5d3SJohn Marinois used to efficiently propagate data 56*86d7f5d3SJohn Marinopackets to a set of multicast listeners in multipoint networks. 57*86d7f5d3SJohn MarinoIf unicast is used to replicate the data to all listeners, 58*86d7f5d3SJohn Marinothen some of the network links may carry multiple copies of the same 59*86d7f5d3SJohn Marinodata packets. 60*86d7f5d3SJohn MarinoWith multicast routing, the overhead is reduced to one copy 61*86d7f5d3SJohn Marino(at most) per network link. 62*86d7f5d3SJohn Marino.Pp 63*86d7f5d3SJohn MarinoAll multicast-capable routers must run a common multicast routing 64*86d7f5d3SJohn Marinoprotocol. 65*86d7f5d3SJohn MarinoThe Distance Vector Multicast Routing Protocol (DVMRP) 66*86d7f5d3SJohn Marinowas the first developed multicast routing protocol. 67*86d7f5d3SJohn MarinoLater, other protocols such as Multicast Extensions to OSPF (MOSPF), 68*86d7f5d3SJohn MarinoCore Based Trees (CBT), 69*86d7f5d3SJohn MarinoProtocol Independent Multicast - Sparse Mode (PIM-SM), 70*86d7f5d3SJohn Marinoand Protocol Independent Multicast - Dense Mode (PIM-DM) 71*86d7f5d3SJohn Marinowere developed as well. 72*86d7f5d3SJohn Marino.Pp 73*86d7f5d3SJohn MarinoTo start multicast routing, 74*86d7f5d3SJohn Marinothe user must enable multicast forwarding in the kernel 75*86d7f5d3SJohn Marino(see 76*86d7f5d3SJohn Marino.Sx SYNOPSIS 77*86d7f5d3SJohn Marinoabout the kernel configuration options), 78*86d7f5d3SJohn Marinoand must run a multicast routing capable user-level process. 79*86d7f5d3SJohn MarinoFrom developer's point of view, 80*86d7f5d3SJohn Marinothe programming guide described in the 81*86d7f5d3SJohn Marino.Sx "Programming Guide" 82*86d7f5d3SJohn Marinosection should be used to control the multicast forwarding in the kernel. 83*86d7f5d3SJohn Marino.\" 84*86d7f5d3SJohn Marino.Ss Programming Guide 85*86d7f5d3SJohn MarinoThis section provides information about the basic multicast routing API. 86*86d7f5d3SJohn MarinoThe so-called 87*86d7f5d3SJohn Marino.Dq advanced multicast API 88*86d7f5d3SJohn Marinois described in the 89*86d7f5d3SJohn Marino.Sx "Advanced Multicast API Programming Guide" 90*86d7f5d3SJohn Marinosection. 91*86d7f5d3SJohn Marino.Pp 92*86d7f5d3SJohn MarinoFirst, a multicast routing socket must be open. 93*86d7f5d3SJohn MarinoThat socket would be used 94*86d7f5d3SJohn Marinoto control the multicast forwarding in the kernel. 95*86d7f5d3SJohn MarinoNote that most operations below require certain privilege 96*86d7f5d3SJohn Marino(i.e., root privilege): 97*86d7f5d3SJohn Marino.Bd -literal 98*86d7f5d3SJohn Marino/* IPv4 */ 99*86d7f5d3SJohn Marinoint mrouter_s4; 100*86d7f5d3SJohn Marinomrouter_s4 = socket(AF_INET, SOCK_RAW, IPPROTO_IGMP); 101*86d7f5d3SJohn Marino.Ed 102*86d7f5d3SJohn Marino.Bd -literal 103*86d7f5d3SJohn Marinoint mrouter_s6; 104*86d7f5d3SJohn Marinomrouter_s6 = socket(AF_INET6, SOCK_RAW, IPPROTO_ICMPV6); 105*86d7f5d3SJohn Marino.Ed 106*86d7f5d3SJohn Marino.Pp 107*86d7f5d3SJohn MarinoNote that if the router needs to open an IGMP or ICMPv6 socket 108*86d7f5d3SJohn Marino(in case of IPv4 and IPv6 respectively) 109*86d7f5d3SJohn Marinofor sending or receiving of IGMP or MLD multicast group membership messages, 110*86d7f5d3SJohn Marinothen the same mrouter_s4 or mrouter_s6 sockets should be used 111*86d7f5d3SJohn Marinofor sending and receiving respectively IGMP or MLD messages. 112*86d7f5d3SJohn MarinoIn case of BSD-derived kernel, it may be possible to open separate sockets 113*86d7f5d3SJohn Marinofor IGMP or MLD messages only. 114*86d7f5d3SJohn MarinoHowever, some other kernels (e.g., Linux) require that the multicast 115*86d7f5d3SJohn Marinorouting socket must be used for sending and receiving of IGMP or MLD 116*86d7f5d3SJohn Marinomessages. 117*86d7f5d3SJohn MarinoTherefore, for portability reason the multicast 118*86d7f5d3SJohn Marinorouting socket should be reused for IGMP and MLD messages as well. 119*86d7f5d3SJohn Marino.Pp 120*86d7f5d3SJohn MarinoAfter the multicast routing socket is open, it can be used to enable 121*86d7f5d3SJohn Marinoor disable multicast forwarding in the kernel: 122*86d7f5d3SJohn Marino.Bd -literal 123*86d7f5d3SJohn Marino/* IPv4 */ 124*86d7f5d3SJohn Marinoint v = 1; /* 1 to enable, or 0 to disable */ 125*86d7f5d3SJohn Marinosetsockopt(mrouter_s4, IPPROTO_IP, MRT_INIT, (void *)&v, sizeof(v)); 126*86d7f5d3SJohn Marino.Ed 127*86d7f5d3SJohn Marino.Bd -literal 128*86d7f5d3SJohn Marino/* IPv6 */ 129*86d7f5d3SJohn Marinoint v = 1; /* 1 to enable, or 0 to disable */ 130*86d7f5d3SJohn Marinosetsockopt(mrouter_s6, IPPROTO_IPV6, MRT6_INIT, (void *)&v, sizeof(v)); 131*86d7f5d3SJohn Marino\&... 132*86d7f5d3SJohn Marino/* If necessary, filter all ICMPv6 messages */ 133*86d7f5d3SJohn Marinostruct icmp6_filter filter; 134*86d7f5d3SJohn MarinoICMP6_FILTER_SETBLOCKALL(&filter); 135*86d7f5d3SJohn Marinosetsockopt(mrouter_s6, IPPROTO_ICMPV6, ICMP6_FILTER, (void *)&filter, 136*86d7f5d3SJohn Marino sizeof(filter)); 137*86d7f5d3SJohn Marino.Ed 138*86d7f5d3SJohn Marino.Pp 139*86d7f5d3SJohn MarinoAfter multicast forwarding is enabled, the multicast routing socket 140*86d7f5d3SJohn Marinocan be used to enable PIM processing in the kernel if we are running PIM-SM or 141*86d7f5d3SJohn MarinoPIM-DM 142*86d7f5d3SJohn Marino(see 143*86d7f5d3SJohn Marino.Xr pim 4 ) . 144*86d7f5d3SJohn Marino.Pp 145*86d7f5d3SJohn MarinoFor each network interface (e.g., physical or a virtual tunnel) 146*86d7f5d3SJohn Marinothat would be used for multicast forwarding, a corresponding 147*86d7f5d3SJohn Marinomulticast interface must be added to the kernel: 148*86d7f5d3SJohn Marino.Bd -literal 149*86d7f5d3SJohn Marino/* IPv4 */ 150*86d7f5d3SJohn Marinostruct vifctl vc; 151*86d7f5d3SJohn Marinomemset(&vc, 0, sizeof(vc)); 152*86d7f5d3SJohn Marino/* Assign all vifctl fields as appropriate */ 153*86d7f5d3SJohn Marinovc.vifc_vifi = vif_index; 154*86d7f5d3SJohn Marinovc.vifc_flags = vif_flags; 155*86d7f5d3SJohn Marinovc.vifc_threshold = min_ttl_threshold; 156*86d7f5d3SJohn Marinovc.vifc_rate_limit = max_rate_limit; 157*86d7f5d3SJohn Marinomemcpy(&vc.vifc_lcl_addr, &vif_local_address, sizeof(vc.vifc_lcl_addr)); 158*86d7f5d3SJohn Marinoif (vc.vifc_flags & VIFF_TUNNEL) 159*86d7f5d3SJohn Marino memcpy(&vc.vifc_rmt_addr, &vif_remote_address, 160*86d7f5d3SJohn Marino sizeof(vc.vifc_rmt_addr)); 161*86d7f5d3SJohn Marinosetsockopt(mrouter_s4, IPPROTO_IP, MRT_ADD_VIF, (void *)&vc, 162*86d7f5d3SJohn Marino sizeof(vc)); 163*86d7f5d3SJohn Marino.Ed 164*86d7f5d3SJohn Marino.Pp 165*86d7f5d3SJohn MarinoThe 166*86d7f5d3SJohn Marino.Dq vif_index 167*86d7f5d3SJohn Marinomust be unique per vif. 168*86d7f5d3SJohn MarinoThe 169*86d7f5d3SJohn Marino.Dq vif_flags 170*86d7f5d3SJohn Marinocontains the 171*86d7f5d3SJohn Marino.Dq VIFF_* 172*86d7f5d3SJohn Marinoflags as defined in 173*86d7f5d3SJohn Marino.In net/ip_mroute/ip_mroute.h . 174*86d7f5d3SJohn MarinoThe 175*86d7f5d3SJohn Marino.Dq min_ttl_threshold 176*86d7f5d3SJohn Marinocontains the minimum TTL a multicast data packet must have to be 177*86d7f5d3SJohn Marinoforwarded on that vif. 178*86d7f5d3SJohn MarinoTypically, it would have value of 1. 179*86d7f5d3SJohn MarinoThe 180*86d7f5d3SJohn Marino.Dq max_rate_limit 181*86d7f5d3SJohn Marinocontains the maximum rate (in bits/s) of the multicast data packets forwarded 182*86d7f5d3SJohn Marinoon that vif. 183*86d7f5d3SJohn MarinoValue of 0 means no limit. 184*86d7f5d3SJohn MarinoThe 185*86d7f5d3SJohn Marino.Dq vif_local_address 186*86d7f5d3SJohn Marinocontains the local IP address of the corresponding local interface. 187*86d7f5d3SJohn MarinoThe 188*86d7f5d3SJohn Marino.Dq vif_remote_address 189*86d7f5d3SJohn Marinocontains the remote IP address in case of DVMRP multicast tunnels. 190*86d7f5d3SJohn Marino.Bd -literal 191*86d7f5d3SJohn Marino/* IPv6 */ 192*86d7f5d3SJohn Marinostruct mif6ctl mc; 193*86d7f5d3SJohn Marinomemset(&mc, 0, sizeof(mc)); 194*86d7f5d3SJohn Marino/* Assign all mif6ctl fields as appropriate */ 195*86d7f5d3SJohn Marinomc.mif6c_mifi = mif_index; 196*86d7f5d3SJohn Marinomc.mif6c_flags = mif_flags; 197*86d7f5d3SJohn Marinomc.mif6c_pifi = pif_index; 198*86d7f5d3SJohn Marinosetsockopt(mrouter_s6, IPPROTO_IPV6, MRT6_ADD_MIF, (void *)&mc, 199*86d7f5d3SJohn Marino sizeof(mc)); 200*86d7f5d3SJohn Marino.Ed 201*86d7f5d3SJohn Marino.Pp 202*86d7f5d3SJohn MarinoThe 203*86d7f5d3SJohn Marino.Dq mif_index 204*86d7f5d3SJohn Marinomust be unique per vif. 205*86d7f5d3SJohn MarinoThe 206*86d7f5d3SJohn Marino.Dq mif_flags 207*86d7f5d3SJohn Marinocontains the 208*86d7f5d3SJohn Marino.Dq MIFF_* 209*86d7f5d3SJohn Marinoflags as defined in 210*86d7f5d3SJohn Marino.In netinet6/ip6_mroute.h . 211*86d7f5d3SJohn MarinoThe 212*86d7f5d3SJohn Marino.Dq pif_index 213*86d7f5d3SJohn Marinois the physical interface index of the corresponding local interface. 214*86d7f5d3SJohn Marino.Pp 215*86d7f5d3SJohn MarinoA multicast interface is deleted by: 216*86d7f5d3SJohn Marino.Bd -literal 217*86d7f5d3SJohn Marino/* IPv4 */ 218*86d7f5d3SJohn Marinovifi_t vifi = vif_index; 219*86d7f5d3SJohn Marinosetsockopt(mrouter_s4, IPPROTO_IP, MRT_DEL_VIF, (void *)&vifi, 220*86d7f5d3SJohn Marino sizeof(vifi)); 221*86d7f5d3SJohn Marino.Ed 222*86d7f5d3SJohn Marino.Bd -literal 223*86d7f5d3SJohn Marino/* IPv6 */ 224*86d7f5d3SJohn Marinomifi_t mifi = mif_index; 225*86d7f5d3SJohn Marinosetsockopt(mrouter_s6, IPPROTO_IPV6, MRT6_DEL_MIF, (void *)&mifi, 226*86d7f5d3SJohn Marino sizeof(mifi)); 227*86d7f5d3SJohn Marino.Ed 228*86d7f5d3SJohn Marino.Pp 229*86d7f5d3SJohn MarinoAfter the multicast forwarding is enabled, and the multicast virtual 230*86d7f5d3SJohn Marinointerfaces are 231*86d7f5d3SJohn Marinoadded, the kernel may deliver upcall messages (also called signals 232*86d7f5d3SJohn Marinolater in this text) on the multicast routing socket that was open 233*86d7f5d3SJohn Marinoearlier with 234*86d7f5d3SJohn Marino.Dq MRT_INIT 235*86d7f5d3SJohn Marinoor 236*86d7f5d3SJohn Marino.Dq MRT6_INIT . 237*86d7f5d3SJohn MarinoThe IPv4 upcalls have 238*86d7f5d3SJohn Marino.Dq struct igmpmsg 239*86d7f5d3SJohn Marinoheader (see 240*86d7f5d3SJohn Marino.In net/ip_mroute/ip_mroute.h ) 241*86d7f5d3SJohn Marinowith field 242*86d7f5d3SJohn Marino.Dq im_mbz 243*86d7f5d3SJohn Marinoset to zero. 244*86d7f5d3SJohn MarinoNote that this header follows the structure of 245*86d7f5d3SJohn Marino.Dq struct ip 246*86d7f5d3SJohn Marinowith the protocol field 247*86d7f5d3SJohn Marino.Dq ip_p 248*86d7f5d3SJohn Marinoset to zero. 249*86d7f5d3SJohn MarinoThe IPv6 upcalls have 250*86d7f5d3SJohn Marino.Dq struct mrt6msg 251*86d7f5d3SJohn Marinoheader (see 252*86d7f5d3SJohn Marino.In netinet6/ip6_mroute.h ) 253*86d7f5d3SJohn Marinowith field 254*86d7f5d3SJohn Marino.Dq im6_mbz 255*86d7f5d3SJohn Marinoset to zero. 256*86d7f5d3SJohn MarinoNote that this header follows the structure of 257*86d7f5d3SJohn Marino.Dq struct ip6_hdr 258*86d7f5d3SJohn Marinowith the next header field 259*86d7f5d3SJohn Marino.Dq ip6_nxt 260*86d7f5d3SJohn Marinoset to zero. 261*86d7f5d3SJohn Marino.Pp 262*86d7f5d3SJohn MarinoThe upcall header contains field 263*86d7f5d3SJohn Marino.Dq im_msgtype 264*86d7f5d3SJohn Marinoand 265*86d7f5d3SJohn Marino.Dq im6_msgtype 266*86d7f5d3SJohn Marinowith the type of the upcall 267*86d7f5d3SJohn Marino.Dq IGMPMSG_* 268*86d7f5d3SJohn Marinoand 269*86d7f5d3SJohn Marino.Dq MRT6MSG_* 270*86d7f5d3SJohn Marinofor IPv4 and IPv6 respectively. 271*86d7f5d3SJohn MarinoThe values of the rest of the upcall header fields 272*86d7f5d3SJohn Marinoand the body of the upcall message depend on the particular upcall type. 273*86d7f5d3SJohn Marino.Pp 274*86d7f5d3SJohn MarinoIf the upcall message type is 275*86d7f5d3SJohn Marino.Dq IGMPMSG_NOCACHE 276*86d7f5d3SJohn Marinoor 277*86d7f5d3SJohn Marino.Dq MRT6MSG_NOCACHE , 278*86d7f5d3SJohn Marinothis is an indication that a multicast packet has reached the multicast 279*86d7f5d3SJohn Marinorouter, but the router has no forwarding state for that packet. 280*86d7f5d3SJohn MarinoTypically, the upcall would be a signal for the multicast routing 281*86d7f5d3SJohn Marinouser-level process to install the appropriate Multicast Forwarding 282*86d7f5d3SJohn MarinoCache (MFC) entry in the kernel. 283*86d7f5d3SJohn Marino.Pp 284*86d7f5d3SJohn MarinoA MFC entry is added by: 285*86d7f5d3SJohn Marino.Bd -literal 286*86d7f5d3SJohn Marino/* IPv4 */ 287*86d7f5d3SJohn Marinostruct mfcctl mc; 288*86d7f5d3SJohn Marinomemset(&mc, 0, sizeof(mc)); 289*86d7f5d3SJohn Marinomemcpy(&mc.mfcc_origin, &source_addr, sizeof(mc.mfcc_origin)); 290*86d7f5d3SJohn Marinomemcpy(&mc.mfcc_mcastgrp, &group_addr, sizeof(mc.mfcc_mcastgrp)); 291*86d7f5d3SJohn Marinomc.mfcc_parent = iif_index; 292*86d7f5d3SJohn Marinofor (i = 0; i < maxvifs; i++) 293*86d7f5d3SJohn Marino mc.mfcc_ttls[i] = oifs_ttl[i]; 294*86d7f5d3SJohn Marinosetsockopt(mrouter_s4, IPPROTO_IP, MRT_ADD_MFC, 295*86d7f5d3SJohn Marino (void *)&mc, sizeof(mc)); 296*86d7f5d3SJohn Marino.Ed 297*86d7f5d3SJohn Marino.Bd -literal 298*86d7f5d3SJohn Marino/* IPv6 */ 299*86d7f5d3SJohn Marinostruct mf6cctl mc; 300*86d7f5d3SJohn Marinomemset(&mc, 0, sizeof(mc)); 301*86d7f5d3SJohn Marinomemcpy(&mc.mf6cc_origin, &source_addr, sizeof(mc.mf6cc_origin)); 302*86d7f5d3SJohn Marinomemcpy(&mc.mf6cc_mcastgrp, &group_addr, sizeof(mf6cc_mcastgrp)); 303*86d7f5d3SJohn Marinomc.mf6cc_parent = iif_index; 304*86d7f5d3SJohn Marinofor (i = 0; i < maxvifs; i++) 305*86d7f5d3SJohn Marino if (oifs_ttl[i] > 0) 306*86d7f5d3SJohn Marino IF_SET(i, &mc.mf6cc_ifset); 307*86d7f5d3SJohn Marinosetsockopt(mrouter_s4, IPPROTO_IPV6, MRT6_ADD_MFC, 308*86d7f5d3SJohn Marino (void *)&mc, sizeof(mc)); 309*86d7f5d3SJohn Marino.Ed 310*86d7f5d3SJohn Marino.Pp 311*86d7f5d3SJohn MarinoThe 312*86d7f5d3SJohn Marino.Dq source_addr 313*86d7f5d3SJohn Marinoand 314*86d7f5d3SJohn Marino.Dq group_addr 315*86d7f5d3SJohn Marinoare the source and group address of the multicast packet (as set 316*86d7f5d3SJohn Marinoin the upcall message). 317*86d7f5d3SJohn MarinoThe 318*86d7f5d3SJohn Marino.Dq iif_index 319*86d7f5d3SJohn Marinois the virtual interface index of the multicast interface the multicast 320*86d7f5d3SJohn Marinopackets for this specific source and group address should be received on. 321*86d7f5d3SJohn MarinoThe 322*86d7f5d3SJohn Marino.Dq oifs_ttl[] 323*86d7f5d3SJohn Marinoarray contains the minimum TTL (per interface) a multicast packet 324*86d7f5d3SJohn Marinoshould have to be forwarded on an outgoing interface. 325*86d7f5d3SJohn MarinoIf the TTL value is zero, the corresponding interface is not included 326*86d7f5d3SJohn Marinoin the set of outgoing interfaces. 327*86d7f5d3SJohn MarinoNote that in case of IPv6 only the set of outgoing interfaces can 328*86d7f5d3SJohn Marinobe specified. 329*86d7f5d3SJohn Marino.Pp 330*86d7f5d3SJohn MarinoA MFC entry is deleted by: 331*86d7f5d3SJohn Marino.Bd -literal 332*86d7f5d3SJohn Marino/* IPv4 */ 333*86d7f5d3SJohn Marinostruct mfcctl mc; 334*86d7f5d3SJohn Marinomemset(&mc, 0, sizeof(mc)); 335*86d7f5d3SJohn Marinomemcpy(&mc.mfcc_origin, &source_addr, sizeof(mc.mfcc_origin)); 336*86d7f5d3SJohn Marinomemcpy(&mc.mfcc_mcastgrp, &group_addr, sizeof(mc.mfcc_mcastgrp)); 337*86d7f5d3SJohn Marinosetsockopt(mrouter_s4, IPPROTO_IP, MRT_DEL_MFC, 338*86d7f5d3SJohn Marino (void *)&mc, sizeof(mc)); 339*86d7f5d3SJohn Marino.Ed 340*86d7f5d3SJohn Marino.Bd -literal 341*86d7f5d3SJohn Marino/* IPv6 */ 342*86d7f5d3SJohn Marinostruct mf6cctl mc; 343*86d7f5d3SJohn Marinomemset(&mc, 0, sizeof(mc)); 344*86d7f5d3SJohn Marinomemcpy(&mc.mf6cc_origin, &source_addr, sizeof(mc.mf6cc_origin)); 345*86d7f5d3SJohn Marinomemcpy(&mc.mf6cc_mcastgrp, &group_addr, sizeof(mf6cc_mcastgrp)); 346*86d7f5d3SJohn Marinosetsockopt(mrouter_s4, IPPROTO_IPV6, MRT6_DEL_MFC, 347*86d7f5d3SJohn Marino (void *)&mc, sizeof(mc)); 348*86d7f5d3SJohn Marino.Ed 349*86d7f5d3SJohn Marino.Pp 350*86d7f5d3SJohn MarinoThe following method can be used to get various statistics per 351*86d7f5d3SJohn Marinoinstalled MFC entry in the kernel (e.g., the number of forwarded 352*86d7f5d3SJohn Marinopackets per source and group address): 353*86d7f5d3SJohn Marino.Bd -literal 354*86d7f5d3SJohn Marino/* IPv4 */ 355*86d7f5d3SJohn Marinostruct sioc_sg_req sgreq; 356*86d7f5d3SJohn Marinomemset(&sgreq, 0, sizeof(sgreq)); 357*86d7f5d3SJohn Marinomemcpy(&sgreq.src, &source_addr, sizeof(sgreq.src)); 358*86d7f5d3SJohn Marinomemcpy(&sgreq.grp, &group_addr, sizeof(sgreq.grp)); 359*86d7f5d3SJohn Marinoioctl(mrouter_s4, SIOCGETSGCNT, &sgreq); 360*86d7f5d3SJohn Marino.Ed 361*86d7f5d3SJohn Marino.Bd -literal 362*86d7f5d3SJohn Marino/* IPv6 */ 363*86d7f5d3SJohn Marinostruct sioc_sg_req6 sgreq; 364*86d7f5d3SJohn Marinomemset(&sgreq, 0, sizeof(sgreq)); 365*86d7f5d3SJohn Marinomemcpy(&sgreq.src, &source_addr, sizeof(sgreq.src)); 366*86d7f5d3SJohn Marinomemcpy(&sgreq.grp, &group_addr, sizeof(sgreq.grp)); 367*86d7f5d3SJohn Marinoioctl(mrouter_s6, SIOCGETSGCNT_IN6, &sgreq); 368*86d7f5d3SJohn Marino.Ed 369*86d7f5d3SJohn Marino.Pp 370*86d7f5d3SJohn MarinoThe following method can be used to get various statistics per 371*86d7f5d3SJohn Marinomulticast virtual interface in the kernel (e.g., the number of forwarded 372*86d7f5d3SJohn Marinopackets per interface): 373*86d7f5d3SJohn Marino.Bd -literal 374*86d7f5d3SJohn Marino/* IPv4 */ 375*86d7f5d3SJohn Marinostruct sioc_vif_req vreq; 376*86d7f5d3SJohn Marinomemset(&vreq, 0, sizeof(vreq)); 377*86d7f5d3SJohn Marinovreq.vifi = vif_index; 378*86d7f5d3SJohn Marinoioctl(mrouter_s4, SIOCGETVIFCNT, &vreq); 379*86d7f5d3SJohn Marino.Ed 380*86d7f5d3SJohn Marino.Bd -literal 381*86d7f5d3SJohn Marino/* IPv6 */ 382*86d7f5d3SJohn Marinostruct sioc_mif_req6 mreq; 383*86d7f5d3SJohn Marinomemset(&mreq, 0, sizeof(mreq)); 384*86d7f5d3SJohn Marinomreq.mifi = vif_index; 385*86d7f5d3SJohn Marinoioctl(mrouter_s6, SIOCGETMIFCNT_IN6, &mreq); 386*86d7f5d3SJohn Marino.Ed 387*86d7f5d3SJohn Marino.Ss Advanced Multicast API Programming Guide 388*86d7f5d3SJohn MarinoIf we want to add new features in the kernel, it becomes difficult 389*86d7f5d3SJohn Marinoto preserve backward compatibility (binary and API), 390*86d7f5d3SJohn Marinoand at the same time to allow user-level processes to take advantage of 391*86d7f5d3SJohn Marinothe new features (if the kernel supports them). 392*86d7f5d3SJohn Marino.Pp 393*86d7f5d3SJohn MarinoOne of the mechanisms that allows us to preserve the backward 394*86d7f5d3SJohn Marinocompatibility is a sort of negotiation 395*86d7f5d3SJohn Marinobetween the user-level process and the kernel: 396*86d7f5d3SJohn Marino.Bl -enum 397*86d7f5d3SJohn Marino.It 398*86d7f5d3SJohn MarinoThe user-level process tries to enable in the kernel the set of new 399*86d7f5d3SJohn Marinofeatures (and the corresponding API) it would like to use. 400*86d7f5d3SJohn Marino.It 401*86d7f5d3SJohn MarinoThe kernel returns the (sub)set of features it knows about 402*86d7f5d3SJohn Marinoand is willing to be enabled. 403*86d7f5d3SJohn Marino.It 404*86d7f5d3SJohn MarinoThe user-level process uses only that set of features 405*86d7f5d3SJohn Marinothe kernel has agreed on. 406*86d7f5d3SJohn Marino.El 407*86d7f5d3SJohn Marino.\" 408*86d7f5d3SJohn Marino.Pp 409*86d7f5d3SJohn MarinoTo support backward compatibility, if the user-level process doesn't 410*86d7f5d3SJohn Marinoask for any new features, the kernel defaults to the basic 411*86d7f5d3SJohn Marinomulticast API (see the 412*86d7f5d3SJohn Marino.Sx "Programming Guide" 413*86d7f5d3SJohn Marinosection). 414*86d7f5d3SJohn Marino.\" XXX: edit as appropriate after the advanced multicast API is 415*86d7f5d3SJohn Marino.\" supported under IPv6 416*86d7f5d3SJohn MarinoCurrently, the advanced multicast API exists only for IPv4; 417*86d7f5d3SJohn Marinoin the future there will be IPv6 support as well. 418*86d7f5d3SJohn Marino.Pp 419*86d7f5d3SJohn MarinoBelow is a summary of the expandable API solution. 420*86d7f5d3SJohn MarinoNote that all new options and structures are defined 421*86d7f5d3SJohn Marinoin 422*86d7f5d3SJohn Marino.In net/ip_mroute/ip_mroute.h 423*86d7f5d3SJohn Marinoand 424*86d7f5d3SJohn Marino.In netinet6/ip6_mroute.h , 425*86d7f5d3SJohn Marinounless stated otherwise. 426*86d7f5d3SJohn Marino.Pp 427*86d7f5d3SJohn MarinoThe user-level process uses new get/setsockopt() options to 428*86d7f5d3SJohn Marinoperform the API features negotiation with the kernel. 429*86d7f5d3SJohn MarinoThis negotiation must be performed right after the multicast routing 430*86d7f5d3SJohn Marinosocket is open. 431*86d7f5d3SJohn MarinoThe set of desired/allowed features is stored in a bitset 432*86d7f5d3SJohn Marino(currently, in uint32_t; i.e., maximum of 32 new features). 433*86d7f5d3SJohn MarinoThe new get/setsockopt() options are 434*86d7f5d3SJohn Marino.Dq MRT_API_SUPPORT 435*86d7f5d3SJohn Marinoand 436*86d7f5d3SJohn Marino.Dq MRT_API_CONFIG . 437*86d7f5d3SJohn MarinoExample: 438*86d7f5d3SJohn Marino.Bd -literal 439*86d7f5d3SJohn Marinouint32_t v; 440*86d7f5d3SJohn Marinogetsockopt(sock, IPPROTO_IP, MRT_API_SUPPORT, (void *)&v, sizeof(v)); 441*86d7f5d3SJohn Marino.Ed 442*86d7f5d3SJohn Marino.Pp 443*86d7f5d3SJohn Marinowould set in 444*86d7f5d3SJohn Marino.Dq v 445*86d7f5d3SJohn Marinothe pre-defined bits that the kernel API supports. 446*86d7f5d3SJohn MarinoThe eight least significant bits in uint32_t are same as the 447*86d7f5d3SJohn Marinoeight possible flags 448*86d7f5d3SJohn Marino.Dq MRT_MFC_FLAGS_* 449*86d7f5d3SJohn Marinothat can be used in 450*86d7f5d3SJohn Marino.Dq mfcc_flags 451*86d7f5d3SJohn Marinoas part of the new definition of 452*86d7f5d3SJohn Marino.Dq struct mfcctl 453*86d7f5d3SJohn Marino(see below about those flags), which leaves 24 flags for other new features. 454*86d7f5d3SJohn MarinoThe value returned by getsockopt(MRT_API_SUPPORT) is read-only; in other 455*86d7f5d3SJohn Marinowords, setsockopt(MRT_API_SUPPORT) would fail. 456*86d7f5d3SJohn Marino.Pp 457*86d7f5d3SJohn MarinoTo modify the API, and to set some specific feature in the kernel, then: 458*86d7f5d3SJohn Marino.Bd -literal 459*86d7f5d3SJohn Marinouint32_t v = MRT_MFC_FLAGS_DISABLE_WRONGVIF; 460*86d7f5d3SJohn Marinoif (setsockopt(sock, IPPROTO_IP, MRT_API_CONFIG, (void *)&v, sizeof(v)) 461*86d7f5d3SJohn Marino != 0) { 462*86d7f5d3SJohn Marino return (ERROR); 463*86d7f5d3SJohn Marino} 464*86d7f5d3SJohn Marinoif (v & MRT_MFC_FLAGS_DISABLE_WRONGVIF) 465*86d7f5d3SJohn Marino return (OK); /* Success */ 466*86d7f5d3SJohn Marinoelse 467*86d7f5d3SJohn Marino return (ERROR); 468*86d7f5d3SJohn Marino.Ed 469*86d7f5d3SJohn Marino.Pp 470*86d7f5d3SJohn MarinoIn other words, when setsockopt(MRT_API_CONFIG) is called, the 471*86d7f5d3SJohn Marinoargument to it specifies the desired set of features to 472*86d7f5d3SJohn Marinobe enabled in the API and the kernel. 473*86d7f5d3SJohn MarinoThe return value in 474*86d7f5d3SJohn Marino.Dq v 475*86d7f5d3SJohn Marinois the actual (sub)set of features that were enabled in the kernel. 476*86d7f5d3SJohn MarinoTo obtain later the same set of features that were enabled, then: 477*86d7f5d3SJohn Marino.Bd -literal 478*86d7f5d3SJohn Marinogetsockopt(sock, IPPROTO_IP, MRT_API_CONFIG, (void *)&v, sizeof(v)); 479*86d7f5d3SJohn Marino.Ed 480*86d7f5d3SJohn Marino.Pp 481*86d7f5d3SJohn MarinoThe set of enabled features is global. 482*86d7f5d3SJohn MarinoIn other words, setsockopt(MRT_API_CONFIG) 483*86d7f5d3SJohn Marinoshould be called right after setsockopt(MRT_INIT). 484*86d7f5d3SJohn Marino.Pp 485*86d7f5d3SJohn MarinoCurrently, the following set of new features is defined: 486*86d7f5d3SJohn Marino.Bd -literal 487*86d7f5d3SJohn Marino#define MRT_MFC_FLAGS_DISABLE_WRONGVIF (1 << 0) /* disable WRONGVIF signals */ 488*86d7f5d3SJohn Marino#define MRT_MFC_FLAGS_BORDER_VIF (1 << 1) /* border vif */ 489*86d7f5d3SJohn Marino#define MRT_MFC_RP (1 << 8) /* enable RP address */ 490*86d7f5d3SJohn Marino#define MRT_MFC_BW_UPCALL (1 << 9) /* enable bw upcalls */ 491*86d7f5d3SJohn Marino.Ed 492*86d7f5d3SJohn Marino.\" .Pp 493*86d7f5d3SJohn Marino.\" In the future there might be: 494*86d7f5d3SJohn Marino.\" .Bd -literal 495*86d7f5d3SJohn Marino.\" #define MRT_MFC_GROUP_SPECIFIC (1 << 10) /* allow (*,G) MFC entries */ 496*86d7f5d3SJohn Marino.\" .Ed 497*86d7f5d3SJohn Marino.\" .Pp 498*86d7f5d3SJohn Marino.\" to allow (*,G) MFC entries (i.e., group-specific entries) in the kernel. 499*86d7f5d3SJohn Marino.\" For now this is left-out until it is clear whether 500*86d7f5d3SJohn Marino.\" (*,G) MFC support is the preferred solution instead of something more generic 501*86d7f5d3SJohn Marino.\" solution for example. 502*86d7f5d3SJohn Marino.\" 503*86d7f5d3SJohn Marino.\" 2. The newly defined struct mfcctl2. 504*86d7f5d3SJohn Marino.\" 505*86d7f5d3SJohn Marino.Pp 506*86d7f5d3SJohn MarinoThe advanced multicast API uses a newly defined 507*86d7f5d3SJohn Marino.Dq struct mfcctl2 508*86d7f5d3SJohn Marinoinstead of the traditional 509*86d7f5d3SJohn Marino.Dq struct mfcctl . 510*86d7f5d3SJohn MarinoThe original 511*86d7f5d3SJohn Marino.Dq struct mfcctl 512*86d7f5d3SJohn Marinois kept as is. 513*86d7f5d3SJohn MarinoThe new 514*86d7f5d3SJohn Marino.Dq struct mfcctl2 515*86d7f5d3SJohn Marinois: 516*86d7f5d3SJohn Marino.Bd -literal 517*86d7f5d3SJohn Marino/* 518*86d7f5d3SJohn Marino * The new argument structure for MRT_ADD_MFC and MRT_DEL_MFC overlays 519*86d7f5d3SJohn Marino * and extends the old struct mfcctl. 520*86d7f5d3SJohn Marino */ 521*86d7f5d3SJohn Marinostruct mfcctl2 { 522*86d7f5d3SJohn Marino /* the mfcctl fields */ 523*86d7f5d3SJohn Marino struct in_addr mfcc_origin; /* ip origin of mcasts */ 524*86d7f5d3SJohn Marino struct in_addr mfcc_mcastgrp; /* multicast group associated*/ 525*86d7f5d3SJohn Marino vifi_t mfcc_parent; /* incoming vif */ 526*86d7f5d3SJohn Marino u_char mfcc_ttls[MAXVIFS];/* forwarding ttls on vifs */ 527*86d7f5d3SJohn Marino 528*86d7f5d3SJohn Marino /* extension fields */ 529*86d7f5d3SJohn Marino uint8_t mfcc_flags[MAXVIFS];/* the MRT_MFC_FLAGS_* flags*/ 530*86d7f5d3SJohn Marino struct in_addr mfcc_rp; /* the RP address */ 531*86d7f5d3SJohn Marino}; 532*86d7f5d3SJohn Marino.Ed 533*86d7f5d3SJohn Marino.Pp 534*86d7f5d3SJohn MarinoThe new fields are 535*86d7f5d3SJohn Marino.Dq mfcc_flags[MAXVIFS] 536*86d7f5d3SJohn Marinoand 537*86d7f5d3SJohn Marino.Dq mfcc_rp . 538*86d7f5d3SJohn MarinoNote that for compatibility reasons they are added at the end. 539*86d7f5d3SJohn Marino.Pp 540*86d7f5d3SJohn MarinoThe 541*86d7f5d3SJohn Marino.Dq mfcc_flags[MAXVIFS] 542*86d7f5d3SJohn Marinofield is used to set various flags per 543*86d7f5d3SJohn Marinointerface per (S,G) entry. 544*86d7f5d3SJohn MarinoCurrently, the defined flags are: 545*86d7f5d3SJohn Marino.Bd -literal 546*86d7f5d3SJohn Marino#define MRT_MFC_FLAGS_DISABLE_WRONGVIF (1 << 0) /* disable WRONGVIF signals */ 547*86d7f5d3SJohn Marino#define MRT_MFC_FLAGS_BORDER_VIF (1 << 1) /* border vif */ 548*86d7f5d3SJohn Marino.Ed 549*86d7f5d3SJohn Marino.Pp 550*86d7f5d3SJohn MarinoThe 551*86d7f5d3SJohn Marino.Dq MRT_MFC_FLAGS_DISABLE_WRONGVIF 552*86d7f5d3SJohn Marinoflag is used to explicitly disable the 553*86d7f5d3SJohn Marino.Dq IGMPMSG_WRONGVIF 554*86d7f5d3SJohn Marinokernel signal at the (S,G) granularity if a multicast data packet 555*86d7f5d3SJohn Marinoarrives on the wrong interface. 556*86d7f5d3SJohn MarinoUsually, this signal is used to 557*86d7f5d3SJohn Marinocomplete the shortest-path switch in case of PIM-SM multicast routing, 558*86d7f5d3SJohn Marinoor to trigger a PIM assert message. 559*86d7f5d3SJohn MarinoHowever, it should not be delivered for interfaces that are not in 560*86d7f5d3SJohn Marinothe outgoing interface set, and that are not expecting to 561*86d7f5d3SJohn Marinobecome an incoming interface. 562*86d7f5d3SJohn MarinoHence, if the 563*86d7f5d3SJohn Marino.Dq MRT_MFC_FLAGS_DISABLE_WRONGVIF 564*86d7f5d3SJohn Marinoflag is set for some of the 565*86d7f5d3SJohn Marinointerfaces, then a data packet that arrives on that interface for 566*86d7f5d3SJohn Marinothat MFC entry will NOT trigger a WRONGVIF signal. 567*86d7f5d3SJohn MarinoIf that flag is not set, then a signal is triggered (the default action). 568*86d7f5d3SJohn Marino.Pp 569*86d7f5d3SJohn MarinoThe 570*86d7f5d3SJohn Marino.Dq MRT_MFC_FLAGS_BORDER_VIF 571*86d7f5d3SJohn Marinoflag is used to specify whether the Border-bit in PIM 572*86d7f5d3SJohn MarinoRegister messages should be set (in case when the Register encapsulation 573*86d7f5d3SJohn Marinois performed inside the kernel). 574*86d7f5d3SJohn MarinoIf it is set for the special PIM Register kernel virtual interface 575*86d7f5d3SJohn Marino(see 576*86d7f5d3SJohn Marino.Xr pim 4 ) , 577*86d7f5d3SJohn Marinothe Border-bit in the Register messages sent to the RP will be set. 578*86d7f5d3SJohn Marino.Pp 579*86d7f5d3SJohn MarinoThe remaining six bits are reserved for future usage. 580*86d7f5d3SJohn Marino.Pp 581*86d7f5d3SJohn MarinoThe 582*86d7f5d3SJohn Marino.Dq mfcc_rp 583*86d7f5d3SJohn Marinofield is used to specify the RP address (in case of PIM-SM multicast routing) 584*86d7f5d3SJohn Marinofor a multicast 585*86d7f5d3SJohn Marinogroup G if we want to perform kernel-level PIM Register encapsulation. 586*86d7f5d3SJohn MarinoThe 587*86d7f5d3SJohn Marino.Dq mfcc_rp 588*86d7f5d3SJohn Marinofield is used only if the 589*86d7f5d3SJohn Marino.Dq MRT_MFC_RP 590*86d7f5d3SJohn Marinoadvanced API flag/capability has been successfully set by 591*86d7f5d3SJohn Marinosetsockopt(MRT_API_CONFIG). 592*86d7f5d3SJohn Marino.Pp 593*86d7f5d3SJohn Marino.\" 594*86d7f5d3SJohn Marino.\" 3. Kernel-level PIM Register encapsulation 595*86d7f5d3SJohn Marino.\" 596*86d7f5d3SJohn MarinoIf the 597*86d7f5d3SJohn Marino.Dq MRT_MFC_RP 598*86d7f5d3SJohn Marinoflag was successfully set by 599*86d7f5d3SJohn Marinosetsockopt(MRT_API_CONFIG), then the kernel will attempt to perform 600*86d7f5d3SJohn Marinothe PIM Register encapsulation itself instead of sending the 601*86d7f5d3SJohn Marinomulticast data packets to user level (inside IGMPMSG_WHOLEPKT 602*86d7f5d3SJohn Marinoupcalls) for user-level encapsulation. 603*86d7f5d3SJohn MarinoThe RP address would be taken from the 604*86d7f5d3SJohn Marino.Dq mfcc_rp 605*86d7f5d3SJohn Marinofield 606*86d7f5d3SJohn Marinoinside the new 607*86d7f5d3SJohn Marino.Dq struct mfcctl2 . 608*86d7f5d3SJohn MarinoHowever, even if the 609*86d7f5d3SJohn Marino.Dq MRT_MFC_RP 610*86d7f5d3SJohn Marinoflag was successfully set, if the 611*86d7f5d3SJohn Marino.Dq mfcc_rp 612*86d7f5d3SJohn Marinofield was set to 613*86d7f5d3SJohn Marino.Dq INADDR_ANY , 614*86d7f5d3SJohn Marinothen the 615*86d7f5d3SJohn Marinokernel will still deliver an IGMPMSG_WHOLEPKT upcall with the 616*86d7f5d3SJohn Marinomulticast data packet to the user-level process. 617*86d7f5d3SJohn Marino.Pp 618*86d7f5d3SJohn MarinoIn addition, if the multicast data packet is too large to fit within 619*86d7f5d3SJohn Marinoa single IP packet after the PIM Register encapsulation (e.g., if 620*86d7f5d3SJohn Marinoits size was on the order of 65500 bytes), the data packet will be 621*86d7f5d3SJohn Marinofragmented, and then each of the fragments will be encapsulated 622*86d7f5d3SJohn Marinoseparately. 623*86d7f5d3SJohn MarinoNote that typically a multicast data packet can be that 624*86d7f5d3SJohn Marinolarge only if it was originated locally from the same hosts that 625*86d7f5d3SJohn Marinoperforms the encapsulation; otherwise the transmission of the 626*86d7f5d3SJohn Marinomulticast data packet over Ethernet for example would have 627*86d7f5d3SJohn Marinofragmented it into much smaller pieces. 628*86d7f5d3SJohn Marino.\" 629*86d7f5d3SJohn Marino.\" Note that if this code is ported to IPv6, we may need the kernel to 630*86d7f5d3SJohn Marino.\" perform MTU discovery to the RP, and keep those discoveries inside 631*86d7f5d3SJohn Marino.\" the kernel so the encapsulating router may send back ICMP 632*86d7f5d3SJohn Marino.\" Fragmentation Required if the size of the multicast data packet is 633*86d7f5d3SJohn Marino.\" too large (see "Encapsulating data packets in the Register Tunnel" 634*86d7f5d3SJohn Marino.\" in Section 4.4.1 in the PIM-SM spec 635*86d7f5d3SJohn Marino.\" draft-ietf-pim-sm-v2-new-05.{txt,ps}). 636*86d7f5d3SJohn Marino.\" For IPv4 we may be able to get away without it, but for IPv6 we need 637*86d7f5d3SJohn Marino.\" that. 638*86d7f5d3SJohn Marino.\" 639*86d7f5d3SJohn Marino.\" 4. Mechanism for "multicast bandwidth monitoring and upcalls". 640*86d7f5d3SJohn Marino.\" 641*86d7f5d3SJohn Marino.Pp 642*86d7f5d3SJohn MarinoTypically, a multicast routing user-level process would need to know the 643*86d7f5d3SJohn Marinoforwarding bandwidth for some data flow. 644*86d7f5d3SJohn MarinoFor example, the multicast routing process may want to timeout idle MFC 645*86d7f5d3SJohn Marinoentries, or in case of PIM-SM it can initiate (S,G) shortest-path switch if 646*86d7f5d3SJohn Marinothe bandwidth rate is above a threshold for example. 647*86d7f5d3SJohn Marino.Pp 648*86d7f5d3SJohn MarinoThe original solution for measuring the bandwidth of a dataflow was 649*86d7f5d3SJohn Marinothat a user-level process would periodically 650*86d7f5d3SJohn Marinoquery the kernel about the number of forwarded packets/bytes per 651*86d7f5d3SJohn Marino(S,G), and then based on those numbers it would estimate whether a source 652*86d7f5d3SJohn Marinohas been idle, or whether the source's transmission bandwidth is above a 653*86d7f5d3SJohn Marinothreshold. 654*86d7f5d3SJohn MarinoThat solution is far from being scalable, hence the need for a new 655*86d7f5d3SJohn Marinomechanism for bandwidth monitoring. 656*86d7f5d3SJohn Marino.Pp 657*86d7f5d3SJohn MarinoBelow is a description of the bandwidth monitoring mechanism. 658*86d7f5d3SJohn Marino.Bl -bullet 659*86d7f5d3SJohn Marino.It 660*86d7f5d3SJohn MarinoIf the bandwidth of a data flow satisfies some pre-defined filter, 661*86d7f5d3SJohn Marinothe kernel delivers an upcall on the multicast routing socket 662*86d7f5d3SJohn Marinoto the multicast routing process that has installed that filter. 663*86d7f5d3SJohn Marino.It 664*86d7f5d3SJohn MarinoThe bandwidth-upcall filters are installed per (S,G). There can be 665*86d7f5d3SJohn Marinomore than one filter per (S,G). 666*86d7f5d3SJohn Marino.It 667*86d7f5d3SJohn MarinoInstead of supporting all possible comparison operations 668*86d7f5d3SJohn Marino(i.e., < <= == != > >= ), there is support only for the 669*86d7f5d3SJohn Marino<= and >= operations, 670*86d7f5d3SJohn Marinobecause this makes the kernel-level implementation simpler, 671*86d7f5d3SJohn Marinoand because practically we need only those two. 672*86d7f5d3SJohn MarinoFurther, the missing operations can be simulated by secondary 673*86d7f5d3SJohn Marinouser-level filtering of those <= and >= filters. 674*86d7f5d3SJohn MarinoFor example, to simulate !=, then we need to install filter 675*86d7f5d3SJohn Marino.Dq bw <= 0xffffffff , 676*86d7f5d3SJohn Marinoand after an 677*86d7f5d3SJohn Marinoupcall is received, we need to check whether 678*86d7f5d3SJohn Marino.Dq measured_bw != expected_bw . 679*86d7f5d3SJohn Marino.It 680*86d7f5d3SJohn MarinoThe bandwidth-upcall mechanism is enabled by 681*86d7f5d3SJohn Marinosetsockopt(MRT_API_CONFIG) for the MRT_MFC_BW_UPCALL flag. 682*86d7f5d3SJohn Marino.It 683*86d7f5d3SJohn MarinoThe bandwidth-upcall filters are added/deleted by the new 684*86d7f5d3SJohn Marinosetsockopt(MRT_ADD_BW_UPCALL) and setsockopt(MRT_DEL_BW_UPCALL) 685*86d7f5d3SJohn Marinorespectively (with the appropriate 686*86d7f5d3SJohn Marino.Dq struct bw_upcall 687*86d7f5d3SJohn Marinoargument of course). 688*86d7f5d3SJohn Marino.El 689*86d7f5d3SJohn Marino.Pp 690*86d7f5d3SJohn MarinoFrom application point of view, a developer needs to know about 691*86d7f5d3SJohn Marinothe following: 692*86d7f5d3SJohn Marino.Bd -literal 693*86d7f5d3SJohn Marino/* 694*86d7f5d3SJohn Marino * Structure for installing or delivering an upcall if the 695*86d7f5d3SJohn Marino * measured bandwidth is above or below a threshold. 696*86d7f5d3SJohn Marino * 697*86d7f5d3SJohn Marino * User programs (e.g. daemons) may have a need to know when the 698*86d7f5d3SJohn Marino * bandwidth used by some data flow is above or below some threshold. 699*86d7f5d3SJohn Marino * This interface allows the userland to specify the threshold (in 700*86d7f5d3SJohn Marino * bytes and/or packets) and the measurement interval. Flows are 701*86d7f5d3SJohn Marino * all packet with the same source and destination IP address. 702*86d7f5d3SJohn Marino * At the moment the code is only used for multicast destinations 703*86d7f5d3SJohn Marino * but there is nothing that prevents its use for unicast. 704*86d7f5d3SJohn Marino * 705*86d7f5d3SJohn Marino * The measurement interval cannot be shorter than some Tmin (currently, 3s). 706*86d7f5d3SJohn Marino * The threshold is set in packets and/or bytes per_interval. 707*86d7f5d3SJohn Marino * 708*86d7f5d3SJohn Marino * Measurement works as follows: 709*86d7f5d3SJohn Marino * 710*86d7f5d3SJohn Marino * For >= measurements: 711*86d7f5d3SJohn Marino * The first packet marks the start of a measurement interval. 712*86d7f5d3SJohn Marino * During an interval we count packets and bytes, and when we 713*86d7f5d3SJohn Marino * pass the threshold we deliver an upcall and we are done. 714*86d7f5d3SJohn Marino * The first packet after the end of the interval resets the 715*86d7f5d3SJohn Marino * count and restarts the measurement. 716*86d7f5d3SJohn Marino * 717*86d7f5d3SJohn Marino * For <= measurement: 718*86d7f5d3SJohn Marino * We start a timer to fire at the end of the interval, and 719*86d7f5d3SJohn Marino * then for each incoming packet we count packets and bytes. 720*86d7f5d3SJohn Marino * When the timer fires, we compare the value with the threshold, 721*86d7f5d3SJohn Marino * schedule an upcall if we are below, and restart the measurement 722*86d7f5d3SJohn Marino * (reschedule timer and zero counters). 723*86d7f5d3SJohn Marino */ 724*86d7f5d3SJohn Marino 725*86d7f5d3SJohn Marinostruct bw_data { 726*86d7f5d3SJohn Marino struct timeval b_time; 727*86d7f5d3SJohn Marino uint64_t b_packets; 728*86d7f5d3SJohn Marino uint64_t b_bytes; 729*86d7f5d3SJohn Marino}; 730*86d7f5d3SJohn Marino 731*86d7f5d3SJohn Marinostruct bw_upcall { 732*86d7f5d3SJohn Marino struct in_addr bu_src; /* source address */ 733*86d7f5d3SJohn Marino struct in_addr bu_dst; /* destination address */ 734*86d7f5d3SJohn Marino uint32_t bu_flags; /* misc flags (see below) */ 735*86d7f5d3SJohn Marino#define BW_UPCALL_UNIT_PACKETS (1 << 0) /* threshold (in packets) */ 736*86d7f5d3SJohn Marino#define BW_UPCALL_UNIT_BYTES (1 << 1) /* threshold (in bytes) */ 737*86d7f5d3SJohn Marino#define BW_UPCALL_GEQ (1 << 2) /* upcall if bw >= threshold */ 738*86d7f5d3SJohn Marino#define BW_UPCALL_LEQ (1 << 3) /* upcall if bw <= threshold */ 739*86d7f5d3SJohn Marino#define BW_UPCALL_DELETE_ALL (1 << 4) /* delete all upcalls for s,d*/ 740*86d7f5d3SJohn Marino struct bw_data bu_threshold; /* the bw threshold */ 741*86d7f5d3SJohn Marino struct bw_data bu_measured; /* the measured bw */ 742*86d7f5d3SJohn Marino}; 743*86d7f5d3SJohn Marino 744*86d7f5d3SJohn Marino/* max. number of upcalls to deliver together */ 745*86d7f5d3SJohn Marino#define BW_UPCALLS_MAX 128 746*86d7f5d3SJohn Marino/* min. threshold time interval for bandwidth measurement */ 747*86d7f5d3SJohn Marino#define BW_UPCALL_THRESHOLD_INTERVAL_MIN_SEC 3 748*86d7f5d3SJohn Marino#define BW_UPCALL_THRESHOLD_INTERVAL_MIN_USEC 0 749*86d7f5d3SJohn Marino.Ed 750*86d7f5d3SJohn Marino.Pp 751*86d7f5d3SJohn MarinoThe 752*86d7f5d3SJohn Marino.Dq bw_upcall 753*86d7f5d3SJohn Marinostructure is used as an argument to 754*86d7f5d3SJohn Marinosetsockopt(MRT_ADD_BW_UPCALL) and setsockopt(MRT_DEL_BW_UPCALL). 755*86d7f5d3SJohn MarinoEach setsockopt(MRT_ADD_BW_UPCALL) installs a filter in the kernel 756*86d7f5d3SJohn Marinofor the source and destination address in the 757*86d7f5d3SJohn Marino.Dq bw_upcall 758*86d7f5d3SJohn Marinoargument, 759*86d7f5d3SJohn Marinoand that filter will trigger an upcall according to the following 760*86d7f5d3SJohn Marinopseudo-algorithm: 761*86d7f5d3SJohn Marino.Bd -literal 762*86d7f5d3SJohn Marino if (bw_upcall_oper IS ">=") { 763*86d7f5d3SJohn Marino if (((bw_upcall_unit & PACKETS == PACKETS) && 764*86d7f5d3SJohn Marino (measured_packets >= threshold_packets)) || 765*86d7f5d3SJohn Marino ((bw_upcall_unit & BYTES == BYTES) && 766*86d7f5d3SJohn Marino (measured_bytes >= threshold_bytes))) 767*86d7f5d3SJohn Marino SEND_UPCALL("measured bandwidth is >= threshold"); 768*86d7f5d3SJohn Marino } 769*86d7f5d3SJohn Marino if (bw_upcall_oper IS "<=" && measured_interval >= threshold_interval) { 770*86d7f5d3SJohn Marino if (((bw_upcall_unit & PACKETS == PACKETS) && 771*86d7f5d3SJohn Marino (measured_packets <= threshold_packets)) || 772*86d7f5d3SJohn Marino ((bw_upcall_unit & BYTES == BYTES) && 773*86d7f5d3SJohn Marino (measured_bytes <= threshold_bytes))) 774*86d7f5d3SJohn Marino SEND_UPCALL("measured bandwidth is <= threshold"); 775*86d7f5d3SJohn Marino } 776*86d7f5d3SJohn Marino.Ed 777*86d7f5d3SJohn Marino.Pp 778*86d7f5d3SJohn MarinoIn the same 779*86d7f5d3SJohn Marino.Dq bw_upcall 780*86d7f5d3SJohn Marinothe unit can be specified in both BYTES and PACKETS. 781*86d7f5d3SJohn MarinoHowever, the GEQ and LEQ flags are mutually exclusive. 782*86d7f5d3SJohn Marino.Pp 783*86d7f5d3SJohn MarinoBasically, an upcall is delivered if the measured bandwidth is >= or 784*86d7f5d3SJohn Marino<= the threshold bandwidth (within the specified measurement 785*86d7f5d3SJohn Marinointerval). 786*86d7f5d3SJohn MarinoFor practical reasons, the smallest value for the measurement 787*86d7f5d3SJohn Marinointerval is 3 seconds. 788*86d7f5d3SJohn MarinoIf smaller values are allowed, then the bandwidth 789*86d7f5d3SJohn Marinoestimation may be less accurate, or the potentially very high frequency 790*86d7f5d3SJohn Marinoof the generated upcalls may introduce too much overhead. 791*86d7f5d3SJohn MarinoFor the >= operation, the answer may be known before the end of 792*86d7f5d3SJohn Marino.Dq threshold_interval , 793*86d7f5d3SJohn Marinotherefore the upcall may be delivered earlier. 794*86d7f5d3SJohn MarinoFor the <= operation however, we must wait 795*86d7f5d3SJohn Marinountil the threshold interval has expired to know the answer. 796*86d7f5d3SJohn Marino.Pp 797*86d7f5d3SJohn MarinoExample of usage: 798*86d7f5d3SJohn Marino.Bd -literal 799*86d7f5d3SJohn Marinostruct bw_upcall bw_upcall; 800*86d7f5d3SJohn Marino/* Assign all bw_upcall fields as appropriate */ 801*86d7f5d3SJohn Marinomemset(&bw_upcall, 0, sizeof(bw_upcall)); 802*86d7f5d3SJohn Marinomemcpy(&bw_upcall.bu_src, &source, sizeof(bw_upcall.bu_src)); 803*86d7f5d3SJohn Marinomemcpy(&bw_upcall.bu_dst, &group, sizeof(bw_upcall.bu_dst)); 804*86d7f5d3SJohn Marinobw_upcall.bu_threshold.b_data = threshold_interval; 805*86d7f5d3SJohn Marinobw_upcall.bu_threshold.b_packets = threshold_packets; 806*86d7f5d3SJohn Marinobw_upcall.bu_threshold.b_bytes = threshold_bytes; 807*86d7f5d3SJohn Marinoif (is_threshold_in_packets) 808*86d7f5d3SJohn Marino bw_upcall.bu_flags |= BW_UPCALL_UNIT_PACKETS; 809*86d7f5d3SJohn Marinoif (is_threshold_in_bytes) 810*86d7f5d3SJohn Marino bw_upcall.bu_flags |= BW_UPCALL_UNIT_BYTES; 811*86d7f5d3SJohn Marinodo { 812*86d7f5d3SJohn Marino if (is_geq_upcall) { 813*86d7f5d3SJohn Marino bw_upcall.bu_flags |= BW_UPCALL_GEQ; 814*86d7f5d3SJohn Marino break; 815*86d7f5d3SJohn Marino } 816*86d7f5d3SJohn Marino if (is_leq_upcall) { 817*86d7f5d3SJohn Marino bw_upcall.bu_flags |= BW_UPCALL_LEQ; 818*86d7f5d3SJohn Marino break; 819*86d7f5d3SJohn Marino } 820*86d7f5d3SJohn Marino return (ERROR); 821*86d7f5d3SJohn Marino} while (0); 822*86d7f5d3SJohn Marinosetsockopt(mrouter_s4, IPPROTO_IP, MRT_ADD_BW_UPCALL, 823*86d7f5d3SJohn Marino (void *)&bw_upcall, sizeof(bw_upcall)); 824*86d7f5d3SJohn Marino.Ed 825*86d7f5d3SJohn Marino.Pp 826*86d7f5d3SJohn MarinoTo delete a single filter, then use MRT_DEL_BW_UPCALL, 827*86d7f5d3SJohn Marinoand the fields of bw_upcall must be set 828*86d7f5d3SJohn Marinoexactly same as when MRT_ADD_BW_UPCALL was called. 829*86d7f5d3SJohn Marino.Pp 830*86d7f5d3SJohn MarinoTo delete all bandwidth filters for a given (S,G), then 831*86d7f5d3SJohn Marinoonly the 832*86d7f5d3SJohn Marino.Dq bu_src 833*86d7f5d3SJohn Marinoand 834*86d7f5d3SJohn Marino.Dq bu_dst 835*86d7f5d3SJohn Marinofields in 836*86d7f5d3SJohn Marino.Dq struct bw_upcall 837*86d7f5d3SJohn Marinoneed to be set, and then just set only the 838*86d7f5d3SJohn Marino.Dq BW_UPCALL_DELETE_ALL 839*86d7f5d3SJohn Marinoflag inside field 840*86d7f5d3SJohn Marino.Dq bw_upcall.bu_flags . 841*86d7f5d3SJohn Marino.Pp 842*86d7f5d3SJohn MarinoThe bandwidth upcalls are received by aggregating them in the new upcall 843*86d7f5d3SJohn Marinomessage: 844*86d7f5d3SJohn Marino.Bd -literal 845*86d7f5d3SJohn Marino#define IGMPMSG_BW_UPCALL 4 /* BW monitoring upcall */ 846*86d7f5d3SJohn Marino.Ed 847*86d7f5d3SJohn Marino.Pp 848*86d7f5d3SJohn MarinoThis message is an array of 849*86d7f5d3SJohn Marino.Dq struct bw_upcall 850*86d7f5d3SJohn Marinoelements (up to BW_UPCALLS_MAX = 128). 851*86d7f5d3SJohn MarinoThe upcalls are 852*86d7f5d3SJohn Marinodelivered when there are 128 pending upcalls, or when 1 second has 853*86d7f5d3SJohn Marinoexpired since the previous upcall (whichever comes first). 854*86d7f5d3SJohn MarinoIn an 855*86d7f5d3SJohn Marino.Dq struct upcall 856*86d7f5d3SJohn Marinoelement, the 857*86d7f5d3SJohn Marino.Dq bu_measured 858*86d7f5d3SJohn Marinofield is filled-in to 859*86d7f5d3SJohn Marinoindicate the particular measured values. 860*86d7f5d3SJohn MarinoHowever, because of the way 861*86d7f5d3SJohn Marinothe particular intervals are measured, the user should be careful how 862*86d7f5d3SJohn Marinobu_measured.b_time is used. 863*86d7f5d3SJohn MarinoFor example, if the 864*86d7f5d3SJohn Marinofilter is installed to trigger an upcall if the number of packets 865*86d7f5d3SJohn Marinois >= 1, then 866*86d7f5d3SJohn Marino.Dq bu_measured 867*86d7f5d3SJohn Marinomay have a value of zero in the upcalls after the 868*86d7f5d3SJohn Marinofirst one, because the measured interval for >= filters is 869*86d7f5d3SJohn Marino.Dq clocked 870*86d7f5d3SJohn Marinoby the forwarded packets. 871*86d7f5d3SJohn MarinoHence, this upcall mechanism should not be used for measuring 872*86d7f5d3SJohn Marinothe exact value of the bandwidth of the forwarded data. 873*86d7f5d3SJohn MarinoTo measure the exact bandwidth, the user would need to 874*86d7f5d3SJohn Marinoget the forwarded packets statistics with the ioctl(SIOCGETSGCNT) 875*86d7f5d3SJohn Marinomechanism 876*86d7f5d3SJohn Marino(see the 877*86d7f5d3SJohn Marino.Sx Programming Guide 878*86d7f5d3SJohn Marinosection) . 879*86d7f5d3SJohn Marino.Pp 880*86d7f5d3SJohn MarinoNote that the upcalls for a filter are delivered until the specific 881*86d7f5d3SJohn Marinofilter is deleted, but no more frequently than once per 882*86d7f5d3SJohn Marino.Dq bu_threshold.b_time . 883*86d7f5d3SJohn MarinoFor example, if the filter is specified to 884*86d7f5d3SJohn Marinodeliver a signal if bw >= 1 packet, the first packet will trigger a 885*86d7f5d3SJohn Marinosignal, but the next upcall will be triggered no earlier than 886*86d7f5d3SJohn Marino.Dq bu_threshold.b_time 887*86d7f5d3SJohn Marinoafter the previous upcall. 888*86d7f5d3SJohn Marino.\" 889*86d7f5d3SJohn Marino.Sh SEE ALSO 890*86d7f5d3SJohn Marino.Xr getsockopt 2 , 891*86d7f5d3SJohn Marino.Xr recvfrom 2 , 892*86d7f5d3SJohn Marino.Xr recvmsg 2 , 893*86d7f5d3SJohn Marino.Xr setsockopt 2 , 894*86d7f5d3SJohn Marino.Xr socket 2 , 895*86d7f5d3SJohn Marino.Xr icmp6 4 , 896*86d7f5d3SJohn Marino.Xr inet 4 , 897*86d7f5d3SJohn Marino.Xr inet6 4 , 898*86d7f5d3SJohn Marino.Xr intro 4 , 899*86d7f5d3SJohn Marino.Xr ip 4 , 900*86d7f5d3SJohn Marino.Xr ip6 4 , 901*86d7f5d3SJohn Marino.Xr pim 4 902*86d7f5d3SJohn Marino.\" 903*86d7f5d3SJohn Marino.Sh AUTHORS 904*86d7f5d3SJohn Marino.An -nosplit 905*86d7f5d3SJohn MarinoThe original multicast code was written by 906*86d7f5d3SJohn Marino.An David Waitzman 907*86d7f5d3SJohn Marino(BBN Labs), and later modified by the following individuals: 908*86d7f5d3SJohn Marino.An Steve Deering 909*86d7f5d3SJohn Marino(Stanford), 910*86d7f5d3SJohn Marino.An Mark J. Steiglitz 911*86d7f5d3SJohn Marino(Stanford), 912*86d7f5d3SJohn Marino.An Van Jacobson 913*86d7f5d3SJohn Marino(LBL), 914*86d7f5d3SJohn Marino.An Ajit Thyagarajan 915*86d7f5d3SJohn Marino(PARC), 916*86d7f5d3SJohn Marino.An Bill Fenner 917*86d7f5d3SJohn Marino(PARC). 918*86d7f5d3SJohn MarinoThe IPv6 multicast support was implemented by the KAME project 919*86d7f5d3SJohn Marino.Pa ( http://www.kame.net ) , 920*86d7f5d3SJohn Marinoand was based on the IPv4 multicast code. 921*86d7f5d3SJohn MarinoThe advanced multicast API and the multicast bandwidth 922*86d7f5d3SJohn Marinomonitoring were implemented by 923*86d7f5d3SJohn Marino.An Pavlin Radoslavov 924*86d7f5d3SJohn Marino(ICSI) in collaboration with 925*86d7f5d3SJohn Marino.An Chris Brown 926*86d7f5d3SJohn Marino(NextHop). 927*86d7f5d3SJohn Marino.Pp 928*86d7f5d3SJohn MarinoThis manual page was written by 929*86d7f5d3SJohn Marino.An Pavlin Radoslavov 930*86d7f5d3SJohn Marino(ICSI). 931