1*a2e2328dSnia$NetBSD: TODO.smpnet,v 1.50 2024/08/12 10:46:40 nia Exp $ 29674e222Sozaki-r 3e87e25beSozaki-rMP-safe components 4e87e25beSozaki-r================== 59674e222Sozaki-r 66647f64dSozaki-rThey work without the big kernel lock (KERNEL_LOCK), i.e., with NET_MPSAFE 76647f64dSozaki-rkernel option. Some components scale up and some don't. 86647f64dSozaki-r 91d5210cbSozaki-r - Device drivers 102d56e9f1Smsaitoh - aq(4) 11*a2e2328dSnia - awge(4) 12c6138d97Smrg - bcmgenet(4) 13be6f3765Snia - bge(4) 1441ad4686Snia - ena(4) 15c6138d97Smrg - iavf(4) 16c6138d97Smrg - ixg(4) 17c6138d97Smrg - ixl(4) 18c6138d97Smrg - ixv(4) 19c6138d97Smrg - mcx(4) 20c6138d97Smrg - rge(4) 21c6138d97Smrg - se(4) 22c6138d97Smrg - sunxi_emac(4) 23e87e25beSozaki-r - vioif(4) 24e87e25beSozaki-r - vmx(4) 25e87e25beSozaki-r - wm(4) 26c6138d97Smrg - xennet(4) 27c6138d97Smrg - usbnet(4) based adapters: 28c6138d97Smrg - axe(4) 29c6138d97Smrg - axen(4) 30c6138d97Smrg - cdce(4) 31c6138d97Smrg - cue(4) 32c6138d97Smrg - kue(4) 33c6138d97Smrg - mos(4) 34c6138d97Smrg - mue(4) 35c6138d97Smrg - smsc(4) 36c6138d97Smrg - udav(4) 37c6138d97Smrg - upl(4) 38c6138d97Smrg - ure(4) 39c6138d97Smrg - url(4) 40c6138d97Smrg - urndis(4) 411d5210cbSozaki-r - Layer 2 421d5210cbSozaki-r - Ethernet (if_ethersubr.c) 431d5210cbSozaki-r - bridge(4) 441d5210cbSozaki-r - STP 451d5210cbSozaki-r - Fast forward (ipflow) 461d5210cbSozaki-r - Layer 3 471d5210cbSozaki-r - All except for items in the below section 481d5210cbSozaki-r - Interfaces 49a3d8d8e3Snia - canloop(4) 501d5210cbSozaki-r - gif(4) 5140e17cbfSozaki-r - ipsecif(4) 521d5210cbSozaki-r - l2tp(4) 53a3d8d8e3Snia - lagg(4) 541d5210cbSozaki-r - pppoe(4) 551d5210cbSozaki-r - if_spppsubr.c 5625ae59b7Snia - tap(4) 571d5210cbSozaki-r - tun(4) 58a3d8d8e3Snia - vether(4) 59fbb0de67Sozaki-r - vlan(4) 601d5210cbSozaki-r - Packet filters 611d5210cbSozaki-r - npf(7) 627cff9016Smrg - ipf(4) 631d5210cbSozaki-r - Others 641d5210cbSozaki-r - bpf(4) 65fbb0de67Sozaki-r - ipsec(4) 66fbb0de67Sozaki-r - opencrypto(9) 671d5210cbSozaki-r - pfil(9) 68e87e25beSozaki-r 69e87e25beSozaki-rNon MP-safe components and kernel options 70e87e25beSozaki-r========================================= 71e87e25beSozaki-r 726647f64dSozaki-rThe components and options aren't MP-safe, i.e., requires the big kernel lock, 736647f64dSozaki-ryet. Some of them can be used safely even if NET_MPSAFE is enabled because 746647f64dSozaki-rthey're still protected by the big kernel lock. The others aren't protected and 756647f64dSozaki-rso unsafe, e.g, they may crash the kernel. 766647f64dSozaki-r 776647f64dSozaki-rProtected ones 786647f64dSozaki-r-------------- 796647f64dSozaki-r 801d5210cbSozaki-r - Device drivers 811d5210cbSozaki-r - Most drivers other than ones listed in the above section 826647f64dSozaki-r - Layer 4 836647f64dSozaki-r - DCCP 846647f64dSozaki-r - SCTP 856647f64dSozaki-r - TCP 866647f64dSozaki-r - UDP 876647f64dSozaki-r 886647f64dSozaki-rUnprotected ones 896647f64dSozaki-r---------------- 906647f64dSozaki-r 915562362bSozaki-r - Layer 2 925562362bSozaki-r - ARCNET (if_arcsubr.c) 935562362bSozaki-r - IEEE 1394 (if_ieee1394subr.c) 945562362bSozaki-r - IEEE 802.11 (ieee80211(4)) 955562362bSozaki-r - Layer 3 965562362bSozaki-r - IPSELSRC 975562362bSozaki-r - MROUTING 985562362bSozaki-r - PIM 995562362bSozaki-r - MPLS (mpls(4)) 1009faa0319Sozaki-r - IPv6 address selection policy 1015562362bSozaki-r - Interfaces 102e87e25beSozaki-r - agr(4) 103e87e25beSozaki-r - carp(4) 104e87e25beSozaki-r - faith(4) 105e87e25beSozaki-r - gre(4) 106e87e25beSozaki-r - ppp(4) 107e87e25beSozaki-r - sl(4) 108e87e25beSozaki-r - stf(4) 109e87e25beSozaki-r - if_srt 1105562362bSozaki-r - Packet filters 1115562362bSozaki-r - pf(4) 1125562362bSozaki-r - Others 1135562362bSozaki-r - AppleTalk (sys/netatalk/) 1145562362bSozaki-r - Bluetooth (sys/netbt/) 1155562362bSozaki-r - altq(4) 1165562362bSozaki-r - kttcp(4) 117e87e25beSozaki-r - NFS 118e87e25beSozaki-r 119e87e25beSozaki-rKnow issues 120e87e25beSozaki-r=========== 1219674e222Sozaki-r 122e3d0b2ccSozaki-rNOMPSAFE 123e3d0b2ccSozaki-r-------- 124e3d0b2ccSozaki-r 125e3d0b2ccSozaki-rWe use "NOMPSAFE" as a mark that indicates that the code around it isn't MP-safe 126e3d0b2ccSozaki-ryet. We use it in comments and also use as part of function names, for example 127e3d0b2ccSozaki-rm_get_rcvif_NOMPSAFE. Let's use "NOMPSAFE" to make it easy to find non-MP-safe 128e3d0b2ccSozaki-rcodes by grep. 129e3d0b2ccSozaki-r 1309674e222Sozaki-rbpf 131e87e25beSozaki-r--- 1329674e222Sozaki-r 1339674e222Sozaki-rMP-ification of bpf requires all of bpf_mtap* are called in normal LWP context 1349674e222Sozaki-ror softint context, i.e., not in hardware interrupt context. For Tx, all 13532a556f9Sandvarbpf_mtap satisfy the requirement. For Rx, most of bpf_mtap are called in softint. 1369674e222Sozaki-rUnfortunately some bpf_mtap on Rx are still called in hardware interrupt context. 1379674e222Sozaki-r 1389674e222Sozaki-rThis is the list of the functions that have such bpf_mtap: 1399674e222Sozaki-r 1409674e222Sozaki-r - sca_frame_process() @ sys/dev/ic/hd64570.c 1419674e222Sozaki-r 1429674e222Sozaki-rIdeally we should make the functions run in softint somehow, but we don't have 1439674e222Sozaki-ractual devices, no time (or interest/love) to work on the task, so instead we 1449674e222Sozaki-rprovide a deferred bpf_mtap mechanism that forcibly runs bpf_mtap in softint 1459674e222Sozaki-rcontext. It's a workaround and once the functions run in softint, we should use 1469674e222Sozaki-rthe original bpf_mtap again. 1476dc0e16bSozaki-r 14802cc0c4dSjdolecekif_mcast_op() - SIOCADDMULTI/SIOCDELMULTI 14902cc0c4dSjdolecek----------------------------------------- 15002cc0c4dSjdolecekHelper function is called to add or remove multicast addresses for 15102cc0c4dSjdolecekinterface. When called via ioctl it takes IFNET_LOCK(), when called 15202cc0c4dSjdolecekvia sosetopt() it doesn't. 15302cc0c4dSjdolecek 15402cc0c4dSjdolecekVarious network drivers can't assert IFNET_LOCKED() in their if_ioctl 15502cc0c4dSjdolecekbecause of this. Generally drivers still take care to splnet() even 15602cc0c4dSjdolecekwith NET_MPSAFE before calling ether_ioctl(), but they do not take 15702cc0c4dSjdolecekKERNEL_LOCK(), so this is actually unsafe. 15802cc0c4dSjdolecek 1596dc0e16bSozaki-rLingering obsolete variables 1606dc0e16bSozaki-r----------------------------- 1616dc0e16bSozaki-r 1626dc0e16bSozaki-rSome obsolete global variables and member variables of structures remain to 1636dc0e16bSozaki-ravoid breaking old userland programs which directly access such variables via 1646dc0e16bSozaki-rkvm(3). 1656dc0e16bSozaki-r 1666dc0e16bSozaki-rThe following programs still use kvm(3) to get some information related to 1676dc0e16bSozaki-rthe network stack. 1686dc0e16bSozaki-r 1696dc0e16bSozaki-r - netstat(1) 1706dc0e16bSozaki-r - vmstat(1) 1716dc0e16bSozaki-r - fstat(1) 1726dc0e16bSozaki-r 1736dc0e16bSozaki-rnetstat(1) accesses ifnet_list, the head of a list of interface objects 1746dc0e16bSozaki-r(struct ifnet), and traverses each object through ifnet#if_list member variable. 1756dc0e16bSozaki-rifnet_list and ifnet#if_list is obsoleted by ifnet_pslist and 1766dc0e16bSozaki-rifnet#if_pslist_entry respectively. netstat also accesses the IP address list 1770ac7f4ddSandvarof an interface through ifnet#if_addrlist. struct ifaddr, struct in_ifaddr 1786dc0e16bSozaki-rand struct in6_ifaddr are accessed and the following obsolete member variables 1796dc0e16bSozaki-rare stuck: ifaddr#ifa_list, in_ifaddr#ia_hash, in_ifaddr#ia_list, 1806dc0e16bSozaki-rin6_ifaddr#ia_next and in6_ifaddr#_ia6_multiaddrs. Note that netstat already 1816dc0e16bSozaki-rimplements alternative methods to fetch the above information via sysctl(3). 1826dc0e16bSozaki-r 1836dc0e16bSozaki-rvmstat(1) shows statistics of hash tables created by hashinit(9) in the kernel. 1846dc0e16bSozaki-rThe statistic information is retrieved via kvm(3). The global variables 1856dc0e16bSozaki-rin_ifaddrhash and in_ifaddrhashtbl, which are for a hash table of IPv4 1866dc0e16bSozaki-raddresses and obsoleted by in_ifaddrhash_pslist and in_ifaddrhashtbl_pslist, 1876dc0e16bSozaki-rare kept for this purpose. We should provide a means to fetch statistics of 1886dc0e16bSozaki-rhash tables via sysctl(3). 1896dc0e16bSozaki-r 1906dc0e16bSozaki-rfstat(1) shows information of bpf instances. Each bpf instance (struct bpf) is 1916dc0e16bSozaki-robtained via kvm(3). bpf_d#_bd_next, bpf_d#_bd_filter and bpf_d#_bd_list 1926dc0e16bSozaki-rmember variables are obsolete but remain. ifnet#if_xname is also accessed 1936dc0e16bSozaki-rvia struct bpf_if and obsolete ifnet#if_list is required to remain to not change 194a0123401Sozaki-rthe offset of ifnet#if_xname. The statistic counters (bpf#bd_rcount, 195a0123401Sozaki-rbpf#bd_dcount and bpf#bd_ccount) are also victims of this restriction; for 196a0123401Sozaki-rscalability the statistic counters should be per-CPU and we should stop using 197a0123401Sozaki-ratomic operations for them however we have to remain the counters and atomic 198a0123401Sozaki-roperations. 199a38b799eSozaki-r 200a38b799eSozaki-rScalability 201a38b799eSozaki-r----------- 202a38b799eSozaki-r 203a38b799eSozaki-r - Per-CPU rtcaches (used in say IP forwarding) aren't scalable on multiple 204a38b799eSozaki-r flows per CPU 205a38b799eSozaki-r - ipsec(4) isn't scalable on the number of SA/SP; the cost of a look-up 206a38b799eSozaki-r is O(n) 2073ceeffeeSknakahara - opencrypto(9)'s crypto_newsession()/crypto_freesession() aren't scalable 2083ceeffeeSknakahara as they are serialized by one mutex 2093a2af743Sozaki-r 21021a3f65aSozaki-rALTQ 21121a3f65aSozaki-r---- 21221a3f65aSozaki-r 21321a3f65aSozaki-rIf ALTQ is enabled in the kernel, it enforces to use just one Tx queue (if_snd) 21421a3f65aSozaki-rfor packet transmissions, resulting in serializing all Tx packet processing on 21521a3f65aSozaki-rthe queue. We should probably design and implement an alternative queuing 21621a3f65aSozaki-rmechanism that deals with multi-core systems at the first place, not making the 21721a3f65aSozaki-rexisting ALTQ MP-safe because it's just annoying. 218f24c721fSpgoyette 219f24c721fSpgoyetteUsing kernel modules 220f24c721fSpgoyette-------------------- 221f24c721fSpgoyette 222f24c721fSpgoyettePlease note that if you enable NET_MPSAFE in your kernel, and you use and 223f24c721fSpgoyetteloadable kernel modules (including compat_xx modules or individual network 224f24c721fSpgoyetteinterface if_xxx device driver modules), you will need to build custom 225f24c721fSpgoyettemodules. For each module you will need to add the following line to its 226f24c721fSpgoyetteMakefile: 227f24c721fSpgoyette 228f24c721fSpgoyette CPPFLAGS+= NET_MPSAFE 229f24c721fSpgoyette 230f24c721fSpgoyetteFailure to do this may result in unpredictable behavior. 231d0b5d19eSozaki-r 232d0b5d19eSozaki-rIPv4 address initialization atomicity 233d0b5d19eSozaki-r------------------------------------- 234d0b5d19eSozaki-r 235d0b5d19eSozaki-rAn IPv4 address is referenced by several data structures: an associated 236d0b5d19eSozaki-rinterface, its local route, a connected route (if necessary), the global list, 237d0b5d19eSozaki-rthe global hash table, etc. These data structures are not updated atomically, 238d0b5d19eSozaki-ri.e., there can be inconsistent states on an IPv4 address in the kernel during 239d0b5d19eSozaki-rthe initialization of an IPv4 address. 240d0b5d19eSozaki-r 241d0b5d19eSozaki-rOne known failure of the issue is that incoming packets destinating to an 242d0b5d19eSozaki-rinitializing address can loop in the network stack in a short period of time. 243d0b5d19eSozaki-rThe address initialization creates an local route first and then registers an 244d0b5d19eSozaki-rinitializing address to the global hash table that is used to decide if an 245d0b5d19eSozaki-rincoming packet destinates to the host by checking the destination of the packet 24632a556f9Sandvaris registered to the hash table. So, if the host allows forwarding, an incoming 247d0b5d19eSozaki-rpacket can match on a local route of an initializing address at ip_output while 248d0b5d19eSozaki-rit fails the to-self check described above at ip_input. Because a matched local 249d0b5d19eSozaki-rroute points a loopback interface as its destination interface, an incoming 250d0b5d19eSozaki-rpacket sends to the network stack (ip_input) again, which results in looping. 251d0b5d19eSozaki-rThe loop stops once an initializing address is registered to the hash table. 252d0b5d19eSozaki-r 253d0b5d19eSozaki-rOne solution of the issue is to reorder the address initialization instructions, 254d0b5d19eSozaki-rfirst register an address to the hash table then create its routes. Another 255d0b5d19eSozaki-rsolution is to use the routing table for the to-self check instead of using the 256d0b5d19eSozaki-rglobal hash table, like IPv6. 257611478deSozaki-r 258611478deSozaki-rif_flags 259611478deSozaki-r-------- 260611478deSozaki-r 261611478deSozaki-rTo avoid data race on if_flags it should be protected by a lock (currently it's 262611478deSozaki-rIFNET_LOCK). Thus, if_flags should not be accessed on packet processing to 263611478deSozaki-ravoid performance degradation by lock contentions. Traditionally IFF_RUNNING, 264611478deSozaki-rIFF_UP and IFF_OACTIVE flags of if_flags are checked on packet processing. If 265611478deSozaki-ryou make a driver MP-safe you must remove such checks. 266611478deSozaki-r 26734e921e5SriastradhDrivers should not touch IFF_ALLMULTI. They are tempted to do so when updating 26834e921e5Sriastradhhardware multicast filters on SIOCADDMULTI/SIOCDELMULTI. Instead, they should 26934e921e5Sriastradhuse the ETHER_F_ALLMULTI bit in struct ethercom::ec_flags, under ETHER_LOCK. 27034e921e5Sriastradhether_ioctl takes care of presenting IFF_ALLMULTI according to the current state 27134e921e5Sriastradhof ETHER_F_ALLMULTI when queried with SIOCGIFFLAGS. 272611478deSozaki-r 273611478deSozaki-rAlso IFF_PROMISC is checked in ether_input and we should get rid of it somehow. 274