1da14cebeSEric Cheng /*
2da14cebeSEric Cheng * CDDL HEADER START
3da14cebeSEric Cheng *
4da14cebeSEric Cheng * The contents of this file are subject to the terms of the
5da14cebeSEric Cheng * Common Development and Distribution License (the "License").
6da14cebeSEric Cheng * You may not use this file except in compliance with the License.
7da14cebeSEric Cheng *
8da14cebeSEric Cheng * You can obtain a copy of the license at usr/src/OPENSOLARIS.LICENSE
9da14cebeSEric Cheng * or http://www.opensolaris.org/os/licensing.
10da14cebeSEric Cheng * See the License for the specific language governing permissions
11da14cebeSEric Cheng * and limitations under the License.
12da14cebeSEric Cheng *
13da14cebeSEric Cheng * When distributing Covered Code, include this CDDL HEADER in each
14da14cebeSEric Cheng * file and include the License file at usr/src/OPENSOLARIS.LICENSE.
15da14cebeSEric Cheng * If applicable, add the following below this CDDL HEADER, with the
16da14cebeSEric Cheng * fields enclosed by brackets "[]" replaced with your own identifying
17da14cebeSEric Cheng * information: Portions Copyright [yyyy] [name of copyright owner]
18da14cebeSEric Cheng *
19da14cebeSEric Cheng * CDDL HEADER END
20da14cebeSEric Cheng */
21da14cebeSEric Cheng /*
229820c710SBaban Kenkre * Copyright 2010 Sun Microsystems, Inc. All rights reserved.
23da14cebeSEric Cheng * Use is subject to license terms.
2484de666eSRyan Zezeski * Copyright 2018 Joyent, Inc.
253cc3202eSDan McDonald * Copyright 2013 Nexenta Systems, Inc. All rights reserved.
26*67cac09cSKyle Simpson * Copyright 2024 Oxide Computer Company
27e2ea9c96SRobert Mustacchi */
28da14cebeSEric Cheng
29bc44a933SRobert Mustacchi /*
30bc44a933SRobert Mustacchi * MAC data path
31bc44a933SRobert Mustacchi *
32bc44a933SRobert Mustacchi * The MAC data path is concerned with the flow of traffic from mac clients --
33bc44a933SRobert Mustacchi * DLS, IP, etc. -- to various GLDv3 device drivers -- e1000g, vnic, aggr,
34bc44a933SRobert Mustacchi * ixgbe, etc. -- and from the GLDv3 device drivers back to clients.
35bc44a933SRobert Mustacchi *
36bc44a933SRobert Mustacchi * -----------
37bc44a933SRobert Mustacchi * Terminology
38bc44a933SRobert Mustacchi * -----------
39bc44a933SRobert Mustacchi *
40bc44a933SRobert Mustacchi * MAC uses a lot of different, but related terms that are associated with the
41bc44a933SRobert Mustacchi * design and structure of the data path. Before we cover other aspects, first
42bc44a933SRobert Mustacchi * let's review the terminology that MAC uses.
43bc44a933SRobert Mustacchi *
44bc44a933SRobert Mustacchi * MAC
45bc44a933SRobert Mustacchi *
46bc44a933SRobert Mustacchi * This driver. It interfaces with device drivers and provides abstractions
47bc44a933SRobert Mustacchi * that the rest of the system consumes. All data links -- things managed
48bbf21555SRichard Lowe * with dladm(8), are accessed through MAC.
49bc44a933SRobert Mustacchi *
50bc44a933SRobert Mustacchi * GLDv3 DEVICE DRIVER
51bc44a933SRobert Mustacchi *
52bc44a933SRobert Mustacchi * A GLDv3 device driver refers to a driver, both for pseudo-devices and
53bc44a933SRobert Mustacchi * real devices, which implement the GLDv3 driver API. Common examples of
54bc44a933SRobert Mustacchi * these are igb and ixgbe, which are drivers for various Intel networking
55bc44a933SRobert Mustacchi * cards. These devices may or may not have various features, such as
56bc44a933SRobert Mustacchi * hardware rings and checksum offloading. For MAC, a GLDv3 device is the
57bc44a933SRobert Mustacchi * final point for the transmission of a packet and the starting point for
58bc44a933SRobert Mustacchi * the receipt of a packet.
59bc44a933SRobert Mustacchi *
60bc44a933SRobert Mustacchi * FLOWS
61bc44a933SRobert Mustacchi *
62bc44a933SRobert Mustacchi * At a high level, a flow refers to a series of packets that are related.
63bc44a933SRobert Mustacchi * Often times the term is used in the context of TCP to indicate a unique
64bc44a933SRobert Mustacchi * TCP connection and the traffic over it. However, a flow can exist at
65bc44a933SRobert Mustacchi * other levels of the system as well. MAC has a notion of a default flow
66bc44a933SRobert Mustacchi * which is used for all unicast traffic addressed to the address of a MAC
67bc44a933SRobert Mustacchi * device. For example, when a VNIC is created, a default flow is created
68bc44a933SRobert Mustacchi * for the VNIC's MAC address. In addition, flows are created for broadcast
69bbf21555SRichard Lowe * groups and a user may create a flow with flowadm(8).
70bc44a933SRobert Mustacchi *
71bc44a933SRobert Mustacchi * CLASSIFICATION
72bc44a933SRobert Mustacchi *
73bc44a933SRobert Mustacchi * Classification refers to the notion of identifying an incoming frame
74bc44a933SRobert Mustacchi * based on its destination address and optionally its source addresses and
75bc44a933SRobert Mustacchi * doing different processing based on that information. Classification can
76bc44a933SRobert Mustacchi * be done in both hardware and software. In general, we usually only
77bc44a933SRobert Mustacchi * classify based on the layer two destination, eg. for Ethernet, the
78bc44a933SRobert Mustacchi * destination MAC address.
79bc44a933SRobert Mustacchi *
80bc44a933SRobert Mustacchi * The system also will do classification based on layer three and layer
81bbf21555SRichard Lowe * four properties. This is used to support things like flowadm(8), which
82bc44a933SRobert Mustacchi * allows setting QoS and other properties on a per-flow basis.
83bc44a933SRobert Mustacchi *
84bc44a933SRobert Mustacchi * RING
85bc44a933SRobert Mustacchi *
86bc44a933SRobert Mustacchi * Conceptually, a ring represents a series of framed messages, often in a
87bc44a933SRobert Mustacchi * contiguous chunk of memory that acts as a circular buffer. Rings come in
88bc44a933SRobert Mustacchi * a couple of forms. Generally they are either a hardware construct (hw
89bc44a933SRobert Mustacchi * ring) or they are a software construct (sw ring) maintained by MAC.
90bc44a933SRobert Mustacchi *
91bc44a933SRobert Mustacchi * HW RING
92bc44a933SRobert Mustacchi *
93bc44a933SRobert Mustacchi * A hardware ring is a set of resources provided by a GLDv3 device driver
94bc44a933SRobert Mustacchi * (even if it is a pseudo-device). A hardware ring comes in two different
95bc44a933SRobert Mustacchi * forms: receive (rx) rings and transmit (tx) rings. An rx hw ring is
96bc44a933SRobert Mustacchi * something that has a unique DMA (direct memory access) region and
97bc44a933SRobert Mustacchi * generally supports some form of classification (though it isn't always
98bc44a933SRobert Mustacchi * used), as well as a means of generating an interrupt specific to that
99bc44a933SRobert Mustacchi * ring. For example, the device may generate a specific MSI-X for a PCI
100bc44a933SRobert Mustacchi * express device. A tx ring is similar, except that it is dedicated to
101bc44a933SRobert Mustacchi * transmission. It may also be a vector for enabling features such as VLAN
102bc44a933SRobert Mustacchi * tagging and large transmit offloading. It usually has its own dedicated
103bc44a933SRobert Mustacchi * interrupts for transmit being completed.
104bc44a933SRobert Mustacchi *
105bc44a933SRobert Mustacchi * SW RING
106bc44a933SRobert Mustacchi *
107bc44a933SRobert Mustacchi * A software ring is a construction of MAC. It represents the same thing
108bc44a933SRobert Mustacchi * that a hardware ring generally does, a collection of frames. However,
109bc44a933SRobert Mustacchi * instead of being in a contiguous ring of memory, they're instead linked
110bc44a933SRobert Mustacchi * by using the mblk_t's b_next pointer. Each frame may itself be multiple
111bc44a933SRobert Mustacchi * mblk_t's linked together by the b_cont pointer. A software ring always
112bc44a933SRobert Mustacchi * represents a collection of classified packets; however, it varies as to
113bc44a933SRobert Mustacchi * whether it uses only layer two information, or a combination of that and
114bc44a933SRobert Mustacchi * additional layer three and layer four data.
115bc44a933SRobert Mustacchi *
116bc44a933SRobert Mustacchi * FANOUT
117bc44a933SRobert Mustacchi *
118bc44a933SRobert Mustacchi * Fanout is the idea of spreading out the load of processing frames based
119bc44a933SRobert Mustacchi * on the source and destination information contained in the layer two,
120bc44a933SRobert Mustacchi * three, and four headers, such that the data can then be processed in
121bc44a933SRobert Mustacchi * parallel using multiple hardware threads.
122bc44a933SRobert Mustacchi *
123bc44a933SRobert Mustacchi * A fanout algorithm hashes the headers and uses that to place different
124bc44a933SRobert Mustacchi * flows into a bucket. The most important thing is that packets that are
125bc44a933SRobert Mustacchi * in the same flow end up in the same bucket. If they do not, performance
126bc44a933SRobert Mustacchi * can be adversely affected. Consider the case of TCP. TCP severely
127bc44a933SRobert Mustacchi * penalizes a connection if the data arrives out of order. If a given flow
128bc44a933SRobert Mustacchi * is processed on different CPUs, then the data will appear out of order,
129bc44a933SRobert Mustacchi * hence the invariant that fanout always hash a given flow to the same
130bc44a933SRobert Mustacchi * bucket and thus get processed on the same CPU.
131bc44a933SRobert Mustacchi *
132bc44a933SRobert Mustacchi * RECEIVE SIDE SCALING (RSS)
133bc44a933SRobert Mustacchi *
134bc44a933SRobert Mustacchi *
135bc44a933SRobert Mustacchi * Receive side scaling is a term that isn't common in illumos, but is used
136bc44a933SRobert Mustacchi * by vendors and was popularized by Microsoft. It refers to the idea of
137bc44a933SRobert Mustacchi * spreading the incoming receive load out across multiple interrupts which
138bc44a933SRobert Mustacchi * can be directed to different CPUs. This allows a device to leverage
139bc44a933SRobert Mustacchi * hardware rings even when it doesn't support hardware classification. The
140bc44a933SRobert Mustacchi * hardware uses an algorithm to perform fanout that ensures the flow
141bc44a933SRobert Mustacchi * invariant is maintained.
142bc44a933SRobert Mustacchi *
143bc44a933SRobert Mustacchi * SOFT RING SET
144bc44a933SRobert Mustacchi *
145bc44a933SRobert Mustacchi * A soft ring set, commonly abbreviated SRS, is a collection of rings and
146bc44a933SRobert Mustacchi * is used for both transmitting and receiving. It is maintained in the
147bc44a933SRobert Mustacchi * structure mac_soft_ring_set_t. A soft ring set is usually associated
148bc44a933SRobert Mustacchi * with flows, and coordinates both the use of hardware and software rings.
149bc44a933SRobert Mustacchi * Because the use of hardware rings can change as devices such as VNICs
150bc44a933SRobert Mustacchi * come and go, we always ensure that the set has software classification
151bc44a933SRobert Mustacchi * rules that correspond to the hardware classification rules from rings.
152bc44a933SRobert Mustacchi *
153bc44a933SRobert Mustacchi * Soft ring sets are also used for the enforcement of various QoS
154bc44a933SRobert Mustacchi * properties. For example, if a bandwidth limit has been placed on a
155bc44a933SRobert Mustacchi * specific flow or device, then that will be enforced by the soft ring
156bc44a933SRobert Mustacchi * set.
157bc44a933SRobert Mustacchi *
158bc44a933SRobert Mustacchi * SERVICE ATTACHMENT POINT (SAP)
159bc44a933SRobert Mustacchi *
160bc44a933SRobert Mustacchi * The service attachment point is a DLPI (Data Link Provider Interface)
161bc44a933SRobert Mustacchi * concept; however, it comes up quite often in MAC. Most MAC devices speak
162bc44a933SRobert Mustacchi * a protocol that has some notion of different channels or message type
163bc44a933SRobert Mustacchi * identifiers. For example, Ethernet defines an EtherType which is a part
164bc44a933SRobert Mustacchi * of the Ethernet header and defines the particular protocol of the data
165bc44a933SRobert Mustacchi * payload. If the EtherType is set to 0x0800, then it defines that the
166bc44a933SRobert Mustacchi * contents of that Ethernet frame is IPv4 traffic. For Ethernet, the
167bc44a933SRobert Mustacchi * EtherType is the SAP.
168bc44a933SRobert Mustacchi *
169bc44a933SRobert Mustacchi * In DLPI, a given consumer attaches to a specific SAP. In illumos, the ip
170bc44a933SRobert Mustacchi * and arp drivers attach to the EtherTypes for IPv4, IPv6, and ARP. Using
171bc44a933SRobert Mustacchi * libdlpi(3LIB) user software can attach to arbitrary SAPs. With the
172bc44a933SRobert Mustacchi * exception of 802.1Q VLAN tagged traffic, MAC itself does not directly
173bc44a933SRobert Mustacchi * consume the SAP; however, it uses that information as part of hashing
174bc44a933SRobert Mustacchi * and it may be used as part of the construction of flows.
175bc44a933SRobert Mustacchi *
176bc44a933SRobert Mustacchi * PRIMARY MAC CLIENT
177bc44a933SRobert Mustacchi *
178bc44a933SRobert Mustacchi * The primary mac client refers to a mac client whose unicast address
179bc44a933SRobert Mustacchi * matches the address of the device itself. For example, if the system has
180bc44a933SRobert Mustacchi * instance of the e1000g driver such as e1000g0, e1000g1, etc., the
181bc44a933SRobert Mustacchi * primary mac client is the one named after the device itself. VNICs that
182bc44a933SRobert Mustacchi * are created on top of such devices are not the primary client.
183bc44a933SRobert Mustacchi *
184bc44a933SRobert Mustacchi * TRANSMIT DESCRIPTORS
185bc44a933SRobert Mustacchi *
186bc44a933SRobert Mustacchi * Transmit descriptors are a resource that most GLDv3 device drivers have.
187bc44a933SRobert Mustacchi * Generally, a GLDv3 device driver takes a frame that's meant to be output
188bc44a933SRobert Mustacchi * and puts a copy of it into a region of memory. Each region of memory
189bc44a933SRobert Mustacchi * usually has an associated descriptor that the device uses to manage
190bc44a933SRobert Mustacchi * properties of the frames. Devices have a limited number of such
191bc44a933SRobert Mustacchi * descriptors. They get reclaimed once the device finishes putting the
192bc44a933SRobert Mustacchi * frame on the wire.
193bc44a933SRobert Mustacchi *
194bc44a933SRobert Mustacchi * If the driver runs out of transmit descriptors, for example, the OS is
195bc44a933SRobert Mustacchi * generating more frames than it can put on the wire, then it will return
196bc44a933SRobert Mustacchi * them back to the MAC layer.
197bc44a933SRobert Mustacchi *
198bc44a933SRobert Mustacchi * ---------------------------------
199bc44a933SRobert Mustacchi * Rings, Classification, and Fanout
200bc44a933SRobert Mustacchi * ---------------------------------
201bc44a933SRobert Mustacchi *
202bc44a933SRobert Mustacchi * The heart of MAC is made up of rings, and not those that Elven-kings wear.
203bc44a933SRobert Mustacchi * When receiving a packet, MAC breaks the work into two different, though
204bc44a933SRobert Mustacchi * interrelated phases. The first phase is generally classification and then the
205bc44a933SRobert Mustacchi * second phase is generally fanout. When a frame comes in from a GLDv3 Device,
206bc44a933SRobert Mustacchi * MAC needs to determine where that frame should be delivered. If it's a
207bc44a933SRobert Mustacchi * unicast frame (say a normal TCP/IP packet), then it will be delivered to a
208bc44a933SRobert Mustacchi * single MAC client; however, if it's a broadcast or multicast frame, then MAC
209bc44a933SRobert Mustacchi * may need to deliver it to multiple MAC clients.
210bc44a933SRobert Mustacchi *
211bc44a933SRobert Mustacchi * On transmit, classification isn't quite as important, but may still be used.
212bc44a933SRobert Mustacchi * Unlike with the receive path, the classification is not used to determine
213bc44a933SRobert Mustacchi * devices that should transmit something, but rather is used for special
214bc44a933SRobert Mustacchi * properties of a flow, eg. bandwidth limits for a given IP address, device, or
215bc44a933SRobert Mustacchi * connection.
216bc44a933SRobert Mustacchi *
217bc44a933SRobert Mustacchi * MAC employs a software classifier and leverages hardware classification as
218bc44a933SRobert Mustacchi * well. The software classifier can leverage the full layer two information,
219bc44a933SRobert Mustacchi * source, destination, VLAN, and SAP. If the SAP indicates that IP traffic is
220bc44a933SRobert Mustacchi * being sent, it can classify based on the IP header, and finally, it also
221bc44a933SRobert Mustacchi * knows how to classify based on the local and remote ports of TCP, UDP, and
222bc44a933SRobert Mustacchi * SCTP.
223bc44a933SRobert Mustacchi *
224bc44a933SRobert Mustacchi * Hardware classifiers vary in capability. Generally all hardware classifiers
225bc44a933SRobert Mustacchi * provide the capability to classify based on the destination MAC address. Some
226bc44a933SRobert Mustacchi * hardware has additional filters built in for performing more in-depth
227bc44a933SRobert Mustacchi * classification; however, it often has much more limited resources for these
228bc44a933SRobert Mustacchi * activities as compared to the layer two destination address classification.
229bc44a933SRobert Mustacchi *
230bc44a933SRobert Mustacchi * The modus operandi in MAC is to always ensure that we have software-based
231bc44a933SRobert Mustacchi * capabilities and rules in place and then to supplement that with hardware
232bc44a933SRobert Mustacchi * resources when available. In general, simple layer two classification is
233bc44a933SRobert Mustacchi * sufficient and nothing else is used, unless a specific flow is created with
234bbf21555SRichard Lowe * tools such as flowadm(8) or bandwidth limits are set on a device with
235bbf21555SRichard Lowe * dladm(8).
236bc44a933SRobert Mustacchi *
237bc44a933SRobert Mustacchi * RINGS AND GROUPS
238bc44a933SRobert Mustacchi *
239bc44a933SRobert Mustacchi * To get into how rings and classification play together, it's first important
240bc44a933SRobert Mustacchi * to understand how hardware devices commonly associate rings and allow them to
241bc44a933SRobert Mustacchi * be programmed. Recall that a hardware ring should be thought of as a DMA
242bc44a933SRobert Mustacchi * buffer and an interrupt resource. Rings are then collected into groups. A
243bc44a933SRobert Mustacchi * group itself has a series of classification rules. One or more MAC addresses
244bc44a933SRobert Mustacchi * are assigned to a group.
245bc44a933SRobert Mustacchi *
246bc44a933SRobert Mustacchi * Hardware devices vary in terms of what capabilities they provide. Sometimes
247bc44a933SRobert Mustacchi * they allow for a dynamic assignment of rings to a group and sometimes they
248bc44a933SRobert Mustacchi * have a static assignment of rings to a group. For example, the ixgbe driver
249bc44a933SRobert Mustacchi * has a static assignment of rings to groups such that every group has exactly
250bc44a933SRobert Mustacchi * one ring and the number of groups is equal to the number of rings.
251bc44a933SRobert Mustacchi *
252bc44a933SRobert Mustacchi * Classification and receive side scaling both come into play with how a device
253bc44a933SRobert Mustacchi * advertises itself to MAC and how MAC uses it. If a device supports layer two
254bc44a933SRobert Mustacchi * classification of frames, then MAC will assign MAC addresses to a group as a
255bc44a933SRobert Mustacchi * form of primary classification. If a single MAC address is assigned to a
256bc44a933SRobert Mustacchi * group, a common case, then MAC will consider packets that come in from rings
257bc44a933SRobert Mustacchi * on that group to be fully classified and will not need to do any software
258bc44a933SRobert Mustacchi * classification unless a specific flow has been created.
259bc44a933SRobert Mustacchi *
260bc44a933SRobert Mustacchi * If a device supports receive side scaling, then it may advertise or support
261bc44a933SRobert Mustacchi * groups with multiple rings. In those cases, then receive side scaling will
262bc44a933SRobert Mustacchi * come into play and MAC will use that as a means of fanning out received
263bc44a933SRobert Mustacchi * frames across multiple CPUs. This can also be combined with groups that
264bc44a933SRobert Mustacchi * support layer two classification.
265bc44a933SRobert Mustacchi *
266bc44a933SRobert Mustacchi * If a device supports dynamic assignments of rings to groups, then MAC will
267bc44a933SRobert Mustacchi * change around the way that rings are assigned to various groups as devices
268bc44a933SRobert Mustacchi * come and go from the system. For example, when a VNIC is created, a new flow
269bc44a933SRobert Mustacchi * will be created for the VNIC's MAC address. If a hardware ring is available,
270bc44a933SRobert Mustacchi * MAC may opt to reassign it from one group to another.
271bc44a933SRobert Mustacchi *
272bc44a933SRobert Mustacchi * ASSIGNMENT OF HARDWARE RINGS
273bc44a933SRobert Mustacchi *
274bc44a933SRobert Mustacchi * This is a bit of a complicated subject that varies depending on the device,
275bc44a933SRobert Mustacchi * the use of aggregations, the special nature of the primary mac client. This
276bc44a933SRobert Mustacchi * section deserves being fleshed out.
277bc44a933SRobert Mustacchi *
278bc44a933SRobert Mustacchi * FANOUT
279bc44a933SRobert Mustacchi *
280bc44a933SRobert Mustacchi * illumos uses fanout to help spread out the incoming processing load of chains
281bc44a933SRobert Mustacchi * of frames away from a single CPU. If a device supports receive side scaling,
282bc44a933SRobert Mustacchi * then that provides an initial form of fanout; however, what we're concerned
283bc44a933SRobert Mustacchi * with all happens after the context of a given set of frames being classified
284bc44a933SRobert Mustacchi * to a soft ring set.
285bc44a933SRobert Mustacchi *
286bc44a933SRobert Mustacchi * After frames reach a soft ring set and account for any potential bandwidth
287bc44a933SRobert Mustacchi * related accounting, they may be fanned out based on one of the following
288bc44a933SRobert Mustacchi * three modes:
289bc44a933SRobert Mustacchi *
290bc44a933SRobert Mustacchi * o No Fanout
291bc44a933SRobert Mustacchi * o Protocol level fanout
292bc44a933SRobert Mustacchi * o Full software ring protocol fanout
293bc44a933SRobert Mustacchi *
294bc44a933SRobert Mustacchi * MAC makes the determination as to which of these modes a given soft ring set
295bc44a933SRobert Mustacchi * obtains based on parameters such as whether or not it's the primary mac
296bbf21555SRichard Lowe * client, whether it's on a 10 GbE or faster device, user controlled dladm(8)
297bc44a933SRobert Mustacchi * properties, and the nature of the hardware and the resources that it has.
298bc44a933SRobert Mustacchi *
299bc44a933SRobert Mustacchi * When there is no fanout, MAC does not create any soft rings for a device and
300bc44a933SRobert Mustacchi * the device has frames delivered directly to the MAC client.
301bc44a933SRobert Mustacchi *
302bc44a933SRobert Mustacchi * Otherwise, all fanout is performed by software. MAC divides incoming frames
303bc44a933SRobert Mustacchi * into one of three buckets -- IPv4 TCP traffic, IPv4 UDP traffic, and
30484de666eSRyan Zezeski * everything else. Regardless of the type of fanout, these three categories
30584de666eSRyan Zezeski * or buckets are always used.
306bc44a933SRobert Mustacchi *
307bc44a933SRobert Mustacchi * The difference between protocol level fanout and full software ring protocol
308bc44a933SRobert Mustacchi * fanout is the number of software rings that end up getting created. The
309bc44a933SRobert Mustacchi * system always uses the same number of software rings per protocol bucket. So
310bc44a933SRobert Mustacchi * in the first case when we're just doing protocol level fanout, we just create
311bc44a933SRobert Mustacchi * one software ring each for IPv4 TCP traffic, IPv4 UDP traffic, and everything
312bc44a933SRobert Mustacchi * else.
313bc44a933SRobert Mustacchi *
314bc44a933SRobert Mustacchi * In the case where we do full software ring protocol fanout, we generally use
315bc44a933SRobert Mustacchi * mac_compute_soft_ring_count() to determine the number of rings. There are
316bc44a933SRobert Mustacchi * other combinations of properties and devices that may send us down other
317bc44a933SRobert Mustacchi * paths, but this is a common starting point. If it's a non-bandwidth enforced
318bc44a933SRobert Mustacchi * device and we're on at least a 10 GbE link, then we'll use eight soft rings
319bc44a933SRobert Mustacchi * per protocol bucket as a starting point. See mac_compute_soft_ring_count()
320bc44a933SRobert Mustacchi * for more information on the total number.
321bc44a933SRobert Mustacchi *
322bc44a933SRobert Mustacchi * For each of these rings, we create a mac_soft_ring_t and an associated worker
323bc44a933SRobert Mustacchi * thread. Particularly when doing full software ring protocol fanout, we bind
324bc44a933SRobert Mustacchi * each of the worker threads to individual CPUs.
325bc44a933SRobert Mustacchi *
326bc44a933SRobert Mustacchi * The other advantage of these software rings is that it allows upper layers to
327bc44a933SRobert Mustacchi * optionally poll on them. For example, TCP can leverage an squeue to poll on
328bc44a933SRobert Mustacchi * the software ring, see squeue.c for more information.
329bc44a933SRobert Mustacchi *
330bc44a933SRobert Mustacchi * DLS BYPASS
331bc44a933SRobert Mustacchi *
332bc44a933SRobert Mustacchi * DLS is the data link services module. It interfaces with DLPI, which is the
333bc44a933SRobert Mustacchi * primary way that other parts of the system such as IP interface with the MAC
334bc44a933SRobert Mustacchi * layer. While DLS is traditionally a STREAMS-based interface, it allows for
335bc44a933SRobert Mustacchi * certain modules such as IP to negotiate various more modern interfaces to be
336bc44a933SRobert Mustacchi * used, which are useful for higher performance and allow it to use direct
337bc44a933SRobert Mustacchi * function calls to DLS instead of using STREAMS.
338bc44a933SRobert Mustacchi *
339bc44a933SRobert Mustacchi * When we have IPv4 TCP or UDP software rings, then traffic on those rings is
340bc44a933SRobert Mustacchi * eligible for what we call the dls bypass. In those cases, rather than going
341bc44a933SRobert Mustacchi * out mac_rx_deliver() to DLS, DLS instead registers them to go directly via
342bc44a933SRobert Mustacchi * the direct callback registered with DLS, generally ip_input().
343bc44a933SRobert Mustacchi *
344bc44a933SRobert Mustacchi * HARDWARE RING POLLING
345bc44a933SRobert Mustacchi *
346bc44a933SRobert Mustacchi * GLDv3 devices with hardware rings generally deliver chains of messages
347bc44a933SRobert Mustacchi * (mblk_t chain) during the context of a single interrupt. However, interrupts
348bc44a933SRobert Mustacchi * are not the only way that these devices may be used. As part of implementing
349bc44a933SRobert Mustacchi * ring support, a GLDv3 device driver must have a way to disable the generation
350bc44a933SRobert Mustacchi * of that interrupt and allow for the operating system to poll on that ring.
351bc44a933SRobert Mustacchi *
352bc44a933SRobert Mustacchi * To implement this, every soft ring set has a worker thread and a polling
353bc44a933SRobert Mustacchi * thread. If a sufficient packet rate comes into the system, MAC will 'blank'
354bc44a933SRobert Mustacchi * (disable) interrupts on that specific ring and the polling thread will start
355bc44a933SRobert Mustacchi * consuming packets from the hardware device and deliver them to the soft ring
356bc44a933SRobert Mustacchi * set, where the worker thread will take over.
357bc44a933SRobert Mustacchi *
358bc44a933SRobert Mustacchi * Once the rate of packet intake drops down below a certain threshold, then
359bc44a933SRobert Mustacchi * polling on the hardware ring will be quiesced and interrupts will be
360bc44a933SRobert Mustacchi * re-enabled for the given ring. This effectively allows the system to shift
361bc44a933SRobert Mustacchi * how it handles a ring based on its load. At high packet rates, polling on the
362bc44a933SRobert Mustacchi * device as opposed to relying on interrupts can actually reduce overall system
363bc44a933SRobert Mustacchi * load due to the minimization of interrupt activity.
364bc44a933SRobert Mustacchi *
365bc44a933SRobert Mustacchi * Note the importance of each ring having its own interrupt source. The whole
366bc44a933SRobert Mustacchi * idea here is that we do not disable interrupts on the device as a whole, but
367bc44a933SRobert Mustacchi * rather each ring can be independently toggled.
368bc44a933SRobert Mustacchi *
369bc44a933SRobert Mustacchi * USE OF WORKER THREADS
370bc44a933SRobert Mustacchi *
371bc44a933SRobert Mustacchi * Both the soft ring set and individual soft rings have a worker thread
372bc44a933SRobert Mustacchi * associated with them that may be bound to a specific CPU in the system. Any
373bc44a933SRobert Mustacchi * such assignment will get reassessed as part of dynamic reconfiguration events
374bc44a933SRobert Mustacchi * in the system such as the onlining and offlining of CPUs and the creation of
375bc44a933SRobert Mustacchi * CPU partitions.
376bc44a933SRobert Mustacchi *
377bc44a933SRobert Mustacchi * In many cases, while in an interrupt, we try to deliver a frame all the way
378bc44a933SRobert Mustacchi * through the stack in the context of the interrupt itself. However, if the
379bc44a933SRobert Mustacchi * amount of queued frames has exceeded a threshold, then we instead defer to
380bc44a933SRobert Mustacchi * the worker thread to do this work and signal it. This is particularly useful
381bc44a933SRobert Mustacchi * when you have the soft ring set delivering frames into multiple software
382bc44a933SRobert Mustacchi * rings. If it was only delivering frames into a single software ring then
383bc44a933SRobert Mustacchi * there'd be no need to have another thread take over. However, if it's
384bc44a933SRobert Mustacchi * delivering chains of frames to multiple rings, then it's worthwhile to have
385bc44a933SRobert Mustacchi * the worker for the software ring take over so that the different software
386bc44a933SRobert Mustacchi * rings can be processed in parallel.
387bc44a933SRobert Mustacchi *
388bc44a933SRobert Mustacchi * In a similar fashion to the hardware polling thread, if we don't have a
389bc44a933SRobert Mustacchi * backlog or there's nothing to do, then the worker thread will go back to
390bc44a933SRobert Mustacchi * sleep and frames can be delivered all the way from an interrupt. This
391bc44a933SRobert Mustacchi * behavior is useful as it's designed to minimize latency and the default
392bc44a933SRobert Mustacchi * disposition of MAC is to optimize for latency.
393bc44a933SRobert Mustacchi *
394bc44a933SRobert Mustacchi * MAINTAINING CHAINS
395bc44a933SRobert Mustacchi *
396bc44a933SRobert Mustacchi * Another useful idea that MAC uses is to try and maintain frames in chains for
397bc44a933SRobert Mustacchi * as long as possible. The idea is that all of MAC can handle chains of frames
398bc44a933SRobert Mustacchi * structured as a series of mblk_t structures linked with the b_next pointer.
399bc44a933SRobert Mustacchi * When performing software classification and software fanout, MAC does not
400bc44a933SRobert Mustacchi * simply determine the destination and send the frame along. Instead, in the
401bc44a933SRobert Mustacchi * case of classification, it tries to maintain a chain for as long as possible
402bc44a933SRobert Mustacchi * before passing it along and performing additional processing.
403bc44a933SRobert Mustacchi *
404bc44a933SRobert Mustacchi * In the case of fanout, MAC first determines what the target software ring is
405bc44a933SRobert Mustacchi * for every frame in the original chain and constructs a new chain for each
406bc44a933SRobert Mustacchi * target. MAC then delivers the new chain to each software ring in succession.
407bc44a933SRobert Mustacchi *
408bc44a933SRobert Mustacchi * The whole rationale for doing this is that we want to try and maintain the
409bc44a933SRobert Mustacchi * pipe as much as possible and deliver as many frames through the stack at once
410bc44a933SRobert Mustacchi * that we can, rather than just pushing a single frame through. This can often
411bc44a933SRobert Mustacchi * help bring down latency and allows MAC to get a better sense of the overall
412bc44a933SRobert Mustacchi * activity in the system and properly engage worker threads.
413bc44a933SRobert Mustacchi *
414bc44a933SRobert Mustacchi * --------------------
415bc44a933SRobert Mustacchi * Bandwidth Management
416bc44a933SRobert Mustacchi * --------------------
417bc44a933SRobert Mustacchi *
418bc44a933SRobert Mustacchi * Bandwidth management is something that's built into the soft ring set itself.
419bc44a933SRobert Mustacchi * When bandwidth limits are placed on a flow, a corresponding soft ring set is
420bc44a933SRobert Mustacchi * toggled into bandwidth mode. This changes how we transmit and receive the
421bc44a933SRobert Mustacchi * frames in question.
422bc44a933SRobert Mustacchi *
423bc44a933SRobert Mustacchi * Bandwidth management is done on a per-tick basis. We translate the user's
424bc44a933SRobert Mustacchi * requested bandwidth from a quantity per-second into a quantity per-tick. MAC
425bc44a933SRobert Mustacchi * cannot process a frame across more than one tick, thus it sets a lower bound
426bc44a933SRobert Mustacchi * for the bandwidth cap to be a single MTU. This also means that when
427bc44a933SRobert Mustacchi * hires ticks are enabled (hz is set to 1000), that the minimum amount of
428bc44a933SRobert Mustacchi * bandwidth is higher, because the number of ticks has increased and MAC has to
429bc44a933SRobert Mustacchi * go from accepting 100 packets / sec to 1000 / sec.
430bc44a933SRobert Mustacchi *
431bc44a933SRobert Mustacchi * The bandwidth counter is reset by either the soft ring set's worker thread or
432bc44a933SRobert Mustacchi * a thread that is doing an inline transmit or receive if they discover that
433bc44a933SRobert Mustacchi * the current tick is in the future from the recorded tick.
434bc44a933SRobert Mustacchi *
435bc44a933SRobert Mustacchi * Whenever we're receiving or transmitting data, we end up leaving most of the
436bc44a933SRobert Mustacchi * work to the soft ring set's worker thread. This forces data inserted into the
437bc44a933SRobert Mustacchi * soft ring set to be effectively serialized and allows us to exhume bandwidth
438bc44a933SRobert Mustacchi * at a reasonable rate. If there is nothing in the soft ring set at the moment
439bc44a933SRobert Mustacchi * and the set has available bandwidth, then it may processed inline.
440bc44a933SRobert Mustacchi * Otherwise, the worker is responsible for taking care of the soft ring set.
441bc44a933SRobert Mustacchi *
442bc44a933SRobert Mustacchi * ---------------------
443bc44a933SRobert Mustacchi * The Receive Data Path
444bc44a933SRobert Mustacchi * ---------------------
445bc44a933SRobert Mustacchi *
446bc44a933SRobert Mustacchi * The following series of ASCII art images breaks apart the way that a frame
447bc44a933SRobert Mustacchi * comes in and is processed in MAC.
448bc44a933SRobert Mustacchi *
449bc44a933SRobert Mustacchi * Part 1 -- Initial frame receipt, SRS classification
450bc44a933SRobert Mustacchi *
451bc44a933SRobert Mustacchi * Here, a frame is received by a GLDv3 driver, generally in the context of an
452bc44a933SRobert Mustacchi * interrupt, and it ends up in mac_rx_common(). A driver calls either mac_rx or
453bc44a933SRobert Mustacchi * mac_rx_ring, depending on whether or not it supports rings and can identify
454bc44a933SRobert Mustacchi * the interrupt as having come from a specific ring. Here we determine whether
455bc44a933SRobert Mustacchi * or not it's fully classified and perform software classification as
456bc44a933SRobert Mustacchi * appropriate. From here, everything always ends up going to either entry [A]
457bc44a933SRobert Mustacchi * or entry [B] based on whether or not they have subflow processing needed. We
458bc44a933SRobert Mustacchi * leave via fanout or delivery.
459bc44a933SRobert Mustacchi *
460bc44a933SRobert Mustacchi * +===========+
461bc44a933SRobert Mustacchi * v hardware v
462bc44a933SRobert Mustacchi * v interrupt v
463bc44a933SRobert Mustacchi * +===========+
464bc44a933SRobert Mustacchi * |
465bc44a933SRobert Mustacchi * * . . appropriate
466bc44a933SRobert Mustacchi * | upcall made
467bc44a933SRobert Mustacchi * | by GLDv3 driver . . always
468bc44a933SRobert Mustacchi * | .
469bc44a933SRobert Mustacchi * +--------+ | +----------+ . +---------------+
470bc44a933SRobert Mustacchi * | GLDv3 | +---->| mac_rx |-----*--->| mac_rx_common |
471bc44a933SRobert Mustacchi * | Driver |-->--+ +----------+ +---------------+
472bc44a933SRobert Mustacchi * +--------+ | ^ |
473bc44a933SRobert Mustacchi * | | ^ v
474bc44a933SRobert Mustacchi * ^ | * . . always +----------------------+
475bc44a933SRobert Mustacchi * | | | | mac_promisc_dispatch |
476bc44a933SRobert Mustacchi * | | +-------------+ +----------------------+
477bc44a933SRobert Mustacchi * | +--->| mac_rx_ring | |
478bc44a933SRobert Mustacchi * | +-------------+ * . . hw classified
479bc44a933SRobert Mustacchi * | v or single flow?
480bc44a933SRobert Mustacchi * | |
481bc44a933SRobert Mustacchi * | +--------++--------------+
482bc44a933SRobert Mustacchi * | | | * hw class,
483bc44a933SRobert Mustacchi * | | * hw classified | subflows
484bc44a933SRobert Mustacchi * | no hw class and . * | or single | exist
485bc44a933SRobert Mustacchi * | subflows | | flow |
486bc44a933SRobert Mustacchi * | | v v
487bc44a933SRobert Mustacchi * | | +-----------+ +-----------+
488bc44a933SRobert Mustacchi * | | | goto | | goto |
489bc44a933SRobert Mustacchi * | | | entry [A] | | entry [B] |
490bc44a933SRobert Mustacchi * | | +-----------+ +-----------+
491bc44a933SRobert Mustacchi * | v ^
492bc44a933SRobert Mustacchi * | +-------------+ |
493bc44a933SRobert Mustacchi * | | mac_rx_flow | * SRS and flow found,
494bc44a933SRobert Mustacchi * | +-------------+ | call flow cb
495bc44a933SRobert Mustacchi * | | +------+
496bc44a933SRobert Mustacchi * | v |
497bc44a933SRobert Mustacchi * v +==========+ +-----------------+
498bc44a933SRobert Mustacchi * | v For each v--->| mac_rx_classify |
499bc44a933SRobert Mustacchi * +----------+ v mblk_t v +-----------------+
500bc44a933SRobert Mustacchi * | srs | +==========+
501bc44a933SRobert Mustacchi * | pollling |
502bc44a933SRobert Mustacchi * | thread |->------------------------------------------+
503bc44a933SRobert Mustacchi * +----------+ |
504bc44a933SRobert Mustacchi * v . inline
505bc44a933SRobert Mustacchi * +--------------------+ +----------+ +---------+ .
506bc44a933SRobert Mustacchi * [A]---->| mac_rx_srs_process |-->| check bw |-->| enqueue |--*---------+
507bc44a933SRobert Mustacchi * +--------------------+ | limits | | frames | |
508bc44a933SRobert Mustacchi * ^ +----------+ | to SRS | |
509bc44a933SRobert Mustacchi * | +---------+ |
510bc44a933SRobert Mustacchi * | send chain +--------+ | |
511bc44a933SRobert Mustacchi * * when clasified | signal | * BW limits, |
512bc44a933SRobert Mustacchi * | flow changes | srs |<---+ loopback, |
513bc44a933SRobert Mustacchi * | | worker | stack too |
514bc44a933SRobert Mustacchi * | +--------+ deep |
515bc44a933SRobert Mustacchi * +-----------------+ +--------+ |
516bc44a933SRobert Mustacchi * | mac_flow_lookup | | srs | +---------------------+ |
517bc44a933SRobert Mustacchi * +-----------------+ | worker |---->| mac_rx_srs_drain |<---+
518bc44a933SRobert Mustacchi * ^ | thread | | mac_rx_srs_drain_bw |
519bc44a933SRobert Mustacchi * | +--------+ +---------------------+
520bc44a933SRobert Mustacchi * | |
521bc44a933SRobert Mustacchi * +----------------------------+ * software rings
522bc44a933SRobert Mustacchi * [B]-->| mac_rx_srs_subflow_process | | for fanout?
523bc44a933SRobert Mustacchi * +----------------------------+ |
524bc44a933SRobert Mustacchi * +----------+-----------+
525bc44a933SRobert Mustacchi * | |
526bc44a933SRobert Mustacchi * v v
527bc44a933SRobert Mustacchi * +--------+ +--------+
528bc44a933SRobert Mustacchi * | goto | | goto |
529bc44a933SRobert Mustacchi * | Part 2 | | Part 3 |
530bc44a933SRobert Mustacchi * +--------+ +--------+
531bc44a933SRobert Mustacchi *
532bc44a933SRobert Mustacchi * Part 2 -- Fanout
533bc44a933SRobert Mustacchi *
534bc44a933SRobert Mustacchi * This part is concerned with using software fanout to assign frames to
535bc44a933SRobert Mustacchi * software rings and then deliver them to MAC clients or allow those rings to
536bc44a933SRobert Mustacchi * be polled upon. While there are two different primary fanout entry points,
537bc44a933SRobert Mustacchi * mac_rx_fanout and mac_rx_proto_fanout, they behave in similar ways, and aside
538bc44a933SRobert Mustacchi * from some of the individual hashing techniques used, most of the general
539bc44a933SRobert Mustacchi * flow is the same.
540bc44a933SRobert Mustacchi *
541bc44a933SRobert Mustacchi * +--------+ +-------------------+
542bc44a933SRobert Mustacchi * | From |---+--------->| mac_rx_srs_fanout |----+
543bc44a933SRobert Mustacchi * | Part 1 | | +-------------------+ | +=================+
544bc44a933SRobert Mustacchi * +--------+ | | v for each mblk_t v
545bc44a933SRobert Mustacchi * * . . protocol only +--->v assign to new v
546bc44a933SRobert Mustacchi * | fanout | v chain based on v
547bc44a933SRobert Mustacchi * | | v hash % nrings v
548bc44a933SRobert Mustacchi * | +-------------------------+ | +=================+
549bc44a933SRobert Mustacchi * +--->| mac_rx_srs_proto_fanout |----+ |
550bc44a933SRobert Mustacchi * +-------------------------+ |
551bc44a933SRobert Mustacchi * v
552bc44a933SRobert Mustacchi * +------------+ +--------------------------+ +================+
553bc44a933SRobert Mustacchi * | enqueue in |<---| mac_rx_soft_ring_process |<------v for each chain v
554bc44a933SRobert Mustacchi * | soft ring | +--------------------------+ +================+
555bc44a933SRobert Mustacchi * +------------+
556bc44a933SRobert Mustacchi * | +-----------+
557bc44a933SRobert Mustacchi * * soft ring set | soft ring |
558bc44a933SRobert Mustacchi * | empty and no | worker |
559bc44a933SRobert Mustacchi * | worker? | thread |
560bc44a933SRobert Mustacchi * | +-----------+
561bc44a933SRobert Mustacchi * +------*----------------+ |
562bc44a933SRobert Mustacchi * | . | v
563bc44a933SRobert Mustacchi * No . * . Yes | +------------------------+
564bc44a933SRobert Mustacchi * | +----<--| mac_rx_soft_ring_drain |
565bc44a933SRobert Mustacchi * | | +------------------------+
566bc44a933SRobert Mustacchi * v |
567bc44a933SRobert Mustacchi * +-----------+ v
568bc44a933SRobert Mustacchi * | signal | +---------------+
569bc44a933SRobert Mustacchi * | soft ring | | Deliver chain |
570bc44a933SRobert Mustacchi * | worker | | goto Part 3 |
571bc44a933SRobert Mustacchi * +-----------+ +---------------+
572bc44a933SRobert Mustacchi *
573bc44a933SRobert Mustacchi *
574bc44a933SRobert Mustacchi * Part 3 -- Packet Delivery
575bc44a933SRobert Mustacchi *
576bc44a933SRobert Mustacchi * Here, we go through and deliver the mblk_t chain directly to a given
577bc44a933SRobert Mustacchi * processing function. In a lot of cases this is mac_rx_deliver(). In the case
578bc44a933SRobert Mustacchi * of DLS bypass being used, then instead we end up going ahead and deliver it
579bc44a933SRobert Mustacchi * to the direct callback registered with DLS, generally ip_input.
580bc44a933SRobert Mustacchi *
581bc44a933SRobert Mustacchi *
582bc44a933SRobert Mustacchi * +---------+ +----------------+ +------------------+
583bc44a933SRobert Mustacchi * | From |---+------->| mac_rx_deliver |--->| Off to DLS, or |
584bc44a933SRobert Mustacchi * | Parts 1 | | +----------------+ | other MAC client |
585bc44a933SRobert Mustacchi * | and 2 | * DLS bypass +------------------+
586bc44a933SRobert Mustacchi * +---------+ | enabled +----------+ +-------------+
587bc44a933SRobert Mustacchi * +---------->| ip_input |--->| To IP |
588bc44a933SRobert Mustacchi * +----------+ | and beyond! |
589bc44a933SRobert Mustacchi * +-------------+
590bc44a933SRobert Mustacchi *
591bc44a933SRobert Mustacchi * ----------------------
592bc44a933SRobert Mustacchi * The Transmit Data Path
593bc44a933SRobert Mustacchi * ----------------------
594bc44a933SRobert Mustacchi *
595bc44a933SRobert Mustacchi * Before we go into the images, it's worth talking about a problem that is a
596bc44a933SRobert Mustacchi * bit different from the receive data path. GLDv3 device drivers have a finite
597bc44a933SRobert Mustacchi * amount of transmit descriptors. When they run out, they return unused frames
598bc44a933SRobert Mustacchi * back to MAC. MAC, at this point has several options about what it will do,
599bc44a933SRobert Mustacchi * which vary based upon the settings that the client uses.
600bc44a933SRobert Mustacchi *
601bc44a933SRobert Mustacchi * When a device runs out of descriptors, the next thing that MAC does is
602bc44a933SRobert Mustacchi * enqueue them off of the soft ring set or a software ring, depending on the
603bc44a933SRobert Mustacchi * configuration of the soft ring set. MAC will enqueue up to a high watermark
604bc44a933SRobert Mustacchi * of mblk_t chains, at which point it will indicate flow control back to the
605bc44a933SRobert Mustacchi * client. Once this condition is reached, any mblk_t chains that were not
606bc44a933SRobert Mustacchi * enqueued will be returned to the caller and they will have to decide what to
607bc44a933SRobert Mustacchi * do with them. There are various flags that control this behavior that a
608bc44a933SRobert Mustacchi * client may pass, which are discussed below.
609bc44a933SRobert Mustacchi *
610bc44a933SRobert Mustacchi * When this condition is hit, MAC also returns a cookie to the client in
611bc44a933SRobert Mustacchi * addition to unconsumed frames. Clients can poll on that cookie and register a
612bc44a933SRobert Mustacchi * callback with MAC to be notified when they are no longer subject to flow
613bc44a933SRobert Mustacchi * control, at which point they may continue to call mac_tx(). This flow control
614bc44a933SRobert Mustacchi * actually manages to work itself all the way up the stack, back through dls,
615bc44a933SRobert Mustacchi * to ip, through the various protocols, and to sockfs.
616bc44a933SRobert Mustacchi *
617bc44a933SRobert Mustacchi * While the behavior described above is the default, this behavior can be
618bc44a933SRobert Mustacchi * modified. There are two alternate modes, described below, which are
619bc44a933SRobert Mustacchi * controlled with flags.
620bc44a933SRobert Mustacchi *
621bc44a933SRobert Mustacchi * DROP MODE
622bc44a933SRobert Mustacchi *
623bc44a933SRobert Mustacchi * This mode is controlled by having the client pass the MAC_DROP_ON_NO_DESC
624bc44a933SRobert Mustacchi * flag. When this is passed, if a device driver runs out of transmit
625bc44a933SRobert Mustacchi * descriptors, then the MAC layer will drop any unsent traffic. The client in
626bc44a933SRobert Mustacchi * this case will never have any frames returned to it.
627bc44a933SRobert Mustacchi *
628bc44a933SRobert Mustacchi * DON'T ENQUEUE
629bc44a933SRobert Mustacchi *
630bc44a933SRobert Mustacchi * This mode is controlled by having the client pass the MAC_TX_NO_ENQUEUE flag.
631bc44a933SRobert Mustacchi * If the MAC_DROP_ON_NO_DESC flag is also passed, it takes precedence. In this
632bc44a933SRobert Mustacchi * mode, when we hit a case where a driver runs out of transmit descriptors,
633bc44a933SRobert Mustacchi * then instead of enqueuing packets in a soft ring set or software ring, we
634bc44a933SRobert Mustacchi * instead return the mblk_t chain back to the caller and immediately put the
635bc44a933SRobert Mustacchi * soft ring set into flow control mode.
636bc44a933SRobert Mustacchi *
637bc44a933SRobert Mustacchi * The following series of ASCII art images describe the transmit data path that
638bc44a933SRobert Mustacchi * MAC clients enter into based on calling into mac_tx(). A soft ring set has a
639bc44a933SRobert Mustacchi * transmission function associated with it. There are seven possible
640bc44a933SRobert Mustacchi * transmission modes, some of which share function entry points. The one that a
641bc44a933SRobert Mustacchi * soft ring set gets depends on properties such as whether there are
642bc44a933SRobert Mustacchi * transmission rings for fanout, whether the device involves aggregations,
643bc44a933SRobert Mustacchi * whether any bandwidth limits exist, etc.
644bc44a933SRobert Mustacchi *
645bc44a933SRobert Mustacchi *
646bc44a933SRobert Mustacchi * Part 1 -- Initial checks
647bc44a933SRobert Mustacchi *
648bc44a933SRobert Mustacchi * * . called by
649bc44a933SRobert Mustacchi * | MAC clients
650bc44a933SRobert Mustacchi * v . . No
651bc44a933SRobert Mustacchi * +--------+ +-----------+ . +-------------------+ +====================+
652bc44a933SRobert Mustacchi * | mac_tx |->| device |-*-->| mac_protect_check |->v Is this the simple v
653bc44a933SRobert Mustacchi * +--------+ | quiesced? | +-------------------+ v case? See [1] v
654bc44a933SRobert Mustacchi * +-----------+ | +====================+
655bc44a933SRobert Mustacchi * * . Yes * failed |
656bc44a933SRobert Mustacchi * v | frames |
657bc44a933SRobert Mustacchi * +--------------+ | +-------+---------+
658bc44a933SRobert Mustacchi * | freemsgchain |<---------+ Yes . * No . *
659bc44a933SRobert Mustacchi * +--------------+ v v
660bc44a933SRobert Mustacchi * +-----------+ +--------+
661bc44a933SRobert Mustacchi * | goto | | goto |
662bc44a933SRobert Mustacchi * | Part 2 | | SRS TX |
663bc44a933SRobert Mustacchi * | Entry [A] | | func |
664bc44a933SRobert Mustacchi * +-----------+ +--------+
665bc44a933SRobert Mustacchi * | |
666bc44a933SRobert Mustacchi * | v
667bc44a933SRobert Mustacchi * | +--------+
668bc44a933SRobert Mustacchi * +---------->| return |
669bc44a933SRobert Mustacchi * | cookie |
670bc44a933SRobert Mustacchi * +--------+
671bc44a933SRobert Mustacchi *
672bc44a933SRobert Mustacchi * [1] The simple case refers to the SRS being configured with the
673bc44a933SRobert Mustacchi * SRS_TX_DEFAULT transmission mode, having a single mblk_t (not a chain), their
674bc44a933SRobert Mustacchi * being only a single active client, and not having a backlog in the srs.
675bc44a933SRobert Mustacchi *
676bc44a933SRobert Mustacchi *
677bc44a933SRobert Mustacchi * Part 2 -- The SRS transmission functions
678bc44a933SRobert Mustacchi *
679bc44a933SRobert Mustacchi * This part is a bit more complicated. The different transmission paths often
680bc44a933SRobert Mustacchi * leverage one another. In this case, we'll draw out the more common ones
681bc44a933SRobert Mustacchi * before the parts that depend upon them. Here, we're going to start with the
682bc44a933SRobert Mustacchi * workings of mac_tx_send() a common function that most of the others end up
683bc44a933SRobert Mustacchi * calling.
684bc44a933SRobert Mustacchi *
685bc44a933SRobert Mustacchi * +-------------+
686bc44a933SRobert Mustacchi * | mac_tx_send |
687bc44a933SRobert Mustacchi * +-------------+
688bc44a933SRobert Mustacchi * |
689bc44a933SRobert Mustacchi * v
690bc44a933SRobert Mustacchi * +=============+ +==============+
691bc44a933SRobert Mustacchi * v more than v--->v check v
692bc44a933SRobert Mustacchi * v one client? v v VLAN and add v
693bc44a933SRobert Mustacchi * +=============+ v VLAN tags v
694bc44a933SRobert Mustacchi * | +==============+
695bc44a933SRobert Mustacchi * | |
696bc44a933SRobert Mustacchi * +------------------+
697bc44a933SRobert Mustacchi * |
698bc44a933SRobert Mustacchi * | [A]
699bc44a933SRobert Mustacchi * v |
700bc44a933SRobert Mustacchi * +============+ . No v
701bc44a933SRobert Mustacchi * v more than v . +==========+ +--------------------------+
702bc44a933SRobert Mustacchi * v one active v-*---->v for each v---->| mac_promisc_dispatch_one |---+
703bc44a933SRobert Mustacchi * v client? v v mblk_t v +--------------------------+ |
704bc44a933SRobert Mustacchi * +============+ +==========+ ^ |
705bc44a933SRobert Mustacchi * | | +==========+ |
706bc44a933SRobert Mustacchi * * . Yes | v hardware v<-------+
707bc44a933SRobert Mustacchi * v +------------+ v rings? v
708bc44a933SRobert Mustacchi * +==========+ | +==========+
709bc44a933SRobert Mustacchi * v for each v No . . . * |
710bc44a933SRobert Mustacchi * v mblk_t v specific | |
711bc44a933SRobert Mustacchi * +==========+ flow | +-----+-----+
712bc44a933SRobert Mustacchi * | | | |
713bc44a933SRobert Mustacchi * v | v v
714bc44a933SRobert Mustacchi * +-----------------+ | +-------+ +---------+
715bc44a933SRobert Mustacchi * | mac_tx_classify |------------+ | GLDv3 | | GLDv3 |
716bc44a933SRobert Mustacchi * +-----------------+ |TX func| | ring tx |
717bc44a933SRobert Mustacchi * | +-------+ | func |
718bc44a933SRobert Mustacchi * * Specific flow, generally | +---------+
719bc44a933SRobert Mustacchi * | bcast, mcast, loopback | |
720bc44a933SRobert Mustacchi * v +-----+-----+
721bc44a933SRobert Mustacchi * +==========+ +---------+ |
722bc44a933SRobert Mustacchi * v valid L2 v--*--->| freemsg | v
723bc44a933SRobert Mustacchi * v header v . No +---------+ +-------------------+
724bc44a933SRobert Mustacchi * +==========+ | return unconsumed |
725bc44a933SRobert Mustacchi * * . Yes | frames to the |
726bc44a933SRobert Mustacchi * v | caller |
727bc44a933SRobert Mustacchi * +===========+ +-------------------+
728bc44a933SRobert Mustacchi * v braodcast v +----------------+ ^
729bc44a933SRobert Mustacchi * v flow? v--*-->| mac_bcast_send |------------------+
730bc44a933SRobert Mustacchi * +===========+ . +----------------+ |
731bc44a933SRobert Mustacchi * | . . Yes |
732bc44a933SRobert Mustacchi * No . * v
733bc44a933SRobert Mustacchi * | +---------------------+ +---------------+ +----------+
734bc44a933SRobert Mustacchi * +->|mac_promisc_dispatch |->| mac_fix_cksum |->| flow |
735bc44a933SRobert Mustacchi * +---------------------+ +---------------+ | callback |
736bc44a933SRobert Mustacchi * +----------+
737bc44a933SRobert Mustacchi *
738bc44a933SRobert Mustacchi *
739bc44a933SRobert Mustacchi * In addition, many but not all of the routines, all rely on
740bc44a933SRobert Mustacchi * mac_tx_softring_process as an entry point.
741bc44a933SRobert Mustacchi *
742bc44a933SRobert Mustacchi *
743bc44a933SRobert Mustacchi * . No . No
744bc44a933SRobert Mustacchi * +--------------------------+ +========+ . +===========+ . +-------------+
745bc44a933SRobert Mustacchi * | mac_tx_soft_ring_process |-->v worker v-*->v out of tx v-*->| goto |
746bc44a933SRobert Mustacchi * +--------------------------+ v only? v v descr.? v | mac_tx_send |
747bc44a933SRobert Mustacchi * +========+ +===========+ +-------------+
748bc44a933SRobert Mustacchi * Yes . * * . Yes |
749bc44a933SRobert Mustacchi * . No v | v
750bc44a933SRobert Mustacchi * v=========+ . +===========+ . Yes | Yes . +==========+
751bc44a933SRobert Mustacchi * v apppend v<--*----------v out of tx v-*-------+---------*--v returned v
752bc44a933SRobert Mustacchi * v mblk_t v v descr.? v | v frames? v
753bc44a933SRobert Mustacchi * v chain v +===========+ | +==========+
754bc44a933SRobert Mustacchi * +=========+ | *. No
755bc44a933SRobert Mustacchi * | | v
756bc44a933SRobert Mustacchi * v v +------------+
757bc44a933SRobert Mustacchi * +===================+ +----------------------+ | done |
758bc44a933SRobert Mustacchi * v worker scheduled? v | mac_tx_sring_enqueue | | processing |
759bc44a933SRobert Mustacchi * v Out of tx descr? v +----------------------+ +------------+
760bc44a933SRobert Mustacchi * +===================+ |
761bc44a933SRobert Mustacchi * | | . Yes v
762bc44a933SRobert Mustacchi * * Yes * No . +============+
763bc44a933SRobert Mustacchi * | v +-*---------v drop on no v
764bc44a933SRobert Mustacchi * | +========+ v v TX desc? v
765bc44a933SRobert Mustacchi * | v wake v +----------+ +============+
766bc44a933SRobert Mustacchi * | v worker v | mac_pkt_ | * . No
767bc44a933SRobert Mustacchi * | +========+ | drop | | . Yes . No
768bc44a933SRobert Mustacchi * | | +----------+ v . .
769bc44a933SRobert Mustacchi * | | v ^ +===============+ . +========+ .
770bc44a933SRobert Mustacchi * +--+--------+---------+ | v Don't enqueue v-*->v ring v-*----+
771bc44a933SRobert Mustacchi * | | v Set? v v empty? v |
772bc44a933SRobert Mustacchi * | +---------------+ +===============+ +========+ |
773bc44a933SRobert Mustacchi * | | | | |
774bc44a933SRobert Mustacchi * | | +-------------------+ | |
775bc44a933SRobert Mustacchi * | *. Yes | +---------+ |
776bc44a933SRobert Mustacchi * | | v v v
777bc44a933SRobert Mustacchi * | | +===========+ +========+ +--------------+
778bc44a933SRobert Mustacchi * | +<-v At hiwat? v v append v | return |
779bc44a933SRobert Mustacchi * | +===========+ v mblk_t v | mblk_t chain |
780bc44a933SRobert Mustacchi * | * No v chain v | and flow |
781bc44a933SRobert Mustacchi * | v +========+ | control |
782bc44a933SRobert Mustacchi * | +=========+ | | cookie |
783bc44a933SRobert Mustacchi * | v append v v +--------------+
784bc44a933SRobert Mustacchi * | v mblk_t v +========+
785bc44a933SRobert Mustacchi * | v chain v v wake v +------------+
786bc44a933SRobert Mustacchi * | +=========+ v worker v-->| done |
787bc44a933SRobert Mustacchi * | | +========+ | processing |
788bc44a933SRobert Mustacchi * | v .. Yes +------------+
789bc44a933SRobert Mustacchi * | +=========+ . +========+
790bc44a933SRobert Mustacchi * | v first v--*-->v wake v
791bc44a933SRobert Mustacchi * | v append? v v worker v
792bc44a933SRobert Mustacchi * | +=========+ +========+
793bc44a933SRobert Mustacchi * | | |
794bc44a933SRobert Mustacchi * | No . * |
795bc44a933SRobert Mustacchi * | v |
796bc44a933SRobert Mustacchi * | +--------------+ |
797bc44a933SRobert Mustacchi * +------>| Return | |
798bc44a933SRobert Mustacchi * | flow control |<------------+
799bc44a933SRobert Mustacchi * | cookie |
800bc44a933SRobert Mustacchi * +--------------+
801bc44a933SRobert Mustacchi *
802bc44a933SRobert Mustacchi *
803bc44a933SRobert Mustacchi * The remaining images are all specific to each of the different transmission
804bc44a933SRobert Mustacchi * modes.
805bc44a933SRobert Mustacchi *
806bc44a933SRobert Mustacchi * SRS TX DEFAULT
807bc44a933SRobert Mustacchi *
808bc44a933SRobert Mustacchi * [ From Part 1 ]
809bc44a933SRobert Mustacchi * |
810bc44a933SRobert Mustacchi * v
811bc44a933SRobert Mustacchi * +-------------------------+
812bc44a933SRobert Mustacchi * | mac_tx_single_ring_mode |
813bc44a933SRobert Mustacchi * +-------------------------+
814bc44a933SRobert Mustacchi * |
815bc44a933SRobert Mustacchi * | . Yes
816bc44a933SRobert Mustacchi * v .
817bc44a933SRobert Mustacchi * +==========+ . +============+
818bc44a933SRobert Mustacchi * v SRS v-*->v Try to v---->---------------------+
819bc44a933SRobert Mustacchi * v backlog? v v enqueue in v |
820bc44a933SRobert Mustacchi * +==========+ v SRS v-->------+ * . . Queue too
821bc44a933SRobert Mustacchi * | +============+ * don't enqueue | deep or
822bc44a933SRobert Mustacchi * * . No ^ | | flag or at | drop flag
823bc44a933SRobert Mustacchi * | | v | hiwat, |
824bc44a933SRobert Mustacchi * v | | | return +---------+
825bc44a933SRobert Mustacchi * +-------------+ | | | cookie | freemsg |
826bc44a933SRobert Mustacchi * | goto |-*-----+ | | +---------+
827bc44a933SRobert Mustacchi * | mac_tx_send | . returned | | |
828bc44a933SRobert Mustacchi * +-------------+ mblk_t | | |
829bc44a933SRobert Mustacchi * | | | |
830bc44a933SRobert Mustacchi * | | | |
831bc44a933SRobert Mustacchi * * . . all mblk_t * queued, | |
832bc44a933SRobert Mustacchi * v consumed | may return | |
833bc44a933SRobert Mustacchi * +-------------+ | tx cookie | |
834bc44a933SRobert Mustacchi * | SRS TX func |<------------+------------+----------------+
835bc44a933SRobert Mustacchi * | completed |
836bc44a933SRobert Mustacchi * +-------------+
837bc44a933SRobert Mustacchi *
838bc44a933SRobert Mustacchi * SRS_TX_SERIALIZE
839bc44a933SRobert Mustacchi *
840bc44a933SRobert Mustacchi * +------------------------+
841bc44a933SRobert Mustacchi * | mac_tx_serializer_mode |
842bc44a933SRobert Mustacchi * +------------------------+
843bc44a933SRobert Mustacchi * |
844bc44a933SRobert Mustacchi * | . No
845bc44a933SRobert Mustacchi * v .
846bc44a933SRobert Mustacchi * +============+ . +============+ +-------------+ +============+
847bc44a933SRobert Mustacchi * v srs being v-*->v set SRS v--->| goto |-->v remove SRS v
848bc44a933SRobert Mustacchi * v processed? v v proc flags v | mac_tx_send | v proc flag v
849bc44a933SRobert Mustacchi * +============+ +============+ +-------------+ +============+
850bc44a933SRobert Mustacchi * | |
851bc44a933SRobert Mustacchi * * Yes |
852bc44a933SRobert Mustacchi * v . No v
853bc44a933SRobert Mustacchi * +--------------------+ . +==========+
854bc44a933SRobert Mustacchi * | mac_tx_srs_enqueue | +------------------------*-----<--v returned v
855bc44a933SRobert Mustacchi * +--------------------+ | v frames? v
856bc44a933SRobert Mustacchi * | | . Yes +==========+
857bc44a933SRobert Mustacchi * | | . |
858bc44a933SRobert Mustacchi * | | . +=========+ v
859bc44a933SRobert Mustacchi * v +-<-*-v queued v +--------------------+
860bc44a933SRobert Mustacchi * +-------------+ | v frames? v<----| mac_tx_srs_enqueue |
861bc44a933SRobert Mustacchi * | SRS TX func | | +=========+ +--------------------+
862bc44a933SRobert Mustacchi * | completed, |<------+ * . Yes
863bc44a933SRobert Mustacchi * | may return | | v
864bc44a933SRobert Mustacchi * | cookie | | +========+
865bc44a933SRobert Mustacchi * +-------------+ +-<---v wake v
866bc44a933SRobert Mustacchi * v worker v
867bc44a933SRobert Mustacchi * +========+
868bc44a933SRobert Mustacchi *
869bc44a933SRobert Mustacchi *
870bc44a933SRobert Mustacchi * SRS_TX_FANOUT
871bc44a933SRobert Mustacchi *
872bc44a933SRobert Mustacchi * . Yes
873bc44a933SRobert Mustacchi * +--------------------+ +=============+ . +--------------------------+
874bc44a933SRobert Mustacchi * | mac_tx_fanout_mode |--->v Have fanout v-*-->| goto |
875bc44a933SRobert Mustacchi * +--------------------+ v hint? v | mac_rx_soft_ring_process |
876bc44a933SRobert Mustacchi * +=============+ +--------------------------+
877bc44a933SRobert Mustacchi * * . No |
878bc44a933SRobert Mustacchi * v ^
879bc44a933SRobert Mustacchi * +===========+ |
880bc44a933SRobert Mustacchi * +--->v for each v +===============+
881bc44a933SRobert Mustacchi * | v mblk_t v v pick softring v
882bc44a933SRobert Mustacchi * same * +===========+ v from hash v
883bc44a933SRobert Mustacchi * hash | | +===============+
884bc44a933SRobert Mustacchi * | v |
885bc44a933SRobert Mustacchi * | +--------------+ |
886bc44a933SRobert Mustacchi * +---| mac_pkt_hash |--->*------------+
887bc44a933SRobert Mustacchi * +--------------+ . different
888bc44a933SRobert Mustacchi * hash or
889bc44a933SRobert Mustacchi * done proc.
890bc44a933SRobert Mustacchi * SRS_TX_AGGR chain
891bc44a933SRobert Mustacchi *
892bc44a933SRobert Mustacchi * +------------------+ +================================+
893bc44a933SRobert Mustacchi * | mac_tx_aggr_mode |--->v Use aggr capab function to v
894bc44a933SRobert Mustacchi * +------------------+ v find appropriate tx ring. v
895bc44a933SRobert Mustacchi * v Applies hash based on aggr v
896bc44a933SRobert Mustacchi * v policy, see mac_tx_aggr_mode() v
897bc44a933SRobert Mustacchi * +================================+
898bc44a933SRobert Mustacchi * |
899bc44a933SRobert Mustacchi * v
900bc44a933SRobert Mustacchi * +-------------------------------+
901bc44a933SRobert Mustacchi * | goto |
902bc44a933SRobert Mustacchi * | mac_rx_srs_soft_ring_process |
903bc44a933SRobert Mustacchi * +-------------------------------+
904bc44a933SRobert Mustacchi *
905bc44a933SRobert Mustacchi *
906bc44a933SRobert Mustacchi * SRS_TX_BW, SRS_TX_BW_FANOUT, SRS_TX_BW_AGGR
907bc44a933SRobert Mustacchi *
908bc44a933SRobert Mustacchi * Note, all three of these tx functions start from the same place --
909bc44a933SRobert Mustacchi * mac_tx_bw_mode().
910bc44a933SRobert Mustacchi *
911bc44a933SRobert Mustacchi * +----------------+
912bc44a933SRobert Mustacchi * | mac_tx_bw_mode |
913bc44a933SRobert Mustacchi * +----------------+
914bc44a933SRobert Mustacchi * |
915bc44a933SRobert Mustacchi * v . No . No . Yes
916bc44a933SRobert Mustacchi * +==============+ . +============+ . +=============+ . +=========+
917bc44a933SRobert Mustacchi * v Out of BW? v--*->v SRS empty? v--*->v reset BW v-*->v Bump BW v
918bc44a933SRobert Mustacchi * +==============+ +============+ v tick count? v v Usage v
919bc44a933SRobert Mustacchi * | | +=============+ +=========+
920bc44a933SRobert Mustacchi * | +---------+ | |
921bc44a933SRobert Mustacchi * | | +--------------------+ |
922bc44a933SRobert Mustacchi * | | | +----------------------+
923bc44a933SRobert Mustacchi * v | v v
924bc44a933SRobert Mustacchi * +===============+ | +==========+ +==========+ +------------------+
925bc44a933SRobert Mustacchi * v Don't enqueue v | v set bw v v Is aggr? v--*-->| goto |
926bc44a933SRobert Mustacchi * v flag set? v | v enforced v +==========+ . | mac_tx_aggr_mode |-+
927bc44a933SRobert Mustacchi * +===============+ | +==========+ | . +------------------+ |
928bc44a933SRobert Mustacchi * | Yes .* | | No . * . |
929bc44a933SRobert Mustacchi * | | | | | . Yes |
930bc44a933SRobert Mustacchi * * . No | | v | |
931bc44a933SRobert Mustacchi * | +---------+ | +========+ v +======+ |
932bc44a933SRobert Mustacchi * | | freemsg | | v append v +============+ . Yes v pick v |
933bc44a933SRobert Mustacchi * | +---------+ | v mblk_t v v Is fanout? v--*---->v ring v |
934bc44a933SRobert Mustacchi * | | | v chain v +============+ +======+ |
935bc44a933SRobert Mustacchi * +------+ | +========+ | | |
936bc44a933SRobert Mustacchi * v | | v v |
937bc44a933SRobert Mustacchi * +---------+ | v +-------------+ +--------------------+ |
938bc44a933SRobert Mustacchi * | return | | +========+ | goto | | goto | |
939bc44a933SRobert Mustacchi * | flow | | v wakeup v | mac_tx_send | | mac_tx_fanout_mode | |
940bc44a933SRobert Mustacchi * | control | | v worker v +-------------+ +--------------------+ |
941bc44a933SRobert Mustacchi * | cookie | | +========+ | | |
942bc44a933SRobert Mustacchi * +---------+ | | | +------+------+
943bc44a933SRobert Mustacchi * | v | |
944bc44a933SRobert Mustacchi * | +---------+ | v
945bc44a933SRobert Mustacchi * | | return | +============+ +------------+
946bc44a933SRobert Mustacchi * | | flow | v unconsumed v-------+ | done |
947bc44a933SRobert Mustacchi * | | control | v frames? v | | processing |
948bc44a933SRobert Mustacchi * | | cookie | +============+ | +------------+
949bc44a933SRobert Mustacchi * | +---------+ | |
950bc44a933SRobert Mustacchi * | Yes * |
951bc44a933SRobert Mustacchi * | | |
952bc44a933SRobert Mustacchi * | +===========+ |
953bc44a933SRobert Mustacchi * | v subtract v |
954bc44a933SRobert Mustacchi * | v unused bw v |
955bc44a933SRobert Mustacchi * | +===========+ |
956bc44a933SRobert Mustacchi * | | |
957bc44a933SRobert Mustacchi * | v |
958bc44a933SRobert Mustacchi * | +--------------------+ |
959bc44a933SRobert Mustacchi * +------------->| mac_tx_srs_enqueue | |
960bc44a933SRobert Mustacchi * +--------------------+ |
961bc44a933SRobert Mustacchi * | |
962bc44a933SRobert Mustacchi * | |
963bc44a933SRobert Mustacchi * +------------+ |
964bc44a933SRobert Mustacchi * | return fc | |
965bc44a933SRobert Mustacchi * | cookie and |<------+
966bc44a933SRobert Mustacchi * | mblk_t |
967bc44a933SRobert Mustacchi * +------------+
968bc44a933SRobert Mustacchi */
969bc44a933SRobert Mustacchi
970da14cebeSEric Cheng #include <sys/types.h>
971da14cebeSEric Cheng #include <sys/callb.h>
972c61a1653SRyan Zezeski #include <sys/pattr.h>
973da14cebeSEric Cheng #include <sys/sdt.h>
974da14cebeSEric Cheng #include <sys/strsubr.h>
975da14cebeSEric Cheng #include <sys/strsun.h>
976da14cebeSEric Cheng #include <sys/vlan.h>
977e2ea9c96SRobert Mustacchi #include <sys/stack.h>
978e2ea9c96SRobert Mustacchi #include <sys/archsystm.h>
979da14cebeSEric Cheng #include <inet/ipsec_impl.h>
980da14cebeSEric Cheng #include <inet/ip_impl.h>
981da14cebeSEric Cheng #include <inet/sadb.h>
982da14cebeSEric Cheng #include <inet/ipsecesp.h>
983da14cebeSEric Cheng #include <inet/ipsecah.h>
984da14cebeSEric Cheng #include <inet/ip6.h>
985da14cebeSEric Cheng
986da14cebeSEric Cheng #include <sys/mac_impl.h>
987da14cebeSEric Cheng #include <sys/mac_client_impl.h>
988da14cebeSEric Cheng #include <sys/mac_client_priv.h>
989da14cebeSEric Cheng #include <sys/mac_soft_ring.h>
990da14cebeSEric Cheng #include <sys/mac_flow_impl.h>
991da14cebeSEric Cheng
992da14cebeSEric Cheng static mac_tx_cookie_t mac_tx_single_ring_mode(mac_soft_ring_set_t *, mblk_t *,
993da14cebeSEric Cheng uintptr_t, uint16_t, mblk_t **);
994da14cebeSEric Cheng static mac_tx_cookie_t mac_tx_serializer_mode(mac_soft_ring_set_t *, mblk_t *,
995da14cebeSEric Cheng uintptr_t, uint16_t, mblk_t **);
996da14cebeSEric Cheng static mac_tx_cookie_t mac_tx_fanout_mode(mac_soft_ring_set_t *, mblk_t *,
997da14cebeSEric Cheng uintptr_t, uint16_t, mblk_t **);
998da14cebeSEric Cheng static mac_tx_cookie_t mac_tx_bw_mode(mac_soft_ring_set_t *, mblk_t *,
999da14cebeSEric Cheng uintptr_t, uint16_t, mblk_t **);
10000dc2366fSVenugopal Iyer static mac_tx_cookie_t mac_tx_aggr_mode(mac_soft_ring_set_t *, mblk_t *,
10010dc2366fSVenugopal Iyer uintptr_t, uint16_t, mblk_t **);
1002da14cebeSEric Cheng
1003da14cebeSEric Cheng typedef struct mac_tx_mode_s {
1004da14cebeSEric Cheng mac_tx_srs_mode_t mac_tx_mode;
1005da14cebeSEric Cheng mac_tx_func_t mac_tx_func;
1006da14cebeSEric Cheng } mac_tx_mode_t;
1007da14cebeSEric Cheng
1008da14cebeSEric Cheng /*
10090dc2366fSVenugopal Iyer * There are seven modes of operation on the Tx side. These modes get set
1010da14cebeSEric Cheng * in mac_tx_srs_setup(). Except for the experimental TX_SERIALIZE mode,
1011da14cebeSEric Cheng * none of the other modes are user configurable. They get selected by
1012da14cebeSEric Cheng * the system depending upon whether the link (or flow) has multiple Tx
10130dc2366fSVenugopal Iyer * rings or a bandwidth configured, or if the link is an aggr, etc.
10140dc2366fSVenugopal Iyer *
10150dc2366fSVenugopal Iyer * When the Tx SRS is operating in aggr mode (st_mode) or if there are
10160dc2366fSVenugopal Iyer * multiple Tx rings owned by Tx SRS, then each Tx ring (pseudo or
10170dc2366fSVenugopal Iyer * otherwise) will have a soft ring associated with it. These soft rings
10180dc2366fSVenugopal Iyer * are stored in srs_tx_soft_rings[] array.
10190dc2366fSVenugopal Iyer *
10200dc2366fSVenugopal Iyer * Additionally in the case of aggr, there is the st_soft_rings[] array
10210dc2366fSVenugopal Iyer * in the mac_srs_tx_t structure. This array is used to store the same
10220dc2366fSVenugopal Iyer * set of soft rings that are present in srs_tx_soft_rings[] array but
10230dc2366fSVenugopal Iyer * in a different manner. The soft ring associated with the pseudo Tx
10240dc2366fSVenugopal Iyer * ring is saved at mr_index (of the pseudo ring) in st_soft_rings[]
10250dc2366fSVenugopal Iyer * array. This helps in quickly getting the soft ring associated with the
10260dc2366fSVenugopal Iyer * Tx ring when aggr_find_tx_ring() returns the pseudo Tx ring that is to
10270dc2366fSVenugopal Iyer * be used for transmit.
1028da14cebeSEric Cheng */
1029da14cebeSEric Cheng mac_tx_mode_t mac_tx_mode_list[] = {
1030da14cebeSEric Cheng {SRS_TX_DEFAULT, mac_tx_single_ring_mode},
1031da14cebeSEric Cheng {SRS_TX_SERIALIZE, mac_tx_serializer_mode},
1032da14cebeSEric Cheng {SRS_TX_FANOUT, mac_tx_fanout_mode},
1033da14cebeSEric Cheng {SRS_TX_BW, mac_tx_bw_mode},
10340dc2366fSVenugopal Iyer {SRS_TX_BW_FANOUT, mac_tx_bw_mode},
10350dc2366fSVenugopal Iyer {SRS_TX_AGGR, mac_tx_aggr_mode},
10360dc2366fSVenugopal Iyer {SRS_TX_BW_AGGR, mac_tx_bw_mode}
1037da14cebeSEric Cheng };
1038da14cebeSEric Cheng
1039da14cebeSEric Cheng /*
1040da14cebeSEric Cheng * Soft Ring Set (SRS) - The Run time code that deals with
1041da14cebeSEric Cheng * dynamic polling from the hardware, bandwidth enforcement,
1042da14cebeSEric Cheng * fanout etc.
1043da14cebeSEric Cheng *
1044da14cebeSEric Cheng * We try to use H/W classification on NIC and assign traffic for
1045da14cebeSEric Cheng * a MAC address to a particular Rx ring or ring group. There is a
1046da14cebeSEric Cheng * 1-1 mapping between a SRS and a Rx ring. The SRS dynamically
1047da14cebeSEric Cheng * switches the underlying Rx ring between interrupt and
1048da14cebeSEric Cheng * polling mode and enforces any specified B/W control.
1049da14cebeSEric Cheng *
1050da14cebeSEric Cheng * There is always a SRS created and tied to each H/W and S/W rule.
1051da14cebeSEric Cheng * Whenever we create a H/W rule, we always add the the same rule to
1052da14cebeSEric Cheng * S/W classifier and tie a SRS to it.
1053da14cebeSEric Cheng *
1054da14cebeSEric Cheng * In case a B/W control is specified, it is broken into bytes
1055da14cebeSEric Cheng * per ticks and as soon as the quota for a tick is exhausted,
1056da14cebeSEric Cheng * the underlying Rx ring is forced into poll mode for remainder of
1057da14cebeSEric Cheng * the tick. The SRS poll thread only polls for bytes that are
1058da14cebeSEric Cheng * allowed to come in the SRS. We typically let 4x the configured
1059da14cebeSEric Cheng * B/W worth of packets to come in the SRS (to prevent unnecessary
1060da14cebeSEric Cheng * drops due to bursts) but only process the specified amount.
1061da14cebeSEric Cheng *
1062da14cebeSEric Cheng * A MAC client (e.g. a VNIC or aggr) can have 1 or more
1063da14cebeSEric Cheng * Rx rings (and corresponding SRSs) assigned to it. The SRS
1064da14cebeSEric Cheng * in turn can have softrings to do protocol level fanout or
1065da14cebeSEric Cheng * softrings to do S/W based fanout or both. In case the NIC
1066da14cebeSEric Cheng * has no Rx rings, we do S/W classification to respective SRS.
1067da14cebeSEric Cheng * The S/W classification rule is always setup and ready. This
1068da14cebeSEric Cheng * allows the MAC layer to reassign Rx rings whenever needed
1069da14cebeSEric Cheng * but packets still continue to flow via the default path and
1070da14cebeSEric Cheng * getting S/W classified to correct SRS.
1071da14cebeSEric Cheng *
1072da14cebeSEric Cheng * The SRS's are used on both Tx and Rx side. They use the same
1073da14cebeSEric Cheng * data structure but the processing routines have slightly different
1074da14cebeSEric Cheng * semantics due to the fact that Rx side needs to do dynamic
1075da14cebeSEric Cheng * polling etc.
1076da14cebeSEric Cheng *
1077da14cebeSEric Cheng * Dynamic Polling Notes
1078da14cebeSEric Cheng * =====================
1079da14cebeSEric Cheng *
1080da14cebeSEric Cheng * Each Soft ring set is capable of switching its Rx ring between
1081da14cebeSEric Cheng * interrupt and poll mode and actively 'polls' for packets in
1082da14cebeSEric Cheng * poll mode. If the SRS is implementing a B/W limit, it makes
1083da14cebeSEric Cheng * sure that only Max allowed packets are pulled in poll mode
1084da14cebeSEric Cheng * and goes to poll mode as soon as B/W limit is exceeded. As
1085da14cebeSEric Cheng * such, there are no overheads to implement B/W limits.
1086da14cebeSEric Cheng *
1087da14cebeSEric Cheng * In poll mode, its better to keep the pipeline going where the
1088da14cebeSEric Cheng * SRS worker thread keeps processing packets and poll thread
1089da14cebeSEric Cheng * keeps bringing more packets (specially if they get to run
1090da14cebeSEric Cheng * on different CPUs). This also prevents the overheads associated
1091da14cebeSEric Cheng * by excessive signalling (on NUMA machines, this can be
1092da14cebeSEric Cheng * pretty devastating). The exception is latency optimized case
1093da14cebeSEric Cheng * where worker thread does no work and interrupt and poll thread
1094da14cebeSEric Cheng * are allowed to do their own drain.
1095da14cebeSEric Cheng *
1096da14cebeSEric Cheng * We use the following policy to control Dynamic Polling:
1097da14cebeSEric Cheng * 1) We switch to poll mode anytime the processing
1098da14cebeSEric Cheng * thread causes a backlog to build up in SRS and
1099da14cebeSEric Cheng * its associated Soft Rings (sr_poll_pkt_cnt > 0).
1100da14cebeSEric Cheng * 2) As long as the backlog stays under the low water
1101da14cebeSEric Cheng * mark (sr_lowat), we poll the H/W for more packets.
1102da14cebeSEric Cheng * 3) If the backlog (sr_poll_pkt_cnt) exceeds low
1103da14cebeSEric Cheng * water mark, we stay in poll mode but don't poll
1104da14cebeSEric Cheng * the H/W for more packets.
1105da14cebeSEric Cheng * 4) Anytime in polling mode, if we poll the H/W for
1106da14cebeSEric Cheng * packets and find nothing plus we have an existing
1107da14cebeSEric Cheng * backlog (sr_poll_pkt_cnt > 0), we stay in polling
1108da14cebeSEric Cheng * mode but don't poll the H/W for packets anymore
1109da14cebeSEric Cheng * (let the polling thread go to sleep).
1110da14cebeSEric Cheng * 5) Once the backlog is relived (packets are processed)
1111da14cebeSEric Cheng * we reenable polling (by signalling the poll thread)
1112da14cebeSEric Cheng * only when the backlog dips below sr_poll_thres.
1113da14cebeSEric Cheng * 6) sr_hiwat is used exclusively when we are not
1114da14cebeSEric Cheng * polling capable and is used to decide when to
1115da14cebeSEric Cheng * drop packets so the SRS queue length doesn't grow
1116da14cebeSEric Cheng * infinitely.
1117da14cebeSEric Cheng *
1118da14cebeSEric Cheng * NOTE: Also see the block level comment on top of mac_soft_ring.c
1119da14cebeSEric Cheng */
1120da14cebeSEric Cheng
1121da14cebeSEric Cheng /*
1122da14cebeSEric Cheng * mac_latency_optimize
1123da14cebeSEric Cheng *
1124da14cebeSEric Cheng * Controls whether the poll thread can process the packets inline
1125da14cebeSEric Cheng * or let the SRS worker thread do the processing. This applies if
1126da14cebeSEric Cheng * the SRS was not being processed. For latency sensitive traffic,
1127da14cebeSEric Cheng * this needs to be true to allow inline processing. For throughput
1128da14cebeSEric Cheng * under load, this should be false.
1129da14cebeSEric Cheng *
1130da14cebeSEric Cheng * This (and other similar) tunable should be rolled into a link
1131da14cebeSEric Cheng * or flow specific workload hint that can be set using dladm
1132da14cebeSEric Cheng * linkprop (instead of multiple such tunables).
1133da14cebeSEric Cheng */
1134da14cebeSEric Cheng boolean_t mac_latency_optimize = B_TRUE;
1135da14cebeSEric Cheng
1136da14cebeSEric Cheng /*
1137da14cebeSEric Cheng * MAC_RX_SRS_ENQUEUE_CHAIN and MAC_TX_SRS_ENQUEUE_CHAIN
1138da14cebeSEric Cheng *
1139da14cebeSEric Cheng * queue a mp or chain in soft ring set and increment the
1140da14cebeSEric Cheng * local count (srs_count) for the SRS and the shared counter
1141da14cebeSEric Cheng * (srs_poll_pkt_cnt - shared between SRS and its soft rings
1142da14cebeSEric Cheng * to track the total unprocessed packets for polling to work
1143da14cebeSEric Cheng * correctly).
1144da14cebeSEric Cheng *
1145da14cebeSEric Cheng * The size (total bytes queued) counters are incremented only
1146da14cebeSEric Cheng * if we are doing B/W control.
1147da14cebeSEric Cheng */
1148da14cebeSEric Cheng #define MAC_SRS_ENQUEUE_CHAIN(mac_srs, head, tail, count, sz) { \
1149da14cebeSEric Cheng ASSERT(MUTEX_HELD(&(mac_srs)->srs_lock)); \
1150da14cebeSEric Cheng if ((mac_srs)->srs_last != NULL) \
1151da14cebeSEric Cheng (mac_srs)->srs_last->b_next = (head); \
1152da14cebeSEric Cheng else \
1153da14cebeSEric Cheng (mac_srs)->srs_first = (head); \
1154da14cebeSEric Cheng (mac_srs)->srs_last = (tail); \
1155da14cebeSEric Cheng (mac_srs)->srs_count += count; \
1156da14cebeSEric Cheng }
1157da14cebeSEric Cheng
1158da14cebeSEric Cheng #define MAC_RX_SRS_ENQUEUE_CHAIN(mac_srs, head, tail, count, sz) { \
1159da14cebeSEric Cheng mac_srs_rx_t *srs_rx = &(mac_srs)->srs_rx; \
1160da14cebeSEric Cheng \
1161da14cebeSEric Cheng MAC_SRS_ENQUEUE_CHAIN(mac_srs, head, tail, count, sz); \
1162da14cebeSEric Cheng srs_rx->sr_poll_pkt_cnt += count; \
1163da14cebeSEric Cheng ASSERT(srs_rx->sr_poll_pkt_cnt > 0); \
1164da14cebeSEric Cheng if ((mac_srs)->srs_type & SRST_BW_CONTROL) { \
1165da14cebeSEric Cheng (mac_srs)->srs_size += (sz); \
1166da14cebeSEric Cheng mutex_enter(&(mac_srs)->srs_bw->mac_bw_lock); \
1167da14cebeSEric Cheng (mac_srs)->srs_bw->mac_bw_sz += (sz); \
1168da14cebeSEric Cheng mutex_exit(&(mac_srs)->srs_bw->mac_bw_lock); \
1169da14cebeSEric Cheng } \
1170da14cebeSEric Cheng }
1171da14cebeSEric Cheng
1172da14cebeSEric Cheng #define MAC_TX_SRS_ENQUEUE_CHAIN(mac_srs, head, tail, count, sz) { \
1173da14cebeSEric Cheng mac_srs->srs_state |= SRS_ENQUEUED; \
1174da14cebeSEric Cheng MAC_SRS_ENQUEUE_CHAIN(mac_srs, head, tail, count, sz); \
1175da14cebeSEric Cheng if ((mac_srs)->srs_type & SRST_BW_CONTROL) { \
1176da14cebeSEric Cheng (mac_srs)->srs_size += (sz); \
1177da14cebeSEric Cheng (mac_srs)->srs_bw->mac_bw_sz += (sz); \
1178da14cebeSEric Cheng } \
1179da14cebeSEric Cheng }
1180da14cebeSEric Cheng
1181da14cebeSEric Cheng /*
1182da14cebeSEric Cheng * Turn polling on routines
1183da14cebeSEric Cheng */
1184da14cebeSEric Cheng #define MAC_SRS_POLLING_ON(mac_srs) { \
1185da14cebeSEric Cheng ASSERT(MUTEX_HELD(&(mac_srs)->srs_lock)); \
1186da14cebeSEric Cheng if (((mac_srs)->srs_state & \
1187da14cebeSEric Cheng (SRS_POLLING_CAPAB|SRS_POLLING)) == SRS_POLLING_CAPAB) { \
1188da14cebeSEric Cheng (mac_srs)->srs_state |= SRS_POLLING; \
1189da14cebeSEric Cheng (void) mac_hwring_disable_intr((mac_ring_handle_t) \
1190da14cebeSEric Cheng (mac_srs)->srs_ring); \
1191da14cebeSEric Cheng (mac_srs)->srs_rx.sr_poll_on++; \
1192da14cebeSEric Cheng } \
1193da14cebeSEric Cheng }
1194da14cebeSEric Cheng
1195da14cebeSEric Cheng #define MAC_SRS_WORKER_POLLING_ON(mac_srs) { \
1196da14cebeSEric Cheng ASSERT(MUTEX_HELD(&(mac_srs)->srs_lock)); \
1197da14cebeSEric Cheng if (((mac_srs)->srs_state & \
1198da14cebeSEric Cheng (SRS_POLLING_CAPAB|SRS_WORKER|SRS_POLLING)) == \
1199da14cebeSEric Cheng (SRS_POLLING_CAPAB|SRS_WORKER)) { \
1200da14cebeSEric Cheng (mac_srs)->srs_state |= SRS_POLLING; \
1201da14cebeSEric Cheng (void) mac_hwring_disable_intr((mac_ring_handle_t) \
1202da14cebeSEric Cheng (mac_srs)->srs_ring); \
1203da14cebeSEric Cheng (mac_srs)->srs_rx.sr_worker_poll_on++; \
1204da14cebeSEric Cheng } \
1205da14cebeSEric Cheng }
1206da14cebeSEric Cheng
1207da14cebeSEric Cheng /*
1208da14cebeSEric Cheng * MAC_SRS_POLL_RING
1209da14cebeSEric Cheng *
1210da14cebeSEric Cheng * Signal the SRS poll thread to poll the underlying H/W ring
1211da14cebeSEric Cheng * provided it wasn't already polling (SRS_GET_PKTS was set).
1212da14cebeSEric Cheng *
1213da14cebeSEric Cheng * Poll thread gets to run only from mac_rx_srs_drain() and only
1214da14cebeSEric Cheng * if the drain was being done by the worker thread.
1215da14cebeSEric Cheng */
1216da14cebeSEric Cheng #define MAC_SRS_POLL_RING(mac_srs) { \
1217da14cebeSEric Cheng mac_srs_rx_t *srs_rx = &(mac_srs)->srs_rx; \
1218da14cebeSEric Cheng \
1219da14cebeSEric Cheng ASSERT(MUTEX_HELD(&(mac_srs)->srs_lock)); \
1220da14cebeSEric Cheng srs_rx->sr_poll_thr_sig++; \
1221da14cebeSEric Cheng if (((mac_srs)->srs_state & \
1222da14cebeSEric Cheng (SRS_POLLING_CAPAB|SRS_WORKER|SRS_GET_PKTS)) == \
1223da14cebeSEric Cheng (SRS_WORKER|SRS_POLLING_CAPAB)) { \
1224da14cebeSEric Cheng (mac_srs)->srs_state |= SRS_GET_PKTS; \
1225da14cebeSEric Cheng cv_signal(&(mac_srs)->srs_cv); \
1226da14cebeSEric Cheng } else { \
1227da14cebeSEric Cheng srs_rx->sr_poll_thr_busy++; \
1228da14cebeSEric Cheng } \
1229da14cebeSEric Cheng }
1230da14cebeSEric Cheng
1231da14cebeSEric Cheng /*
1232da14cebeSEric Cheng * MAC_SRS_CHECK_BW_CONTROL
1233da14cebeSEric Cheng *
1234da14cebeSEric Cheng * Check to see if next tick has started so we can reset the
1235da14cebeSEric Cheng * SRS_BW_ENFORCED flag and allow more packets to come in the
1236da14cebeSEric Cheng * system.
1237da14cebeSEric Cheng */
1238da14cebeSEric Cheng #define MAC_SRS_CHECK_BW_CONTROL(mac_srs) { \
1239da14cebeSEric Cheng ASSERT(MUTEX_HELD(&(mac_srs)->srs_lock)); \
1240da14cebeSEric Cheng ASSERT(((mac_srs)->srs_type & SRST_TX) || \
1241da14cebeSEric Cheng MUTEX_HELD(&(mac_srs)->srs_bw->mac_bw_lock)); \
1242d3d50737SRafael Vanoni clock_t now = ddi_get_lbolt(); \
1243d3d50737SRafael Vanoni if ((mac_srs)->srs_bw->mac_bw_curr_time != now) { \
1244d3d50737SRafael Vanoni (mac_srs)->srs_bw->mac_bw_curr_time = now; \
1245da14cebeSEric Cheng (mac_srs)->srs_bw->mac_bw_used = 0; \
1246da14cebeSEric Cheng if ((mac_srs)->srs_bw->mac_bw_state & SRS_BW_ENFORCED) \
1247da14cebeSEric Cheng (mac_srs)->srs_bw->mac_bw_state &= ~SRS_BW_ENFORCED; \
1248da14cebeSEric Cheng } \
1249da14cebeSEric Cheng }
1250da14cebeSEric Cheng
1251da14cebeSEric Cheng /*
1252da14cebeSEric Cheng * MAC_SRS_WORKER_WAKEUP
1253da14cebeSEric Cheng *
1254da14cebeSEric Cheng * Wake up the SRS worker thread to process the queue as long as
1255da14cebeSEric Cheng * no one else is processing the queue. If we are optimizing for
1256da14cebeSEric Cheng * latency, we wake up the worker thread immediately or else we
1257da14cebeSEric Cheng * wait mac_srs_worker_wakeup_ticks before worker thread gets
1258da14cebeSEric Cheng * woken up.
1259da14cebeSEric Cheng */
1260da14cebeSEric Cheng int mac_srs_worker_wakeup_ticks = 0;
1261da14cebeSEric Cheng #define MAC_SRS_WORKER_WAKEUP(mac_srs) { \
1262da14cebeSEric Cheng ASSERT(MUTEX_HELD(&(mac_srs)->srs_lock)); \
1263da14cebeSEric Cheng if (!((mac_srs)->srs_state & SRS_PROC) && \
1264da14cebeSEric Cheng (mac_srs)->srs_tid == NULL) { \
12653631b19bSRajagopal Kunhappan if (((mac_srs)->srs_state & SRS_LATENCY_OPT) || \
1266da14cebeSEric Cheng (mac_srs_worker_wakeup_ticks == 0)) \
1267da14cebeSEric Cheng cv_signal(&(mac_srs)->srs_async); \
1268da14cebeSEric Cheng else \
1269da14cebeSEric Cheng (mac_srs)->srs_tid = \
1270da14cebeSEric Cheng timeout(mac_srs_fire, (mac_srs), \
1271da14cebeSEric Cheng mac_srs_worker_wakeup_ticks); \
1272da14cebeSEric Cheng } \
1273da14cebeSEric Cheng }
1274da14cebeSEric Cheng
1275da14cebeSEric Cheng #define TX_BANDWIDTH_MODE(mac_srs) \
1276da14cebeSEric Cheng ((mac_srs)->srs_tx.st_mode == SRS_TX_BW || \
12770dc2366fSVenugopal Iyer (mac_srs)->srs_tx.st_mode == SRS_TX_BW_FANOUT || \
12780dc2366fSVenugopal Iyer (mac_srs)->srs_tx.st_mode == SRS_TX_BW_AGGR)
1279da14cebeSEric Cheng
1280da14cebeSEric Cheng #define TX_SRS_TO_SOFT_RING(mac_srs, head, hint) { \
12810dc2366fSVenugopal Iyer if (tx_mode == SRS_TX_BW_FANOUT) \
12820dc2366fSVenugopal Iyer (void) mac_tx_fanout_mode(mac_srs, head, hint, 0, NULL);\
12830dc2366fSVenugopal Iyer else \
12840dc2366fSVenugopal Iyer (void) mac_tx_aggr_mode(mac_srs, head, hint, 0, NULL); \
1285da14cebeSEric Cheng }
1286da14cebeSEric Cheng
1287da14cebeSEric Cheng /*
1288da14cebeSEric Cheng * MAC_TX_SRS_BLOCK
1289da14cebeSEric Cheng *
1290da14cebeSEric Cheng * Always called from mac_tx_srs_drain() function. SRS_TX_BLOCKED
1291da14cebeSEric Cheng * will be set only if srs_tx_woken_up is FALSE. If
1292da14cebeSEric Cheng * srs_tx_woken_up is TRUE, it indicates that the wakeup arrived
1293da14cebeSEric Cheng * before we grabbed srs_lock to set SRS_TX_BLOCKED. We need to
1294da14cebeSEric Cheng * attempt to transmit again and not setting SRS_TX_BLOCKED does
1295da14cebeSEric Cheng * that.
1296da14cebeSEric Cheng */
1297da14cebeSEric Cheng #define MAC_TX_SRS_BLOCK(srs, mp) { \
1298da14cebeSEric Cheng ASSERT(MUTEX_HELD(&(srs)->srs_lock)); \
1299da14cebeSEric Cheng if ((srs)->srs_tx.st_woken_up) { \
1300da14cebeSEric Cheng (srs)->srs_tx.st_woken_up = B_FALSE; \
1301da14cebeSEric Cheng } else { \
1302da14cebeSEric Cheng ASSERT(!((srs)->srs_state & SRS_TX_BLOCKED)); \
1303da14cebeSEric Cheng (srs)->srs_state |= SRS_TX_BLOCKED; \
13040dc2366fSVenugopal Iyer (srs)->srs_tx.st_stat.mts_blockcnt++; \
1305da14cebeSEric Cheng } \
1306da14cebeSEric Cheng }
1307da14cebeSEric Cheng
1308da14cebeSEric Cheng /*
1309da14cebeSEric Cheng * MAC_TX_SRS_TEST_HIWAT
1310da14cebeSEric Cheng *
1311da14cebeSEric Cheng * Called before queueing a packet onto Tx SRS to test and set
1312da14cebeSEric Cheng * SRS_TX_HIWAT if srs_count exceeds srs_tx_hiwat.
1313da14cebeSEric Cheng */
1314da14cebeSEric Cheng #define MAC_TX_SRS_TEST_HIWAT(srs, mp, tail, cnt, sz, cookie) { \
1315da14cebeSEric Cheng boolean_t enqueue = 1; \
1316da14cebeSEric Cheng \
1317da14cebeSEric Cheng if ((srs)->srs_count > (srs)->srs_tx.st_hiwat) { \
1318da14cebeSEric Cheng /* \
1319da14cebeSEric Cheng * flow-controlled. Store srs in cookie so that it \
1320da14cebeSEric Cheng * can be returned as mac_tx_cookie_t to client \
1321da14cebeSEric Cheng */ \
1322da14cebeSEric Cheng (srs)->srs_state |= SRS_TX_HIWAT; \
1323da14cebeSEric Cheng cookie = (mac_tx_cookie_t)srs; \
1324da14cebeSEric Cheng (srs)->srs_tx.st_hiwat_cnt++; \
1325da14cebeSEric Cheng if ((srs)->srs_count > (srs)->srs_tx.st_max_q_cnt) { \
1326da14cebeSEric Cheng /* increment freed stats */ \
13270dc2366fSVenugopal Iyer (srs)->srs_tx.st_stat.mts_sdrops += cnt; \
1328da14cebeSEric Cheng /* \
1329da14cebeSEric Cheng * b_prev may be set to the fanout hint \
1330da14cebeSEric Cheng * hence can't use freemsg directly \
1331da14cebeSEric Cheng */ \
1332c61a1653SRyan Zezeski mac_drop_chain(mp_chain, "SRS Tx max queue"); \
1333da14cebeSEric Cheng DTRACE_PROBE1(tx_queued_hiwat, \
1334da14cebeSEric Cheng mac_soft_ring_set_t *, srs); \
1335da14cebeSEric Cheng enqueue = 0; \
1336da14cebeSEric Cheng } \
1337da14cebeSEric Cheng } \
1338da14cebeSEric Cheng if (enqueue) \
1339da14cebeSEric Cheng MAC_TX_SRS_ENQUEUE_CHAIN(srs, mp, tail, cnt, sz); \
1340da14cebeSEric Cheng }
1341da14cebeSEric Cheng
1342da14cebeSEric Cheng /* Some utility macros */
1343da14cebeSEric Cheng #define MAC_SRS_BW_LOCK(srs) \
1344da14cebeSEric Cheng if (!(srs->srs_type & SRST_TX)) \
1345da14cebeSEric Cheng mutex_enter(&srs->srs_bw->mac_bw_lock);
1346da14cebeSEric Cheng
1347da14cebeSEric Cheng #define MAC_SRS_BW_UNLOCK(srs) \
1348da14cebeSEric Cheng if (!(srs->srs_type & SRST_TX)) \
1349da14cebeSEric Cheng mutex_exit(&srs->srs_bw->mac_bw_lock);
1350da14cebeSEric Cheng
1351c61a1653SRyan Zezeski #define MAC_TX_SRS_DROP_MESSAGE(srs, chain, cookie, s) { \
1352c61a1653SRyan Zezeski mac_drop_chain((chain), (s)); \
1353da14cebeSEric Cheng /* increment freed stats */ \
1354c61a1653SRyan Zezeski (srs)->srs_tx.st_stat.mts_sdrops++; \
1355c61a1653SRyan Zezeski (cookie) = (mac_tx_cookie_t)(srs); \
1356da14cebeSEric Cheng }
1357da14cebeSEric Cheng
1358da14cebeSEric Cheng #define MAC_TX_SET_NO_ENQUEUE(srs, mp_chain, ret_mp, cookie) { \
1359da14cebeSEric Cheng mac_srs->srs_state |= SRS_TX_WAKEUP_CLIENT; \
1360da14cebeSEric Cheng cookie = (mac_tx_cookie_t)srs; \
1361da14cebeSEric Cheng *ret_mp = mp_chain; \
1362da14cebeSEric Cheng }
1363da14cebeSEric Cheng
1364da14cebeSEric Cheng /*
1365cab61dfbSBryan Cantrill * Threshold used in receive-side processing to determine if handling
1366e2ea9c96SRobert Mustacchi * can occur in situ (in the interrupt thread) or if it should be left to a
1367e2ea9c96SRobert Mustacchi * worker thread. Note that the constant used to make this determination is
1368e2ea9c96SRobert Mustacchi * not entirely made-up, and is a result of some emprical validation. That
1369cab61dfbSBryan Cantrill * said, the constant is left as a global variable to allow it to be
1370e2ea9c96SRobert Mustacchi * dynamically tuned in the field if and as needed.
1371e2ea9c96SRobert Mustacchi */
1372cab61dfbSBryan Cantrill uintptr_t mac_rx_srs_stack_needed = 14336;
1373cab61dfbSBryan Cantrill uint_t mac_rx_srs_stack_toodeep;
1374e2ea9c96SRobert Mustacchi
1375e2ea9c96SRobert Mustacchi #ifndef STACK_GROWTH_DOWN
1376e2ea9c96SRobert Mustacchi #error Downward stack growth assumed.
1377e2ea9c96SRobert Mustacchi #endif
1378e2ea9c96SRobert Mustacchi
1379e2ea9c96SRobert Mustacchi /*
1380da14cebeSEric Cheng * Drop the rx packet and advance to the next one in the chain.
1381da14cebeSEric Cheng */
1382da14cebeSEric Cheng static void
mac_rx_drop_pkt(mac_soft_ring_set_t * srs,mblk_t * mp)1383da14cebeSEric Cheng mac_rx_drop_pkt(mac_soft_ring_set_t *srs, mblk_t *mp)
1384da14cebeSEric Cheng {
1385da14cebeSEric Cheng mac_srs_rx_t *srs_rx = &srs->srs_rx;
1386da14cebeSEric Cheng
1387da14cebeSEric Cheng ASSERT(mp->b_next == NULL);
1388da14cebeSEric Cheng mutex_enter(&srs->srs_lock);
1389da14cebeSEric Cheng MAC_UPDATE_SRS_COUNT_LOCKED(srs, 1);
1390da14cebeSEric Cheng MAC_UPDATE_SRS_SIZE_LOCKED(srs, msgdsize(mp));
1391da14cebeSEric Cheng mutex_exit(&srs->srs_lock);
1392da14cebeSEric Cheng
13930dc2366fSVenugopal Iyer srs_rx->sr_stat.mrs_sdrops++;
1394da14cebeSEric Cheng freemsg(mp);
1395da14cebeSEric Cheng }
1396da14cebeSEric Cheng
1397da14cebeSEric Cheng /* DATAPATH RUNTIME ROUTINES */
1398da14cebeSEric Cheng
1399da14cebeSEric Cheng /*
1400da14cebeSEric Cheng * mac_srs_fire
1401da14cebeSEric Cheng *
1402da14cebeSEric Cheng * Timer callback routine for waking up the SRS worker thread.
1403da14cebeSEric Cheng */
1404da14cebeSEric Cheng static void
mac_srs_fire(void * arg)1405da14cebeSEric Cheng mac_srs_fire(void *arg)
1406da14cebeSEric Cheng {
1407da14cebeSEric Cheng mac_soft_ring_set_t *mac_srs = (mac_soft_ring_set_t *)arg;
1408da14cebeSEric Cheng
1409da14cebeSEric Cheng mutex_enter(&mac_srs->srs_lock);
1410da906b4eSRyan Zezeski if (mac_srs->srs_tid == NULL) {
1411da14cebeSEric Cheng mutex_exit(&mac_srs->srs_lock);
1412da14cebeSEric Cheng return;
1413da14cebeSEric Cheng }
1414da14cebeSEric Cheng
1415da906b4eSRyan Zezeski mac_srs->srs_tid = NULL;
1416da14cebeSEric Cheng if (!(mac_srs->srs_state & SRS_PROC))
1417da14cebeSEric Cheng cv_signal(&mac_srs->srs_async);
1418da14cebeSEric Cheng
1419da14cebeSEric Cheng mutex_exit(&mac_srs->srs_lock);
1420da14cebeSEric Cheng }
1421da14cebeSEric Cheng
1422da14cebeSEric Cheng /*
1423da14cebeSEric Cheng * 'hint' is fanout_hint (type of uint64_t) which is given by the TCP/IP stack,
1424da14cebeSEric Cheng * and it is used on the TX path.
1425da14cebeSEric Cheng */
14266fa43c05SRao Shoaib #define HASH_HINT(hint) \
14276fa43c05SRao Shoaib ((hint) ^ ((hint) >> 24) ^ ((hint) >> 16) ^ ((hint) >> 8))
14286fa43c05SRao Shoaib
1429da14cebeSEric Cheng
1430da14cebeSEric Cheng /*
1431dea290bfSSaso Kiselkov * hash based on the src address, dst address and the port information.
1432da14cebeSEric Cheng */
1433dea290bfSSaso Kiselkov #define HASH_ADDR(src, dst, ports) \
1434dea290bfSSaso Kiselkov (ntohl((src) + (dst)) ^ ((ports) >> 24) ^ ((ports) >> 16) ^ \
1435da14cebeSEric Cheng ((ports) >> 8) ^ (ports))
1436da14cebeSEric Cheng
1437da14cebeSEric Cheng #define COMPUTE_INDEX(key, sz) (key % sz)
1438da14cebeSEric Cheng
1439da14cebeSEric Cheng #define FANOUT_ENQUEUE_MP(head, tail, cnt, bw_ctl, sz, sz0, mp) { \
1440da14cebeSEric Cheng if ((tail) != NULL) { \
1441da14cebeSEric Cheng ASSERT((tail)->b_next == NULL); \
1442da14cebeSEric Cheng (tail)->b_next = (mp); \
1443da14cebeSEric Cheng } else { \
1444da14cebeSEric Cheng ASSERT((head) == NULL); \
1445da14cebeSEric Cheng (head) = (mp); \
1446da14cebeSEric Cheng } \
1447da14cebeSEric Cheng (tail) = (mp); \
1448da14cebeSEric Cheng (cnt)++; \
1449da14cebeSEric Cheng if ((bw_ctl)) \
1450da14cebeSEric Cheng (sz) += (sz0); \
1451da14cebeSEric Cheng }
1452da14cebeSEric Cheng
1453da14cebeSEric Cheng #define MAC_FANOUT_DEFAULT 0
1454da14cebeSEric Cheng #define MAC_FANOUT_RND_ROBIN 1
1455da14cebeSEric Cheng int mac_fanout_type = MAC_FANOUT_DEFAULT;
1456da14cebeSEric Cheng
1457da14cebeSEric Cheng #define MAX_SR_TYPES 3
1458da14cebeSEric Cheng /* fanout types for port based hashing */
1459da14cebeSEric Cheng enum pkt_type {
1460da14cebeSEric Cheng V4_TCP = 0,
1461da14cebeSEric Cheng V4_UDP,
1462da14cebeSEric Cheng OTH,
1463da14cebeSEric Cheng UNDEF
1464da14cebeSEric Cheng };
1465da14cebeSEric Cheng
1466da14cebeSEric Cheng /*
1467da14cebeSEric Cheng * Pair of local and remote ports in the transport header
1468da14cebeSEric Cheng */
1469da14cebeSEric Cheng #define PORTS_SIZE 4
1470da14cebeSEric Cheng
1471da14cebeSEric Cheng /*
147284de666eSRyan Zezeski * This routine delivers packets destined for an SRS into one of the
1473da14cebeSEric Cheng * protocol soft rings.
1474da14cebeSEric Cheng *
147584de666eSRyan Zezeski * Given a chain of packets we need to split it up into multiple sub
147684de666eSRyan Zezeski * chains: TCP, UDP or OTH soft ring. Instead of entering the soft
147784de666eSRyan Zezeski * ring one packet at a time, we want to enter it in the form of a
147884de666eSRyan Zezeski * chain otherwise we get this start/stop behaviour where the worker
147984de666eSRyan Zezeski * thread goes to sleep and then next packet comes in forcing it to
148084de666eSRyan Zezeski * wake up.
1481da14cebeSEric Cheng */
1482da14cebeSEric Cheng static void
mac_rx_srs_proto_fanout(mac_soft_ring_set_t * mac_srs,mblk_t * head)1483da14cebeSEric Cheng mac_rx_srs_proto_fanout(mac_soft_ring_set_t *mac_srs, mblk_t *head)
1484da14cebeSEric Cheng {
1485da14cebeSEric Cheng struct ether_header *ehp;
1486ae6aa22aSVenugopal Iyer struct ether_vlan_header *evhp;
1487ae6aa22aSVenugopal Iyer uint32_t sap;
1488da14cebeSEric Cheng ipha_t *ipha;
1489ae6aa22aSVenugopal Iyer uint8_t *dstaddr;
1490ae6aa22aSVenugopal Iyer size_t hdrsize;
1491da14cebeSEric Cheng mblk_t *mp;
1492da14cebeSEric Cheng mblk_t *headmp[MAX_SR_TYPES];
1493da14cebeSEric Cheng mblk_t *tailmp[MAX_SR_TYPES];
1494da14cebeSEric Cheng int cnt[MAX_SR_TYPES];
1495da14cebeSEric Cheng size_t sz[MAX_SR_TYPES];
1496da14cebeSEric Cheng size_t sz1;
1497ae6aa22aSVenugopal Iyer boolean_t bw_ctl;
1498da14cebeSEric Cheng boolean_t hw_classified;
1499ae6aa22aSVenugopal Iyer boolean_t dls_bypass;
1500ae6aa22aSVenugopal Iyer boolean_t is_ether;
1501ae6aa22aSVenugopal Iyer boolean_t is_unicast;
1502da14cebeSEric Cheng enum pkt_type type;
1503da14cebeSEric Cheng mac_client_impl_t *mcip = mac_srs->srs_mcip;
1504da14cebeSEric Cheng
1505ae6aa22aSVenugopal Iyer is_ether = (mcip->mci_mip->mi_info.mi_nativemedia == DL_ETHER);
1506ae6aa22aSVenugopal Iyer bw_ctl = ((mac_srs->srs_type & SRST_BW_CONTROL) != 0);
1507da14cebeSEric Cheng
1508da14cebeSEric Cheng /*
1509da14cebeSEric Cheng * If we don't have a Rx ring, S/W classification would have done
1510da14cebeSEric Cheng * its job and its a packet meant for us. If we were polling on
1511da14cebeSEric Cheng * the default ring (i.e. there was a ring assigned to this SRS),
1512da14cebeSEric Cheng * then we need to make sure that the mac address really belongs
1513da14cebeSEric Cheng * to us.
1514da14cebeSEric Cheng */
1515da14cebeSEric Cheng hw_classified = mac_srs->srs_ring != NULL &&
1516da14cebeSEric Cheng mac_srs->srs_ring->mr_classify_type == MAC_HW_CLASSIFIER;
1517da14cebeSEric Cheng
1518da14cebeSEric Cheng /*
151984de666eSRyan Zezeski * Some clients, such as non-ethernet, need DLS processing in
152084de666eSRyan Zezeski * the Rx path. Such clients clear the SRST_DLS_BYPASS flag.
152184de666eSRyan Zezeski * DLS bypass may also be disabled via the
15228d4cf8d8S * MCIS_RX_BYPASS_DISABLE flag.
1523da14cebeSEric Cheng */
15248d4cf8d8S dls_bypass = ((mac_srs->srs_type & SRST_DLS_BYPASS) != 0) &&
15258d4cf8d8S ((mcip->mci_state_flags & MCIS_RX_BYPASS_DISABLE) == 0);
1526da14cebeSEric Cheng
1527da14cebeSEric Cheng bzero(headmp, MAX_SR_TYPES * sizeof (mblk_t *));
1528da14cebeSEric Cheng bzero(tailmp, MAX_SR_TYPES * sizeof (mblk_t *));
1529da14cebeSEric Cheng bzero(cnt, MAX_SR_TYPES * sizeof (int));
1530da14cebeSEric Cheng bzero(sz, MAX_SR_TYPES * sizeof (size_t));
1531da14cebeSEric Cheng
1532da14cebeSEric Cheng /*
153384de666eSRyan Zezeski * We have a chain from SRS that we need to split across the
153484de666eSRyan Zezeski * soft rings. The squeues for the TCP and IPv4 SAPs use their
153584de666eSRyan Zezeski * own soft rings to allow polling from the squeue. The rest of
153684de666eSRyan Zezeski * the packets are delivered on the OTH soft ring which cannot
153784de666eSRyan Zezeski * be polled.
1538da14cebeSEric Cheng */
1539da14cebeSEric Cheng while (head != NULL) {
1540da14cebeSEric Cheng mp = head;
1541da14cebeSEric Cheng head = head->b_next;
1542da14cebeSEric Cheng mp->b_next = NULL;
1543da14cebeSEric Cheng
1544da14cebeSEric Cheng type = OTH;
1545ae6aa22aSVenugopal Iyer sz1 = (mp->b_cont == NULL) ? MBLKL(mp) : msgdsize(mp);
1546da14cebeSEric Cheng
1547ae6aa22aSVenugopal Iyer if (is_ether) {
1548da14cebeSEric Cheng /*
1549da14cebeSEric Cheng * At this point we can be sure the packet at least
1550da14cebeSEric Cheng * has an ether header.
1551da14cebeSEric Cheng */
1552da14cebeSEric Cheng if (sz1 < sizeof (struct ether_header)) {
1553da14cebeSEric Cheng mac_rx_drop_pkt(mac_srs, mp);
1554da14cebeSEric Cheng continue;
1555da14cebeSEric Cheng }
1556da14cebeSEric Cheng ehp = (struct ether_header *)mp->b_rptr;
1557da14cebeSEric Cheng
1558da14cebeSEric Cheng /*
1559da14cebeSEric Cheng * Determine if this is a VLAN or non-VLAN packet.
1560da14cebeSEric Cheng */
1561ae6aa22aSVenugopal Iyer if ((sap = ntohs(ehp->ether_type)) == VLAN_TPID) {
1562da14cebeSEric Cheng evhp = (struct ether_vlan_header *)mp->b_rptr;
1563ae6aa22aSVenugopal Iyer sap = ntohs(evhp->ether_type);
1564ae6aa22aSVenugopal Iyer hdrsize = sizeof (struct ether_vlan_header);
156584de666eSRyan Zezeski
1566da14cebeSEric Cheng /*
156784de666eSRyan Zezeski * Check if the VID of the packet, if
156884de666eSRyan Zezeski * any, belongs to this client.
156984de666eSRyan Zezeski * Technically, if this packet came up
157084de666eSRyan Zezeski * via a HW classified ring then we
157184de666eSRyan Zezeski * don't need to perform this check.
157284de666eSRyan Zezeski * Perhaps a future optimization.
1573da14cebeSEric Cheng */
1574da14cebeSEric Cheng if (!mac_client_check_flow_vid(mcip,
1575da14cebeSEric Cheng VLAN_ID(ntohs(evhp->ether_tci)))) {
1576da14cebeSEric Cheng mac_rx_drop_pkt(mac_srs, mp);
1577da14cebeSEric Cheng continue;
1578da14cebeSEric Cheng }
1579da14cebeSEric Cheng } else {
1580ae6aa22aSVenugopal Iyer hdrsize = sizeof (struct ether_header);
1581ae6aa22aSVenugopal Iyer }
1582ae6aa22aSVenugopal Iyer is_unicast =
1583ae6aa22aSVenugopal Iyer ((((uint8_t *)&ehp->ether_dhost)[0] & 0x01) == 0);
1584ae6aa22aSVenugopal Iyer dstaddr = (uint8_t *)&ehp->ether_dhost;
1585ae6aa22aSVenugopal Iyer } else {
1586ae6aa22aSVenugopal Iyer mac_header_info_t mhi;
1587ae6aa22aSVenugopal Iyer
1588ae6aa22aSVenugopal Iyer if (mac_header_info((mac_handle_t)mcip->mci_mip,
1589ae6aa22aSVenugopal Iyer mp, &mhi) != 0) {
1590ae6aa22aSVenugopal Iyer mac_rx_drop_pkt(mac_srs, mp);
1591ae6aa22aSVenugopal Iyer continue;
1592ae6aa22aSVenugopal Iyer }
1593ae6aa22aSVenugopal Iyer hdrsize = mhi.mhi_hdrsize;
1594ae6aa22aSVenugopal Iyer sap = mhi.mhi_bindsap;
1595ae6aa22aSVenugopal Iyer is_unicast = (mhi.mhi_dsttype == MAC_ADDRTYPE_UNICAST);
1596ae6aa22aSVenugopal Iyer dstaddr = (uint8_t *)mhi.mhi_daddr;
1597da14cebeSEric Cheng }
1598da14cebeSEric Cheng
1599ae6aa22aSVenugopal Iyer if (!dls_bypass) {
1600ae6aa22aSVenugopal Iyer FANOUT_ENQUEUE_MP(headmp[type], tailmp[type],
1601ae6aa22aSVenugopal Iyer cnt[type], bw_ctl, sz[type], sz1, mp);
1602ae6aa22aSVenugopal Iyer continue;
1603ae6aa22aSVenugopal Iyer }
1604ae6aa22aSVenugopal Iyer
1605ae6aa22aSVenugopal Iyer if (sap == ETHERTYPE_IP) {
1606da14cebeSEric Cheng /*
1607da14cebeSEric Cheng * If we are H/W classified, but we have promisc
1608da14cebeSEric Cheng * on, then we need to check for the unicast address.
1609da14cebeSEric Cheng */
1610da14cebeSEric Cheng if (hw_classified && mcip->mci_promisc_list != NULL) {
1611da14cebeSEric Cheng mac_address_t *map;
1612da14cebeSEric Cheng
1613da14cebeSEric Cheng rw_enter(&mcip->mci_rw_lock, RW_READER);
1614da14cebeSEric Cheng map = mcip->mci_unicast;
1615ae6aa22aSVenugopal Iyer if (bcmp(dstaddr, map->ma_addr,
1616da14cebeSEric Cheng map->ma_len) == 0)
1617da14cebeSEric Cheng type = UNDEF;
1618da14cebeSEric Cheng rw_exit(&mcip->mci_rw_lock);
1619ae6aa22aSVenugopal Iyer } else if (is_unicast) {
1620da14cebeSEric Cheng type = UNDEF;
1621da14cebeSEric Cheng }
1622da14cebeSEric Cheng }
1623da14cebeSEric Cheng
1624da14cebeSEric Cheng /*
1625da14cebeSEric Cheng * This needs to become a contract with the driver for
1626da14cebeSEric Cheng * the fast path.
1627da14cebeSEric Cheng *
1628da14cebeSEric Cheng * In the normal case the packet will have at least the L2
1629da14cebeSEric Cheng * header and the IP + Transport header in the same mblk.
1630da14cebeSEric Cheng * This is usually the case when the NIC driver sends up
1631da14cebeSEric Cheng * the packet. This is also true when the stack generates
1632da14cebeSEric Cheng * a packet that is looped back and when the stack uses the
1633da14cebeSEric Cheng * fastpath mechanism. The normal case is optimized for
1634da14cebeSEric Cheng * performance and may bypass DLS. All other cases go through
1635da14cebeSEric Cheng * the 'OTH' type path without DLS bypass.
1636da14cebeSEric Cheng */
1637ae6aa22aSVenugopal Iyer ipha = (ipha_t *)(mp->b_rptr + hdrsize);
1638da14cebeSEric Cheng if ((type != OTH) && MBLK_RX_FANOUT_SLOWPATH(mp, ipha))
1639da14cebeSEric Cheng type = OTH;
1640da14cebeSEric Cheng
1641da14cebeSEric Cheng if (type == OTH) {
1642da14cebeSEric Cheng FANOUT_ENQUEUE_MP(headmp[type], tailmp[type],
1643da14cebeSEric Cheng cnt[type], bw_ctl, sz[type], sz1, mp);
1644da14cebeSEric Cheng continue;
1645da14cebeSEric Cheng }
1646da14cebeSEric Cheng
1647da14cebeSEric Cheng ASSERT(type == UNDEF);
164884de666eSRyan Zezeski
1649da14cebeSEric Cheng /*
165084de666eSRyan Zezeski * Determine the type from the IP protocol value. If
165184de666eSRyan Zezeski * classified as TCP or UDP, then update the read
165284de666eSRyan Zezeski * pointer to the beginning of the IP header.
165384de666eSRyan Zezeski * Otherwise leave the message as is for further
165484de666eSRyan Zezeski * processing by DLS.
1655da14cebeSEric Cheng */
1656da14cebeSEric Cheng switch (ipha->ipha_protocol) {
1657da14cebeSEric Cheng case IPPROTO_TCP:
1658da14cebeSEric Cheng type = V4_TCP;
1659ae6aa22aSVenugopal Iyer mp->b_rptr += hdrsize;
1660da14cebeSEric Cheng break;
1661da14cebeSEric Cheng case IPPROTO_UDP:
1662da14cebeSEric Cheng type = V4_UDP;
1663ae6aa22aSVenugopal Iyer mp->b_rptr += hdrsize;
1664da14cebeSEric Cheng break;
1665da14cebeSEric Cheng default:
1666da14cebeSEric Cheng type = OTH;
1667da14cebeSEric Cheng break;
1668da14cebeSEric Cheng }
1669da14cebeSEric Cheng
1670da14cebeSEric Cheng FANOUT_ENQUEUE_MP(headmp[type], tailmp[type], cnt[type],
1671da14cebeSEric Cheng bw_ctl, sz[type], sz1, mp);
1672da14cebeSEric Cheng }
1673da14cebeSEric Cheng
1674da14cebeSEric Cheng for (type = V4_TCP; type < UNDEF; type++) {
1675da14cebeSEric Cheng if (headmp[type] != NULL) {
1676ae6aa22aSVenugopal Iyer mac_soft_ring_t *softring;
1677ae6aa22aSVenugopal Iyer
1678da14cebeSEric Cheng ASSERT(tailmp[type]->b_next == NULL);
1679da14cebeSEric Cheng switch (type) {
1680da14cebeSEric Cheng case V4_TCP:
1681da14cebeSEric Cheng softring = mac_srs->srs_tcp_soft_rings[0];
1682da14cebeSEric Cheng break;
1683da14cebeSEric Cheng case V4_UDP:
1684da14cebeSEric Cheng softring = mac_srs->srs_udp_soft_rings[0];
1685da14cebeSEric Cheng break;
1686da14cebeSEric Cheng case OTH:
1687da14cebeSEric Cheng softring = mac_srs->srs_oth_soft_rings[0];
1688da14cebeSEric Cheng }
1689ae6aa22aSVenugopal Iyer mac_rx_soft_ring_process(mcip, softring,
1690da14cebeSEric Cheng headmp[type], tailmp[type], cnt[type], sz[type]);
1691da14cebeSEric Cheng }
1692da14cebeSEric Cheng }
1693da14cebeSEric Cheng }
1694da14cebeSEric Cheng
16953cc3202eSDan McDonald int fanout_unaligned = 0;
1696da14cebeSEric Cheng
1697da14cebeSEric Cheng /*
169884de666eSRyan Zezeski * The fanout routine for any clients with DLS bypass disabled or for
169984de666eSRyan Zezeski * traffic classified as "other". Returns -1 on an error (drop the
170084de666eSRyan Zezeski * packet due to a malformed packet), 0 on success, with values
170184de666eSRyan Zezeski * written in *indx and *type.
1702da14cebeSEric Cheng */
1703da14cebeSEric Cheng static int
mac_rx_srs_long_fanout(mac_soft_ring_set_t * mac_srs,mblk_t * mp,uint32_t sap,size_t hdrsize,enum pkt_type * type,uint_t * indx)1704da14cebeSEric Cheng mac_rx_srs_long_fanout(mac_soft_ring_set_t *mac_srs, mblk_t *mp,
1705ae6aa22aSVenugopal Iyer uint32_t sap, size_t hdrsize, enum pkt_type *type, uint_t *indx)
1706da14cebeSEric Cheng {
1707da14cebeSEric Cheng ip6_t *ip6h;
17083cc3202eSDan McDonald ipha_t *ipha;
1709da14cebeSEric Cheng uint8_t *whereptr;
1710da14cebeSEric Cheng uint_t hash;
1711da14cebeSEric Cheng uint16_t remlen;
1712da14cebeSEric Cheng uint8_t nexthdr;
1713da14cebeSEric Cheng uint16_t hdr_len;
1714dea290bfSSaso Kiselkov uint32_t src_val, dst_val;
1715da14cebeSEric Cheng boolean_t modifiable = B_TRUE;
17163cc3202eSDan McDonald boolean_t v6;
1717da14cebeSEric Cheng
1718ae6aa22aSVenugopal Iyer ASSERT(MBLKL(mp) >= hdrsize);
1719da14cebeSEric Cheng
17203cc3202eSDan McDonald if (sap == ETHERTYPE_IPV6) {
17213cc3202eSDan McDonald v6 = B_TRUE;
17223cc3202eSDan McDonald hdr_len = IPV6_HDR_LEN;
17233cc3202eSDan McDonald } else if (sap == ETHERTYPE_IP) {
17243cc3202eSDan McDonald v6 = B_FALSE;
17253cc3202eSDan McDonald hdr_len = IP_SIMPLE_HDR_LENGTH;
17263cc3202eSDan McDonald } else {
17273cc3202eSDan McDonald *indx = 0;
17283cc3202eSDan McDonald *type = OTH;
17293cc3202eSDan McDonald return (0);
17303cc3202eSDan McDonald }
17313cc3202eSDan McDonald
1732ae6aa22aSVenugopal Iyer ip6h = (ip6_t *)(mp->b_rptr + hdrsize);
17333cc3202eSDan McDonald ipha = (ipha_t *)ip6h;
17343cc3202eSDan McDonald
17353cc3202eSDan McDonald if ((uint8_t *)ip6h == mp->b_wptr) {
1736da14cebeSEric Cheng /*
1737ae6aa22aSVenugopal Iyer * The first mblk_t only includes the mac header.
1738da14cebeSEric Cheng * Note that it is safe to change the mp pointer here,
1739da14cebeSEric Cheng * as the subsequent operation does not assume mp
1740ae6aa22aSVenugopal Iyer * points to the start of the mac header.
1741da14cebeSEric Cheng */
1742da14cebeSEric Cheng mp = mp->b_cont;
1743da14cebeSEric Cheng
1744da14cebeSEric Cheng /*
17453cc3202eSDan McDonald * Make sure the IP header points to an entire one.
1746da14cebeSEric Cheng */
1747da14cebeSEric Cheng if (mp == NULL)
1748da14cebeSEric Cheng return (-1);
1749da14cebeSEric Cheng
17503cc3202eSDan McDonald if (MBLKL(mp) < hdr_len) {
1751da14cebeSEric Cheng modifiable = (DB_REF(mp) == 1);
1752da14cebeSEric Cheng
17533cc3202eSDan McDonald if (modifiable && !pullupmsg(mp, hdr_len))
1754da14cebeSEric Cheng return (-1);
1755da14cebeSEric Cheng }
1756da14cebeSEric Cheng
1757da14cebeSEric Cheng ip6h = (ip6_t *)mp->b_rptr;
17583cc3202eSDan McDonald ipha = (ipha_t *)ip6h;
1759da14cebeSEric Cheng }
1760da14cebeSEric Cheng
1761da14cebeSEric Cheng if (!modifiable || !(OK_32PTR((char *)ip6h)) ||
17623cc3202eSDan McDonald ((uint8_t *)ip6h + hdr_len > mp->b_wptr)) {
1763da14cebeSEric Cheng /*
17643cc3202eSDan McDonald * If either the IP header is not aligned, or it does not hold
17653cc3202eSDan McDonald * the complete simple structure (a pullupmsg() is not an
17663cc3202eSDan McDonald * option since it would result in an unaligned IP header),
17673cc3202eSDan McDonald * fanout to the default ring.
17683cc3202eSDan McDonald *
17693cc3202eSDan McDonald * Note that this may cause packet reordering.
1770da14cebeSEric Cheng */
1771da14cebeSEric Cheng *indx = 0;
1772da14cebeSEric Cheng *type = OTH;
17733cc3202eSDan McDonald fanout_unaligned++;
1774da14cebeSEric Cheng return (0);
1775da14cebeSEric Cheng }
1776da14cebeSEric Cheng
17773cc3202eSDan McDonald /*
17783cc3202eSDan McDonald * Extract next-header, full header length, and source-hash value
17793cc3202eSDan McDonald * using v4/v6 specific fields.
17803cc3202eSDan McDonald */
17813cc3202eSDan McDonald if (v6) {
1782da14cebeSEric Cheng remlen = ntohs(ip6h->ip6_plen);
1783da14cebeSEric Cheng nexthdr = ip6h->ip6_nxt;
17843cc3202eSDan McDonald src_val = V4_PART_OF_V6(ip6h->ip6_src);
1785dea290bfSSaso Kiselkov dst_val = V4_PART_OF_V6(ip6h->ip6_dst);
1786da14cebeSEric Cheng /*
1787da14cebeSEric Cheng * Do src based fanout if below tunable is set to B_TRUE or
1788da14cebeSEric Cheng * when mac_ip_hdr_length_v6() fails because of malformed
17893cc3202eSDan McDonald * packets or because mblks need to be concatenated using
1790da14cebeSEric Cheng * pullupmsg().
1791e5d18d1cSDan McDonald *
1792e5d18d1cSDan McDonald * Perform a version check to prevent parsing weirdness...
1793da14cebeSEric Cheng */
1794e5d18d1cSDan McDonald if (IPH_HDR_VERSION(ip6h) != IPV6_VERSION ||
1795e5d18d1cSDan McDonald !mac_ip_hdr_length_v6(ip6h, mp->b_wptr, &hdr_len, &nexthdr,
1796dea290bfSSaso Kiselkov NULL)) {
1797dea290bfSSaso Kiselkov goto src_dst_based_fanout;
1798da14cebeSEric Cheng }
17993cc3202eSDan McDonald } else {
18003cc3202eSDan McDonald hdr_len = IPH_HDR_LENGTH(ipha);
18013cc3202eSDan McDonald remlen = ntohs(ipha->ipha_length) - hdr_len;
18023cc3202eSDan McDonald nexthdr = ipha->ipha_protocol;
18033cc3202eSDan McDonald src_val = (uint32_t)ipha->ipha_src;
1804dea290bfSSaso Kiselkov dst_val = (uint32_t)ipha->ipha_dst;
18053cc3202eSDan McDonald /*
18063cc3202eSDan McDonald * Catch IPv4 fragment case here. IPv6 has nexthdr == FRAG
18073cc3202eSDan McDonald * for its equivalent case.
18083cc3202eSDan McDonald */
1809dea290bfSSaso Kiselkov if ((ntohs(ipha->ipha_fragment_offset_and_flags) &
18103cc3202eSDan McDonald (IPH_MF | IPH_OFFSET)) != 0) {
1811dea290bfSSaso Kiselkov goto src_dst_based_fanout;
18123cc3202eSDan McDonald }
18133cc3202eSDan McDonald }
18143cc3202eSDan McDonald if (remlen < MIN_EHDR_LEN)
18153cc3202eSDan McDonald return (-1);
1816da14cebeSEric Cheng whereptr = (uint8_t *)ip6h + hdr_len;
1817da14cebeSEric Cheng
18183cc3202eSDan McDonald /* If the transport is one of below, we do port/SPI based fanout */
1819da14cebeSEric Cheng switch (nexthdr) {
1820da14cebeSEric Cheng case IPPROTO_TCP:
1821da14cebeSEric Cheng case IPPROTO_UDP:
1822da14cebeSEric Cheng case IPPROTO_SCTP:
1823da14cebeSEric Cheng case IPPROTO_ESP:
1824da14cebeSEric Cheng /*
18253cc3202eSDan McDonald * If the ports or SPI in the transport header is not part of
1826da14cebeSEric Cheng * the mblk, do src_based_fanout, instead of calling
1827da14cebeSEric Cheng * pullupmsg().
1828da14cebeSEric Cheng */
18293cc3202eSDan McDonald if (mp->b_cont == NULL || whereptr + PORTS_SIZE <= mp->b_wptr)
18303cc3202eSDan McDonald break; /* out of switch... */
18313cc3202eSDan McDonald /* FALLTHRU */
1832da14cebeSEric Cheng default:
1833dea290bfSSaso Kiselkov goto src_dst_based_fanout;
1834da14cebeSEric Cheng }
1835da14cebeSEric Cheng
1836da14cebeSEric Cheng switch (nexthdr) {
1837da14cebeSEric Cheng case IPPROTO_TCP:
1838dea290bfSSaso Kiselkov hash = HASH_ADDR(src_val, dst_val, *(uint32_t *)whereptr);
18393cc3202eSDan McDonald *indx = COMPUTE_INDEX(hash, mac_srs->srs_tcp_ring_count);
1840da14cebeSEric Cheng *type = OTH;
1841da14cebeSEric Cheng break;
1842da14cebeSEric Cheng case IPPROTO_UDP:
1843da14cebeSEric Cheng case IPPROTO_SCTP:
1844da14cebeSEric Cheng case IPPROTO_ESP:
1845da14cebeSEric Cheng if (mac_fanout_type == MAC_FANOUT_DEFAULT) {
1846dea290bfSSaso Kiselkov hash = HASH_ADDR(src_val, dst_val,
1847dea290bfSSaso Kiselkov *(uint32_t *)whereptr);
1848da14cebeSEric Cheng *indx = COMPUTE_INDEX(hash,
1849da14cebeSEric Cheng mac_srs->srs_udp_ring_count);
1850da14cebeSEric Cheng } else {
18513cc3202eSDan McDonald *indx = mac_srs->srs_ind % mac_srs->srs_udp_ring_count;
1852da14cebeSEric Cheng mac_srs->srs_ind++;
1853da14cebeSEric Cheng }
1854da14cebeSEric Cheng *type = OTH;
1855da14cebeSEric Cheng break;
1856da14cebeSEric Cheng }
1857da14cebeSEric Cheng return (0);
1858da14cebeSEric Cheng
1859dea290bfSSaso Kiselkov src_dst_based_fanout:
1860dea290bfSSaso Kiselkov hash = HASH_ADDR(src_val, dst_val, (uint32_t)0);
1861da14cebeSEric Cheng *indx = COMPUTE_INDEX(hash, mac_srs->srs_oth_ring_count);
1862da14cebeSEric Cheng *type = OTH;
1863da14cebeSEric Cheng return (0);
1864da14cebeSEric Cheng }
1865da14cebeSEric Cheng
1866da14cebeSEric Cheng /*
186784de666eSRyan Zezeski * This routine delivers packets destined for an SRS into a soft ring member
1868da14cebeSEric Cheng * of the set.
1869da14cebeSEric Cheng *
187084de666eSRyan Zezeski * Given a chain of packets we need to split it up into multiple sub
187184de666eSRyan Zezeski * chains: TCP, UDP or OTH soft ring. Instead of entering the soft
187284de666eSRyan Zezeski * ring one packet at a time, we want to enter it in the form of a
187384de666eSRyan Zezeski * chain otherwise we get this start/stop behaviour where the worker
187484de666eSRyan Zezeski * thread goes to sleep and then next packet comes in forcing it to
187584de666eSRyan Zezeski * wake up.
1876da14cebeSEric Cheng *
1877da14cebeSEric Cheng * Note:
1878da14cebeSEric Cheng * Since we know what is the maximum fanout possible, we create a 2D array
1879da14cebeSEric Cheng * of 'softring types * MAX_SR_FANOUT' for the head, tail, cnt and sz
1880da14cebeSEric Cheng * variables so that we can enter the softrings with chain. We need the
1881da14cebeSEric Cheng * MAX_SR_FANOUT so we can allocate the arrays on the stack (a kmem_alloc
1882da14cebeSEric Cheng * for each packet would be expensive). If we ever want to have the
1883da14cebeSEric Cheng * ability to have unlimited fanout, we should probably declare a head,
1884da14cebeSEric Cheng * tail, cnt, sz with each soft ring (a data struct which contains a softring
1885da14cebeSEric Cheng * along with these members) and create an array of this uber struct so we
1886da14cebeSEric Cheng * don't have to do kmem_alloc.
1887da14cebeSEric Cheng */
1888da14cebeSEric Cheng int fanout_oth1 = 0;
1889da14cebeSEric Cheng int fanout_oth2 = 0;
1890da14cebeSEric Cheng int fanout_oth3 = 0;
1891da14cebeSEric Cheng int fanout_oth4 = 0;
1892da14cebeSEric Cheng int fanout_oth5 = 0;
1893da14cebeSEric Cheng
1894da14cebeSEric Cheng static void
mac_rx_srs_fanout(mac_soft_ring_set_t * mac_srs,mblk_t * head)1895da14cebeSEric Cheng mac_rx_srs_fanout(mac_soft_ring_set_t *mac_srs, mblk_t *head)
1896da14cebeSEric Cheng {
1897da14cebeSEric Cheng struct ether_header *ehp;
1898ae6aa22aSVenugopal Iyer struct ether_vlan_header *evhp;
1899ae6aa22aSVenugopal Iyer uint32_t sap;
1900da14cebeSEric Cheng ipha_t *ipha;
1901ae6aa22aSVenugopal Iyer uint8_t *dstaddr;
1902da14cebeSEric Cheng uint_t indx;
1903ae6aa22aSVenugopal Iyer size_t ports_offset;
1904ae6aa22aSVenugopal Iyer size_t ipha_len;
1905ae6aa22aSVenugopal Iyer size_t hdrsize;
1906da14cebeSEric Cheng uint_t hash;
1907da14cebeSEric Cheng mblk_t *mp;
1908da14cebeSEric Cheng mblk_t *headmp[MAX_SR_TYPES][MAX_SR_FANOUT];
1909da14cebeSEric Cheng mblk_t *tailmp[MAX_SR_TYPES][MAX_SR_FANOUT];
1910da14cebeSEric Cheng int cnt[MAX_SR_TYPES][MAX_SR_FANOUT];
1911da14cebeSEric Cheng size_t sz[MAX_SR_TYPES][MAX_SR_FANOUT];
1912da14cebeSEric Cheng size_t sz1;
1913ae6aa22aSVenugopal Iyer boolean_t bw_ctl;
1914da14cebeSEric Cheng boolean_t hw_classified;
1915ae6aa22aSVenugopal Iyer boolean_t dls_bypass;
1916ae6aa22aSVenugopal Iyer boolean_t is_ether;
1917ae6aa22aSVenugopal Iyer boolean_t is_unicast;
1918da14cebeSEric Cheng int fanout_cnt;
1919da14cebeSEric Cheng enum pkt_type type;
1920da14cebeSEric Cheng mac_client_impl_t *mcip = mac_srs->srs_mcip;
1921da14cebeSEric Cheng
1922ae6aa22aSVenugopal Iyer is_ether = (mcip->mci_mip->mi_info.mi_nativemedia == DL_ETHER);
1923ae6aa22aSVenugopal Iyer bw_ctl = ((mac_srs->srs_type & SRST_BW_CONTROL) != 0);
1924da14cebeSEric Cheng
1925da14cebeSEric Cheng /*
1926da14cebeSEric Cheng * If we don't have a Rx ring, S/W classification would have done
1927da14cebeSEric Cheng * its job and its a packet meant for us. If we were polling on
1928da14cebeSEric Cheng * the default ring (i.e. there was a ring assigned to this SRS),
1929da14cebeSEric Cheng * then we need to make sure that the mac address really belongs
1930da14cebeSEric Cheng * to us.
1931da14cebeSEric Cheng */
1932da14cebeSEric Cheng hw_classified = mac_srs->srs_ring != NULL &&
1933da14cebeSEric Cheng mac_srs->srs_ring->mr_classify_type == MAC_HW_CLASSIFIER;
1934da14cebeSEric Cheng
1935da14cebeSEric Cheng /*
193684de666eSRyan Zezeski * Some clients, such as non Ethernet, need DLS processing in
193784de666eSRyan Zezeski * the Rx path. Such clients clear the SRST_DLS_BYPASS flag.
193884de666eSRyan Zezeski * DLS bypass may also be disabled via the
193984de666eSRyan Zezeski * MCIS_RX_BYPASS_DISABLE flag, but this is only consumed by
194084de666eSRyan Zezeski * sun4v vsw currently.
1941da14cebeSEric Cheng */
19428d4cf8d8S dls_bypass = ((mac_srs->srs_type & SRST_DLS_BYPASS) != 0) &&
19438d4cf8d8S ((mcip->mci_state_flags & MCIS_RX_BYPASS_DISABLE) == 0);
1944da14cebeSEric Cheng
1945da14cebeSEric Cheng /*
1946da14cebeSEric Cheng * Since the softrings are never destroyed and we always
1947da14cebeSEric Cheng * create equal number of softrings for TCP, UDP and rest,
1948da14cebeSEric Cheng * its OK to check one of them for count and use it without
1949da14cebeSEric Cheng * any lock. In future, if soft rings get destroyed because
1950da14cebeSEric Cheng * of reduction in fanout, we will need to ensure that happens
1951da14cebeSEric Cheng * behind the SRS_PROC.
1952da14cebeSEric Cheng */
1953da14cebeSEric Cheng fanout_cnt = mac_srs->srs_tcp_ring_count;
1954da14cebeSEric Cheng
1955da14cebeSEric Cheng bzero(headmp, MAX_SR_TYPES * MAX_SR_FANOUT * sizeof (mblk_t *));
1956da14cebeSEric Cheng bzero(tailmp, MAX_SR_TYPES * MAX_SR_FANOUT * sizeof (mblk_t *));
1957da14cebeSEric Cheng bzero(cnt, MAX_SR_TYPES * MAX_SR_FANOUT * sizeof (int));
1958da14cebeSEric Cheng bzero(sz, MAX_SR_TYPES * MAX_SR_FANOUT * sizeof (size_t));
1959da14cebeSEric Cheng
1960da14cebeSEric Cheng /*
1961da14cebeSEric Cheng * We got a chain from SRS that we need to send to the soft rings.
196284de666eSRyan Zezeski * Since squeues for TCP & IPv4 SAP poll their soft rings (for
1963da14cebeSEric Cheng * performance reasons), we need to separate out v4_tcp, v4_udp
1964da14cebeSEric Cheng * and the rest goes in other.
1965da14cebeSEric Cheng */
1966da14cebeSEric Cheng while (head != NULL) {
1967da14cebeSEric Cheng mp = head;
1968da14cebeSEric Cheng head = head->b_next;
1969da14cebeSEric Cheng mp->b_next = NULL;
1970da14cebeSEric Cheng
1971da14cebeSEric Cheng type = OTH;
1972ae6aa22aSVenugopal Iyer sz1 = (mp->b_cont == NULL) ? MBLKL(mp) : msgdsize(mp);
1973da14cebeSEric Cheng
1974ae6aa22aSVenugopal Iyer if (is_ether) {
1975ae6aa22aSVenugopal Iyer /*
1976ae6aa22aSVenugopal Iyer * At this point we can be sure the packet at least
1977ae6aa22aSVenugopal Iyer * has an ether header.
1978ae6aa22aSVenugopal Iyer */
1979ae6aa22aSVenugopal Iyer if (sz1 < sizeof (struct ether_header)) {
1980ae6aa22aSVenugopal Iyer mac_rx_drop_pkt(mac_srs, mp);
1981ae6aa22aSVenugopal Iyer continue;
1982ae6aa22aSVenugopal Iyer }
1983da14cebeSEric Cheng ehp = (struct ether_header *)mp->b_rptr;
1984ae6aa22aSVenugopal Iyer
1985da14cebeSEric Cheng /*
1986ae6aa22aSVenugopal Iyer * Determine if this is a VLAN or non-VLAN packet.
1987da14cebeSEric Cheng */
1988ae6aa22aSVenugopal Iyer if ((sap = ntohs(ehp->ether_type)) == VLAN_TPID) {
1989ae6aa22aSVenugopal Iyer evhp = (struct ether_vlan_header *)mp->b_rptr;
1990ae6aa22aSVenugopal Iyer sap = ntohs(evhp->ether_type);
1991ae6aa22aSVenugopal Iyer hdrsize = sizeof (struct ether_vlan_header);
199284de666eSRyan Zezeski
1993da14cebeSEric Cheng /*
199484de666eSRyan Zezeski * Check if the VID of the packet, if
199584de666eSRyan Zezeski * any, belongs to this client.
199684de666eSRyan Zezeski * Technically, if this packet came up
199784de666eSRyan Zezeski * via a HW classified ring then we
199884de666eSRyan Zezeski * don't need to perform this check.
199984de666eSRyan Zezeski * Perhaps a future optimization.
2000da14cebeSEric Cheng */
2001da14cebeSEric Cheng if (!mac_client_check_flow_vid(mcip,
2002da14cebeSEric Cheng VLAN_ID(ntohs(evhp->ether_tci)))) {
2003da14cebeSEric Cheng mac_rx_drop_pkt(mac_srs, mp);
2004da14cebeSEric Cheng continue;
2005da14cebeSEric Cheng }
2006ae6aa22aSVenugopal Iyer } else {
2007ae6aa22aSVenugopal Iyer hdrsize = sizeof (struct ether_header);
2008da14cebeSEric Cheng }
2009ae6aa22aSVenugopal Iyer is_unicast =
2010ae6aa22aSVenugopal Iyer ((((uint8_t *)&ehp->ether_dhost)[0] & 0x01) == 0);
2011ae6aa22aSVenugopal Iyer dstaddr = (uint8_t *)&ehp->ether_dhost;
2012ae6aa22aSVenugopal Iyer } else {
2013ae6aa22aSVenugopal Iyer mac_header_info_t mhi;
2014ae6aa22aSVenugopal Iyer
2015ae6aa22aSVenugopal Iyer if (mac_header_info((mac_handle_t)mcip->mci_mip,
2016ae6aa22aSVenugopal Iyer mp, &mhi) != 0) {
2017da14cebeSEric Cheng mac_rx_drop_pkt(mac_srs, mp);
2018da14cebeSEric Cheng continue;
2019da14cebeSEric Cheng }
2020ae6aa22aSVenugopal Iyer hdrsize = mhi.mhi_hdrsize;
2021ae6aa22aSVenugopal Iyer sap = mhi.mhi_bindsap;
2022ae6aa22aSVenugopal Iyer is_unicast = (mhi.mhi_dsttype == MAC_ADDRTYPE_UNICAST);
2023ae6aa22aSVenugopal Iyer dstaddr = (uint8_t *)mhi.mhi_daddr;
2024ae6aa22aSVenugopal Iyer }
2025ae6aa22aSVenugopal Iyer
2026ae6aa22aSVenugopal Iyer if (!dls_bypass) {
2027ae6aa22aSVenugopal Iyer if (mac_rx_srs_long_fanout(mac_srs, mp, sap,
2028ae6aa22aSVenugopal Iyer hdrsize, &type, &indx) == -1) {
2029ae6aa22aSVenugopal Iyer mac_rx_drop_pkt(mac_srs, mp);
2030ae6aa22aSVenugopal Iyer continue;
2031da14cebeSEric Cheng }
2032da14cebeSEric Cheng
2033da14cebeSEric Cheng FANOUT_ENQUEUE_MP(headmp[type][indx],
2034da14cebeSEric Cheng tailmp[type][indx], cnt[type][indx], bw_ctl,
2035da14cebeSEric Cheng sz[type][indx], sz1, mp);
2036da14cebeSEric Cheng continue;
2037da14cebeSEric Cheng }
2038da14cebeSEric Cheng
2039da14cebeSEric Cheng /*
2040da14cebeSEric Cheng * If we are using the default Rx ring where H/W or S/W
2041da14cebeSEric Cheng * classification has not happened, we need to verify if
2042da14cebeSEric Cheng * this unicast packet really belongs to us.
2043da14cebeSEric Cheng */
2044ae6aa22aSVenugopal Iyer if (sap == ETHERTYPE_IP) {
2045da14cebeSEric Cheng /*
2046da14cebeSEric Cheng * If we are H/W classified, but we have promisc
2047da14cebeSEric Cheng * on, then we need to check for the unicast address.
2048da14cebeSEric Cheng */
2049da14cebeSEric Cheng if (hw_classified && mcip->mci_promisc_list != NULL) {
2050da14cebeSEric Cheng mac_address_t *map;
2051da14cebeSEric Cheng
2052da14cebeSEric Cheng rw_enter(&mcip->mci_rw_lock, RW_READER);
2053da14cebeSEric Cheng map = mcip->mci_unicast;
2054ae6aa22aSVenugopal Iyer if (bcmp(dstaddr, map->ma_addr,
2055da14cebeSEric Cheng map->ma_len) == 0)
2056da14cebeSEric Cheng type = UNDEF;
2057da14cebeSEric Cheng rw_exit(&mcip->mci_rw_lock);
2058ae6aa22aSVenugopal Iyer } else if (is_unicast) {
2059da14cebeSEric Cheng type = UNDEF;
2060da14cebeSEric Cheng }
2061da14cebeSEric Cheng }
2062da14cebeSEric Cheng
2063da14cebeSEric Cheng /*
2064da14cebeSEric Cheng * This needs to become a contract with the driver for
2065da14cebeSEric Cheng * the fast path.
2066da14cebeSEric Cheng */
2067da14cebeSEric Cheng
2068ae6aa22aSVenugopal Iyer ipha = (ipha_t *)(mp->b_rptr + hdrsize);
2069da14cebeSEric Cheng if ((type != OTH) && MBLK_RX_FANOUT_SLOWPATH(mp, ipha)) {
2070da14cebeSEric Cheng type = OTH;
2071da14cebeSEric Cheng fanout_oth1++;
2072da14cebeSEric Cheng }
2073da14cebeSEric Cheng
2074da14cebeSEric Cheng if (type != OTH) {
2075ae6aa22aSVenugopal Iyer uint16_t frag_offset_flags;
2076ae6aa22aSVenugopal Iyer
2077da14cebeSEric Cheng switch (ipha->ipha_protocol) {
2078da14cebeSEric Cheng case IPPROTO_TCP:
2079da14cebeSEric Cheng case IPPROTO_UDP:
2080da14cebeSEric Cheng case IPPROTO_SCTP:
2081da14cebeSEric Cheng case IPPROTO_ESP:
2082da14cebeSEric Cheng ipha_len = IPH_HDR_LENGTH(ipha);
2083da14cebeSEric Cheng if ((uchar_t *)ipha + ipha_len + PORTS_SIZE >
2084da14cebeSEric Cheng mp->b_wptr) {
2085da14cebeSEric Cheng type = OTH;
2086da14cebeSEric Cheng break;
2087da14cebeSEric Cheng }
2088da14cebeSEric Cheng frag_offset_flags =
2089da14cebeSEric Cheng ntohs(ipha->ipha_fragment_offset_and_flags);
2090da14cebeSEric Cheng if ((frag_offset_flags &
2091da14cebeSEric Cheng (IPH_MF | IPH_OFFSET)) != 0) {
2092da14cebeSEric Cheng type = OTH;
2093da14cebeSEric Cheng fanout_oth3++;
2094da14cebeSEric Cheng break;
2095da14cebeSEric Cheng }
2096ae6aa22aSVenugopal Iyer ports_offset = hdrsize + ipha_len;
2097da14cebeSEric Cheng break;
2098da14cebeSEric Cheng default:
2099da14cebeSEric Cheng type = OTH;
2100da14cebeSEric Cheng fanout_oth4++;
2101da14cebeSEric Cheng break;
2102da14cebeSEric Cheng }
2103da14cebeSEric Cheng }
2104da14cebeSEric Cheng
2105da14cebeSEric Cheng if (type == OTH) {
2106ae6aa22aSVenugopal Iyer if (mac_rx_srs_long_fanout(mac_srs, mp, sap,
2107ae6aa22aSVenugopal Iyer hdrsize, &type, &indx) == -1) {
2108da14cebeSEric Cheng mac_rx_drop_pkt(mac_srs, mp);
2109da14cebeSEric Cheng continue;
2110da14cebeSEric Cheng }
2111da14cebeSEric Cheng
2112da14cebeSEric Cheng FANOUT_ENQUEUE_MP(headmp[type][indx],
2113da14cebeSEric Cheng tailmp[type][indx], cnt[type][indx], bw_ctl,
2114da14cebeSEric Cheng sz[type][indx], sz1, mp);
2115da14cebeSEric Cheng continue;
2116da14cebeSEric Cheng }
2117da14cebeSEric Cheng
2118da14cebeSEric Cheng ASSERT(type == UNDEF);
2119da14cebeSEric Cheng
2120da14cebeSEric Cheng /*
2121da14cebeSEric Cheng * XXX-Sunay: We should hold srs_lock since ring_count
2122da14cebeSEric Cheng * below can change. But if we are always called from
2123da14cebeSEric Cheng * mac_rx_srs_drain and SRS_PROC is set, then we can
2124da14cebeSEric Cheng * enforce that ring_count can't be changed i.e.
2125da14cebeSEric Cheng * to change fanout type or ring count, the calling
2126da14cebeSEric Cheng * thread needs to be behind SRS_PROC.
2127da14cebeSEric Cheng */
2128da14cebeSEric Cheng switch (ipha->ipha_protocol) {
2129da14cebeSEric Cheng case IPPROTO_TCP:
2130da14cebeSEric Cheng /*
2131da14cebeSEric Cheng * Note that for ESP, we fanout on SPI and it is at the
2132da14cebeSEric Cheng * same offset as the 2x16-bit ports. So it is clumped
2133da14cebeSEric Cheng * along with TCP, UDP and SCTP.
2134da14cebeSEric Cheng */
2135dea290bfSSaso Kiselkov hash = HASH_ADDR(ipha->ipha_src, ipha->ipha_dst,
2136da14cebeSEric Cheng *(uint32_t *)(mp->b_rptr + ports_offset));
2137da14cebeSEric Cheng indx = COMPUTE_INDEX(hash, mac_srs->srs_tcp_ring_count);
2138da14cebeSEric Cheng type = V4_TCP;
2139ae6aa22aSVenugopal Iyer mp->b_rptr += hdrsize;
2140da14cebeSEric Cheng break;
2141da14cebeSEric Cheng case IPPROTO_UDP:
2142da14cebeSEric Cheng case IPPROTO_SCTP:
2143da14cebeSEric Cheng case IPPROTO_ESP:
2144da14cebeSEric Cheng if (mac_fanout_type == MAC_FANOUT_DEFAULT) {
2145dea290bfSSaso Kiselkov hash = HASH_ADDR(ipha->ipha_src, ipha->ipha_dst,
2146da14cebeSEric Cheng *(uint32_t *)(mp->b_rptr + ports_offset));
2147da14cebeSEric Cheng indx = COMPUTE_INDEX(hash,
2148da14cebeSEric Cheng mac_srs->srs_udp_ring_count);
2149da14cebeSEric Cheng } else {
2150da14cebeSEric Cheng indx = mac_srs->srs_ind %
2151da14cebeSEric Cheng mac_srs->srs_udp_ring_count;
2152da14cebeSEric Cheng mac_srs->srs_ind++;
2153da14cebeSEric Cheng }
2154da14cebeSEric Cheng type = V4_UDP;
2155ae6aa22aSVenugopal Iyer mp->b_rptr += hdrsize;
2156da14cebeSEric Cheng break;
2157ae6aa22aSVenugopal Iyer default:
2158ae6aa22aSVenugopal Iyer indx = 0;
2159ae6aa22aSVenugopal Iyer type = OTH;
2160da14cebeSEric Cheng }
2161da14cebeSEric Cheng
2162da14cebeSEric Cheng FANOUT_ENQUEUE_MP(headmp[type][indx], tailmp[type][indx],
2163da14cebeSEric Cheng cnt[type][indx], bw_ctl, sz[type][indx], sz1, mp);
2164da14cebeSEric Cheng }
2165da14cebeSEric Cheng
2166da14cebeSEric Cheng for (type = V4_TCP; type < UNDEF; type++) {
2167ae6aa22aSVenugopal Iyer int i;
2168ae6aa22aSVenugopal Iyer
2169da14cebeSEric Cheng for (i = 0; i < fanout_cnt; i++) {
2170da14cebeSEric Cheng if (headmp[type][i] != NULL) {
2171ae6aa22aSVenugopal Iyer mac_soft_ring_t *softring;
2172ae6aa22aSVenugopal Iyer
2173da14cebeSEric Cheng ASSERT(tailmp[type][i]->b_next == NULL);
2174da14cebeSEric Cheng switch (type) {
2175da14cebeSEric Cheng case V4_TCP:
2176da14cebeSEric Cheng softring =
2177da14cebeSEric Cheng mac_srs->srs_tcp_soft_rings[i];
2178da14cebeSEric Cheng break;
2179da14cebeSEric Cheng case V4_UDP:
2180da14cebeSEric Cheng softring =
2181da14cebeSEric Cheng mac_srs->srs_udp_soft_rings[i];
2182da14cebeSEric Cheng break;
2183da14cebeSEric Cheng case OTH:
2184da14cebeSEric Cheng softring =
2185da14cebeSEric Cheng mac_srs->srs_oth_soft_rings[i];
2186da14cebeSEric Cheng break;
2187da14cebeSEric Cheng }
2188ae6aa22aSVenugopal Iyer mac_rx_soft_ring_process(mcip,
2189da14cebeSEric Cheng softring, headmp[type][i], tailmp[type][i],
2190da14cebeSEric Cheng cnt[type][i], sz[type][i]);
2191da14cebeSEric Cheng }
2192da14cebeSEric Cheng }
2193da14cebeSEric Cheng }
2194da14cebeSEric Cheng }
2195da14cebeSEric Cheng
2196da14cebeSEric Cheng #define SRS_BYTES_TO_PICKUP 150000
2197da14cebeSEric Cheng ssize_t max_bytes_to_pickup = SRS_BYTES_TO_PICKUP;
2198da14cebeSEric Cheng
2199da14cebeSEric Cheng /*
2200da14cebeSEric Cheng * mac_rx_srs_poll_ring
2201da14cebeSEric Cheng *
2202da14cebeSEric Cheng * This SRS Poll thread uses this routine to poll the underlying hardware
2203da14cebeSEric Cheng * Rx ring to get a chain of packets. It can inline process that chain
2204da14cebeSEric Cheng * if mac_latency_optimize is set (default) or signal the SRS worker thread
2205da14cebeSEric Cheng * to do the remaining processing.
2206da14cebeSEric Cheng *
2207da14cebeSEric Cheng * Since packets come in the system via interrupt or poll path, we also
2208da14cebeSEric Cheng * update the stats and deal with promiscous clients here.
2209da14cebeSEric Cheng */
2210da14cebeSEric Cheng void
mac_rx_srs_poll_ring(mac_soft_ring_set_t * mac_srs)2211da14cebeSEric Cheng mac_rx_srs_poll_ring(mac_soft_ring_set_t *mac_srs)
2212da14cebeSEric Cheng {
2213da14cebeSEric Cheng kmutex_t *lock = &mac_srs->srs_lock;
2214da14cebeSEric Cheng kcondvar_t *async = &mac_srs->srs_cv;
2215da14cebeSEric Cheng mac_srs_rx_t *srs_rx = &mac_srs->srs_rx;
2216da14cebeSEric Cheng mblk_t *head, *tail, *mp;
2217da14cebeSEric Cheng callb_cpr_t cprinfo;
2218da14cebeSEric Cheng ssize_t bytes_to_pickup;
2219da14cebeSEric Cheng size_t sz;
2220da14cebeSEric Cheng int count;
2221da14cebeSEric Cheng mac_client_impl_t *smcip;
2222da14cebeSEric Cheng
2223da14cebeSEric Cheng CALLB_CPR_INIT(&cprinfo, lock, callb_generic_cpr, "mac_srs_poll");
2224da14cebeSEric Cheng mutex_enter(lock);
2225da14cebeSEric Cheng
2226da14cebeSEric Cheng start:
2227da14cebeSEric Cheng for (;;) {
2228da14cebeSEric Cheng if (mac_srs->srs_state & SRS_PAUSE)
2229da14cebeSEric Cheng goto done;
2230da14cebeSEric Cheng
2231da14cebeSEric Cheng CALLB_CPR_SAFE_BEGIN(&cprinfo);
2232da14cebeSEric Cheng cv_wait(async, lock);
2233da14cebeSEric Cheng CALLB_CPR_SAFE_END(&cprinfo, lock);
2234da14cebeSEric Cheng
2235da14cebeSEric Cheng if (mac_srs->srs_state & SRS_PAUSE)
2236da14cebeSEric Cheng goto done;
2237da14cebeSEric Cheng
2238da14cebeSEric Cheng check_again:
2239da14cebeSEric Cheng if (mac_srs->srs_type & SRST_BW_CONTROL) {
2240da14cebeSEric Cheng /*
2241da14cebeSEric Cheng * We pick as many bytes as we are allowed to queue.
2242da14cebeSEric Cheng * Its possible that we will exceed the total
2243da14cebeSEric Cheng * packets queued in case this SRS is part of the
2244da14cebeSEric Cheng * Rx ring group since > 1 poll thread can be pulling
2245da14cebeSEric Cheng * upto the max allowed packets at the same time
2246da14cebeSEric Cheng * but that should be OK.
2247da14cebeSEric Cheng */
2248da14cebeSEric Cheng mutex_enter(&mac_srs->srs_bw->mac_bw_lock);
2249da14cebeSEric Cheng bytes_to_pickup =
2250da14cebeSEric Cheng mac_srs->srs_bw->mac_bw_drop_threshold -
2251da14cebeSEric Cheng mac_srs->srs_bw->mac_bw_sz;
2252da14cebeSEric Cheng /*
2253da14cebeSEric Cheng * We shouldn't have been signalled if we
2254da14cebeSEric Cheng * have 0 or less bytes to pick but since
2255da14cebeSEric Cheng * some of the bytes accounting is driver
2256da14cebeSEric Cheng * dependant, we do the safety check.
2257da14cebeSEric Cheng */
2258da14cebeSEric Cheng if (bytes_to_pickup < 0)
2259da14cebeSEric Cheng bytes_to_pickup = 0;
2260da14cebeSEric Cheng mutex_exit(&mac_srs->srs_bw->mac_bw_lock);
2261da14cebeSEric Cheng } else {
2262da14cebeSEric Cheng /*
2263da14cebeSEric Cheng * ToDO: Need to change the polling API
2264da14cebeSEric Cheng * to add a packet count and a flag which
2265da14cebeSEric Cheng * tells the driver whether we want packets
2266da14cebeSEric Cheng * based on a count, or bytes, or all the
2267da14cebeSEric Cheng * packets queued in the driver/HW. This
2268da14cebeSEric Cheng * way, we never have to check the limits
2269da14cebeSEric Cheng * on poll path. We truly let only as many
2270da14cebeSEric Cheng * packets enter the system as we are willing
2271da14cebeSEric Cheng * to process or queue.
2272da14cebeSEric Cheng *
2273da14cebeSEric Cheng * Something along the lines of
2274da14cebeSEric Cheng * pkts_to_pickup = mac_soft_ring_max_q_cnt -
2275da14cebeSEric Cheng * mac_srs->srs_poll_pkt_cnt
2276da14cebeSEric Cheng */
2277da14cebeSEric Cheng
2278da14cebeSEric Cheng /*
2279da14cebeSEric Cheng * Since we are not doing B/W control, pick
2280da14cebeSEric Cheng * as many packets as allowed.
2281da14cebeSEric Cheng */
2282da14cebeSEric Cheng bytes_to_pickup = max_bytes_to_pickup;
2283da14cebeSEric Cheng }
2284da14cebeSEric Cheng
2285da14cebeSEric Cheng /* Poll the underlying Hardware */
2286da14cebeSEric Cheng mutex_exit(lock);
2287da14cebeSEric Cheng head = MAC_HWRING_POLL(mac_srs->srs_ring, (int)bytes_to_pickup);
2288da14cebeSEric Cheng mutex_enter(lock);
2289da14cebeSEric Cheng
2290da14cebeSEric Cheng ASSERT((mac_srs->srs_state & SRS_POLL_THR_OWNER) ==
2291da14cebeSEric Cheng SRS_POLL_THR_OWNER);
2292da14cebeSEric Cheng
2293da14cebeSEric Cheng mp = tail = head;
2294da14cebeSEric Cheng count = 0;
2295da14cebeSEric Cheng sz = 0;
2296da14cebeSEric Cheng while (mp != NULL) {
2297da14cebeSEric Cheng tail = mp;
2298da14cebeSEric Cheng sz += msgdsize(mp);
2299da14cebeSEric Cheng mp = mp->b_next;
2300da14cebeSEric Cheng count++;
2301da14cebeSEric Cheng }
2302da14cebeSEric Cheng
2303da14cebeSEric Cheng if (head != NULL) {
2304da14cebeSEric Cheng tail->b_next = NULL;
2305da14cebeSEric Cheng smcip = mac_srs->srs_mcip;
2306da14cebeSEric Cheng
23070dc2366fSVenugopal Iyer SRS_RX_STAT_UPDATE(mac_srs, pollbytes, sz);
23080dc2366fSVenugopal Iyer SRS_RX_STAT_UPDATE(mac_srs, pollcnt, count);
2309da14cebeSEric Cheng
2310da14cebeSEric Cheng /*
2311da14cebeSEric Cheng * If there are any promiscuous mode callbacks
2312da14cebeSEric Cheng * defined for this MAC client, pass them a copy
2313da14cebeSEric Cheng * if appropriate and also update the counters.
2314da14cebeSEric Cheng */
2315da14cebeSEric Cheng if (smcip != NULL) {
2316da14cebeSEric Cheng if (smcip->mci_mip->mi_promisc_list != NULL) {
2317da14cebeSEric Cheng mutex_exit(lock);
2318da14cebeSEric Cheng mac_promisc_dispatch(smcip->mci_mip,
2319c61a1653SRyan Zezeski head, NULL, B_FALSE);
2320da14cebeSEric Cheng mutex_enter(lock);
2321da14cebeSEric Cheng }
2322da14cebeSEric Cheng }
2323da14cebeSEric Cheng if (mac_srs->srs_type & SRST_BW_CONTROL) {
2324da14cebeSEric Cheng mutex_enter(&mac_srs->srs_bw->mac_bw_lock);
2325da14cebeSEric Cheng mac_srs->srs_bw->mac_bw_polled += sz;
2326da14cebeSEric Cheng mutex_exit(&mac_srs->srs_bw->mac_bw_lock);
2327da14cebeSEric Cheng }
2328da14cebeSEric Cheng MAC_RX_SRS_ENQUEUE_CHAIN(mac_srs, head, tail,
2329da14cebeSEric Cheng count, sz);
2330da14cebeSEric Cheng if (count <= 10)
23310dc2366fSVenugopal Iyer srs_rx->sr_stat.mrs_chaincntundr10++;
2332da14cebeSEric Cheng else if (count > 10 && count <= 50)
23330dc2366fSVenugopal Iyer srs_rx->sr_stat.mrs_chaincnt10to50++;
2334da14cebeSEric Cheng else
23350dc2366fSVenugopal Iyer srs_rx->sr_stat.mrs_chaincntover50++;
2336da14cebeSEric Cheng }
2337da14cebeSEric Cheng
2338da14cebeSEric Cheng /*
2339da14cebeSEric Cheng * We are guaranteed that SRS_PROC will be set if we
2340da14cebeSEric Cheng * are here. Also, poll thread gets to run only if
2341da14cebeSEric Cheng * the drain was being done by a worker thread although
2342da14cebeSEric Cheng * its possible that worker thread is still running
2343da14cebeSEric Cheng * and poll thread was sent down to keep the pipeline
2344da14cebeSEric Cheng * going instead of doing a complete drain and then
2345da14cebeSEric Cheng * trying to poll the NIC.
2346da14cebeSEric Cheng *
2347da14cebeSEric Cheng * So we need to check SRS_WORKER flag to make sure
2348da14cebeSEric Cheng * that the worker thread is not processing the queue
2349da14cebeSEric Cheng * in parallel to us. The flags and conditions are
2350da14cebeSEric Cheng * protected by the srs_lock to prevent any race. We
2351da14cebeSEric Cheng * ensure that we don't drop the srs_lock from now
2352da14cebeSEric Cheng * till the end and similarly we don't drop the srs_lock
2353da14cebeSEric Cheng * in mac_rx_srs_drain() till similar condition check
2354da14cebeSEric Cheng * are complete. The mac_rx_srs_drain() needs to ensure
2355da14cebeSEric Cheng * that SRS_WORKER flag remains set as long as its
2356da14cebeSEric Cheng * processing the queue.
2357da14cebeSEric Cheng */
2358da14cebeSEric Cheng if (!(mac_srs->srs_state & SRS_WORKER) &&
2359da14cebeSEric Cheng (mac_srs->srs_first != NULL)) {
2360da14cebeSEric Cheng /*
2361da14cebeSEric Cheng * We have packets to process and worker thread
2362da14cebeSEric Cheng * is not running. Check to see if poll thread is
2363ae6aa22aSVenugopal Iyer * allowed to process.
2364da14cebeSEric Cheng */
2365ae6aa22aSVenugopal Iyer if (mac_srs->srs_state & SRS_LATENCY_OPT) {
2366da14cebeSEric Cheng mac_srs->srs_drain_func(mac_srs, SRS_POLL_PROC);
23677507fc2cSEric Cheng if (!(mac_srs->srs_state & SRS_PAUSE) &&
23687507fc2cSEric Cheng srs_rx->sr_poll_pkt_cnt <=
2369da14cebeSEric Cheng srs_rx->sr_lowat) {
2370da14cebeSEric Cheng srs_rx->sr_poll_again++;
2371da14cebeSEric Cheng goto check_again;
2372ae6aa22aSVenugopal Iyer }
2373da14cebeSEric Cheng /*
2374da14cebeSEric Cheng * We are already above low water mark
2375da14cebeSEric Cheng * so stay in the polling mode but no
2376da14cebeSEric Cheng * need to poll. Once we dip below
2377da14cebeSEric Cheng * the polling threshold, the processing
2378da14cebeSEric Cheng * thread (soft ring) will signal us
2379da14cebeSEric Cheng * to poll again (MAC_UPDATE_SRS_COUNT)
2380da14cebeSEric Cheng */
2381da14cebeSEric Cheng srs_rx->sr_poll_drain_no_poll++;
2382ae6aa22aSVenugopal Iyer mac_srs->srs_state &= ~(SRS_PROC|SRS_GET_PKTS);
2383da14cebeSEric Cheng /*
2384da14cebeSEric Cheng * In B/W control case, its possible
2385da14cebeSEric Cheng * that the backlog built up due to
2386da14cebeSEric Cheng * B/W limit being reached and packets
2387da14cebeSEric Cheng * are queued only in SRS. In this case,
2388da14cebeSEric Cheng * we should schedule worker thread
2389da14cebeSEric Cheng * since no one else will wake us up.
2390da14cebeSEric Cheng */
2391ae6aa22aSVenugopal Iyer if ((mac_srs->srs_type & SRST_BW_CONTROL) &&
2392da14cebeSEric Cheng (mac_srs->srs_tid == NULL)) {
2393da14cebeSEric Cheng mac_srs->srs_tid =
2394ae6aa22aSVenugopal Iyer timeout(mac_srs_fire, mac_srs, 1);
2395da14cebeSEric Cheng srs_rx->sr_poll_worker_wakeup++;
2396da14cebeSEric Cheng }
2397da14cebeSEric Cheng } else {
2398da14cebeSEric Cheng /*
2399da14cebeSEric Cheng * Wakeup the worker thread for more processing.
2400da14cebeSEric Cheng * We optimize for throughput in this case.
2401da14cebeSEric Cheng */
2402da14cebeSEric Cheng mac_srs->srs_state &= ~(SRS_PROC|SRS_GET_PKTS);
2403da14cebeSEric Cheng MAC_SRS_WORKER_WAKEUP(mac_srs);
2404da14cebeSEric Cheng srs_rx->sr_poll_sig_worker++;
2405da14cebeSEric Cheng }
2406da14cebeSEric Cheng } else if ((mac_srs->srs_first == NULL) &&
2407da14cebeSEric Cheng !(mac_srs->srs_state & SRS_WORKER)) {
2408da14cebeSEric Cheng /*
2409da14cebeSEric Cheng * There is nothing queued in SRS and
2410da14cebeSEric Cheng * no worker thread running. Plus we
2411da14cebeSEric Cheng * didn't get anything from the H/W
2412da14cebeSEric Cheng * as well (head == NULL);
2413da14cebeSEric Cheng */
2414da14cebeSEric Cheng ASSERT(head == NULL);
2415da14cebeSEric Cheng mac_srs->srs_state &=
2416da14cebeSEric Cheng ~(SRS_PROC|SRS_GET_PKTS);
2417da14cebeSEric Cheng
2418da14cebeSEric Cheng /*
2419da14cebeSEric Cheng * If we have a packets in soft ring, don't allow
2420da14cebeSEric Cheng * more packets to come into this SRS by keeping the
2421da14cebeSEric Cheng * interrupts off but not polling the H/W. The
2422da14cebeSEric Cheng * poll thread will get signaled as soon as
2423da14cebeSEric Cheng * srs_poll_pkt_cnt dips below poll threshold.
2424da14cebeSEric Cheng */
2425da14cebeSEric Cheng if (srs_rx->sr_poll_pkt_cnt == 0) {
2426da14cebeSEric Cheng srs_rx->sr_poll_intr_enable++;
2427da14cebeSEric Cheng MAC_SRS_POLLING_OFF(mac_srs);
2428da14cebeSEric Cheng } else {
2429da14cebeSEric Cheng /*
2430da14cebeSEric Cheng * We know nothing is queued in SRS
2431da14cebeSEric Cheng * since we are here after checking
2432da14cebeSEric Cheng * srs_first is NULL. The backlog
2433da14cebeSEric Cheng * is entirely due to packets queued
2434da14cebeSEric Cheng * in Soft ring which will wake us up
2435da14cebeSEric Cheng * and get the interface out of polling
2436da14cebeSEric Cheng * mode once the backlog dips below
2437da14cebeSEric Cheng * sr_poll_thres.
2438da14cebeSEric Cheng */
2439da14cebeSEric Cheng srs_rx->sr_poll_no_poll++;
2440da14cebeSEric Cheng }
2441da14cebeSEric Cheng } else {
2442da14cebeSEric Cheng /*
2443da14cebeSEric Cheng * Worker thread is already running.
2444da14cebeSEric Cheng * Nothing much to do. If the polling
2445da14cebeSEric Cheng * was enabled, worker thread will deal
2446da14cebeSEric Cheng * with that.
2447da14cebeSEric Cheng */
2448da14cebeSEric Cheng mac_srs->srs_state &= ~SRS_GET_PKTS;
2449da14cebeSEric Cheng srs_rx->sr_poll_goto_sleep++;
2450da14cebeSEric Cheng }
2451da14cebeSEric Cheng }
2452da14cebeSEric Cheng done:
2453da14cebeSEric Cheng mac_srs->srs_state |= SRS_POLL_THR_QUIESCED;
2454da14cebeSEric Cheng cv_signal(&mac_srs->srs_async);
2455da14cebeSEric Cheng /*
2456da14cebeSEric Cheng * If this is a temporary quiesce then wait for the restart signal
2457da14cebeSEric Cheng * from the srs worker. Then clear the flags and signal the srs worker
2458da14cebeSEric Cheng * to ensure a positive handshake and go back to start.
2459da14cebeSEric Cheng */
2460da14cebeSEric Cheng while (!(mac_srs->srs_state & (SRS_CONDEMNED | SRS_POLL_THR_RESTART)))
2461da14cebeSEric Cheng cv_wait(async, lock);
2462da14cebeSEric Cheng if (mac_srs->srs_state & SRS_POLL_THR_RESTART) {
2463da14cebeSEric Cheng ASSERT(!(mac_srs->srs_state & SRS_CONDEMNED));
2464da14cebeSEric Cheng mac_srs->srs_state &=
2465da14cebeSEric Cheng ~(SRS_POLL_THR_QUIESCED | SRS_POLL_THR_RESTART);
2466da14cebeSEric Cheng cv_signal(&mac_srs->srs_async);
2467da14cebeSEric Cheng goto start;
2468da14cebeSEric Cheng } else {
2469da14cebeSEric Cheng mac_srs->srs_state |= SRS_POLL_THR_EXITED;
2470da14cebeSEric Cheng cv_signal(&mac_srs->srs_async);
2471da14cebeSEric Cheng CALLB_CPR_EXIT(&cprinfo);
2472da14cebeSEric Cheng thread_exit();
2473da14cebeSEric Cheng }
2474da14cebeSEric Cheng }
2475da14cebeSEric Cheng
2476da14cebeSEric Cheng /*
2477da14cebeSEric Cheng * mac_srs_pick_chain
2478da14cebeSEric Cheng *
2479da14cebeSEric Cheng * In Bandwidth control case, checks how many packets can be processed
2480da14cebeSEric Cheng * and return them in a sub chain.
2481da14cebeSEric Cheng */
2482da14cebeSEric Cheng static mblk_t *
mac_srs_pick_chain(mac_soft_ring_set_t * mac_srs,mblk_t ** chain_tail,size_t * chain_sz,int * chain_cnt)2483da14cebeSEric Cheng mac_srs_pick_chain(mac_soft_ring_set_t *mac_srs, mblk_t **chain_tail,
2484da14cebeSEric Cheng size_t *chain_sz, int *chain_cnt)
2485da14cebeSEric Cheng {
2486da14cebeSEric Cheng mblk_t *head = NULL;
2487da14cebeSEric Cheng mblk_t *tail = NULL;
2488da14cebeSEric Cheng size_t sz;
2489da14cebeSEric Cheng size_t tsz = 0;
2490da14cebeSEric Cheng int cnt = 0;
2491da14cebeSEric Cheng mblk_t *mp;
2492da14cebeSEric Cheng
2493da14cebeSEric Cheng ASSERT(MUTEX_HELD(&mac_srs->srs_lock));
2494da14cebeSEric Cheng mutex_enter(&mac_srs->srs_bw->mac_bw_lock);
2495da14cebeSEric Cheng if (((mac_srs->srs_bw->mac_bw_used + mac_srs->srs_size) <=
2496da14cebeSEric Cheng mac_srs->srs_bw->mac_bw_limit) ||
2497da14cebeSEric Cheng (mac_srs->srs_bw->mac_bw_limit == 0)) {
2498da14cebeSEric Cheng mutex_exit(&mac_srs->srs_bw->mac_bw_lock);
2499da14cebeSEric Cheng head = mac_srs->srs_first;
2500da14cebeSEric Cheng mac_srs->srs_first = NULL;
2501da14cebeSEric Cheng *chain_tail = mac_srs->srs_last;
2502da14cebeSEric Cheng mac_srs->srs_last = NULL;
2503da14cebeSEric Cheng *chain_sz = mac_srs->srs_size;
2504da14cebeSEric Cheng *chain_cnt = mac_srs->srs_count;
2505da14cebeSEric Cheng mac_srs->srs_count = 0;
2506da14cebeSEric Cheng mac_srs->srs_size = 0;
2507da14cebeSEric Cheng return (head);
2508da14cebeSEric Cheng }
2509da14cebeSEric Cheng
2510da14cebeSEric Cheng /*
2511da14cebeSEric Cheng * Can't clear the entire backlog.
2512da14cebeSEric Cheng * Need to find how many packets to pick
2513da14cebeSEric Cheng */
2514da14cebeSEric Cheng ASSERT(MUTEX_HELD(&mac_srs->srs_bw->mac_bw_lock));
2515da14cebeSEric Cheng while ((mp = mac_srs->srs_first) != NULL) {
2516da14cebeSEric Cheng sz = msgdsize(mp);
2517da14cebeSEric Cheng if ((tsz + sz + mac_srs->srs_bw->mac_bw_used) >
2518da14cebeSEric Cheng mac_srs->srs_bw->mac_bw_limit) {
2519da14cebeSEric Cheng if (!(mac_srs->srs_bw->mac_bw_state & SRS_BW_ENFORCED))
2520da14cebeSEric Cheng mac_srs->srs_bw->mac_bw_state |=
2521da14cebeSEric Cheng SRS_BW_ENFORCED;
2522da14cebeSEric Cheng break;
2523da14cebeSEric Cheng }
2524da14cebeSEric Cheng
2525da14cebeSEric Cheng /*
2526da14cebeSEric Cheng * The _size & cnt is decremented from the softrings
2527da14cebeSEric Cheng * when they send up the packet for polling to work
2528da14cebeSEric Cheng * properly.
2529da14cebeSEric Cheng */
2530da14cebeSEric Cheng tsz += sz;
2531da14cebeSEric Cheng cnt++;
2532da14cebeSEric Cheng mac_srs->srs_count--;
2533da14cebeSEric Cheng mac_srs->srs_size -= sz;
2534da14cebeSEric Cheng if (tail != NULL)
2535da14cebeSEric Cheng tail->b_next = mp;
2536da14cebeSEric Cheng else
2537da14cebeSEric Cheng head = mp;
2538da14cebeSEric Cheng tail = mp;
2539da14cebeSEric Cheng mac_srs->srs_first = mac_srs->srs_first->b_next;
2540da14cebeSEric Cheng }
2541da14cebeSEric Cheng mutex_exit(&mac_srs->srs_bw->mac_bw_lock);
2542da14cebeSEric Cheng if (mac_srs->srs_first == NULL)
2543da14cebeSEric Cheng mac_srs->srs_last = NULL;
2544da14cebeSEric Cheng
2545da14cebeSEric Cheng if (tail != NULL)
2546da14cebeSEric Cheng tail->b_next = NULL;
2547da14cebeSEric Cheng *chain_tail = tail;
2548da14cebeSEric Cheng *chain_cnt = cnt;
2549da14cebeSEric Cheng *chain_sz = tsz;
2550da14cebeSEric Cheng
2551da14cebeSEric Cheng return (head);
2552da14cebeSEric Cheng }
2553da14cebeSEric Cheng
2554da14cebeSEric Cheng /*
2555da14cebeSEric Cheng * mac_rx_srs_drain
2556da14cebeSEric Cheng *
2557da14cebeSEric Cheng * The SRS drain routine. Gets to run to clear the queue. Any thread
2558da14cebeSEric Cheng * (worker, interrupt, poll) can call this based on processing model.
2559da14cebeSEric Cheng * The first thing we do is disable interrupts if possible and then
2560da14cebeSEric Cheng * drain the queue. we also try to poll the underlying hardware if
2561da14cebeSEric Cheng * there is a dedicated hardware Rx ring assigned to this SRS.
2562da14cebeSEric Cheng *
2563da14cebeSEric Cheng * There is a equivalent drain routine in bandwidth control mode
2564da14cebeSEric Cheng * mac_rx_srs_drain_bw. There is some code duplication between the two
2565da14cebeSEric Cheng * routines but they are highly performance sensitive and are easier
2566da14cebeSEric Cheng * to read/debug if they stay separate. Any code changes here might
2567da14cebeSEric Cheng * also apply to mac_rx_srs_drain_bw as well.
2568da14cebeSEric Cheng */
2569da14cebeSEric Cheng void
mac_rx_srs_drain(mac_soft_ring_set_t * mac_srs,uint_t proc_type)2570da14cebeSEric Cheng mac_rx_srs_drain(mac_soft_ring_set_t *mac_srs, uint_t proc_type)
2571da14cebeSEric Cheng {
2572da14cebeSEric Cheng mblk_t *head;
2573da14cebeSEric Cheng mblk_t *tail;
2574da14cebeSEric Cheng timeout_id_t tid;
2575da14cebeSEric Cheng int cnt = 0;
2576da14cebeSEric Cheng mac_client_impl_t *mcip = mac_srs->srs_mcip;
2577da14cebeSEric Cheng mac_srs_rx_t *srs_rx = &mac_srs->srs_rx;
2578da14cebeSEric Cheng
2579da14cebeSEric Cheng ASSERT(MUTEX_HELD(&mac_srs->srs_lock));
2580da14cebeSEric Cheng ASSERT(!(mac_srs->srs_type & SRST_BW_CONTROL));
2581ae6aa22aSVenugopal Iyer
2582da14cebeSEric Cheng /* If we are blanked i.e. can't do upcalls, then we are done */
2583da14cebeSEric Cheng if (mac_srs->srs_state & (SRS_BLANK | SRS_PAUSE)) {
2584da14cebeSEric Cheng ASSERT((mac_srs->srs_type & SRST_NO_SOFT_RINGS) ||
2585da14cebeSEric Cheng (mac_srs->srs_state & SRS_PAUSE));
2586da14cebeSEric Cheng goto out;
2587da14cebeSEric Cheng }
2588da14cebeSEric Cheng
2589da14cebeSEric Cheng if (mac_srs->srs_first == NULL)
2590da14cebeSEric Cheng goto out;
2591da14cebeSEric Cheng
2592ae6aa22aSVenugopal Iyer if (!(mac_srs->srs_state & SRS_LATENCY_OPT) &&
2593ae6aa22aSVenugopal Iyer (srs_rx->sr_poll_pkt_cnt <= srs_rx->sr_lowat)) {
2594ae6aa22aSVenugopal Iyer /*
2595ae6aa22aSVenugopal Iyer * In the normal case, the SRS worker thread does no
2596ae6aa22aSVenugopal Iyer * work and we wait for a backlog to build up before
2597ae6aa22aSVenugopal Iyer * we switch into polling mode. In case we are
2598ae6aa22aSVenugopal Iyer * optimizing for throughput, we use the worker thread
2599ae6aa22aSVenugopal Iyer * as well. The goal is to let worker thread process
2600ae6aa22aSVenugopal Iyer * the queue and poll thread to feed packets into
2601ae6aa22aSVenugopal Iyer * the queue. As such, we should signal the poll
2602ae6aa22aSVenugopal Iyer * thread to try and get more packets.
2603ae6aa22aSVenugopal Iyer *
2604ae6aa22aSVenugopal Iyer * We could have pulled this check in the POLL_RING
2605ae6aa22aSVenugopal Iyer * macro itself but keeping it explicit here makes
2606ae6aa22aSVenugopal Iyer * the architecture more human understandable.
2607ae6aa22aSVenugopal Iyer */
2608ae6aa22aSVenugopal Iyer MAC_SRS_POLL_RING(mac_srs);
2609ae6aa22aSVenugopal Iyer }
2610ae6aa22aSVenugopal Iyer
2611ae6aa22aSVenugopal Iyer again:
2612da14cebeSEric Cheng head = mac_srs->srs_first;
2613da14cebeSEric Cheng mac_srs->srs_first = NULL;
2614da14cebeSEric Cheng tail = mac_srs->srs_last;
2615da14cebeSEric Cheng mac_srs->srs_last = NULL;
2616da14cebeSEric Cheng cnt = mac_srs->srs_count;
2617da14cebeSEric Cheng mac_srs->srs_count = 0;
2618da14cebeSEric Cheng
2619da14cebeSEric Cheng ASSERT(head != NULL);
2620da14cebeSEric Cheng ASSERT(tail != NULL);
2621da14cebeSEric Cheng
2622da906b4eSRyan Zezeski if ((tid = mac_srs->srs_tid) != NULL)
2623da906b4eSRyan Zezeski mac_srs->srs_tid = NULL;
2624da14cebeSEric Cheng
2625da14cebeSEric Cheng mac_srs->srs_state |= (SRS_PROC|proc_type);
2626da14cebeSEric Cheng
2627da14cebeSEric Cheng /*
2628da14cebeSEric Cheng * mcip is NULL for broadcast and multicast flows. The promisc
2629da14cebeSEric Cheng * callbacks for broadcast and multicast packets are delivered from
2630da14cebeSEric Cheng * mac_rx() and we don't need to worry about that case in this path
2631da14cebeSEric Cheng */
26320dc2366fSVenugopal Iyer if (mcip != NULL) {
26330dc2366fSVenugopal Iyer if (mcip->mci_promisc_list != NULL) {
2634da14cebeSEric Cheng mutex_exit(&mac_srs->srs_lock);
2635da14cebeSEric Cheng mac_promisc_client_dispatch(mcip, head);
2636da14cebeSEric Cheng mutex_enter(&mac_srs->srs_lock);
2637da14cebeSEric Cheng }
26380dc2366fSVenugopal Iyer if (MAC_PROTECT_ENABLED(mcip, MPT_IPNOSPOOF)) {
26390dc2366fSVenugopal Iyer mutex_exit(&mac_srs->srs_lock);
2640797f979dSCody Peter Mello mac_protect_intercept_dynamic(mcip, head);
26410dc2366fSVenugopal Iyer mutex_enter(&mac_srs->srs_lock);
26420dc2366fSVenugopal Iyer }
26430dc2366fSVenugopal Iyer }
2644da14cebeSEric Cheng
2645da14cebeSEric Cheng /*
264684de666eSRyan Zezeski * Check if SRS itself is doing the processing. This direct
264784de666eSRyan Zezeski * path applies only when subflows are present.
2648da14cebeSEric Cheng */
2649da14cebeSEric Cheng if (mac_srs->srs_type & SRST_NO_SOFT_RINGS) {
2650da14cebeSEric Cheng mac_direct_rx_t proc;
2651da14cebeSEric Cheng void *arg1;
2652da14cebeSEric Cheng mac_resource_handle_t arg2;
2653da14cebeSEric Cheng
2654da14cebeSEric Cheng /*
2655da14cebeSEric Cheng * This is the case when a Rx is directly
2656da14cebeSEric Cheng * assigned and we have a fully classified
2657da14cebeSEric Cheng * protocol chain. We can deal with it in
2658da14cebeSEric Cheng * one shot.
2659da14cebeSEric Cheng */
2660da14cebeSEric Cheng proc = srs_rx->sr_func;
2661da14cebeSEric Cheng arg1 = srs_rx->sr_arg1;
2662da14cebeSEric Cheng arg2 = srs_rx->sr_arg2;
2663da14cebeSEric Cheng
2664da14cebeSEric Cheng mac_srs->srs_state |= SRS_CLIENT_PROC;
2665da14cebeSEric Cheng mutex_exit(&mac_srs->srs_lock);
2666da906b4eSRyan Zezeski if (tid != NULL) {
2667da14cebeSEric Cheng (void) untimeout(tid);
2668da906b4eSRyan Zezeski tid = NULL;
2669da14cebeSEric Cheng }
2670da14cebeSEric Cheng
2671da14cebeSEric Cheng proc(arg1, arg2, head, NULL);
2672da14cebeSEric Cheng /*
2673da14cebeSEric Cheng * Decrement the size and count here itelf
2674da14cebeSEric Cheng * since the packet has been processed.
2675da14cebeSEric Cheng */
2676da14cebeSEric Cheng mutex_enter(&mac_srs->srs_lock);
2677da14cebeSEric Cheng MAC_UPDATE_SRS_COUNT_LOCKED(mac_srs, cnt);
2678da14cebeSEric Cheng if (mac_srs->srs_state & SRS_CLIENT_WAIT)
2679da14cebeSEric Cheng cv_signal(&mac_srs->srs_client_cv);
2680da14cebeSEric Cheng mac_srs->srs_state &= ~SRS_CLIENT_PROC;
2681da14cebeSEric Cheng } else {
2682da14cebeSEric Cheng /* Some kind of softrings based fanout is required */
2683da14cebeSEric Cheng mutex_exit(&mac_srs->srs_lock);
2684da906b4eSRyan Zezeski if (tid != NULL) {
2685da14cebeSEric Cheng (void) untimeout(tid);
2686da906b4eSRyan Zezeski tid = NULL;
2687da14cebeSEric Cheng }
2688da14cebeSEric Cheng
2689da14cebeSEric Cheng /*
2690da14cebeSEric Cheng * Since the fanout routines can deal with chains,
2691da14cebeSEric Cheng * shoot the entire chain up.
2692da14cebeSEric Cheng */
2693da14cebeSEric Cheng if (mac_srs->srs_type & SRST_FANOUT_SRC_IP)
2694da14cebeSEric Cheng mac_rx_srs_fanout(mac_srs, head);
2695da14cebeSEric Cheng else
2696da14cebeSEric Cheng mac_rx_srs_proto_fanout(mac_srs, head);
2697da14cebeSEric Cheng mutex_enter(&mac_srs->srs_lock);
2698da14cebeSEric Cheng }
2699da14cebeSEric Cheng
27008ac29891SEric Cheng if (!(mac_srs->srs_state & (SRS_BLANK|SRS_PAUSE)) &&
27018ac29891SEric Cheng (mac_srs->srs_first != NULL)) {
2702da14cebeSEric Cheng /*
27038ac29891SEric Cheng * More packets arrived while we were clearing the
27048ac29891SEric Cheng * SRS. This can be possible because of one of
27058ac29891SEric Cheng * three conditions below:
27068ac29891SEric Cheng * 1) The driver is using multiple worker threads
27078ac29891SEric Cheng * to send the packets to us.
27088ac29891SEric Cheng * 2) The driver has a race in switching
27098ac29891SEric Cheng * between interrupt and polling mode or
27108ac29891SEric Cheng * 3) Packets are arriving in this SRS via the
27118ac29891SEric Cheng * S/W classification as well.
27128ac29891SEric Cheng *
27138ac29891SEric Cheng * We should switch to polling mode and see if we
27148ac29891SEric Cheng * need to send the poll thread down. Also, signal
27158ac29891SEric Cheng * the worker thread to process whats just arrived.
2716da14cebeSEric Cheng */
27178ac29891SEric Cheng MAC_SRS_POLLING_ON(mac_srs);
2718ae6aa22aSVenugopal Iyer if (srs_rx->sr_poll_pkt_cnt <= srs_rx->sr_lowat) {
2719ae6aa22aSVenugopal Iyer srs_rx->sr_drain_poll_sig++;
2720da14cebeSEric Cheng MAC_SRS_POLL_RING(mac_srs);
2721ae6aa22aSVenugopal Iyer }
27228ac29891SEric Cheng
27238ac29891SEric Cheng /*
27248ac29891SEric Cheng * If we didn't signal the poll thread, we need
27258ac29891SEric Cheng * to deal with the pending packets ourselves.
27268ac29891SEric Cheng */
27278ac29891SEric Cheng if (proc_type == SRS_WORKER) {
2728da14cebeSEric Cheng srs_rx->sr_drain_again++;
2729da14cebeSEric Cheng goto again;
27308ac29891SEric Cheng } else {
27318ac29891SEric Cheng srs_rx->sr_drain_worker_sig++;
27328ac29891SEric Cheng cv_signal(&mac_srs->srs_async);
2733da14cebeSEric Cheng }
2734da14cebeSEric Cheng }
2735da14cebeSEric Cheng
2736da14cebeSEric Cheng out:
2737da14cebeSEric Cheng if (mac_srs->srs_state & SRS_GET_PKTS) {
2738da14cebeSEric Cheng /*
2739da14cebeSEric Cheng * Poll thread is already running. Leave the
2740da14cebeSEric Cheng * SRS_RPOC set and hand over the control to
2741da14cebeSEric Cheng * poll thread.
2742da14cebeSEric Cheng */
2743da14cebeSEric Cheng mac_srs->srs_state &= ~proc_type;
2744da14cebeSEric Cheng srs_rx->sr_drain_poll_running++;
2745da14cebeSEric Cheng return;
2746da14cebeSEric Cheng }
2747da14cebeSEric Cheng
2748da14cebeSEric Cheng /*
2749da14cebeSEric Cheng * Even if there are no packets queued in SRS, we
2750da14cebeSEric Cheng * need to make sure that the shared counter is
2751da14cebeSEric Cheng * clear and any associated softrings have cleared
2752da14cebeSEric Cheng * all the backlog. Otherwise, leave the interface
2753da14cebeSEric Cheng * in polling mode and the poll thread will get
2754da14cebeSEric Cheng * signalled once the count goes down to zero.
2755da14cebeSEric Cheng *
2756da14cebeSEric Cheng * If someone is already draining the queue (SRS_PROC is
2757da14cebeSEric Cheng * set) when the srs_poll_pkt_cnt goes down to zero,
2758da14cebeSEric Cheng * then it means that drain is already running and we
2759da14cebeSEric Cheng * will turn off polling at that time if there is
2760da14cebeSEric Cheng * no backlog.
2761da14cebeSEric Cheng *
2762da14cebeSEric Cheng * As long as there are packets queued either
2763da14cebeSEric Cheng * in soft ring set or its soft rings, we will leave
2764da14cebeSEric Cheng * the interface in polling mode (even if the drain
2765da14cebeSEric Cheng * was done being the interrupt thread). We signal
2766da14cebeSEric Cheng * the poll thread as well if we have dipped below
2767da14cebeSEric Cheng * low water mark.
2768da14cebeSEric Cheng *
2769da14cebeSEric Cheng * NOTE: We can't use the MAC_SRS_POLLING_ON macro
2770da14cebeSEric Cheng * since that turn polling on only for worker thread.
2771da14cebeSEric Cheng * Its not worth turning polling on for interrupt
2772da14cebeSEric Cheng * thread (since NIC will not issue another interrupt)
2773da14cebeSEric Cheng * unless a backlog builds up.
2774da14cebeSEric Cheng */
2775da14cebeSEric Cheng if ((srs_rx->sr_poll_pkt_cnt > 0) &&
2776da14cebeSEric Cheng (mac_srs->srs_state & SRS_POLLING_CAPAB)) {
2777da14cebeSEric Cheng mac_srs->srs_state &= ~(SRS_PROC|proc_type);
2778da14cebeSEric Cheng srs_rx->sr_drain_keep_polling++;
2779da14cebeSEric Cheng MAC_SRS_POLLING_ON(mac_srs);
2780da14cebeSEric Cheng if (srs_rx->sr_poll_pkt_cnt <= srs_rx->sr_lowat)
2781da14cebeSEric Cheng MAC_SRS_POLL_RING(mac_srs);
2782da14cebeSEric Cheng return;
2783da14cebeSEric Cheng }
2784da14cebeSEric Cheng
2785da14cebeSEric Cheng /* Nothing else to do. Get out of poll mode */
2786da14cebeSEric Cheng MAC_SRS_POLLING_OFF(mac_srs);
2787da14cebeSEric Cheng mac_srs->srs_state &= ~(SRS_PROC|proc_type);
2788da14cebeSEric Cheng srs_rx->sr_drain_finish_intr++;
2789da14cebeSEric Cheng }
2790da14cebeSEric Cheng
2791da14cebeSEric Cheng /*
2792da14cebeSEric Cheng * mac_rx_srs_drain_bw
2793da14cebeSEric Cheng *
2794da14cebeSEric Cheng * The SRS BW drain routine. Gets to run to clear the queue. Any thread
2795da14cebeSEric Cheng * (worker, interrupt, poll) can call this based on processing model.
2796da14cebeSEric Cheng * The first thing we do is disable interrupts if possible and then
2797da14cebeSEric Cheng * drain the queue. we also try to poll the underlying hardware if
2798da14cebeSEric Cheng * there is a dedicated hardware Rx ring assigned to this SRS.
2799da14cebeSEric Cheng *
2800da14cebeSEric Cheng * There is a equivalent drain routine in non bandwidth control mode
2801da14cebeSEric Cheng * mac_rx_srs_drain. There is some code duplication between the two
2802da14cebeSEric Cheng * routines but they are highly performance sensitive and are easier
2803da14cebeSEric Cheng * to read/debug if they stay separate. Any code changes here might
2804da14cebeSEric Cheng * also apply to mac_rx_srs_drain as well.
2805da14cebeSEric Cheng */
2806da14cebeSEric Cheng void
mac_rx_srs_drain_bw(mac_soft_ring_set_t * mac_srs,uint_t proc_type)2807da14cebeSEric Cheng mac_rx_srs_drain_bw(mac_soft_ring_set_t *mac_srs, uint_t proc_type)
2808da14cebeSEric Cheng {
2809da14cebeSEric Cheng mblk_t *head;
2810da14cebeSEric Cheng mblk_t *tail;
2811da14cebeSEric Cheng timeout_id_t tid;
2812da14cebeSEric Cheng size_t sz = 0;
2813da14cebeSEric Cheng int cnt = 0;
2814da14cebeSEric Cheng mac_client_impl_t *mcip = mac_srs->srs_mcip;
2815da14cebeSEric Cheng mac_srs_rx_t *srs_rx = &mac_srs->srs_rx;
2816d3d50737SRafael Vanoni clock_t now;
2817da14cebeSEric Cheng
2818da14cebeSEric Cheng ASSERT(MUTEX_HELD(&mac_srs->srs_lock));
2819da14cebeSEric Cheng ASSERT(mac_srs->srs_type & SRST_BW_CONTROL);
2820da14cebeSEric Cheng again:
2821da14cebeSEric Cheng /* Check if we are doing B/W control */
2822da14cebeSEric Cheng mutex_enter(&mac_srs->srs_bw->mac_bw_lock);
2823d3d50737SRafael Vanoni now = ddi_get_lbolt();
2824d3d50737SRafael Vanoni if (mac_srs->srs_bw->mac_bw_curr_time != now) {
2825d3d50737SRafael Vanoni mac_srs->srs_bw->mac_bw_curr_time = now;
2826da14cebeSEric Cheng mac_srs->srs_bw->mac_bw_used = 0;
2827da14cebeSEric Cheng if (mac_srs->srs_bw->mac_bw_state & SRS_BW_ENFORCED)
2828da14cebeSEric Cheng mac_srs->srs_bw->mac_bw_state &= ~SRS_BW_ENFORCED;
2829da14cebeSEric Cheng } else if (mac_srs->srs_bw->mac_bw_state & SRS_BW_ENFORCED) {
2830da14cebeSEric Cheng mutex_exit(&mac_srs->srs_bw->mac_bw_lock);
2831da14cebeSEric Cheng goto done;
2832da14cebeSEric Cheng } else if (mac_srs->srs_bw->mac_bw_used >
2833da14cebeSEric Cheng mac_srs->srs_bw->mac_bw_limit) {
2834da14cebeSEric Cheng mac_srs->srs_bw->mac_bw_state |= SRS_BW_ENFORCED;
2835da14cebeSEric Cheng mutex_exit(&mac_srs->srs_bw->mac_bw_lock);
2836da14cebeSEric Cheng goto done;
2837da14cebeSEric Cheng }
2838da14cebeSEric Cheng mutex_exit(&mac_srs->srs_bw->mac_bw_lock);
2839da14cebeSEric Cheng
2840da14cebeSEric Cheng /* If we are blanked i.e. can't do upcalls, then we are done */
2841da14cebeSEric Cheng if (mac_srs->srs_state & (SRS_BLANK | SRS_PAUSE)) {
2842da14cebeSEric Cheng ASSERT((mac_srs->srs_type & SRST_NO_SOFT_RINGS) ||
2843da14cebeSEric Cheng (mac_srs->srs_state & SRS_PAUSE));
2844da14cebeSEric Cheng goto done;
2845da14cebeSEric Cheng }
2846da14cebeSEric Cheng
2847da14cebeSEric Cheng sz = 0;
2848da14cebeSEric Cheng cnt = 0;
2849da14cebeSEric Cheng if ((head = mac_srs_pick_chain(mac_srs, &tail, &sz, &cnt)) == NULL) {
2850da14cebeSEric Cheng /*
2851da14cebeSEric Cheng * We couldn't pick up a single packet.
2852da14cebeSEric Cheng */
2853da14cebeSEric Cheng mutex_enter(&mac_srs->srs_bw->mac_bw_lock);
2854da14cebeSEric Cheng if ((mac_srs->srs_bw->mac_bw_used == 0) &&
2855da14cebeSEric Cheng (mac_srs->srs_size != 0) &&
2856da14cebeSEric Cheng !(mac_srs->srs_bw->mac_bw_state & SRS_BW_ENFORCED)) {
2857da14cebeSEric Cheng /*
2858da14cebeSEric Cheng * Seems like configured B/W doesn't
2859da14cebeSEric Cheng * even allow processing of 1 packet
2860da14cebeSEric Cheng * per tick.
2861da14cebeSEric Cheng *
2862da14cebeSEric Cheng * XXX: raise the limit to processing
2863da14cebeSEric Cheng * at least 1 packet per tick.
2864da14cebeSEric Cheng */
2865da14cebeSEric Cheng mac_srs->srs_bw->mac_bw_limit +=
2866da14cebeSEric Cheng mac_srs->srs_bw->mac_bw_limit;
2867da14cebeSEric Cheng mac_srs->srs_bw->mac_bw_drop_threshold +=
2868da14cebeSEric Cheng mac_srs->srs_bw->mac_bw_drop_threshold;
2869da14cebeSEric Cheng cmn_err(CE_NOTE, "mac_rx_srs_drain: srs(%p) "
2870da14cebeSEric Cheng "raised B/W limit to %d since not even a "
2871da14cebeSEric Cheng "single packet can be processed per "
2872da14cebeSEric Cheng "tick %d\n", (void *)mac_srs,
2873da14cebeSEric Cheng (int)mac_srs->srs_bw->mac_bw_limit,
2874da14cebeSEric Cheng (int)msgdsize(mac_srs->srs_first));
2875da14cebeSEric Cheng }
2876da14cebeSEric Cheng mutex_exit(&mac_srs->srs_bw->mac_bw_lock);
2877da14cebeSEric Cheng goto done;
2878da14cebeSEric Cheng }
2879da14cebeSEric Cheng
2880da14cebeSEric Cheng ASSERT(head != NULL);
2881da14cebeSEric Cheng ASSERT(tail != NULL);
2882da14cebeSEric Cheng
2883da14cebeSEric Cheng /* zero bandwidth: drop all and return to interrupt mode */
2884da14cebeSEric Cheng mutex_enter(&mac_srs->srs_bw->mac_bw_lock);
2885da14cebeSEric Cheng if (mac_srs->srs_bw->mac_bw_limit == 0) {
28860dc2366fSVenugopal Iyer srs_rx->sr_stat.mrs_sdrops += cnt;
2887da14cebeSEric Cheng ASSERT(mac_srs->srs_bw->mac_bw_sz >= sz);
2888da14cebeSEric Cheng mac_srs->srs_bw->mac_bw_sz -= sz;
2889da14cebeSEric Cheng mac_srs->srs_bw->mac_bw_drop_bytes += sz;
2890da14cebeSEric Cheng mutex_exit(&mac_srs->srs_bw->mac_bw_lock);
2891c61a1653SRyan Zezeski mac_drop_chain(head, "Rx no bandwidth");
2892da14cebeSEric Cheng goto leave_poll;
2893da14cebeSEric Cheng } else {
2894da14cebeSEric Cheng mutex_exit(&mac_srs->srs_bw->mac_bw_lock);
2895da14cebeSEric Cheng }
2896da14cebeSEric Cheng
2897da906b4eSRyan Zezeski if ((tid = mac_srs->srs_tid) != NULL)
2898da906b4eSRyan Zezeski mac_srs->srs_tid = NULL;
2899da14cebeSEric Cheng
2900da14cebeSEric Cheng mac_srs->srs_state |= (SRS_PROC|proc_type);
2901da14cebeSEric Cheng MAC_SRS_WORKER_POLLING_ON(mac_srs);
2902da14cebeSEric Cheng
2903da14cebeSEric Cheng /*
2904da14cebeSEric Cheng * mcip is NULL for broadcast and multicast flows. The promisc
2905da14cebeSEric Cheng * callbacks for broadcast and multicast packets are delivered from
2906da14cebeSEric Cheng * mac_rx() and we don't need to worry about that case in this path
2907da14cebeSEric Cheng */
29080dc2366fSVenugopal Iyer if (mcip != NULL) {
29090dc2366fSVenugopal Iyer if (mcip->mci_promisc_list != NULL) {
2910da14cebeSEric Cheng mutex_exit(&mac_srs->srs_lock);
2911da14cebeSEric Cheng mac_promisc_client_dispatch(mcip, head);
2912da14cebeSEric Cheng mutex_enter(&mac_srs->srs_lock);
2913da14cebeSEric Cheng }
29140dc2366fSVenugopal Iyer if (MAC_PROTECT_ENABLED(mcip, MPT_IPNOSPOOF)) {
29150dc2366fSVenugopal Iyer mutex_exit(&mac_srs->srs_lock);
2916797f979dSCody Peter Mello mac_protect_intercept_dynamic(mcip, head);
29170dc2366fSVenugopal Iyer mutex_enter(&mac_srs->srs_lock);
29180dc2366fSVenugopal Iyer }
29190dc2366fSVenugopal Iyer }
2920da14cebeSEric Cheng
2921da14cebeSEric Cheng /*
2922da14cebeSEric Cheng * Check if SRS itself is doing the processing
2923da14cebeSEric Cheng * This direct path does not apply when subflows are present. In this
2924da14cebeSEric Cheng * case, packets need to be dispatched to a soft ring according to the
2925da14cebeSEric Cheng * flow's bandwidth and other resources contraints.
2926da14cebeSEric Cheng */
2927da14cebeSEric Cheng if (mac_srs->srs_type & SRST_NO_SOFT_RINGS) {
2928da14cebeSEric Cheng mac_direct_rx_t proc;
2929da14cebeSEric Cheng void *arg1;
2930da14cebeSEric Cheng mac_resource_handle_t arg2;
2931da14cebeSEric Cheng
2932da14cebeSEric Cheng /*
2933da14cebeSEric Cheng * This is the case when a Rx is directly
2934da14cebeSEric Cheng * assigned and we have a fully classified
2935da14cebeSEric Cheng * protocol chain. We can deal with it in
2936da14cebeSEric Cheng * one shot.
2937da14cebeSEric Cheng */
2938da14cebeSEric Cheng proc = srs_rx->sr_func;
2939da14cebeSEric Cheng arg1 = srs_rx->sr_arg1;
2940da14cebeSEric Cheng arg2 = srs_rx->sr_arg2;
2941da14cebeSEric Cheng
2942da14cebeSEric Cheng mac_srs->srs_state |= SRS_CLIENT_PROC;
2943da14cebeSEric Cheng mutex_exit(&mac_srs->srs_lock);
2944da906b4eSRyan Zezeski if (tid != NULL) {
2945da14cebeSEric Cheng (void) untimeout(tid);
2946da906b4eSRyan Zezeski tid = NULL;
2947da14cebeSEric Cheng }
2948da14cebeSEric Cheng
2949da14cebeSEric Cheng proc(arg1, arg2, head, NULL);
2950da14cebeSEric Cheng /*
2951da14cebeSEric Cheng * Decrement the size and count here itelf
2952da14cebeSEric Cheng * since the packet has been processed.
2953da14cebeSEric Cheng */
2954da14cebeSEric Cheng mutex_enter(&mac_srs->srs_lock);
2955da14cebeSEric Cheng MAC_UPDATE_SRS_COUNT_LOCKED(mac_srs, cnt);
2956da14cebeSEric Cheng MAC_UPDATE_SRS_SIZE_LOCKED(mac_srs, sz);
2957da14cebeSEric Cheng
2958da14cebeSEric Cheng if (mac_srs->srs_state & SRS_CLIENT_WAIT)
2959da14cebeSEric Cheng cv_signal(&mac_srs->srs_client_cv);
2960da14cebeSEric Cheng mac_srs->srs_state &= ~SRS_CLIENT_PROC;
2961da14cebeSEric Cheng } else {
2962da14cebeSEric Cheng /* Some kind of softrings based fanout is required */
2963da14cebeSEric Cheng mutex_exit(&mac_srs->srs_lock);
2964da906b4eSRyan Zezeski if (tid != NULL) {
2965da14cebeSEric Cheng (void) untimeout(tid);
2966da906b4eSRyan Zezeski tid = NULL;
2967da14cebeSEric Cheng }
2968da14cebeSEric Cheng
2969da14cebeSEric Cheng /*
2970da14cebeSEric Cheng * Since the fanout routines can deal with chains,
2971da14cebeSEric Cheng * shoot the entire chain up.
2972da14cebeSEric Cheng */
2973da14cebeSEric Cheng if (mac_srs->srs_type & SRST_FANOUT_SRC_IP)
2974da14cebeSEric Cheng mac_rx_srs_fanout(mac_srs, head);
2975da14cebeSEric Cheng else
2976da14cebeSEric Cheng mac_rx_srs_proto_fanout(mac_srs, head);
2977da14cebeSEric Cheng mutex_enter(&mac_srs->srs_lock);
2978da14cebeSEric Cheng }
2979da14cebeSEric Cheng
2980da14cebeSEric Cheng /*
2981da14cebeSEric Cheng * Send the poll thread to pick up any packets arrived
2982da14cebeSEric Cheng * so far. This also serves as the last check in case
2983da14cebeSEric Cheng * nothing else is queued in the SRS. The poll thread
2984da14cebeSEric Cheng * is signalled only in the case the drain was done
2985da14cebeSEric Cheng * by the worker thread and SRS_WORKER is set. The
2986da14cebeSEric Cheng * worker thread can run in parallel as long as the
2987da14cebeSEric Cheng * SRS_WORKER flag is set. We we have nothing else to
2988da14cebeSEric Cheng * process, we can exit while leaving SRS_PROC set
2989da14cebeSEric Cheng * which gives the poll thread control to process and
2990da14cebeSEric Cheng * cleanup once it returns from the NIC.
2991da14cebeSEric Cheng *
2992da14cebeSEric Cheng * If we have nothing else to process, we need to
2993da14cebeSEric Cheng * ensure that we keep holding the srs_lock till
2994da14cebeSEric Cheng * all the checks below are done and control is
2995da14cebeSEric Cheng * handed to the poll thread if it was running.
2996da14cebeSEric Cheng */
2997da14cebeSEric Cheng mutex_enter(&mac_srs->srs_bw->mac_bw_lock);
2998da14cebeSEric Cheng if (!(mac_srs->srs_bw->mac_bw_state & SRS_BW_ENFORCED)) {
2999da14cebeSEric Cheng if (mac_srs->srs_first != NULL) {
3000da14cebeSEric Cheng if (proc_type == SRS_WORKER) {
3001da14cebeSEric Cheng mutex_exit(&mac_srs->srs_bw->mac_bw_lock);
3002da14cebeSEric Cheng if (srs_rx->sr_poll_pkt_cnt <=
3003da14cebeSEric Cheng srs_rx->sr_lowat)
3004da14cebeSEric Cheng MAC_SRS_POLL_RING(mac_srs);
3005da14cebeSEric Cheng goto again;
3006da14cebeSEric Cheng } else {
3007da14cebeSEric Cheng cv_signal(&mac_srs->srs_async);
3008da14cebeSEric Cheng }
3009da14cebeSEric Cheng }
3010da14cebeSEric Cheng }
3011da14cebeSEric Cheng mutex_exit(&mac_srs->srs_bw->mac_bw_lock);
3012da14cebeSEric Cheng
3013da14cebeSEric Cheng done:
3014da14cebeSEric Cheng
3015da14cebeSEric Cheng if (mac_srs->srs_state & SRS_GET_PKTS) {
3016da14cebeSEric Cheng /*
3017da14cebeSEric Cheng * Poll thread is already running. Leave the
3018da14cebeSEric Cheng * SRS_RPOC set and hand over the control to
3019da14cebeSEric Cheng * poll thread.
3020da14cebeSEric Cheng */
3021da14cebeSEric Cheng mac_srs->srs_state &= ~proc_type;
3022da14cebeSEric Cheng return;
3023da14cebeSEric Cheng }
3024da14cebeSEric Cheng
3025da14cebeSEric Cheng /*
3026da14cebeSEric Cheng * If we can't process packets because we have exceeded
3027da14cebeSEric Cheng * B/W limit for this tick, just set the timeout
3028da14cebeSEric Cheng * and leave.
3029da14cebeSEric Cheng *
3030da14cebeSEric Cheng * Even if there are no packets queued in SRS, we
3031da14cebeSEric Cheng * need to make sure that the shared counter is
3032da14cebeSEric Cheng * clear and any associated softrings have cleared
3033da14cebeSEric Cheng * all the backlog. Otherwise, leave the interface
3034da14cebeSEric Cheng * in polling mode and the poll thread will get
3035da14cebeSEric Cheng * signalled once the count goes down to zero.
3036da14cebeSEric Cheng *
3037da14cebeSEric Cheng * If someone is already draining the queue (SRS_PROC is
3038da14cebeSEric Cheng * set) when the srs_poll_pkt_cnt goes down to zero,
3039da14cebeSEric Cheng * then it means that drain is already running and we
3040da14cebeSEric Cheng * will turn off polling at that time if there is
3041da14cebeSEric Cheng * no backlog. As long as there are packets queued either
3042da14cebeSEric Cheng * is soft ring set or its soft rings, we will leave
3043da14cebeSEric Cheng * the interface in polling mode.
3044da14cebeSEric Cheng */
3045da14cebeSEric Cheng mutex_enter(&mac_srs->srs_bw->mac_bw_lock);
3046da14cebeSEric Cheng if ((mac_srs->srs_state & SRS_POLLING_CAPAB) &&
3047da14cebeSEric Cheng ((mac_srs->srs_bw->mac_bw_state & SRS_BW_ENFORCED) ||
3048da14cebeSEric Cheng (srs_rx->sr_poll_pkt_cnt > 0))) {
3049da14cebeSEric Cheng MAC_SRS_POLLING_ON(mac_srs);
3050da14cebeSEric Cheng mac_srs->srs_state &= ~(SRS_PROC|proc_type);
3051da14cebeSEric Cheng if ((mac_srs->srs_first != NULL) &&
3052da14cebeSEric Cheng (mac_srs->srs_tid == NULL))
3053da14cebeSEric Cheng mac_srs->srs_tid = timeout(mac_srs_fire,
3054da14cebeSEric Cheng mac_srs, 1);
3055da14cebeSEric Cheng mutex_exit(&mac_srs->srs_bw->mac_bw_lock);
3056da14cebeSEric Cheng return;
3057da14cebeSEric Cheng }
3058da14cebeSEric Cheng mutex_exit(&mac_srs->srs_bw->mac_bw_lock);
3059da14cebeSEric Cheng
3060da14cebeSEric Cheng leave_poll:
3061da14cebeSEric Cheng
3062da14cebeSEric Cheng /* Nothing else to do. Get out of poll mode */
3063da14cebeSEric Cheng MAC_SRS_POLLING_OFF(mac_srs);
3064da14cebeSEric Cheng mac_srs->srs_state &= ~(SRS_PROC|proc_type);
3065da14cebeSEric Cheng }
3066da14cebeSEric Cheng
3067da14cebeSEric Cheng /*
3068da14cebeSEric Cheng * mac_srs_worker
3069da14cebeSEric Cheng *
3070da14cebeSEric Cheng * The SRS worker routine. Drains the queue when no one else is
3071da14cebeSEric Cheng * processing it.
3072da14cebeSEric Cheng */
3073da14cebeSEric Cheng void
mac_srs_worker(mac_soft_ring_set_t * mac_srs)3074da14cebeSEric Cheng mac_srs_worker(mac_soft_ring_set_t *mac_srs)
3075da14cebeSEric Cheng {
3076da14cebeSEric Cheng kmutex_t *lock = &mac_srs->srs_lock;
3077da14cebeSEric Cheng kcondvar_t *async = &mac_srs->srs_async;
3078da14cebeSEric Cheng callb_cpr_t cprinfo;
3079da14cebeSEric Cheng boolean_t bw_ctl_flag;
3080da14cebeSEric Cheng
3081da14cebeSEric Cheng CALLB_CPR_INIT(&cprinfo, lock, callb_generic_cpr, "srs_worker");
3082da14cebeSEric Cheng mutex_enter(lock);
3083da14cebeSEric Cheng
3084da14cebeSEric Cheng start:
3085da14cebeSEric Cheng for (;;) {
3086da14cebeSEric Cheng bw_ctl_flag = B_FALSE;
3087da14cebeSEric Cheng if (mac_srs->srs_type & SRST_BW_CONTROL) {
3088da14cebeSEric Cheng MAC_SRS_BW_LOCK(mac_srs);
3089da14cebeSEric Cheng MAC_SRS_CHECK_BW_CONTROL(mac_srs);
3090da14cebeSEric Cheng if (mac_srs->srs_bw->mac_bw_state & SRS_BW_ENFORCED)
3091da14cebeSEric Cheng bw_ctl_flag = B_TRUE;
3092da14cebeSEric Cheng MAC_SRS_BW_UNLOCK(mac_srs);
3093da14cebeSEric Cheng }
3094da14cebeSEric Cheng /*
3095da14cebeSEric Cheng * The SRS_BW_ENFORCED flag may change since we have dropped
3096da14cebeSEric Cheng * the mac_bw_lock. However the drain function can handle both
3097da14cebeSEric Cheng * a drainable SRS or a bandwidth controlled SRS, and the
3098da14cebeSEric Cheng * effect of scheduling a timeout is to wakeup the worker
3099da14cebeSEric Cheng * thread which in turn will call the drain function. Since
3100da14cebeSEric Cheng * we release the srs_lock atomically only in the cv_wait there
3101da14cebeSEric Cheng * isn't a fear of waiting for ever.
3102da14cebeSEric Cheng */
3103da14cebeSEric Cheng while (((mac_srs->srs_state & SRS_PROC) ||
3104da14cebeSEric Cheng (mac_srs->srs_first == NULL) || bw_ctl_flag ||
3105da14cebeSEric Cheng (mac_srs->srs_state & SRS_TX_BLOCKED)) &&
3106da14cebeSEric Cheng !(mac_srs->srs_state & SRS_PAUSE)) {
3107da14cebeSEric Cheng /*
3108da14cebeSEric Cheng * If we have packets queued and we are here
3109da14cebeSEric Cheng * because B/W control is in place, we better
3110da14cebeSEric Cheng * schedule the worker wakeup after 1 tick
3111da14cebeSEric Cheng * to see if bandwidth control can be relaxed.
3112da14cebeSEric Cheng */
3113da14cebeSEric Cheng if (bw_ctl_flag && mac_srs->srs_tid == NULL) {
3114da14cebeSEric Cheng /*
3115da14cebeSEric Cheng * We need to ensure that a timer is already
3116da14cebeSEric Cheng * scheduled or we force schedule one for
3117da14cebeSEric Cheng * later so that we can continue processing
3118da14cebeSEric Cheng * after this quanta is over.
3119da14cebeSEric Cheng */
3120da14cebeSEric Cheng mac_srs->srs_tid = timeout(mac_srs_fire,
3121da14cebeSEric Cheng mac_srs, 1);
3122da14cebeSEric Cheng }
3123da14cebeSEric Cheng wait:
3124da14cebeSEric Cheng CALLB_CPR_SAFE_BEGIN(&cprinfo);
3125da14cebeSEric Cheng cv_wait(async, lock);
3126da14cebeSEric Cheng CALLB_CPR_SAFE_END(&cprinfo, lock);
3127da14cebeSEric Cheng
3128da14cebeSEric Cheng if (mac_srs->srs_state & SRS_PAUSE)
3129da14cebeSEric Cheng goto done;
3130da14cebeSEric Cheng if (mac_srs->srs_state & SRS_PROC)
3131da14cebeSEric Cheng goto wait;
3132da14cebeSEric Cheng
3133da14cebeSEric Cheng if (mac_srs->srs_first != NULL &&
3134da14cebeSEric Cheng mac_srs->srs_type & SRST_BW_CONTROL) {
3135da14cebeSEric Cheng MAC_SRS_BW_LOCK(mac_srs);
3136da14cebeSEric Cheng if (mac_srs->srs_bw->mac_bw_state &
3137da14cebeSEric Cheng SRS_BW_ENFORCED) {
3138da14cebeSEric Cheng MAC_SRS_CHECK_BW_CONTROL(mac_srs);
3139da14cebeSEric Cheng }
3140da14cebeSEric Cheng bw_ctl_flag = mac_srs->srs_bw->mac_bw_state &
3141da14cebeSEric Cheng SRS_BW_ENFORCED;
3142da14cebeSEric Cheng MAC_SRS_BW_UNLOCK(mac_srs);
3143da14cebeSEric Cheng }
3144da14cebeSEric Cheng }
3145da14cebeSEric Cheng
3146da14cebeSEric Cheng if (mac_srs->srs_state & SRS_PAUSE)
3147da14cebeSEric Cheng goto done;
3148da14cebeSEric Cheng mac_srs->srs_drain_func(mac_srs, SRS_WORKER);
3149da14cebeSEric Cheng }
3150da14cebeSEric Cheng done:
3151da14cebeSEric Cheng /*
3152da14cebeSEric Cheng * The Rx SRS quiesce logic first cuts off packet supply to the SRS
3153da14cebeSEric Cheng * from both hard and soft classifications and waits for such threads
3154da14cebeSEric Cheng * to finish before signaling the worker. So at this point the only
3155da14cebeSEric Cheng * thread left that could be competing with the worker is the poll
3156da14cebeSEric Cheng * thread. In the case of Tx, there shouldn't be any thread holding
3157da14cebeSEric Cheng * SRS_PROC at this point.
3158da14cebeSEric Cheng */
3159da14cebeSEric Cheng if (!(mac_srs->srs_state & SRS_PROC)) {
3160da14cebeSEric Cheng mac_srs->srs_state |= SRS_PROC;
3161da14cebeSEric Cheng } else {
3162da14cebeSEric Cheng ASSERT((mac_srs->srs_type & SRST_TX) == 0);
3163da14cebeSEric Cheng /*
3164da14cebeSEric Cheng * Poll thread still owns the SRS and is still running
3165da14cebeSEric Cheng */
3166da14cebeSEric Cheng ASSERT((mac_srs->srs_poll_thr == NULL) ||
3167da14cebeSEric Cheng ((mac_srs->srs_state & SRS_POLL_THR_OWNER) ==
3168da14cebeSEric Cheng SRS_POLL_THR_OWNER));
3169da14cebeSEric Cheng }
3170da14cebeSEric Cheng mac_srs_worker_quiesce(mac_srs);
3171da14cebeSEric Cheng /*
3172da14cebeSEric Cheng * Wait for the SRS_RESTART or SRS_CONDEMNED signal from the initiator
3173da14cebeSEric Cheng * of the quiesce operation
3174da14cebeSEric Cheng */
3175da14cebeSEric Cheng while (!(mac_srs->srs_state & (SRS_CONDEMNED | SRS_RESTART)))
3176da14cebeSEric Cheng cv_wait(&mac_srs->srs_async, &mac_srs->srs_lock);
3177da14cebeSEric Cheng
3178da14cebeSEric Cheng if (mac_srs->srs_state & SRS_RESTART) {
3179da14cebeSEric Cheng ASSERT(!(mac_srs->srs_state & SRS_CONDEMNED));
3180da14cebeSEric Cheng mac_srs_worker_restart(mac_srs);
3181da14cebeSEric Cheng mac_srs->srs_state &= ~SRS_PROC;
3182da14cebeSEric Cheng goto start;
3183da14cebeSEric Cheng }
3184da14cebeSEric Cheng
3185da14cebeSEric Cheng if (!(mac_srs->srs_state & SRS_CONDEMNED_DONE))
3186da14cebeSEric Cheng mac_srs_worker_quiesce(mac_srs);
3187da14cebeSEric Cheng
3188da14cebeSEric Cheng mac_srs->srs_state &= ~SRS_PROC;
3189da14cebeSEric Cheng /* The macro drops the srs_lock */
3190da14cebeSEric Cheng CALLB_CPR_EXIT(&cprinfo);
3191da14cebeSEric Cheng thread_exit();
3192da14cebeSEric Cheng }
3193da14cebeSEric Cheng
3194da14cebeSEric Cheng /*
3195da14cebeSEric Cheng * mac_rx_srs_subflow_process
3196da14cebeSEric Cheng *
3197da14cebeSEric Cheng * Receive side routine called from interrupt path when there are
3198da14cebeSEric Cheng * sub flows present on this SRS.
3199da14cebeSEric Cheng */
3200da14cebeSEric Cheng /* ARGSUSED */
3201da14cebeSEric Cheng void
mac_rx_srs_subflow_process(void * arg,mac_resource_handle_t srs,mblk_t * mp_chain,boolean_t loopback)3202da14cebeSEric Cheng mac_rx_srs_subflow_process(void *arg, mac_resource_handle_t srs,
3203da14cebeSEric Cheng mblk_t *mp_chain, boolean_t loopback)
3204da14cebeSEric Cheng {
3205da14cebeSEric Cheng flow_entry_t *flent = NULL;
3206da14cebeSEric Cheng flow_entry_t *prev_flent = NULL;
3207da14cebeSEric Cheng mblk_t *mp = NULL;
3208da14cebeSEric Cheng mblk_t *tail = NULL;
3209da14cebeSEric Cheng mac_soft_ring_set_t *mac_srs = (mac_soft_ring_set_t *)srs;
3210da14cebeSEric Cheng mac_client_impl_t *mcip;
3211da14cebeSEric Cheng
3212da14cebeSEric Cheng mcip = mac_srs->srs_mcip;
3213da14cebeSEric Cheng ASSERT(mcip != NULL);
3214da14cebeSEric Cheng
3215da14cebeSEric Cheng /*
3216da14cebeSEric Cheng * We need to determine the SRS for every packet
3217da14cebeSEric Cheng * by walking the flow table, if we don't get any,
3218da14cebeSEric Cheng * then we proceed using the SRS we came with.
3219da14cebeSEric Cheng */
3220da14cebeSEric Cheng mp = tail = mp_chain;
3221da14cebeSEric Cheng while (mp != NULL) {
3222da14cebeSEric Cheng
3223da14cebeSEric Cheng /*
3224*67cac09cSKyle Simpson * We will increment the stats for the matching subflow.
3225da14cebeSEric Cheng * when we get the bytes/pkt count for the classified packets
3226da14cebeSEric Cheng * later in mac_rx_srs_process.
3227da14cebeSEric Cheng */
3228da14cebeSEric Cheng (void) mac_flow_lookup(mcip->mci_subflow_tab, mp,
3229da14cebeSEric Cheng FLOW_INBOUND, &flent);
3230da14cebeSEric Cheng
3231da14cebeSEric Cheng if (mp == mp_chain || flent == prev_flent) {
3232da14cebeSEric Cheng if (prev_flent != NULL)
3233da14cebeSEric Cheng FLOW_REFRELE(prev_flent);
3234da14cebeSEric Cheng prev_flent = flent;
3235da14cebeSEric Cheng flent = NULL;
3236da14cebeSEric Cheng tail = mp;
3237da14cebeSEric Cheng mp = mp->b_next;
3238da14cebeSEric Cheng continue;
3239da14cebeSEric Cheng }
3240da14cebeSEric Cheng tail->b_next = NULL;
3241da14cebeSEric Cheng /*
3242da14cebeSEric Cheng * A null indicates, this is for the mac_srs itself.
3243da14cebeSEric Cheng * XXX-venu : probably assert for fe_rx_srs_cnt == 0.
3244da14cebeSEric Cheng */
3245da14cebeSEric Cheng if (prev_flent == NULL || prev_flent->fe_rx_srs_cnt == 0) {
3246da14cebeSEric Cheng mac_rx_srs_process(arg,
3247da14cebeSEric Cheng (mac_resource_handle_t)mac_srs, mp_chain,
3248da14cebeSEric Cheng loopback);
3249da14cebeSEric Cheng } else {
3250da14cebeSEric Cheng (prev_flent->fe_cb_fn)(prev_flent->fe_cb_arg1,
3251da14cebeSEric Cheng prev_flent->fe_cb_arg2, mp_chain, loopback);
3252da14cebeSEric Cheng FLOW_REFRELE(prev_flent);
3253da14cebeSEric Cheng }
3254da14cebeSEric Cheng prev_flent = flent;
3255da14cebeSEric Cheng flent = NULL;
3256da14cebeSEric Cheng mp_chain = mp;
3257da14cebeSEric Cheng tail = mp;
3258da14cebeSEric Cheng mp = mp->b_next;
3259da14cebeSEric Cheng }
3260da14cebeSEric Cheng /* Last chain */
3261da14cebeSEric Cheng ASSERT(mp_chain != NULL);
3262da14cebeSEric Cheng if (prev_flent == NULL || prev_flent->fe_rx_srs_cnt == 0) {
3263da14cebeSEric Cheng mac_rx_srs_process(arg,
3264da14cebeSEric Cheng (mac_resource_handle_t)mac_srs, mp_chain, loopback);
3265da14cebeSEric Cheng } else {
3266da14cebeSEric Cheng (prev_flent->fe_cb_fn)(prev_flent->fe_cb_arg1,
3267da14cebeSEric Cheng prev_flent->fe_cb_arg2, mp_chain, loopback);
3268da14cebeSEric Cheng FLOW_REFRELE(prev_flent);
3269da14cebeSEric Cheng }
3270da14cebeSEric Cheng }
3271da14cebeSEric Cheng
3272da14cebeSEric Cheng /*
3273c61a1653SRyan Zezeski * MAC SRS receive side routine. If the data is coming from the
3274c61a1653SRyan Zezeski * network (i.e. from a NIC) then this is called in interrupt context.
3275c61a1653SRyan Zezeski * If the data is coming from a local sender (e.g. mac_tx_send() or
3276c61a1653SRyan Zezeski * bridge_forward()) then this is not called in interrupt context.
3277da14cebeSEric Cheng *
3278da14cebeSEric Cheng * loopback is set to force a context switch on the loopback
3279da14cebeSEric Cheng * path between MAC clients.
3280da14cebeSEric Cheng */
3281da14cebeSEric Cheng /* ARGSUSED */
3282da14cebeSEric Cheng void
mac_rx_srs_process(void * arg,mac_resource_handle_t srs,mblk_t * mp_chain,boolean_t loopback)3283da14cebeSEric Cheng mac_rx_srs_process(void *arg, mac_resource_handle_t srs, mblk_t *mp_chain,
3284da14cebeSEric Cheng boolean_t loopback)
3285da14cebeSEric Cheng {
3286da14cebeSEric Cheng mac_soft_ring_set_t *mac_srs = (mac_soft_ring_set_t *)srs;
3287da14cebeSEric Cheng mblk_t *mp, *tail, *head;
3288da14cebeSEric Cheng int count = 0;
3289da14cebeSEric Cheng int count1;
3290da14cebeSEric Cheng size_t sz = 0;
3291da14cebeSEric Cheng size_t chain_sz, sz1;
3292da14cebeSEric Cheng mac_bw_ctl_t *mac_bw;
3293da14cebeSEric Cheng mac_srs_rx_t *srs_rx = &mac_srs->srs_rx;
3294da14cebeSEric Cheng
3295da14cebeSEric Cheng /*
3296da14cebeSEric Cheng * Set the tail, count and sz. We set the sz irrespective
3297da14cebeSEric Cheng * of whether we are doing B/W control or not for the
3298da14cebeSEric Cheng * purpose of updating the stats.
3299da14cebeSEric Cheng */
3300da14cebeSEric Cheng mp = tail = mp_chain;
3301da14cebeSEric Cheng while (mp != NULL) {
3302da14cebeSEric Cheng tail = mp;
3303da14cebeSEric Cheng count++;
3304da14cebeSEric Cheng sz += msgdsize(mp);
3305da14cebeSEric Cheng mp = mp->b_next;
3306da14cebeSEric Cheng }
3307da14cebeSEric Cheng
3308da14cebeSEric Cheng mutex_enter(&mac_srs->srs_lock);
3309da14cebeSEric Cheng
33100dc2366fSVenugopal Iyer if (loopback) {
33110dc2366fSVenugopal Iyer SRS_RX_STAT_UPDATE(mac_srs, lclbytes, sz);
33120dc2366fSVenugopal Iyer SRS_RX_STAT_UPDATE(mac_srs, lclcnt, count);
33130dc2366fSVenugopal Iyer
33140dc2366fSVenugopal Iyer } else {
33150dc2366fSVenugopal Iyer SRS_RX_STAT_UPDATE(mac_srs, intrbytes, sz);
33160dc2366fSVenugopal Iyer SRS_RX_STAT_UPDATE(mac_srs, intrcnt, count);
3317da14cebeSEric Cheng }
3318da14cebeSEric Cheng
3319da14cebeSEric Cheng /*
3320da14cebeSEric Cheng * If the SRS in already being processed; has been blanked;
3321da14cebeSEric Cheng * can be processed by worker thread only; or the B/W limit
3322da14cebeSEric Cheng * has been reached, then queue the chain and check if
3323da14cebeSEric Cheng * worker thread needs to be awakend.
3324da14cebeSEric Cheng */
3325da14cebeSEric Cheng if (mac_srs->srs_type & SRST_BW_CONTROL) {
3326da14cebeSEric Cheng mac_bw = mac_srs->srs_bw;
3327da14cebeSEric Cheng ASSERT(mac_bw != NULL);
3328da14cebeSEric Cheng mutex_enter(&mac_bw->mac_bw_lock);
3329da14cebeSEric Cheng mac_bw->mac_bw_intr += sz;
3330da14cebeSEric Cheng if (mac_bw->mac_bw_limit == 0) {
3331da14cebeSEric Cheng /* zero bandwidth: drop all */
33320dc2366fSVenugopal Iyer srs_rx->sr_stat.mrs_sdrops += count;
3333da14cebeSEric Cheng mac_bw->mac_bw_drop_bytes += sz;
3334da14cebeSEric Cheng mutex_exit(&mac_bw->mac_bw_lock);
3335da14cebeSEric Cheng mutex_exit(&mac_srs->srs_lock);
3336c61a1653SRyan Zezeski mac_drop_chain(mp_chain, "Rx no bandwidth");
3337da14cebeSEric Cheng return;
3338da14cebeSEric Cheng } else {
3339da14cebeSEric Cheng if ((mac_bw->mac_bw_sz + sz) <=
3340da14cebeSEric Cheng mac_bw->mac_bw_drop_threshold) {
3341da14cebeSEric Cheng mutex_exit(&mac_bw->mac_bw_lock);
3342da14cebeSEric Cheng MAC_RX_SRS_ENQUEUE_CHAIN(mac_srs, mp_chain,
3343da14cebeSEric Cheng tail, count, sz);
3344da14cebeSEric Cheng } else {
3345da14cebeSEric Cheng mp = mp_chain;
3346da14cebeSEric Cheng chain_sz = 0;
3347da14cebeSEric Cheng count1 = 0;
3348da14cebeSEric Cheng tail = NULL;
3349da14cebeSEric Cheng head = NULL;
3350da14cebeSEric Cheng while (mp != NULL) {
3351da14cebeSEric Cheng sz1 = msgdsize(mp);
3352da14cebeSEric Cheng if (mac_bw->mac_bw_sz + chain_sz + sz1 >
3353da14cebeSEric Cheng mac_bw->mac_bw_drop_threshold)
3354da14cebeSEric Cheng break;
3355da14cebeSEric Cheng chain_sz += sz1;
3356da14cebeSEric Cheng count1++;
3357da14cebeSEric Cheng tail = mp;
3358da14cebeSEric Cheng mp = mp->b_next;
3359da14cebeSEric Cheng }
3360da14cebeSEric Cheng mutex_exit(&mac_bw->mac_bw_lock);
3361da14cebeSEric Cheng if (tail != NULL) {
3362da14cebeSEric Cheng head = tail->b_next;
3363da14cebeSEric Cheng tail->b_next = NULL;
3364da14cebeSEric Cheng MAC_RX_SRS_ENQUEUE_CHAIN(mac_srs,
3365da14cebeSEric Cheng mp_chain, tail, count1, chain_sz);
3366da14cebeSEric Cheng sz -= chain_sz;
3367da14cebeSEric Cheng count -= count1;
3368da14cebeSEric Cheng } else {
3369da14cebeSEric Cheng /* Can't pick up any */
3370da14cebeSEric Cheng head = mp_chain;
3371da14cebeSEric Cheng }
3372da14cebeSEric Cheng if (head != NULL) {
3373da14cebeSEric Cheng /* Drop any packet over the threshold */
33740dc2366fSVenugopal Iyer srs_rx->sr_stat.mrs_sdrops += count;
3375da14cebeSEric Cheng mutex_enter(&mac_bw->mac_bw_lock);
3376da14cebeSEric Cheng mac_bw->mac_bw_drop_bytes += sz;
3377da14cebeSEric Cheng mutex_exit(&mac_bw->mac_bw_lock);
3378da14cebeSEric Cheng freemsgchain(head);
3379da14cebeSEric Cheng }
3380da14cebeSEric Cheng }
3381da14cebeSEric Cheng MAC_SRS_WORKER_WAKEUP(mac_srs);
3382da14cebeSEric Cheng mutex_exit(&mac_srs->srs_lock);
3383da14cebeSEric Cheng return;
3384da14cebeSEric Cheng }
3385da14cebeSEric Cheng }
3386da14cebeSEric Cheng
3387da14cebeSEric Cheng /*
3388da14cebeSEric Cheng * If the total number of packets queued in the SRS and
3389da14cebeSEric Cheng * its associated soft rings exceeds the max allowed,
3390da14cebeSEric Cheng * then drop the chain. If we are polling capable, this
3391da14cebeSEric Cheng * shouldn't be happening.
3392da14cebeSEric Cheng */
3393da14cebeSEric Cheng if (!(mac_srs->srs_type & SRST_BW_CONTROL) &&
3394da14cebeSEric Cheng (srs_rx->sr_poll_pkt_cnt > srs_rx->sr_hiwat)) {
3395da14cebeSEric Cheng mac_bw = mac_srs->srs_bw;
33960dc2366fSVenugopal Iyer srs_rx->sr_stat.mrs_sdrops += count;
3397da14cebeSEric Cheng mutex_enter(&mac_bw->mac_bw_lock);
3398da14cebeSEric Cheng mac_bw->mac_bw_drop_bytes += sz;
3399da14cebeSEric Cheng mutex_exit(&mac_bw->mac_bw_lock);
3400da14cebeSEric Cheng freemsgchain(mp_chain);
3401da14cebeSEric Cheng mutex_exit(&mac_srs->srs_lock);
3402da14cebeSEric Cheng return;
3403da14cebeSEric Cheng }
3404da14cebeSEric Cheng
3405da14cebeSEric Cheng MAC_RX_SRS_ENQUEUE_CHAIN(mac_srs, mp_chain, tail, count, sz);
3406da14cebeSEric Cheng
3407da14cebeSEric Cheng if (!(mac_srs->srs_state & SRS_PROC)) {
3408da14cebeSEric Cheng /*
3409e2ea9c96SRobert Mustacchi * If we are coming via loopback, if we are not optimizing for
3410e2ea9c96SRobert Mustacchi * latency, or if our stack is running deep, we should signal
3411e2ea9c96SRobert Mustacchi * the worker thread.
3412da14cebeSEric Cheng */
3413cab61dfbSBryan Cantrill if (loopback || !(mac_srs->srs_state & SRS_LATENCY_OPT)) {
3414da14cebeSEric Cheng /*
3415da14cebeSEric Cheng * For loopback, We need to let the worker take
3416da14cebeSEric Cheng * over as we don't want to continue in the same
3417da14cebeSEric Cheng * thread even if we can. This could lead to stack
3418da14cebeSEric Cheng * overflows and may also end up using
3419da14cebeSEric Cheng * resources (cpu) incorrectly.
3420da14cebeSEric Cheng */
3421da14cebeSEric Cheng cv_signal(&mac_srs->srs_async);
3422cab61dfbSBryan Cantrill } else if (STACK_BIAS + (uintptr_t)getfp() -
3423cab61dfbSBryan Cantrill (uintptr_t)curthread->t_stkbase < mac_rx_srs_stack_needed) {
3424cab61dfbSBryan Cantrill if (++mac_rx_srs_stack_toodeep == 0)
3425cab61dfbSBryan Cantrill mac_rx_srs_stack_toodeep = 1;
3426cab61dfbSBryan Cantrill cv_signal(&mac_srs->srs_async);
3427da14cebeSEric Cheng } else {
3428da14cebeSEric Cheng /*
3429da14cebeSEric Cheng * Seems like no one is processing the SRS and
3430da14cebeSEric Cheng * there is no backlog. We also inline process
3431da14cebeSEric Cheng * our packet if its a single packet in non
3432da14cebeSEric Cheng * latency optimized case (in latency optimized
3433da14cebeSEric Cheng * case, we inline process chains of any size).
3434da14cebeSEric Cheng */
3435da14cebeSEric Cheng mac_srs->srs_drain_func(mac_srs, SRS_PROC_FAST);
3436da14cebeSEric Cheng }
3437da14cebeSEric Cheng }
3438da14cebeSEric Cheng mutex_exit(&mac_srs->srs_lock);
3439da14cebeSEric Cheng }
3440da14cebeSEric Cheng
3441da14cebeSEric Cheng /* TX SIDE ROUTINES (RUNTIME) */
3442da14cebeSEric Cheng
3443da14cebeSEric Cheng /*
3444da14cebeSEric Cheng * mac_tx_srs_no_desc
3445da14cebeSEric Cheng *
3446da14cebeSEric Cheng * This routine is called by Tx single ring default mode
3447da14cebeSEric Cheng * when Tx ring runs out of descs.
3448da14cebeSEric Cheng */
3449da14cebeSEric Cheng mac_tx_cookie_t
mac_tx_srs_no_desc(mac_soft_ring_set_t * mac_srs,mblk_t * mp_chain,uint16_t flag,mblk_t ** ret_mp)3450da14cebeSEric Cheng mac_tx_srs_no_desc(mac_soft_ring_set_t *mac_srs, mblk_t *mp_chain,
3451da14cebeSEric Cheng uint16_t flag, mblk_t **ret_mp)
3452da14cebeSEric Cheng {
345336f99a58SToomas Soome mac_tx_cookie_t cookie = 0;
3454da14cebeSEric Cheng mac_srs_tx_t *srs_tx = &mac_srs->srs_tx;
3455da14cebeSEric Cheng boolean_t wakeup_worker = B_TRUE;
3456da14cebeSEric Cheng uint32_t tx_mode = srs_tx->st_mode;
3457da14cebeSEric Cheng int cnt, sz;
3458da14cebeSEric Cheng mblk_t *tail;
3459da14cebeSEric Cheng
3460da14cebeSEric Cheng ASSERT(tx_mode == SRS_TX_DEFAULT || tx_mode == SRS_TX_BW);
3461da14cebeSEric Cheng if (flag & MAC_DROP_ON_NO_DESC) {
3462c61a1653SRyan Zezeski MAC_TX_SRS_DROP_MESSAGE(mac_srs, mp_chain, cookie,
3463c61a1653SRyan Zezeski "Tx no desc");
3464da14cebeSEric Cheng } else {
3465da14cebeSEric Cheng if (mac_srs->srs_first != NULL)
3466da14cebeSEric Cheng wakeup_worker = B_FALSE;
3467da14cebeSEric Cheng MAC_COUNT_CHAIN(mac_srs, mp_chain, tail, cnt, sz);
3468da14cebeSEric Cheng if (flag & MAC_TX_NO_ENQUEUE) {
3469da14cebeSEric Cheng /*
3470da14cebeSEric Cheng * If TX_QUEUED is not set, queue the
3471da14cebeSEric Cheng * packet and let mac_tx_srs_drain()
3472da14cebeSEric Cheng * set the TX_BLOCKED bit for the
3473da14cebeSEric Cheng * reasons explained above. Otherwise,
3474da14cebeSEric Cheng * return the mblks.
3475da14cebeSEric Cheng */
3476da14cebeSEric Cheng if (wakeup_worker) {
3477da14cebeSEric Cheng MAC_TX_SRS_ENQUEUE_CHAIN(mac_srs,
3478da14cebeSEric Cheng mp_chain, tail, cnt, sz);
3479da14cebeSEric Cheng } else {
3480da14cebeSEric Cheng MAC_TX_SET_NO_ENQUEUE(mac_srs,
3481da14cebeSEric Cheng mp_chain, ret_mp, cookie);
3482da14cebeSEric Cheng }
3483da14cebeSEric Cheng } else {
3484da14cebeSEric Cheng MAC_TX_SRS_TEST_HIWAT(mac_srs, mp_chain,
3485da14cebeSEric Cheng tail, cnt, sz, cookie);
3486da14cebeSEric Cheng }
3487da14cebeSEric Cheng if (wakeup_worker)
3488da14cebeSEric Cheng cv_signal(&mac_srs->srs_async);
3489da14cebeSEric Cheng }
3490da14cebeSEric Cheng return (cookie);
3491da14cebeSEric Cheng }
3492da14cebeSEric Cheng
3493da14cebeSEric Cheng /*
3494da14cebeSEric Cheng * mac_tx_srs_enqueue
3495da14cebeSEric Cheng *
3496da14cebeSEric Cheng * This routine is called when Tx SRS is operating in either serializer
3497da14cebeSEric Cheng * or bandwidth mode. In serializer mode, a packet will get enqueued
3498da14cebeSEric Cheng * when a thread cannot enter SRS exclusively. In bandwidth mode,
3499da14cebeSEric Cheng * packets gets queued if allowed byte-count limit for a tick is
3500da14cebeSEric Cheng * exceeded. The action that gets taken when MAC_DROP_ON_NO_DESC and
3501da14cebeSEric Cheng * MAC_TX_NO_ENQUEUE is set is different than when operaing in either
3502da14cebeSEric Cheng * the default mode or fanout mode. Here packets get dropped or
3503da14cebeSEric Cheng * returned back to the caller only after hi-watermark worth of data
3504da14cebeSEric Cheng * is queued.
3505da14cebeSEric Cheng */
3506da14cebeSEric Cheng static mac_tx_cookie_t
mac_tx_srs_enqueue(mac_soft_ring_set_t * mac_srs,mblk_t * mp_chain,uint16_t flag,uintptr_t fanout_hint,mblk_t ** ret_mp)3507da14cebeSEric Cheng mac_tx_srs_enqueue(mac_soft_ring_set_t *mac_srs, mblk_t *mp_chain,
3508da14cebeSEric Cheng uint16_t flag, uintptr_t fanout_hint, mblk_t **ret_mp)
3509da14cebeSEric Cheng {
351036f99a58SToomas Soome mac_tx_cookie_t cookie = 0;
3511da14cebeSEric Cheng int cnt, sz;
3512da14cebeSEric Cheng mblk_t *tail;
3513da14cebeSEric Cheng boolean_t wakeup_worker = B_TRUE;
3514da14cebeSEric Cheng
3515ae6aa22aSVenugopal Iyer /*
3516ae6aa22aSVenugopal Iyer * Ignore fanout hint if we don't have multiple tx rings.
3517ae6aa22aSVenugopal Iyer */
35180dc2366fSVenugopal Iyer if (!MAC_TX_SOFT_RINGS(mac_srs))
3519ae6aa22aSVenugopal Iyer fanout_hint = 0;
3520ae6aa22aSVenugopal Iyer
3521da14cebeSEric Cheng if (mac_srs->srs_first != NULL)
3522da14cebeSEric Cheng wakeup_worker = B_FALSE;
3523da14cebeSEric Cheng MAC_COUNT_CHAIN(mac_srs, mp_chain, tail, cnt, sz);
3524da14cebeSEric Cheng if (flag & MAC_DROP_ON_NO_DESC) {
3525da14cebeSEric Cheng if (mac_srs->srs_count > mac_srs->srs_tx.st_hiwat) {
3526c61a1653SRyan Zezeski MAC_TX_SRS_DROP_MESSAGE(mac_srs, mp_chain, cookie,
3527c61a1653SRyan Zezeski "Tx SRS hiwat");
3528da14cebeSEric Cheng } else {
3529da14cebeSEric Cheng MAC_TX_SRS_ENQUEUE_CHAIN(mac_srs,
3530da14cebeSEric Cheng mp_chain, tail, cnt, sz);
3531da14cebeSEric Cheng }
3532da14cebeSEric Cheng } else if (flag & MAC_TX_NO_ENQUEUE) {
3533da14cebeSEric Cheng if ((mac_srs->srs_count > mac_srs->srs_tx.st_hiwat) ||
3534da14cebeSEric Cheng (mac_srs->srs_state & SRS_TX_WAKEUP_CLIENT)) {
3535da14cebeSEric Cheng MAC_TX_SET_NO_ENQUEUE(mac_srs, mp_chain,
3536da14cebeSEric Cheng ret_mp, cookie);
3537da14cebeSEric Cheng } else {
3538da14cebeSEric Cheng mp_chain->b_prev = (mblk_t *)fanout_hint;
3539da14cebeSEric Cheng MAC_TX_SRS_ENQUEUE_CHAIN(mac_srs,
3540da14cebeSEric Cheng mp_chain, tail, cnt, sz);
3541da14cebeSEric Cheng }
3542da14cebeSEric Cheng } else {
3543da14cebeSEric Cheng /*
3544da14cebeSEric Cheng * If you are BW_ENFORCED, just enqueue the
3545da14cebeSEric Cheng * packet. srs_worker will drain it at the
3546da14cebeSEric Cheng * prescribed rate. Before enqueueing, save
3547da14cebeSEric Cheng * the fanout hint.
3548da14cebeSEric Cheng */
3549da14cebeSEric Cheng mp_chain->b_prev = (mblk_t *)fanout_hint;
3550da14cebeSEric Cheng MAC_TX_SRS_TEST_HIWAT(mac_srs, mp_chain,
3551da14cebeSEric Cheng tail, cnt, sz, cookie);
3552da14cebeSEric Cheng }
3553da14cebeSEric Cheng if (wakeup_worker)
3554da14cebeSEric Cheng cv_signal(&mac_srs->srs_async);
3555da14cebeSEric Cheng return (cookie);
3556da14cebeSEric Cheng }
3557da14cebeSEric Cheng
3558da14cebeSEric Cheng /*
35590dc2366fSVenugopal Iyer * There are seven tx modes:
3560da14cebeSEric Cheng *
3561da14cebeSEric Cheng * 1) Default mode (SRS_TX_DEFAULT)
3562da14cebeSEric Cheng * 2) Serialization mode (SRS_TX_SERIALIZE)
3563da14cebeSEric Cheng * 3) Fanout mode (SRS_TX_FANOUT)
3564da14cebeSEric Cheng * 4) Bandwdith mode (SRS_TX_BW)
3565da14cebeSEric Cheng * 5) Fanout and Bandwidth mode (SRS_TX_BW_FANOUT)
35660dc2366fSVenugopal Iyer * 6) aggr Tx mode (SRS_TX_AGGR)
35670dc2366fSVenugopal Iyer * 7) aggr Tx bw mode (SRS_TX_BW_AGGR)
3568da14cebeSEric Cheng *
3569da14cebeSEric Cheng * The tx mode in which an SRS operates is decided in mac_tx_srs_setup()
3570da14cebeSEric Cheng * based on the number of Tx rings requested for an SRS and whether
3571da14cebeSEric Cheng * bandwidth control is requested or not.
3572da14cebeSEric Cheng *
35730dc2366fSVenugopal Iyer * The default mode (i.e., no fanout/no bandwidth) is used when the
35740dc2366fSVenugopal Iyer * underlying NIC does not have Tx rings or just one Tx ring. In this mode,
35750dc2366fSVenugopal Iyer * the SRS acts as a pass-thru. Packets will go directly to mac_tx_send().
35760dc2366fSVenugopal Iyer * When the underlying Tx ring runs out of Tx descs, it starts queueing up
35770dc2366fSVenugopal Iyer * packets in SRS. When flow-control is relieved, the srs_worker drains
35780dc2366fSVenugopal Iyer * the queued packets and informs blocked clients to restart sending
35790dc2366fSVenugopal Iyer * packets.
3580da14cebeSEric Cheng *
35810dc2366fSVenugopal Iyer * In the SRS_TX_SERIALIZE mode, all calls to mac_tx() are serialized. This
35820dc2366fSVenugopal Iyer * mode is used when the link has no Tx rings or only one Tx ring.
3583da14cebeSEric Cheng *
3584da14cebeSEric Cheng * In the SRS_TX_FANOUT mode, packets will be fanned out to multiple
3585da14cebeSEric Cheng * Tx rings. Each Tx ring will have a soft ring associated with it.
3586da14cebeSEric Cheng * These soft rings will be hung off the Tx SRS. Queueing if it happens
3587da14cebeSEric Cheng * due to lack of Tx desc will be in individual soft ring (and not srs)
3588da14cebeSEric Cheng * associated with Tx ring.
3589da14cebeSEric Cheng *
3590da14cebeSEric Cheng * In the TX_BW mode, tx srs will allow packets to go down to Tx ring
3591da14cebeSEric Cheng * only if bw is available. Otherwise the packets will be queued in
3592da14cebeSEric Cheng * SRS. If fanout to multiple Tx rings is configured, the packets will
3593da14cebeSEric Cheng * be fanned out among the soft rings associated with the Tx rings.
3594da14cebeSEric Cheng *
35950dc2366fSVenugopal Iyer * In SRS_TX_AGGR mode, mac_tx_aggr_mode() routine is called. This routine
35960dc2366fSVenugopal Iyer * invokes an aggr function, aggr_find_tx_ring(), to find a pseudo Tx ring
35970dc2366fSVenugopal Iyer * belonging to a port on which the packet has to be sent. Aggr will
35980dc2366fSVenugopal Iyer * always have a pseudo Tx ring associated with it even when it is an
35990dc2366fSVenugopal Iyer * aggregation over a single NIC that has no Tx rings. Even in such a
36000dc2366fSVenugopal Iyer * case, the single pseudo Tx ring will have a soft ring associated with
36010dc2366fSVenugopal Iyer * it and the soft ring will hang off the SRS.
36020dc2366fSVenugopal Iyer *
36030dc2366fSVenugopal Iyer * If a bandwidth is specified for an aggr, SRS_TX_BW_AGGR mode is used.
36040dc2366fSVenugopal Iyer * In this mode, the bandwidth is first applied on the outgoing packets
36050dc2366fSVenugopal Iyer * and later mac_tx_addr_mode() function is called to send the packet out
36060dc2366fSVenugopal Iyer * of one of the pseudo Tx rings.
36070dc2366fSVenugopal Iyer *
3608da14cebeSEric Cheng * Four flags are used in srs_state for indicating flow control
3609da14cebeSEric Cheng * conditions : SRS_TX_BLOCKED, SRS_TX_HIWAT, SRS_TX_WAKEUP_CLIENT.
3610da14cebeSEric Cheng * SRS_TX_BLOCKED indicates out of Tx descs. SRS expects a wakeup from the
3611da14cebeSEric Cheng * driver below.
3612da14cebeSEric Cheng * SRS_TX_HIWAT indicates packet count enqueued in Tx SRS exceeded Tx hiwat
3613da14cebeSEric Cheng * and flow-control pressure is applied back to clients. The clients expect
3614da14cebeSEric Cheng * wakeup when flow-control is relieved.
3615da14cebeSEric Cheng * SRS_TX_WAKEUP_CLIENT get set when (flag == MAC_TX_NO_ENQUEUE) and mblk
3616da14cebeSEric Cheng * got returned back to client either due to lack of Tx descs or due to bw
3617da14cebeSEric Cheng * control reasons. The clients expect a wakeup when condition is relieved.
3618da14cebeSEric Cheng *
3619da14cebeSEric Cheng * The fourth argument to mac_tx() is the flag. Normally it will be 0 but
3620da14cebeSEric Cheng * some clients set the following values too: MAC_DROP_ON_NO_DESC,
3621da14cebeSEric Cheng * MAC_TX_NO_ENQUEUE
3622da14cebeSEric Cheng * Mac clients that do not want packets to be enqueued in the mac layer set
3623da14cebeSEric Cheng * MAC_DROP_ON_NO_DESC value. The packets won't be queued in the Tx SRS or
3624da14cebeSEric Cheng * Tx soft rings but instead get dropped when the NIC runs out of desc. The
3625da14cebeSEric Cheng * behaviour of this flag is different when the Tx is running in serializer
3626da14cebeSEric Cheng * or bandwidth mode. Under these (Serializer, bandwidth) modes, the packet
3627da14cebeSEric Cheng * get dropped when Tx high watermark is reached.
3628da14cebeSEric Cheng * There are some mac clients like vsw, aggr that want the mblks to be
3629da14cebeSEric Cheng * returned back to clients instead of being queued in Tx SRS (or Tx soft
3630da14cebeSEric Cheng * rings) under flow-control (i.e., out of desc or exceeding bw limits)
3631da14cebeSEric Cheng * conditions. These clients call mac_tx() with MAC_TX_NO_ENQUEUE flag set.
3632da14cebeSEric Cheng * In the default and Tx fanout mode, the un-transmitted mblks will be
3633da14cebeSEric Cheng * returned back to the clients when the driver runs out of Tx descs.
3634da14cebeSEric Cheng * SRS_TX_WAKEUP_CLIENT (or S_RING_WAKEUP_CLIENT) will be set in SRS (or
3635da14cebeSEric Cheng * soft ring) so that the clients can be woken up when Tx desc become
3636da14cebeSEric Cheng * available. When running in serializer or bandwidth mode mode,
3637da14cebeSEric Cheng * SRS_TX_WAKEUP_CLIENT will be set when tx hi-watermark is reached.
3638da14cebeSEric Cheng */
3639da14cebeSEric Cheng
3640da14cebeSEric Cheng mac_tx_func_t
mac_tx_get_func(uint32_t mode)3641da14cebeSEric Cheng mac_tx_get_func(uint32_t mode)
3642da14cebeSEric Cheng {
3643da14cebeSEric Cheng return (mac_tx_mode_list[mode].mac_tx_func);
3644da14cebeSEric Cheng }
3645da14cebeSEric Cheng
3646da14cebeSEric Cheng /* ARGSUSED */
3647da14cebeSEric Cheng static mac_tx_cookie_t
mac_tx_single_ring_mode(mac_soft_ring_set_t * mac_srs,mblk_t * mp_chain,uintptr_t fanout_hint,uint16_t flag,mblk_t ** ret_mp)3648da14cebeSEric Cheng mac_tx_single_ring_mode(mac_soft_ring_set_t *mac_srs, mblk_t *mp_chain,
3649da14cebeSEric Cheng uintptr_t fanout_hint, uint16_t flag, mblk_t **ret_mp)
3650da14cebeSEric Cheng {
3651da14cebeSEric Cheng mac_srs_tx_t *srs_tx = &mac_srs->srs_tx;
3652da14cebeSEric Cheng mac_tx_stats_t stats;
365336f99a58SToomas Soome mac_tx_cookie_t cookie = 0;
3654da14cebeSEric Cheng
3655da14cebeSEric Cheng ASSERT(srs_tx->st_mode == SRS_TX_DEFAULT);
3656da14cebeSEric Cheng
3657da14cebeSEric Cheng /* Regular case with a single Tx ring */
3658da14cebeSEric Cheng /*
3659da14cebeSEric Cheng * SRS_TX_BLOCKED is set when underlying NIC runs
3660da14cebeSEric Cheng * out of Tx descs and messages start getting
3661da14cebeSEric Cheng * queued. It won't get reset until
3662da14cebeSEric Cheng * tx_srs_drain() completely drains out the
3663da14cebeSEric Cheng * messages.
3664da14cebeSEric Cheng */
3665da14cebeSEric Cheng if ((mac_srs->srs_state & SRS_ENQUEUED) != 0) {
3666da14cebeSEric Cheng /* Tx descs/resources not available */
3667da14cebeSEric Cheng mutex_enter(&mac_srs->srs_lock);
3668da14cebeSEric Cheng if ((mac_srs->srs_state & SRS_ENQUEUED) != 0) {
3669da14cebeSEric Cheng cookie = mac_tx_srs_no_desc(mac_srs, mp_chain,
3670da14cebeSEric Cheng flag, ret_mp);
3671da14cebeSEric Cheng mutex_exit(&mac_srs->srs_lock);
3672da14cebeSEric Cheng return (cookie);
3673da14cebeSEric Cheng }
3674da14cebeSEric Cheng /*
3675da14cebeSEric Cheng * While we were computing mblk count, the
3676da14cebeSEric Cheng * flow control condition got relieved.
3677da14cebeSEric Cheng * Continue with the transmission.
3678da14cebeSEric Cheng */
3679da14cebeSEric Cheng mutex_exit(&mac_srs->srs_lock);
3680da14cebeSEric Cheng }
3681da14cebeSEric Cheng
3682da14cebeSEric Cheng mp_chain = mac_tx_send(srs_tx->st_arg1, srs_tx->st_arg2,
36830dc2366fSVenugopal Iyer mp_chain, &stats);
3684da14cebeSEric Cheng
3685da14cebeSEric Cheng /*
3686da14cebeSEric Cheng * Multiple threads could be here sending packets.
3687da14cebeSEric Cheng * Under such conditions, it is not possible to
3688da14cebeSEric Cheng * automically set SRS_TX_BLOCKED bit to indicate
3689da14cebeSEric Cheng * out of tx desc condition. To atomically set
3690da14cebeSEric Cheng * this, we queue the returned packet and do
3691da14cebeSEric Cheng * the setting of SRS_TX_BLOCKED in
3692da14cebeSEric Cheng * mac_tx_srs_drain().
3693da14cebeSEric Cheng */
3694da14cebeSEric Cheng if (mp_chain != NULL) {
3695da14cebeSEric Cheng mutex_enter(&mac_srs->srs_lock);
3696da14cebeSEric Cheng cookie = mac_tx_srs_no_desc(mac_srs, mp_chain, flag, ret_mp);
3697da14cebeSEric Cheng mutex_exit(&mac_srs->srs_lock);
3698da14cebeSEric Cheng return (cookie);
3699da14cebeSEric Cheng }
37000dc2366fSVenugopal Iyer SRS_TX_STATS_UPDATE(mac_srs, &stats);
3701da14cebeSEric Cheng
370236f99a58SToomas Soome return (0);
3703da14cebeSEric Cheng }
3704da14cebeSEric Cheng
3705da14cebeSEric Cheng /*
3706da14cebeSEric Cheng * mac_tx_serialize_mode
3707da14cebeSEric Cheng *
3708da14cebeSEric Cheng * This is an experimental mode implemented as per the request of PAE.
3709da14cebeSEric Cheng * In this mode, all callers attempting to send a packet to the NIC
3710da14cebeSEric Cheng * will get serialized. Only one thread at any time will access the
3711da14cebeSEric Cheng * NIC to send the packet out.
3712da14cebeSEric Cheng */
3713da14cebeSEric Cheng /* ARGSUSED */
3714da14cebeSEric Cheng static mac_tx_cookie_t
mac_tx_serializer_mode(mac_soft_ring_set_t * mac_srs,mblk_t * mp_chain,uintptr_t fanout_hint,uint16_t flag,mblk_t ** ret_mp)3715da14cebeSEric Cheng mac_tx_serializer_mode(mac_soft_ring_set_t *mac_srs, mblk_t *mp_chain,
3716da14cebeSEric Cheng uintptr_t fanout_hint, uint16_t flag, mblk_t **ret_mp)
3717da14cebeSEric Cheng {
3718da14cebeSEric Cheng mac_tx_stats_t stats;
371936f99a58SToomas Soome mac_tx_cookie_t cookie = 0;
3720da14cebeSEric Cheng mac_srs_tx_t *srs_tx = &mac_srs->srs_tx;
3721da14cebeSEric Cheng
3722da14cebeSEric Cheng /* Single ring, serialize below */
3723da14cebeSEric Cheng ASSERT(srs_tx->st_mode == SRS_TX_SERIALIZE);
3724da14cebeSEric Cheng mutex_enter(&mac_srs->srs_lock);
3725da14cebeSEric Cheng if ((mac_srs->srs_first != NULL) ||
3726da14cebeSEric Cheng (mac_srs->srs_state & SRS_PROC)) {
3727da14cebeSEric Cheng /*
3728da14cebeSEric Cheng * In serialization mode, queue all packets until
3729da14cebeSEric Cheng * TX_HIWAT is set.
3730da14cebeSEric Cheng * If drop bit is set, drop if TX_HIWAT is set.
3731da14cebeSEric Cheng * If no_enqueue is set, still enqueue until hiwat
3732da14cebeSEric Cheng * is set and return mblks after TX_HIWAT is set.
3733da14cebeSEric Cheng */
3734da14cebeSEric Cheng cookie = mac_tx_srs_enqueue(mac_srs, mp_chain,
373536f99a58SToomas Soome flag, 0, ret_mp);
3736da14cebeSEric Cheng mutex_exit(&mac_srs->srs_lock);
3737da14cebeSEric Cheng return (cookie);
3738da14cebeSEric Cheng }
3739da14cebeSEric Cheng /*
3740da14cebeSEric Cheng * No packets queued, nothing on proc and no flow
3741da14cebeSEric Cheng * control condition. Fast-path, ok. Do inline
3742da14cebeSEric Cheng * processing.
3743da14cebeSEric Cheng */
3744da14cebeSEric Cheng mac_srs->srs_state |= SRS_PROC;
3745da14cebeSEric Cheng mutex_exit(&mac_srs->srs_lock);
3746da14cebeSEric Cheng
3747da14cebeSEric Cheng mp_chain = mac_tx_send(srs_tx->st_arg1, srs_tx->st_arg2,
37480dc2366fSVenugopal Iyer mp_chain, &stats);
3749da14cebeSEric Cheng
3750da14cebeSEric Cheng mutex_enter(&mac_srs->srs_lock);
3751da14cebeSEric Cheng mac_srs->srs_state &= ~SRS_PROC;
3752da14cebeSEric Cheng if (mp_chain != NULL) {
3753da14cebeSEric Cheng cookie = mac_tx_srs_enqueue(mac_srs,
375436f99a58SToomas Soome mp_chain, flag, 0, ret_mp);
3755da14cebeSEric Cheng }
3756da14cebeSEric Cheng if (mac_srs->srs_first != NULL) {
3757da14cebeSEric Cheng /*
3758da14cebeSEric Cheng * We processed inline our packet and a new
3759da14cebeSEric Cheng * packet/s got queued while we were
3760da14cebeSEric Cheng * processing. Wakeup srs worker
3761da14cebeSEric Cheng */
3762da14cebeSEric Cheng cv_signal(&mac_srs->srs_async);
3763da14cebeSEric Cheng }
3764da14cebeSEric Cheng mutex_exit(&mac_srs->srs_lock);
3765da14cebeSEric Cheng
376636f99a58SToomas Soome if (cookie == 0)
37670dc2366fSVenugopal Iyer SRS_TX_STATS_UPDATE(mac_srs, &stats);
3768da14cebeSEric Cheng
3769da14cebeSEric Cheng return (cookie);
3770da14cebeSEric Cheng }
3771da14cebeSEric Cheng
3772da14cebeSEric Cheng /*
3773da14cebeSEric Cheng * mac_tx_fanout_mode
3774da14cebeSEric Cheng *
3775da14cebeSEric Cheng * In this mode, the SRS will have access to multiple Tx rings to send
3776da14cebeSEric Cheng * the packet out. The fanout hint that is passed as an argument is
3777da14cebeSEric Cheng * used to find an appropriate ring to fanout the traffic. Each Tx
3778da14cebeSEric Cheng * ring, in turn, will have a soft ring associated with it. If a Tx
3779da14cebeSEric Cheng * ring runs out of Tx desc's the returned packet will be queued in
3780da14cebeSEric Cheng * the soft ring associated with that Tx ring. The srs itself will not
3781da14cebeSEric Cheng * queue any packets.
3782da14cebeSEric Cheng */
3783ae6aa22aSVenugopal Iyer
3784ae6aa22aSVenugopal Iyer #define MAC_TX_SOFT_RING_PROCESS(chain) { \
37850dc2366fSVenugopal Iyer index = COMPUTE_INDEX(hash, mac_srs->srs_tx_ring_count), \
37860dc2366fSVenugopal Iyer softring = mac_srs->srs_tx_soft_rings[index]; \
3787ae6aa22aSVenugopal Iyer cookie = mac_tx_soft_ring_process(softring, chain, flag, ret_mp); \
3788ae6aa22aSVenugopal Iyer DTRACE_PROBE2(tx__fanout, uint64_t, hash, uint_t, index); \
3789ae6aa22aSVenugopal Iyer }
3790ae6aa22aSVenugopal Iyer
3791da14cebeSEric Cheng static mac_tx_cookie_t
mac_tx_fanout_mode(mac_soft_ring_set_t * mac_srs,mblk_t * mp_chain,uintptr_t fanout_hint,uint16_t flag,mblk_t ** ret_mp)3792da14cebeSEric Cheng mac_tx_fanout_mode(mac_soft_ring_set_t *mac_srs, mblk_t *mp_chain,
3793da14cebeSEric Cheng uintptr_t fanout_hint, uint16_t flag, mblk_t **ret_mp)
3794da14cebeSEric Cheng {
3795da14cebeSEric Cheng mac_soft_ring_t *softring;
3796ae6aa22aSVenugopal Iyer uint64_t hash;
3797ae6aa22aSVenugopal Iyer uint_t index;
379836f99a58SToomas Soome mac_tx_cookie_t cookie = 0;
3799da14cebeSEric Cheng
38000dc2366fSVenugopal Iyer ASSERT(mac_srs->srs_tx.st_mode == SRS_TX_FANOUT ||
38010dc2366fSVenugopal Iyer mac_srs->srs_tx.st_mode == SRS_TX_BW_FANOUT);
3802ae6aa22aSVenugopal Iyer if (fanout_hint != 0) {
3803ae6aa22aSVenugopal Iyer /*
3804ae6aa22aSVenugopal Iyer * The hint is specified by the caller, simply pass the
3805ae6aa22aSVenugopal Iyer * whole chain to the soft ring.
3806ae6aa22aSVenugopal Iyer */
3807da14cebeSEric Cheng hash = HASH_HINT(fanout_hint);
3808ae6aa22aSVenugopal Iyer MAC_TX_SOFT_RING_PROCESS(mp_chain);
3809ae6aa22aSVenugopal Iyer } else {
3810ae6aa22aSVenugopal Iyer mblk_t *last_mp, *cur_mp, *sub_chain;
3811ae6aa22aSVenugopal Iyer uint64_t last_hash = 0;
3812ae6aa22aSVenugopal Iyer uint_t media = mac_srs->srs_mcip->mci_mip->mi_info.mi_media;
3813ae6aa22aSVenugopal Iyer
3814ae6aa22aSVenugopal Iyer /*
3815ae6aa22aSVenugopal Iyer * Compute the hash from the contents (headers) of the
3816ae6aa22aSVenugopal Iyer * packets of the mblk chain. Split the chains into
3817ae6aa22aSVenugopal Iyer * subchains of the same conversation.
3818ae6aa22aSVenugopal Iyer *
3819ae6aa22aSVenugopal Iyer * Since there may be more than one ring used for
3820ae6aa22aSVenugopal Iyer * sub-chains of the same call, and since the caller
3821ae6aa22aSVenugopal Iyer * does not maintain per conversation state since it
3822ae6aa22aSVenugopal Iyer * passed a zero hint, unsent subchains will be
3823ae6aa22aSVenugopal Iyer * dropped.
3824ae6aa22aSVenugopal Iyer */
3825ae6aa22aSVenugopal Iyer
3826ae6aa22aSVenugopal Iyer flag |= MAC_DROP_ON_NO_DESC;
3827ae6aa22aSVenugopal Iyer ret_mp = NULL;
3828ae6aa22aSVenugopal Iyer
3829ae6aa22aSVenugopal Iyer ASSERT(ret_mp == NULL);
3830ae6aa22aSVenugopal Iyer
3831ae6aa22aSVenugopal Iyer sub_chain = NULL;
3832ae6aa22aSVenugopal Iyer last_mp = NULL;
3833ae6aa22aSVenugopal Iyer
3834ae6aa22aSVenugopal Iyer for (cur_mp = mp_chain; cur_mp != NULL;
3835ae6aa22aSVenugopal Iyer cur_mp = cur_mp->b_next) {
3836ae6aa22aSVenugopal Iyer hash = mac_pkt_hash(media, cur_mp, MAC_PKT_HASH_L4,
3837ae6aa22aSVenugopal Iyer B_TRUE);
3838ae6aa22aSVenugopal Iyer if (last_hash != 0 && hash != last_hash) {
3839ae6aa22aSVenugopal Iyer /*
3840ae6aa22aSVenugopal Iyer * Starting a different subchain, send current
3841ae6aa22aSVenugopal Iyer * chain out.
3842ae6aa22aSVenugopal Iyer */
3843ae6aa22aSVenugopal Iyer ASSERT(last_mp != NULL);
3844ae6aa22aSVenugopal Iyer last_mp->b_next = NULL;
3845ae6aa22aSVenugopal Iyer MAC_TX_SOFT_RING_PROCESS(sub_chain);
3846ae6aa22aSVenugopal Iyer sub_chain = NULL;
3847ae6aa22aSVenugopal Iyer }
3848ae6aa22aSVenugopal Iyer
3849ae6aa22aSVenugopal Iyer /* add packet to subchain */
3850ae6aa22aSVenugopal Iyer if (sub_chain == NULL)
3851ae6aa22aSVenugopal Iyer sub_chain = cur_mp;
3852ae6aa22aSVenugopal Iyer last_mp = cur_mp;
3853ae6aa22aSVenugopal Iyer last_hash = hash;
3854ae6aa22aSVenugopal Iyer }
3855ae6aa22aSVenugopal Iyer
3856ae6aa22aSVenugopal Iyer if (sub_chain != NULL) {
3857ae6aa22aSVenugopal Iyer /* send last subchain */
3858ae6aa22aSVenugopal Iyer ASSERT(last_mp != NULL);
3859ae6aa22aSVenugopal Iyer last_mp->b_next = NULL;
3860ae6aa22aSVenugopal Iyer MAC_TX_SOFT_RING_PROCESS(sub_chain);
3861ae6aa22aSVenugopal Iyer }
3862ae6aa22aSVenugopal Iyer
386336f99a58SToomas Soome cookie = 0;
3864ae6aa22aSVenugopal Iyer }
3865ae6aa22aSVenugopal Iyer
3866ae6aa22aSVenugopal Iyer return (cookie);
3867da14cebeSEric Cheng }
3868da14cebeSEric Cheng
3869da14cebeSEric Cheng /*
3870da14cebeSEric Cheng * mac_tx_bw_mode
3871da14cebeSEric Cheng *
3872da14cebeSEric Cheng * In the bandwidth mode, Tx srs will allow packets to go down to Tx ring
3873da14cebeSEric Cheng * only if bw is available. Otherwise the packets will be queued in
3874da14cebeSEric Cheng * SRS. If the SRS has multiple Tx rings, then packets will get fanned
3875da14cebeSEric Cheng * out to a Tx rings.
3876da14cebeSEric Cheng */
3877da14cebeSEric Cheng static mac_tx_cookie_t
mac_tx_bw_mode(mac_soft_ring_set_t * mac_srs,mblk_t * mp_chain,uintptr_t fanout_hint,uint16_t flag,mblk_t ** ret_mp)3878da14cebeSEric Cheng mac_tx_bw_mode(mac_soft_ring_set_t *mac_srs, mblk_t *mp_chain,
3879da14cebeSEric Cheng uintptr_t fanout_hint, uint16_t flag, mblk_t **ret_mp)
3880da14cebeSEric Cheng {
3881da14cebeSEric Cheng int cnt, sz;
3882da14cebeSEric Cheng mblk_t *tail;
388336f99a58SToomas Soome mac_tx_cookie_t cookie = 0;
3884da14cebeSEric Cheng mac_srs_tx_t *srs_tx = &mac_srs->srs_tx;
3885d3d50737SRafael Vanoni clock_t now;
3886da14cebeSEric Cheng
3887da14cebeSEric Cheng ASSERT(TX_BANDWIDTH_MODE(mac_srs));
3888da14cebeSEric Cheng ASSERT(mac_srs->srs_type & SRST_BW_CONTROL);
3889da14cebeSEric Cheng mutex_enter(&mac_srs->srs_lock);
3890da14cebeSEric Cheng if (mac_srs->srs_bw->mac_bw_limit == 0) {
3891ae6aa22aSVenugopal Iyer /*
3892ae6aa22aSVenugopal Iyer * zero bandwidth, no traffic is sent: drop the packets,
3893ae6aa22aSVenugopal Iyer * or return the whole chain if the caller requests all
3894ae6aa22aSVenugopal Iyer * unsent packets back.
3895ae6aa22aSVenugopal Iyer */
3896ae6aa22aSVenugopal Iyer if (flag & MAC_TX_NO_ENQUEUE) {
3897ae6aa22aSVenugopal Iyer cookie = (mac_tx_cookie_t)mac_srs;
3898ae6aa22aSVenugopal Iyer *ret_mp = mp_chain;
3899ae6aa22aSVenugopal Iyer } else {
3900c61a1653SRyan Zezeski MAC_TX_SRS_DROP_MESSAGE(mac_srs, mp_chain, cookie,
3901c61a1653SRyan Zezeski "Tx no bandwidth");
3902ae6aa22aSVenugopal Iyer }
3903da14cebeSEric Cheng mutex_exit(&mac_srs->srs_lock);
3904da14cebeSEric Cheng return (cookie);
3905da14cebeSEric Cheng } else if ((mac_srs->srs_first != NULL) ||
3906da14cebeSEric Cheng (mac_srs->srs_bw->mac_bw_state & SRS_BW_ENFORCED)) {
3907da14cebeSEric Cheng cookie = mac_tx_srs_enqueue(mac_srs, mp_chain, flag,
3908da14cebeSEric Cheng fanout_hint, ret_mp);
3909da14cebeSEric Cheng mutex_exit(&mac_srs->srs_lock);
3910da14cebeSEric Cheng return (cookie);
3911da14cebeSEric Cheng }
3912da14cebeSEric Cheng MAC_COUNT_CHAIN(mac_srs, mp_chain, tail, cnt, sz);
3913d3d50737SRafael Vanoni now = ddi_get_lbolt();
3914d3d50737SRafael Vanoni if (mac_srs->srs_bw->mac_bw_curr_time != now) {
3915d3d50737SRafael Vanoni mac_srs->srs_bw->mac_bw_curr_time = now;
3916da14cebeSEric Cheng mac_srs->srs_bw->mac_bw_used = 0;
3917da14cebeSEric Cheng } else if (mac_srs->srs_bw->mac_bw_used >
3918da14cebeSEric Cheng mac_srs->srs_bw->mac_bw_limit) {
3919da14cebeSEric Cheng mac_srs->srs_bw->mac_bw_state |= SRS_BW_ENFORCED;
3920da14cebeSEric Cheng MAC_TX_SRS_ENQUEUE_CHAIN(mac_srs,
3921da14cebeSEric Cheng mp_chain, tail, cnt, sz);
3922da14cebeSEric Cheng /*
3923da14cebeSEric Cheng * Wakeup worker thread. Note that worker
3924da14cebeSEric Cheng * thread has to be woken up so that it
3925da14cebeSEric Cheng * can fire up the timer to be woken up
3926da14cebeSEric Cheng * on the next tick. Also once
3927da14cebeSEric Cheng * BW_ENFORCED is set, it can only be
3928da14cebeSEric Cheng * reset by srs_worker thread. Until then
3929da14cebeSEric Cheng * all packets will get queued up in SRS
3930da14cebeSEric Cheng * and hence this this code path won't be
3931da14cebeSEric Cheng * entered until BW_ENFORCED is reset.
3932da14cebeSEric Cheng */
3933da14cebeSEric Cheng cv_signal(&mac_srs->srs_async);
3934da14cebeSEric Cheng mutex_exit(&mac_srs->srs_lock);
3935da14cebeSEric Cheng return (cookie);
3936da14cebeSEric Cheng }
3937da14cebeSEric Cheng
3938da14cebeSEric Cheng mac_srs->srs_bw->mac_bw_used += sz;
3939da14cebeSEric Cheng mutex_exit(&mac_srs->srs_lock);
3940da14cebeSEric Cheng
3941da14cebeSEric Cheng if (srs_tx->st_mode == SRS_TX_BW_FANOUT) {
3942da14cebeSEric Cheng mac_soft_ring_t *softring;
3943da14cebeSEric Cheng uint_t indx, hash;
3944da14cebeSEric Cheng
3945da14cebeSEric Cheng hash = HASH_HINT(fanout_hint);
3946da14cebeSEric Cheng indx = COMPUTE_INDEX(hash,
39470dc2366fSVenugopal Iyer mac_srs->srs_tx_ring_count);
39480dc2366fSVenugopal Iyer softring = mac_srs->srs_tx_soft_rings[indx];
3949da14cebeSEric Cheng return (mac_tx_soft_ring_process(softring, mp_chain, flag,
3950da14cebeSEric Cheng ret_mp));
39510dc2366fSVenugopal Iyer } else if (srs_tx->st_mode == SRS_TX_BW_AGGR) {
39520dc2366fSVenugopal Iyer return (mac_tx_aggr_mode(mac_srs, mp_chain,
39530dc2366fSVenugopal Iyer fanout_hint, flag, ret_mp));
3954da14cebeSEric Cheng } else {
3955da14cebeSEric Cheng mac_tx_stats_t stats;
3956da14cebeSEric Cheng
3957da14cebeSEric Cheng mp_chain = mac_tx_send(srs_tx->st_arg1, srs_tx->st_arg2,
39580dc2366fSVenugopal Iyer mp_chain, &stats);
3959da14cebeSEric Cheng
3960da14cebeSEric Cheng if (mp_chain != NULL) {
3961da14cebeSEric Cheng mutex_enter(&mac_srs->srs_lock);
3962da14cebeSEric Cheng MAC_COUNT_CHAIN(mac_srs, mp_chain, tail, cnt, sz);
3963da14cebeSEric Cheng if (mac_srs->srs_bw->mac_bw_used > sz)
3964da14cebeSEric Cheng mac_srs->srs_bw->mac_bw_used -= sz;
3965da14cebeSEric Cheng else
3966da14cebeSEric Cheng mac_srs->srs_bw->mac_bw_used = 0;
3967da14cebeSEric Cheng cookie = mac_tx_srs_enqueue(mac_srs, mp_chain, flag,
3968da14cebeSEric Cheng fanout_hint, ret_mp);
3969da14cebeSEric Cheng mutex_exit(&mac_srs->srs_lock);
3970da14cebeSEric Cheng return (cookie);
3971da14cebeSEric Cheng }
39720dc2366fSVenugopal Iyer SRS_TX_STATS_UPDATE(mac_srs, &stats);
3973da14cebeSEric Cheng
397436f99a58SToomas Soome return (0);
3975da14cebeSEric Cheng }
3976da14cebeSEric Cheng }
3977da14cebeSEric Cheng
39780dc2366fSVenugopal Iyer /*
39790dc2366fSVenugopal Iyer * mac_tx_aggr_mode
39800dc2366fSVenugopal Iyer *
39810dc2366fSVenugopal Iyer * This routine invokes an aggr function, aggr_find_tx_ring(), to find
39820dc2366fSVenugopal Iyer * a (pseudo) Tx ring belonging to a port on which the packet has to
39830dc2366fSVenugopal Iyer * be sent. aggr_find_tx_ring() first finds the outgoing port based on
39840dc2366fSVenugopal Iyer * L2/L3/L4 policy and then uses the fanout_hint passed to it to pick
39850dc2366fSVenugopal Iyer * a Tx ring from the selected port.
39860dc2366fSVenugopal Iyer *
39870dc2366fSVenugopal Iyer * Note that a port can be deleted from the aggregation. In such a case,
39880dc2366fSVenugopal Iyer * the aggregation layer first separates the port from the rest of the
39890dc2366fSVenugopal Iyer * ports making sure that port (and thus any Tx rings associated with
39900dc2366fSVenugopal Iyer * it) won't get selected in the call to aggr_find_tx_ring() function.
39910dc2366fSVenugopal Iyer * Later calls are made to mac_group_rem_ring() passing pseudo Tx ring
39920dc2366fSVenugopal Iyer * handles one by one which in turn will quiesce the Tx SRS and remove
39930dc2366fSVenugopal Iyer * the soft ring associated with the pseudo Tx ring. Unlike Rx side
39940dc2366fSVenugopal Iyer * where a cookie is used to protect against mac_rx_ring() calls on
39950dc2366fSVenugopal Iyer * rings that have been removed, no such cookie is needed on the Tx
39960dc2366fSVenugopal Iyer * side as the pseudo Tx ring won't be available anymore to
39970dc2366fSVenugopal Iyer * aggr_find_tx_ring() once the port has been removed.
39980dc2366fSVenugopal Iyer */
39990dc2366fSVenugopal Iyer static mac_tx_cookie_t
mac_tx_aggr_mode(mac_soft_ring_set_t * mac_srs,mblk_t * mp_chain,uintptr_t fanout_hint,uint16_t flag,mblk_t ** ret_mp)40000dc2366fSVenugopal Iyer mac_tx_aggr_mode(mac_soft_ring_set_t *mac_srs, mblk_t *mp_chain,
40010dc2366fSVenugopal Iyer uintptr_t fanout_hint, uint16_t flag, mblk_t **ret_mp)
40020dc2366fSVenugopal Iyer {
40030dc2366fSVenugopal Iyer mac_srs_tx_t *srs_tx = &mac_srs->srs_tx;
40040dc2366fSVenugopal Iyer mac_tx_ring_fn_t find_tx_ring_fn;
40050dc2366fSVenugopal Iyer mac_ring_handle_t ring = NULL;
40060dc2366fSVenugopal Iyer void *arg;
40070dc2366fSVenugopal Iyer mac_soft_ring_t *sringp;
40080dc2366fSVenugopal Iyer
40090dc2366fSVenugopal Iyer find_tx_ring_fn = srs_tx->st_capab_aggr.mca_find_tx_ring_fn;
40100dc2366fSVenugopal Iyer arg = srs_tx->st_capab_aggr.mca_arg;
40110dc2366fSVenugopal Iyer if (find_tx_ring_fn(arg, mp_chain, fanout_hint, &ring) == NULL)
401236f99a58SToomas Soome return (0);
40130dc2366fSVenugopal Iyer sringp = srs_tx->st_soft_rings[((mac_ring_t *)ring)->mr_index];
40140dc2366fSVenugopal Iyer return (mac_tx_soft_ring_process(sringp, mp_chain, flag, ret_mp));
40150dc2366fSVenugopal Iyer }
40160dc2366fSVenugopal Iyer
40170dc2366fSVenugopal Iyer void
mac_tx_invoke_callbacks(mac_client_impl_t * mcip,mac_tx_cookie_t cookie)40180dc2366fSVenugopal Iyer mac_tx_invoke_callbacks(mac_client_impl_t *mcip, mac_tx_cookie_t cookie)
40190dc2366fSVenugopal Iyer {
40200dc2366fSVenugopal Iyer mac_cb_t *mcb;
40210dc2366fSVenugopal Iyer mac_tx_notify_cb_t *mtnfp;
40220dc2366fSVenugopal Iyer
40230dc2366fSVenugopal Iyer /* Wakeup callback registered clients */
40240dc2366fSVenugopal Iyer MAC_CALLBACK_WALKER_INC(&mcip->mci_tx_notify_cb_info);
40250dc2366fSVenugopal Iyer for (mcb = mcip->mci_tx_notify_cb_list; mcb != NULL;
40260dc2366fSVenugopal Iyer mcb = mcb->mcb_nextp) {
40270dc2366fSVenugopal Iyer mtnfp = (mac_tx_notify_cb_t *)mcb->mcb_objp;
40280dc2366fSVenugopal Iyer mtnfp->mtnf_fn(mtnfp->mtnf_arg, cookie);
40290dc2366fSVenugopal Iyer }
40300dc2366fSVenugopal Iyer MAC_CALLBACK_WALKER_DCR(&mcip->mci_tx_notify_cb_info,
40310dc2366fSVenugopal Iyer &mcip->mci_tx_notify_cb_list);
40320dc2366fSVenugopal Iyer }
40330dc2366fSVenugopal Iyer
4034da14cebeSEric Cheng /* ARGSUSED */
4035da14cebeSEric Cheng void
mac_tx_srs_drain(mac_soft_ring_set_t * mac_srs,uint_t proc_type)4036da14cebeSEric Cheng mac_tx_srs_drain(mac_soft_ring_set_t *mac_srs, uint_t proc_type)
4037da14cebeSEric Cheng {
4038da14cebeSEric Cheng mblk_t *head, *tail;
4039da14cebeSEric Cheng size_t sz;
4040da14cebeSEric Cheng uint32_t tx_mode;
4041da14cebeSEric Cheng uint_t saved_pkt_count;
4042da14cebeSEric Cheng mac_tx_stats_t stats;
4043da14cebeSEric Cheng mac_srs_tx_t *srs_tx = &mac_srs->srs_tx;
4044d3d50737SRafael Vanoni clock_t now;
4045da14cebeSEric Cheng
4046da14cebeSEric Cheng saved_pkt_count = 0;
4047da14cebeSEric Cheng ASSERT(mutex_owned(&mac_srs->srs_lock));
4048da14cebeSEric Cheng ASSERT(!(mac_srs->srs_state & SRS_PROC));
4049da14cebeSEric Cheng
4050da14cebeSEric Cheng mac_srs->srs_state |= SRS_PROC;
4051da14cebeSEric Cheng
4052da14cebeSEric Cheng tx_mode = srs_tx->st_mode;
4053da14cebeSEric Cheng if (tx_mode == SRS_TX_DEFAULT || tx_mode == SRS_TX_SERIALIZE) {
4054da14cebeSEric Cheng if (mac_srs->srs_first != NULL) {
4055da14cebeSEric Cheng head = mac_srs->srs_first;
4056da14cebeSEric Cheng tail = mac_srs->srs_last;
4057da14cebeSEric Cheng saved_pkt_count = mac_srs->srs_count;
4058da14cebeSEric Cheng mac_srs->srs_first = NULL;
4059da14cebeSEric Cheng mac_srs->srs_last = NULL;
4060da14cebeSEric Cheng mac_srs->srs_count = 0;
4061da14cebeSEric Cheng mutex_exit(&mac_srs->srs_lock);
4062da14cebeSEric Cheng
4063da14cebeSEric Cheng head = mac_tx_send(srs_tx->st_arg1, srs_tx->st_arg2,
4064da14cebeSEric Cheng head, &stats);
4065da14cebeSEric Cheng
4066da14cebeSEric Cheng mutex_enter(&mac_srs->srs_lock);
4067da14cebeSEric Cheng if (head != NULL) {
4068da14cebeSEric Cheng /* Device out of tx desc, set block */
4069da14cebeSEric Cheng if (head->b_next == NULL)
4070da14cebeSEric Cheng VERIFY(head == tail);
4071da14cebeSEric Cheng tail->b_next = mac_srs->srs_first;
4072da14cebeSEric Cheng mac_srs->srs_first = head;
4073da14cebeSEric Cheng mac_srs->srs_count +=
40740dc2366fSVenugopal Iyer (saved_pkt_count - stats.mts_opackets);
4075da14cebeSEric Cheng if (mac_srs->srs_last == NULL)
4076da14cebeSEric Cheng mac_srs->srs_last = tail;
4077da14cebeSEric Cheng MAC_TX_SRS_BLOCK(mac_srs, head);
4078da14cebeSEric Cheng } else {
4079da14cebeSEric Cheng srs_tx->st_woken_up = B_FALSE;
40800dc2366fSVenugopal Iyer SRS_TX_STATS_UPDATE(mac_srs, &stats);
4081da14cebeSEric Cheng }
4082da14cebeSEric Cheng }
4083da14cebeSEric Cheng } else if (tx_mode == SRS_TX_BW) {
4084da14cebeSEric Cheng /*
4085da14cebeSEric Cheng * We are here because the timer fired and we have some data
4086da14cebeSEric Cheng * to tranmit. Also mac_tx_srs_worker should have reset
4087da14cebeSEric Cheng * SRS_BW_ENFORCED flag
4088da14cebeSEric Cheng */
4089da14cebeSEric Cheng ASSERT(!(mac_srs->srs_bw->mac_bw_state & SRS_BW_ENFORCED));
4090da14cebeSEric Cheng head = tail = mac_srs->srs_first;
4091da14cebeSEric Cheng while (mac_srs->srs_first != NULL) {
4092da14cebeSEric Cheng tail = mac_srs->srs_first;
4093da14cebeSEric Cheng tail->b_prev = NULL;
4094da14cebeSEric Cheng mac_srs->srs_first = tail->b_next;
4095da14cebeSEric Cheng if (mac_srs->srs_first == NULL)
4096da14cebeSEric Cheng mac_srs->srs_last = NULL;
4097da14cebeSEric Cheng mac_srs->srs_count--;
4098da14cebeSEric Cheng sz = msgdsize(tail);
4099da14cebeSEric Cheng mac_srs->srs_size -= sz;
4100da14cebeSEric Cheng saved_pkt_count++;
4101da14cebeSEric Cheng MAC_TX_UPDATE_BW_INFO(mac_srs, sz);
4102da14cebeSEric Cheng
4103da14cebeSEric Cheng if (mac_srs->srs_bw->mac_bw_used <
4104da14cebeSEric Cheng mac_srs->srs_bw->mac_bw_limit)
4105da14cebeSEric Cheng continue;
4106da14cebeSEric Cheng
4107d3d50737SRafael Vanoni now = ddi_get_lbolt();
4108d3d50737SRafael Vanoni if (mac_srs->srs_bw->mac_bw_curr_time != now) {
4109d3d50737SRafael Vanoni mac_srs->srs_bw->mac_bw_curr_time = now;
4110da14cebeSEric Cheng mac_srs->srs_bw->mac_bw_used = sz;
4111da14cebeSEric Cheng continue;
4112da14cebeSEric Cheng }
4113da14cebeSEric Cheng mac_srs->srs_bw->mac_bw_state |= SRS_BW_ENFORCED;
4114da14cebeSEric Cheng break;
4115da14cebeSEric Cheng }
4116da14cebeSEric Cheng
4117da14cebeSEric Cheng ASSERT((head == NULL && tail == NULL) ||
4118da14cebeSEric Cheng (head != NULL && tail != NULL));
4119da14cebeSEric Cheng if (tail != NULL) {
4120da14cebeSEric Cheng tail->b_next = NULL;
4121da14cebeSEric Cheng mutex_exit(&mac_srs->srs_lock);
4122da14cebeSEric Cheng
4123da14cebeSEric Cheng head = mac_tx_send(srs_tx->st_arg1, srs_tx->st_arg2,
4124da14cebeSEric Cheng head, &stats);
4125da14cebeSEric Cheng
4126da14cebeSEric Cheng mutex_enter(&mac_srs->srs_lock);
4127da14cebeSEric Cheng if (head != NULL) {
4128da14cebeSEric Cheng uint_t size_sent;
4129da14cebeSEric Cheng
4130da14cebeSEric Cheng /* Device out of tx desc, set block */
4131da14cebeSEric Cheng if (head->b_next == NULL)
4132da14cebeSEric Cheng VERIFY(head == tail);
4133da14cebeSEric Cheng tail->b_next = mac_srs->srs_first;
4134da14cebeSEric Cheng mac_srs->srs_first = head;
4135da14cebeSEric Cheng mac_srs->srs_count +=
41360dc2366fSVenugopal Iyer (saved_pkt_count - stats.mts_opackets);
4137da14cebeSEric Cheng if (mac_srs->srs_last == NULL)
4138da14cebeSEric Cheng mac_srs->srs_last = tail;
41390dc2366fSVenugopal Iyer size_sent = sz - stats.mts_obytes;
4140da14cebeSEric Cheng mac_srs->srs_size += size_sent;
4141da14cebeSEric Cheng mac_srs->srs_bw->mac_bw_sz += size_sent;
4142da14cebeSEric Cheng if (mac_srs->srs_bw->mac_bw_used > size_sent) {
4143da14cebeSEric Cheng mac_srs->srs_bw->mac_bw_used -=
4144da14cebeSEric Cheng size_sent;
4145da14cebeSEric Cheng } else {
4146da14cebeSEric Cheng mac_srs->srs_bw->mac_bw_used = 0;
4147da14cebeSEric Cheng }
4148da14cebeSEric Cheng MAC_TX_SRS_BLOCK(mac_srs, head);
4149da14cebeSEric Cheng } else {
4150da14cebeSEric Cheng srs_tx->st_woken_up = B_FALSE;
41510dc2366fSVenugopal Iyer SRS_TX_STATS_UPDATE(mac_srs, &stats);
4152da14cebeSEric Cheng }
4153da14cebeSEric Cheng }
41540dc2366fSVenugopal Iyer } else if (tx_mode == SRS_TX_BW_FANOUT || tx_mode == SRS_TX_BW_AGGR) {
4155da14cebeSEric Cheng mblk_t *prev;
4156da14cebeSEric Cheng uint64_t hint;
4157da14cebeSEric Cheng
4158da14cebeSEric Cheng /*
4159da14cebeSEric Cheng * We are here because the timer fired and we
4160da14cebeSEric Cheng * have some quota to tranmit.
4161da14cebeSEric Cheng */
4162da14cebeSEric Cheng prev = NULL;
4163da14cebeSEric Cheng head = tail = mac_srs->srs_first;
4164da14cebeSEric Cheng while (mac_srs->srs_first != NULL) {
4165da14cebeSEric Cheng tail = mac_srs->srs_first;
4166da14cebeSEric Cheng mac_srs->srs_first = tail->b_next;
4167da14cebeSEric Cheng if (mac_srs->srs_first == NULL)
4168da14cebeSEric Cheng mac_srs->srs_last = NULL;
4169da14cebeSEric Cheng mac_srs->srs_count--;
4170da14cebeSEric Cheng sz = msgdsize(tail);
4171da14cebeSEric Cheng mac_srs->srs_size -= sz;
4172da14cebeSEric Cheng mac_srs->srs_bw->mac_bw_used += sz;
4173da14cebeSEric Cheng if (prev == NULL)
4174da14cebeSEric Cheng hint = (ulong_t)tail->b_prev;
4175da14cebeSEric Cheng if (hint != (ulong_t)tail->b_prev) {
4176da14cebeSEric Cheng prev->b_next = NULL;
4177da14cebeSEric Cheng mutex_exit(&mac_srs->srs_lock);
4178da14cebeSEric Cheng TX_SRS_TO_SOFT_RING(mac_srs, head, hint);
4179da14cebeSEric Cheng head = tail;
4180da14cebeSEric Cheng hint = (ulong_t)tail->b_prev;
4181da14cebeSEric Cheng mutex_enter(&mac_srs->srs_lock);
4182da14cebeSEric Cheng }
4183da14cebeSEric Cheng
4184da14cebeSEric Cheng prev = tail;
4185da14cebeSEric Cheng tail->b_prev = NULL;
4186da14cebeSEric Cheng if (mac_srs->srs_bw->mac_bw_used <
4187da14cebeSEric Cheng mac_srs->srs_bw->mac_bw_limit)
4188da14cebeSEric Cheng continue;
4189da14cebeSEric Cheng
4190d3d50737SRafael Vanoni now = ddi_get_lbolt();
4191d3d50737SRafael Vanoni if (mac_srs->srs_bw->mac_bw_curr_time != now) {
4192d3d50737SRafael Vanoni mac_srs->srs_bw->mac_bw_curr_time = now;
4193da14cebeSEric Cheng mac_srs->srs_bw->mac_bw_used = 0;
4194da14cebeSEric Cheng continue;
4195da14cebeSEric Cheng }
4196da14cebeSEric Cheng mac_srs->srs_bw->mac_bw_state |= SRS_BW_ENFORCED;
4197da14cebeSEric Cheng break;
4198da14cebeSEric Cheng }
4199da14cebeSEric Cheng ASSERT((head == NULL && tail == NULL) ||
4200da14cebeSEric Cheng (head != NULL && tail != NULL));
4201da14cebeSEric Cheng if (tail != NULL) {
4202da14cebeSEric Cheng tail->b_next = NULL;
4203da14cebeSEric Cheng mutex_exit(&mac_srs->srs_lock);
4204da14cebeSEric Cheng TX_SRS_TO_SOFT_RING(mac_srs, head, hint);
4205da14cebeSEric Cheng mutex_enter(&mac_srs->srs_lock);
4206da14cebeSEric Cheng }
4207da14cebeSEric Cheng }
4208da14cebeSEric Cheng /*
4209da14cebeSEric Cheng * SRS_TX_FANOUT case not considered here because packets
4210da14cebeSEric Cheng * won't be queued in the SRS for this case. Packets will
4211da14cebeSEric Cheng * be sent directly to soft rings underneath and if there
4212da14cebeSEric Cheng * is any queueing at all, it would be in Tx side soft
4213da14cebeSEric Cheng * rings.
4214da14cebeSEric Cheng */
4215da14cebeSEric Cheng
4216da14cebeSEric Cheng /*
4217da14cebeSEric Cheng * When srs_count becomes 0, reset SRS_TX_HIWAT and
4218da14cebeSEric Cheng * SRS_TX_WAKEUP_CLIENT and wakeup registered clients.
4219da14cebeSEric Cheng */
4220da14cebeSEric Cheng if (mac_srs->srs_count == 0 && (mac_srs->srs_state &
4221da14cebeSEric Cheng (SRS_TX_HIWAT | SRS_TX_WAKEUP_CLIENT | SRS_ENQUEUED))) {
4222da14cebeSEric Cheng mac_client_impl_t *mcip = mac_srs->srs_mcip;
4223da14cebeSEric Cheng boolean_t wakeup_required = B_FALSE;
4224da14cebeSEric Cheng
4225da14cebeSEric Cheng if (mac_srs->srs_state &
4226da14cebeSEric Cheng (SRS_TX_HIWAT|SRS_TX_WAKEUP_CLIENT)) {
4227da14cebeSEric Cheng wakeup_required = B_TRUE;
4228da14cebeSEric Cheng }
4229da14cebeSEric Cheng mac_srs->srs_state &= ~(SRS_TX_HIWAT |
4230da14cebeSEric Cheng SRS_TX_WAKEUP_CLIENT | SRS_ENQUEUED);
4231da14cebeSEric Cheng mutex_exit(&mac_srs->srs_lock);
4232da14cebeSEric Cheng if (wakeup_required) {
42330dc2366fSVenugopal Iyer mac_tx_invoke_callbacks(mcip, (mac_tx_cookie_t)mac_srs);
4234da14cebeSEric Cheng /*
4235da14cebeSEric Cheng * If the client is not the primary MAC client, then we
4236da14cebeSEric Cheng * need to send the notification to the clients upper
4237da14cebeSEric Cheng * MAC, i.e. mci_upper_mip.
4238da14cebeSEric Cheng */
4239da14cebeSEric Cheng mac_tx_notify(mcip->mci_upper_mip != NULL ?
4240da14cebeSEric Cheng mcip->mci_upper_mip : mcip->mci_mip);
4241da14cebeSEric Cheng }
4242da14cebeSEric Cheng mutex_enter(&mac_srs->srs_lock);
4243da14cebeSEric Cheng }
4244da14cebeSEric Cheng mac_srs->srs_state &= ~SRS_PROC;
4245da14cebeSEric Cheng }
4246da14cebeSEric Cheng
4247da14cebeSEric Cheng /*
4248da14cebeSEric Cheng * Given a packet, get the flow_entry that identifies the flow
4249da14cebeSEric Cheng * to which that packet belongs. The flow_entry will contain
4250da14cebeSEric Cheng * the transmit function to be used to send the packet. If the
4251da14cebeSEric Cheng * function returns NULL, the packet should be sent using the
4252da14cebeSEric Cheng * underlying NIC.
4253da14cebeSEric Cheng */
4254da14cebeSEric Cheng static flow_entry_t *
mac_tx_classify(mac_impl_t * mip,mblk_t * mp)4255da14cebeSEric Cheng mac_tx_classify(mac_impl_t *mip, mblk_t *mp)
4256da14cebeSEric Cheng {
4257da14cebeSEric Cheng flow_entry_t *flent = NULL;
4258da14cebeSEric Cheng mac_client_impl_t *mcip;
4259da14cebeSEric Cheng int err;
4260da14cebeSEric Cheng
4261da14cebeSEric Cheng /*
4262da14cebeSEric Cheng * Do classification on the packet.
4263da14cebeSEric Cheng */
4264da14cebeSEric Cheng err = mac_flow_lookup(mip->mi_flow_tab, mp, FLOW_OUTBOUND, &flent);
4265da14cebeSEric Cheng if (err != 0)
4266da14cebeSEric Cheng return (NULL);
4267da14cebeSEric Cheng
4268da14cebeSEric Cheng /*
4269da14cebeSEric Cheng * This flent might just be an additional one on the MAC client,
4270da14cebeSEric Cheng * i.e. for classification purposes (different fdesc), however
4271da14cebeSEric Cheng * the resources, SRS et. al., are in the mci_flent, so if
4272da14cebeSEric Cheng * this isn't the mci_flent, we need to get it.
4273da14cebeSEric Cheng */
4274da14cebeSEric Cheng if ((mcip = flent->fe_mcip) != NULL && mcip->mci_flent != flent) {
4275da14cebeSEric Cheng FLOW_REFRELE(flent);
4276da14cebeSEric Cheng flent = mcip->mci_flent;
4277da14cebeSEric Cheng FLOW_TRY_REFHOLD(flent, err);
4278da14cebeSEric Cheng if (err != 0)
4279da14cebeSEric Cheng return (NULL);
4280da14cebeSEric Cheng }
4281da14cebeSEric Cheng
4282da14cebeSEric Cheng return (flent);
4283da14cebeSEric Cheng }
4284da14cebeSEric Cheng
4285da14cebeSEric Cheng /*
4286da14cebeSEric Cheng * This macro is only meant to be used by mac_tx_send().
4287da14cebeSEric Cheng */
4288da14cebeSEric Cheng #define CHECK_VID_AND_ADD_TAG(mp) { \
4289da14cebeSEric Cheng if (vid_check) { \
4290da14cebeSEric Cheng int err = 0; \
4291da14cebeSEric Cheng \
4292da14cebeSEric Cheng MAC_VID_CHECK(src_mcip, (mp), err); \
4293da14cebeSEric Cheng if (err != 0) { \
4294da14cebeSEric Cheng freemsg((mp)); \
4295da14cebeSEric Cheng (mp) = next; \
4296da14cebeSEric Cheng oerrors++; \
4297da14cebeSEric Cheng continue; \
4298da14cebeSEric Cheng } \
4299da14cebeSEric Cheng } \
4300da14cebeSEric Cheng if (add_tag) { \
4301da14cebeSEric Cheng (mp) = mac_add_vlan_tag((mp), 0, vid); \
4302da14cebeSEric Cheng if ((mp) == NULL) { \
4303da14cebeSEric Cheng (mp) = next; \
4304da14cebeSEric Cheng oerrors++; \
4305da14cebeSEric Cheng continue; \
4306da14cebeSEric Cheng } \
4307da14cebeSEric Cheng } \
4308da14cebeSEric Cheng }
4309da14cebeSEric Cheng
4310da14cebeSEric Cheng mblk_t *
mac_tx_send(mac_client_handle_t mch,mac_ring_handle_t ring,mblk_t * mp_chain,mac_tx_stats_t * stats)4311da14cebeSEric Cheng mac_tx_send(mac_client_handle_t mch, mac_ring_handle_t ring, mblk_t *mp_chain,
4312da14cebeSEric Cheng mac_tx_stats_t *stats)
4313da14cebeSEric Cheng {
4314da14cebeSEric Cheng mac_client_impl_t *src_mcip = (mac_client_impl_t *)mch;
4315da14cebeSEric Cheng mac_impl_t *mip = src_mcip->mci_mip;
4316da14cebeSEric Cheng uint_t obytes = 0, opackets = 0, oerrors = 0;
4317da14cebeSEric Cheng mblk_t *mp = NULL, *next;
4318da14cebeSEric Cheng boolean_t vid_check, add_tag;
4319da14cebeSEric Cheng uint16_t vid = 0;
4320da14cebeSEric Cheng
4321da14cebeSEric Cheng if (mip->mi_nclients > 1) {
4322da14cebeSEric Cheng vid_check = MAC_VID_CHECK_NEEDED(src_mcip);
4323da14cebeSEric Cheng add_tag = MAC_TAG_NEEDED(src_mcip);
4324da14cebeSEric Cheng if (add_tag)
4325da14cebeSEric Cheng vid = mac_client_vid(mch);
4326da14cebeSEric Cheng } else {
4327da14cebeSEric Cheng ASSERT(mip->mi_nclients == 1);
4328da14cebeSEric Cheng vid_check = add_tag = B_FALSE;
4329da14cebeSEric Cheng }
4330da14cebeSEric Cheng
4331da14cebeSEric Cheng /*
43320dc2366fSVenugopal Iyer * Fastpath: if there's only one client, we simply send
43330dc2366fSVenugopal Iyer * the packet down to the underlying NIC.
4334da14cebeSEric Cheng */
43350dc2366fSVenugopal Iyer if (mip->mi_nactiveclients == 1) {
4336da14cebeSEric Cheng DTRACE_PROBE2(fastpath,
4337da14cebeSEric Cheng mac_client_impl_t *, src_mcip, mblk_t *, mp_chain);
4338da14cebeSEric Cheng
4339da14cebeSEric Cheng mp = mp_chain;
4340da14cebeSEric Cheng while (mp != NULL) {
4341da14cebeSEric Cheng next = mp->b_next;
4342da14cebeSEric Cheng mp->b_next = NULL;
4343da14cebeSEric Cheng opackets++;
4344da14cebeSEric Cheng obytes += (mp->b_cont == NULL ? MBLKL(mp) :
4345da14cebeSEric Cheng msgdsize(mp));
4346da14cebeSEric Cheng
4347da14cebeSEric Cheng CHECK_VID_AND_ADD_TAG(mp);
4348c61a1653SRyan Zezeski mp = mac_provider_tx(mip, ring, mp, src_mcip);
4349da14cebeSEric Cheng
4350da14cebeSEric Cheng /*
4351da14cebeSEric Cheng * If the driver is out of descriptors and does a
4352da14cebeSEric Cheng * partial send it will return a chain of unsent
4353da14cebeSEric Cheng * mblks. Adjust the accounting stats.
4354da14cebeSEric Cheng */
4355da14cebeSEric Cheng if (mp != NULL) {
4356da14cebeSEric Cheng opackets--;
4357da14cebeSEric Cheng obytes -= msgdsize(mp);
4358da14cebeSEric Cheng mp->b_next = next;
4359da14cebeSEric Cheng break;
4360da14cebeSEric Cheng }
4361da14cebeSEric Cheng mp = next;
4362da14cebeSEric Cheng }
4363da14cebeSEric Cheng goto done;
4364da14cebeSEric Cheng }
4365da14cebeSEric Cheng
4366da14cebeSEric Cheng /*
4367da14cebeSEric Cheng * No fastpath, we either have more than one MAC client
4368da14cebeSEric Cheng * defined on top of the same MAC, or one or more MAC
4369da14cebeSEric Cheng * client promiscuous callbacks.
4370da14cebeSEric Cheng */
4371da14cebeSEric Cheng DTRACE_PROBE3(slowpath, mac_client_impl_t *,
4372da14cebeSEric Cheng src_mcip, int, mip->mi_nclients, mblk_t *, mp_chain);
4373da14cebeSEric Cheng
4374da14cebeSEric Cheng mp = mp_chain;
4375da14cebeSEric Cheng while (mp != NULL) {
4376da14cebeSEric Cheng flow_entry_t *dst_flow_ent;
4377da14cebeSEric Cheng void *flow_cookie;
4378da14cebeSEric Cheng size_t pkt_size;
4379da14cebeSEric Cheng
4380da14cebeSEric Cheng next = mp->b_next;
4381da14cebeSEric Cheng mp->b_next = NULL;
4382da14cebeSEric Cheng opackets++;
4383da14cebeSEric Cheng pkt_size = (mp->b_cont == NULL ? MBLKL(mp) : msgdsize(mp));
4384da14cebeSEric Cheng obytes += pkt_size;
4385da14cebeSEric Cheng CHECK_VID_AND_ADD_TAG(mp);
4386da14cebeSEric Cheng
4387da14cebeSEric Cheng /*
4388da14cebeSEric Cheng * Find the destination.
4389da14cebeSEric Cheng */
4390da14cebeSEric Cheng dst_flow_ent = mac_tx_classify(mip, mp);
4391da14cebeSEric Cheng
4392da14cebeSEric Cheng if (dst_flow_ent != NULL) {
4393da14cebeSEric Cheng /*
4394da14cebeSEric Cheng * Got a matching flow. It's either another
4395da14cebeSEric Cheng * MAC client, or a broadcast/multicast flow.
4396da14cebeSEric Cheng */
4397da14cebeSEric Cheng flow_cookie = mac_flow_get_client_cookie(dst_flow_ent);
4398c61a1653SRyan Zezeski
4399da14cebeSEric Cheng if (flow_cookie != NULL) {
4400da14cebeSEric Cheng /*
4401da14cebeSEric Cheng * The vnic_bcast_send function expects
4402da14cebeSEric Cheng * to receive the sender MAC client
4403da14cebeSEric Cheng * as value for arg2.
4404da14cebeSEric Cheng */
4405da14cebeSEric Cheng mac_bcast_send(flow_cookie, src_mcip, mp,
4406da14cebeSEric Cheng B_TRUE);
4407da14cebeSEric Cheng } else {
4408da14cebeSEric Cheng /*
44090dc2366fSVenugopal Iyer * loopback the packet to a local MAC
44100dc2366fSVenugopal Iyer * client. We force a context switch
44110dc2366fSVenugopal Iyer * if both source and destination MAC
44120dc2366fSVenugopal Iyer * clients are used by IP, i.e.
44130dc2366fSVenugopal Iyer * bypass is set.
4414da14cebeSEric Cheng */
4415da14cebeSEric Cheng boolean_t do_switch;
4416c61a1653SRyan Zezeski
4417da14cebeSEric Cheng mac_client_impl_t *dst_mcip =
4418da14cebeSEric Cheng dst_flow_ent->fe_mcip;
4419da14cebeSEric Cheng
44200dc2366fSVenugopal Iyer /*
44210dc2366fSVenugopal Iyer * Check if there are promiscuous mode
44220dc2366fSVenugopal Iyer * callbacks defined. This check is
44230dc2366fSVenugopal Iyer * done here in the 'else' case and
44240dc2366fSVenugopal Iyer * not in other cases because this
44250dc2366fSVenugopal Iyer * path is for local loopback
44260dc2366fSVenugopal Iyer * communication which does not go
44270dc2366fSVenugopal Iyer * through MAC_TX(). For paths that go
44280dc2366fSVenugopal Iyer * through MAC_TX(), the promisc_list
44290dc2366fSVenugopal Iyer * check is done inside the MAC_TX()
44300dc2366fSVenugopal Iyer * macro.
44310dc2366fSVenugopal Iyer */
4432c61a1653SRyan Zezeski if (mip->mi_promisc_list != NULL) {
4433c61a1653SRyan Zezeski mac_promisc_dispatch(mip, mp, src_mcip,
4434c61a1653SRyan Zezeski B_TRUE);
4435c61a1653SRyan Zezeski }
44360dc2366fSVenugopal Iyer
4437da14cebeSEric Cheng do_switch = ((src_mcip->mci_state_flags &
4438da14cebeSEric Cheng dst_mcip->mci_state_flags &
4439da14cebeSEric Cheng MCIS_CLIENT_POLL_CAPABLE) != 0);
4440da14cebeSEric Cheng
4441c61a1653SRyan Zezeski mac_hw_emul(&mp, NULL, NULL, MAC_ALL_EMULS);
4442c61a1653SRyan Zezeski if (mp != NULL) {
4443da14cebeSEric Cheng (dst_flow_ent->fe_cb_fn)(
4444da14cebeSEric Cheng dst_flow_ent->fe_cb_arg1,
4445da14cebeSEric Cheng dst_flow_ent->fe_cb_arg2,
4446c61a1653SRyan Zezeski mp, do_switch);
4447da14cebeSEric Cheng }
4448c61a1653SRyan Zezeski
4449da14cebeSEric Cheng }
4450da14cebeSEric Cheng FLOW_REFRELE(dst_flow_ent);
4451da14cebeSEric Cheng } else {
4452da14cebeSEric Cheng /*
4453da14cebeSEric Cheng * Unknown destination, send via the underlying
4454da14cebeSEric Cheng * NIC.
4455da14cebeSEric Cheng */
4456c61a1653SRyan Zezeski mp = mac_provider_tx(mip, ring, mp, src_mcip);
4457da14cebeSEric Cheng if (mp != NULL) {
4458da14cebeSEric Cheng /*
4459da14cebeSEric Cheng * Adjust for the last packet that
4460da14cebeSEric Cheng * could not be transmitted
4461da14cebeSEric Cheng */
4462da14cebeSEric Cheng opackets--;
4463da14cebeSEric Cheng obytes -= pkt_size;
4464da14cebeSEric Cheng mp->b_next = next;
4465da14cebeSEric Cheng break;
4466da14cebeSEric Cheng }
4467da14cebeSEric Cheng }
4468da14cebeSEric Cheng mp = next;
4469da14cebeSEric Cheng }
4470da14cebeSEric Cheng
4471da14cebeSEric Cheng done:
44720dc2366fSVenugopal Iyer stats->mts_obytes = obytes;
44730dc2366fSVenugopal Iyer stats->mts_opackets = opackets;
44740dc2366fSVenugopal Iyer stats->mts_oerrors = oerrors;
4475da14cebeSEric Cheng return (mp);
4476da14cebeSEric Cheng }
4477da14cebeSEric Cheng
4478da14cebeSEric Cheng /*
4479da14cebeSEric Cheng * mac_tx_srs_ring_present
4480da14cebeSEric Cheng *
4481da14cebeSEric Cheng * Returns whether the specified ring is part of the specified SRS.
4482da14cebeSEric Cheng */
4483da14cebeSEric Cheng boolean_t
mac_tx_srs_ring_present(mac_soft_ring_set_t * srs,mac_ring_t * tx_ring)4484da14cebeSEric Cheng mac_tx_srs_ring_present(mac_soft_ring_set_t *srs, mac_ring_t *tx_ring)
4485da14cebeSEric Cheng {
4486da14cebeSEric Cheng int i;
4487da14cebeSEric Cheng mac_soft_ring_t *soft_ring;
4488da14cebeSEric Cheng
4489da14cebeSEric Cheng if (srs->srs_tx.st_arg2 == tx_ring)
4490da14cebeSEric Cheng return (B_TRUE);
4491da14cebeSEric Cheng
44920dc2366fSVenugopal Iyer for (i = 0; i < srs->srs_tx_ring_count; i++) {
44930dc2366fSVenugopal Iyer soft_ring = srs->srs_tx_soft_rings[i];
4494da14cebeSEric Cheng if (soft_ring->s_ring_tx_arg2 == tx_ring)
4495da14cebeSEric Cheng return (B_TRUE);
4496da14cebeSEric Cheng }
4497da14cebeSEric Cheng
4498da14cebeSEric Cheng return (B_FALSE);
4499da14cebeSEric Cheng }
4500da14cebeSEric Cheng
4501da14cebeSEric Cheng /*
45020dc2366fSVenugopal Iyer * mac_tx_srs_get_soft_ring
45030dc2366fSVenugopal Iyer *
45040dc2366fSVenugopal Iyer * Returns the TX soft ring associated with the given ring, if present.
45050dc2366fSVenugopal Iyer */
45060dc2366fSVenugopal Iyer mac_soft_ring_t *
mac_tx_srs_get_soft_ring(mac_soft_ring_set_t * srs,mac_ring_t * tx_ring)45070dc2366fSVenugopal Iyer mac_tx_srs_get_soft_ring(mac_soft_ring_set_t *srs, mac_ring_t *tx_ring)
45080dc2366fSVenugopal Iyer {
45090dc2366fSVenugopal Iyer int i;
45100dc2366fSVenugopal Iyer mac_soft_ring_t *soft_ring;
45110dc2366fSVenugopal Iyer
45120dc2366fSVenugopal Iyer if (srs->srs_tx.st_arg2 == tx_ring)
45130dc2366fSVenugopal Iyer return (NULL);
45140dc2366fSVenugopal Iyer
45150dc2366fSVenugopal Iyer for (i = 0; i < srs->srs_tx_ring_count; i++) {
45160dc2366fSVenugopal Iyer soft_ring = srs->srs_tx_soft_rings[i];
45170dc2366fSVenugopal Iyer if (soft_ring->s_ring_tx_arg2 == tx_ring)
45180dc2366fSVenugopal Iyer return (soft_ring);
45190dc2366fSVenugopal Iyer }
45200dc2366fSVenugopal Iyer
45210dc2366fSVenugopal Iyer return (NULL);
45220dc2366fSVenugopal Iyer }
45230dc2366fSVenugopal Iyer
45240dc2366fSVenugopal Iyer /*
4525da14cebeSEric Cheng * mac_tx_srs_wakeup
4526da14cebeSEric Cheng *
4527da14cebeSEric Cheng * Called when Tx desc become available. Wakeup the appropriate worker
4528da14cebeSEric Cheng * thread after resetting the SRS_TX_BLOCKED/S_RING_BLOCK bit in the
4529da14cebeSEric Cheng * state field.
4530da14cebeSEric Cheng */
4531da14cebeSEric Cheng void
mac_tx_srs_wakeup(mac_soft_ring_set_t * mac_srs,mac_ring_handle_t ring)4532da14cebeSEric Cheng mac_tx_srs_wakeup(mac_soft_ring_set_t *mac_srs, mac_ring_handle_t ring)
4533da14cebeSEric Cheng {
4534da14cebeSEric Cheng int i;
4535da14cebeSEric Cheng mac_soft_ring_t *sringp;
4536da14cebeSEric Cheng mac_srs_tx_t *srs_tx = &mac_srs->srs_tx;
4537da14cebeSEric Cheng
4538da14cebeSEric Cheng mutex_enter(&mac_srs->srs_lock);
45390dc2366fSVenugopal Iyer /*
45400dc2366fSVenugopal Iyer * srs_tx_ring_count == 0 is the single ring mode case. In
45410dc2366fSVenugopal Iyer * this mode, there will not be Tx soft rings associated
45420dc2366fSVenugopal Iyer * with the SRS.
45430dc2366fSVenugopal Iyer */
45440dc2366fSVenugopal Iyer if (!MAC_TX_SOFT_RINGS(mac_srs)) {
4545da14cebeSEric Cheng if (srs_tx->st_arg2 == ring &&
4546da14cebeSEric Cheng mac_srs->srs_state & SRS_TX_BLOCKED) {
4547da14cebeSEric Cheng mac_srs->srs_state &= ~SRS_TX_BLOCKED;
45480dc2366fSVenugopal Iyer srs_tx->st_stat.mts_unblockcnt++;
4549da14cebeSEric Cheng cv_signal(&mac_srs->srs_async);
4550da14cebeSEric Cheng }
4551da14cebeSEric Cheng /*
4552da14cebeSEric Cheng * A wakeup can come before tx_srs_drain() could
4553da14cebeSEric Cheng * grab srs lock and set SRS_TX_BLOCKED. So
4554da14cebeSEric Cheng * always set woken_up flag when we come here.
4555da14cebeSEric Cheng */
4556da14cebeSEric Cheng srs_tx->st_woken_up = B_TRUE;
4557da14cebeSEric Cheng mutex_exit(&mac_srs->srs_lock);
4558da14cebeSEric Cheng return;
4559da14cebeSEric Cheng }
4560da14cebeSEric Cheng
45610dc2366fSVenugopal Iyer /*
45620dc2366fSVenugopal Iyer * If you are here, it is for FANOUT, BW_FANOUT,
45630dc2366fSVenugopal Iyer * AGGR_MODE or AGGR_BW_MODE case
45640dc2366fSVenugopal Iyer */
45650dc2366fSVenugopal Iyer for (i = 0; i < mac_srs->srs_tx_ring_count; i++) {
45660dc2366fSVenugopal Iyer sringp = mac_srs->srs_tx_soft_rings[i];
4567da14cebeSEric Cheng mutex_enter(&sringp->s_ring_lock);
4568da14cebeSEric Cheng if (sringp->s_ring_tx_arg2 == ring) {
4569da14cebeSEric Cheng if (sringp->s_ring_state & S_RING_BLOCK) {
4570da14cebeSEric Cheng sringp->s_ring_state &= ~S_RING_BLOCK;
45710dc2366fSVenugopal Iyer sringp->s_st_stat.mts_unblockcnt++;
4572da14cebeSEric Cheng cv_signal(&sringp->s_ring_async);
4573da14cebeSEric Cheng }
4574da14cebeSEric Cheng sringp->s_ring_tx_woken_up = B_TRUE;
4575da14cebeSEric Cheng }
4576da14cebeSEric Cheng mutex_exit(&sringp->s_ring_lock);
4577da14cebeSEric Cheng }
4578da14cebeSEric Cheng mutex_exit(&mac_srs->srs_lock);
4579da14cebeSEric Cheng }
4580da14cebeSEric Cheng
4581da14cebeSEric Cheng /*
4582da14cebeSEric Cheng * Once the driver is done draining, send a MAC_NOTE_TX notification to unleash
4583da14cebeSEric Cheng * the blocked clients again.
4584da14cebeSEric Cheng */
4585da14cebeSEric Cheng void
mac_tx_notify(mac_impl_t * mip)4586da14cebeSEric Cheng mac_tx_notify(mac_impl_t *mip)
4587da14cebeSEric Cheng {
4588da14cebeSEric Cheng i_mac_notify(mip, MAC_NOTE_TX);
4589da14cebeSEric Cheng }
4590da14cebeSEric Cheng
4591da14cebeSEric Cheng /*
4592da14cebeSEric Cheng * RX SOFTRING RELATED FUNCTIONS
4593da14cebeSEric Cheng *
4594da14cebeSEric Cheng * These functions really belong in mac_soft_ring.c and here for
4595da14cebeSEric Cheng * a short period.
4596da14cebeSEric Cheng */
4597da14cebeSEric Cheng
4598da14cebeSEric Cheng #define SOFT_RING_ENQUEUE_CHAIN(ringp, mp, tail, cnt, sz) { \
4599da14cebeSEric Cheng /* \
4600da14cebeSEric Cheng * Enqueue our mblk chain. \
4601da14cebeSEric Cheng */ \
4602da14cebeSEric Cheng ASSERT(MUTEX_HELD(&(ringp)->s_ring_lock)); \
4603da14cebeSEric Cheng \
4604da14cebeSEric Cheng if ((ringp)->s_ring_last != NULL) \
4605da14cebeSEric Cheng (ringp)->s_ring_last->b_next = (mp); \
4606da14cebeSEric Cheng else \
4607da14cebeSEric Cheng (ringp)->s_ring_first = (mp); \
4608da14cebeSEric Cheng (ringp)->s_ring_last = (tail); \
4609da14cebeSEric Cheng (ringp)->s_ring_count += (cnt); \
4610da14cebeSEric Cheng ASSERT((ringp)->s_ring_count > 0); \
4611da14cebeSEric Cheng if ((ringp)->s_ring_type & ST_RING_BW_CTL) { \
4612da14cebeSEric Cheng (ringp)->s_ring_size += sz; \
4613da14cebeSEric Cheng } \
4614da14cebeSEric Cheng }
4615da14cebeSEric Cheng
4616da14cebeSEric Cheng /*
4617da14cebeSEric Cheng * Default entry point to deliver a packet chain to a MAC client.
4618da14cebeSEric Cheng * If the MAC client has flows, do the classification with these
4619da14cebeSEric Cheng * flows as well.
4620da14cebeSEric Cheng */
4621da14cebeSEric Cheng /* ARGSUSED */
4622da14cebeSEric Cheng void
mac_rx_deliver(void * arg1,mac_resource_handle_t mrh,mblk_t * mp_chain,mac_header_info_t * arg3)4623da14cebeSEric Cheng mac_rx_deliver(void *arg1, mac_resource_handle_t mrh, mblk_t *mp_chain,
4624da14cebeSEric Cheng mac_header_info_t *arg3)
4625da14cebeSEric Cheng {
4626da14cebeSEric Cheng mac_client_impl_t *mcip = arg1;
4627da14cebeSEric Cheng
4628da14cebeSEric Cheng if (mcip->mci_nvids == 1 &&
4629dbe999caSVenugopal Iyer !(mcip->mci_state_flags & MCIS_STRIP_DISABLE)) {
4630da14cebeSEric Cheng /*
4631da14cebeSEric Cheng * If the client has exactly one VID associated with it
4632da14cebeSEric Cheng * and striping of VLAN header is not disabled,
4633da14cebeSEric Cheng * remove the VLAN tag from the packet before
4634da14cebeSEric Cheng * passing it on to the client's receive callback.
4635da14cebeSEric Cheng * Note that this needs to be done after we dispatch
4636da14cebeSEric Cheng * the packet to the promiscuous listeners of the
4637da14cebeSEric Cheng * client, since they expect to see the whole
4638da14cebeSEric Cheng * frame including the VLAN headers.
463984de666eSRyan Zezeski *
464084de666eSRyan Zezeski * The MCIS_STRIP_DISABLE is only issued when sun4v
464184de666eSRyan Zezeski * vsw is in play.
4642da14cebeSEric Cheng */
4643da14cebeSEric Cheng mp_chain = mac_strip_vlan_tag_chain(mp_chain);
4644da14cebeSEric Cheng }
4645da14cebeSEric Cheng
4646da14cebeSEric Cheng mcip->mci_rx_fn(mcip->mci_rx_arg, mrh, mp_chain, B_FALSE);
4647da14cebeSEric Cheng }
4648da14cebeSEric Cheng
4649da14cebeSEric Cheng /*
465084de666eSRyan Zezeski * Process a chain for a given soft ring. If the number of packets
465184de666eSRyan Zezeski * queued in the SRS and its associated soft rings (including this
465284de666eSRyan Zezeski * one) is very small (tracked by srs_poll_pkt_cnt) then allow the
465384de666eSRyan Zezeski * entering thread (interrupt or poll thread) to process the chain
465484de666eSRyan Zezeski * inline. This is meant to reduce latency under low load.
4655da14cebeSEric Cheng *
4656da14cebeSEric Cheng * The proc and arg for each mblk is already stored in the mblk in
4657da14cebeSEric Cheng * appropriate places.
4658da14cebeSEric Cheng */
4659da14cebeSEric Cheng /* ARGSUSED */
4660da14cebeSEric Cheng void
mac_rx_soft_ring_process(mac_client_impl_t * mcip,mac_soft_ring_t * ringp,mblk_t * mp_chain,mblk_t * tail,int cnt,size_t sz)4661da14cebeSEric Cheng mac_rx_soft_ring_process(mac_client_impl_t *mcip, mac_soft_ring_t *ringp,
4662da14cebeSEric Cheng mblk_t *mp_chain, mblk_t *tail, int cnt, size_t sz)
4663da14cebeSEric Cheng {
4664da14cebeSEric Cheng mac_direct_rx_t proc;
4665da14cebeSEric Cheng void *arg1;
4666da14cebeSEric Cheng mac_resource_handle_t arg2;
4667da14cebeSEric Cheng mac_soft_ring_set_t *mac_srs = ringp->s_ring_set;
4668da14cebeSEric Cheng
4669da14cebeSEric Cheng ASSERT(ringp != NULL);
4670da14cebeSEric Cheng ASSERT(mp_chain != NULL);
4671da14cebeSEric Cheng ASSERT(tail != NULL);
4672da14cebeSEric Cheng ASSERT(MUTEX_NOT_HELD(&ringp->s_ring_lock));
4673da14cebeSEric Cheng
4674da14cebeSEric Cheng mutex_enter(&ringp->s_ring_lock);
4675da14cebeSEric Cheng ringp->s_ring_total_inpkt += cnt;
46760dc2366fSVenugopal Iyer ringp->s_ring_total_rbytes += sz;
4677ae6aa22aSVenugopal Iyer if ((mac_srs->srs_rx.sr_poll_pkt_cnt <= 1) &&
4678ae6aa22aSVenugopal Iyer !(ringp->s_ring_type & ST_RING_WORKER_ONLY)) {
4679da14cebeSEric Cheng /* If on processor or blanking on, then enqueue and return */
4680da14cebeSEric Cheng if (ringp->s_ring_state & S_RING_BLANK ||
4681da14cebeSEric Cheng ringp->s_ring_state & S_RING_PROC) {
4682da14cebeSEric Cheng SOFT_RING_ENQUEUE_CHAIN(ringp, mp_chain, tail, cnt, sz);
4683da14cebeSEric Cheng mutex_exit(&ringp->s_ring_lock);
4684da14cebeSEric Cheng return;
4685da14cebeSEric Cheng }
4686da14cebeSEric Cheng proc = ringp->s_ring_rx_func;
4687da14cebeSEric Cheng arg1 = ringp->s_ring_rx_arg1;
4688da14cebeSEric Cheng arg2 = ringp->s_ring_rx_arg2;
4689da14cebeSEric Cheng /*
4690da14cebeSEric Cheng * See if anything is already queued. If we are the
4691da14cebeSEric Cheng * first packet, do inline processing else queue the
4692da14cebeSEric Cheng * packet and do the drain.
4693da14cebeSEric Cheng */
4694da14cebeSEric Cheng if (ringp->s_ring_first == NULL) {
4695da14cebeSEric Cheng /*
4696da14cebeSEric Cheng * Fast-path, ok to process and nothing queued.
4697da14cebeSEric Cheng */
4698da14cebeSEric Cheng ringp->s_ring_run = curthread;
4699da14cebeSEric Cheng ringp->s_ring_state |= (S_RING_PROC);
4700da14cebeSEric Cheng
4701da14cebeSEric Cheng mutex_exit(&ringp->s_ring_lock);
4702da14cebeSEric Cheng
4703da14cebeSEric Cheng /*
4704da14cebeSEric Cheng * We are the chain of 1 packet so
4705da14cebeSEric Cheng * go through this fast path.
4706da14cebeSEric Cheng */
4707da14cebeSEric Cheng ASSERT(mp_chain->b_next == NULL);
4708da14cebeSEric Cheng
4709da14cebeSEric Cheng (*proc)(arg1, arg2, mp_chain, NULL);
4710da14cebeSEric Cheng
4711da14cebeSEric Cheng ASSERT(MUTEX_NOT_HELD(&ringp->s_ring_lock));
4712da14cebeSEric Cheng /*
471384de666eSRyan Zezeski * If we have an SRS performing bandwidth
471484de666eSRyan Zezeski * control then we need to decrement the size
471584de666eSRyan Zezeski * and count so the SRS has an accurate count
471684de666eSRyan Zezeski * of the data queued between the SRS and its
471784de666eSRyan Zezeski * soft rings. We decrement the counters only
471884de666eSRyan Zezeski * when the packet is processed by both the
471984de666eSRyan Zezeski * SRS and the soft ring.
4720da14cebeSEric Cheng */
4721da14cebeSEric Cheng mutex_enter(&mac_srs->srs_lock);
4722da14cebeSEric Cheng MAC_UPDATE_SRS_COUNT_LOCKED(mac_srs, cnt);
4723da14cebeSEric Cheng MAC_UPDATE_SRS_SIZE_LOCKED(mac_srs, sz);
4724da14cebeSEric Cheng mutex_exit(&mac_srs->srs_lock);
4725da14cebeSEric Cheng
4726da14cebeSEric Cheng mutex_enter(&ringp->s_ring_lock);
4727da14cebeSEric Cheng ringp->s_ring_run = NULL;
4728da14cebeSEric Cheng ringp->s_ring_state &= ~S_RING_PROC;
4729da14cebeSEric Cheng if (ringp->s_ring_state & S_RING_CLIENT_WAIT)
4730da14cebeSEric Cheng cv_signal(&ringp->s_ring_client_cv);
4731da14cebeSEric Cheng
4732da14cebeSEric Cheng if ((ringp->s_ring_first == NULL) ||
4733da14cebeSEric Cheng (ringp->s_ring_state & S_RING_BLANK)) {
4734da14cebeSEric Cheng /*
473584de666eSRyan Zezeski * We processed a single packet inline
473684de666eSRyan Zezeski * and nothing new has arrived or our
4737da14cebeSEric Cheng * receiver doesn't want to receive
4738da14cebeSEric Cheng * any packets. We are done.
4739da14cebeSEric Cheng */
4740da14cebeSEric Cheng mutex_exit(&ringp->s_ring_lock);
4741da14cebeSEric Cheng return;
4742da14cebeSEric Cheng }
4743da14cebeSEric Cheng } else {
4744da14cebeSEric Cheng SOFT_RING_ENQUEUE_CHAIN(ringp,
4745da14cebeSEric Cheng mp_chain, tail, cnt, sz);
4746da14cebeSEric Cheng }
4747da14cebeSEric Cheng
4748da14cebeSEric Cheng /*
4749da14cebeSEric Cheng * We are here because either we couldn't do inline
4750da14cebeSEric Cheng * processing (because something was already
4751da14cebeSEric Cheng * queued), or we had a chain of more than one
4752da14cebeSEric Cheng * packet, or something else arrived after we were
4753da14cebeSEric Cheng * done with inline processing.
4754da14cebeSEric Cheng */
4755da14cebeSEric Cheng ASSERT(MUTEX_HELD(&ringp->s_ring_lock));
4756da14cebeSEric Cheng ASSERT(ringp->s_ring_first != NULL);
4757da14cebeSEric Cheng
4758da14cebeSEric Cheng ringp->s_ring_drain_func(ringp);
4759da14cebeSEric Cheng mutex_exit(&ringp->s_ring_lock);
4760da14cebeSEric Cheng return;
4761da14cebeSEric Cheng } else {
4762da14cebeSEric Cheng /* ST_RING_WORKER_ONLY case */
4763da14cebeSEric Cheng SOFT_RING_ENQUEUE_CHAIN(ringp, mp_chain, tail, cnt, sz);
4764da14cebeSEric Cheng mac_soft_ring_worker_wakeup(ringp);
4765da14cebeSEric Cheng mutex_exit(&ringp->s_ring_lock);
4766da14cebeSEric Cheng }
4767da14cebeSEric Cheng }
4768da14cebeSEric Cheng
4769da14cebeSEric Cheng /*
4770da14cebeSEric Cheng * TX SOFTRING RELATED FUNCTIONS
4771da14cebeSEric Cheng *
4772da14cebeSEric Cheng * These functions really belong in mac_soft_ring.c and here for
4773da14cebeSEric Cheng * a short period.
4774da14cebeSEric Cheng */
4775da14cebeSEric Cheng
4776da14cebeSEric Cheng #define TX_SOFT_RING_ENQUEUE_CHAIN(ringp, mp, tail, cnt, sz) { \
4777da14cebeSEric Cheng ASSERT(MUTEX_HELD(&ringp->s_ring_lock)); \
4778da14cebeSEric Cheng ringp->s_ring_state |= S_RING_ENQUEUED; \
4779da14cebeSEric Cheng SOFT_RING_ENQUEUE_CHAIN(ringp, mp_chain, tail, cnt, sz); \
4780da14cebeSEric Cheng }
4781da14cebeSEric Cheng
4782da14cebeSEric Cheng /*
4783da14cebeSEric Cheng * mac_tx_sring_queued
4784da14cebeSEric Cheng *
4785da14cebeSEric Cheng * When we are out of transmit descriptors and we already have a
4786da14cebeSEric Cheng * queue that exceeds hiwat (or the client called us with
4787da14cebeSEric Cheng * MAC_TX_NO_ENQUEUE or MAC_DROP_ON_NO_DESC flag), return the
4788da14cebeSEric Cheng * soft ring pointer as the opaque cookie for the client enable
4789da14cebeSEric Cheng * flow control.
4790da14cebeSEric Cheng */
4791da14cebeSEric Cheng static mac_tx_cookie_t
mac_tx_sring_enqueue(mac_soft_ring_t * ringp,mblk_t * mp_chain,uint16_t flag,mblk_t ** ret_mp)4792da14cebeSEric Cheng mac_tx_sring_enqueue(mac_soft_ring_t *ringp, mblk_t *mp_chain, uint16_t flag,
4793da14cebeSEric Cheng mblk_t **ret_mp)
4794da14cebeSEric Cheng {
4795da14cebeSEric Cheng int cnt;
4796da14cebeSEric Cheng size_t sz;
4797da14cebeSEric Cheng mblk_t *tail;
4798da14cebeSEric Cheng mac_soft_ring_set_t *mac_srs = ringp->s_ring_set;
479936f99a58SToomas Soome mac_tx_cookie_t cookie = 0;
4800da14cebeSEric Cheng boolean_t wakeup_worker = B_TRUE;
4801da14cebeSEric Cheng
4802da14cebeSEric Cheng ASSERT(MUTEX_HELD(&ringp->s_ring_lock));
4803da14cebeSEric Cheng MAC_COUNT_CHAIN(mac_srs, mp_chain, tail, cnt, sz);
4804da14cebeSEric Cheng if (flag & MAC_DROP_ON_NO_DESC) {
4805c61a1653SRyan Zezeski mac_drop_chain(mp_chain, "Tx softring no desc");
4806da14cebeSEric Cheng /* increment freed stats */
4807da14cebeSEric Cheng ringp->s_ring_drops += cnt;
4808da14cebeSEric Cheng cookie = (mac_tx_cookie_t)ringp;
4809da14cebeSEric Cheng } else {
4810da14cebeSEric Cheng if (ringp->s_ring_first != NULL)
4811da14cebeSEric Cheng wakeup_worker = B_FALSE;
4812da14cebeSEric Cheng
4813da14cebeSEric Cheng if (flag & MAC_TX_NO_ENQUEUE) {
4814da14cebeSEric Cheng /*
4815da14cebeSEric Cheng * If QUEUED is not set, queue the packet
4816da14cebeSEric Cheng * and let mac_tx_soft_ring_drain() set
4817da14cebeSEric Cheng * the TX_BLOCKED bit for the reasons
4818da14cebeSEric Cheng * explained above. Otherwise, return the
4819da14cebeSEric Cheng * mblks.
4820da14cebeSEric Cheng */
4821da14cebeSEric Cheng if (wakeup_worker) {
4822da14cebeSEric Cheng TX_SOFT_RING_ENQUEUE_CHAIN(ringp,
4823da14cebeSEric Cheng mp_chain, tail, cnt, sz);
4824da14cebeSEric Cheng } else {
4825da14cebeSEric Cheng ringp->s_ring_state |= S_RING_WAKEUP_CLIENT;
4826da14cebeSEric Cheng cookie = (mac_tx_cookie_t)ringp;
4827da14cebeSEric Cheng *ret_mp = mp_chain;
4828da14cebeSEric Cheng }
4829da14cebeSEric Cheng } else {
4830da14cebeSEric Cheng boolean_t enqueue = B_TRUE;
4831da14cebeSEric Cheng
4832da14cebeSEric Cheng if (ringp->s_ring_count > ringp->s_ring_tx_hiwat) {
4833da14cebeSEric Cheng /*
4834da14cebeSEric Cheng * flow-controlled. Store ringp in cookie
4835da14cebeSEric Cheng * so that it can be returned as
4836da14cebeSEric Cheng * mac_tx_cookie_t to client
4837da14cebeSEric Cheng */
4838da14cebeSEric Cheng ringp->s_ring_state |= S_RING_TX_HIWAT;
4839da14cebeSEric Cheng cookie = (mac_tx_cookie_t)ringp;
4840da14cebeSEric Cheng ringp->s_ring_hiwat_cnt++;
4841da14cebeSEric Cheng if (ringp->s_ring_count >
4842da14cebeSEric Cheng ringp->s_ring_tx_max_q_cnt) {
4843da14cebeSEric Cheng /* increment freed stats */
4844da14cebeSEric Cheng ringp->s_ring_drops += cnt;
4845da14cebeSEric Cheng /*
4846da14cebeSEric Cheng * b_prev may be set to the fanout hint
4847da14cebeSEric Cheng * hence can't use freemsg directly
4848da14cebeSEric Cheng */
4849c61a1653SRyan Zezeski mac_drop_chain(mp_chain,
4850c61a1653SRyan Zezeski "Tx softring max queue");
4851da14cebeSEric Cheng DTRACE_PROBE1(tx_queued_hiwat,
4852da14cebeSEric Cheng mac_soft_ring_t *, ringp);
4853da14cebeSEric Cheng enqueue = B_FALSE;
4854da14cebeSEric Cheng }
4855da14cebeSEric Cheng }
4856da14cebeSEric Cheng if (enqueue) {
4857da14cebeSEric Cheng TX_SOFT_RING_ENQUEUE_CHAIN(ringp, mp_chain,
4858da14cebeSEric Cheng tail, cnt, sz);
4859da14cebeSEric Cheng }
4860da14cebeSEric Cheng }
4861da14cebeSEric Cheng if (wakeup_worker)
4862da14cebeSEric Cheng cv_signal(&ringp->s_ring_async);
4863da14cebeSEric Cheng }
4864da14cebeSEric Cheng return (cookie);
4865da14cebeSEric Cheng }
4866da14cebeSEric Cheng
4867da14cebeSEric Cheng
4868da14cebeSEric Cheng /*
4869da14cebeSEric Cheng * mac_tx_soft_ring_process
4870da14cebeSEric Cheng *
4871da14cebeSEric Cheng * This routine is called when fanning out outgoing traffic among
4872da14cebeSEric Cheng * multipe Tx rings.
4873da14cebeSEric Cheng * Note that a soft ring is associated with a h/w Tx ring.
4874da14cebeSEric Cheng */
4875da14cebeSEric Cheng mac_tx_cookie_t
mac_tx_soft_ring_process(mac_soft_ring_t * ringp,mblk_t * mp_chain,uint16_t flag,mblk_t ** ret_mp)4876da14cebeSEric Cheng mac_tx_soft_ring_process(mac_soft_ring_t *ringp, mblk_t *mp_chain,
4877da14cebeSEric Cheng uint16_t flag, mblk_t **ret_mp)
4878da14cebeSEric Cheng {
4879da14cebeSEric Cheng mac_soft_ring_set_t *mac_srs = ringp->s_ring_set;
4880da14cebeSEric Cheng int cnt;
4881da14cebeSEric Cheng size_t sz;
4882da14cebeSEric Cheng mblk_t *tail;
488336f99a58SToomas Soome mac_tx_cookie_t cookie = 0;
4884da14cebeSEric Cheng
4885da14cebeSEric Cheng ASSERT(ringp != NULL);
4886da14cebeSEric Cheng ASSERT(mp_chain != NULL);
4887da14cebeSEric Cheng ASSERT(MUTEX_NOT_HELD(&ringp->s_ring_lock));
4888da14cebeSEric Cheng /*
48890dc2366fSVenugopal Iyer * The following modes can come here: SRS_TX_BW_FANOUT,
48900dc2366fSVenugopal Iyer * SRS_TX_FANOUT, SRS_TX_AGGR, SRS_TX_BW_AGGR.
4891da14cebeSEric Cheng */
48920dc2366fSVenugopal Iyer ASSERT(MAC_TX_SOFT_RINGS(mac_srs));
4893da14cebeSEric Cheng ASSERT(mac_srs->srs_tx.st_mode == SRS_TX_FANOUT ||
48940dc2366fSVenugopal Iyer mac_srs->srs_tx.st_mode == SRS_TX_BW_FANOUT ||
48950dc2366fSVenugopal Iyer mac_srs->srs_tx.st_mode == SRS_TX_AGGR ||
48960dc2366fSVenugopal Iyer mac_srs->srs_tx.st_mode == SRS_TX_BW_AGGR);
4897da14cebeSEric Cheng
4898da14cebeSEric Cheng if (ringp->s_ring_type & ST_RING_WORKER_ONLY) {
4899da14cebeSEric Cheng /* Serialization mode */
4900da14cebeSEric Cheng
4901da14cebeSEric Cheng mutex_enter(&ringp->s_ring_lock);
4902da14cebeSEric Cheng if (ringp->s_ring_count > ringp->s_ring_tx_hiwat) {
4903da14cebeSEric Cheng cookie = mac_tx_sring_enqueue(ringp, mp_chain,
4904da14cebeSEric Cheng flag, ret_mp);
4905da14cebeSEric Cheng mutex_exit(&ringp->s_ring_lock);
4906da14cebeSEric Cheng return (cookie);
4907da14cebeSEric Cheng }
4908da14cebeSEric Cheng MAC_COUNT_CHAIN(mac_srs, mp_chain, tail, cnt, sz);
4909da14cebeSEric Cheng TX_SOFT_RING_ENQUEUE_CHAIN(ringp, mp_chain, tail, cnt, sz);
4910da14cebeSEric Cheng if (ringp->s_ring_state & (S_RING_BLOCK | S_RING_PROC)) {
4911da14cebeSEric Cheng /*
4912da14cebeSEric Cheng * If ring is blocked due to lack of Tx
4913da14cebeSEric Cheng * descs, just return. Worker thread
4914da14cebeSEric Cheng * will get scheduled when Tx desc's
4915da14cebeSEric Cheng * become available.
4916da14cebeSEric Cheng */
4917da14cebeSEric Cheng mutex_exit(&ringp->s_ring_lock);
4918da14cebeSEric Cheng return (cookie);
4919da14cebeSEric Cheng }
4920da14cebeSEric Cheng mac_soft_ring_worker_wakeup(ringp);
4921da14cebeSEric Cheng mutex_exit(&ringp->s_ring_lock);
4922da14cebeSEric Cheng return (cookie);
4923da14cebeSEric Cheng } else {
4924da14cebeSEric Cheng /* Default fanout mode */
4925da14cebeSEric Cheng /*
4926da14cebeSEric Cheng * S_RING_BLOCKED is set when underlying NIC runs
4927da14cebeSEric Cheng * out of Tx descs and messages start getting
4928da14cebeSEric Cheng * queued. It won't get reset until
4929da14cebeSEric Cheng * tx_srs_drain() completely drains out the
4930da14cebeSEric Cheng * messages.
4931da14cebeSEric Cheng */
4932da14cebeSEric Cheng mac_tx_stats_t stats;
4933da14cebeSEric Cheng
4934da14cebeSEric Cheng if (ringp->s_ring_state & S_RING_ENQUEUED) {
4935da14cebeSEric Cheng /* Tx descs/resources not available */
4936da14cebeSEric Cheng mutex_enter(&ringp->s_ring_lock);
4937da14cebeSEric Cheng if (ringp->s_ring_state & S_RING_ENQUEUED) {
4938da14cebeSEric Cheng cookie = mac_tx_sring_enqueue(ringp, mp_chain,
4939da14cebeSEric Cheng flag, ret_mp);
4940da14cebeSEric Cheng mutex_exit(&ringp->s_ring_lock);
4941da14cebeSEric Cheng return (cookie);
4942da14cebeSEric Cheng }
4943da14cebeSEric Cheng /*
4944da14cebeSEric Cheng * While we were computing mblk count, the
4945da14cebeSEric Cheng * flow control condition got relieved.
4946da14cebeSEric Cheng * Continue with the transmission.
4947da14cebeSEric Cheng */
4948da14cebeSEric Cheng mutex_exit(&ringp->s_ring_lock);
4949da14cebeSEric Cheng }
4950da14cebeSEric Cheng
4951da14cebeSEric Cheng mp_chain = mac_tx_send(ringp->s_ring_tx_arg1,
49520dc2366fSVenugopal Iyer ringp->s_ring_tx_arg2, mp_chain, &stats);
4953da14cebeSEric Cheng
4954da14cebeSEric Cheng /*
4955da14cebeSEric Cheng * Multiple threads could be here sending packets.
4956da14cebeSEric Cheng * Under such conditions, it is not possible to
4957da14cebeSEric Cheng * automically set S_RING_BLOCKED bit to indicate
4958da14cebeSEric Cheng * out of tx desc condition. To atomically set
4959da14cebeSEric Cheng * this, we queue the returned packet and do
4960da14cebeSEric Cheng * the setting of S_RING_BLOCKED in
4961da14cebeSEric Cheng * mac_tx_soft_ring_drain().
4962da14cebeSEric Cheng */
4963da14cebeSEric Cheng if (mp_chain != NULL) {
4964da14cebeSEric Cheng mutex_enter(&ringp->s_ring_lock);
4965da14cebeSEric Cheng cookie =
4966da14cebeSEric Cheng mac_tx_sring_enqueue(ringp, mp_chain, flag, ret_mp);
4967da14cebeSEric Cheng mutex_exit(&ringp->s_ring_lock);
4968da14cebeSEric Cheng return (cookie);
4969da14cebeSEric Cheng }
49700dc2366fSVenugopal Iyer SRS_TX_STATS_UPDATE(mac_srs, &stats);
49710dc2366fSVenugopal Iyer SOFTRING_TX_STATS_UPDATE(ringp, &stats);
49720dc2366fSVenugopal Iyer
497336f99a58SToomas Soome return (0);
4974da14cebeSEric Cheng }
4975da14cebeSEric Cheng }
4976