xref: /netbsd-src/external/bsd/openldap/dist/doc/devel/lloadd/design.md (revision e670fd5c413e99c2f6a37901bb21c537fcd322d2)
1TODO:
2- [ ] keep a global op in-flight counter? (might need locking)
3- [-] scheduling (who does what, more than one select thread? How does the proxy
4      work get distributed between threads?)
5- [ ] managing timeouts?
6- [X] outline locking policy: seems like there might be a lock inversion in the
7      design looming: when working with op, might need a lock on both client and
8      upstream but depending on where we started, we might want to start with
9      locking one, then other
10- [ ] how to deal with the balancer running out of fds? Especially when we hit
11      the limit, then lose an upstream connection and accept() a client, we
12      wouldn't be able to initiate a new one. A bit of a DoS... But probably not
13      a concern for Ericsson
14- [ ] non-Linux? No idea how anything other than poll works (moot if building a
15      libevent/libuv-based load balancer since they take care of that, except
16      edge-triggered I/O?)
17- [-] rootDSE? Controls and exops might have different semantics and need
18      binding to the same upstream connection.
19- [ ] Just piggybacking on OpenLDAP as a module? Would still need some updates
20      in the core and the module/subsystem would be a very invasive one. On the
21      other hand, allows to expose live configuration and monitoring over LDAP
22      over the current slapd listeners without re-inventing the wheel.
23
24
25Expecting to handle only LDAPv3
26
27terms:
28  server - configured target
29  upstream - a single connection to a server
30  client - an incoming connection
31
32To maintain fairness `G( requested => ( F( progressed | failed ) ) )`, use
33queues and put timeouts in
34
35Runtime organisation
36------
37- main thread with its own event base handling signals
38- one thread (later possibly more) listening on the rendezvous sockets, handing
39  the new sockets to worker threads
40- n worker threads dealing with client and server I/O (dispatching actual work
41  to the thread pool most likely)
42- a thread pool to handle actual work
43
44Operational behaviour
45------
46
47- client read -> upstream write:
48  - client read:
49    - if TLS_SETUP, keep processing, set state back when finished and note that
50      we're under TLS
51    - ber_get_next(), if we don't have a tag, finished (unless we have true
52      edge-triggered I/O, also put the fd back into the ones we're waiting for)
53    - peek at op tag:
54      - unbind:
55        - with a single lock, mark all pending ops in upstreams abandoned, clear
56          client link (would it be fast enough if we remove them from upstream
57          map instead?)
58        - locked per op:
59          - remove op from upstream map
60          - check upstream is not write-suspended, if it is ...
61          - try to write the abandon op to upstream, suspend upstream if not
62            fully sent
63          - remove op from client map (how if we're in avl_apply?, another pass?)
64        - would be nice if we could wipe the complete client map then, otherwise
65          we need to queue it to have it freed when all abandons get passed onto
66          the upstream (just dropping them might put extra strain on upstreams,
67          will probably have a queue on each client/upstream anyway, not just a
68          single Ber)
69      - bind:
70        - check mechanism is not EXTERNAL (or implement it)
71        - abandon existing ops (see unbind)
72        - set state to BINDING, put DN into authzid
73        - pick upstream, create PDU and sent
74      - abandon:
75        - find op, mark for abandon, send to appropriate upstream
76      - Exop:
77        - check not BINDING (unless it's a cancel?)
78        - check OID:
79          - STARTTLS:
80            - check we don't have TLS yet
81            - abandon all
82            - set state to TLS_SETUP
83            - send the hello
84          - VC(?):
85            - similar to bind except for the abandons/state change
86      - other:
87        - check not BINDING
88        - pick an upstream
89        - create a PDU, send (marking upstream suspended if not written in full)
90      - check if should read again (keep a counter of number of times to read
91        off a connection in a single pass so that we maintain fairness)
92      - if read enough requests and can still read, re-queue ourselves (if we
93        don't have true edge-triggered I/O, we can just register the fd again)
94  - upstream write (only when suspended):
95    - flush the current BER
96    - there shouldn't be anything else?
97- upstream read -> client write:
98  - upstream read:
99    - ber_get_next(), if we don't have a tag, finished (unless we have true
100      edge-triggered I/O, also put the fd back into the ones we're waiting for)
101    - when we get it, peek at msgid, resolve client connection, lock, check:
102      - if unsolicited, handle as close (and mark connection closing)
103      - if op is abandoned or does not exist, drop PDU and op, update counters
104      - if client backlogged, suspend upstream, register callback to unsuspend
105        (on progress when writing to client or abandon from client (connection
106        death, abandon proper, ...))
107    - reconstruct final PDU, write BER to client, if did not write fully,
108      suspend client
109    - if a final response, decrement operation counts on upstream and client
110    - check if should read again (keep a counter of number of responses to read
111      off a connection in a single pass so that we don't starve any?)
112  - client write ready (only checked for when suspended):
113    - write the rest of pending BER if any
114    - on successful write, pick all pending ops that need failure response, push
115      to client (are there any controls that need to be present in response even
116      in the case of failure?, what to do with them?)
117    - on successfully flushing them, walk through suspended upstreams, picking
118      the pending PDU (unsuspending the upstream) and writing, if PDU flushed
119      successfully, pick next upstream
120    - if we successfully flushed all suspended upstreams, unsuspend client
121      (and disable the write callback)
122- upstream close/error:
123  - look up pending ops, try to write to clients, mark clients suspended that
124    have ops that need responses (another queue associated with client to speed
125    up?)
126  - schedule a new connection open
127- client close/error:
128  - same as unbind
129- client inactive (no pending ops and nothing happened in x seconds)
130  - might just send notice of disconnection and close
131- op timeout handling:
132  - mark for abandon
133  - send abandon
134  - send timeLimitExceeded/adminLimitExceeded to client
135
136Picking an upstream:
137- while there is a level available:
138  - pick a random ordering of upstreams based on weights
139  - while there is an upstream in the level:
140    - check number of ops in-flight (this is where we lock the upstream map)
141    - find the least busy connection (and check if a new connection should be
142      opened)
143    - try to lock for socket write, if available (no BER queued) we have our
144      upstream
145
146PDU processing:
147- request (have an upstream selected):
148  - get new msgid from upstream
149  - create an Op structure (actually, with the need for freelist lock, we can
150    make it a cache for freed operation structures, avoiding some malloc
151    traffic, to reset, we need slap_sl_mem_create( ,,, 1 ))
152  - check proxyauthz is not present? or just let upstream reject it if there are
153    two?
154  - add own controls at the end:
155    - construct proxyauthz from authzid
156    - construct session tracking from remote IP, own name, authzid
157  - send over
158  - insert Op into client and upstream maps
159- response/intermediate/entry:
160  - look up Op in upstream's map
161  - write old msgid, rest of the response can go unchanged
162  - if a response, remove Op from all maps (client and upstream)
163
164Managing upstreams:
165- async connect up to min_connections (is there a point in having a connection
166  count range if we can't use it when needed since all of the below is async?)
167- when connected, set up TLS (if requested)
168- when done, send a bind
169- go for the bind interaction
170- when done, add it to the upstream's connection list
171- (if a connection is suspended or connections are over 75 % op limit, schedule
172  creating a new connection setup unless connection limit has been hit)
173
174Managing timeouts:
175- two options:
176  - maintain a separate locked priority queue to give a perfect ordering to when
177    each operation is to time out, would need to maintain yet another place
178    where operations can be found.
179    - the locking protocol for disposing of the operation would need to be
180      adjusted and might become even more complicated, might do the alternative
181      initially and then attempt this if it helps performance
182  - just do a sweep over all clients (that mutex is less contended) every so
183    often. With many in-flight operations might be a lot of wasted work.
184    - we still need to sweep over all clients to check if they should be killed
185      anyway
186
187Dispatcher thread (2^n of them, fd x is handled by thread no x % (2^n)):
188- poll on all registered fds
189- remove each fd that's ready from the registered list and schedule the work
190- work threads can put their fd back in if they deem necessary (=not suspended)
191- this works as a poor man's edge-triggered polling, with enough workers, should
192  we do proper edge triggered I/O? What about non-Linux?
193
194Listener thread:
195- slapd has just one, which then reassigns the sockets to separate I/O
196  threads
197
198Threading:
199- if using slap_sl_malloc, how much perf do we gain? To allocate a context per
200  op, we should have a dedicated parent context so that when we free it, we can
201  use that exclusively. The parent context's parent would be the main thread's
202  context. This implies a lot of slap_sl_mem_setctx/slap_sl_mem_create( ,,, 0 )
203  and making sure an op does not allocate/free things from two threads at the
204  same time (might need an Op mutex after all? Not such a huge cost if we
205  routinely reuse Op structures)
206
207Locking policy:
208- read mutexes are unnecessary, we only have one thread receiving data from the
209  connection - the one started from the dispatcher
210- two reference counters of operation structures (an op is accessible from
211  client and upstream map, each counter is consistent when thread has a lock on
212  corresponding map), when decreasing the counter to zero, start freeing
213  procedure
214- place to mark disposal finished for each side, consistency enforced by holding
215  the freelist lock when reading/manipulating
216- when op is created, we already have a write lock on upstream socket and map,
217  start writing, insert to upstream map with upstream refcount 1, unlock, lock
218  client, insert (client refcount 0), unlock, lock upstream, decrement refcount
219  (triggers a test if we need to drop it now), unlock upstream, done
220- when upstream processes a PDU, locks its map, increments counter, (potentially
221  removes if it's a response), unlocks, locks client's map, write mutex (this
222  order?) and full client mutex (if a bind response)
223- when client side wants to work with a PDU (abandon, (un)bind), locks its map,
224  increase refcount, unlocks, locks upstream map, write mutex, sends or queues
225  abandon, unlocks write mutex, initiates freeing procedure from upstream side
226  (or if having to remember we've already increased client-side refcount, mark
227  for deletion, lose upstream lock, lock client, decref, either triggering
228  deletion from client or mark for it)
229- if we have operation lock, we can simplify a bit (no need for three-stage
230  locking above)
231
232Shutdown:
233- stop accept() thread(s) - potentially add a channel to hand these listening
234  sockets over for zero-downtime restart
235- if very gentle, mark connections as closing, start timeout and:
236  - when a new non-abandon PDU comes in from client - return LDAP_UNAVAILABLE
237  - when receiving a PDU from upstream, send over to client, if no ops pending,
238    send unsolicited response and close (RFC4511 suggests unsolicited response
239    is the last PDU coming from the upstream and libldap agrees, so we can't
240    send it for a socket we want to shut down more gracefully)
241- gentle (or very gentle timed out):
242  - set timeout
243  - mark all ops as abandoned
244  - send unbind to all upstreams
245  - send unsolicited to all clients
246- imminent (or gentle timed out):
247  - async close all connections?
248  - exit()
249
250RootDSE:
251- default option is not to care and if a control/exop has special restrictions,
252  it is the admin's job to flag it as such in the load-balancer's config
253- another is not to care about the search request but check each search entry
254  being passed back, check DN and if it's a rootDSE, filter the list of
255  controls/exops/sasl mechs (external!) that are supported
256- last one is to check all search requests for the DN/scope and synthesise the
257  response locally - probably not (would need to configure the complete list of
258  controls, exops, sasl mechs, naming contexts in the balancer)
259
260Potential red flags:
261- we suspend upstreams, if we ever suspend clients we need to be sure we can't
262  create dependency cycles
263  - is this an issue when only suspending the read side of each? Because even if
264    we stop reading from everything, we should eventually flush data to those we
265    can still talk to, as upstreams are flushed, we can start sending new
266    requests from live clients (those that are suspended are due to their own
267    inability to accept data)
268  - we might need to suspend a client if there is a reason to choose a
269    particular upstream (multi-request operation - bind, VC, PR, TXN, ...)
270    - a SASL bind, but that means there are no outstanding ops to receive
271      it holds that !suspended(client) \or !suspended(upstream), so they
272      cannot participate in a cycle
273    - VC - multiple binds at the same time - !!! more analysis needed
274    - PR - should only be able to have one per connection (that's a problem
275      for later, maybe even needs a dedicated upstream connection)
276    - TXN - ??? probably same situation as PR
277  - or if we have a queue for pending Bers on the server, we not need to suspend
278    clients, upstream is only chosen if the queue is free or there is a reason
279    to send it to that particular upstream (multi-stage bind/VC, PR, ...), but
280    that still makes it possible for a client to exhaust all our memory by
281    sending requests (VC or other ones bound to a slow upstream or by not
282    reading the responses at all)
283