1TODO: 2- [ ] keep a global op in-flight counter? (might need locking) 3- [-] scheduling (who does what, more than one select thread? How does the proxy 4 work get distributed between threads?) 5- [ ] managing timeouts? 6- [X] outline locking policy: seems like there might be a lock inversion in the 7 design looming: when working with op, might need a lock on both client and 8 upstream but depending on where we started, we might want to start with 9 locking one, then other 10- [ ] how to deal with the balancer running out of fds? Especially when we hit 11 the limit, then lose an upstream connection and accept() a client, we 12 wouldn't be able to initiate a new one. A bit of a DoS... But probably not 13 a concern for Ericsson 14- [ ] non-Linux? No idea how anything other than poll works (moot if building a 15 libevent/libuv-based load balancer since they take care of that, except 16 edge-triggered I/O?) 17- [-] rootDSE? Controls and exops might have different semantics and need 18 binding to the same upstream connection. 19- [ ] Just piggybacking on OpenLDAP as a module? Would still need some updates 20 in the core and the module/subsystem would be a very invasive one. On the 21 other hand, allows to expose live configuration and monitoring over LDAP 22 over the current slapd listeners without re-inventing the wheel. 23 24 25Expecting to handle only LDAPv3 26 27terms: 28 server - configured target 29 upstream - a single connection to a server 30 client - an incoming connection 31 32To maintain fairness `G( requested => ( F( progressed | failed ) ) )`, use 33queues and put timeouts in 34 35Runtime organisation 36------ 37- main thread with its own event base handling signals 38- one thread (later possibly more) listening on the rendezvous sockets, handing 39 the new sockets to worker threads 40- n worker threads dealing with client and server I/O (dispatching actual work 41 to the thread pool most likely) 42- a thread pool to handle actual work 43 44Operational behaviour 45------ 46 47- client read -> upstream write: 48 - client read: 49 - if TLS_SETUP, keep processing, set state back when finished and note that 50 we're under TLS 51 - ber_get_next(), if we don't have a tag, finished (unless we have true 52 edge-triggered I/O, also put the fd back into the ones we're waiting for) 53 - peek at op tag: 54 - unbind: 55 - with a single lock, mark all pending ops in upstreams abandoned, clear 56 client link (would it be fast enough if we remove them from upstream 57 map instead?) 58 - locked per op: 59 - remove op from upstream map 60 - check upstream is not write-suspended, if it is ... 61 - try to write the abandon op to upstream, suspend upstream if not 62 fully sent 63 - remove op from client map (how if we're in avl_apply?, another pass?) 64 - would be nice if we could wipe the complete client map then, otherwise 65 we need to queue it to have it freed when all abandons get passed onto 66 the upstream (just dropping them might put extra strain on upstreams, 67 will probably have a queue on each client/upstream anyway, not just a 68 single Ber) 69 - bind: 70 - check mechanism is not EXTERNAL (or implement it) 71 - abandon existing ops (see unbind) 72 - set state to BINDING, put DN into authzid 73 - pick upstream, create PDU and sent 74 - abandon: 75 - find op, mark for abandon, send to appropriate upstream 76 - Exop: 77 - check not BINDING (unless it's a cancel?) 78 - check OID: 79 - STARTTLS: 80 - check we don't have TLS yet 81 - abandon all 82 - set state to TLS_SETUP 83 - send the hello 84 - VC(?): 85 - similar to bind except for the abandons/state change 86 - other: 87 - check not BINDING 88 - pick an upstream 89 - create a PDU, send (marking upstream suspended if not written in full) 90 - check if should read again (keep a counter of number of times to read 91 off a connection in a single pass so that we maintain fairness) 92 - if read enough requests and can still read, re-queue ourselves (if we 93 don't have true edge-triggered I/O, we can just register the fd again) 94 - upstream write (only when suspended): 95 - flush the current BER 96 - there shouldn't be anything else? 97- upstream read -> client write: 98 - upstream read: 99 - ber_get_next(), if we don't have a tag, finished (unless we have true 100 edge-triggered I/O, also put the fd back into the ones we're waiting for) 101 - when we get it, peek at msgid, resolve client connection, lock, check: 102 - if unsolicited, handle as close (and mark connection closing) 103 - if op is abandoned or does not exist, drop PDU and op, update counters 104 - if client backlogged, suspend upstream, register callback to unsuspend 105 (on progress when writing to client or abandon from client (connection 106 death, abandon proper, ...)) 107 - reconstruct final PDU, write BER to client, if did not write fully, 108 suspend client 109 - if a final response, decrement operation counts on upstream and client 110 - check if should read again (keep a counter of number of responses to read 111 off a connection in a single pass so that we don't starve any?) 112 - client write ready (only checked for when suspended): 113 - write the rest of pending BER if any 114 - on successful write, pick all pending ops that need failure response, push 115 to client (are there any controls that need to be present in response even 116 in the case of failure?, what to do with them?) 117 - on successfully flushing them, walk through suspended upstreams, picking 118 the pending PDU (unsuspending the upstream) and writing, if PDU flushed 119 successfully, pick next upstream 120 - if we successfully flushed all suspended upstreams, unsuspend client 121 (and disable the write callback) 122- upstream close/error: 123 - look up pending ops, try to write to clients, mark clients suspended that 124 have ops that need responses (another queue associated with client to speed 125 up?) 126 - schedule a new connection open 127- client close/error: 128 - same as unbind 129- client inactive (no pending ops and nothing happened in x seconds) 130 - might just send notice of disconnection and close 131- op timeout handling: 132 - mark for abandon 133 - send abandon 134 - send timeLimitExceeded/adminLimitExceeded to client 135 136Picking an upstream: 137- while there is a level available: 138 - pick a random ordering of upstreams based on weights 139 - while there is an upstream in the level: 140 - check number of ops in-flight (this is where we lock the upstream map) 141 - find the least busy connection (and check if a new connection should be 142 opened) 143 - try to lock for socket write, if available (no BER queued) we have our 144 upstream 145 146PDU processing: 147- request (have an upstream selected): 148 - get new msgid from upstream 149 - create an Op structure (actually, with the need for freelist lock, we can 150 make it a cache for freed operation structures, avoiding some malloc 151 traffic, to reset, we need slap_sl_mem_create( ,,, 1 )) 152 - check proxyauthz is not present? or just let upstream reject it if there are 153 two? 154 - add own controls at the end: 155 - construct proxyauthz from authzid 156 - construct session tracking from remote IP, own name, authzid 157 - send over 158 - insert Op into client and upstream maps 159- response/intermediate/entry: 160 - look up Op in upstream's map 161 - write old msgid, rest of the response can go unchanged 162 - if a response, remove Op from all maps (client and upstream) 163 164Managing upstreams: 165- async connect up to min_connections (is there a point in having a connection 166 count range if we can't use it when needed since all of the below is async?) 167- when connected, set up TLS (if requested) 168- when done, send a bind 169- go for the bind interaction 170- when done, add it to the upstream's connection list 171- (if a connection is suspended or connections are over 75 % op limit, schedule 172 creating a new connection setup unless connection limit has been hit) 173 174Managing timeouts: 175- two options: 176 - maintain a separate locked priority queue to give a perfect ordering to when 177 each operation is to time out, would need to maintain yet another place 178 where operations can be found. 179 - the locking protocol for disposing of the operation would need to be 180 adjusted and might become even more complicated, might do the alternative 181 initially and then attempt this if it helps performance 182 - just do a sweep over all clients (that mutex is less contended) every so 183 often. With many in-flight operations might be a lot of wasted work. 184 - we still need to sweep over all clients to check if they should be killed 185 anyway 186 187Dispatcher thread (2^n of them, fd x is handled by thread no x % (2^n)): 188- poll on all registered fds 189- remove each fd that's ready from the registered list and schedule the work 190- work threads can put their fd back in if they deem necessary (=not suspended) 191- this works as a poor man's edge-triggered polling, with enough workers, should 192 we do proper edge triggered I/O? What about non-Linux? 193 194Listener thread: 195- slapd has just one, which then reassigns the sockets to separate I/O 196 threads 197 198Threading: 199- if using slap_sl_malloc, how much perf do we gain? To allocate a context per 200 op, we should have a dedicated parent context so that when we free it, we can 201 use that exclusively. The parent context's parent would be the main thread's 202 context. This implies a lot of slap_sl_mem_setctx/slap_sl_mem_create( ,,, 0 ) 203 and making sure an op does not allocate/free things from two threads at the 204 same time (might need an Op mutex after all? Not such a huge cost if we 205 routinely reuse Op structures) 206 207Locking policy: 208- read mutexes are unnecessary, we only have one thread receiving data from the 209 connection - the one started from the dispatcher 210- two reference counters of operation structures (an op is accessible from 211 client and upstream map, each counter is consistent when thread has a lock on 212 corresponding map), when decreasing the counter to zero, start freeing 213 procedure 214- place to mark disposal finished for each side, consistency enforced by holding 215 the freelist lock when reading/manipulating 216- when op is created, we already have a write lock on upstream socket and map, 217 start writing, insert to upstream map with upstream refcount 1, unlock, lock 218 client, insert (client refcount 0), unlock, lock upstream, decrement refcount 219 (triggers a test if we need to drop it now), unlock upstream, done 220- when upstream processes a PDU, locks its map, increments counter, (potentially 221 removes if it's a response), unlocks, locks client's map, write mutex (this 222 order?) and full client mutex (if a bind response) 223- when client side wants to work with a PDU (abandon, (un)bind), locks its map, 224 increase refcount, unlocks, locks upstream map, write mutex, sends or queues 225 abandon, unlocks write mutex, initiates freeing procedure from upstream side 226 (or if having to remember we've already increased client-side refcount, mark 227 for deletion, lose upstream lock, lock client, decref, either triggering 228 deletion from client or mark for it) 229- if we have operation lock, we can simplify a bit (no need for three-stage 230 locking above) 231 232Shutdown: 233- stop accept() thread(s) - potentially add a channel to hand these listening 234 sockets over for zero-downtime restart 235- if very gentle, mark connections as closing, start timeout and: 236 - when a new non-abandon PDU comes in from client - return LDAP_UNAVAILABLE 237 - when receiving a PDU from upstream, send over to client, if no ops pending, 238 send unsolicited response and close (RFC4511 suggests unsolicited response 239 is the last PDU coming from the upstream and libldap agrees, so we can't 240 send it for a socket we want to shut down more gracefully) 241- gentle (or very gentle timed out): 242 - set timeout 243 - mark all ops as abandoned 244 - send unbind to all upstreams 245 - send unsolicited to all clients 246- imminent (or gentle timed out): 247 - async close all connections? 248 - exit() 249 250RootDSE: 251- default option is not to care and if a control/exop has special restrictions, 252 it is the admin's job to flag it as such in the load-balancer's config 253- another is not to care about the search request but check each search entry 254 being passed back, check DN and if it's a rootDSE, filter the list of 255 controls/exops/sasl mechs (external!) that are supported 256- last one is to check all search requests for the DN/scope and synthesise the 257 response locally - probably not (would need to configure the complete list of 258 controls, exops, sasl mechs, naming contexts in the balancer) 259 260Potential red flags: 261- we suspend upstreams, if we ever suspend clients we need to be sure we can't 262 create dependency cycles 263 - is this an issue when only suspending the read side of each? Because even if 264 we stop reading from everything, we should eventually flush data to those we 265 can still talk to, as upstreams are flushed, we can start sending new 266 requests from live clients (those that are suspended are due to their own 267 inability to accept data) 268 - we might need to suspend a client if there is a reason to choose a 269 particular upstream (multi-request operation - bind, VC, PR, TXN, ...) 270 - a SASL bind, but that means there are no outstanding ops to receive 271 it holds that !suspended(client) \or !suspended(upstream), so they 272 cannot participate in a cycle 273 - VC - multiple binds at the same time - !!! more analysis needed 274 - PR - should only be able to have one per connection (that's a problem 275 for later, maybe even needs a dedicated upstream connection) 276 - TXN - ??? probably same situation as PR 277 - or if we have a queue for pending Bers on the server, we not need to suspend 278 clients, upstream is only chosen if the queue is free or there is a reason 279 to send it to that particular upstream (multi-stage bind/VC, PR, ...), but 280 that still makes it possible for a client to exhaust all our memory by 281 sending requests (VC or other ones bound to a slow upstream or by not 282 reading the responses at all) 283