From c0d4f5d1ccf5cd4c2a4a4c3c4e19179aada1e7c6 Mon Sep 17 00:00:00 2001 From: Yuefeng Zhou Date: Fri, 6 Mar 2020 15:59:34 -0800 Subject: [PATCH 01/10] Initial RFC for single-client parameter server training. --- ...20200306-single-client-parameter-server.md | 517 ++++++++++++++++++ .../rebuild_arbitrary_future.png | Bin 0 -> 14706 bytes 2 files changed, 517 insertions(+) create mode 100644 rfcs/20200306-single-client-parameter-server.md create mode 100644 rfcs/20200306-single-client-parameter-server/rebuild_arbitrary_future.png diff --git a/rfcs/20200306-single-client-parameter-server.md b/rfcs/20200306-single-client-parameter-server.md new file mode 100644 index 000000000..bcad965da --- /dev/null +++ b/rfcs/20200306-single-client-parameter-server.md @@ -0,0 +1,517 @@ +# Single-client Parameter Server Training + +| Status | Proposed | +:-------------- |:---------------------------------------------------- | +| **Author(s)** | Yuefeng Zhou (yuefengz@google.com), Rick Chao (rchao@google.com) | +| **Sponsor** | Priya Gupta (priyag@google.com) | +| **Updated** | 2018-03-06 | + +## Background + +Parameter server training is a very commonly used distributed training architecture. It is especially relevant to models with large embeddings, for training on large clusters of machines with CPUs, or when scalability is preferred over determinism. Its high-level idea is to create variables on parameter servers and in each step let workers take different training input, pull variable values from parameter servers, compute gradients and send them to parameter servers. + + +### Distribution Strategy + +Distribution Strategy (`tf.distribute.Strategy`) is a library that aims to allow users to write simple model code and scale up their models automatically with decent out-of-box performance. We will design parameter server training under the umbrella of `tf.distribute.Strategy` in TensorFlow 2 with a single-client architecture, in contrast to the multi-client approach which is traditionally done in TensorFlow distributed training such as `tf.estimator.Estimator` or `tf.distribute.experimental.MultiWorkerMirroredStrategy`. + +Distribution Strategy’s [custom training loop](https://www.tensorflow.org/tutorials/distribute/custom_training) (CTL) API has been popular among users who want more control in writing their training loops. User community of this API is large. We would like to focus on supporting CTL API first and later abstract out a commonly used pattern for Keras `compile`/`fit` API. + + +### Single-Client Distributed Training + +We believe that a single-client architecture can provide a simpler programming model than multi-client setup. A single source of truth can avoid bugs due to inconsistencies in multi-client setup. Furthermore, a single source of control can enable more determinism. In extreme cases, it can launch long-running tasks and turn into multi-client effectively. + + +## Goal + +The goal of this project is to support multi-worker asynchronous training with `ParameterServerStrategy` and CTL API, and in the long term also Keras `model.fit()`. In the first stage of this project, we focus more on design ideas rather than the APIs. + +The goal of this document is to discuss the high-level design and challenges of various pieces needed to support single-client parameter server training. Detailed designs for some pieces may be directed to other documents. + + +## Overview + + +### Programming Model + +With a single-client architecture, the programming model will be different than the multi-client architecture. All workers and parameter servers are standard TensorFlow servers, and the user-defined program will run on the client only. Generally, no matter what high-level APIs users will use, the workflow for running a step function distributedly with single-client approach includes the following steps: + + +1. Connect to all remote workers and parameter servers. +2. Create variables on parameter servers and hold references to them. +3. Create datasets and iterators on workers.. +4. Create the replica function that takes an iterator as input, trace it and register it on all workers. Note: a function may create variables as well. If not specified, they will be created on parameter servers as well. +5. Dispatch the step function on one available worker. +6. Repeat 5 until the end of epoch. +7. Repeat 5 - 6 until the stop criteria is reached. + + +### Interfaces + +One of our goals is to make `ParameterServerStrategy`’s API consistent with other strategies so that users can easily switch between them. This may be challenging due to the fundamental difference between synchronous and asynchronous training. Therefore, we try to use most of the Distribution Strategy’ APIs but occasionally we will create APIs that still make sense in other strategies as well. + +**Note: all proposed APIs in this document are tentative and only valid for our first version. We will revisit them once we get enough feedback from users and the community.** + + +#### Constraints + +Function is first-class citizen. Users should only schedule functions instead of running individual ops, in addition to creating variables. + +Users can occasionally run individual ops on the client, only for reporting purposes such as printing a metric’s value. + + +#### Schedule/Join Primitives + +The `strategy.run` API was initially developed for synchronous training. We propose a new pair of primitives to + + +* hide the details of load-balancing, fault tolerance and dynamic scheduling +* expose the non-blocking semantics to users. + +To enable scheduling a function on any worker, we recommend users create the same dataset, but may shuffled differently, on all workers via `strategy.experimental_distribute_datasets_from_function` API. + + +```python +class ParameterServerStrategyV2: + + def schedule(self, replica_fn, args=(), kwargs=()): + """Schedule the `replica_fn` on all replicas in a sync group (a worker). + + Schedule the `replica_fn` on all replicas in a sync group that is available, + returns a future of PerReplica immediately if `function` has return values. + + It implements at-least-once semantics for function execution. If a worker + fails, it will try to reschedule the function on another replica group or throw + an exception to users. So this method assumes that function execution can be + out of order and function inputs are shared between sync groups. + + We don't support the cases where `args` or `kwargs` are bound to a specific + sync group. We will consider supporting them in the future. + + If there are barriers in `replica_fn`, it is users' responsibility to make + sure they won't cause deadlock. + """ + pass + + def join(self): + """Wait until all scheduled functions are finished. + + Raises an error if any of the functions fails to execute. In this case, + there is no guarantee that non-failing functions will complete. + + When join() is being called, it is not allowed to call `schedule`. + """ + pass + + def done(self): + """Returns True if there is no pending functions to be executed.""" + pass + + def local_results(self, future_list): + """Get concrete values of the future list. + + Poisoned future objects will give `None`. + """ + pass +``` + + +#### Custom Training Loop + +To construct a custom training loop, users need to + + +* use `strategy.experimental_distribute_datasets_from_function` to create one dataset per worker. At this point, we recommend against using `strategy.experimental_distribute_dataset`. +* create models under `strategy.scope` so variables will be assigned to parameter servers. +* likewise, create a Keras metric object under `strategy.scope`. We expect the metric variables to be stored on parameter servers. Each worker, within their `replica_fn`, updates the metric states. +* use `strategy.schedule` to schedule the `replica_fn` on one remote worker. This `replica_fn` should take an iterator and perform forward and backward computation. This `strategy.schedule` will only schedule this `replica_fn` and returns one or several `Future` objects immediately. +* use `strategy.local_results` to get concrete values of results returned by `strategy.schedule`. This may be a blocking call if the result is not yet ready. With any failure that cannot be handled will be ignored and as a result some of the results may be `None`. +* call `strategy.join` to wait until all scheduled functions are executed. + +```python +# Connect to remote servers with a user-provided `ClusterResolver` object. +strategy = ParameterServerStrategyV2(cluster_resolver) + +dataset_fn = # a function that returns a dataset + +# Clone the dataset on all workers, shuffled with different seeds. +distributed_dataset = strategy.experimental_distribute_datasets_from_function( + dataset_fn) + +with strategy.scope(): + # Create variables on parameter servers in a round-robin fashion. + model = create_model() + optimizer = tf.keras.optimizers.Adam() + accuracy = tf.keras.metrics.CategoricalAccuracy(name="train_accuracy") + + @tf.function + def replica_fn(iterator): + x, y = next(iterator) + with tf.GradientTape() as tape: + predictions = model(x, table, training=True) + loss = compute_loss(y, predictions) + gradients = tape.gradient(loss, model.trainable_variables) + optimizer.apply_gradients(zip(gradients, model.trainable_variables)) + accuracy.update_state(y, predictions) + return loss + + for _ in range(num_epoches): + distributed_iter = iter(distributed_dataset) + for i in range(steps_per_epoch): + # strategy.schedule pushes a closure in the scheduling queue and + # returns a list of future objects immediately. + loss = strategy.schedule(replica_fn, + args=(distributed_iter,)) + strategy.join() + model.save() # save checkpoint/summary... + print ("Loss = %f, accuracy = %f" % ( + strategy.local_results(loss), accuracy.result())) +``` + + +##### Alternative training loop: fully async + +Another option from calling `join` every epoch, users can choose to schedule all steps and then asynchronously print metric values. This option doesn’t require any synchronization in epoch boundaries. + + +```python +# … omitted +with strategy.scope(): + # … omitted + for _ in range(total_steps)): + strategy.schedule(step_fn, args=(iterators,)) + + # Print accuracy value every one minute. + while not strategy.done(): + print("Current accuracy: %f" % accuracy.result()) + time.sleep(60) +# … omitted +``` + +#### Error Reporting From `replica_fn` + +Because of the non-blocking `schedule`, any exception raised in `replica_fn` wouldn’t be returned to users immediately. Actually an exception may pollute arbitrary number of functions in flight following the culprit function. We will set the error in returned `Future` objects for the culprit function and these polluted functions and we will raise exceptions when `join` is called. + +Therefore the best practice for users is to avoid writing any code that may raise in `replica_fn`: + +* use repeated dataset so `OutOfRangeError` will be avoided; +* avoid using assertion ops or some debugging ops like `tf.debugging.check_numerics`. + + +#### Dataset Interface + +The traditional training loop of `tf.distribute` passes the `get_next` results of a distributed iterator to `replica_fn`: + + +```python +for x in enumerate(distributed_iter): + loss = strategy.schedule(replica_fn, x, y) +``` + + +If we do the same thing with the `strategy.schedule` API, there are several challenges. + +The first challenge is we don’t know which worker the `get_next` should return to since where the `replica_fn` will be executed will be decided later. Some later-binding mechanism can be explored. + +The second challenge is calling `get_next` on an iterator is synchronous. This means that the training loop is not truly asynchronous. It is tricky to make `get_next` asynchronous because the client doesn’t know how many items will be in the iterator and thus doesn’t know how many functions to schedule. + + +##### Alternative: passing iterators to `strategy.schedule` + +The following training loop is less consistent with other `tf.distribute` examples but is easier to implement in the short term. It requires users to explicitly set a number of steps. + + +```python +# … omitted +with strategy.scope(): + # … omitted + distributed_iter = iter(distributed_dataset) + for i in range(total_steps): + strategy.schedule(replica_fn, args=(distributed_iter,)) +# … omitted +``` + + +**We will start with this kind of training loop in our first version.** + + +### Fault Tolerance + + +#### Task Failure + + +##### Worker failure + + +###### When scheduling + +When a worker fails, our training will continue without this failed worker. Functions scheduled on a failed worker will be rescheduled on other workers. + +For functions that bound to a specific worker, e.g. resource creation function, they will be queued until the worker is back. + +When the failed worker is back, we will update the cluster configuration with `context.update_server_def` which would also reset all the states. After resources on the restarted worker are built, we can resume scheduling functions on the worker. + + +###### When materialing a `Future` object + +It is possible that a function is executed but its corresponding worker fails when users try to consume its output. In this case, we will give users a `None` value and set an error in the `Future` object. + +We can mitigate the problem by eagerly materializing function outputs when they are passed to `local_results`. + +We can explore mechanisms to recover these objects in the future. In the short-term, users can choose to write the results to variables on parameter servers, just like a Keras metric. + + +##### Parameter server failure + +When a parameter server fails, the error will be propagated to the client via workers. Since the latest values of variables on the failed parameter servers are gone, there is no way for the client to recover them. Therefore the training will pause until the failed parameter server is back. The client then needs to clean up other variables on other parameter servers, rebuild all the variables and load variable values from a checkpoint. To trigger this process, the simplest method is to restart the client as well. + + +##### Client failure + +When a client fails, some scheduled functions will continue to run on workers. No new functions will be scheduled. When the client comes back, it will create variables, load from a checkpoint, schedule functions with a new context id. All the old variables will be garbage-collected when we reset their eager contexts. + +#### Resource Management for Workers + +When a worker has recovered from failure, we will need to rebuild iterators, worker-local variables, lookup tables and other resources on that worker that don’t need to be read from a checkpoint. This means that the client will have to keep track of these iterators, worker-local variables and other resources. + +Keeping track of resources and rebuilding them will be achieved depending how users create their resources: + + +* we will record iterators created via `tf.distribute`’s API; The state of a rebuilt iterator will be lost. We can recover their states as future work. +* we will capture the creation of worker-local variables via variable creator scopes. +* in the future we will provide users an API to create worker-local resources. We will capture these resources in the API. + +If users create iterators or other resources inside a function but don’t expose them as outputs, we will not rebuild them. + + +#### The Unknown of Scheduled Functions + +For functions that have been scheduled, it is difficult for the client to know whether they have actually been executed or not when the client detects their corresponding worker failure. Therefore, in addition to inform users of this uncertainty in the case of worker failure, we should do the following to reduce this uncertainty: + + +* keep the number of scheduled but not executed functions small. This may be difficult to achieve since there is not an easy way for the client to know whether a function is executed or not. The only way is to synchronize the executor. Therefore, as a workaround we will have to periodically synchronize the executor to make sure functions are actually executed, before the client schedules more functions. In the long run, we should get acknowledgement from runtime about how many functions have been executed. +* eagerly fetch the outputs of remote functions once the outputs are passed to `strategy.local_result`. In this way, we can know the status of function execution earlier. +* recommend users schedule only small functions. Large functions are more expensive to retry. + + +### Evaluation + +Historically, `tf.estimator.Estimator` uses a dedicated evaluator that periodically loads from a checkpoint, and performs evaluation with evaluation data. However `tf.keras` typically evaluates in an alternating manner after every epoch of training, and this is also the case with `tf.keras` + `MultiWorkerMirroredStrategy`. + +With `ParameterServerStrategyV2`, we will start with a dedicated** evaluator that runs alongside the training cluster**, **aka “sidecar evaluation”**; in this scheme, training client is required to generate checkpoints periodically, and the evaluator reads the latest checkpoint as it becomes available. The evaluation is asynchronous to the training progress. With this we provide the functionality Estimator has been able to with Keras API, which is important to attract updates from Estimator users to TF 2.0. + +With our recommendation, users should create a separate evaluation client that runs the same python binary as the training client. This python binary will contain the if-else clause as it bifurcates into two paths: + + +```python +if cluster_resolver.task_type == "chief": + run_training_loop() +elif cluster_resolver.task_type == "evaluator": + run_evaluation_loop() +``` + + +Evaluation code: + + +```python +def run_evaluation_loop(...): + """Run the example custom evaluation loop.""" + + eval_dataset, model, eval_accuracy = ... + checkpoint = tf.train.Checkpoint(model=model) + + @tf.function + def eval_fn(eval_dataset): + for _ in range(eval_steps): + # evaluation pass + return eval_accuracy.result() + + while True: + latest_checkpoint = get_new_checkpoint() + checkpoint.restore(latest_checkpoint) + eval_result = eval_fn(iterator) # Users can print, early stop, mark ckpt.. +``` + + +In the evaluation client, the user loads the checkpoints that were periodically saved into the model (by the training client), does evaluation over a full pass of eval dataset, and does whatever they want to do with eval results. Examples include exporting them to files which can be read by the training client for actions (such as reducing learning rate, early stopping, etc.) + + +## Implementation + + +### Low-level Primitives + +We can potentially expose them in the future when they are more stable and when we want to allow more advanced use cases. + +We will have `Cluster` and `Worker` classes to encapsulate logic related to remote function scheduling. + + +```python +class Cluster(object): + + def __init__(self, cluster_resolver, failure_handler=None): + """Create the cluster instance and connect to the remote cluster.""" + pass + + @property + def workers(self): + """Return all available workers.""" + return self._workers + + def schedule(self, function, args=None, kwargs=None): + """Schedule the function on one worker. + + It adds the function to the global scheduling queue and returns future + objects immediately. + """ + pass + + def join(self): + """Block until all scheduled functions are complete.""" + pass +``` + + +We will probably merge this `Worker` with executors. + + +```python +class Worker(object): + + def __init__(self, + worker_job_name, + cluster, + max_scheduled_functions=100): + """Create a scheduling queue and a thread that processes the queue.""" + pass + + def schedule(self, function, args=None, kwargs=None): + """Schedule the function on the worker. + + It adds the function to the scheduling queue. It returns Future object + immediately unless the scheduling queue is full. + """ + pass + + def healthy(self): + """Return a boolean indicating whether the worker is health or not.""" + pass + + def _set_dead(self): + """Declare the worker is dead and poison all future objects.""" + pass + + def _rebuild_resources(self): + """Rebuild worker-local resources when it is recovered from failure.""" + pass +``` + + +As we mentioned the return value of `schedule` will be `Future` objects if `function` has return values. The `Future` works as a container and will be later-binded with states of either success or complete failure. Overall, this `Future` class has the following benefits: + + + +* It allows `schedule` method return immediately after pushing functions to its scheduling queue. It allows these methods to return without needing to wait for acknowledgement from workers. +* It serves as the container for values or errors. It would be binded with a value or an error later. When it is rebuilt, we can replace its underlying value silently. +* When being passed to `local_result`, we flag it to indicate that this value needs to be fetched eagerly. +* (Future work) It captures the lineage between functions and return values so that we can rebuild any poisoned objects. + +```python +class Future(object): + + def __init__(self, closure): + pass + + def _set_value(self, value): + pass + + def _set_error(self, error): + pass + + def _set_eagerly_fetch(self): + pass + + def _fetch(self): + pass +``` + + + +We can potentially merge this `Future` class with our `Tensor` class. + + +## Future Work + +The following are features we have been considering to support in the future although this is not an exhaustive list. We don’t promise to support all of them. We’ll prioritize according to the feedback from the community and users. + + +### Dynamic Membership + +Workers can come and go. To support this, we’ll probably need a mechanism to discover and remove workers and make our implementation of `tf.distribute` reactive. + + +### Integration with tf.data Service + +In our design, we assume that `replica_fn` can be scheduled on any worker with some constraints. For example, datasets can not be sharded across workers; rebuilding iterators will lose their states. With the help of `tf.data` service, we can get rid of these constraints. + + +### Advanced Evaluations + + +#### Inline evaluation + +The client drives the same worker pool for evaluation. We can alternative training and evaluation. + + +#### Sidecar evaluation cluster + +We can have a sidecar evaluation cluster as well. They can either do evaluation synchronously on a common dataset or each does its own evaluation. + + +### Keras Integration + +Integrating with Keras `model.fit()` will largely be reusing previous work done when synchronous distribution strategies were integrated with Keras. We hope from the end-user’s perspective, they will notice minimal changes when they switch from other strategies. + +Most important implication of integrating with Keras `model.fit()` is that we will need support for `strategy.join()` and/or `strategy.local_results()` for callbacks. This would have performance implications but that would be the trade off for fitting the synchronous `model.fit()` semantics. + + +### Versioning + +The client and standard server binaries may be in different versions. There is no backward or forward compatibility guarantee. For now, we recommend users run the same binary which will run standard TensorFlow servers if it is not the client. + + +### Advanced Fault Tolerance + + +#### Reschedule Functions with Input Affinity + +Our proposed `schedule` method supports at-least-once semantics only when functions don't have input affinity. Functions that depend on inputs that only exist on one worker can not be rescheduled. We can think of ways to rebuild these inputs to achieve at-least-once in more cases. + +With input-affinity, there may be futures that are bound to a worker and the worker can die and don’t come up within a reasonable timeout. We should poison these futures in this case. + + +#### Rebuild Arbitrary Resources and Future Objects + +Any poisoned future can be rebuilt according to the lineage relationship between functions and futures. For example, in the following diagram, to rebuild `future3`, we can rerun function `A` and function `B`, likely on a different worker if the original worker is dead. + + +![Rebuild Arbitrary Futures](20200306-single-client-parameter-server/rebuild_arbitrary_future.png) + + +### Multi-GPU Support + +We can expand the `replica_fn` into a multi-GPU function before we schedule it. + + +### Wrap `schedule`/`join` into a `tf.function` + +It is possible to implement ops for `schedule` and `join` and make the training loop wrappable by a `tf.function`. + +When a `Cluster` object is created, we use an op to create a scheduling queue and launch a background thread in runtime to periodically check the scheduling queue and schedule items on one available worker. + +The `schedule` op could just push a closure into the scheduling queue. Note any control dependency added between `schedule` ops won’t make the execution deterministic. + +The `join` op could wait until the scheduling queue is empty. diff --git a/rfcs/20200306-single-client-parameter-server/rebuild_arbitrary_future.png b/rfcs/20200306-single-client-parameter-server/rebuild_arbitrary_future.png new file mode 100644 index 0000000000000000000000000000000000000000..7a89161d309f6e85bb94cbcb66e84b8463ab9143 GIT binary patch literal 14706 zcmeHtcU05aw{K7pl%}JC0)cUykwKb>G=YFH$_N7}NN)mShTc0#umBE7&=F9nqDG{5 zLP82fh*Fgf0YVW%2_*EA_P)XU-n#eQ-@WU;Kkm9~y?gTyi|;vSpMCb(^|Q}kH_eSj zj!GT{fj}ZRe*eu91UlFV0v-4wco_KN8gZru1k%>N@!Qqgk)yQnBWX^auk<;Z;{!fd zn1%=V41W-OdhEcFD$}1#Pd-(m{pk1!bolDySC8H74xK;u=F@9YyVKN|4bov1w2z2PqFQ9`UkYJ+; z2++P`ap@po5a>sc0^ey6$N*$>;35bV2P!#q2M`9I-H}G%&w+{`;`SH*2d4acw8g(u z`S+&$qwTZwK!|N)VmD>|VO~Im49xpR7e6sEFOon`U5?R@sS6YiQp_3Q1P~YlELbYU zeuoo4-0Nb2xAiaZ4jO6=$|Y=bP{#(`V7qx76!~C)=N;BJXu#r@heA%nIi!x~Te1s$ zSz^sG6^w>dkX8e#ZF6P_y4t_F~}8{DV%MosT!H)(Ub$@_7ROlFnD3LB$LjqR~~| zEJka*N779xR~5oWrYKs^vm3`6Ue&3QqAi7B3}5cc$@^E3@sPTq`v(lHF4G z%Qn_+IlGjjd>0drAXr5`HDvh|wj|KTH%=kS^ez;ieIvR$WSk?QZ0u?qOrE521w~QccQ)>o=cJTepv& z?2sr+YLs-lbS1xmMlgm`K2Dlz4 z%$#mdwcsi52X_$CV8kUo!j9qvLf)~RuufU=U#M-THq~v>xf#nQElH;99%P}0^wid& zov6Gs0z^-@LW~5)K~rq$pbZcrd&2>RIo)#2Py!LzI|5FI6@PS+y*m~NeXBf z{XShKI}y26`{$}iJQco778B%Ig!Az@D><7ucmZF1@NhurRjZ(;(w*J923ZIF8ccBB zak%MKj2fR77PlI*pkXw7%9*-$Op<;UOH3y(*O4ZcAuihQw43Rj_KC!0-L{!@qbd(X zB?Y-9Vf1mP{;iebzTRF1A%6r8*E0eGw!!r|PRJ|qPxgkZqIYtXA6W0&JsLyKP31aef8YgpBIcQO@hh(g@C*?A8A3sWNX>et5c6931Ef5Dp)k01wpS_nO*<77U;qO2rr?J%W1Hxk}|3V9JsI__3oW7g@4+(j3#ijT**()Ho?k?Jbp6YdA5TvyuVAFkYHS@RD#VZOVAe7WR+*8wlW7Wx z2lj21uL>VC-A0(+eB-W<8v4~0h3(%=@tfXpw+o6vUrS=6hl0kxU>@fWEaP1~Tj?GR zFIA-VNxOgDlo?;zs!Mt5f(fqKx8itENs;UEwzmz=dtwhPTPtYu?~D^2zXyYu*lPuv z34x?o%33SN9kSZ6z&W}f0s_0Ng{4xJ1BL%mR)VVs^`@c2;`#3YrodNEvh6fh&XdDe z?tc82!Q=lqzWmF`@SXhs6@vcdlz%&R8Gyt^|2YBtRU@OBD))<)cwDA~gKKarpxGOG z*nHpw`xgBpPu1gi@&WM<@sB_L+av#w#((Z~80k^Tpm<9_|C0S|wm86Y^jpX3=(nfi zxaEC-*%<_!a-OuUa!4OAEZ`k&g2t&SJBWK>wLhkzT72p=%N1C)7k46?m;FV*6LSr6 zB3~_737LZ=(zjYm2smh6tF zZ%HXms`nCanMSvilt@VkcY|F?28-7fg1p8z;Zw2Mm|ospWBf{l45v zOtVX(8GBKg>Cf#GtTR(e*;OT{w$oL8Ua%mtV@im5dx3AMTJ%O%VKVYl4`y$e#jly0 z;TqX7#!8Qfy$kE&_S(mjEQquZ&Ec!r_677j%K8`Q^*+(ew4MXOgi6d**DZiax9OTn zF1P>835k@^d;LEnfU|>&?h)S>IO949IxJi9VB5===Z;QYey|uMf%uU)Iaj@&2b=tx zmBL-M&BB-A1FmvdQs!N5Lt*Wrj+zC34 zx)W<C7oD636MFOvIXe6FkMoB+owxik^HU;+@I zZ+&x65`8QwzbPmtpKvTzK_MYZj1v4Q%jmiY%O%`o)aev33vcS*AKpTjQN1@mY8s$Z zZgH5hmK4=Z-)&6O6cF`c8io6}VReCt(TTSn8jG^V)7WwxOaq8A_JqwKOr(^RF5i+>SBB~QVDMbCTtiOT^DsY*0j=+r)~db z{CDSYCXV%3)o4hV9LzxfZT$vNsA%Ne;9Pa0uPURqp8sha8Iv?MMK+u8?s`3SkU%>M z*w*}+EpK1{$Z>aDI!>ITrk}?@IV2$LoilR@82sSpk-4WyxI4W5eyG`Tt}T0VU<8xB z7|a_o^HcIOBj&=IN;LULB5Vs>!*3J94AzrJw81M?ya}pZ>?}^$-r482LKRsBqPkvm zR%5B zIb{fhg1Fg1ck%+nHNN!e?Ayai$8Q%qw$w@gfWDSF6$}uYKaWWm6YJqD&A3Qd7CXuz zSPT{*98TXU0YFLg+B7cDWe9y3DDQy;-hd3<4 zK~3QgwMImJn>Ik(UDrvIkkbKv5C212Jq3V&#&N@ zw;KE$(`V|Z@~_34hd!cY)Xn(|HLl9?b##ipiVEhr6Tfh2a7)tq&;QeVq~yB`Rh_42oG}Q^SXA|%~ByDVzOTcRfH5? z30sH~Nx>L?$_d%5++A96eJQIOJh=yhP5V$!9@uQb%2Y>e5_ir#{xL*O0T;8P)~Z@uL~vXF?mtPyQu`*%`vbd3{$~MAKL`dwG9=Gx$f%TV0QdGQGQV>+AAJdA431b8IC}p|U8?nK<5VdO z{A#Jgk;VoZx+RI0oi|;@Q}bx!dv2>#K!YlO2%VLlnIF79oKoh`f)ugBuc=yStdOPj zNu??7(Wd>BXxtb+$uqXq5Co%B3I^XZ^uSFY9iPi z|0E?DQd}JZ8Q$kzWN&k6yIKq}_YNB0zc^@bE89>9ggjMSX+Wsk+OXjdO%_KAtz|g6 z_vWi;PO=O8RtUks*?a6gASiiYUOVI@MST`Y5wU3K=czFTS6WAzKH^q;Mcr*t!?Zd? zm*fe=wJ+>UCa_gOyz46_wh9|ar{gtPo-s&e2CH~d*9=~q01oM=qj+FUJ;M3t zFN7BC3MYOy|Lx&x)CfX{Ag~&0m&T8w0-n~jV)^bkg|1GB(AeW_OT z_w#`VO(}&89j%ut+ev)@gw0q2#1;`yH4nrXgVa#NG18V%GV-Ea{X^_GA8Lso_>1)R z9{xgcwkdG%*VIfWiXE59M&=R}6Lyc{xsFiRkg>%bu?IBioasw|2b8>SY3tPyI)dF2y=tCb55MwY4bzQ{k#qXHr#u1CZ#LI z!e86}qp8GLq(iv~f#+JdgcI9N4td3n>3C|#bL(8*%#stC4wBr8nV7cH$5T!CUv}VI z_f6MGuno3?q@$!uNpUfgbhlm&*!z`}_z(Jj$G3U{j@DU~`M^G)?deqR=~(@}R<*T0 zv60#MgtoDECyx~Z4R5wd<)wIfYMG9b#ECCyjZI+*)1jkml3r zJaFapEb_jQz}PtA=F*JcLngNhQ|oo(&@63x*V0p;*x%pFFy!(namTMz%o?v|10mb4 zQrqA5^a#Lc04NW*TmxRw1>#l70MPeuef+ma{vnP3wCT;m?hK^? zlY&FM`eaeglW#}+d#$pV29!1j8T#5lFad4YU`kxN7Hi#414oN}};GBX%4tt%Hd!^u5H#ufyb_jl!xH%lOWW)^9 z9M6>1a8G7E)aT^vLn79EJ7dX^uWn&=4fb6O$1~wX#-@%CiQ}*HAxkoDrdFY+Pq~49 zYiGRV9MTWn2e~u(nx(Ee#>yZ?kUi+EuW~v2coKlemd!?$jSno3-vl8W$rE(bv-MTQ zG40t=@hbtdTL|Vrz#M>snu)IQ3g{m}wr4@c$D9LO3r0oaj|&pFtD!>r#xE7v(mA82 z#bZAP6?{{v|Mq+dEngm~xFtpYP01m)WzP-3L0?sisz`gV%1-KK%Y@|knu{EP|skJSK#wX^zzC=ZT)~I&t(WS z`KzN}u*AL^9GbjvXmU8bLzRqMUKM;;nQD1jDa=<405A9vY(LJkd!HsXN*SHpCFsvFz~5)Q6&a zMHQ4CNWxiTgA_jxmzDR2CnHlL%UAbilS5j^D3RvI&J%)*s)$G{@r`|4o^Jse0Qmb3ANylOY_IwUxED+WFF-M603lhcB4*Fv?(wO-qs;RPPK0~Mx zut;ms#=;W{2E;y)llC`F!+RH&a@(HHY}F|&t33#W9t;c(ouRHZx)ce%|GKOs*0_vM zoj+u}9d2cNADP6q34ba&jR`(Yp>KZ^fG0N)RsJZ=VB6 z!Nzp(VE5t4-m#%A8*ij`6Dq1H=SZV!r}msQI%4d&qSZ+0Uqxr)O!I31fK_~F)DEPa zSMc4~JU|oBwx;;_H|@5!m5JNSIwFlIf3HLH?L$dZov5>2D0aGvvh|s`84bh2ob(J2 z!HC_(uLGiu$LD6YRpl_TqQSR~nXGnwuH%%pEOAk7{}f7xChMX8zEZB|>>#X<>+$rr zEf@glJXtiei{dZ;$>~Lc;)?@r@}U`&Vjx9*VH1S>BHe@y<&O7G+XV$ob)AXxFb|kb z8maUS8|(gc!s5~BnrNu*;8#E7H0-^|*!degr#3$gB>?7 zf$ayTwUd7iyBxa@Cl3DsK;TFz$45oDH%3x?Dp!gsdxJvN&wXrzn1H)`d(zQACs9Z< zsMk)~!dfe0($;YLq~$eDO&68syP>(J<;PQB1V|8*m#25$6pu~#J$sWp=7O$OqyFTg zQdd|q42>zbh6w8QW;>v-5Sq4?G5gfnY%`pJ2 z+yYRs)6w9o^|&mVZIv?VH%dII_GMH{Yvq360|3rJpsj>~ zpAHp{Da&rzzwiIkJ>v&-_h0qN{%gVt-s~ShoFbSKNX|e&^Ee{p!sn(2Qq}? z$;aEJFjOI0)P_?%>F_ygYEVH)M~#Q4+@Qzz-2@=0Te?q^Bc73Xtx^DXZ!hGn3VES;(e+LPZy_ z_^+Gz3)TUI0BDNY96xZ+H~Ce2H0PKy_C}<;EniGxgUxP!9C6MLGz^I}kbDrZ8fI{K&L&*q8y zr17^pS7ARF(@#E><#z0^adE${I;j+V;mSVu24b4!r)l@JPf4k3yyD%$mzlIVG{i+X z99O-5k0%!cv5RLH?h+jLh1OqxdJ*??aY%MjMAtqw$*?HZisjkcz8W6S{jUJCyG~~h z;O+HZ4vw!p?)tu?KIDw@Ox|TLM%fE_f&GA1}+kQ#*yp><+%*jr{)M zYNu)n=DjL4^g-XP>VKj%5%wVo^WM+gxJ!+v9(PWBya{dbj`_aL6|Y;W2Fujj8+Q9! z)Og%p^%2fbAxr65Zk^T}8s6B7YVvmm%9)J?hEz^7yE^(pu~c=twi5pOE^hLSy|*I--KWB8{7vegb5V=EstlbD*sjKQ$DJTop+R;D^z60jtd$18Lxd%af! zoA%&WickBqM0yr0YW(Br%qoZ6f!@27K!x^NAW1vNs`|<=rCA*@Q8u;%0|RW2j<}45 zOh+kMKYyR-DjtRq;?wu{Tt=HGo16(B0nj=`t>~YaY7An1xs|GT0m6MW&6)pY$k5aT zEn>bdtHtU*7^?khk_uN-E59LeZ#WO9jK?WwoKYw4S zUABx~eP~~yr;19qPlC-ZrmeYlEpC;_bS_qwz#VN;Ps$Q&>&H*o!xT}cR$1Fn=ExPY zD=p%V2{+wQ6A>BipjiV5_)c{)a$*|5B`D`;TeAIF-va!3-I$GC0sP)qmE31U9;_h6 zohjFdX@z@mt5veo^QyQX(+?)r;+^t}W0Ydpq~H{c1(a0tLf>rLNyx}@uSXDd&MJE0 zgW_wKyxuHI&x57*k~QlOv-U8}VjnAqkU$gDsv*}`cuvt(dA+BDVNdwo0Y09$vK=Hm z?{61!4mxJYtmVU8wYUjZJY?|o1vhl{sd8D;aqyhmTLI$O*yx6nctZIPaYYEF8q~&y zxO{2-CDahnQ5d!BU-FCnc+Nnp{V}9LHki^=J%hVgniT@rcg5ApqwCpwdwcX*$wk*( zIOAIZ6|qp_FM^XQjz@QP_2|0wXgUaJlqW^CwSw*jp}InqgnwW7I+*72DvQ6<7^}_j zFZn4=_C;^9nLK_}AI0p$?zVJs(g01y908yl^x(iyskDCB5+#p4M8d-X$l3eKPMz)7 zGZP+z$|{uHK6_@!2-_ z^70U$s6G_?oXZPI(}PAsL)1l=E8jL?+%MxTS(>&m0IfJGkXl$Hm;s^)?zk=WVYVJ8c~Xj z807$8FS|O1>>S~zjXL7*gSoOVt7-u29U8%8j%BGlQk-&C(?!|JN<`}55(C8_xcdF> zRhB9OEOm=2a3#ViydGG=+dg@kf4#Xw`GJBCO49XU^WOzUP#L`Z(KEqx7L7s_TtXs)onke{L?;`Zjs9DiV z1-*+X>yg6jMGqioc=tTQVzCm;MLoDC6i+>9mAe4I;jlt#_raW}{FAmaC+||M$1|-R zO;qkuL+Vo{fMyKX&whv9dTU)Y`fA*`Zn06SjOO$L4d+5RQ67I*UXLBc7hc_eO&+g$ z8D;$ZS|y-1_ne696DK%7;`|BNEdu}U9LKDuK&I~3K@EA>^1HM_!SbvE_tgrRRkWX- ziR`v2kp7GZ1dcV6aFi}rIb|(hdm^08!OJ-d9jvOjH$k5(HLhwuz^HmB1bB>be~x*K zS)|RK((aq{89?&Ej&6>xFfvKDxUtY*VGr;B(zEl4H#cN&@PzRa$EI=`-i}Gwldwvmh(cpc zBsxtN%-m|D({pBqnA;WFQ-D_be3J~e$mL?JhK?2uly2wkPSmR~G`)8nPN(rQh&JF@b3M)jeA9_Qw)FuDWzDtYTpZ61x5~h zw40G>?ep@{kiS11R~y-9Jz9$^`;YnXxvUnrLt>= zM_^^X2(*o^_-0oL=RJ*DkW^laO+FYXJya&oF0l9!@Cb%MKDm}3#>T=XZ^TdqMEyEa z^Qyu)kJm*N5B1HjWdxjn87z-o02 z=*T%ojz`AwdVjig^M`pob7iqc2}3T64V2J1=)(;pRfZV{bUuGut!r_?QjBq$8L9Z%vty| zMzlGocIj2g^>;O+`}&`+C|Kqc+ryGM6W2i%vdAVlPwfJ2&&ItZ(IDYMWeN$v-c!Vt~B3F!)doXWyU>6d)*Bp+cwe z;amrH&_UJEp2?h!BiRTmnR`y5aTyVTzA@G`qQV7iFa5>>6mlPMeL}qaRZaZw#UhZj zkaId*B8Mu%KMC}8m(VHsU0*TFz>_oJDT!e^Fvhd;^sF*f02bIC?m#NWk2O1S_Tn`8zd;;}2J=r2N6?n$*Ts|{jai~KSGZDWy(CzxyZ!9=)QCj);u=WTzkSeN z#k0jGmHND3Xqf^HA?s3 zpR;>?!NylP%+sSG5*Lb?*nzR6$;QJ<6kFZ$wJkH%G#x|#M=Lv$`C2lmXM2vnR#4db z3Y0UG6?OV|sj=QlE`r9W(A`Swqp_%gY*T#|`GlS`tt8o!dc9kP=U8^DaJ;MvC)QoH zA70{1l=XpzQF}^-UdB^narLRabRc#8*q3~D13g|LXoK5QmIr^E;kom{jVVV_`-z;E z2TYUJiz`?xJm4-BhY?qK#I7>XsFiBI<63f6o~^Aj+X7<&dFW8Px`t*6@GijVY;Aj3 z`!B}0uV0?V)Mu2(yUF8M>*E{-_QGw*1vQ<ckp%jwwX-|VOs!JNI8I*oOt_pNx7rJ*k|RB- z+d_~aK$T<6{sf+eY*vdo{Z!#UmqgZ5*O<+{q4#b{Qvo%eh}xhI99X(nT95CWOdGsg z5LuBb1$euAU4CZV@D1EU%1H;!06%ocOFPDVc5bK=Eu&YZ{Dgz1Pfp%|i~RV~bD&7| zHVVD(DnG7oi)UW(?dB`jtkvNByjh{07xVtI=kF#yvL~E0O@X@Mahi>ViC0?c8D2J0 zEgANaPM2dwN{5$9=t_g;m#0B32t zd|5r#pIJyQ!2vHO9?;GYqR=@K#gOuDbE=EpuEWdBUU0SL-n!(~@aNv1HddC=YC^=k zjnBN~Onzv%myuU(PC?XOWOgrMnG$n}{;Gdk!(i;Oi9n@4eqYeE6|By>4ZWM5LllU8 z2i&-yRJzXx#44^Q8)x%jIriF6>~zhtjfT1=8y0{SPk2|e2f4om9xz*4*<}{cShc%a zKwuq3+ZtJ|psLFtHi9d6sS(o~%Qot0b?%Rw71zo&?!}s@CBEO-)8r?c4^XSyyN8Ns z{bmea`lv_h2eNw4#2i>-hoL_yjMvNr9=TKj?Yh;B330gTVbo>6JK~F+>y675?qwUe zw297UddP^p)Y1%~MbWvpFl}Q!1>pL1OCb+DUb%j?CKhS-E<7d$IER zHK|c^38;x@$G0*Y18|=dK)tS3xhah=xBnF6}k&kMQKJe$0i5touBb@GHC2MOKk8s z7Z4l>*7OZ{w*W)?{l3&kbw#vSO<4*5s zEKZ){Wa{r;QQ~l}w*4$1!M&0Z`(vCuzle!7s4SCN?XsfO6p<&kQl|*KHVWS2%*o{t z%w2^>{)&#aG3n7CG<`rnC|{bB*(+PUy_wvP7S{zBxpIsz2rm98&lfz*+BjRVyj=~v zZ*lGG6Gva+cqEoeW4bEz2dRr0_vp|dn-;d0=CZ?|e=H0+-o2`&^OtGHMx!=Cw&MY- zF)c&m-A3dVeC}rj9@*t#m14)Li>%&^nd_XLCkjw*Q3mx$9)c}jPOl-=Q=s_~*5RF$ zKbFMRe+bu@sePK$nI41M=to;QRK8E3ZgV_LI?j7G;UOrA>KTG+aUtgsaCOj+7e0U0 zX2M}*oL%Rwan3WrmZ6D;$=VfRTxIm;SSzV~nEH4eF#s5yJ5V+vJV{zA3tKj9hffb? zTT11X^t4fOgjc`R$6%C?AOzC&RW-Bmi2tBtUzuZ4pC>y>+EL@?=qn;S4`<(5C zr8~^O=ruiP8GVnECd0jJ!ezJ6M(wOxMd@OedwtDmaGc{-w358u3p`vnjHp&jo4VYV z2ROA434*y_!JJ?l{beB0F9IGE$ja(_X35etLJ9;Z5z{WIGQew}7O1l}Q;zBs)7zuM zEm+UP>~Wd%No#>IUCt7-DnvGm3Sqr^o}03a%#8SgAgyo3a?^|>hGAsV;gr*iVnrGu zcd%=U)iAwu1g8#H>PKy~&!@pP)KDbz8w*p(R?!<)^6dGoH29i|JeyOfcv1gFZ{}Jc zGt{Iw_xFX72CL}BLB-!y%on&WPX}(u!w^JY{T8d}OF|9BNkfLhe%;|Y=e0J0>xy`M zwo2e$n7P9c7zZhK?8PDhd<(#R6}R8T+UwcIvYMvD(I@jFqJs3*l9;G%SD~P+qJktM z*F1=6A(1CLb!F}+sX?G_`=8%`QwcP3eZ)io!~%Hs!A2|}Irra341q$P=9h+bm}XSo PnQvS(|E Date: Fri, 6 Mar 2020 16:06:08 -0800 Subject: [PATCH 02/10] Remove rebuild_arbitrary_future.png --- .../rebuild_arbitrary_future.png | Bin 14706 -> 0 bytes 1 file changed, 0 insertions(+), 0 deletions(-) delete mode 100644 rfcs/20200306-single-client-parameter-server/rebuild_arbitrary_future.png diff --git a/rfcs/20200306-single-client-parameter-server/rebuild_arbitrary_future.png b/rfcs/20200306-single-client-parameter-server/rebuild_arbitrary_future.png deleted file mode 100644 index 7a89161d309f6e85bb94cbcb66e84b8463ab9143..0000000000000000000000000000000000000000 GIT binary patch literal 0 HcmV?d00001 literal 14706 zcmeHtcU05aw{K7pl%}JC0)cUykwKb>G=YFH$_N7}NN)mShTc0#umBE7&=F9nqDG{5 zLP82fh*Fgf0YVW%2_*EA_P)XU-n#eQ-@WU;Kkm9~y?gTyi|;vSpMCb(^|Q}kH_eSj zj!GT{fj}ZRe*eu91UlFV0v-4wco_KN8gZru1k%>N@!Qqgk)yQnBWX^auk<;Z;{!fd zn1%=V41W-OdhEcFD$}1#Pd-(m{pk1!bolDySC8H74xK;u=F@9YyVKN|4bov1w2z2PqFQ9`UkYJ+; z2++P`ap@po5a>sc0^ey6$N*$>;35bV2P!#q2M`9I-H}G%&w+{`;`SH*2d4acw8g(u z`S+&$qwTZwK!|N)VmD>|VO~Im49xpR7e6sEFOon`U5?R@sS6YiQp_3Q1P~YlELbYU zeuoo4-0Nb2xAiaZ4jO6=$|Y=bP{#(`V7qx76!~C)=N;BJXu#r@heA%nIi!x~Te1s$ zSz^sG6^w>dkX8e#ZF6P_y4t_F~}8{DV%MosT!H)(Ub$@_7ROlFnD3LB$LjqR~~| zEJka*N779xR~5oWrYKs^vm3`6Ue&3QqAi7B3}5cc$@^E3@sPTq`v(lHF4G z%Qn_+IlGjjd>0drAXr5`HDvh|wj|KTH%=kS^ez;ieIvR$WSk?QZ0u?qOrE521w~QccQ)>o=cJTepv& z?2sr+YLs-lbS1xmMlgm`K2Dlz4 z%$#mdwcsi52X_$CV8kUo!j9qvLf)~RuufU=U#M-THq~v>xf#nQElH;99%P}0^wid& zov6Gs0z^-@LW~5)K~rq$pbZcrd&2>RIo)#2Py!LzI|5FI6@PS+y*m~NeXBf z{XShKI}y26`{$}iJQco778B%Ig!Az@D><7ucmZF1@NhurRjZ(;(w*J923ZIF8ccBB zak%MKj2fR77PlI*pkXw7%9*-$Op<;UOH3y(*O4ZcAuihQw43Rj_KC!0-L{!@qbd(X zB?Y-9Vf1mP{;iebzTRF1A%6r8*E0eGw!!r|PRJ|qPxgkZqIYtXA6W0&JsLyKP31aef8YgpBIcQO@hh(g@C*?A8A3sWNX>et5c6931Ef5Dp)k01wpS_nO*<77U;qO2rr?J%W1Hxk}|3V9JsI__3oW7g@4+(j3#ijT**()Ho?k?Jbp6YdA5TvyuVAFkYHS@RD#VZOVAe7WR+*8wlW7Wx z2lj21uL>VC-A0(+eB-W<8v4~0h3(%=@tfXpw+o6vUrS=6hl0kxU>@fWEaP1~Tj?GR zFIA-VNxOgDlo?;zs!Mt5f(fqKx8itENs;UEwzmz=dtwhPTPtYu?~D^2zXyYu*lPuv z34x?o%33SN9kSZ6z&W}f0s_0Ng{4xJ1BL%mR)VVs^`@c2;`#3YrodNEvh6fh&XdDe z?tc82!Q=lqzWmF`@SXhs6@vcdlz%&R8Gyt^|2YBtRU@OBD))<)cwDA~gKKarpxGOG z*nHpw`xgBpPu1gi@&WM<@sB_L+av#w#((Z~80k^Tpm<9_|C0S|wm86Y^jpX3=(nfi zxaEC-*%<_!a-OuUa!4OAEZ`k&g2t&SJBWK>wLhkzT72p=%N1C)7k46?m;FV*6LSr6 zB3~_737LZ=(zjYm2smh6tF zZ%HXms`nCanMSvilt@VkcY|F?28-7fg1p8z;Zw2Mm|ospWBf{l45v zOtVX(8GBKg>Cf#GtTR(e*;OT{w$oL8Ua%mtV@im5dx3AMTJ%O%VKVYl4`y$e#jly0 z;TqX7#!8Qfy$kE&_S(mjEQquZ&Ec!r_677j%K8`Q^*+(ew4MXOgi6d**DZiax9OTn zF1P>835k@^d;LEnfU|>&?h)S>IO949IxJi9VB5===Z;QYey|uMf%uU)Iaj@&2b=tx zmBL-M&BB-A1FmvdQs!N5Lt*Wrj+zC34 zx)W<C7oD636MFOvIXe6FkMoB+owxik^HU;+@I zZ+&x65`8QwzbPmtpKvTzK_MYZj1v4Q%jmiY%O%`o)aev33vcS*AKpTjQN1@mY8s$Z zZgH5hmK4=Z-)&6O6cF`c8io6}VReCt(TTSn8jG^V)7WwxOaq8A_JqwKOr(^RF5i+>SBB~QVDMbCTtiOT^DsY*0j=+r)~db z{CDSYCXV%3)o4hV9LzxfZT$vNsA%Ne;9Pa0uPURqp8sha8Iv?MMK+u8?s`3SkU%>M z*w*}+EpK1{$Z>aDI!>ITrk}?@IV2$LoilR@82sSpk-4WyxI4W5eyG`Tt}T0VU<8xB z7|a_o^HcIOBj&=IN;LULB5Vs>!*3J94AzrJw81M?ya}pZ>?}^$-r482LKRsBqPkvm zR%5B zIb{fhg1Fg1ck%+nHNN!e?Ayai$8Q%qw$w@gfWDSF6$}uYKaWWm6YJqD&A3Qd7CXuz zSPT{*98TXU0YFLg+B7cDWe9y3DDQy;-hd3<4 zK~3QgwMImJn>Ik(UDrvIkkbKv5C212Jq3V&#&N@ zw;KE$(`V|Z@~_34hd!cY)Xn(|HLl9?b##ipiVEhr6Tfh2a7)tq&;QeVq~yB`Rh_42oG}Q^SXA|%~ByDVzOTcRfH5? z30sH~Nx>L?$_d%5++A96eJQIOJh=yhP5V$!9@uQb%2Y>e5_ir#{xL*O0T;8P)~Z@uL~vXF?mtPyQu`*%`vbd3{$~MAKL`dwG9=Gx$f%TV0QdGQGQV>+AAJdA431b8IC}p|U8?nK<5VdO z{A#Jgk;VoZx+RI0oi|;@Q}bx!dv2>#K!YlO2%VLlnIF79oKoh`f)ugBuc=yStdOPj zNu??7(Wd>BXxtb+$uqXq5Co%B3I^XZ^uSFY9iPi z|0E?DQd}JZ8Q$kzWN&k6yIKq}_YNB0zc^@bE89>9ggjMSX+Wsk+OXjdO%_KAtz|g6 z_vWi;PO=O8RtUks*?a6gASiiYUOVI@MST`Y5wU3K=czFTS6WAzKH^q;Mcr*t!?Zd? zm*fe=wJ+>UCa_gOyz46_wh9|ar{gtPo-s&e2CH~d*9=~q01oM=qj+FUJ;M3t zFN7BC3MYOy|Lx&x)CfX{Ag~&0m&T8w0-n~jV)^bkg|1GB(AeW_OT z_w#`VO(}&89j%ut+ev)@gw0q2#1;`yH4nrXgVa#NG18V%GV-Ea{X^_GA8Lso_>1)R z9{xgcwkdG%*VIfWiXE59M&=R}6Lyc{xsFiRkg>%bu?IBioasw|2b8>SY3tPyI)dF2y=tCb55MwY4bzQ{k#qXHr#u1CZ#LI z!e86}qp8GLq(iv~f#+JdgcI9N4td3n>3C|#bL(8*%#stC4wBr8nV7cH$5T!CUv}VI z_f6MGuno3?q@$!uNpUfgbhlm&*!z`}_z(Jj$G3U{j@DU~`M^G)?deqR=~(@}R<*T0 zv60#MgtoDECyx~Z4R5wd<)wIfYMG9b#ECCyjZI+*)1jkml3r zJaFapEb_jQz}PtA=F*JcLngNhQ|oo(&@63x*V0p;*x%pFFy!(namTMz%o?v|10mb4 zQrqA5^a#Lc04NW*TmxRw1>#l70MPeuef+ma{vnP3wCT;m?hK^? zlY&FM`eaeglW#}+d#$pV29!1j8T#5lFad4YU`kxN7Hi#414oN}};GBX%4tt%Hd!^u5H#ufyb_jl!xH%lOWW)^9 z9M6>1a8G7E)aT^vLn79EJ7dX^uWn&=4fb6O$1~wX#-@%CiQ}*HAxkoDrdFY+Pq~49 zYiGRV9MTWn2e~u(nx(Ee#>yZ?kUi+EuW~v2coKlemd!?$jSno3-vl8W$rE(bv-MTQ zG40t=@hbtdTL|Vrz#M>snu)IQ3g{m}wr4@c$D9LO3r0oaj|&pFtD!>r#xE7v(mA82 z#bZAP6?{{v|Mq+dEngm~xFtpYP01m)WzP-3L0?sisz`gV%1-KK%Y@|knu{EP|skJSK#wX^zzC=ZT)~I&t(WS z`KzN}u*AL^9GbjvXmU8bLzRqMUKM;;nQD1jDa=<405A9vY(LJkd!HsXN*SHpCFsvFz~5)Q6&a zMHQ4CNWxiTgA_jxmzDR2CnHlL%UAbilS5j^D3RvI&J%)*s)$G{@r`|4o^Jse0Qmb3ANylOY_IwUxED+WFF-M603lhcB4*Fv?(wO-qs;RPPK0~Mx zut;ms#=;W{2E;y)llC`F!+RH&a@(HHY}F|&t33#W9t;c(ouRHZx)ce%|GKOs*0_vM zoj+u}9d2cNADP6q34ba&jR`(Yp>KZ^fG0N)RsJZ=VB6 z!Nzp(VE5t4-m#%A8*ij`6Dq1H=SZV!r}msQI%4d&qSZ+0Uqxr)O!I31fK_~F)DEPa zSMc4~JU|oBwx;;_H|@5!m5JNSIwFlIf3HLH?L$dZov5>2D0aGvvh|s`84bh2ob(J2 z!HC_(uLGiu$LD6YRpl_TqQSR~nXGnwuH%%pEOAk7{}f7xChMX8zEZB|>>#X<>+$rr zEf@glJXtiei{dZ;$>~Lc;)?@r@}U`&Vjx9*VH1S>BHe@y<&O7G+XV$ob)AXxFb|kb z8maUS8|(gc!s5~BnrNu*;8#E7H0-^|*!degr#3$gB>?7 zf$ayTwUd7iyBxa@Cl3DsK;TFz$45oDH%3x?Dp!gsdxJvN&wXrzn1H)`d(zQACs9Z< zsMk)~!dfe0($;YLq~$eDO&68syP>(J<;PQB1V|8*m#25$6pu~#J$sWp=7O$OqyFTg zQdd|q42>zbh6w8QW;>v-5Sq4?G5gfnY%`pJ2 z+yYRs)6w9o^|&mVZIv?VH%dII_GMH{Yvq360|3rJpsj>~ zpAHp{Da&rzzwiIkJ>v&-_h0qN{%gVt-s~ShoFbSKNX|e&^Ee{p!sn(2Qq}? z$;aEJFjOI0)P_?%>F_ygYEVH)M~#Q4+@Qzz-2@=0Te?q^Bc73Xtx^DXZ!hGn3VES;(e+LPZy_ z_^+Gz3)TUI0BDNY96xZ+H~Ce2H0PKy_C}<;EniGxgUxP!9C6MLGz^I}kbDrZ8fI{K&L&*q8y zr17^pS7ARF(@#E><#z0^adE${I;j+V;mSVu24b4!r)l@JPf4k3yyD%$mzlIVG{i+X z99O-5k0%!cv5RLH?h+jLh1OqxdJ*??aY%MjMAtqw$*?HZisjkcz8W6S{jUJCyG~~h z;O+HZ4vw!p?)tu?KIDw@Ox|TLM%fE_f&GA1}+kQ#*yp><+%*jr{)M zYNu)n=DjL4^g-XP>VKj%5%wVo^WM+gxJ!+v9(PWBya{dbj`_aL6|Y;W2Fujj8+Q9! z)Og%p^%2fbAxr65Zk^T}8s6B7YVvmm%9)J?hEz^7yE^(pu~c=twi5pOE^hLSy|*I--KWB8{7vegb5V=EstlbD*sjKQ$DJTop+R;D^z60jtd$18Lxd%af! zoA%&WickBqM0yr0YW(Br%qoZ6f!@27K!x^NAW1vNs`|<=rCA*@Q8u;%0|RW2j<}45 zOh+kMKYyR-DjtRq;?wu{Tt=HGo16(B0nj=`t>~YaY7An1xs|GT0m6MW&6)pY$k5aT zEn>bdtHtU*7^?khk_uN-E59LeZ#WO9jK?WwoKYw4S zUABx~eP~~yr;19qPlC-ZrmeYlEpC;_bS_qwz#VN;Ps$Q&>&H*o!xT}cR$1Fn=ExPY zD=p%V2{+wQ6A>BipjiV5_)c{)a$*|5B`D`;TeAIF-va!3-I$GC0sP)qmE31U9;_h6 zohjFdX@z@mt5veo^QyQX(+?)r;+^t}W0Ydpq~H{c1(a0tLf>rLNyx}@uSXDd&MJE0 zgW_wKyxuHI&x57*k~QlOv-U8}VjnAqkU$gDsv*}`cuvt(dA+BDVNdwo0Y09$vK=Hm z?{61!4mxJYtmVU8wYUjZJY?|o1vhl{sd8D;aqyhmTLI$O*yx6nctZIPaYYEF8q~&y zxO{2-CDahnQ5d!BU-FCnc+Nnp{V}9LHki^=J%hVgniT@rcg5ApqwCpwdwcX*$wk*( zIOAIZ6|qp_FM^XQjz@QP_2|0wXgUaJlqW^CwSw*jp}InqgnwW7I+*72DvQ6<7^}_j zFZn4=_C;^9nLK_}AI0p$?zVJs(g01y908yl^x(iyskDCB5+#p4M8d-X$l3eKPMz)7 zGZP+z$|{uHK6_@!2-_ z^70U$s6G_?oXZPI(}PAsL)1l=E8jL?+%MxTS(>&m0IfJGkXl$Hm;s^)?zk=WVYVJ8c~Xj z807$8FS|O1>>S~zjXL7*gSoOVt7-u29U8%8j%BGlQk-&C(?!|JN<`}55(C8_xcdF> zRhB9OEOm=2a3#ViydGG=+dg@kf4#Xw`GJBCO49XU^WOzUP#L`Z(KEqx7L7s_TtXs)onke{L?;`Zjs9DiV z1-*+X>yg6jMGqioc=tTQVzCm;MLoDC6i+>9mAe4I;jlt#_raW}{FAmaC+||M$1|-R zO;qkuL+Vo{fMyKX&whv9dTU)Y`fA*`Zn06SjOO$L4d+5RQ67I*UXLBc7hc_eO&+g$ z8D;$ZS|y-1_ne696DK%7;`|BNEdu}U9LKDuK&I~3K@EA>^1HM_!SbvE_tgrRRkWX- ziR`v2kp7GZ1dcV6aFi}rIb|(hdm^08!OJ-d9jvOjH$k5(HLhwuz^HmB1bB>be~x*K zS)|RK((aq{89?&Ej&6>xFfvKDxUtY*VGr;B(zEl4H#cN&@PzRa$EI=`-i}Gwldwvmh(cpc zBsxtN%-m|D({pBqnA;WFQ-D_be3J~e$mL?JhK?2uly2wkPSmR~G`)8nPN(rQh&JF@b3M)jeA9_Qw)FuDWzDtYTpZ61x5~h zw40G>?ep@{kiS11R~y-9Jz9$^`;YnXxvUnrLt>= zM_^^X2(*o^_-0oL=RJ*DkW^laO+FYXJya&oF0l9!@Cb%MKDm}3#>T=XZ^TdqMEyEa z^Qyu)kJm*N5B1HjWdxjn87z-o02 z=*T%ojz`AwdVjig^M`pob7iqc2}3T64V2J1=)(;pRfZV{bUuGut!r_?QjBq$8L9Z%vty| zMzlGocIj2g^>;O+`}&`+C|Kqc+ryGM6W2i%vdAVlPwfJ2&&ItZ(IDYMWeN$v-c!Vt~B3F!)doXWyU>6d)*Bp+cwe z;amrH&_UJEp2?h!BiRTmnR`y5aTyVTzA@G`qQV7iFa5>>6mlPMeL}qaRZaZw#UhZj zkaId*B8Mu%KMC}8m(VHsU0*TFz>_oJDT!e^Fvhd;^sF*f02bIC?m#NWk2O1S_Tn`8zd;;}2J=r2N6?n$*Ts|{jai~KSGZDWy(CzxyZ!9=)QCj);u=WTzkSeN z#k0jGmHND3Xqf^HA?s3 zpR;>?!NylP%+sSG5*Lb?*nzR6$;QJ<6kFZ$wJkH%G#x|#M=Lv$`C2lmXM2vnR#4db z3Y0UG6?OV|sj=QlE`r9W(A`Swqp_%gY*T#|`GlS`tt8o!dc9kP=U8^DaJ;MvC)QoH zA70{1l=XpzQF}^-UdB^narLRabRc#8*q3~D13g|LXoK5QmIr^E;kom{jVVV_`-z;E z2TYUJiz`?xJm4-BhY?qK#I7>XsFiBI<63f6o~^Aj+X7<&dFW8Px`t*6@GijVY;Aj3 z`!B}0uV0?V)Mu2(yUF8M>*E{-_QGw*1vQ<ckp%jwwX-|VOs!JNI8I*oOt_pNx7rJ*k|RB- z+d_~aK$T<6{sf+eY*vdo{Z!#UmqgZ5*O<+{q4#b{Qvo%eh}xhI99X(nT95CWOdGsg z5LuBb1$euAU4CZV@D1EU%1H;!06%ocOFPDVc5bK=Eu&YZ{Dgz1Pfp%|i~RV~bD&7| zHVVD(DnG7oi)UW(?dB`jtkvNByjh{07xVtI=kF#yvL~E0O@X@Mahi>ViC0?c8D2J0 zEgANaPM2dwN{5$9=t_g;m#0B32t zd|5r#pIJyQ!2vHO9?;GYqR=@K#gOuDbE=EpuEWdBUU0SL-n!(~@aNv1HddC=YC^=k zjnBN~Onzv%myuU(PC?XOWOgrMnG$n}{;Gdk!(i;Oi9n@4eqYeE6|By>4ZWM5LllU8 z2i&-yRJzXx#44^Q8)x%jIriF6>~zhtjfT1=8y0{SPk2|e2f4om9xz*4*<}{cShc%a zKwuq3+ZtJ|psLFtHi9d6sS(o~%Qot0b?%Rw71zo&?!}s@CBEO-)8r?c4^XSyyN8Ns z{bmea`lv_h2eNw4#2i>-hoL_yjMvNr9=TKj?Yh;B330gTVbo>6JK~F+>y675?qwUe zw297UddP^p)Y1%~MbWvpFl}Q!1>pL1OCb+DUb%j?CKhS-E<7d$IER zHK|c^38;x@$G0*Y18|=dK)tS3xhah=xBnF6}k&kMQKJe$0i5touBb@GHC2MOKk8s z7Z4l>*7OZ{w*W)?{l3&kbw#vSO<4*5s zEKZ){Wa{r;QQ~l}w*4$1!M&0Z`(vCuzle!7s4SCN?XsfO6p<&kQl|*KHVWS2%*o{t z%w2^>{)&#aG3n7CG<`rnC|{bB*(+PUy_wvP7S{zBxpIsz2rm98&lfz*+BjRVyj=~v zZ*lGG6Gva+cqEoeW4bEz2dRr0_vp|dn-;d0=CZ?|e=H0+-o2`&^OtGHMx!=Cw&MY- zF)c&m-A3dVeC}rj9@*t#m14)Li>%&^nd_XLCkjw*Q3mx$9)c}jPOl-=Q=s_~*5RF$ zKbFMRe+bu@sePK$nI41M=to;QRK8E3ZgV_LI?j7G;UOrA>KTG+aUtgsaCOj+7e0U0 zX2M}*oL%Rwan3WrmZ6D;$=VfRTxIm;SSzV~nEH4eF#s5yJ5V+vJV{zA3tKj9hffb? zTT11X^t4fOgjc`R$6%C?AOzC&RW-Bmi2tBtUzuZ4pC>y>+EL@?=qn;S4`<(5C zr8~^O=ruiP8GVnECd0jJ!ezJ6M(wOxMd@OedwtDmaGc{-w358u3p`vnjHp&jo4VYV z2ROA434*y_!JJ?l{beB0F9IGE$ja(_X35etLJ9;Z5z{WIGQew}7O1l}Q;zBs)7zuM zEm+UP>~Wd%No#>IUCt7-DnvGm3Sqr^o}03a%#8SgAgyo3a?^|>hGAsV;gr*iVnrGu zcd%=U)iAwu1g8#H>PKy~&!@pP)KDbz8w*p(R?!<)^6dGoH29i|JeyOfcv1gFZ{}Jc zGt{Iw_xFX72CL}BLB-!y%on&WPX}(u!w^JY{T8d}OF|9BNkfLhe%;|Y=e0J0>xy`M zwo2e$n7P9c7zZhK?8PDhd<(#R6}R8T+UwcIvYMvD(I@jFqJs3*l9;G%SD~P+qJktM z*F1=6A(1CLb!F}+sX?G_`=8%`QwcP3eZ)io!~%Hs!A2|}Irra341q$P=9h+bm}XSo PnQvS(|E Date: Fri, 6 Mar 2020 16:08:57 -0800 Subject: [PATCH 03/10] Use a smaller png. --- .../rebuild_arbitrary_future.png | Bin 0 -> 18465 bytes 1 file changed, 0 insertions(+), 0 deletions(-) create mode 100644 rfcs/20200306-single-client-parameter-server/rebuild_arbitrary_future.png diff --git a/rfcs/20200306-single-client-parameter-server/rebuild_arbitrary_future.png b/rfcs/20200306-single-client-parameter-server/rebuild_arbitrary_future.png new file mode 100644 index 0000000000000000000000000000000000000000..8ac18991e93ab9d573b0d57fc95d8710863c602c GIT binary patch literal 18465 zcmeFZRd8HEkS*9^vY1(x#mvl17Be$5Gcz+=U@=P;Gg-{c%*;%$Ju`21_irQKZtQ+M zMTh$K?dnsR)m2$}^PJle^0MOausE;)003T6LPQY&00AC?G(tlHpD&4!U-1%~ zJ3HHR(bK!RxzV{X)7d$i(KB*#a?&#}(K9j80$b2JdDuD|xYOD?{rFFi|A`}F;$-A# zVef2VXG`=C*TB%u#hI6w_+Ll={r8`CI$N0jA3fPR{g+!n2kHO4L(fRZK>y#^z>o6$ zJIf{KXkh~E`5!(XBhP=d{J)(0uYP#w|8@RL1z_+VxF$Ta`} z0f3~4pt3v2#V@FIw1LG?D<=dRI2w-;P8bP!WD!vXWOR{%V62@;{?H2DYK>m#rC>uL zWwgKk5@-tX(G<{$LDI7AwPS}2?O$##3tB1tITnlVZJ`t8s%(z={??7h_1m z@2iCf3a^qm{tYJHUkAJtaNZ|6MWRmwstg#Uk~<6qKPMJ|(8r|nBm*%*&I4so^X#0<_!DFvDO)0nq*HMw42n6dmPP$xd+4<9Wk3i2 z!BpNF+e-A%0kK_oew(2%U;tr^QplZzL0y5^wQ!icN&a;}7_17O3|y4jf7m)L#~I+! z|6tS)M%MpdF_CDQZtGqD-Td);9SETYk}V!GCTOjvhoa%58Vq+~XkeJ<65 zOE@B`MnOqbOjKpecqVSEd~(CHoXh2`;-H&G(dCR?c|xnP>XFN{G#lUrNU5(o8AtuX z$*(@$sFrFzncA{}9s8A2*mu5J*T@n{3TJ9%@o2hA&~;==3tz+YAz{%*wU*Ai$NX?< zHC2HF!^%p3LawwR{3Db~!5D1_rWR_Oqn$}@lmC(5_nB`8k-9o`shoXgMVH)erz*ngQ z>L$&Fqzin_;y(iO2Xk-*@kLZBM)-t&cTH;yI$jAs4rukJAy~=I} z>DE+wu_*81$k767!Ig4-o!?rg5tYp3{76zVMJ346y8PigFXE>b?`IA7Li~#Z_w3V%k^9$gf7+L{ zs)Q3k0aUSdJn@fNiDody%XP0$IJxda=Q~wZi9R3wS7HuiOqRLP&6SyNv7+l0MO9wa zs!PMY=1x;tDfG#&vcEOTr_E$MRb6l8h7j?-Og|@Hca2=8(*4rlq-V)^(=+Dy1Q2%* z9FIKmJV6^kl&6MG_=;t7N?lB9?82u1KZ9*#}-~$#~Lqh z$K-f_+jc*qkfLwWC>lULoK)(*PYJJL`JENRpSJqNsWu~C*2KhpskaM#w3$~Q`b<`y zoPI>?HQG4geVuPwAV~SVj2N@-{~EC~ajaAO7*!nviGT@W_4j_^_pwjnMNX zG=)$atY4VSuIZkpy6WleZ(v@^RspZ)`&u3?QLIR%Q@vq$CuO(NxXi{ow%>e_@`U;Q zO!0`ijnw07imv;e)>V`^lzoiORn(V?o6URQyei9@kKcQr;sL&`KMe7F3BhLLzC?9K zJleA*n*T&qI>dalJmzKA@~b=JGve^*L~WCfnUwu>reuECqrOqMcLR}66(@Fo#8I#9 z?t;rz;f>L|OseqD=j)ty)j?_`Epey-qr|9#LV9XP6A|H0s53-&#*Zt;4YM+J5}t@| zCdLwRG`Im@V6c5;uDFhjq%sI0&~d+BPUzP(&(AlI4WrsSA1$Uf$&U>mF?Qt2xiUCM zeYvhuV>QFiySEqPVNk^-PM?B+fXHB{{zw!3iS`Bq4DeIq2WlLicbQ($t*x8~vm|1y z*hy<#ycpQ%G5G^sjyuC38OTU5a89x_!||dRKj5)gM6%lEKpUTtt-7!X1|-~PNWrA0 zlpB|atg~6D%Vwim3{ZjNk;TwDn4~;Bk2=&ySL>WM@dpCEAK79qKgx9_)TLrn8H``wcE;_f!1!KU#!afPM$#p={#$2D1t%j~6a#H$9*LWHKshn=ZE9DL) z^kDWkz_`Ug6<|mhj~y{8Qp_qkE*voWe&ap>4ob~8UT{e?jhrIWwo2Ifky#9G>@`{l zgP>Jwf=8?BrGZTD^lqAPT|^v}qFZSDA-38r!#?`%CrVro@jH2t*IBNNP^08)ya3VM z$d8eQ>qnwm#?N>HNzoL&J`7TlQE{5DLn(UI&tzV1s$Yl9^mev$`Sr{rcWYK<03P*w z#gTnTKfhV`g%V+~h4!C1q}02%HO3Ixw31#Y6Nrpp(}FQwz-eKzbXcx;WNb6>uClbH zygGp)`W?e9V~cRUB$1Fil;XxGdazaSl5m7e(h*YH@CRC4)5dJy%!W}ZwC%Jl ztNh#HYu45DWE8D{$l;3#bghN(z?4#@&S%BqE4PvL0`cu!E`Wc+&1HpdrIx>*i+|Ol zE=OA0q5@f7v+qE9k|+H~mBH3njcTiUIW?SPVHlAdeB^=FgSW+(>iWI_Fh-Cy)Ax#;K~aI9s(Y`0lDDCsR<2Wq?I( zmv5ZUQi#1j+wfkmMXNcPEms1{VZusu1O!4Q|KPcQmUoZIghb9k692sRt<4ymTisMzk6HElhiHSjM7eMqI|N22B zoctB*r{r^c`FHl{%gpx?Bot<*HkZeJ@<;PIS!v8FLY4ucg7%!PI$ld*|G*8VNS$w)rQ z9(y3nH?_iXB$#DcV1neTdM^nY`U42lLTnN$(c=olZmRz+hr)mdgy~18WEKH!LIvho zztDXW{dfN5S=AC5-%)CR0kJPCkCVW^i34H&@0qx*6+lff8y`R2(AO6j3@dvR-wPcq zDq^~SbodvSMr#}ahC?LpZ+?Y)f-Q_abl%!;qdv(V51>Vp%*N0pd0N1uHd-#SZ18NL z-`hOGp#q?7qyiBCyW#)h@N|Yv<6%<-Qgdb{c)vsCZX6N<^Lj$0XHX9se^k=iNyOVR znnVDMI#Rr@0Gl?(`X0tmt0XANHaR%PkK9?X;}h~16E;bIa2n`_K37P*p)U;i7D>=g z5&d+2lT;5CDu1NtpZfFl!Ub-Zd@L;FKbpC@p1-%QcuYu3Dck$X`X-JjqokGtw5 zV`XOjA}mj!prnilj~IrBMWBg}hLNLZLq8gv50|2)U@-Z+5YCnGE5nQ9!}zk#C+n$- zjg9kb7$4m~V6^q>mrsAtEo{xiCSyId7P=pqGf#EDO**MkWi3{ca=Ii z9!8}h>ThLrkC}rds11cu2j^w{5Nh{^Gc@ies991&KhAy`fp^v$-AH;b49Z$hh z+}`CJ`L^Dg+CuY(;pw${v6mC+KF(h=v_%2-!kEBi<*Xo+LVnwK*2b6Zz;m~#G$hB) z-9*pIhMdx=aMou3pzOHzT7=$xi-gCWl$BEQc*XG>;exC9bH!1<$eOdzeq6a-VK@J} zwALkq+%dOE$UeG?yyiFPHHi)8z=D;KmLMYs;CV6!y0 z3+3>b*X8u)UK#`>WUyOzN6@LOhZpIFmtG3K-#eISt;R>AwU@(0VGHkfFs{WSbeh12MxOa zIvzYp(y%;PI-8GNIfJtz8Ia@dcv}t<=pzV-c?zlMQ~+toEKS;JJXwA37$2{scalA7 z!sEEz^(Rnsbz~CR>UiHzaAtifC_24LX}`5xjPpC(jx9hA)yh5A?sn@3nF!G_N)-JDUljg_4j`vnh4?>9?@x)q-7len8wWzww;fCA!nK>}z zGM;o12WFt#>|oVo&)bD$5%oVq2GrJ{Mr-r9Vo)n?Y|rOPKP+Fz_%ZDCnxBqO8Yb5vpn|?@=UtPX&(TD{9y+QQB~!-9kkx4T zHMe;kPmh_6o18jr3SS-6afdAoKo@KwEzfE>#G9Iy#P9ppX$h+NWM({7`h9UKB8Ys! zE6!I%SzG(vXm#fv7bLJAoURLZ`Mw3qT%|WeJah{m1>^p;fN&k1%Ibkq1vNY}MxM9k z6tCr^`GQs|QYGj;JCLxVE70r$7{K-E8HYZezI}sDXLY^(Wzf~(y>xTldL)HS;L=j6 zvK_hN*MJBQ!>1Up*V4<&_+nQ2s;5~T=V*n5f!=m~nHC!Q+FrDkup2q}W9+HhRmnV| z>=|EsYbAlf{pSGqG8k~frH&KZ^LS%bi#b<=vU;&_a)`IKo-kJ_aJ9a!=2r^cSqY3o zmWd4#ykUX^P7Mr(^|}bNJwIzd!6<|vaUGo9*y|6woSMM^P_kp((NV4$mAFzH)b}4AP znWDNeP-6iz=wYG!Rd`fGAz|Sfa?~T>cFZwUGh1;3I_VHZaD3HXb5_Q3@XFz^VuKkR zA_Cr79ykALO9ix0iET-YTx7v=tpMAG=5nWkOLtx>L_B0k@Y-hsEkqA|40j%W01pp^ z=OdN8+BSKfBf)#4X9>?(T||OmlYAPI8l0braIf13IGnX{X|!EQV4%v(JPWRe!njS^ zg8T@`$Dx=7re^!uVNnJ6u>P2oRDh)|@C!?}(uBiT{i%=^&r8UM1!|{-YWTNF{7PeK z-Qg441ZTYUgO?|-(J)!d`G}+^n9-#eN(7`lP|I%1uJv~)RZvRmua6rTmDXohG7Nj0BM`J7fk@hja%+T3+>b`8v_ zS$q~s)A=F_!Fa!(yRBN>Sgoj}!P0z+?ti6WTg_X!`4#4LpXj`dr_JUJ`Rz-y+D?-^ z>|Q@5f8!lb^nGu5y;yrV?NYOw?UFF&yFB?S9As57H&$1Z8?s7%UIw#wkc0#WZ>U6Y9Ug=%0 zc2SUPAwJjW(|S3T+gzHDsF&ZFtgO1;$$IFt@ODafdI`f`uH2Em_c3H!~fv`OTArS;U)jzys!5cVg>eNy}%u0?*GbL4?5XLnyaD_}zkWs6B zOyBF`F6kbKl_FO~(EDg~8^-k%Lx>yg<{SHG_Z*@5*=t@38?TvAg{_4V$DgMbRlys& zk6DbZcMnGFA?X8~>EE%c0VbAmXremGmM4?a&ANSo8$#06CM}Kvv4ksLLd-mw{k4;6 z_dFul5qAw8_GA0^dv9}e!X0P4_|3?>1)%mmluZaQh`L-iN8|^RtdY1d!`Fv1W4a?j zC>2rWljCM>jC?;Jf8zR|NxOI{bWysjt!{*A6S4?5kX$d_;*)h)UfAIo9W;X*4Y8s7 z`biJS| zki(Ue5wq=QuY~4{OoQ(471;{i=`UGNMtl9v{O58cZYqT9Yp=*61R`YWMZ3>Am52)P zguXl!=0+&T(Q-#YsN}?Gl<^?3;wW*saYBc%dKCzCOE2LlAM^UujQ-tz+DGBpYoaRknTOZS0aQ;C!4LDgQUbA@rk_FR%?3H^Tm5;^Vy$A&*h z-hZGV@>pHIqq1S~2q^!tzwS{7@lE5UKR4hEsY`yE{4( z`^BdX71PGeVbC0Gg(v9LPlq!;br1WK2`GMP>T&u+v!t|RK4HFOkW~PW8diI)GpqSt z&D!cN^Uh5E&i9-sjUmPn$1TdQ91l=tswjr+#48X*Z3$qW7{Xu!ZON%SlG5Ti8R}}| z$pSaPezgKt0h-AZ&T{ivHiU4_aZ8TYi+?RupoiXp$0-W24b6H~tq~Zernoryb@nHi ztB&+Ri=>WP9%QQDD=tg7^Q=EQNt}6;o|3K=QJ3RD4TRhbQ2~TQdb;kc84WXSeC?0x zlM8m42AQ`o11Mr62Sy@D&8yYQW5)=h+|=Y0tvAw&7wYA@*3WP0dUmD}Zok-1sL=`0 z(C%F}OdH$zu3vw`TzY1dy49QOo!67oOMbWn-%u%Fe8U4D`?5ZM<a(zp4R+~O|Rn_cufV)R~b5_rr<;+InPDul>C zHQQ)^iD1SdS@D`)tWCC2E$}-M(S0PBD9a32&%E+BlDc?V0f3K=S73!$U&HIA?cz%KoFU@&Ofaw{V5OrQyDyw7Kv^$^TsvD$2`}C?yLawk-DM z_wM8ye6O0363+{%j-$sG@54&-zkuek#5IZiK|;gvgMlHBh=b$VT+*LPWXh0!@+dE@ z4%658XIDp*AFVbOUI@5E4-vS2Udb~m2w&`}T+u{FqCYzDl7gCOofO0mNS1B??{ojF zW1!=po8-{4Ux`YXaVz{C1m&6#)bD$RDXINmzdsY%z$yrIz>dAB{cmv7J_QMX@b_>5 zhTBZAG$`^>e<5H=#0U)f<);`_Lu`Hk7r%1(bb(Pn>8W?l0`rcwA6VE z7gy0Cv(dUBv)QsjwA;AX`X4_>EmvYQm)l+86$)R8BxtOhjN`uX)AGC}E!kRY>>N9{ zOpdHOze}S-l{+OPA^(&4(AI{ZFTc3a?Q%^3IDn86P42CJ=YCWVsx=0CQd1tg!OEZH zVq0Y@u1?;pRtq!9iZa&OFY)4EqTgN5G*l_8wVTv^?goaa?wmtCfjpSW><`|m2>MTP z-kUz$qRz5Uuo{&)eqXiDhf(HJAl*c{m&@Jsj}DJfYV5kieKT(K?Klu@+gO}^+m%k? zu!~UkwKDzV_qHuP|C@8Glsv~zN2BRfiCk{>5lWZ6QYKM6+E*l|^No!4{LzaOjY4>^ z2_INR=4k!96t^X_q}i->iK9ZsgDv!4XByIGvQI%udmBreH9wnsXz`ZC{$A~?!=ffh zDr>tb1&{5Ub~wZYuwW#$Zj~+U_c*E$6tFE7= zUZe6lqI~hnm!?}r*JkYR+WB%cCP!A0YQt4ZHlbAxWKZRXV)2C1cs1eic)mHfv3ofk z>!J0VenUkdTvsaxkoeV4Gq;$HcaMcoT1bo7nI4)E)?&)X%057Gz{&IeNP9_Vwrb1# zoM}JkOOkr|EWF8dj8Ecjwpo1C^LSNYHT^5!Xmm!_xbw7dS_LClu~ySrIG_8R0_&yI zU42XtCoIYn9a*o}`9A-{av2@d{F8*HRn_=lHg_RgL(hEekOP5mx#VJODp?YB{->I5 z-k}e>Fr0YuP&#*5Y1mb(Mn+_$)Q2QN;VxopW!T}9z^LxShT2@iqSwefq14Ap`*(** zy?X7D>%Z0&@eCv~3jT|c!+4*fA2rjS@aPZDoXAN?ypnP%CpC(HS0f?J{)S$a!B#flS0uB!D{kA zj9*I`Ku1MgrjH(>zJjj;5VpnF(0n01Ka)QZ1(6^{PN@E)+%Xf!tE2gH@a%78D^J<( zW;kpw<}h4$YP@(vXr((C^QD(DN{;T$=Lrr}52%WP4A=H*zZ?iDDEzu>#4QfB8gP&yxkE1Pcj zJUz`&BAj;9OjBJZv1`@h7psF7*Rr(vrVW2OCpnSMR-_Cfk!hV#JSK-#95*CyN zR25`Msl@pGEId6$*49?GbjtZr2I7&n;RWDm9!Otv~RmQel$JQb{ zIVVJ8MFrhev@@;z{VrP(Sl#S^D}Bp5EiJ&tn(^{q3*0TN(x8awC zVn#VIKsL6)w8rE^-UNYdrxl5XW%4MH9COgoQADU_OaEn(fT0Z(Q=~wyfeR;RlHX9Pf?N9 zVdoqyD7&^hz6D>>2K)gD2cc!l?$?ZFgcHmD5F zI%lS_o4?20$qmnc$ogmu_nx6%%B(Z(jus~})jY)f4GDo)Gv|8;$#xn5l9cQIX5EEs za5CC6odN>7jb`@1qb+wis}qy5D``2tLYZP>(nyo%;TcOH$!pm=lHKMUHxN&L2`n}p z{6+@Llj)AZWmd!ulD5i(?bnugQDbT}OZ5e-?tKk8 zvD|xQJ$;9J%|uN5tkJotFzqfKl~+H_z%_vePhQ!J%@7y~{m3}Hp!zBrqoFl1xjS3L zl!og(Z6?)pYln=s!h9gi)A>RyzTok7lUS#=a*GD~D2 zNl(l|Q<-hDiGh)baApu1j2-H|WVKwXt?ej#%UkQ_wQw{`JzDjcbpLpkU3uAJ7mf39 zthBp9ck+)NefuwVH0^(~qap&Sgx&lO9hxn!;u+)+CZfM91GBqNuV7QeU1r|eLG$4A z+PH&{d;f8W17Lg2y*;sG#N@dE7oCxCTk`^BxW@WHyN zoqerc_-zUj0Z>R{1{4oY5H}T84^QwTQOWzNK|;AC0L-z!zw=3vRX28ZYs|?7qF)2t zGIZ3YM2{K_$q$3sqAyt4w&zC)CZ0JueMH7rQV>F7ecOQ|>8sLdx@`(hU2#xdl537# z*8`3mR}gv(vH~&?Ef|R?UN~FU>}315_lxeRpntmnRhil0M7bscu{a^``2Zu2^5&szfOAh|+am^{ z28~!wBkkNZUc(h?_h&Cae5UH+C=7=~30Z7YCI^iFfFVeyVX{ z!c)_gZTF|PV4^`O_4jYMC+xEcMUCpFhrIypQm2whx`Y|d{=BlU-!5!?Fv-{HNs|t< zn8Od~lDGlk(AX%!loS)Pz?_!0VEn}=EzmZ9tDMtN?uohzaj$A9Q!|AWKl(%NTC#a- zLE4k|m*gr0&(G|elS~1pLfvmZhS$uaXRYR#(+{*Gf4w4mRKkzZ)+OXbk9*w{kufho z3BV^y?}Y`5pLPAY0Y@jKojHXf?f&Ez%yc?P@$7eW{z;^rw}ZNVNUIR<`NY_0oc&97pj)r1)in+#QzF%+PI9 zC4L(!fsPilH16QPQE8Yj! z)t|yN+6<>=zy;{!&EA48F_F#(s{NA(rSA_>>307FsX4^k+i$lNdmlNz@oI7=jM&I^ z><5&q(L>)s4@o<_cP;%iX3vl)py+?=^a$E_jDLWy{)p7EPuhOBk*s+;NsS3`4fw;; za%2M9kKFD2a*(@=4Yc1ql-Z~(5a{dduJjl^FCfwkIDNT*&W>f*{URkY^2RzhiXrtHht$S4R4g%WRrGOAB7eaHA^6~)^I#u?`B+>5wIV!_WnVRM# z8?&VWmyTxo1nV``992y0SNzoId>A~?!9uZo?PLbWe;F{VHIHkjM z*=56MglJf3g;G-2xbfwIp!L#`3thXriSHffJTLm<$k`1jvxevXYtZ5|m;ta?A<#vc zhS8WomGi=wL^u$?!eNp9FNmYz6r)l`d)2S?s})x$ycwajHsQ2tznZ%klI5xr(EODj zW$ibB{Qg^(%mA$+Y60^ZdzBIz1~QIDy{#)dA+iMV1g8XPFW;u6fBT+(1=e{Qnxq0y4j9y7{3lLtMVo?cR3w z`;m@~wXfHXDc|b#oA^omzy_At4*>x~j1qn28yi8gCrK=&#*z*c9gb`pxM>2Nzxlhc zz2ceD_Mc>~m{cMpoj$zzPo(Dv5e_Y+vGzyTcN2M7kZ{YI~L=xsbgZ$4ivR4q_-K}Q3~ zdWbvL-_H9|1SfYRqSxfKf9K}@!QNgve2<*m%ThR8Vh6ZFa+0iE?evfIA=G?mQ7HAs z?D@TgE;!Uq;ywf_IbdS7i)+cJB5cCn?OYQPidACVla(4%-c67A=MF_finb;G9E5Q?OO}|Tq zf1~GoMJVv=$@FPcCB0~{JM=qfvLHs-Je6^r|rewwGwZ zF-N-6H?o0Z!^ZP0mykO?%~U;WstlmQh9a)qvl0DT@b9dSFVf}7n!#T$n8imCLYcyP zNhv`uBW!4gpt5$OY|z;uk}SX|`-lO?j_o0qig!OWnmP2KD+hm$4&rTi1IecuE> zWQqwJlxQ#ho`y9XjJf5mqOcD$VVa<)d~tt`pshS;+d(?E4eCv0z^|FpsaAb^HJ819 zH5O+(HXr``WZJO#_Ba{-80?@3i3_K$T*I{DT4v1;Q6q*R_RsNtPC9sPxd6#8!u}lV zj^JG+%P+uraSQXu=Bect;Kx?wjvNAP=X(=e;(CT)XE44Vx9S7lGNtufTPMCF)7@(d zTAD3ywo-4eoSd(dy-c+~7|G38 z;>rLL9I`alM^k)oF&bf|$@K~P9=z$h+$~aK0tpvJjiy!EynKUW!l(YRQ{i>yE-_Y! zP?yqaqBqZ!|4y$rULUhmHLBbrz_jsrz4WMiv!g0t6XUK0q1osA#hUN({b8v^hPp5V zxSHoUleHez#gh)|-UnPt8y1{VU*JXe`c*ZO^fu`+%(>zzstoGKSPH;+cn1bs4XC}i zf+9mOJ7^Aa-CKMJy!5*$W}uLnM>p~gCeba!JT17L(m!Z^k?oO}%>0D@mUbB!6f^Cz z3j@JGM8e~K-hUNWN*=ePc>aS(sX&+j$3G1noUSLuk^Wd!{~9~oVXHv>86^}q!w@P! zjjAyQ|Jeaw%RuB*q_0aU7i3nM&@ZqyY4{72L&@> zr!>&;w2~KF-}7?MNwWB*+#f`8q6h>o7SN%fa=*Aw@3d}87_vgLRBMDWbHQkc^!V>a zUx#W>(A3;0{S#>0StoW|wjQ2jYbv!)?|gc-zxINrO%7VN8ch~ZdwWZX z&nQDVQ2%gS19pJoqOCah%U3S-@m-l!cN++B<9vC2`|Yxe?yH}JCmHDOH$}obqEbFj zAZUOkul6o=h9-o*ae*o8A~H^#Hf~Qq|0!$5ucbVQsP!44vdcC;c4s>*Jd|KJ-8NgU z(l!o!&aH1XmqOC$mwz~^#Ze5lt!ib|H!XB=5fwSDQ5`)H0d^<9JkI{#s2TUTn71P*%S#K2r0ZF@HmR=?JhbCrR8T04;qN1m5iu70paf(8!IV zrZ)SyqmbXINKEBkAb?fTHlBc{nx_C z{V5NJgKpm?UAu;|`{#5BzWgeNcRp)u zu#79FLKo)cpA;v-Z<^CFsN?BaO{$n9H{CO?&Fo#i5vW{a{Yu(eM^ai4T(tSkcs)V@ zXbZ5}NU7}(F!{dDZ}09To01q<->iDlf0IOQf3jgu-)|F|G}{6>IY*qWaMDo&2<#8! zTCi1i8E{66?O>v2hm#W(M!4dw-q-QG1woKrX*3{w-?!)lyYaHtTV1)}AzjjnJeS5I zKD0S8IA3uV;3lJeZEEyF8(6G$*g+lI zDsnhe@M%6R)Q!$bq=A}Ya5f#D z$6AlM{MIitVGpDH`eWzut^E{hz207}cLeZ`-11=WVV{${82)W6aE%M4MA{4B9`rLw z6TbFz4L&fyQ}f{cvMP7<@BLy>uEo~HTKkJYnchmO-OEp($95)P z6FsVyaO>`e?W5BYdeyR``bA4Z=#<(+-{%UhKTpA#FR$JYCo5+~M>l|Y;Ya$FoO)wZiYq7Q8vCl!n1~a7! zL0@{5uKh?;_7zpgFB7BbYCA(nz;OF|>@Oecku4ue`PGox`p4Z=^e%0qT=yEvFZVxL4D|arFSc! zM0c8xF!kqpNxOID&Qeu?gxcBvYONnKPA3}rD>;CAg6IY+8wx!e91eQ&?TC^yGQ5Xd zZC1ag#Tgn2hrp5HfylXgQgrZZ_Et-3>GNV#`4%VyE2G!VEU4Yr*a<31)#Yu1_YyI6 zP^eqa`l}GbnR>VlBX?Yp0Pp%cVvt zia}n1gVz+wUq%GBJ^mum+q8^SILi+pH_+&+R!a^oa_p&I6GgKrRCG7V``A^4h)>FB z{~YEon}u_@r}e8t@m_PSLnm%=kOU9INSXfYt$cszG->LBxmAJA*J({SMKWJm3N?qp># zYaBl9cKlNGmb1)JdnfUIcjYpoF3!qm)Qob; zw@}TNL?)PamFD8pDZS=$_-^ZkVhlTvcN`T&C0{5v0+m}40)oMr*s+EJ4h4#;DZ5kk zLA(yjOQ`O{{UeNyps|ClCrXR3m58KL+)H;bc>+h+voIWQwjtgd`iU)TjxztL9?>32 z(xJUSRN%IB+0RLfl8|N0z&gQCe>nekJlVbp{`MIDpVd8fFzZA1Q9GW%)rJ^i1-=9) zi#B>>kCL_wa@y;mNP>tx3~Q?UxQbh|zmYNvj|f8RdwAwI(9R?us*>YRkzMyOh>3xj zHta+GR4oTXwN+$pfN6JKri{(ZSZYlw4`@tiSN3vJUz|0$S~&Qy`@5Sda0bD3(0LPH z557pdt{t9|W@!J|I!f=M6=2~clY(yR%vR9RB?ZdAO=H>RY`Q_H9Mg1&0IS%@B!L{`|zudX~ux} z%n`z$r3fknQtr)VaFsZ1KQSQr2g6lCk%c?KPMIUQ_c>sEwyZXm-tGDFzV7{5({WQE zuol#W>JzWp>QWv+MNfV8u+{a}DNm4@-;(M-ZdgdbLn%yy{rS?-qxpFq#-R98nBRAF zqLL$UUv13kJX0eTQ?ai0V7C4nfOKKo(c(C8Yqx}{mJ9w`8`cv?`0j|JltsUbqhqe- zEcf--d4(<`tfVEjhZ?&4A;TYLo4_{P-UX3{2m z&?85eN{;N8*~P@nUm~DdeER!=9dkmnh|T4%Jb?Np50e%W<{?lyk^Kk>p`CAIQGw}YuVQGm?-{Yu`zIWoXcZT;Zx>lkQ9TXH8)v0^Kj11pz)-C_zR%ye<}n})|o8r>#zNh z(lZ5BKdm6!WM}23Z{Plsl9D=1k2|{|oD0}0fN7xxldE}LFoNvJK72#^BA}@ig0nBF zNZf@H)mDzz;&2~Ht8i?t-Y{}h>HGMOr*ZkVgQ{m{Vz%#N!IYsdv($-pG3e+i19#ZV zAyF69gZ8Ww&h8(yMZ`^4u~@O@1xQ6IUjhIiE&u%%fSFQ!2zuaHt-$f%Y7DCIDcRC( z@NOB~UDvsw<|j}}m(Xc9lrZta`63*zS}WG2fL%Y_D?(n6!-&1k>d-y- z3T;Ju^Xo)e%*INmx4ET-2H(5b3g*)0YvcSWD7o?R-r5s|jx#v|VfH^HqdQ3Aap$e(2a&dkcG<-Q(&`NLJ*p{{6YWe}4mhu3>UY zk@`eh^nJRw#_%j(*8Bm*z6yi#W=ioyhSjHW-LQPSNd=9+O^K!*g!l7ah{L7l>Y~3; z9(OTAjdloP|KyRFfwExZ2p^+tSkeYwYvwvVYB^4>N-T!893;ND`Oo1-Ys*Za{; zC`P;t5%^K2M^W{ueB^+Rne@qg{+ZzM(A z;>Oo~-orS_n&gabM%;XM$NN<$^E#>h6{mi%%u{Uci2mNo>t2wSU;`%ad0Usr!=aR~ zXqQ*R4-&RtEB;pv&7ZUW;%4OROLRcIq2uo4?6;CcVz{4Y1rf(<92R*9KsX4V@+RgS#&|P`=c(>o@+6m_FJr6>ixBT;a z3ZrIl-4)8XHk`AOon~*KYfcPkPFG^x%(fx1LiVI5j(9$X-IeLCS|xjE0>m)s;l0Oc;PATP&A=VUqV#kyXQfg z#!uV8w&4=){cQN~ta;-JN9%~7kM!|ZdoF0?_rLt)=3s%Jw88MzV!9ZE?P2<;g(*vr zhg@2)QKUY&<)(`~pKW%5(`!UMfA+EKO{7;N+-Ut?WLj5d{C)9yOiRo7p+XIA4W%@; zxzP)xst4&FGL|pE&9KiA=Par!(D|xM^>n|&-GW^64ZJcatvS^1Tl0+C^O-0}x`%;! zX=v|8gbO(B>ZAFq+|n5LqP3UroY*r%*pe#7s2hm>Ja;Eh`)PEBX(vlF%3h?89EZk2 zux4t72ziNgq1(fsBb7c}{H12*;Ok%d8x{ zcbYyW;GaFB5v=>rKVlAqb$0JiGZ zr^j)7{*Q44%{EWa>7P~0(r-C^!Bqc48z%_uVHaF}?SN-@y}-G2*YjKlq_V1IUNA1# zWZ@B?<#NpJjaSy38Pd_lTJrH+T|GUW8msOqC|ao~nOLkWJbPophX*%y%re~gtl^38 zF;>e~cYhf#jOXk|l6D+XO)i$@bb)RcBPq5<6fs+<*%-vG&cV#bWxPJIyt@xS` z2@kx3p2Z!Ql62Y1>o2qJ#AyE0*&Ahb))~rh+3#bG>*rrOaY{>Vu^Qh3*7I!>^Z3_! zwjZv$CpgEZv7bHBT4t)P8m~MD(8$)qi+((?3l9o|EB&=*XQKCZgpVj>$Kco9Q*36wx<0jN=zQ3~B pZ`;!uKjQ@K>3%v(LCb;v%snM4OLKa2)`AZ6@pScbS?83{1OSyN-&p_v literal 0 HcmV?d00001 From 6d3fc138549f60403310eefa4e7c4846f0f7c6fa Mon Sep 17 00:00:00 2001 From: Yuefeng Zhou Date: Fri, 6 Mar 2020 16:18:08 -0800 Subject: [PATCH 04/10] Fix small typos. --- rfcs/20200306-single-client-parameter-server.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/rfcs/20200306-single-client-parameter-server.md b/rfcs/20200306-single-client-parameter-server.md index bcad965da..1aeb4d16f 100644 --- a/rfcs/20200306-single-client-parameter-server.md +++ b/rfcs/20200306-single-client-parameter-server.md @@ -69,7 +69,7 @@ The `strategy.run` API was initially developed for synchronous training. We prop * hide the details of load-balancing, fault tolerance and dynamic scheduling * expose the non-blocking semantics to users. -To enable scheduling a function on any worker, we recommend users create the same dataset, but may shuffled differently, on all workers via `strategy.experimental_distribute_datasets_from_function` API. +To enable scheduling a function on any worker, we recommend users create the same dataset, but may shuffle differently, on all workers via `strategy.experimental_distribute_datasets_from_function` API. ```python From 55c142a14589678bc8df5811d0e70130badd2f70 Mon Sep 17 00:00:00 2001 From: Rick Chao <6505863+rchao@users.noreply.github.com> Date: Fri, 6 Mar 2020 17:19:36 -0800 Subject: [PATCH 05/10] Fix extra period. Fix extra period. --- rfcs/20200306-single-client-parameter-server.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/rfcs/20200306-single-client-parameter-server.md b/rfcs/20200306-single-client-parameter-server.md index 1aeb4d16f..e825b56b1 100644 --- a/rfcs/20200306-single-client-parameter-server.md +++ b/rfcs/20200306-single-client-parameter-server.md @@ -40,7 +40,7 @@ With a single-client architecture, the programming model will be different than 1. Connect to all remote workers and parameter servers. 2. Create variables on parameter servers and hold references to them. -3. Create datasets and iterators on workers.. +3. Create datasets and iterators on workers. 4. Create the replica function that takes an iterator as input, trace it and register it on all workers. Note: a function may create variables as well. If not specified, they will be created on parameter servers as well. 5. Dispatch the step function on one available worker. 6. Repeat 5 until the end of epoch. From 6e75f040b91712720c5cc8184fa6ca671e45cae3 Mon Sep 17 00:00:00 2001 From: Yuefeng Zhou Date: Thu, 12 Mar 2020 00:53:07 -0700 Subject: [PATCH 06/10] Add one more paragraph for single client --- rfcs/20200306-single-client-parameter-server.md | 3 +++ 1 file changed, 3 insertions(+) diff --git a/rfcs/20200306-single-client-parameter-server.md b/rfcs/20200306-single-client-parameter-server.md index e825b56b1..fb16554f7 100644 --- a/rfcs/20200306-single-client-parameter-server.md +++ b/rfcs/20200306-single-client-parameter-server.md @@ -20,9 +20,12 @@ Distribution Strategy’s [custom training loop](https://www.tensorflow.org/tuto ### Single-Client Distributed Training +We recommend a single client architecture for parameter server training in TensorFlow 2. This means there is only one client in a training cluster that coordinates the training of all workers in contrast to the multi-client setup in TensorFlow 1.x where each worker has its own coordinator. + We believe that a single-client architecture can provide a simpler programming model than multi-client setup. A single source of truth can avoid bugs due to inconsistencies in multi-client setup. Furthermore, a single source of control can enable more determinism. In extreme cases, it can launch long-running tasks and turn into multi-client effectively. + ## Goal The goal of this project is to support multi-worker asynchronous training with `ParameterServerStrategy` and CTL API, and in the long term also Keras `model.fit()`. In the first stage of this project, we focus more on design ideas rather than the APIs. From 4f31a44d4e9a011ce4b34867f8181b2458809870 Mon Sep 17 00:00:00 2001 From: Yuefeng Zhou Date: Wed, 18 Mar 2020 00:27:40 -0700 Subject: [PATCH 07/10] Update based PR comments. --- rfcs/20200306-single-client-parameter-server.md | 11 +++++++---- 1 file changed, 7 insertions(+), 4 deletions(-) diff --git a/rfcs/20200306-single-client-parameter-server.md b/rfcs/20200306-single-client-parameter-server.md index fb16554f7..163752d40 100644 --- a/rfcs/20200306-single-client-parameter-server.md +++ b/rfcs/20200306-single-client-parameter-server.md @@ -59,7 +59,7 @@ One of our goals is to make `ParameterServerStrategy`’s API consistent with ot #### Constraints -Function is first-class citizen. Users should only schedule functions instead of running individual ops, in addition to creating variables. +Function is first-class citizen. Users should only schedule functions instead of running individual ops, in addition to creating variables. We will only support `tf.function`s. Scheduling arbitrary Python functions will not be supported in the first cut. Users can occasionally run individual ops on the client, only for reporting purposes such as printing a metric’s value. @@ -94,6 +94,9 @@ class ParameterServerStrategyV2: If there are barriers in `replica_fn`, it is users' responsibility to make sure they won't cause deadlock. + + It will throw an exception if any previously scheduled functions have + non-retryable errors. """ pass @@ -169,7 +172,7 @@ with strategy.scope(): strategy.join() model.save() # save checkpoint/summary... print ("Loss = %f, accuracy = %f" % ( - strategy.local_results(loss), accuracy.result())) + strategy.local_results(loss) or float('nan'), accuracy.result())) ``` @@ -183,7 +186,7 @@ Another option from calling `join` every epoch, users can choose to schedule all with strategy.scope(): # … omitted for _ in range(total_steps)): - strategy.schedule(step_fn, args=(iterators,)) + strategy.schedule(step_fn, args=(distributed_iter,)) # Print accuracy value every one minute. while not strategy.done(): @@ -257,7 +260,7 @@ For functions that bound to a specific worker, e.g. resource creation function, When the failed worker is back, we will update the cluster configuration with `context.update_server_def` which would also reset all the states. After resources on the restarted worker are built, we can resume scheduling functions on the worker. -###### When materialing a `Future` object +###### When materializing a `Future` object It is possible that a function is executed but its corresponding worker fails when users try to consume its output. In this case, we will give users a `None` value and set an error in the `Future` object. From bdc6dad343832d7292a7e7459ed638954f09f47e Mon Sep 17 00:00:00 2001 From: Yuefeng Zhou Date: Mon, 30 Mar 2020 23:15:52 -0700 Subject: [PATCH 08/10] Update 20200306-single-client-parameter-server.md --- rfcs/20200306-single-client-parameter-server.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/rfcs/20200306-single-client-parameter-server.md b/rfcs/20200306-single-client-parameter-server.md index 163752d40..2816894eb 100644 --- a/rfcs/20200306-single-client-parameter-server.md +++ b/rfcs/20200306-single-client-parameter-server.md @@ -289,7 +289,7 @@ Keeping track of resources and rebuilding them will be achieved depending how us * we will capture the creation of worker-local variables via variable creator scopes. * in the future we will provide users an API to create worker-local resources. We will capture these resources in the API. -If users create iterators or other resources inside a function but don’t expose them as outputs, we will not rebuild them. +If users create iterators or other resources inside a function but don’t expose them as outputs, we don't need to rebuild them. #### The Unknown of Scheduled Functions From ab0f4334813801a39bad466d21fa401fb2b09bc0 Mon Sep 17 00:00:00 2001 From: Yuefeng Zhou Date: Fri, 10 Apr 2020 00:39:27 -0700 Subject: [PATCH 09/10] Update 20200306-single-client-parameter-server.md --- ...20200306-single-client-parameter-server.md | 422 +++++++++++------- 1 file changed, 273 insertions(+), 149 deletions(-) diff --git a/rfcs/20200306-single-client-parameter-server.md b/rfcs/20200306-single-client-parameter-server.md index 2816894eb..208e55e37 100644 --- a/rfcs/20200306-single-client-parameter-server.md +++ b/rfcs/20200306-single-client-parameter-server.md @@ -6,9 +6,10 @@ | **Sponsor** | Priya Gupta (priyag@google.com) | | **Updated** | 2018-03-06 | + ## Background -Parameter server training is a very commonly used distributed training architecture. It is especially relevant to models with large embeddings, for training on large clusters of machines with CPUs, or when scalability is preferred over determinism. Its high-level idea is to create variables on parameter servers and in each step let workers take different training input, pull variable values from parameter servers, compute gradients and send them to parameter servers. +Parameter server training is a very commonly used distributed training architecture. It is especially relevant to models with large embeddings, for training on large clusters of machines with CPUs, or when scalability is preferred over determinism. Its high-level idea is to create variables on parameter servers and in each step let workers take different training inputs, pull variable values from parameter servers, compute gradients and send them to parameter servers. ### Distribution Strategy @@ -25,7 +26,6 @@ We recommend a single client architecture for parameter server training in Tenso We believe that a single-client architecture can provide a simpler programming model than multi-client setup. A single source of truth can avoid bugs due to inconsistencies in multi-client setup. Furthermore, a single source of control can enable more determinism. In extreme cases, it can launch long-running tasks and turn into multi-client effectively. - ## Goal The goal of this project is to support multi-worker asynchronous training with `ParameterServerStrategy` and CTL API, and in the long term also Keras `model.fit()`. In the first stage of this project, we focus more on design ideas rather than the APIs. @@ -44,7 +44,7 @@ With a single-client architecture, the programming model will be different than 1. Connect to all remote workers and parameter servers. 2. Create variables on parameter servers and hold references to them. 3. Create datasets and iterators on workers. -4. Create the replica function that takes an iterator as input, trace it and register it on all workers. Note: a function may create variables as well. If not specified, they will be created on parameter servers as well. +4. Create the replica function that takes an iterator as input, trace it and register it on all workers. Note: a function may create variables as well. If not specified, they will be created on parameter servers at the time the function is traced. 5. Dispatch the step function on one available worker. 6. Repeat 5 until the end of epoch. 7. Repeat 5 - 6 until the stop criteria is reached. @@ -68,40 +68,44 @@ Users can occasionally run individual ops on the client, only for reporting purp The `strategy.run` API was initially developed for synchronous training. We propose a new pair of primitives to - * hide the details of load-balancing, fault tolerance and dynamic scheduling * expose the non-blocking semantics to users. -To enable scheduling a function on any worker, we recommend users create the same dataset, but may shuffle differently, on all workers via `strategy.experimental_distribute_datasets_from_function` API. - - ```python class ParameterServerStrategyV2: - def schedule(self, replica_fn, args=(), kwargs=()): - """Schedule the `replica_fn` on all replicas in a sync group (a worker). + def schedule(self, replica_fn, args=(), kwargs=(), schedule_options=None): + """Schedule the `replica_fn` on a worker. - Schedule the `replica_fn` on all replicas in a sync group that is available, - returns a future of PerReplica immediately if `function` has return values. + Schedule the `replica_fn` on a worker that is available, returns a future + object immediately. + + By default, it implements at-least-once semantics for function execution. If + client gets a retryable error, e.g. worker preemption, it will reschedule the + function on another worker. So this method assumes that function execution can + be out of order. - It implements at-least-once semantics for function execution. If a worker - fails, it will try to reschedule the function on another replica group or throw - an exception to users. So this method assumes that function execution can be - out of order and function inputs are shared between sync groups. + If `args` or `kwargs` contains distributed values such as a distributed dataset + returned from `strategy.distribute_dataset` or + `strategy.distribute_dataset_from_function`, the slice of the dataset + corresponding to the scheduled worker will be substituted for the original + distributed value. - We don't support the cases where `args` or `kwargs` are bound to a specific - sync group. We will consider supporting them in the future. + If some element in `args` or `kwargs` is bound to a specific worker, the + execution of the function may fail if the worker fails. We will consider + rebuilding the inputs to achieve at-least-once in all cases. + + The `schedule_options` will give users flexibility to specify which worker to + schedule on. We will support more options in the future. If there are barriers in `replica_fn`, it is users' responsibility to make - sure they won't cause deadlock. - - It will throw an exception if any previously scheduled functions have - non-retryable errors. + sure they won't cause deadlock. If `replica_fn` has collective ops that are + bound to specific devices, we recommend users use the run method instead. """ pass - def join(self): - """Wait until all scheduled functions are finished. + def join(self, futures=None): + """Wait until all given futures are ready. Raises an error if any of the functions fails to execute. In this case, there is no guarantee that non-failing functions will complete. @@ -111,31 +115,86 @@ class ParameterServerStrategyV2: pass def done(self): - """Returns True if there is no pending functions to be executed.""" + """Returns True if there are no pending functions to be executed.""" pass - def local_results(self, future_list): - """Get concrete values of the future list. + def local_results(self, futures): + """Get concrete values of the futures. Poisoned future objects will give `None`. """ pass + + +class Future(object): + + def wait(self): + """Block until the corresponding function is executed.""" + pass + + def result(self): + """Materialize the future. + + This is a blocking call. An exception will be thrown if the corresponding + function fails to execute or schedule. + """ + pass + + +class ScheduleOption(object): + + def __init__(assigned_worker=None): # More options to be added. + pass +``` + + +#### Dataset Interface + +The traditional training loop of `tf.distribute` passes the `get_next` results of a distributed iterator to `replica_fn`: + +``` +for x, y in distributed_iter: + loss = strategy.schedule(replica_fn, x, y) +``` + +If we do the same thing with the `strategy.schedule` API, there are several challenges. + +The first challenge is we don’t know which worker the `get_next` should return to since where the `replica_fn` will be executed will be decided later. Some later-binding mechanism can be explored. + +The second challenge is calling `get_next` on an iterator is synchronous. This means that the training loop is not truly asynchronous. It is tricky to make `get_next` asynchronous because the client doesn’t know how many items will be in the iterator and thus doesn’t know how many functions to schedule. + + +##### Alternative: passing iterators to `strategy.schedule` + +The following training loop is less consistent with other `tf.distribute` examples but is easier to implement in the short term. It requires users to explicitly set a number of steps. + +```python +# … omitted +with strategy.scope(): + # … omitted + distributed_iter = iter(distributed_dataset) + for i in range(total_steps): + strategy.schedule(replica_fn, args=(distributed_iter,)) +# … omitted ``` +**We will start with this kind of training loop in our first version. We hope to get rid of this restriction in the future.** + -#### Custom Training Loop +#### Example: Estimator-style Training with Custom Training Loop -To construct a custom training loop, users need to +In Estimator, workers independently run training steps. Datasets created on each worker are usually identical but shuffled differently. The termination of training is decided based on the global step. Since workers are independent and stateless, workers can come and go freely. We can achieve similar behavior with our proposed interfaces. +To construct a custom training loop for Estimator-style training, users need to -* use `strategy.experimental_distribute_datasets_from_function` to create one dataset per worker. At this point, we recommend against using `strategy.experimental_distribute_dataset`. +* use `strategy.experimental_distribute_datasets_from_function` to create one dataset per worker. The dataset should be the same but shuffled differently across workers. * create models under `strategy.scope` so variables will be assigned to parameter servers. -* likewise, create a Keras metric object under `strategy.scope`. We expect the metric variables to be stored on parameter servers. Each worker, within their `replica_fn`, updates the metric states. -* use `strategy.schedule` to schedule the `replica_fn` on one remote worker. This `replica_fn` should take an iterator and perform forward and backward computation. This `strategy.schedule` will only schedule this `replica_fn` and returns one or several `Future` objects immediately. +* likewise, create a Keras metric object under `strategy.scope`. Each worker, within their `replica_fn`, updates the metric states. +* use `strategy.schedule` to schedule the `replica_fn` into the cluster, which will end up scheduled on one remote worker. This `replica_fn` should take an iterator and perform forward and backward computation. This `strategy.schedule` returns one or several `Future` objects immediately. * use `strategy.local_results` to get concrete values of results returned by `strategy.schedule`. This may be a blocking call if the result is not yet ready. With any failure that cannot be handled will be ignored and as a result some of the results may be `None`. * call `strategy.join` to wait until all scheduled functions are executed. -```python +```Python # Connect to remote servers with a user-provided `ClusterResolver` object. strategy = ParameterServerStrategyV2(cluster_resolver) @@ -150,6 +209,8 @@ with strategy.scope(): model = create_model() optimizer = tf.keras.optimizers.Adam() accuracy = tf.keras.metrics.CategoricalAccuracy(name="train_accuracy") + checkpoint_manager = tf.train.CheckpointManager( + tf.train.Checkpoint(model=model), checkpoint_dir, max_to_keep=2) @tf.function def replica_fn(iterator): @@ -170,80 +231,16 @@ with strategy.scope(): loss = strategy.schedule(replica_fn, args=(distributed_iter,)) strategy.join() - model.save() # save checkpoint/summary... + checkpoint_manager.save() # save checkpoint/summary... print ("Loss = %f, accuracy = %f" % ( strategy.local_results(loss) or float('nan'), accuracy.result())) ``` -##### Alternative training loop: fully async - -Another option from calling `join` every epoch, users can choose to schedule all steps and then asynchronously print metric values. This option doesn’t require any synchronization in epoch boundaries. - - -```python -# … omitted -with strategy.scope(): - # … omitted - for _ in range(total_steps)): - strategy.schedule(step_fn, args=(distributed_iter,)) - - # Print accuracy value every one minute. - while not strategy.done(): - print("Current accuracy: %f" % accuracy.result()) - time.sleep(60) -# … omitted -``` - -#### Error Reporting From `replica_fn` - -Because of the non-blocking `schedule`, any exception raised in `replica_fn` wouldn’t be returned to users immediately. Actually an exception may pollute arbitrary number of functions in flight following the culprit function. We will set the error in returned `Future` objects for the culprit function and these polluted functions and we will raise exceptions when `join` is called. - -Therefore the best practice for users is to avoid writing any code that may raise in `replica_fn`: - -* use repeated dataset so `OutOfRangeError` will be avoided; -* avoid using assertion ops or some debugging ops like `tf.debugging.check_numerics`. - - -#### Dataset Interface - -The traditional training loop of `tf.distribute` passes the `get_next` results of a distributed iterator to `replica_fn`: - - -```python -for x in enumerate(distributed_iter): - loss = strategy.schedule(replica_fn, x, y) -``` - - -If we do the same thing with the `strategy.schedule` API, there are several challenges. - -The first challenge is we don’t know which worker the `get_next` should return to since where the `replica_fn` will be executed will be decided later. Some later-binding mechanism can be explored. - -The second challenge is calling `get_next` on an iterator is synchronous. This means that the training loop is not truly asynchronous. It is tricky to make `get_next` asynchronous because the client doesn’t know how many items will be in the iterator and thus doesn’t know how many functions to schedule. - - -##### Alternative: passing iterators to `strategy.schedule` - -The following training loop is less consistent with other `tf.distribute` examples but is easier to implement in the short term. It requires users to explicitly set a number of steps. - - -```python -# … omitted -with strategy.scope(): - # … omitted - distributed_iter = iter(distributed_dataset) - for i in range(total_steps): - strategy.schedule(replica_fn, args=(distributed_iter,)) -# … omitted -``` - - -**We will start with this kind of training loop in our first version.** - - ### Fault Tolerance +This section talks about the failure model and how we will support it. It has limitations and we will consider exposing APIs for users to define custom failure recovery policies in the future. + #### Task Failure @@ -271,78 +268,155 @@ We can explore mechanisms to recover these objects in the future. In the short-t ##### Parameter server failure -When a parameter server fails, the error will be propagated to the client via workers. Since the latest values of variables on the failed parameter servers are gone, there is no way for the client to recover them. Therefore the training will pause until the failed parameter server is back. The client then needs to clean up other variables on other parameter servers, rebuild all the variables and load variable values from a checkpoint. To trigger this process, the simplest method is to restart the client as well. +When a parameter server fails, the error will be propagated to the client via workers. Since the latest values of variables on the failed parameter servers are gone, there is no way for the client to recover them. Therefore the training will pause until the failed parameter server is back. The client then needs to clean up other variables on other parameter servers, rebuild all the variables and load variable values from a checkpoint. To trigger this process, the simplest method is to restart the client as well. This would require the cluster management to start the program again, once it receives an error from the client program due to parameter server failures. ##### Client failure When a client fails, some scheduled functions will continue to run on workers. No new functions will be scheduled. When the client comes back, it will create variables, load from a checkpoint, schedule functions with a new context id. All the old variables will be garbage-collected when we reset their eager contexts. + #### Resource Management for Workers When a worker has recovered from failure, we will need to rebuild iterators, worker-local variables, lookup tables and other resources on that worker that don’t need to be read from a checkpoint. This means that the client will have to keep track of these iterators, worker-local variables and other resources. Keeping track of resources and rebuilding them will be achieved depending how users create their resources: - * we will record iterators created via `tf.distribute`’s API; The state of a rebuilt iterator will be lost. We can recover their states as future work. -* we will capture the creation of worker-local variables via variable creator scopes. -* in the future we will provide users an API to create worker-local resources. We will capture these resources in the API. +* In the future we will provide users an API to create worker-local resources. We will capture these resources in the API. -If users create iterators or other resources inside a function but don’t expose them as outputs, we don't need to rebuild them. +If users create iterators or other resources inside a function but don’t expose them as outputs, we don’t need to rebuild them. #### The Unknown of Scheduled Functions For functions that have been scheduled, it is difficult for the client to know whether they have actually been executed or not when the client detects their corresponding worker failure. Therefore, in addition to inform users of this uncertainty in the case of worker failure, we should do the following to reduce this uncertainty: - * keep the number of scheduled but not executed functions small. This may be difficult to achieve since there is not an easy way for the client to know whether a function is executed or not. The only way is to synchronize the executor. Therefore, as a workaround we will have to periodically synchronize the executor to make sure functions are actually executed, before the client schedules more functions. In the long run, we should get acknowledgement from runtime about how many functions have been executed. * eagerly fetch the outputs of remote functions once the outputs are passed to `strategy.local_result`. In this way, we can know the status of function execution earlier. * recommend users schedule only small functions. Large functions are more expensive to retry. +#### Schedule Affinity + +When there is schedule affinity, specified by `ScheduleOptions` or inferred from input affinity, the aforementioned failure handling mechanism of rescheduling a function on other workers will not work. In this case, the default behavior is the client waits for the failing worker to come back until timeout and returns a schedule error to users. + + ### Evaluation -Historically, `tf.estimator.Estimator` uses a dedicated evaluator that periodically loads from a checkpoint, and performs evaluation with evaluation data. However `tf.keras` typically evaluates in an alternating manner after every epoch of training, and this is also the case with `tf.keras` + `MultiWorkerMirroredStrategy`. +Historically, `tf.estimator.Estimator` uses a dedicated evaluator that periodically loads from a checkpoint, and performs evaluation with evaluation data. On the other hand, `tf.keras` typically evaluates in an alternating manner after every epoch of training, and this is also the case with `tf.keras` + `MultiWorkerMirroredStrategy`. -With `ParameterServerStrategyV2`, we will start with a dedicated** evaluator that runs alongside the training cluster**, **aka “sidecar evaluation”**; in this scheme, training client is required to generate checkpoints periodically, and the evaluator reads the latest checkpoint as it becomes available. The evaluation is asynchronous to the training progress. With this we provide the functionality Estimator has been able to with Keras API, which is important to attract updates from Estimator users to TF 2.0. +With `ParameterServerStrategyV2`, we will start with two schemes: 1) evaluation done by a dedicated **** evaluator that runs alongside the training cluster, aka “sidecar evaluation”, with a supporting utility function, and 2) evaluation done by a function executed on a single worker or functions executed on multiple workers, aka “inline evaluation”, where evaluation takes place in an alternating manner with training. -With our recommendation, users should create a separate evaluation client that runs the same python binary as the training client. This python binary will contain the if-else clause as it bifurcates into two paths: +Sidecar evaluation is especially useful for those users who prefer the settings where evaluation does not interrupt training progress, if saving/loading checkpoints are not considered expensive. +Inline evaluation is especially useful for those users who would like to avoid checkpoint saving/loading, and those who feel performing evaluation isn’t too expensive so that it’s fine training is stopped for a short period of time. -```python + +#### Sidecar evaluation + +In this scheme, the training client is required to generate checkpoints periodically, and the evaluator reads the latest checkpoint as it becomes available. The evaluation is asynchronous to the training progress. With our recommendation[^1], users should create a separate evaluation client that runs the same python binary as the training client. This python binary will contain the if-else clause as it bifurcates into two paths: + +```Python if cluster_resolver.task_type == "chief": run_training_loop() elif cluster_resolver.task_type == "evaluator": run_evaluation_loop() ``` +For user’s convenience, we will provide an `EvaluationLoop` API where the user provides key components for evaluation: -Evaluation code: - - -```python +```Python def run_evaluation_loop(...): """Run the example custom evaluation loop.""" + + model, eval_dataset, checkpoint_dir, eval_metrics = ... + + utils.EvaluationLoop( + model, + eval_dataset, + checkpoint_dir, + eval_metrics).start() + +class EvaluationLoop(object): - eval_dataset, model, eval_accuracy = ... - checkpoint = tf.train.Checkpoint(model=model) + def __init__(self, model, eval_dataset, checkpoint_dir, eval_metrics, + eval_steps=None): + """Initializes an EvaluationLoop object.""" + + @tf.function + def eval_fn(dataset): + """Evaluation function to compute metrics given a dataset. + + This creates a tf.function'ed evaluation function, where the dataset is + iterated over until exhaustion, or until eval_steps is met, whichever comes + earlier. If `eval_steps` is None, it exhausts the dataset. If dataset is + repeated, `eval_steps` must be provided or evaluation will be performed + indefinitely. + """ + pass + + self._eval_fn = eval_fn + # Other self attributes. + + def start(self): + """Starts an evaluation loop. + + This will start an evaluation loop which attempts to read the latest + checkpoint file. If a checkpoint file exists, and it has not been + evaluated, it loads it into the model, and executes the `eval_fn` locally. + After each evaluation run, it logs the metrics requested by the user, + writes to summary file for TensorBoard visualization, and possibly outputs + files for chief to read for further actions such as early stopping or + adjusting learning rate. + """ + pass +``` + +As illustrated above, evaluation loads into the model the checkpoints that were periodically saved (by the training client), does evaluation over a full pass of the eval dataset, and outputs the eval results. It may also export results to files which can be read by the training client for actions (such as reducing learning rate, early stopping, etc.) + +At evaluator’s failures or preemptions, we expect the evaluator job to be restarted, pick up the latest checkpoint, and continue with the next round of evaluation. + + +#### Inline evaluation +In this scheme, there’s no checkpoint needed (although the training/evaluation can still involve one at user’s choice), and the same set of workers is used for evaluation after some amount of training (usually an epoch of training) has completed. No dedicated evaluator job is needed. As illustrated below, this would require users to write their `eval_fn` and schedule it to workers. + +```Python +strategy = ParameterServerStrategyV2(cluster_resolver=...) + +with strategy.scope(): + model, train_metric, train_dataset = ... @tf.function - def eval_fn(eval_dataset): - for _ in range(eval_steps): - # evaluation pass - return eval_accuracy.result() - - while True: - latest_checkpoint = get_new_checkpoint() - checkpoint.restore(latest_checkpoint) - eval_result = eval_fn(iterator) # Users can print, early stop, mark ckpt.. + def train_fn(): + ... + + eval_metric = tf.keras.metrics.CategoricalAccuracy(name="eval_accuracy") + @tf.function + def eval_fn(shard_id, num_shards): + eval_dataset = ... + for x, y in eval_dataset.shard(shard_id, total_shard): + eval_metric.update_state(y, model(x, training=False)) + + for _ in range(num_epochs): + for _ in range(num_steps): + strategy.schedule(train_fn, args=...) # Training for num_steps steps. + strategy.join() # Make sure training ends and nobody is updating PS. + + # NUM_SHARDS' some sensible number, needs to be at least the number of workers, + # preferably much larger than that. + for shard_id in range(NUM_SHARDS): + strategy.schedule(eval_fn, args=(shard_id, NUM_SHARDS)) + strategy.join() + print("Eval result is %f." % eval_metric.result()) + + # Optionally save checkpoint/summary, adjust learning rate or early stop, + # based on the evaluation result. + checkpoint_manager.save() ``` -In the evaluation client, the user loads the checkpoints that were periodically saved into the model (by the training client), does evaluation over a full pass of eval dataset, and does whatever they want to do with eval results. Examples include exporting them to files which can be read by the training client for actions (such as reducing learning rate, early stopping, etc.) +If the worker that’s actively performing the evaluation encounters failures or preemptions, it is expected that `eval_fn` with a specific `shard_id` will be taken over by another available worker. This may result in duplicated evaluation on some input examples. This can be solved by having metrics as worker local resources, and returning the metric results as the return value of `eval_fn`. The user would then aggregate on the results of those `eval_fn`s. ## Implementation @@ -354,8 +428,7 @@ We can potentially expose them in the future when they are more stable and when We will have `Cluster` and `Worker` classes to encapsulate logic related to remote function scheduling. - -```python +```Python class Cluster(object): def __init__(self, cluster_resolver, failure_handler=None): @@ -380,11 +453,9 @@ class Cluster(object): pass ``` - We will probably merge this `Worker` with executors. - -```python +```Python class Worker(object): def __init__(self, @@ -398,7 +469,7 @@ class Worker(object): """Schedule the function on the worker. It adds the function to the scheduling queue. It returns Future object - immediately unless the scheduling queue is full. + immediately. """ pass @@ -415,22 +486,32 @@ class Worker(object): pass ``` +As we mentioned the return value of `schedule` will be `Future` objects. The `Future` works as a container and will be later-binded with states of either success or complete failure. Overall, this `Future` class has the following benefits: -As we mentioned the return value of `schedule` will be `Future` objects if `function` has return values. The `Future` works as a container and will be later-binded with states of either success or complete failure. Overall, this `Future` class has the following benefits: - - - -* It allows `schedule` method return immediately after pushing functions to its scheduling queue. It allows these methods to return without needing to wait for acknowledgement from workers. +* It allows the `schedule` method to return immediately after pushing functions to its scheduling queue. It allows these methods to return without needing to wait for acknowledgement from workers. * It serves as the container for values or errors. It would be binded with a value or an error later. When it is rebuilt, we can replace its underlying value silently. * When being passed to `local_result`, we flag it to indicate that this value needs to be fetched eagerly. +* It provides a handle for user to wait for and get the error of a particular function. * (Future work) It captures the lineage between functions and return values so that we can rebuild any poisoned objects. -```python +```Python class Future(object): def __init__(self, closure): pass + def wait(self): + """Block until the corresponding function is executed.""" + pass + + def result(self): + """Materialize the future. + + An exception will be thrown if the corresponding function fails to + schedule/execute. + """ + pass + def _set_value(self, value): pass @@ -439,13 +520,8 @@ class Future(object): def _set_eagerly_fetch(self): pass - - def _fetch(self): - pass ``` - - We can potentially merge this `Future` class with our `Tensor` class. @@ -459,22 +535,60 @@ The following are features we have been considering to support in the future alt Workers can come and go. To support this, we’ll probably need a mechanism to discover and remove workers and make our implementation of `tf.distribute` reactive. -### Integration with tf.data Service +### Automated Worker Pool Resizing -In our design, we assume that `replica_fn` can be scheduled on any worker with some constraints. For example, datasets can not be sharded across workers; rebuilding iterators will lose their states. With the help of `tf.data` service, we can get rid of these constraints. +Once dynamic membership is supported, it would be useful that there is automation built on top of dynamic membership, where the number of workers increases or decreases automatically based on the usage. -### Advanced Evaluations +### Caching Variables/Resources +Some variables or resources can be cached on workers to achieve faster read and update. They can have a global copy on parameter servers and local copies on all workers. We should allow users to define policies to use cached local copies to update the global copy whenever the latest value is needed. -#### Inline evaluation +These variables include loss scales in mixed precision training and batchnorm statistics. These are similar to sync-on-read variables in other distribution strategies. A possible way to update the global copy using a local copy is: `global_value += (local_value - global_value) / num_workers`. + +Hash tables for embedding lookup can also be cached on workers. + + +### Worker-local Resources + +Lookup tables, replay buffers or any other worker-local resources that need to be elastic to work with the `schedule` API. The `distribute_dataset` method can also call this method to create elastic datasets for training. + +```Python +class ParameterServerStrategyV2(BaseStrategy): -The client drives the same worker pool for evaluation. We can alternative training and evaluation. + def create_worker_resource(self, resource_creation_fn): + """Create one resource per worker. + If workers are added, the `resource_creation_fn` will be called to create + resources on new workers. + """ + pass + +class ElasticResource(object): + + def __init__(self, resource_dict): + pass -#### Sidecar evaluation cluster + def add_resource(self, worker_resource_pair): + pass + + def remove_resource(self, worker): + pass -We can have a sidecar evaluation cluster as well. They can either do evaluation synchronously on a common dataset or each does its own evaluation. + def get(self, worker): + """Return the concrete resource on the given `worker`. + + If an scheduled function takes `ElasticResource` as input, the scheduler, after + deciding which worker to schedule the function on, will call this method to + get the underlying resource on the corresponding worker. + """ + pass +``` + + +### Integration with tf.data Service + +In our design, we assume that `replica_fn` can be scheduled on any worker with some constraints. For example, datasets can not be sharded across workers; rebuilding iterators will lose their states. With the help of `tf.data` service, we can get rid of these constraints. ### Keras Integration @@ -484,11 +598,21 @@ Integrating with Keras `model.fit()` will largely be reusing previous work done Most important implication of integrating with Keras `model.fit()` is that we will need support for `strategy.join()` and/or `strategy.local_results()` for callbacks. This would have performance implications but that would be the trade off for fitting the synchronous `model.fit()` semantics. +### More ScheduleOptions + +More schedule options can be added such as how many times of reschedules before returning an error to users if a function gets interrupted because of worker preemption. + + ### Versioning The client and standard server binaries may be in different versions. There is no backward or forward compatibility guarantee. For now, we recommend users run the same binary which will run standard TensorFlow servers if it is not the client. +### Better Preemption Handling + +We can leverage features of container orchestration frameworks to improve preemption handling. For example, if we can get notifications about a worker or a parameter server about to be preempted, we can save some of its state and recover much faster with this state. + + ### Advanced Fault Tolerance From 6fa632ac9113ea9bc7cc2ee3f3e291860af708af Mon Sep 17 00:00:00 2001 From: Yuefeng Zhou Date: Fri, 10 Apr 2020 00:39:56 -0700 Subject: [PATCH 10/10] Change the status to accepted --- rfcs/20200306-single-client-parameter-server.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/rfcs/20200306-single-client-parameter-server.md b/rfcs/20200306-single-client-parameter-server.md index 208e55e37..a57ae4231 100644 --- a/rfcs/20200306-single-client-parameter-server.md +++ b/rfcs/20200306-single-client-parameter-server.md @@ -1,6 +1,6 @@ # Single-client Parameter Server Training -| Status | Proposed | +| Status | Accepted | :-------------- |:---------------------------------------------------- | | **Author(s)** | Yuefeng Zhou (yuefengz@google.com), Rick Chao (rchao@google.com) | | **Sponsor** | Priya Gupta (priyag@google.com) |