SuperServe: Fine-Grained Inference Serving for Unpredictable Workloads
作者信息:Gatech & Berkeley: Alind Khare, Dhruv Garg, Sukrit Kalra, Snigdha Grandhi, Ion Stoica, Alexey Tumanov
链接:[2312.16733] SuperServe: Fine-Grained Inference Serving for Unpredictable Workloads
摘要:The increasing deployment of ML models on the critical path of production applications in both datacenter and the edge requires ML inference serving systems to serve these models under unpredictable and bursty request arrival rates. Serving models under such conditions requires these systems to strike a careful balance between the latency and accuracy requirements of the application and the overall efficiency of utilization of scarce resources. State-of-the-art systems resolve this tension by either choosing a static point in the latency-accuracy tradeoff space to serve all requests or load specific models on the critical path of request serving. In this work, we instead resolve this tension by simultaneously serving the entire-range of models spanning the latency-accuracy tradeoff space. Our novel mechanism, SubNetAct, achieves this by carefully inserting specialized operators in weight-shared SuperNetworks. These operators enable SubNetAct to dynamically route requests through the network to meet a latency and accuracy target. SubNetAct requires upto 2.6x lower memory to serve a vastly-higher number of models than prior state-of-the-art. In addition, SubNetAct's near-instantaneous actuation of models unlocks the design space of fine-grained, reactive scheduling policies. We explore the design of one such extremely effective policy, SlackFit and instantiate both SubNetAct and SlackFit in a real system, SuperServe. SuperServe achieves 4.67% higher accuracy for the same SLO attainment and 2.85x higher SLO attainment for the same accuracy on a trace derived from the real-world Microsoft Azure Functions workload and yields the best trade-offs on a wide range of extremely-bursty synthetic traces automatically.
结合MIT Song Han老师的One-for-All上做的新调度问题。23年的NSDI。。。。
SLO和Accuracy的竞争:
最开始的服务系统中模型是定死的,遇到动态负载则只能从符合 SLO 和 Accuracy二选一。
SLO和Resource Efficiency的竞争:
近期服务系统会采用多个模型根据请求率来选择性安排服务请求,然而这种方案需要将多个模型都放入内存或使用一些模型切换技术,会违反 resource efficiency。同时,这种方案使得当前工作关注于避免模型切换或减少模型切换的开销,从而忽略了SLO、Accuracy和Resource Efficiency之间的trade-off。
- NAS、SuperNet
- 传统的方案是将其完整load到GPU内存中、或涉及model切换策略
- 近期的工作解耦了NAS中多次训练问题
与先前工作相比,本文同时考虑了SLO、Accuracy和Resource Efficiency。
- LayerSelect:动态选择层。
- SubnetNorm:根据选定的不同层,归一化结果可能受影响,需要选择正确的数据来做标准化。
- WeightSlice:在全链接层中动态选择channel。
第三章: To be continue