Horovod源码分析（一）

Horovod为Uber开源的一个分布式训练框架，支持主流的机器学习框架（Tensorflow, PyTorch及MxNet）。本文主要是基于版本v0.21.1介绍Horovod的核心实现，以及与各个框架的集成。

Horovod的工作流程比较简单，有一个消息队列接收AllReduce,AllGather以及Broadcast这三个op的请求，有一个后台线程会每隔一段时间轮询消息队列，拿到一批op之后，会对op中的tensor进行融合，再进行相应的操作。如果tensor在显存中，那么它会使用NCCL库执行。而如果是在内存中，则会使用MPI或者Gloo执行。

Horovod的核心代码位于horovod/common目录中。operations.cc文件相当于Horovod的入口，它包含了BackgroundThreadLoop、RunLoopOnce等重要函数。顺着这几个函数看下去，可以略窥一二。

首先欣赏一下函数RunLoopOnce，这里省略了一些优化的代码，比如使用response cache，auto tune等：

boolRunLoopOnce(HorovodGlobalState& state){// 检查从上一个cycle开始到现在，是否已经超过一个cycle时间（CycleTimeMs）autostart_time =std::chrono::steady_clock::now();autosleep_duration = state.last_cycle_start +std::chrono::microseconds(long( state.parameter_manager.CycleTimeMs() *1000.)) - start_time;if(sleep_duration >std::chrono::steady_clock::duration::zero()) {std::this_thread::sleep_for(sleep_duration); } state.last_cycle_start =std::chrono::steady_clock::now();// 在Timeline中记录，用户拿到Timeline结果后，可以在chrome中查看if(state.mark_cycles_in_timeline) {// Mark start of the new cycle.state.timeline.MarkCycleStart(); }autoresponse_list = state.controller->ComputeResponseList(horovod_global.shut_down, state); state.mark_cycles_in_timeline = state.controller->MarkCyclesInTimelinePending();// 对于每个response，做collective的操作for(auto& response : response_list.responses()) { PerformOperation(response, horovod_global); }return!response_list.shutdown();}复制代码

从HorovodRunOnce函数中，我们可以看到Horovod的工作流程大致如之前所说的，是一个生产者和消费者的模式。controller在这里是做协调的工作：会互通各个rank有哪些request已经就绪，对于就绪的request，执行collective的操作。

接下来我们先看看ComputeResponseList这个函数。这个函数是个长达380行的超长函数，为了更方便地理解这个函数在干什么，这里先把cache以及检查stall的代码去除：

ResponseListController::ComputeResponseList(std::atomic_bool& shut_down, HorovodGlobalState& state){CacheCoordinatorcache_coordinator(response_cache_.num_active_bits());// message queue used only in this cyclestd::deque message_queue_tmp; tensor_queue_.PopMessagesFromQueue(message_queue_tmp);for(auto& message : message_queue_tmp) {if(message.request_type() == Request::JOIN) { state.joined =true; cache_coordinator.set_uncached_in_queue(true);continue; } }// Flag indicating that the background thread should shut down.boolshould_shut_down = shut_down; cache_coordinator.set_should_shut_down(should_shut_down); ResponseList response_list; response_list.set_shutdown(cache_coordinator.should_shut_down()); {// Collect all tensors that are ready to be reduced. Record them in the// tensor count table (rank zero) or send them to rank zero to be// recorded (everyone else).std::vector ready_to_reduce;if(is_coordinator_) {// 对于master进程，记录已经ready的tensor。注意此时message_queue_tmp中的request是来自// master进程while(!message_queue_tmp.empty()) {// Pop the first available messageRequest message = message_queue_tmp.front(); message_queue_tmp.pop_front();if(message.request_type() == Request::JOIN) { state.joined_size++;continue; }boolreduce = IncrementTensorCount(message, state.joined_size);if(reduce) { ready_to_reduce.push_back(message.tensor_name()); } }// 接收其他rank的ready的tensorstd::vector ready_list; RecvReadyTensors(ready_to_reduce, ready_list);// 处理来自其他rank的request。size_是指有多少个rankfor(inti =1; i < size_; ++i) {autoreceived_message_list = ready_list[i];for(auto& received_message : received_message_list.requests()) {auto& received_name = received_message.tensor_name();// Join类型消息是指有新的rank加入，Horovod支持弹性if(received_message.request_type() == Request::JOIN) { state.joined_size++;continue; }// 增加该tensor已经ready的rank的个数，如果所有rank都ready，则发给其他rankboolreduce = IncrementTensorCount(received_message, state.joined_size);if(reduce) { ready_to_reduce.push_back(received_name); } }if(received_message_list.shutdown()) {// Received SHUTDOWN request from one of the workers.should_shut_down =true; } }// Check if tensors from previous ticks are ready to reduce after Joins.if(state.joined_size >0) {for(auto& table_iter : message_table_) {intcount = (int)table_iter.second.size();if(count == (size_ - state.joined_size) &&std::find(ready_to_reduce.begin(), ready_to_reduce.end(), table_iter.first) == ready_to_reduce.end()) { state.timeline.NegotiateEnd(table_iter.first); ready_to_reduce.push_back(table_iter.first); } } }// 这个条件有点让人费解，看字面意思是如果禁止group fusion，并且group_table_非空，则fuse?if(state.disable_group_fusion && !group_table_.empty()) {// Extract set of common groups from coordinator tensor list and cache hits.std::vector common_ready_groups;std::unordered_set processed;for(constauto& tensor_name : ready_to_reduce) {intgroup_id = group_table_.GetGroupIDFromTensorName(tensor_name);if(group_id != NULL_GROUP_ID && processed.find(group_id) == processed.end()) { common_ready_groups.push_back(group_id); processed.insert(group_id);// Leaving name in list, to be skipped later.} }// For each ready group, form and fuse response lists independentlyfor(autoid : common_ready_groups) {std::deque responses;for(constauto&tensor_name : group_table_.GetGroupTensorNames(id)) {if(message_table_.find(tensor_name) != message_table_.end()) {// Uncached messageResponse response = ConstructResponse(tensor_name, state.joined_size); responses.push_back(std::move(response)); } } FuseResponses(responses, state, response_list); } }// At this point, rank zero should have a fully updated tensor count// table and should know all the tensors that need to be reduced or// gathered, and everyone else should have sent all their information// to rank zero. We can now do reductions and gathers; rank zero will// choose which ones and in what order, and will notify the other ranks// before doing each reduction.std::deque responses;for(auto& tensor_name : ready_to_reduce) {// Skip tensors in group that were handled earlier.if(state.disable_group_fusion && !group_table_.empty() && group_table_.GetGroupIDFromTensorName(tensor_name) != NULL_GROUP_ID) {continue; } Response response = ConstructResponse(tensor_name, state.joined_size); responses.push_back(std::move(response)); }if(state.joined_size == size_) {// All ranks did Join(). Send the response, reset joined size.Response join_response; join_response.set_response_type(Response::JOIN); join_response.add_tensor_name(JOIN_TENSOR_NAME); responses.push_back(std::move(join_response)); state.joined_size =0; } FuseResponses(responses, state, response_list); response_list.set_shutdown(should_shut_down);// Broadcast final results to other ranks.SendFinalTensors(response_list); }else{// 非master，则发送自己已经ready的tensor给master，再接收已经ready的tensor列表RequestList message_list; message_list.set_shutdown(should_shut_down);while(!message_queue_tmp.empty()) { message_list.add_request(message_queue_tmp.front()); message_queue_tmp.pop_front(); }// Send ready tensors to rank zeroSendReadyTensors(message_list);// Receive final tensors to be processed from rank zeroRecvFinalTensors(response_list); } }if(!response_list.responses().empty()) {std::stringtensors_ready;for(constauto& r : response_list.responses()) { tensors_ready += r.tensor_names_string() +"; "; } }// Reassign cache bits based on current cache order.response_cache_.update_cache_bits();returnresponse_list;}复制代码

在Horovod中，每张卡都对应一个训练进程，称之为rank。如4张卡，对应的各个进程的rank则为[0,1,2,3]。rank为0的进程作为master，其余的进程为worker。worker会在ComputeResponseList中向master发送已经ready的tensor。如果一个tensor在所有的rank中都已经ready，则master会通知其他rank，可以对这个tensor执行collective操作。

接下来继续看在HorovodRunOnce中出现的另一重要函数PerformOperation。这个函数比较清楚，主要是做三件事情：

对tensor做fusion：即将一些tensor合并成一个大的tensor，再做collective的操作

等待数据到位

做collective操作

voidPerformOperation(Response response, HorovodGlobalState& state){std::vector entries;auto& timeline = horovod_global.timeline;if(response.response_type() != Response::JOIN) {// 这里有点奇怪，直接用了horovod_global这个变量，而拿joined的时候，又是从state里拿的horovod_global.tensor_queue.GetTensorEntriesFromResponse(response, entries, state.joined);for(auto& e : entries) { timeline.Start(e.tensor_name, response.response_type()); }if(entries.size() >1) {// 如果多于1个，则可以进行fuse，以提高throughputautofirst_entry = entries[0];// Note: it is OK for different entries to come from different frameworks// since buffer allocated here is guaranteed to survive at least till the// end of this operation.Status status = horovod_global.fusion_buffer.InitializeBuffer( horovod_global.controller->TensorFusionThresholdBytes(), first_entry.device, first_entry.context, horovod_global.current_nccl_stream, [&]() { timeline.ActivityStartAll(entries, INIT_FUSION_BUFFER); }, [&]() { timeline.ActivityEndAll(entries); });if(!status.ok()) { LOG(DEBUG, horovod_global.controller->GetRank()) <<"InitializeBuffer Failed";for(auto& e : entries) { timeline.End(e.tensor_name,nullptr);// Callback can be null if the rank sent Join request.if(e.callback !=nullptr) { e.callback(status); } }return; } }// On GPU data readiness is signalled by ready_event.// 即使tensor可以进行操作了，但需要等待数据同步到显存std::vector waiting_tensors;for(auto& e : entries) {if(e.ready_event !=nullptr) { timeline.ActivityStart(e.tensor_name, WAIT_FOR_DATA); waiting_tensors.push_back(e); } }while(!waiting_tensors.empty()) {for(autoit = waiting_tensors.begin(); it != waiting_tensors.end();) {if(it->ready_event->Ready()) { timeline.ActivityEnd(it->tensor_name); timeline.ActivityStart(it->tensor_name, WAIT_FOR_OTHER_TENSOR_DATA); it = waiting_tensors.erase(it); }else{ ++it; } }std::this_thread::sleep_for(std::chrono::nanoseconds(100)); }for(auto& e : entries) {if(e.ready_event !=nullptr) { timeline.ActivityEnd(e.tensor_name); } } }// 终于可以进行collective的操作了Status status;try{ status = op_manager->ExecuteOperation(entries, response); }catch(conststd::exception& ex) { LOG(DEBUG, horovod_global.controller->GetRank()) <<"ExecuteOperation Failed"; status = Status::UnknownError(ex.what()); }if(!status.in_progress()) {for(auto& e : entries) { timeline.End(e.tensor_name, status.ok() ? e.output :nullptr);// Callback can be null if the rank sent Join request.if(e.callback !=nullptr) { e.callback(status); } } }}复制代码

至此，Horovod的主要工作流程就介绍完毕。

人面猴
序言：七十年代末，一起剥皮案震惊了整个滨河市，随后出现的几起案子，更是在滨河造成了极大的恐慌，老刑警刘岩，带你破解...
沈念sama阅读 159,117评论 4赞 362
死咒
序言：滨河连续发生了三起死亡事件，死亡现场离奇诡异，居然都是意外死亡，警方通过查阅死者的电脑和手机，发现死者居然都...
沈念sama阅读 67,328评论 1赞 293
救了他两次的神仙让他今天三更去死
文/潘晓璐我一进店门，熙熙楼的掌柜王于贵愁眉苦脸地迎上来，“玉大人，你说我怎么就摊上这事。” “怎么了？”我有些...
开封第一讲书人阅读 108,839评论 0赞 243
道士缉凶录：失踪的卖姜人
文/不坏的土叔我叫张陵，是天一观的道长。经常有香客问我，道长，这世上最难降的妖魔是什么？我笑而不...
开封第一讲书人阅读 44,007评论 0赞 206
港岛之恋（遗憾婚礼）
正文为了忘掉前任，我火速办了婚礼，结果婚礼上，老公的妹妹穿的比我还像新娘。我一直安慰自己，他们只是感情好，可当我...
茶点故事阅读 52,384评论 3赞 287
恶毒庶女顶嫁案：这布局不是一般人想出来的
文/花漫我一把揭开白布。她就那样静静地躺着，像睡着了一般。火红的嫁衣衬着肌肤如雪。梳的纹丝不乱的头发上，一...
开封第一讲书人阅读 40,629评论 1赞 219
城市分裂传说
那天，我揣着相机与录音，去河边找鬼。笑死，一个胖子当着我的面吹牛，可吹牛的内容都是我干的。我是一名探鬼主播，决...
沈念sama阅读 31,880评论 2赞 313
双鸳鸯连环套：你想象不到人心有多黑
文/苍兰香墨我猛地睁开眼，长吁一口气：“原来是场噩梦啊……” “哼！你这毒妇竟也来了？” 一声冷哼从身侧响起，我...
开封第一讲书人阅读 30,593评论 0赞 198
万荣杀人案实录
序言：老挝万荣一对情侣失踪，失踪者是张志新（化名）和其女友刘颖，没想到半个月后，有当地人在树林里发现了一具尸体，经...
沈念sama阅读 34,313评论 1赞 243
护林员之死
正文独居荒郊野岭守林人离奇死亡，尸身上长有42处带血的脓包…… 初始之章·张勋以下内容为张勋视角年9月15日...
茶点故事阅读 30,575评论 2赞 246
白月光启示录
正文我和宋清朗相恋三年，在试婚纱的时候发现自己被绿了。大学时的朋友给我发了我未婚夫和他白月光在一起吃饭的照片。...
茶点故事阅读 32,066评论 1赞 260
活死人
序言：一个原本活蹦乱跳的男人离奇死亡，死状恐怖，灵堂内的尸体忽然破棺而出，到底是诈尸还是另有隐情，我是刑警宁泽，带...
沈念sama阅读 28,392评论 2赞 253
日本核电站爆炸内幕
正文年R本政府宣布，位于F岛的核电站，受9级特大地震影响，放射性物质发生泄漏。R本人自食恶果不足惜，却给世界环境...
茶点故事阅读 33,052评论 3赞 236
男人毒药：我在死后第九天来索命
文/蒙蒙一、第九天我趴在偏房一处隐蔽的房顶上张望。院中可真热闹，春花似锦、人声如沸。这庄子的主人今日做“春日...
开封第一讲书人阅读 26,082评论 0赞 8
一桩弑父案，背后竟有这般阴谋
文/苍兰香墨我抬头看了看天上的太阳。三九已至，却和暖如春，着一层夹袄步出监牢的瞬间，已是汗流浃背。一阵脚步声响...
开封第一讲书人阅读 26,844评论 0赞 195
情欲美人皮
我被黑心中介骗来泰国打工，没想到刚下飞机就差点儿被人妖公主榨干…… 1. 我叫王不留，地道东北人。一个月前我还...
沈念sama阅读 35,662评论 2赞 274
代替公主和亲
正文我出身青楼，却偏偏与公主长得像，于是被迫代替她去往敌国和亲。传闻我的和亲对象是个残疾皇子，可洞房花烛夜当晚...
茶点故事阅读 35,575评论 2赞 270

Horovod源码分析（一）

推荐阅读更多精彩内容