MMDetection英文文档翻译---1_exist_data_model现有的数据模型

1: Inference and train with existing models and standard datasets用现有模型对标准数据集进行推理训练

MMDetection provides hundreds of existing and existing detection models in Model Zoo), and supports multiple standard datasets, including Pascal VOC, COCO, CityScapes, LVIS, etc. This note will show how to perform common tasks on these existing models and standard datasets, including:

MMDetection提供几百个现有的和现有的模型在Model Zoo,并且支持多个标准数据集,包括Pascal VOC, COCO, CityScapes, LVIS等。这个文档将展示怎样在现有的模型和标准数据集中执行常见的任务。包括:

  • Use existing models to inference on given images.使用现有的模型对给定的数据集进行推理

  • Test existing models on standard datasets.测试现有的模型在标准数据集上

  • Train predefined models on standard datasets.训练预定义的模型在标准数据集上

Inference with existing models预测存在的模型

By inference, we mean using trained models to detect objects on images. In MMDetection, a model is defined by a configuration file and existing model parameters are save in a checkpoint file.

根据训练,意味着我们可以通过已经训练的模型对图片进行目标检测。在MMDetection,一个模型由config文件定义,并保存模型参数在checkpoint文件夹里。

To start with, we recommend Faster RCNN with this configuration file and this checkpoint file. It is recommended to download the checkpoint file to checkpoints directory.

首先我们建议Faster RCNN使用config文件和checkpoint文件。我们建议下载checkpoint文件到checkpoint文件架中。

High-level APIs for inference高级API用于训练

MMDetection provide high-level Python APIs for inference on images. Here is an example of building the model and inference on given images or videos.

MMDetection 提供高级的 Python APIs在图片中进行训练。这里给一个例子用于建立一个模型和训练所给的图片和视频

from mmdet.apis import init_detector, inference_detector
import mmcv

# Specify the path to model config and checkpoint file
config_file = 'configs/faster_rcnn/faster_rcnn_r50_fpn_1x_coco.py'
checkpoint_file = 'checkpoints/faster_rcnn_r50_fpn_1x_coco_20200130-047c8118.pth'

# build the model from a config file and a checkpoint file
model = init_detector(config_file, checkpoint_file, device='cuda:0')

# test a single image and show the results
img = 'test.jpg'  # or img = mmcv.imread(img), which will only load it once
result = inference_detector(model, img)
# visualize the results in a new window
model.show_result(img, result)
# or save the visualization results to image files
model.show_result(img, result, out_file='result.jpg')

# test a video and show the results
video = mmcv.VideoReader('video.mp4')
for frame in video:
    result = inference_detector(model, frame)
    model.show_result(frame, result, wait_time=1)

A notebook demo can be found in demo/inference_demo.ipynb.

Note: inference_detector only supports single-image inference for now.

inference_detector目前只支持图像预测

Asynchronous interface - supported for Python 3.7+

异步接口-支持python3.7+

For Python 3.7+, MMDetection also supports async interfaces.

By utilizing CUDA streams, it allows not to block CPU on GPU bound inference code and enables better CPU/GPU utilization for single-threaded application. Inference can be done concurrently either between different input data samples or between different models of some inference pipeline.

通过利用CUDA流,它允许不阻止CPU,在只限制GPU训练的程序里,使单线程程序更好的利用CPU/GPU。训练可以在 不同的输入数据样本中 或者 在相同管道的不同模型中 并发完成。

See tests/async_benchmark.py to compare the speed of synchronous and asynchronous interfaces.

查看tests/test_runtime/async_benchmark.py比较在同步和异步的情况下进行训练的速度。

import asyncio
import torch
from mmdet.apis import init_detector, async_inference_detector
from mmdet.utils.contextmanagers import concurrent

async def main():
 config_file = 'configs/faster_rcnn/faster_rcnn_r50_fpn_1x_coco.py'
 checkpoint_file = 'checkpoints/faster_rcnn_r50_fpn_1x_coco_20200130-047c8118.pth'
 device = 'cuda:0'
 model = init_detector(config_file, checkpoint=checkpoint_file, device=device)

 # queue is used for concurrent inference of multiple images
 streamqueue = asyncio.Queue()
 # queue size defines concurrency level
 streamqueue_size = 3

 for _ in range(streamqueue_size):
 streamqueue.put_nowait(torch.cuda.Stream(device=device))

 # test a single image and show the results
 img = 'test.jpg'  # or img = mmcv.imread(img), which will only load it once

 async with concurrent(streamqueue):
 result = await async_inference_detector(model, img)

 # visualize the results in a new window
 model.show_result(img, result)
 # or save the visualization results to image files
 model.show_result(img, result, out_file='result.jpg')


asyncio.run(main())

Demos

We also provide three demo scripts, implemented with high-level APIs and supporting functionality codes. Source codes are available here.

我们还提供了三个演示脚本,使用高级api和支持功能的代码实现

Image demo

This script performs inference on a single image.

这个脚本对单个图片进行训练

>python demo/image_demo.py \
 ${IMAGE_FILE} \
 ${CONFIG_FILE} \
 ${CHECKPOINT_FILE} \
 [--device ${GPU_ID}] \
 [--score-thr ${SCORE_THR}]

Examples:

>python demo/image_demo.py demo/demo.jpg \
 configs/faster_rcnn/faster_rcnn_r50_fpn_1x_coco.py \
 checkpoints/faster_rcnn_r50_fpn_1x_coco_20200130-047c8118.pth \
 --device cpu

Webcam demo

This is a live demo from a webcam.

>python demo/webcam_demo.py \
 ${CONFIG_FILE} \
 ${CHECKPOINT_FILE} \
 [--device ${GPU_ID}] \
 [--camera-id ${CAMERA-ID}] \
 [--score-thr ${SCORE_THR}]

Examples:

>python demo/webcam_demo.py \
 configs/faster_rcnn/faster_rcnn_r50_fpn_1x_coco.py \
 checkpoints/faster_rcnn_r50_fpn_1x_coco_20200130-047c8118.pth

Video demo

This script performs inference on a video.

>python demo/video_demo.py \
 ${VIDEO_FILE} \
 ${CONFIG_FILE} \
 ${CHECKPOINT_FILE} \
 [--device ${GPU_ID}] \
 [--score-thr ${SCORE_THR}] \
 [--out ${OUT_FILE}] \
 [--show] \
 [--wait-time ${WAIT_TIME}]

Examples:


>python demo/video_demo.py demo/demo.mp4 \
 configs/faster_rcnn/faster_rcnn_r50_fpn_1x_coco.py \
 checkpoints/faster_rcnn_r50_fpn_1x_coco_20200130-047c8118.pth \
 --out result.mp4

Test existing models on standard datasets测试现有的模型在标准数据集上

To evaluate a model's accuracy, one usually tests the model on some standard datasets.

为了评估模型的accuracy,人们通常在标准数据集上测试模型。

MMDetection supports multiple public datasets including COCO, Pascal VOC, CityScapes, and more. This section will show how to test existing models on supported datasets.

MMDetection支持多个共用数据集,包括 COCO, Pascal VOC, CityScapes以及更多。

这部分将展示怎样测试现有的模型在支持的数据集上

Prepare datasets准备数据集

Public datasets like Pascal VOC or mirror and COCO are available from official websites or mirrors. Note: In the detection task, Pascal VOC 2012 is an extension of Pascal VOC 2007 without overlap, and we usually use them together.

公开的数据集例如Pascal VOC或镜像以及coco数据集可以从官网以及镜像中得到。

It is recommended to download and extract the dataset somewhere outside the project directory and symlink the dataset root to $MMDETECTION/data as below.

建议下载并提取数据集,并将数据集放在MMDETECTION/data

If your folder structure is different, you may need to change the corresponding paths in config files.

如果你的文件夹结构不同,你可能需要改变对应的config文件中相应的路径

mmdetection
├── mmdet
├── tools
├── configs
├── data
│ ├── coco
│ │ ├── annotations
│ │ ├── train2017
│ │ ├── val2017
│ │ ├── test2017
│ ├── cityscapes
│ │ ├── annotations
│ │ ├── leftImg8bit
│ │ │ ├── train
│ │ │ ├── val
│ │ ├── gtFine
│ │ │ ├── train
│ │ │ ├── val
│ ├── VOCdevkit
│ │ ├── VOC2007
│ │ ├── VOC2012</pre>

Some models require additional COCO-stuff datasets, such as HTC, DetectoRS and SCNet, you can download and unzip then move to the coco folder. The directory should be like this.

一个模型要求额外的coco-stuff数据集,例如HTC, DetectoRS and SCNet ,你可以下载并解压缩,然后放在

coco文件夹下。目录如下

mmdetection
├── data
│ ├── coco
│ │ ├── annotations
│ │ ├── train2017
│ │ ├── val2017
│ │ ├── test2017
│ │ ├── stuffthingmaps</pre>

The cityscapes annotations need to be converted into the coco format using

需要将cityscapes标注转换为coco形式

tools/dataset_converters/cityscapes.py:

>pip install cityscapesscripts

python tools/dataset_converters/cityscapes.py \
 ./data/cityscapes \
 --nproc 8 \
 --out-dir ./data/cityscapes/annotations

TODO: CHANGE TO THE NEW PATH

Test existing models测试现有的模型

We provide testing scripts for evaluating an existing model on the whole dataset (COCO, PASCAL VOC, Cityscapes, etc.).

我们提供测试脚本去评估一个现有的模型在所有数据集上(COCO, PASCAL VOC, Cityscapes等)

The following testing environments are supported:支持以下测试的环境

  • single GPU【单GPU】

  • single node multiple GPUs【单节点多GPU】

  • multiple nodes【多个节点】

Choose the proper script to perform testing depending on the testing environment.

根据测试环境,选择合适的脚本进行测试


># single-gpu testing
python tools/test.py \
 ${CONFIG_FILE} \
 ${CHECKPOINT_FILE} \
 [--out ${RESULT_FILE}] \
 [--eval ${EVAL_METRICS}] \
 [--show]

# multi-gpu testing
bash tools/dist_test.sh \
 ${CONFIG_FILE} \
 ${CHECKPOINT_FILE} \
 ${GPU_NUM} \
 [--out ${RESULT_FILE}] \
 [--eval ${EVAL_METRICS}]

tools/dist_test.sh also supports multi-node testing, but relies on PyTorch's launch utility.

tools/dist_test.sh也支持多个节点的测试,但是依赖PyTorch's launch utility.

Optional arguments:

可选择的参数:

  • RESULT_FILE: Filename of the output results in pickle format. If not specified, the results will not be saved to a file.

    RESULT_FILE: 输出结果为pickle格式的文件名。如果没有指明,结果不会保存在一个文件里。

  • EVAL_METRICS: Items to be evaluated on the results. Allowed values depend on the dataset, e.g., proposal_fast, proposal, bbox, segm are available for COCO, mAP, recall for PASCAL VOC.

    EVAL_METRICS:根据结果评估的指标。允许的值取决于数据集,例如: proposal_fast, proposal, bbox, segm对coco数据集是有效的。 mAP, recall 对于PASCAL VOC数据集是有效的。

  • Cityscapes could be evaluated by cityscapes as well as all COCO metrics.

    Cityscapes可以通过cityscapes以及所有coco指标进行评估

  • --show: If specified, detection results will be plotted on the images and shown in a new window. It is only applicable to single GPU testing and used for debugging and visualization. Please make sure that GUI is available in your environment. Otherwise, you may encounter an error like cannot connect to X server.

    如果指定,检测结果将绘制在图像上,并在一个新窗口显示。仅适用于单GPU测试,用于调试和可视化。请确保GUI在你的环境中可用,否则,您可能会遇到“cannot connect to X server“这样的错误。

  • --show-dir: If specified, detection results will be plotted on the images and saved to the specified directory. It is only applicable to single GPU testing and used for debugging and visualization. You do NOT need a GUI available in your environment for using this option.

    如果指定,检测结果将绘制在图像上,并且保存在指定的文件中。仅适用于单GPU测试,用于调试和可视化。不需要确保GUI在你的环境中。

  • --show-score-thr: If specified, detections with scores below this threshold will be removed.

    如果指定,分数低于此阈值的检测将被删除

  • --cfg-options: if specified, the key-value pair optional cfg will be merged into config file

    如果指定,键-值对 可选 cfg 将被合并到config文件中

  • --eval-options: if specified, the key-value pair optional eval cfg will be kwargs for dataset.evaluate() function, it's only for evaluation

    如果指定,对于dataset.evaluate()函数,键-值对可选的eval cfg将是kwargs,它仅用于评估

Examples

Assume that you have already downloaded the checkpoints to the directory checkpoints/.

假设你已经下载好.pth文件到 checkpoints/.下

  1. Test Faster R-CNN and visualize the results. Press any key for the next image.

    测试Faster R-CNN并可视化结果。按任何键到下一张图片

    Config and checkpoint files are available here.

  >python tools/test.py \
     configs/faster_rcnn/faster_rcnn_r50_fpn_1x_coco.py \
     checkpoints/faster_rcnn_r50_fpn_1x_coco_20200130-047c8118.pth \
     --show
  1. Test Faster R-CNN and save the painted images for future visualization.

    测试Faster R-CNN并且保存可视化的未来的图像

    Config and checkpoint files are available here.

    >python tools/test.py \
     configs/faster_rcnn/faster_rcnn_r50_fpn_1x.py \
     checkpoints/faster_rcnn_r50_fpn_1x_coco_20200130-047c8118.pth \
     --show-dir faster_rcnn_r50_fpn_1x_results
  1. Test Faster R-CNN on PASCAL VOC (without saving the test results) and evaluate the mAP. Config and checkpoint files are available here.

    测试Faster R-CNN在 PASCAL VOC数据集(不保存测试结果),并用mAP评估。

   >python tools/test.py \
     configs/pascal_voc/faster_rcnn_r50_fpn_1x_voc.py \
     checkpoints/faster_rcnn_r50_fpn_1x_voc0712_20200624-c9895d40.pth \
     --eval mAP
  1. Test Mask R-CNN with 8 GPUs, and evaluate the bbox and mask AP.

    用8个Gpu测试Mask R-CNN,并用bbox 、 mask AP评估

    Config and checkpoint files are available here.

   >./tools/dist_test.sh \
     configs/mask_rcnn_r50_fpn_1x_coco.py \
     checkpoints/mask_rcnn_r50_fpn_1x_coco_20200205-d4b0c5d6.pth \
     8 \
     --out results.pkl \
     --eval bbox segm
  1. Test Mask R-CNN with 8 GPUs, and evaluate the classwise bbox and mask AP.

    用8个Gpu测试Mask R-CNN,并用classwise bbox 、 mask AP评估

    Config and checkpoint files are available here.

    >./tools/dist_test.sh \
     configs/mask_rcnn/mask_rcnn_r50_fpn_1x_coco.py \
     checkpoints/mask_rcnn_r50_fpn_1x_coco_20200205-d4b0c5d6.pth \
     8 \
     --out results.pkl \
     --eval bbox segm \
     --options "classwise=True"
  1. Test Mask R-CNN on COCO test-dev with 8 GPUs, and generate JSON files for submitting to the official evaluation server.

    测试开发在coco数据集上训练的Mask R-CNN,用8个GPUs,并且生成JSON文件,并提交给官方的评估服务器

    Config and checkpoint files are available here.

   >./tools/dist_test.sh \
     configs/mask_rcnn/mask_rcnn_r50_fpn_1x_coco.py \
     checkpoints/mask_rcnn_r50_fpn_1x_coco_20200205-d4b0c5d6.pth \
     8 \
     --format-only \
     --options "jsonfile_prefix=./mask_rcnn_test-dev_results"
This command generates two JSON files这个命令生成两个json文件: `mask_rcnn_test-dev_results.bbox.json` and `mask_rcnn_test-dev_results.segm.json`.
  1. Test Mask R-CNN on Cityscapes test with 8 GPUs, and generate txt and png files for submitting to the official evaluation server.

    测试Mask R-CNN的Cityscapes,用8个GPUS,并且生成txt 和 png文件 并提交给官方评估服务器

    Config and checkpoint files are available here.

   >./tools/dist_test.sh \
     configs/cityscapes/mask_rcnn_r50_fpn_1x_cityscapes.py \
     checkpoints/mask_rcnn_r50_fpn_1x_cityscapes_20200227-afe51d5a.pth \
     8 \
     --format-only \
     --options "txtfile_prefix=./mask_rcnn_cityscapes_test_results"
The generated png and txt would be under `./mask_rcnn_cityscapes_test_results` directory.

Test without Ground Truth Annotations

MMDetection supports to test models without ground-truth annotations using CocoDataset. If your dataset format is not in COCO format, please convert them to COCO format. For example, if your dataset format is VOC, you can directly convert it to COCO format by the [script in tools.

MMDetection支持使用coco数据集在没有下限时 测试模型。如果你的数据集形式不是coco形式,请你先转换成coco形式。例如,如果你的数据集是VOC形式,你可以直接转成coco形式,通过以下脚本工具

># single-gpu testing
python tools/test.py \
 ${CONFIG_FILE} \
 ${CHECKPOINT_FILE} \
 --format-only \
 --options ${JSONFILE_PREFIX} \
 [--show]

# multi-gpu testing
bash tools/dist_test.sh \
 ${CONFIG_FILE} \
 ${CHECKPOINT_FILE} \
 ${GPU_NUM} \
 --format-only \
 --options ${JSONFILE_PREFIX} \
 [--show]

Assuming that the checkpoints in the model zoo have been downloaded to the directory checkpoints/, we can test Mask R-CNN on COCO test-dev with 8 GPUs, and generate JSON files using the following command.

假设我们已经在model zoo中下载了checkpoints文件,放在checkpoints/路径下,我们可以测试开发在coco数据集上训练的Mask R-CNN,用8个GPUs,并且生成JSON文件,使用以下命令

>./tools/dist_test.sh \
 configs/mask_rcnn/mask_rcnn_r50_fpn_1x_coco.py \
 checkpoints/mask_rcnn_r50_fpn_1x_coco_20200205-d4b0c5d6.pth \
 8 \
 -format-only \
 --options "jsonfile_prefix=./mask_rcnn_test-dev_results"

This command generates two JSON files mask_rcnn_test-dev_results.bbox.json and mask_rcnn_test-dev_results.segm.json.

Batch Inference批训练

MMDetection supports inference with a single image or batched images in test mode. By default, we use single-image inference and you can use batch inference by modifying samples_per_gpu in the config of test data. You can do that either by modifying the config as below.

在测试模式下,MMDetection支持单个图像或批处理图像的训练。默认情况下,我们使用单图像训练,你可以通过修改测试数据配置中的“samples_per_gpu”来使用批训练。您可以通过以下方式修改配置来实现这一点。

>data = dict(train=dict(...), val=dict(...), test=dict(samples_per_gpu=2, ...))

Or you can set it through或者你可以通过 --cfg-options as设置为

--cfg-options data.test.samples_per_gpu=2

Deprecated ImageToTensor 弃用ImageToTensor

In test mode, ImageToTensor pipeline is deprecated, it's replaced by DefaultFormatBundle that recommended to manually replace it in the test data pipeline in your config file. examples:

在测试模式下,ImageToTensor管道被弃用,它被DefaultFormatBundle取代,建议在config文件中的测试数据管道中手动替换掉它。

# use ImageToTensor (deprecated)
pipelines = [
 dict(type='LoadImageFromFile'),
 dict(
 type='MultiScaleFlipAug',
 img_scale=(1333, 800),
 flip=False,
 transforms=[
 dict(type='Resize', keep_ratio=True),
 dict(type='RandomFlip'),
 dict(type='Normalize', mean=[0, 0, 0], std=[1, 1, 1]),
 dict(type='Pad', size_divisor=32),
 dict(type='ImageToTensor', keys=['img']),
 dict(type='Collect', keys=['img']),
 ])
 ]

# manually replace ImageToTensor to DefaultFormatBundle (recommended)
pipelines = [
 dict(type='LoadImageFromFile'),
 dict(
 type='MultiScaleFlipAug',
 img_scale=(1333, 800),
 flip=False,
 transforms=[
 dict(type='Resize', keep_ratio=True),
 dict(type='RandomFlip'),
 dict(type='Normalize', mean=[0, 0, 0], std=[1, 1, 1]),
 dict(type='Pad', size_divisor=32),
 dict(type='DefaultFormatBundle'),
 dict(type='Collect', keys=['img']),
 ])
 ]

Train predefined models on standard datasets预训练模型在标准测试集上

MMDetection also provides out-of-the-box tools for training detection models.

MMDetection提供开箱即用的工具来训练检测模型。

This section will show how to train predefined models (under configs) on standard datasets i.e. COCO.

本节将展示如何在标准数据集即COCO上训练predefined模型(在[configs]下)

Important: The default learning rate in config files is for 8 GPUs and 2 img/gpu (batch size = 8*2 = 16). According to the linear scaling rule, you need to set the learning rate proportional to the batch size if you use different GPUs or images per GPU, e.g., lr=0.01 for 4 GPUs * 2 imgs/gpu and lr=0.08 for 16 GPUs * 4 imgs/gpu.

重要:配置文件中的默认学习率为8个gpu和2个img/gpu(批处理大小= 8*2 = 16)。

根据线性缩放规则,如果使用不同的GPU或每个GPU的图像,需要设置学习率与批大小成比例,例如:‘lr=0.01’对于4个GPU * 2 imgs/ GPU,‘lr=0.08’对于16个GPU * 4 imgs/ GPU。

Prepare datasets准备数据集

Training requires preparing datasets too. See section Prepare datasets above for details.

Note: Currently, the config files under configs/cityscapes use COCO pretrained weights to initialize. You could download the existing models in advance if the network connection is unavailable or slow. Otherwise, it would cause errors at the beginning of training.

  • 注意 *:

目前,在' configs/cityscape '下的配置文件使用COCO预先训练的权重来初始化。

如果网络不通或速度慢,您可以提前下载现有的型号。否则,在训练开始时就会出现错误。

Training on a single GPU在单个GPU上进行训练

We provide tools/train.py to launch training jobs on a single GPU.

我们提供“tools/train.py”在单个GPU上启动训练工作。

The basic usage is as follows.基本用法如下。

>python tools/train.py \
 ${CONFIG_FILE} \
 [optional arguments]

During training, log files and checkpoints will be saved to the working directory, which is specified by work_dir in the config file or via CLI argument --work-dir.在训练期间,日志文件和检查点将被保存到工作目录中,该目录由配置文件中的' work_dir '或通过CLI参数'——work-dir '指定。

By default, the model is evaluated on the validation set every epoch, the evaluation interval can be specified in the config file as shown below.默认情况下,模型在验证集上进行评估,评估间隔可以在配置文件中指定,如下所示。

# evaluate the model every 12 epoch.
evaluation = dict(interval=12)

This tool accepts several optional arguments, including:这个工具接受几个可选参数,包括:

  • --no-validate (not suggested不建议): Disable evaluation during training.在训练期间禁用评估。

  • --work-dir ${WORK_DIR}: Override the working directory.覆盖工作目录。

  • --resume-from ${CHECKPOINT_FILE}: Resume from a previous checkpoint file.从以前的 checkpoint文件恢复。

  • --options 'Key=value': Overrides other settings in the used config.覆盖所使用的配置中的其他设置。

Note:

Difference between resume-from and load-from:

resume-from loads both the model weights and optimizer status, and the epoch is also inherited from the specified checkpoint. It is usually used for resuming the training process that is interrupted accidentally. load-from only loads the model weights and the training epoch starts from 0. It is usually used for finetuning.

  • 注意 *: “resume-from”和“load-from”的区别: ' resume-from '加载模型权重和优化器状态,epoch也从指定的检查点继承。它通常用于恢复意外中断的训练过程。 ' load-from '只加载模型权值,训练纪元从0开始。它通常用于微调。

Training on multiple GPUs用多个GPU训练

We provide tools/dist_train.sh to launch training on multiple GPUs. The basic usage is as follows.

我们提供' tools/dist_train.sh '在多个gpu上启动训练。

基本用法如下。

>bash ./tools/dist_train.sh \
 ${CONFIG_FILE} \
 ${GPU_NUM} \
 [optional arguments]

Optional arguments remain the same as stated above.

Launch multiple jobs simultaneously同时启动多个作业

If you would like to launch multiple jobs on a single machine, e.g., 2 jobs of 4-GPU training on a machine with 8 GPUs, you need to specify different ports (29500 by default) for each job to avoid communication conflict.

If you use dist_train.sh to launch training jobs, you can set the port in commands.

如果你想在一台机器上启动多个作业,例如在一台8个gpu的机器上启动2个4-GPU训练作业,

您需要为每个作业指定不同的端口(默认为29500),以避免通信冲突。

如果使用' dist_train.sh '启动培训作业,可以在命令中设置端口。

>CUDA_VISIBLE_DEVICES=0,1,2,3 PORT=29500 ./tools/dist_train.sh ${CONFIG_FILE} 4
CUDA_VISIBLE_DEVICES=4,5,6,7 PORT=29501 ./tools/dist_train.sh ${CONFIG_FILE} 4

Training on multiple nodes训练多个节点

MMDetection relies on torch.distributed package for distributed training.

MMDetection依赖于torch.distributed进行分布式训练。

Thus, as a basic usage, one can launch distributed training via PyTorch's launch utility.

因此,作为一种基本用法,可以通过PyTorch的launch实用程序启动分布式培训。

Manage jobs with Slurm用Slurm管理作业

Slurm is a good job scheduling system for computing clusters. On a cluster managed by Slurm, you can use slurm_train.sh to spawn training jobs. It supports both single-node and multi-node training.

The basic usage is as follows.

Slurm是一个很好的计算集群作业调度系统。

在由Slurm管理的集群上,可以使用' slurm_train.sh '来生成培训作业。它支持单节点和多节点训练。

基本用法如下。

>[GPUS=${GPUS}] ./tools/slurm_train.sh ${PARTITION} ${JOB_NAME} ${CONFIG_FILE} ${WORK_DIR}

Below is an example of using 16 GPUs to train Mask R-CNN on a Slurm partition named dev, and set the work-dir to some shared file systems.

下面是一个示例,使用16个gpu在一个名为dev的Slurm分区上训练Mask R-CNN,并将work-dir设置为一些共享文件系统。

>GPUS=16 ./tools/slurm_train.sh dev mask_r50_1x configs/mask_rcnn_r50_fpn_1x_coco.py /nfs/xxxx/mask_rcnn_r50_fpn_1x

You can check the source code to review full arguments and environment variables.

您可以检查源代码来检查完整的参数和环境变量。

When using Slurm, the port option need to be set in one of the following ways:

当使用Slurm时,端口选项需要通过以下方式之一设置:

  1. Set the port through --options. This is more recommended since it does not change the original configs.

    1. 通过“——options”设置端口。更推荐这样做,因为它不会改变原始的配置。
>CUDA_VISIBLE_DEVICES=0,1,2,3 GPUS=4 ./tools/slurm_train.sh ${PARTITION} ${JOB_NAME} config1.py ${WORK_DIR} --options 'dist_params.port=29500'
    CUDA_VISIBLE_DEVICES=4,5,6,7 GPUS=4 ./tools/slurm_train.sh ${PARTITION} ${JOB_NAME} config2.py ${WORK_DIR} --options 'dist_params.port=29501'
  1. Modify the config files to set different communication ports.

    修改配置文件,设置不同的通信端口。

    #In `config1.py`, set【nccl是英伟达的通信框架】
>dist_params = dict(backend='nccl', port=29500)
    #In `config2.py`, set
>dist_params = dict(backend='nccl', port=29501)
    #然后可以使用' config1.py '和' config2.py '启动两个作业。
>CUDA_VISIBLE_DEVICES=0,1,2,3 GPUS=4 ./tools/slurm_train.sh ${PARTITION} ${JOB_NAME} config1.py ${WORK_DIR}
  CUDA_VISIBLE_DEVICES=4,5,6,7 GPUS=4 ./tools/slurm_train.sh ${PARTITION} ${JOB_NAME} config2.py ${WORK_DIR}
©著作权归作者所有,转载或内容合作请联系作者
  • 序言:七十年代末,一起剥皮案震惊了整个滨河市,随后出现的几起案子,更是在滨河造成了极大的恐慌,老刑警刘岩,带你破解...
    沈念sama阅读 158,736评论 4 362
  • 序言:滨河连续发生了三起死亡事件,死亡现场离奇诡异,居然都是意外死亡,警方通过查阅死者的电脑和手机,发现死者居然都...
    沈念sama阅读 67,167评论 1 291
  • 文/潘晓璐 我一进店门,熙熙楼的掌柜王于贵愁眉苦脸地迎上来,“玉大人,你说我怎么就摊上这事。” “怎么了?”我有些...
    开封第一讲书人阅读 108,442评论 0 243
  • 文/不坏的土叔 我叫张陵,是天一观的道长。 经常有香客问我,道长,这世上最难降的妖魔是什么? 我笑而不...
    开封第一讲书人阅读 43,902评论 0 204
  • 正文 为了忘掉前任,我火速办了婚礼,结果婚礼上,老公的妹妹穿的比我还像新娘。我一直安慰自己,他们只是感情好,可当我...
    茶点故事阅读 52,302评论 3 287
  • 文/花漫 我一把揭开白布。 她就那样静静地躺着,像睡着了一般。 火红的嫁衣衬着肌肤如雪。 梳的纹丝不乱的头发上,一...
    开封第一讲书人阅读 40,573评论 1 216
  • 那天,我揣着相机与录音,去河边找鬼。 笑死,一个胖子当着我的面吹牛,可吹牛的内容都是我干的。 我是一名探鬼主播,决...
    沈念sama阅读 31,847评论 2 312
  • 文/苍兰香墨 我猛地睁开眼,长吁一口气:“原来是场噩梦啊……” “哼!你这毒妇竟也来了?” 一声冷哼从身侧响起,我...
    开封第一讲书人阅读 30,562评论 0 197
  • 序言:老挝万荣一对情侣失踪,失踪者是张志新(化名)和其女友刘颖,没想到半个月后,有当地人在树林里发现了一具尸体,经...
    沈念sama阅读 34,260评论 1 241
  • 正文 独居荒郊野岭守林人离奇死亡,尸身上长有42处带血的脓包…… 初始之章·张勋 以下内容为张勋视角 年9月15日...
    茶点故事阅读 30,531评论 2 245
  • 正文 我和宋清朗相恋三年,在试婚纱的时候发现自己被绿了。 大学时的朋友给我发了我未婚夫和他白月光在一起吃饭的照片。...
    茶点故事阅读 32,021评论 1 258
  • 序言:一个原本活蹦乱跳的男人离奇死亡,死状恐怖,灵堂内的尸体忽然破棺而出,到底是诈尸还是另有隐情,我是刑警宁泽,带...
    沈念sama阅读 28,367评论 2 253
  • 正文 年R本政府宣布,位于F岛的核电站,受9级特大地震影响,放射性物质发生泄漏。R本人自食恶果不足惜,却给世界环境...
    茶点故事阅读 33,016评论 3 235
  • 文/蒙蒙 一、第九天 我趴在偏房一处隐蔽的房顶上张望。 院中可真热闹,春花似锦、人声如沸。这庄子的主人今日做“春日...
    开封第一讲书人阅读 26,068评论 0 8
  • 文/苍兰香墨 我抬头看了看天上的太阳。三九已至,却和暖如春,着一层夹袄步出监牢的瞬间,已是汗流浃背。 一阵脚步声响...
    开封第一讲书人阅读 26,827评论 0 194
  • 我被黑心中介骗来泰国打工, 没想到刚下飞机就差点儿被人妖公主榨干…… 1. 我叫王不留,地道东北人。 一个月前我还...
    沈念sama阅读 35,610评论 2 274
  • 正文 我出身青楼,却偏偏与公主长得像,于是被迫代替她去往敌国和亲。 传闻我的和亲对象是个残疾皇子,可洞房花烛夜当晚...
    茶点故事阅读 35,514评论 2 269

推荐阅读更多精彩内容