Tutorial 4: Train and test with predefined models¶

This section will tell users how to train ,test and eval models by following steps below.

Tutorial 4: Train and test with predefined models

Train predefined models¶

MMGeneration supports distributed training, which improves training speed largely. We highly recommend to adopt distributed training with our scripts. The basic usage is as follows:

sh tools/dist_train.sh ${CONFIG_FILE} ${GPUS_NUMBER} \
    --work-dir ./work_dirs/experiments/experiments_name \
    [optional arguments]

If you are using slurm system, the following commands can help you start training:

sh tools/slurm_train.sh ${PARTITION} ${JOB_NAME} ${CONFIG} ${WORK_DIR} \
    [optional arguments]

There are two scripts wrap tools/train.py with distributed training entrypoint. The optional arguments are defined in tools/train.py. Users can also set amp and resume with these arguments.

Note that the name of work_dirs has already been put into our .gitignore file. Users can put any files here without concern about changing git related files. Here is an example command that we use to train our 1024x1024 StyleGAN2 model.

sh tools/slurm_train.sh openmmlab-platform stylegan2-1024 \
    configs/styleganv2/stylegan2_c2_ffhq_1024_b4x8.py \
    work_dirs/experiments/stylegan2_c2_ffhq_1024_b4x8

During training, log files and checkpoints will be saved to the working directory. More details can be found in our guides for running time configuration.

Training with multiple machines¶

If you launch with multiple machines simply connected with ethernet, you can simply run following commands:

On the first machine:

NNODES=2 NODE_RANK=0 PORT=$MASTER_PORT MASTER_ADDR=$MASTER_ADDR sh tools/dist_train.sh $CONFIG $GPUS

On the second machine:

NNODES=2 NODE_RANK=1 PORT=$MASTER_PORT MASTER_ADDR=$MASTER_ADDR sh tools/dist_train.sh $CONFIG $GPUS

Usually it is slow if you do not have high speed networking like InfiniBand.

If you launch with slurm, the command is the same as that on single machine described above, but you need refer to slurm_train.sh to set appropriate parameters and environment variables.

Training on CPU¶

The process of training on the CPU is consistent with single GPU training. We just need to disable GPUs before the training process.

export CUDA_VISIBLE_DEVICES=-1

And then run this script.

python tools/train.py config --work-dir WORK_DIR

We do not recommend users to use CPU for training because it is too slow. We support this feature to allow users to debug on machines without GPU for convenience.

Test predefined models¶

Currently, we have supported 9 evaluation metrics, i.e., MS-SSIM, SWD, IS, FID, Precision&Recall, PPL, Equivarience, TransFID, TransIS. We have provided unified evaluation scripts in tools/test.py for all models. If users want to evaluate their models with some metrics, you can add the metrics into your config file like this:

# at the end of the configs/styleganv2/stylegan2_c2_ffhq_256_b4x8_800k.py
metrics = [
    dict(
        type='FrechetInceptionDistance',
        prefix='FID-Full-50k',
        fake_nums=50000,
        inception_style='StyleGAN',
        sample_model='ema'),
    dict(type='PrecisionAndRecall', fake_nums=50000, prefix='PR-50K'),
    dict(type='PerceptualPathLength', fake_nums=50000, prefix='ppl-w')
]

As above, metrics consists of multiple metric dictionaries. Each metric will contain type to indicate the category of the metric. fake_nums denotes the number of images generated by model. Some metrics will output a dictionary of results, you can also set prefix to specify the prefix of the results. If you set prefix of FID as FID-Full-50k, then an example of output may be

FID-Full-50k/fid: 3.6561  FID-Full-50k/mean: 0.4263  FID-Full-50k/cov: 3.2298

Then users can test models with command below:

bash tools/dist_test.sh ${CONFIG_FILE} ${CKPT_FILE}

If you are in slurm environment, please switch to the tools/slurm_test.sh by using the following commands:

sh slurm_test.sh ${PLATFORM} ${JOBNAME} ${CONFIG_FILE} ${CKPT_FILE}

Evaluation during training¶

Benefit from the mmengine’s Runner. We can evaluate model during training in a simple way as below.

# define metrics
metrics = [
    dict(
        type='FrechetInceptionDistance',
        prefix='FID-Full-50k',
        fake_nums=50000,
        inception_style='StyleGAN')
]

# define dataloader
val_dataloader = dict(
    batch_size=128,
    num_workers=8,
    dataset=dict(
        type='UnconditionalImageDataset',
        data_root='data/celeba-cropped/',
        pipeline=[
            dict(type='LoadImageFromFile', key='img'),
            dict(type='Resize', scale=(64, 64)),
            dict(type='PackGenInputs', meta_keys=[])
        ]),
    sampler=dict(type='DefaultSampler', shuffle=False),
    persistent_workers=True)

# define val interval
train_cfg = dict(by_epoch=False, val_begin=1, val_interval=10000)

# define val loop and evaluator
val_cfg = dict(type='GenValLoop')
val_evaluator = dict(type='GenEvaluator', metrics=metrics)

You can set val_begin and val_interval to adjust when to begin valiadation and interval of validation.

For details of metrics, refer to metrics’ guide.