Training in FP16
Here we find that there are many things to consider when using fp16 training. Normalization, and batch size become important to avoid gradient overflow.
- FP32 Training
- FP16
- FP16 with FP32 BatchNorm
- FP16 with loss in fp32
- fp16 with loss in fp32, with loss scale without fp32 accumulate
- FP16 with loss_scale
- Batch Size
- Limiting Maximum Loss and Differential Loss Functions
- Normalization
- Contact Me
from nbdev import export2html
from fastai2.basics import *
from fastai2.vision.all import *
path=untar_data(URLs.IMAGENETTE)
db=DataBlock((ImageBlock, CategoryBlock), get_items=get_image_files, splitter=GrandparentSplitter(valid_name='val'),
get_y=parent_label,item_tfms=Resize(420),batch_tfms=aug_transforms(size=240))
dls=db.dataloaders(path)
We start with a normal fp32 trained model as a baseline.
learner=cnn_learner(dls,resnet50,pretrained=False)
learner.lr_find()
learner.fit_one_cycle(30,lr_max=0.00275)
learner.recorder.plot_loss(skip_start=20,with_valid=False)
del learner
This is a purely trained fp16 model, and this isn't really done.
class MixedPrecision(Callback):
"Run training in mixed precision"
run_before = Recorder
def __init__(self):
assert torch.backends.cudnn.enabled, "Mixed precision training requires cudnn."
def begin_batch(self): self.learn.xb = to_half(self.xb)
def after_batch(self): self.learn.loss = to_float(self.learn.loss)
class ModelToHalf(Callback):
"Use with MixedPrecision callback (but it needs to run at the very beginning)"
run_before=TrainEvalCallback
def begin_fit(self): self.learn.model = self.model.half()
def after_fit(self): self.learn.model = self.model.float() #convert back to float, for saving and such
learner=cnn_learner(dls,resnet50,pretrained=False)
learner.add_cbs((ModelToHalf(),MixedPrecision()))
learner.fit_one_cycle(30,lr_max=0.00275)
learner.recorder.plot_loss(skip_start=20)
del learner
We now use ModelToHalf from fastai2, the convert_network function converts it to a specfic data type without converting batchnorm.
del ModelToHalf
#reimporting ModelToHalf from fastai2
from fastai2.vision.all import *
class MixedPrecision(Callback):
"Run training in mixed precision"
run_before = Recorder
def __init__(self):
assert torch.backends.cudnn.enabled, "Mixed precision training requires cudnn."
def begin_batch(self): self.learn.xb = to_half(self.xb)
def after_batch(self): self.learn.loss = to_float(self.learn.loss)
learner=cnn_learner(dls,resnet50,pretrained=False)
learner.add_cbs((ModelToHalf(),MixedPrecision()))
learner.fit_one_cycle(30,lr_max=0.00275)
learner.recorder.plot_loss(skip_start=20)
del learner
class MixedPrecision(Callback):
"Run training in mixed precision"
toward_end = True
def __init__(self):
assert torch.backends.cudnn.enabled, "Mixed precision training requires cudnn."
def begin_batch(self): self.learn.xb = to_half(self.xb)
def after_pred(self): self.learn.pred = to_float(self.pred)
learner=cnn_learner(dls,resnet50,pretrained=False)
learner.add_cbs((ModelToHalf(),MixedPrecision()))
learner.fit_one_cycle(30,lr_max=0.00275)
learner.recorder.plot_loss(skip_start=20)
del learner
del MixedPrecision
class MixedPrecision(Callback):
"Run training in mixed precision"
toward_end=True
def __init__(self, loss_scale=512, flat_master=False, dynamic=True, max_loss_scale=2.**24,
div_factor=2., scale_wait=500, clip=None):
assert torch.backends.cudnn.enabled, "Mixed precision training requires cudnn."
self.flat_master,self.dynamic,self.max_loss_scale = flat_master,dynamic,max_loss_scale
self.div_factor,self.scale_wait,self.clip = div_factor,scale_wait,clip
self.loss_scale = max_loss_scale if dynamic else loss_scale
def begin_fit(self):
if self.learn.opt is None: self.learn.create_opt()
self.model_pgs,_ = get_master(self.opt, self.flat_master)
self.old_pgs = self.opt.param_groups
#Changes the optimizer so that the optimization step is done in FP32.
if self.dynamic: self.count = 0
def begin_batch(self): self.learn.xb = to_half(self.xb)
def after_pred(self): self.learn.pred = to_float(self.pred)
def after_loss(self):
if self.training: self.learn.loss *= self.loss_scale
def after_backward(self):
self.learn.loss /= self.loss_scale #To record the real loss
#First, check for an overflow
if self.dynamic and grad_overflow(self.model_pgs):
self.loss_scale /= self.div_factor
self.model.zero_grad()
raise CancelBatchException() #skip step and zero_grad
for params in self.model_pgs:
for param in params:
if param.grad is not None: param.grad.div_(self.loss_scale)
#Check if it's been long enough without overflow
if self.clip is not None:
for group in self.model_pgs: nn.utils.clip_grad_norm_(group, self.clip)
if self.dynamic:
self.count += 1
if self.count == self.scale_wait:
self.count = 0
self.loss_scale *= self.div_factor
def after_step(self):
self.model.zero_grad() #Zero the gradients of the model manually (optimizer disconnected)
def after_fit(self):
self.learn.opt.param_groups = self.old_pgs
delattr(self, "model_pgs")
delattr(self, "old_pgs")
learner=cnn_learner(dls,resnet50,pretrained=False)
learner.add_cbs((ModelToHalf(),MixedPrecision()))
learner.fit_one_cycle(30,lr_max=0.00275)
learner.recorder.plot_loss(skip_start=20)
del learner
del MixedPrecision
from fastai2.vision.all import *
learner=cnn_learner(dls,resnet50,pretrained=False)
learner.to_fp16()
learner.fit_one_cycle(30,lr_max=0.00275)
learner.recorder.plot_loss(skip_start=20)
Reduced range in fp16⁴. Insert Image
As seen about there is a limited range to fp16. fp16 overflow is when your gradients are too big too fit within fp16. Underflow is when fp16 gradients are so close to 0 that they are simply rounded to 0. Overflow and underflow can happen in fp32 as well, but isn't nearly as big of a problem because fp32 is able to "hold" more information. Loss scaling is multiplying our loss by as large a value as possible,which effectively is multiplied by our gradients to keep them within its representable range. This avoids underflow by increasing the gradients, but we need to make sure it is not big enough that it causes an overflow. One technicality to loss scaling is calculating the loss itself in fp32, because calculating a loss includes division and low precision division is very in-precise leading to unstable gradients/training. Another technicality is that the gradients have to be divided by the loss scale, to keep the magnitudes consistent with a fp32-bit model, otherwise this would be very similar to simply increasing the learning rate by the loss scale! But, wait a minute… if we divide by the loss scale won't we just run into the underflow issue from before? Your right, we have to have a fp32 copy of the weights as well, so that we can avoid this problem. We copy the scaled gradients over to an fp32 model and then divide by the loss scale there. We then continue the optimization step in fp32.
Batch size is imporant in fp16 training. I found that, the very first convolution layer was very unstable in my training. Increasing the batch size tends to smooth out this first layer's gradients as there are more inputs to the layer.
I found that is was very important to effectively cap the maximum loss and minimum loss. This becaomes a problem if your model is expected to have very large loss values such as values greater than a magnitude of 100. Generally the loss is positive and relatively small, though in the case of Wasserstein (https://arxiv.org/abs/1506.05439) loss the loss can take a negative value. For large maximal values I found that differential losses can be very large, especially if your loss functions look something like (1000 x loss1 + loss2). In the case of differential loss functions, I found that I was able to get satisfactory results simply by changing the function to 1000 x tanh(loss1)+loss2. I expect gradient clipping would also be very helpful, but I have not yet experimented with that myself.
In my experience I found normalization to be very important in making sure an architecture performed well in fp16 training. Some architectures with very little normalization I had constant issues with overflowing or underflowing gradients. Introducing Similar normalization layers to those already present in the model gave me similar results to the fp32 models.
I am looking for a Job
Linked-In Profile: https://www.linkedin.com/in/molly-beavers-651025118/ Source Code(WIP): https://github.com/marii-moe/selfie2anime
export2html.notebook2html(fname='2020-05-11_FP16.ipynb', dest='html/', template_file='fastpages.tpl',n_workers=1)