Unsupervised Data Augmentation

Overview

Unsupervised Data Augmentation or UDA is a semi-supervised learning method which
achieves state-of-the-art results on a wide variety of language and vision
tasks.

With only 20 labeled examples, UDA outperforms the previous state-of-the-art on
IMDb trained on 25,000 labeled examples.

Model	Number of labeled examples	Error rate
Mixed VAT (Prev. SOTA)	25,000	4.32
BERT	25,000	4.51
UDA	20	4.20

It reduces more than 30% of the error rate of state-of-the-art methods on
CIFAR-10 with 4,000 labeled examples and SVHN with 1,000 labeled examples:

Model	CIFAR-10	SVHN
ICT (Prev. SOTA)	7.66±.17	3.53±.07
UDA	4.31±.08	2.28±.10

It leads to significant improvements on ImageNet with 10% labeled data.

Model	top-1 accuracy	top-5 accuracy
ResNet-50	55.09	77.26
UDA	68.78	88.80

How it works

UDA is a method of semi-supervised learning, that reduces the need for labeled
examples and better utilizes unlabeled ones.

What we are releasing

We are releasing the following:

Code for text classifications based on BERT.
Code for image classifications on CIFAR-10 and SVHN.
Code and checkpoints for our back translation augmentation system.

All of the code in this repository works out-of-the-box with GPU and Google
Cloud TPU.

Requirements

The code is tested on Python 2.7 and Tensorflow 1.13. After installing
Tensorflow, run the following command to install dependencies:

shell
pip install --user absl-py

Image classification

Preprocessing

We generate 100 augmented examples for every original example. To download all
the augmented data, go to the image directory and run

shell
AUG_COPY=100
bash scripts/download_cifar10.sh ${AUG_COPY}

Note that you need 120G disk space for all the augmented data. To save space,
you can set AUG_COPY to a smaller number such as 30.

Alternatively, you can generate the augmented examples yourself by running

shell
AUG_COPY=100
bash scripts/preprocess.sh --aug_copy=${AUG_COPY}

CIFAR-10 with 250, 500, 1000, 2000, 4000 examples on GPUs

GPU command:

shell
# UDA accuracy: 
# 4000: 95.68 +- 0.08
# 2000: 95.27 +- 0.14
# 1000: 95.25 +- 0.10
# 500: 95.20 +- 0.09
# 250: 94.57 +- 0.96
bash scripts/run_cifar10_gpu.sh --aug_copy=${AUG_COPY}

SVHN with 250, 500, 1000, 2000, 4000 examples on GPUs

shell
# UDA accuracy:
# 4000: 97.72 +- 0.10
# 2000: 97.80 +- 0.06
# 1000: 97.77 +- 0.07
# 500: 97.73 +- 0.09
# 250: 97.28 +- 0.40

bash scripts/run_svhn_gpu.sh --aug_copy=${AUG_COPY}

Text classifiation

Run on GPUs

Memory issues

The movie review texts in IMDb are longer than many classification tasks so
using a longer sequence length leads to better performances. The sequence
lengths are limited by the TPU/GPU memory when using BERT (See the
Out-of-memory issues of BERT).
As such, we provide scripts to run with shorter sequence lengths and smaller
batch sizes.

Instructions

If you want to run UDA with BERT base on a GPU with 11 GB memory, go to the
text directory and run the following commands:

shell
# Set a larger max_seq_length if your GPU has a memory larger than 11GB
MAX_SEQ_LENGTH=128

# Download data and pretrained BERT checkpoints
bash scripts/download.sh

# Preprocessing
bash scripts/prepro.sh --max_seq_length=${MAX_SEQ_LENGTH}

# Baseline accuracy: around 68%
bash scripts/run_base.sh --max_seq_length=${MAX_SEQ_LENGTH}

# UDA accuracy: around 90%
# Set a larger train_batch_size to achieve better performance if your GPU has a larger memory.
bash scripts/run_base_uda.sh --train_batch_size=8 --max_seq_length=${MAX_SEQ_LENGTH}

Run on Cloud TPU v3-32 Pod to achieve SOTA performance

The best performance in the paper is achieved by using a max_seq_length of 512
and initializing with BERT large finetuned on in-domain unsupervised data. If
you have access to Google Cloud TPU v3-32 Pod, try:

shell
MAX_SEQ_LENGTH=512

# Download data and pretrained BERT checkpoints
bash scripts/download.sh

# Preprocessing
bash scripts/prepro.sh --max_seq_length=${MAX_SEQ_LENGTH}

# UDA accuracy: 95.3% - 95.9%
bash train_large_ft_uda_tpu.sh

Run back translation data augmentation for your dataset

First of all, install the following dependencies:

shell
pip install --user nltk
python -c "import nltk; nltk.download('punkt')"
pip install --user tensor2tensor==1.13.4

The following command translates the provided example file. It automatically
splits paragraphs into sentences, translates English sentences to French and
then translates them back into English. Finally, it composes the paraphrased
sentences into paragraphs. Go to the back_translate directory and run:

shell
bash download.sh
bash run.sh

Guidelines for hyperparameters:

There is a variable sampling_temp in the bash file. It is used to control the
diversity and quality of the paraphrases. Increasing sampling_temp will lead to
increased diversity but worse quality. Surprisingly, diversity is more important
than quality for many tasks we tried.

We suggest trying to set sampling_temp to 0.7, 0.8 and 0.9. If your task is very
robust to noise, sampling_temp=0.9 or 0.8 should lead to improved performance.
If your task is not robust to noise, setting sampling temp to 0.7 or 0.6 should
be better.

If you want to do back translation to a large file, you can change the replicas
and worker_id arguments in run.sh. For example, when replicas=3, we divide the
data into three parts, and each run.sh will only process one part according to
the worker_id.

General guidelines for setting hyperparameters:

UDA works out-of-box and does not require extensive hyperparameter tuning, but
to really push the performance, here are suggestions about hyperparamters:

It works well to set the weight on unsupervised objective 'unsup_coeff'
to 1.
Use a lower learning rate than pure supervised learning because there are
two loss terms computed on labeled data and unlabeled data respecitively.
If your have an extremely small amount of data, try to tweak
'uda_softmax_temp' and 'uda_confidence_thresh' a bit. For more details about
these two hyperparameters, search the "Confidence-based masking" and
"Softmax temperature control" in the paper.
Effective augmentation for supervised learning usually works well for UDA.
For some tasks, we observed that increasing the batch size for the
unsupervised objective leads to better performance. For other tasks, small
batch sizes also work well. For example, when we run UDA with GPU on
CIFAR-10, the best batch size for the unsupervised objective is 160.

Acknowledgement

A large portion of the code is taken from
BERT and
RandAugment.
Thanks!

Citation

Please cite this paper if you use UDA.

@article{xie2019unsupervised,
  title={Unsupervised Data Augmentation for Consistency Training},
  author={Xie, Qizhe and Dai, Zihang and Hovy, Eduard and Luong, Minh-Thang and Le, Quoc V},
  journal={arXiv preprint arXiv:1904.12848},
  year={2019}
}

Please also cite this paper if you use UDA for images.

@article{cubuk2019randaugment,
  title={RandAugment: Practical data augmentation with no separate search},
  author={Cubuk, Ekin D and Zoph, Barret and Shlens, Jonathon and Le, Quoc V},
  journal={arXiv preprint arXiv:1909.13719},
  year={2019}
}