Have you ever heard of augmenting training data? It’s a quite simple idea, often used in computer vision. Given an image, you try to alter this image in as many informative ways as possible to create various scenarios your model must learn. It's basically creating new data points from existing ones, i.e. synthesizing data. For instance, you can flip an image, crop or scale it, or even rotate it. It has been shown that this usually increases the accuracy of your classifier significantly, as your model learns to focus more on relevant features.
But how do you do such augmentation on texts? In a regular text classification, you must come up with new techniques. In this blog post, we’ll show you great examples for doing so.
If you process text that has been crafted from scanned documents, you’ll most likely have some OCR errors. Depending on the font used, “m” can be read as “rn”, “h” as “b” or an “I” (upper i) as an “l” (lower L).
To improve your model, you can do some OCR based augmentations. Create copies of the individual texts and mutate them based on statistics for each font. You can easily double or triple the size of your dataset, helping your model to become more robust towards erroneous OCR texts, especially if you use Bag-of-Character Embeddings.
For use cases in which you need to process user input that has been entered via keyboard, it’s a good idea to include keyboard augmentations. For instance, in a chat message, it is likely that users will have some minor mistypings such as typing a “J” instead of a “K”.
You can regulate the number of mistypings that can occur. In professional emails, users will correct identified typing errors, whereas for chatbots, users tend to ignore typing mistakes.
There are different types of spelling errors. For instance, you could produce a typographical error (“three” → “there”), or a cognitive error (“too” → “two”, “piece” → “peace”), or just create some random spelling error (“onetask” → “onetsk”). Spelling Augmentations help your model to understand such issues much better.
Word Replacement Augmentation
Imagine the following input text: “The quick brown fox jumps over the lazy dog”. What do you think about the adjectives, such as “quick”? You could easily replace them, such as in “The fast brown fox jumps over the lazy dog”.
With techniques such as applying Wordnet or Word2Vec to your data, this can be done automatically. If you apply learned embeddings such as Word2Vec or GloVe, the context is quite important. In GloVe, "nbc" is similar to "fox", which in the sentence above would lead to misleading augmentations. So you should be careful when applying such augmentations, even though they can be powerful if used correctly.
Back Translation Augmentation
This approach is different from the previously mentioned ones. It includes Neural Machine Translation. Have you ever wondered what the lyrics of your favorite song are if you translate it to a chain of arbitrary languages, only to convert it back to its original language? Depending on the number of languages in the chain, as well as the translator’s quality, structure and content will change.
Augmenting data can help you scale your labeled training data to further stabilize your classifier, increasing the resulting accuracy, leading to better models. Try it out, it's fun!