# Tips on training seq2seq type models ## Preprocessing * Investigate your word distributions: * How many distinct words are there in your dataset? * A nice vocab size is 10,000-30,000 * Make sure your data isn't all UNKed with your cutoff; it's nice to have 5% UNK * Use [TorchText](https://www.google.com/url?q=http://torchtext.readthedocs.io/&sa=D&ust=1531728490741000) for reading in text, padding sequences * Split your data into train/dev/test set (usually 80/10/10, or 70/15/15) ## Hyperparameters * Use a 1-layer GRU or LSTM * Nice hidden sizes are usually 128 or 256 (powers of 2 are easier for GPUs) -- beyond 256 the chance of overfitting increases, especially with the datasets we have * Dropout: * Only apply dropout to the encoder * A nice probability of dropout is .2 (or .8 "keep" probability) * Make your encoder bidirectional if you want. But make sure the dimension of the encode state matches up to the decoder dimension (e.g. 128 BiGRU encoder -> 256 decoder) ## Training * Use Adam optimizer, default learning rate of 0.001 * Early stopping: * After one full epoch of training (i.e. looping through your entire training set), compute loss on the dev set * Keep track of last N dev losses, if the dev loss starts increasing, stop training (it means you're overfitting) * Use [TorchText](https://www.google.com/url?q=http://torchtext.readthedocs.io/&sa=D&ust=1531728490742000) for automatic batching