# Tips on training seq2seq type models
## Preprocessing
* Investigate your word distributions:
* How many distinct words are there in your dataset?
* A nice vocab size is 10,000-30,000
* Make sure your data isn't all UNKed with your cutoff; it's nice to have 5% UNK
* Use [TorchText](https://www.google.com/url?q=http://torchtext.readthedocs.io/&sa=D&ust=1531728490741000) for reading in text, padding sequences
* Split your data into train/dev/test set (usually 80/10/10, or 70/15/15)
## Hyperparameters
* Use a 1-layer GRU or LSTM
* Nice hidden sizes are usually 128 or 256 (powers of 2 are easier for GPUs) -- beyond 256 the chance of overfitting increases, especially with the datasets we have
* Dropout:
* Only apply dropout to the encoder
* A nice probability of dropout is .2 (or .8 "keep" probability)
* Make your encoder bidirectional if you want. But make sure the dimension of the encode state matches up to the decoder dimension (e.g. 128 BiGRU encoder -> 256 decoder)
## Training
* Use Adam optimizer, default learning rate of 0.001
* Early stopping:
* After one full epoch of training (i.e. looping through your entire training set), compute loss on the dev set
* Keep track of last N dev losses, if the dev loss starts increasing, stop training (it means you're overfitting)
* Use [TorchText](https://www.google.com/url?q=http://torchtext.readthedocs.io/&sa=D&ust=1531728490742000) for automatic batching