Alammar, Jay. 2018. “The Illustrated Transformer [Blog Post].” Http://

Ba, Jimmy Lei, Jamie Ryan Kiros, and Geoffrey E Hinton. 2016. “Layer Normalization.” arXiv Preprint arXiv:1607.06450.

Bahdanau, Dzmitry, Kyunghyun Cho, and Yoshua Bengio. 2014. “Neural Machine Translation by Jointly Learning to Align and Translate.” arXiv Preprint arXiv:1409.0473.

Bakarov, Amir. 2018. “A Survey of Word Embeddings Evaluation Methods.” arXiv Preprint arXiv:1801.09536.

Beltagy, Iz, Kyle Lo, and Arman Cohan. 2019. “SciBERT: A Pretrained Language Model for Scientific Text.” In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (Emnlp-Ijcnlp), 3606–11.

Bengio, Yoshua, Réjean Ducharme, Pascal Vincent, and Christian Jauvin. 2003. “A Neural Probabilistic Language Model.” Journal of Machine Learning Research, no. 3: 1137–55.

Bobrow, Daniel G. 1964. “Natural Language Input for a Computer Problem Solving System.”

Boden, Mikael. 2002. “A Guide to Recurrent Neural Networks and Backpropagation.” The Dallas Project.

Bojanowski, Piotr, Edouard Grave, Armand Joulin, and Tomas Mikolov. 2017. “Enriching Word Vectors with Subword Information.” Transactions of the Association for Computational Linguistics 5. MIT Press: 135–46.

Boughorbel, Sabri, Fethi Jarray, and Mohammed El-Anbari. 2017. “Optimal Classifier for Imbalanced Data Using Matthews Correlation Coefficient Metric.” PloS One 12 (6). Public Library of Science San Francisco, CA USA: e0177678.

Boureau, Y-Lan, Jean Ponce, and Yann LeCun. 2010. A Theoretical Analysis of Feature Pooling in Visual Recognition.

Bowman, Samuel R, Gabor Angeli, Christopher Potts, and Christopher D Manning. 2015. “A Large Annotated Corpus for Learning Natural Language Inference.” arXiv Preprint arXiv:1508.05326.

Brown, Tom B., Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared Kaplan, Prafulla Dhariwal, Arvind Neelakantan, et al. 2020. “Language Models Are Few-Shot Learners.”

Bruni, Elia, Nam-Khanh Tran, and Marco Baroni. 2014. “Multimodal Distributional Semantics.” Journal of Artificial Intelligence Research 49: 1–47.

Chen, Gang. 2016. “A Gentle Tutorial of Recurrent Neural Network with Error Backpropagation.” arXiv Preprint arXiv:1610.02583.

Cheng, Jianpeng, Li Dong, and Mirella Lapata. 2016. “Long Short-Term Memory-Networks for Machine Reading.” arXiv Preprint arXiv:1601.06733.

Child, Rewon, Scott Gray, Alec Radford, and Ilya Sutskever. 2019. “Generating Long Sequences with Sparse Transformers.”

Cho, Kyunghyun, Bart Van Merriënboer, Caglar Gulcehre, Dzmitry Bahdanau, Fethi Bougares, Holger Schwenk, and Yoshua Bengio. 2014. “Learning Phrase Representations Using Rnn Encoder-Decoder for Statistical Machine Translation.” arXiv Preprint arXiv:1406.1078.

Chollet, Francois. 2018. Deep Learning Mit Python Und Keras: Das Praxis-Handbuch Vom Entwickler Der Keras-Bibliothek. MITP-Verlags GmbH & Co. KG.

Chomsky, Noam. 1957. Syntactic Structures. The Hague: Mouton.

Choromanski, Krzysztof, Valerii Likhosherstov, David Dohan, Xingyou Song, Jared Davis, Tamas Sarlos, David Belanger, Lucy Colwell, and Adrian Weller. 2020. “Masked Language Modeling for Proteins via Linearly Scalable Long-Context Transformers.”

Chung, Junyoung, Çaglar Gülçehre, KyungHyun Cho, and Yoshua Bengio. 2014. “Empirical Evaluation of Gated Recurrent Neural Networks on Sequence Modeling.” CoRR abs/1412.3555.

Clark, Kevin, Minh-Thang Luong, Quoc V. Le, and Christopher D. Manning. 2020. “ELECTRA: Pre-Training Text Encoders as Discriminators Rather Than Generators.” In International Conference on Learning Representations.

Collobert, Ronan, Jason Weston, Léon Bottou, Michael Karlen, Koray Kavukcuoglu, and Pavel P. Kuksa. 2011. Natural Language Processing (Almost) from Scratch. J. Mach. Learn. Res. Vol. 12.

Dai, Zihang, Zhilin Yang, Yiming Yang, Jaime Carbonell, Quoc V Le, and Ruslan Salakhutdinov. 2019. “Transformer-Xl: Attentive Language Models Beyond a Fixed-Length Context.” arXiv Preprint arXiv:1901.02860.

Deng, Li, and Yang Liu. 2018. Deep Learning in Natural Language Processing. Springer Nature.

Devlin, Jacob, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. 2018. “BERT: Pre-Training of Deep Bidirectional Transformers for Language Understanding.” CoRR abs/1810.04805.

French, Robert M. 1999. “Catastrophic Forgetting in Connectionist Networks.” Trends in Cognitive Sciences 3 (4). Elsevier: 128–35.

Gao, Bin, Jiang Bian, and Tie-Yan Liu. 2014. “Wordrep: A Benchmark for Research on Learning Word Representations.” arXiv Preprint arXiv:1407.1640.

Gehring, Jonas, Michael Auli, David Grangier, Denis Yarats, and Yann N. Dauphin. 2017. “Convolutional Sequence to Sequence Learning.”

Gerz, Daniela, Ivan Vulić, Felix Hill, Roi Reichart, and Anna Korhonen. 2016. “Simverb-3500: A Large-Scale Evaluation Set of Verb Similarity.” arXiv Preprint arXiv:1608.00869.

Gladkova, Anna, and Aleksandr Drozd. 2016. “Intrinsic Evaluations of Word Embeddings: What Can We Do Better?” In Proceedings of the 1st Workshop on Evaluating Vector-Space Representations for Nlp, 36–42.

Goldberg, Yoav. 2016. “A Primer on Neural Network Models for Natural Language Processing.” Journal of Artificial Intelligence Research 57: 345–420.

Goldberg, Yoav, and Omer Levy. 2014. “Word2vec Explained: Deriving Mikolov et Al.’s Negative-Sampling Word-Embedding Method.” arXiv Preprint arXiv:1402.3722.

Goodfellow, Ian, Yoshua Bengio, and Aaron Courville. 2016. Deep Learning. MIT press.

Graves, Alex. 2013. “Generating Sequences with Recurrent Neural Networks.” arXiv Preprint arXiv:1308.0850.

Graves, Alex, Abdel-rahman Mohamed, and Geoffrey Hinton. 2013. “Speech Recognition with Deep Recurrent Neural Networks.” In 2013 Ieee International Conference on Acoustics, Speech and Signal Processing, 6645–9. IEEE.

Graves, Alex, Greg Wayne, and Ivo Danihelka. 2014. “Neural Turing Machines.” CoRR abs/1410.5401.

Gregor, Karol, Ivo Danihelka, Alex Graves, Danilo Jimenez Rezende, and Daan Wierstra. 2015. “Draw: A Recurrent Neural Network for Image Generation.” arXiv Preprint arXiv:1502.04623.

Hancox, Peter. 1996. Natural Language Processing.

Harris, Zellig S. 1954. “Distributional Structure.” WORD 10 (2-3): 146–62.

He, Kaiming, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. 2016a. “Deep Residual Learning for Image Recognition.” 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 770–78.

———. 2016b. “Identity Mappings in Deep Residual Networks.” In European Conference on Computer Vision, 630–45. Springer.

Hill, Felix, Roi Reichart, and Anna Korhonen. 2015. “Simlex-999: Evaluating Semantic Models with (Genuine) Similarity Estimation.” Computational Linguistics 41 (4). MIT Press: 665–95.

Hochreiter, Sepp, and Jürgen Schmidhuber. 1997. “Long Short-Term Memory.” Neural Computation 9 (8). MIT Press: 1735–80.

Howard, Jeremy, and Sebastian Ruder. 2018. “Universal Language Model Fine-tuning for Text Classification.” arXiv E-Prints, January, arXiv:1801.06146.

Hutchins, John. 2005. “The First Public Demonstration of Machine Translation: The Georgetown-Ibm System, 7th January 1954.” Noviembre de.

Ioffe, Sergey, and Christian Szegedy. 2015. “Batch Normalization: Accelerating Deep Network Training by Reducing Internal Covariate Shift.” ArXiv abs/1502.03167.

Johnson, Mark. 2009. “How the Statistical Revolution Changes (Computational) Linguistics.” Proceedings of the EACL 2009 Workshop on the Interaction Between Linguistics and Computational Linguistics Virtuous, Vicious or Vacuous? - ILCL 09, March.

Johnson, Rie, and Tong Zhang. 2016. “Convolutional Neural Networks for Text Categorization: Shallow Word-Level Vs. Deep Character-Level.” ArXiv abs/1609.00718.

———. 2017. “Deep Pyramid Convolutional Neural Networks for Text Categorization.” In ACL.

Joulin, Armand, Edouard Grave, Piotr Bojanowski, and Tomas Mikolov. 2016. “Bag of Tricks for Efficient Text Classification.” arXiv Preprint arXiv:1607.01759.

———. 2017. “Bag of Tricks for Efficient Text Classification.” ArXiv abs/1607.01759.

Jurgens, David, Saif Mohammad, Peter Turney, and Keith Holyoak. 2012. “Semeval-2012 Task 2: Measuring Degrees of Relational Similarity.” In * SEM 2012: The First Joint Conference on Lexical and Computational Semantics–Volume 1: Proceedings of the Main Conference and the Shared Task, and Volume 2: Proceedings of the Sixth International Workshop on Semantic Evaluation (Semeval 2012), 356–64.

Kaiser, Lukasz, Aidan N Gomez, Noam Shazeer, Ashish Vaswani, Niki Parmar, Llion Jones, and Jakob Uszkoreit. 2017. “One Model to Learn Them All.” arXiv Preprint arXiv:1706.05137.

Kalchbrenner, Nal, Lasse Espeholt, Karen Simonyan, Aaron van den Oord, Alex Graves, and Koray Kavukcuoglu. 2016a. “Neural Machine Translation in Linear Time.”

Kalchbrenner, Nal, Lasse Espeholt, Karen Simonyan, Aäron van den Oord, Alex Graves, and Koray Kavukcuoglu. 2016b. “Neural Machine Translation in Linear Time.” ArXiv abs/1610.10099.

Kalchbrenner, Nal, Edward Grefenstette, and Phil Blunsom. 2014. A Convolutional Neural Network for Modelling Sentences. ArXiv. Vol. abs/1404.2188.

Katharopoulos, Angelos, Apoorv Vyas, Nikolaos Pappas, and François Fleuret. 2020. “Transformers Are Rnns: Fast Autoregressive Transformers with Linear Attention.”

Kim, Yoon. 2014. Convolutional Neural Networks for Sentence Classification.

Kitaev, Nikita, Łukasz Kaiser, and Anselm Levskaya. 2020. “Reformer: The Efficient Transformer.”

Krizhevsky, Alex, Ilya Sutskever, and Geoffrey E. Hinton. 2012. ImageNet Classification with Deep Convolutional Neural Networks.

Kudo, Taku, and John Richardson. 2018. “Sentencepiece: A Simple and Language Independent Subword Tokenizer and Detokenizer for Neural Text Processing.” arXiv Preprint arXiv:1808.06226.

Lan, Zhenzhong, Mingda Chen, Sebastian Goodman, Kevin Gimpel, Piyush Sharma, and Radu Soricut. 2019. “Albert: A Lite Bert for Self-Supervised Learning of Language Representations.” arXiv Preprint arXiv:1909.11942.

“Language and Machines.” 1966. National Academy of Sciences National Research Council.

Le, Quoc, and Tomas Mikolov. 2014. “Distributed Representations of Sentences and Documents.” In International Conference on Machine Learning, 1188–96.

Levy, Omer, Yoav Goldberg, and Ido Dagan. 2015. “Improving Distributional Similarity with Lessons Learned from Word Embeddings.” Transactions of the Association for Computational Linguistics 3. MIT Press: 211–25.

Ling, Wang, Dani Yogatama, Chris Dyer, and Phil Blunsom. 2017. “Program Induction by Rationale Generation: Learning to Solve and Explain Algebraic Word Problems.” arXiv Preprint arXiv:1705.04146.

Liu, Yinhan, Myle Ott, Naman Goyal, Jingfei Du, Mandar Joshi, Danqi Chen, Omer Levy, Mike Lewis, Luke Zettlemoyer, and Veselin Stoyanov. 2019. “Roberta: A Robustly Optimized Bert Pretraining Approach.” arXiv Preprint arXiv:1907.11692.

Luong, Minh-Thang, Quoc V Le, Ilya Sutskever, Oriol Vinyals, and Lukasz Kaiser. 2015. “Multi-Task Sequence to Sequence Learning.” arXiv Preprint arXiv:1511.06114.

Luong, Minh-Thang, Hieu Pham, and Christopher D Manning. 2015. “Effective Approaches to Attention-Based Neural Machine Translation.” arXiv Preprint arXiv:1508.04025.

Luong, Minh-Thang, Richard Socher, and Christopher D Manning. 2013. “Better Word Representations with Recursive Neural Networks for Morphology.” In Proceedings of the Seventeenth Conference on Computational Natural Language Learning, 104–13.

Malte, Aditya, and Pratik Ratadiya. 2019. “Evolution of Transfer Learning in Natural Language Processing.”

Manning, Christopher D., and Hinrich Schutze. 2008. Foundations of Statistical Natural Language Processing. MIT.

Martinc, Matej, Senja Pollak, and Marko Robnik-Šikonja. 2019. “Supervised and Unsupervised Neural Approaches to Text Readability.” arXiv Preprint arXiv:1907.11779.

McCloskey, Michael, and Neal J Cohen. 1989. “Catastrophic Interference in Connectionist Networks: The Sequential Learning Problem.” In Psychology of Learning and Motivation, 24:109–65. Elsevier.

Mccorduck, Pamela, and Cli Cfe. 2004. “Machines Who Think.” A K Peters/CRC Press, March.

Merity, Stephen, Nitish Shirish Keskar, and Richard Socher. 2017. “Regularizing and Optimizing LSTM Language Models.” arXiv E-Prints, August, arXiv:1708.02182.

Mikolov, Tomas, Kai Chen, Greg Corrado, and Jeffrey Dean. 2013. “Efficient Estimation of Word Representations in Vector Space.” arXiv Preprint arXiv:1301.3781.

Mikolov, Tomas, Quoc V Le, and Ilya Sutskever. 2013. “Exploiting Similarities Among Languages for Machine Translation.” arXiv Preprint arXiv:1309.4168.

Mikolov, Tomas, Ilya Sutskever, Kai Chen, Greg Corrado, and Jeffrey Dean. 2013. “Distributed Representations of Words and Phrases and Their Compositionality.” Advances in Neural Information Processing Systems, 3111–9.

Mikolov, Tomáš, Martin Karafiát, Lukáš Burget, Jan Černocky, and Sanjeev Khudanpur. 2010. “Recurrent Neural Network Based Language Model.” In Eleventh Annual Conference of the International Speech Communication Association.

Mikolov, Tomáš, Wen-tau Yih, and Geoffrey Zweig. 2013. “Linguistic Regularities in Continuous Space Word Representations.” In Proceedings of the 2013 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, 746–51.

Morin, Frederic, and Yoshua Bengio. 2005. “Hierarchical Probabilistic Neural Network Language Model.” In Aistats, 5:246–52. Citeseer.

Oord, Aäron van den, Sander Dieleman, Heiga Zen, Karen Simonyan, Oriol Vinyals, Alex Graves, Nal Kalchbrenner, Andrew W. Senior, and Koray Kavukcuoglu. 2016. WaveNet: A Generative Model for Raw Audio. ArXiv. Vol. abs/1609.03499.

Pan, S. J., and Q. Yang. 2010. “A Survey on Transfer Learning.” IEEE Transactions on Knowledge and Data Engineering 22 (10): 1345–59.

Papineni, Kishore, Salim Roukos, Todd Ward, and Wei-Jing Zhu. 2002. “BLEU: A Method for Automatic Evaluation of Machine Translation.” In Proceedings of the 40th Annual Meeting on Association for Computational Linguistics, 311–18. Association for Computational Linguistics.

Pascanu, Razvan, Caglar Gulcehre, Kyunghyun Cho, and Yoshua Bengio. 2013. “How to Construct Deep Recurrent Neural Networks.” arXiv Preprint arXiv:1312.6026.

Pascanu, Razvan, Tomas Mikolov, and Yoshua Bengio. 2013. “On the Difficulty of Training Recurrent Neural Networks.” In International Conference on Machine Learning, 1310–8.

Patel, Kevin, and Pushpak Bhattacharyya. 2017. “Towards Lower Bounds on Number of Dimensions for Word Embeddings.” In Proceedings of the Eighth International Joint Conference on Natural Language Processing (Volume 2: Short Papers), 31–36.

Pennington, Jeffrey, Richard Socher, Manning, and Christopher D. 2014. “GloVe: Global Vectors for Word Representation.” Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP), 1532–43.

Peters, Matthew E., Mark Neumann, Mohit Iyyer, Matt Gardner, Christopher Clark, Kenton Lee, and Luke Zettlemoyer. 2018. “Deep Contextualized Word Representations.”

Peters, Matthew E., Mark Neumann, Mohit Iyyer, Matt Gardner, Christopher Clark, Kenton Lee, and Luke Zettlemoyer. 2018. “Deep contextualized word representations.” arXiv E-Prints, February, arXiv:1802.05365.

Peters, Matthew E., Sebastian Ruder, and Noah A. Smith. 2019. “To Tune or Not to Tune? Adapting Pretrained Representations to Diverse Tasks.” arXiv E-Prints, March, arXiv:1903.05987.

Prabhavalkar, Rohit, Kanishka Rao, Tara N Sainath, Bo Li, Leif Johnson, and Navdeep Jaitly. 2017. “A Comparison of Sequence-to-Sequence Models for Speech Recognition.” In Interspeech, 939–43.

Radford, Alec, Karthik Narasimhan, Tim Salimans, and Ilya Sutskever. 2018. “Improving Language Understanding by Generative Pre-Training.”

Radford, Alec, Jeffrey Wu, Rewon Child, David Luan, Dario Amodei, and Ilya Sutskever. 2019. “Language Models Are Unsupervised Multitask Learners.”

Raffel, Colin, Noam Shazeer, Adam Roberts, Katherine Lee, Sharan Narang, Michael Matena, Yanqi Zhou, Wei Li, and Peter J Liu. 2019. “Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer.” arXiv Preprint arXiv:1910.10683.

Rajpurkar, Pranav, Robin Jia, and Percy Liang. 2018. “Know What You Don’t Know: Unanswerable Questions for Squad.”

Rajpurkar, Pranav, Jian Zhang, Konstantin Lopyrev, and Percy Liang. 2016. “SQuAD: 100,000+ Questions for Machine Comprehension of Text.”

R Core Team. 2018. R: A Language and Environment for Statistical Computing. Vienna, Austria: R Foundation for Statistical Computing.

Reddy, Siva, Danqi Chen, and Christopher D. Manning. 2018. “CoQA: A Conversational Question Answering Challenge.” CoRR abs/1808.07042.

Ribeiro, Marco Tulio, Tongshuang Wu, Carlos Guestrin, and Sameer Singh. 2020. “Beyond Accuracy: Behavioral Testing of NLP Models with CheckList.” In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, 4902–12. Online: Association for Computational Linguistics.

Ruder, Sebastian. 2019. “Neural Transfer Learning for Natural Language Processing.” PhD thesis, National University of Ireland, Galway.

Ruder, Sebastian, Matthew E Peters, Swabha Swayamdipta, and Thomas Wolf. 2019. “Transfer Learning in Natural Language Processing.” In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Tutorials, 15–18.

Scherer, Dominik, Andreas C. Müller, and Sven Behnke. 2010. “Evaluation of Pooling Operations in Convolutional Architectures for Object Recognition.” In ICANN.

Schuster, Mike, and Kuldip K Paliwal. 1997. “Bidirectional Recurrent Neural Networks.” IEEE Transactions on Signal Processing 45 (11). Ieee: 2673–81.

Schwenk, Holger, Loïc Barrault, Alexis Conneau, and Yann LeCun. 2017. Very Deep Convolutional Networks for Text Classification.

Sennrich, Rico, Barry Haddow, and Alexandra Birch. 2015. “Neural Machine Translation of Rare Words with Subword Units.” arXiv Preprint arXiv:1508.07909.

Shen, Zhuoran, Mingyuan Zhang, Haiyu Zhao, Shuai Yi, and Hongsheng Li. 2018. “Efficient Attention: Attention with Linear Complexities.”

Simonyan, Karen, and Andrew Zisserman. 2015. “Very Deep Convolutional Networks for Large-Scale Image Recognition.” CoRR abs/1409.1556.

Sutskever, Ilya, James Martens, George E. Dahl, and Geoffrey E. Hinton. 2013. “On the Importance of Initialization and Momentum in Deep Learning.” In ICML.

Sutskever, Ilya, Oriol Vinyals, and Quoc V Le. 2014. “Sequence to Sequence Learning with Neural Networks.” In Advances in Neural Information Processing Systems, 3104–12.

Tenney, Ian, James Wexler, Jasmijn Bastings, Tolga Bolukbasi, Andy Coenen, Sebastian Gehrmann, Ellen Jiang, et al. 2020. “The Language Interpretability Tool: Extensible, Interactive Visualizations and Analysis for Nlp Models.” arXiv Preprint arXiv:2008.05122.

Tsai, Yao-Hung Hubert, Shaojie Bai, Makoto Yamada, Louis-Philippe Morency, and Ruslan Salakhutdinov. 2019. “Transformer Dissection: A Unified Understanding of Transformer’s Attention via the Lens of Kernel.”

Turing, Alan M. 1950. “Computing Machinery and Intelligence.” Mind LIX (236): 433–60.

Turing, Alan Mathison. 1948. “Intelligent Machinery.” NPL. Mathematics Division.

Turing, A. M. 1937. “On Computable Numbers, with an Application to the Entscheidungsproblem.” Proceedings of the London Mathematical Society s2-42 (1): 230–65.

“Understanding Lstm Networks.” 2015. Understanding LSTM Networks – Colah’s Blog.

Vaswani, Ashish, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Łukasz Kaiser, and Illia Polosukhin. 2017. “Attention Is All You Need.” In Advances in Neural Information Processing Systems, 5998–6008.

Venugopalan, Subhashini, Marcus Rohrbach, Jeffrey Donahue, Raymond Mooney, Trevor Darrell, and Kate Saenko. 2015. “Sequence to Sequence-Video to Text.” In Proceedings of the Ieee International Conference on Computer Vision, 4534–42.

Visin, Francesco, Kyle Kastner, Kyunghyun Cho, Matteo Matteucci, Aaron C. Courville, and Yoshua Bengio. 2015. ReNet: A Recurrent Neural Network Based Alternative to Convolutional Networks. ArXiv. Vol. abs/1505.00393.

Wan, Li, Matthew Zeiler, Sixin Zhang, Yann Le Cun, and Rob Fergus. 2013. “Regularization of Neural Networks Using Dropconnect.” In Proceedings of the 30th International Conference on Machine Learning, edited by Sanjoy Dasgupta and David McAllester, 28:1058–66. Proceedings of Machine Learning Research 3. Atlanta, Georgia, USA: PMLR.

Wang, Alex, Yada Pruksachatkun, Nikita Nangia, Amanpreet Singh, Julian Michael, Felix Hill, Omer Levy, and Samuel Bowman. 2019. “Superglue: A Stickier Benchmark for General-Purpose Language Understanding Systems.” In Advances in Neural Information Processing Systems, 3261–75.

Wang, Alex, Amanpreet Singh, Julian Michael, Felix Hill, Omer Levy, and Samuel R Bowman. 2018. “Glue: A Multi-Task Benchmark and Analysis Platform for Natural Language Understanding.” arXiv Preprint arXiv:1804.07461.

Wang, Sinong, Belinda Z. Li, Madian Khabsa, Han Fang, and Hao Ma. 2020. “Linformer: Self-Attention with Linear Complexity.”

Weaver, Warren. 1949. “The Mathematics of Communication.” Scientific American 181 (1): 11–15.

Weizenbaum, Joseph. 1966. “ELIZA-a Computer Program for the Study of Natural Language Communication Between Man and Machine.” Communications of the ACM 9 (1). ACM New York, NY, USA: 36–45.

Weng, Lilian. 2018. “Attention? Attention!”

Winograd, Terry. 1972. “SHRDLU: A System for Dialog.” CUMINCAD.

Wu, Yonghui, Mike Schuster, Zhifeng Chen, Quoc V Le, Mohammad Norouzi, Wolfgang Macherey, Maxim Krikun, et al. 2016. “Google’s Neural Machine Translation System: Bridging the Gap Between Human and Machine Translation.” arXiv Preprint arXiv:1609.08144.

Xu, Kelvin, Jimmy Ba, Ryan Kiros, Kyunghyun Cho, Aaron Courville, Ruslan Salakhudinov, Rich Zemel, and Yoshua Bengio. 2015. “Show, Attend and Tell: Neural Image Caption Generation with Visual Attention.” In International Conference on Machine Learning, 2048–57.

Yamaguchi, Kouichi, Kenji Sakamoto, Toshio Akabane, and Yoshiji Fujimoto. 1990. A Neural Network for Speaker-Independent Isolated Word Recognition.

Yang, Zhilin, Zihang Dai, Yiming Yang, Jaime Carbonell, Russ R Salakhutdinov, and Quoc V Le. 2019. “XLNet: Generalized Autoregressive Pretraining for Language Understanding.” In Advances in Neural Information Processing Systems 32, edited by H. Wallach, H. Larochelle, A. Beygelzimer, F. dAlché-Buc, E. Fox, and R. Garnett, 5753–63. Curran Associates, Inc.

Yeung, Joshua. 2020. “Three Major Fields of Artificial Intelligence and Their Industrial Applications.” Medium. Towards Data Science.

Zhang, Xiang, Junbo Jake Zhao, and Yann LeCun. 2015. Character-Level Convolutional Networks for Text Classification.

Zhang, Zhengyan, Xu Han, Zhiyuan Liu, Xin Jiang, Maosong Sun, and Qun Liu. 2019. “ERNIE: Enhanced Language Representation with Informative Entities.” arXiv Preprint arXiv:1905.07129.

Zhu, Yukun, Ryan Kiros, Richard S. Zemel, Ruslan Salakhutdinov, Raquel Urtasun, Antonio Torralba, and Sanja Fidler. 2015. “Aligning Books and Movies: Towards Story-Like Visual Explanations by Watching Movies and Reading Books.” CoRR abs/1506.06724.