با برنامه Player FM !
Open Pre-Trained Transformer Language Models (OPT): What does it take to train GPT-3?
Manage episode 355037186 series 3446693
Andrew Yates (Assistant Professor at the University of Amsterdam) and Sergi Castella i Sapé discuss the recent "Open Pre-trained Transformer (OPT) Language Models" from Meta AI (formerly Facebook). In this replication work, Meta developed and trained a 175 Billion parameter Transformer very similar to GPT-3 from OpenAI, documenting the process in detail to share their findings with the community. The code, pretrained weights, and logbook are available on their Github repository (links below).
Links
❓Feedback Form: https://scastella.typeform.com/to/rg7a5GfJ
📄 OPT paper: https://arxiv.org/abs/2205.01068
👾 Code: https://github.com/facebookresearch/metaseq
📒 Logbook: https://github.com/facebookresearch/metaseq/blob/main/projects/OPT/chronicles/OPT175B_Logbook.pdf
✍️ OPT Official Blog Post: https://ai.facebook.com/blog/democratizing-access-to-large-scale-language-models-with-opt-175b/
OpenAI Embeddings API: https://openai.com/blog/introducing-text-and-code-embeddings/
Nils Reimers' critique of OpenAI Embeddings API: https://medium.com/@nils_reimers/openai-gpt-3-text-embeddings-really-a-new-state-of-the-art-in-dense-text-embeddings-6571fe3ec9d9
Timestamps:
00:00 Introduction and housekeeping: new feedback form, ACL conference highlights
02:42 The convergence between NLP and Neural IR techniques
06:43 Open Pretrained Transformer motivation and scope, reproducing GPT-3 and open-sourcing
08:16 Basics of OPT: architecture, pre-training objective, teacher forcing, tokenizer, training data
13:40 Preliminary experiments findings: hyperparameters, training stability, spikiness
20:08 Problems that appear at scale when training with 992 GPUs
23:01 Using temperature to check whether GPUs are working
25:00 Training the largest model: what to do when the loss explodes? (which happens quite often)
29:15 When they switched away from AdamW to SGD
32:00 Results: successful but not quite GPT-3 level.
Toxicity? 35:45 Replicability of Large Language Models research. Was GPT-3 replicable? What difference does it make?
37:25 What makes a paper replicable?
40:33 Directions in which large Language Models are applied to Information Retrieval
45:15 Final thoughts and takeaways
20 قسمت
Manage episode 355037186 series 3446693
Andrew Yates (Assistant Professor at the University of Amsterdam) and Sergi Castella i Sapé discuss the recent "Open Pre-trained Transformer (OPT) Language Models" from Meta AI (formerly Facebook). In this replication work, Meta developed and trained a 175 Billion parameter Transformer very similar to GPT-3 from OpenAI, documenting the process in detail to share their findings with the community. The code, pretrained weights, and logbook are available on their Github repository (links below).
Links
❓Feedback Form: https://scastella.typeform.com/to/rg7a5GfJ
📄 OPT paper: https://arxiv.org/abs/2205.01068
👾 Code: https://github.com/facebookresearch/metaseq
📒 Logbook: https://github.com/facebookresearch/metaseq/blob/main/projects/OPT/chronicles/OPT175B_Logbook.pdf
✍️ OPT Official Blog Post: https://ai.facebook.com/blog/democratizing-access-to-large-scale-language-models-with-opt-175b/
OpenAI Embeddings API: https://openai.com/blog/introducing-text-and-code-embeddings/
Nils Reimers' critique of OpenAI Embeddings API: https://medium.com/@nils_reimers/openai-gpt-3-text-embeddings-really-a-new-state-of-the-art-in-dense-text-embeddings-6571fe3ec9d9
Timestamps:
00:00 Introduction and housekeeping: new feedback form, ACL conference highlights
02:42 The convergence between NLP and Neural IR techniques
06:43 Open Pretrained Transformer motivation and scope, reproducing GPT-3 and open-sourcing
08:16 Basics of OPT: architecture, pre-training objective, teacher forcing, tokenizer, training data
13:40 Preliminary experiments findings: hyperparameters, training stability, spikiness
20:08 Problems that appear at scale when training with 992 GPUs
23:01 Using temperature to check whether GPUs are working
25:00 Training the largest model: what to do when the loss explodes? (which happens quite often)
29:15 When they switched away from AdamW to SGD
32:00 Results: successful but not quite GPT-3 level.
Toxicity? 35:45 Replicability of Large Language Models research. Was GPT-3 replicable? What difference does it make?
37:25 What makes a paper replicable?
40:33 Directions in which large Language Models are applied to Information Retrieval
45:15 Final thoughts and takeaways
20 قسمت
همه قسمت ها
×به Player FM خوش آمدید!
Player FM در سراسر وب را برای یافتن پادکست های با کیفیت اسکن می کند تا همین الان لذت ببرید. این بهترین برنامه ی پادکست است که در اندروید، آیفون و وب کار می کند. ثبت نام کنید تا اشتراک های شما در بین دستگاه های مختلف همگام سازی شود.