Artwork

محتوای ارائه شده توسط BlueDot Impact. تمام محتوای پادکست شامل قسمت‌ها، گرافیک‌ها و توضیحات پادکست مستقیماً توسط BlueDot Impact یا شریک پلتفرم پادکست آن‌ها آپلود و ارائه می‌شوند. اگر فکر می‌کنید شخصی بدون اجازه شما از اثر دارای حق نسخه‌برداری شما استفاده می‌کند، می‌توانید روندی که در اینجا شرح داده شده است را دنبال کنید.https://fa.player.fm/legal
Player FM - برنامه پادکست
با برنامه Player FM !

Weak-To-Strong Generalization: Eliciting Strong Capabilities With Weak Supervision

35:05
 
اشتراک گذاری
 

Manage episode 424744798 series 3498845
محتوای ارائه شده توسط BlueDot Impact. تمام محتوای پادکست شامل قسمت‌ها، گرافیک‌ها و توضیحات پادکست مستقیماً توسط BlueDot Impact یا شریک پلتفرم پادکست آن‌ها آپلود و ارائه می‌شوند. اگر فکر می‌کنید شخصی بدون اجازه شما از اثر دارای حق نسخه‌برداری شما استفاده می‌کند، می‌توانید روندی که در اینجا شرح داده شده است را دنبال کنید.https://fa.player.fm/legal

Widely used alignment techniques, such as reinforcement learning from human feedback (RLHF), rely on the ability of humans to supervise model behavior—for example, to evaluate whether a model faithfully followed instructions or generated safe outputs. However, future superhuman models will behave in complex ways too difficult for humans to reliably evaluate; humans will only be able to weakly supervise superhuman models. We study an analogy to this problem: can weak model supervision elicit the full capabilities of a much stronger model? We test this using a range of pretrained language models in the GPT-4 family on natural language processing (NLP), chess, and reward modeling tasks. We find that when we naively fine-tune strong pretrained models on labels generated by a weak model, they consistently perform better than their weak supervisors, a phenomenon we call weak-to-strong generalization. However, we are still far from recovering the full capabilities of strong models with naive fine-tuning alone, suggesting that techniques like RLHF may scale poorly to superhuman models without further work.

We find that simple methods can often significantly improve weak-to-strong generalization: for example, when fine-tuning GPT-4 with a GPT-2-level supervisor and an auxiliary confidence loss, we can recover close to GPT-3.5-level performance on NLP tasks. Our results suggest that it is feasible to make empirical progress today on a fundamental challenge of aligning superhuman models.

Source:
https://arxiv.org/pdf/2312.09390.pdf
Narrated for AI Safety Fundamentals by Perrin Walker

A podcast by BlueDot Impact.
Learn more on the AI Safety Fundamentals website.

  continue reading

فصل ها

1. Weak-To-Strong Generalization: Eliciting Strong Capabilities With Weak Supervision (00:00:00)

2. ABSTRACT (00:00:18)

3. 1 INTRODUCTION (00:01:48)

4. 3 METHODOLOGY (00:09:29)

5. 4 MAIN RESULTS (00:14:44)

6. 4.1 TASKS (00:14:53)

7. 4.2 NAIVELY FINETUNING ON WEAK LABELS (00:17:01)

8. 4.3 IMPROVING WEAK-TO-STRONG GENERALIZATION IS TRACTABLE (00:20:10)

9. 4.3.1 BOOTSTRAPPING WITH INTERMEDIATE MODEL SIZES (00:20:30)

10. 4.3.2 AN AUXILIARY CONFIDENCE LOSS CAN DRAMATICALLY IMPROVE GENERALIZATION ON NLP TASKS (00:23:00)

11. 6 DISCUSSION (00:25:41)

12. 6.1 REMAINING DISANALOGIES (00:26:01)

13. 6.2 FUTURE WORK (00:29:00)

14. 6.2.1 CONCRETE PROBLEMS: ANALOGOUS SETUPS (00:29:26)

15. 6.2.2 CONCRETE PROBLEMS: SCALABLE METHODS (00:30:56)

16. 6.2.3 CONCRETE PROBLEMS: SCIENTIFIC UNDERSTANDING (00:32:33)

17. 6.3 CONCLUSION (00:33:58)

83 قسمت

Artwork
iconاشتراک گذاری
 
Manage episode 424744798 series 3498845
محتوای ارائه شده توسط BlueDot Impact. تمام محتوای پادکست شامل قسمت‌ها، گرافیک‌ها و توضیحات پادکست مستقیماً توسط BlueDot Impact یا شریک پلتفرم پادکست آن‌ها آپلود و ارائه می‌شوند. اگر فکر می‌کنید شخصی بدون اجازه شما از اثر دارای حق نسخه‌برداری شما استفاده می‌کند، می‌توانید روندی که در اینجا شرح داده شده است را دنبال کنید.https://fa.player.fm/legal

Widely used alignment techniques, such as reinforcement learning from human feedback (RLHF), rely on the ability of humans to supervise model behavior—for example, to evaluate whether a model faithfully followed instructions or generated safe outputs. However, future superhuman models will behave in complex ways too difficult for humans to reliably evaluate; humans will only be able to weakly supervise superhuman models. We study an analogy to this problem: can weak model supervision elicit the full capabilities of a much stronger model? We test this using a range of pretrained language models in the GPT-4 family on natural language processing (NLP), chess, and reward modeling tasks. We find that when we naively fine-tune strong pretrained models on labels generated by a weak model, they consistently perform better than their weak supervisors, a phenomenon we call weak-to-strong generalization. However, we are still far from recovering the full capabilities of strong models with naive fine-tuning alone, suggesting that techniques like RLHF may scale poorly to superhuman models without further work.

We find that simple methods can often significantly improve weak-to-strong generalization: for example, when fine-tuning GPT-4 with a GPT-2-level supervisor and an auxiliary confidence loss, we can recover close to GPT-3.5-level performance on NLP tasks. Our results suggest that it is feasible to make empirical progress today on a fundamental challenge of aligning superhuman models.

Source:
https://arxiv.org/pdf/2312.09390.pdf
Narrated for AI Safety Fundamentals by Perrin Walker

A podcast by BlueDot Impact.
Learn more on the AI Safety Fundamentals website.

  continue reading

فصل ها

1. Weak-To-Strong Generalization: Eliciting Strong Capabilities With Weak Supervision (00:00:00)

2. ABSTRACT (00:00:18)

3. 1 INTRODUCTION (00:01:48)

4. 3 METHODOLOGY (00:09:29)

5. 4 MAIN RESULTS (00:14:44)

6. 4.1 TASKS (00:14:53)

7. 4.2 NAIVELY FINETUNING ON WEAK LABELS (00:17:01)

8. 4.3 IMPROVING WEAK-TO-STRONG GENERALIZATION IS TRACTABLE (00:20:10)

9. 4.3.1 BOOTSTRAPPING WITH INTERMEDIATE MODEL SIZES (00:20:30)

10. 4.3.2 AN AUXILIARY CONFIDENCE LOSS CAN DRAMATICALLY IMPROVE GENERALIZATION ON NLP TASKS (00:23:00)

11. 6 DISCUSSION (00:25:41)

12. 6.1 REMAINING DISANALOGIES (00:26:01)

13. 6.2 FUTURE WORK (00:29:00)

14. 6.2.1 CONCRETE PROBLEMS: ANALOGOUS SETUPS (00:29:26)

15. 6.2.2 CONCRETE PROBLEMS: SCALABLE METHODS (00:30:56)

16. 6.2.3 CONCRETE PROBLEMS: SCIENTIFIC UNDERSTANDING (00:32:33)

17. 6.3 CONCLUSION (00:33:58)

83 قسمت

همه قسمت ها

×
 
Loading …

به Player FM خوش آمدید!

Player FM در سراسر وب را برای یافتن پادکست های با کیفیت اسکن می کند تا همین الان لذت ببرید. این بهترین برنامه ی پادکست است که در اندروید، آیفون و وب کار می کند. ثبت نام کنید تا اشتراک های شما در بین دستگاه های مختلف همگام سازی شود.

 

راهنمای مرجع سریع