Artwork

محتوای ارائه شده توسط Kyle Polich. تمام محتوای پادکست شامل قسمت‌ها، گرافیک‌ها و توضیحات پادکست مستقیماً توسط Kyle Polich یا شریک پلتفرم پادکست آن‌ها آپلود و ارائه می‌شوند. اگر فکر می‌کنید شخصی بدون اجازه شما از اثر دارای حق نسخه‌برداری شما استفاده می‌کند، می‌توانید روندی که در اینجا شرح داده شده است را دنبال کنید.https://fa.player.fm/legal
Player FM - برنامه پادکست
با برنامه Player FM !

Very Large Corpora and Zipf's Law

24:11
 
اشتراک گذاری
 

Manage episode 225317218 series 49487
محتوای ارائه شده توسط Kyle Polich. تمام محتوای پادکست شامل قسمت‌ها، گرافیک‌ها و توضیحات پادکست مستقیماً توسط Kyle Polich یا شریک پلتفرم پادکست آن‌ها آپلود و ارائه می‌شوند. اگر فکر می‌کنید شخصی بدون اجازه شما از اثر دارای حق نسخه‌برداری شما استفاده می‌کند، می‌توانید روندی که در اینجا شرح داده شده است را دنبال کنید.https://fa.player.fm/legal

The earliest efforts to apply machine learning to natural language tended to convert every token (every word, more or less) into a unique feature. While techniques like stemming may have cut the number of unique tokens down, researchers always had to face a problem that was highly dimensional. Naive Bayes algorithm was celebrated in NLP applications because of its ability to efficiently process highly dimensional data.

Of course, other algorithms were applied to natural language tasks as well. While different algorithms had different strengths and weaknesses to different NLP problems, an early paper titled Scaling to Very Very Large Corpora for Natural Language Disambiguation popularized one somewhat surprising idea. For many NLP tasks, simply providing a large corpus of examples not only improved accuracy, but it also showed that asymptotically, some algorithms yielded more improvement from working on very, very large corpora.

Although not explicitly in about NLP, the noteworthy paper The Unreasonable Effectiveness of Data emphasizes this point further while paying homage to the classic treatise The Unreasonable Effectiveness of Mathematics in the Natural Sciences.

In this episode, Kyle shares a few thoughts along these lines with Linh Da.

The discussion winds up with a brief introduction to Zipf's law. When applied to natural language, Zipf's law states that the frequency of any given word in a corpus (regardless of language) will be proportional to its rank in the frequency table.

  continue reading

596 قسمت

Artwork

Very Large Corpora and Zipf's Law

Data Skeptic

3,168 subscribers

published

iconاشتراک گذاری
 
Manage episode 225317218 series 49487
محتوای ارائه شده توسط Kyle Polich. تمام محتوای پادکست شامل قسمت‌ها، گرافیک‌ها و توضیحات پادکست مستقیماً توسط Kyle Polich یا شریک پلتفرم پادکست آن‌ها آپلود و ارائه می‌شوند. اگر فکر می‌کنید شخصی بدون اجازه شما از اثر دارای حق نسخه‌برداری شما استفاده می‌کند، می‌توانید روندی که در اینجا شرح داده شده است را دنبال کنید.https://fa.player.fm/legal

The earliest efforts to apply machine learning to natural language tended to convert every token (every word, more or less) into a unique feature. While techniques like stemming may have cut the number of unique tokens down, researchers always had to face a problem that was highly dimensional. Naive Bayes algorithm was celebrated in NLP applications because of its ability to efficiently process highly dimensional data.

Of course, other algorithms were applied to natural language tasks as well. While different algorithms had different strengths and weaknesses to different NLP problems, an early paper titled Scaling to Very Very Large Corpora for Natural Language Disambiguation popularized one somewhat surprising idea. For many NLP tasks, simply providing a large corpus of examples not only improved accuracy, but it also showed that asymptotically, some algorithms yielded more improvement from working on very, very large corpora.

Although not explicitly in about NLP, the noteworthy paper The Unreasonable Effectiveness of Data emphasizes this point further while paying homage to the classic treatise The Unreasonable Effectiveness of Mathematics in the Natural Sciences.

In this episode, Kyle shares a few thoughts along these lines with Linh Da.

The discussion winds up with a brief introduction to Zipf's law. When applied to natural language, Zipf's law states that the frequency of any given word in a corpus (regardless of language) will be proportional to its rank in the frequency table.

  continue reading

596 قسمت

همه قسمت ها

×
 
Loading …

به Player FM خوش آمدید!

Player FM در سراسر وب را برای یافتن پادکست های با کیفیت اسکن می کند تا همین الان لذت ببرید. این بهترین برنامه ی پادکست است که در اندروید، آیفون و وب کار می کند. ثبت نام کنید تا اشتراک های شما در بین دستگاه های مختلف همگام سازی شود.

 

راهنمای مرجع سریع

در حین کاوش به این نمایش گوش دهید
پخش