Player FM - Internet Radio Done Right
23 subscribers
Checked 1+ y ago
اضافه شده در six سال پیش
محتوای ارائه شده توسط Linear Digressions, Ben Jaffe, and Katie Malone. تمام محتوای پادکست شامل قسمتها، گرافیکها و توضیحات پادکست مستقیماً توسط Linear Digressions, Ben Jaffe, and Katie Malone یا شریک پلتفرم پادکست آنها آپلود و ارائه میشوند. اگر فکر میکنید شخصی بدون اجازه شما از اثر دارای حق نسخهبرداری شما استفاده میکند، میتوانید روندی که در اینجا شرح داده شده است را دنبال کنید.https://fa.player.fm/legal
Player FM - برنامه پادکست
با برنامه Player FM !
با برنامه Player FM !
Building a curriculum for educating data scientists: Interview with Prof. Xiao-Li Meng
Manage episode 252297961 series 2527355
محتوای ارائه شده توسط Linear Digressions, Ben Jaffe, and Katie Malone. تمام محتوای پادکست شامل قسمتها، گرافیکها و توضیحات پادکست مستقیماً توسط Linear Digressions, Ben Jaffe, and Katie Malone یا شریک پلتفرم پادکست آنها آپلود و ارائه میشوند. اگر فکر میکنید شخصی بدون اجازه شما از اثر دارای حق نسخهبرداری شما استفاده میکند، میتوانید روندی که در اینجا شرح داده شده است را دنبال کنید.https://fa.player.fm/legal
As demand for data scientists grows, and it remains as relevant as ever that practicing data scientists have a solid methodological and technical foundation for their work, higher education institutions are coming to terms with what’s required to educate the next cohorts of data scientists. The heterogeneity and speed of the field makes it challenging for even the most talented and dedicated educators to know what a data science education “should” look like. This doesn’t faze Xiao-Li Meng, Professor of Statistics at Harvard University and founding Editor-in-Chief of the Harvard Data Science Review. He’s our interview guest in this episode, talking about the pedagogically distinct classes of data science and how he thinks about designing curricula for making anyone more data literate. From new initiatives in data science to dealing with data science FOMO, this wide-ranging conversation with a leading scholar gives us a lot to think about. Relevant links: https://hdsr.mitpress.mit.edu/
…
continue reading
291 قسمت
Manage episode 252297961 series 2527355
محتوای ارائه شده توسط Linear Digressions, Ben Jaffe, and Katie Malone. تمام محتوای پادکست شامل قسمتها، گرافیکها و توضیحات پادکست مستقیماً توسط Linear Digressions, Ben Jaffe, and Katie Malone یا شریک پلتفرم پادکست آنها آپلود و ارائه میشوند. اگر فکر میکنید شخصی بدون اجازه شما از اثر دارای حق نسخهبرداری شما استفاده میکند، میتوانید روندی که در اینجا شرح داده شده است را دنبال کنید.https://fa.player.fm/legal
As demand for data scientists grows, and it remains as relevant as ever that practicing data scientists have a solid methodological and technical foundation for their work, higher education institutions are coming to terms with what’s required to educate the next cohorts of data scientists. The heterogeneity and speed of the field makes it challenging for even the most talented and dedicated educators to know what a data science education “should” look like. This doesn’t faze Xiao-Li Meng, Professor of Statistics at Harvard University and founding Editor-in-Chief of the Harvard Data Science Review. He’s our interview guest in this episode, talking about the pedagogically distinct classes of data science and how he thinks about designing curricula for making anyone more data literate. From new initiatives in data science to dealing with data science FOMO, this wide-ranging conversation with a leading scholar gives us a lot to think about. Relevant links: https://hdsr.mitpress.mit.edu/
…
continue reading
291 قسمت
ทุกตอน
×All good things must come to an end, including this podcast. This is the last episode we plan to release, and it doesn’t cover data science—it’s mostly reminiscing, thanking our wonderful audience (that’s you!), and marveling at how this thing that started out as a side project grew into a huge part of our lives for over 5 years. It’s been a ride, and a real pleasure and privilege to talk to you each week. Thanks, best wishes, and good night! —Katie and Ben…
The data science and artificial intelligence community has made amazing strides in the past few years to algorithmically automate portions of the healthcare process. This episode looks at two computer vision algorithms, one that diagnoses diabetic retinopathy and another that classifies liver cancer, and asks the question—are patients now getting better care, and achieving better outcomes, with these algorithms in the mix? The answer isn’t no, exactly, but it’s not a resounding yes, because these algorithms interact with a very complex system (the healthcare system) and other shortcomings of that system are proving hard to automate away. Getting a faster diagnosis from an image might not be an improvement if the image is now harder to capture (because of strict data quality requirements associated with the algorithm that wouldn’t stop a human doing the same job). Likewise, an algorithm getting a prediction mostly correct might not be an overall benefit if it introduces more dramatic failures when the prediction happens to be wrong. For every data scientist whose work is deployed into some kind of product, and is being used to solve real-world problems, these papers underscore how important and difficult it is to consider all the context around those problems.…
A few weeks ago, we put out a call for data scientists interested in issues of race and racism, or people studying how those topics can be studied with data science methods, should get in touch to come talk to our audience about their work. This week we’re excited to bring on Todd Hendricks, Bay Area data scientist and a volunteer who reached out to tell us about his studies with the Stanford Open Policing dataset.…
This is a re-release of an episode that originally ran in October 2019. If you’re trying to manage a project that serves up analytics data for a few very distinct uses, you’d be wise to consider having custom solutions for each use case that are optimized for the needs and constraints of that use cases. You also wouldn’t be YouTube, which found themselves with this problem (gigantic data needs and several very different use cases of what they needed to do with that data) and went a different way: they built one analytics data system to serve them all. Procella, the system they built, is the topic of our episode today: by deconstructing the system, we dig into the four motivating uses of this system, the complexity they had to introduce to service all four uses simultaneously, and the impressive engineering that has to go into building something that “just works.”…
Open source software is ubiquitous throughout data science, and enables the work of nearly every data scientist in some way or another. Open source projects, however, are disproportionately maintained by a small number of individuals, some of whom are institutionally supported, but many of whom do this maintenance on a purely volunteer basis. The health of the data science ecosystem depends on the support of open source projects, on an individual and institutional level. https://hdsr.mitpress.mit.edu/pub/xsrt4zs2/release/2…
L
Linear Digressions

This is a re-release of an episode that first ran on January 29, 2017. This week: everybody's favorite WWII-era classifier metric! But it's not just for winning wars, it's a fantastic go-to metric for all your classifier quality needs.
This episode features Zach Drake, a working data scientist and PhD candidate in the Criminology, Law and Society program at George Mason University. Zach specializes in bringing data science methods to studies of criminal behavior, and got in touch after our last episode (about racially complicated recidivism algorithms). Our conversation covers a wide range of topics—common misconceptions around race and crime statistics, how methodologically-driven criminology scholars think about building crime prediction models, and how to think about policy changes when we don’t have a complete understanding of cause and effect in criminology. For the many of us currently re-thinking race and criminal justice, but wanting to be data-driven about it, this conversation with Zach is a must-listen.…
As protests sweep across the United States in the wake of the killing of George Floyd by a Minneapolis police officer, we take a moment to dig into one of the ways that data science perpetuates and amplifies racism in the American criminal justice system. COMPAS is an algorithm that claims to give a prediction about the likelihood of an offender to re-offend if released, based on the attributes of the individual, and guess what: it shows disparities in the predictions for black and white offenders that would nudge judges toward giving harsher sentences to black individuals. We dig into this algorithm a little more deeply, unpacking how different metrics give different pictures into the “fairness” of the predictions and what is causing its racially disparate output (to wit: race is explicitly not an input to the algorithm, and yet the algorithm gives outputs that correlate with race—what gives?) Unfortunately it’s not an open-and-shut case of a tuning parameter being off, or the wrong metric being used: instead the biases in the justice system itself are being captured in the algorithm outputs, in such a way that a self-fulfilling prophecy of harsher treatment for black defendants is all but guaranteed. Like many other things this week, this episode left us thinking about bigger, systemic issues, and why it’s proven so hard for years to fix what’s broken.…
A message from Ben around algorithmic bias, and how our models are sometimes reflections of ourselves.
This is a re-release of an episode that originally aired on April 1, 2018 If you've done image recognition or computer vision tasks with a neural network, you've probably used a convolutional neural net. This episode is all about the architecture and implementation details of convolutional networks, and the tricks that make them so good at image tasks.…
L
Linear Digressions

This is a re-release of an episode that was originally released on February 26, 2017. When you're estimating something about some object that's a member of a larger group of similar objects (say, the batting average of a baseball player, who belongs to a baseball team), how should you estimate it: use measurements of the individual, or get some extra information from the group? The James-Stein estimator tells you how to combine individual and group information make predictions that, taken over the whole group, are more accurate than if you treated each individual, well, individually.…
The power of finely-grained, individual-level data comes with a drawback: it compromises the privacy of potentially anyone and everyone in the dataset. Even for de-identified datasets, there can be ways to re-identify the records or otherwise figure out sensitive personal information. That problem has motivated the study of differential privacy, a set of techniques and definitions for keeping personal information private when datasets are released or used for study. Differential privacy is getting a big boost this year, as it’s being implemented across the 2020 US Census as a way of protecting the privacy of census respondents while still opening up the dataset for research and policy use. When two important topics come together like this, we can’t help but sit up and pay attention.…
L
Linear Digressions

What do you get when you combine the causal inference needs of econometrics with the data-driven methodology of machine learning? Usually these two don’t go well together (deriving causal conclusions from naive data methods leads to biased answers) but economists Susan Athey and Guido Imbens are on the case. This episodes explores their algorithm for recursively partitioning a dataset to find heterogeneous treatment effects, or for you ML nerds, applying decision trees to causal inference problems. It’s not a free lunch, but for those (like us!) who love crossover topics, causal trees are a smart approach from one field hopping the fence to another. Relevant links: https://www.pnas.org/content/113/27/7353…
L
Linear Digressions

You may not realize it consciously, but beautiful visualizations have rules. The rules are often implict and manifest themselves as expectations about how the data is summarized, presented, and annotated so you can quickly extract the information in the underlying data using just visual cues. It’s a bit abstract but very profound, and these principles underlie the ggplot2 package in R that makes famously beautiful plots with minimal code. This episode covers a paper by Hadley Wickham (author of ggplot2, among other R packages) that unpacks the layered approach to graphics taken in ggplot2, and makes clear the assumptions and structure of many familiar data visualizations.…
L
Linear Digressions

It’s pretty common to fit a function to a dataset when you’re a data scientist. But in many cases, it’s not clear what kind of function might be most appropriate—linear? quadratic? sinusoidal? some combination of these, and perhaps others? Gaussian processes introduce a nonparameteric option where you can fit over all the possible types of functions, using the data points in your datasets as constraints on the results that you get (the idea being that, no matter what the “true” underlying function is, it produced the data points you’re trying to fit). What this means is a very flexible, but depending on your parameters not-too-flexible, way to fit complex datasets. The math underlying GPs gets complex, and the links below contain some excellent visualizations that help make the underlying concepts clearer. Check them out! Relevant links: http://katbailey.github.io/post/gaussian-processes-for-dummies/ https://thegradient.pub/gaussian-process-not-quite-for-dummies/ https://distill.pub/2019/visual-exploration-gaussian-processes/…
The abundance of data in healthcare, and the value we could capture from structuring and analyzing that data, is a huge opportunity. It also presents huge challenges. One of the biggest challenges is how, exactly, to do that structuring and analysis—data scientists working with this data have hundreds or thousands of small, and sometimes large, decisions to make in their day-to-day analysis work. What data should they include in their studies? What method should they use to analyze it? What hyperparameter settings should they explore, and how should they pick a value for their hyperparameters? The thing that’s really difficult here is that, depending on which path they choose among many reasonable options, a data scientist can get really different answers to the underlying question, which makes you wonder how to conclude anything with certainty at all. The paper for this week’s episode performs a systematic study of many, many different permutations of the questions above on a set of benchmark datasets where the “right” answers are known. Which strategies are most likely to yield the “right” answers? That’s the whole topic of discussion. Relevant links: https://hdsr.mitpress.mit.edu/pub/fxz7kr65…
L
Linear Digressions

AI is evolving incredibly quickly, and thinking now about where it might go next (and how we as a species and a society should be prepared) is critical. Professor Stuart Russell, an AI expert at UC Berkeley, has a formulation for modifications to AI that we should study and try implementing now to keep it much safer in the long run. Prof. Russell’s new book, “Human Compatible: Artificial Intelligence and the Problem of Control” gives an accessible but deeply thoughtful exploration of why he thinks runaway AI is something we need to be considering seriously now, and what changes in formulation might be a solution. This episodes features Prof. Russell as a special guest, exploring the topics in his book and giving more perspective on the long-term possible futures of AI: both good and bad. Relevant links: https://www.penguinrandomhouse.com/books/566677/human-compatible-by-stuart-russell/…
Most data scientists bounce back and forth regularly between doing analysis in databases using SQL and building and deploying machine learning pipelines in R or python. But if we think ahead a few years, a few visionary researchers are starting to see a world in which the ML pipelines can actually be deployed inside the database. Why? One strong advantage for databases is they have built-in features for data governance, including things like permissioning access and tracking the provenance of data. Adding machine learning as another thing you can do in a database means that, potentially, these enterprise-grade features will be available for ML models too, which will make them much more widely accepted across enterprises with tight IT policies. The papers this week articulate the gap between enterprise needs and current ML infrastructure, how ML in a database could be a way to knit the two closer together, and a proof-of-concept that ML in a database can actually work. Relevant links: https://blog.acolyer.org/2020/02/19/ten-year-egml-predictions/ https://blog.acolyer.org/2020/02/21/extending-relational-query-processing/…
Many of us have the privilege of working from home right now, in an effort to keep ourselves and our family safe and slow the transmission of covid-19. But working from home is an adjustment for many of us, and can hold some challenges compared to coming in to the office every day. This episode explores this a little bit, informally, as we compare our new work-from-home setups and reflect on what’s working well and what we’re finding challenging.…
L
Linear Digressions

Covid-19 is turning the world upside down right now. One thing that’s extremely important to understand, in order to fight it as effectively as possible, is how the virus spreads and especially how much of the spread of the disease comes from carriers who are experiencing no or mild symptoms but are contagious anyway. This episode digs into the epidemiological model that was published in Science this week—this model finds that the data suggests that the majority of carriers of the coronavirus, 80-90%, do not have a detected disease. This has big implications for the importance of social distancing of a way to get the pandemic under control and explains why a more comprehensive testing program is critical for the United States. Also, in lighter news, Katie (a native of Dayton, Ohio) lays a data-driven claim for just declaring the University of Dayton flyers to be the 2020 NCAA College Basketball champions. Relevant links: https://science.sciencemag.org/content/early/2020/03/13/science.abb3221…
L
Linear Digressions

1 Network effects re-release: when the power of a public health measure lies in widespread adoption 26:40
This week’s episode is a re-release of a recent episode, which we don’t usually do but it seems important for understanding what we can all do to slow the spread of covid-19. In brief, public health measures for infectious diseases get most of their effectiveness from their widespread adoption: most of the protection you get from a vaccine, for example, comes from all the other people who also got the vaccine. That’s why measures like social distancing are so important right now: even if you’re not in a high-risk group for covid-19, you should still stay home and avoid in-person socializing because your good behavior lowers the risk for those who are in high-risk groups. If we all take these kinds of measures, the risk lowers dramatically. So stay home, work remotely if you can, avoid physical contact with others, and do your part to manage this crisis. We’re all in this together.…
L
Linear Digressions

1 Causal inference when you can't experiment: difference-in-differences and synthetic controls 20:48
When you need to untangle cause and effect, but you can’t run an experiment, it’s time to get creative. This episode covers difference in differences and synthetic controls, two observational causal inference techniques that researchers have used to understand causality in complex real-world situations.…
This is a re-release of an episode that originally ran on October 21, 2018. The Poisson distribution is a probability distribution function used to for events that happen in time or space. It’s super handy because it’s pretty simple to use and is applicable for tons of things—there are a lot of interesting processes that boil down to “events that happen in time or space.” This episode is a quick introduction to the distribution, and then a focus on two of our favorite everyday applications: using the Poisson distribution to identify supernovas and study army deaths from horse kicks.…
Recent research into neural networks reveals that sometimes, not all parts of the neural net are equally responsible for the performance of the network overall. Instead, it seems like (in some neural nets, at least) there are smaller subnetworks present where most of the predictive power resides. The fascinating thing is that, for some of these subnetworks (so-called “winning lottery tickets”), it’s not the training process that makes them good at their classification or regression tasks: they just happened to be initialized in a way that was very effective. This changes the way we think about what training might be doing, in a pretty fundamental way. Sometimes, instead of crafting a good fit from wholecloth, training might be finding the parts of the network that always had predictive power to begin with, and isolating and strengthening them. This research is pretty recent, having only come to prominence in the last year, but nonetheless challenges our notions about what it means to train a machine learning model.…
Data privacy is a huge issue right now, after years of consumers and users gaining awareness of just how much of their personal data is out there and how companies are using it. Policies like GDPR are imposing more stringent rules on who can use what data for what purposes, with an end goal of giving consumers more control and privacy around their data. This episode digs into this topic, but not from a security or legal perspective—this week, we talk about some of the interesting technical challenges introduced by a simple idea: a company should remove a user’s data from their database when that user asks to be removed. We talk about two topics, namely using Bloom filters to efficiently find records in a database (and what Bloom filters are, for that matter) and types of machine learning algorithms that can un-learn their training data when it contains records that need to be deleted.…
Put yourself in the shoes of an executive at a big legacy company for a moment, operating in virtually any market vertical: you’re constantly hearing that data science is revolutionizing the world and the firms that survive and thrive in the coming years are those that execute on a data strategy. What does this mean for your company? How can you best guide your established firm through a successful transition to becoming data-driven? How do you balance the momentum your firm has right now, and the need to support all your current products, customers and operations, against a new and relatively unknown future? If you’re working as a data scientist at a mature and well-established company, these are the worries on the mind of your boss’s boss’s boss. The worries on your mind may be similar: you’re trying to understand where your work fits into the bigger picture, you need to break down silos, you’re often running into cultural headwinds created by colleagues who don’t understand or trust your work. Congratulations, you’re in the midst of a classic set of challenges encountered by innovation initiatives everywhere. Harvard Business School professor Clayton Christensen wrote a classic business book (The Innovator’s Dilemma) explaining the paradox of trying to innovate in established companies, and why the structure and incentives of those companies almost guarantee an uphill climb to innovate. This week’s episode breaks down the innovator’s dilemma argument, and what it means for data scientists working in mature companies trying to become more data-centric.…
L
Linear Digressions

As demand for data scientists grows, and it remains as relevant as ever that practicing data scientists have a solid methodological and technical foundation for their work, higher education institutions are coming to terms with what’s required to educate the next cohorts of data scientists. The heterogeneity and speed of the field makes it challenging for even the most talented and dedicated educators to know what a data science education “should” look like. This doesn’t faze Xiao-Li Meng, Professor of Statistics at Harvard University and founding Editor-in-Chief of the Harvard Data Science Review. He’s our interview guest in this episode, talking about the pedagogically distinct classes of data science and how he thinks about designing curricula for making anyone more data literate. From new initiatives in data science to dealing with data science FOMO, this wide-ranging conversation with a leading scholar gives us a lot to think about. Relevant links: https://hdsr.mitpress.mit.edu/…
Traditional A/B tests assume that whether or not one person got a treatment has no effect on the experiment outcome for another person. But that’s not a safe assumption, especially when there are network effects (like in almost any social context, for instance!) SUTVA, or the stable treatment unit value assumption, is a big phrase for this assumption and violations of SUTVA make for some pretty interesting experiment designs. From news feeds in LinkedIn to disentangling herd immunity from individual immunity in vaccine studies, indirect (i.e. network) effects in experiments can be just as big as, or even bigger than, direct (i.e. individual effects). And this is what we talk about this week on the podcast. Relevant links: http://hanj.cs.illinois.edu/pdf/www15_hgui.pdf https://www.ncbi.nlm.nih.gov/pmc/articles/PMC2600548/pdf/nihms-73860.pdf…
Adversarial examples are really, really weird: pictures of penguins that get classified with high certainty by machine learning algorithms as drumsets, or random noise labeled as pandas, or any one of an infinite number of mistakes in labeling data that humans would never make but computers make with joyous abandon. What gives? A compelling new argument makes the case that it’s not the algorithms so much as the features in the datasets that holds the clue. This week’s episode goes through several papers pushing our collective understanding of adversarial examples, and giving us clues to what makes these counterintuitive cases possible. Relevant links: https://arxiv.org/pdf/1905.02175.pdf https://arxiv.org/pdf/1805.12152.pdf https://distill.pub/2019/advex-bugs-discussion/ https://arxiv.org/pdf/1911.02508.pdf…
Dimensionality reduction redux: this episode covers UMAP, an unsupervised algorithm designed to make high-dimensional data easier to visualize, cluster, etc. It’s similar to t-SNE but has some advantages. This episode gives a quick recap of t-SNE, especially the connection it shares with information theory, then gets into how UMAP is different (many say better). Between the time we recorded and released this episode, an interesting argument made the rounds on the internet that UMAP’s advantages largely stem from good initialization, not from advantages inherent in the algorithm. We don’t cover that argument here obviously, because it wasn’t out there when we were recording, but you can find a link to the paper below. Relevant links: https://pair-code.github.io/understanding-umap/ https://www.biorxiv.org/content/10.1101/2019.12.19.877522v1…
به Player FM خوش آمدید!
Player FM در سراسر وب را برای یافتن پادکست های با کیفیت اسکن می کند تا همین الان لذت ببرید. این بهترین برنامه ی پادکست است که در اندروید، آیفون و وب کار می کند. ثبت نام کنید تا اشتراک های شما در بین دستگاه های مختلف همگام سازی شود.