Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Statistics in the age of Data Science

Mine Cetinkaya-Rundel
September 23, 2024
13

Statistics in the age of Data Science

In the age of data science, traditional statistical methods are crucial, but they are increasingly combined with computational tools and predictive modeling techniques. This talk highlights Duke University's large introductory data science course, which provides students with a strong foundation in exploratory data analysis encompassing data importing, visualization, transformation, and summarization, as well as statistical inference and descriptive and predictive modeling techniques using the R programming language. The course emphasizes real-world applications, ethical considerations, and the importance of reproducibility in data analysis. By integrating classical statistical theory with modern computational approaches, the course equips students to succeed in a data-driven world. We will share the pedagogical strategies, challenges, and successes in preparing students for careers in data science.

Mine Cetinkaya-Rundel

September 23, 2024
Tweet

Transcript

  1. Indeed, even Haidt, who has also emphasized broader changes to

    the culture of childhood , estimated that social media use is responsible for only about 10 percent to 15 percent of the variation in teenage well-being — which would be a significant correlation, given the complexities of adolescent life and of social science, but is also a much more measured estimate than you tend to see in headlines trumpeting the connection. … In Britain, the share of young people who reported “feeling down” or experiencing depression grew from 31 percent in 2012 to 38 percent on the eve of the pandemic and to 41 percent in 2021. That is significant, though by other measures British teenagers appear, if more depressed than they were in the 2000s, not much more depressed than they were in the 1990s. …
  2. Program Import Tidy Communicate Understand Transform Model Visualize Wickham, H.,

    Çetinkaya-Rundel, M., & Grolemund, G. (2023). R for Data Science, 2nd Edition. Doing data science
  3. data visualisation data wrangling, tidying, acquisition exploratory data analysis predictive

    modeling + uncertainty quantification effective communication of results interactive visualizations text analysis machine learning Bayesian inference … consistent syntax | R + tidyverse reproducibility | Quarto version control and collaboration | Git + GitHub focus on emphasize foray into
  4. population # A tibble: 217 × 3 country year population

    <chr> <dbl> <dbl> 1 Afghanistan 2022 41129. 2 Albania 2022 2778. 3 Algeria 2022 44903. 4 American Samoa 2022 44.3 5 Andorra 2022 79.8 6 Angola 2022 35589. 7 Antigua and Barbuda 2022 93.8 8 Argentina 2022 46235. 9 Armenia 2022 2780. 10 Aruba 2022 106. # ℹ 207 more rows continents # A tibble: 285 × 4 entity code year continent <chr> <chr> <dbl> <chr> 1 Abkhazia OWID_ABK 2015 Asia 2 Afghanistan AFG 2015 Asia 3 Akrotiri and Dhekelia OWID_AKD 2015 Asia 4 Aland Islands ALA 2015 Europe 5 Albania ALB 2015 Europe 6 Algeria DZA 2015 Africa 7 American Samoa ASM 2015 Oceania 8 Andorra AND 2015 Europe 9 Angola AGO 2015 Africa 10 Anguilla AIA 2015 North America # ℹ 275 more rows population_continents < - left_join(population, continents, join_by(country == entity)) ✓ data joins
  5. population_continents | > f i lter(is.na(continent)) # A tibble: 6

    × 6 country year.x population code year.y continent <chr> <dbl> <dbl> <chr> <dbl> <chr> 1 Congo, Dem. Rep. 2022 99010. NA NA NA 2 Congo, Rep. 2022 5970. NA NA NA 3 Hong Kong SAR, China 2022 7346. NA NA NA 4 Korea, Dem. People's Rep. 2022 26069. NA NA NA 5 Korea, Rep. 2022 51628. NA NA NA 6 Kyrgyz Republic 2022 6975. NA NA NA ✓ data joins ✓ data wrangling
  6. population_continent < - population | > mutate(country = case_when( country

    = = "Congo, Dem. Rep." ~ "Democratic Republic of Congo", country = = "Congo, Rep." ~ "Congo", country = = "Hong Kong SAR, China" ~ "Hong Kong", country = = "Korea, Dem. People's Rep." ~ "North Korea", country = = "Korea, Rep." ~ "South Korea", country = = "Kyrgyz Republic" ~ "Kyrgyzstan", .default = country ) ) | > left_join(continents, by = join_by(country = = entity)) ✓ data joins ✓ data wrangling ✓ data cleaning ✓ ethics
  7. ✓ data joins ✓ data wrangling ✓ data cleaning ✓

    ethics ✓ critique ✓ improving visualizations
  8. ✓ data joins ✓ data wrangling ✓ data cleaning ✓

    ethics ✓ critique ✓ improving visualizations ✓ mapping ✓ iteration
  9. David Spiegelhalter Professor, University of Cambridge “There is no substitute

    for simply looking at data properly.” from “The Art of Statistics”
  10. ✓ web scraping chronicle # A tibble: 500 × 6

    title author date abstract column url <chr> <chr> <date> <chr> <chr> <chr> 1 All the world’s a stage Anna … 2024-02-22 If we a… STUDE… http… 2 Words that matter: For Alexei Navalny Carol… 2024-02-22 In some… STUDE… http… 3 Which would you save: Friend or romantic partn… Jess … 2024-02-22 Love sh… STUDE… http… 4 Happiness is not what you’re looking for Paul … 2024-02-21 We hing… STUDE… http… 5 Closing Duke's Herbarium: A fear of long - term … Matth… 2024-02-21 Without… LETTE… http… 6 CS Majors launch 'ambiguous and labelless rela… Monda… 2024-02-20 Unlike … STUDE… http… 7 The fear of being single Heidi… 2024-02-20 But it … STUDE… http… 8 Save the Duke Herbarium Henry… 2024-02-17 The Duk… LETTE… http… 9 What Duke can learn from retiring ex - president… Rober… 2024-02-17 In Duke… GUEST… http… 10 Love, love Gabri… 2024-02-16 Somehow… STUDE… http… # ℹ 490 more rows
  11. ✓ web scraping ✓ terms of use ✓ ethics robotstxt

    : : paths_allowed("https: / / w w w .dukechronicle.com") w w w .dukechronicle.com [1] TRUE
  12. ✓ web scraping ✓ terms of use ✓ ethics ✓

    text analysis ✓ data wrangling ✓ data visualization
  13. ✓ web scraping ✓ terms of use ✓ ethics ✓

    text analysis ✓ data wrangling ✓ data visualization ✓ sentiment analysis
  14. Sarah Jarvis Director of Applied Machine Learning and Data Science

    at Secondmind “Data science is all about asking interesting questions based on the data you have—or often the data you don’t have.”
  15. ✓ logistic regression ✓ classification ✓ decision errors ✓ sensitivity

    / specificity ✓ intuition around loss functions
  16. George Box Statistician “All models are wrong but some are

    useful” from “Robustness in the strategy of scientific model building”
  17. 300 students 30 students 30 students 30 students 30 students

    30 students 30 students 30 students 30 students 30 students 30 students Lecture x 2 per week Lab x 1 per week
  18. teams: weekly labs in teams + periodic team evaluations +

    term project in teams peer feedback: at various stages of the project live coding: in every “lecture”, along with time for students to attempt exercises on their own “minute paper”: weekly online quizzes ending with a brief reflection of the week’s material creativity: assignments that make room for creativity nudges: periodically throughout the semester
  19. Çetinkaya-Rundel, Mine, Mine Dogucu, and Wendy Rummerfield. "The 5Ws and

    1H of term projects in the introductory data science classroom." Statistics Education Research Journal 21.2 (2022): 4-4.