Upgrade to Pro — share decks privately, control downloads, hide ads and more …

My toolbox is full of shiny tools, do I also ne...

My toolbox is full of shiny tools, do I also need super powers?

Over the past decade the number of different computational tools our students encounter throughout their undergraduate education has increased greatly. But having a toolbox full of shiny tools is not sufficient for the modern student to be a productive statistician or data scientist. The modern student needs to learn to use these tools in harmony with each other. And unlike super heroes that tend to be good at using one super power well, the modern student needs to have practical familiarity with many "super powers". In this talk I'll talk about how to integrate various super powers into statistics and data science curricula, e.g., shapeshifting (data manipulation), clairvoyance (predictive modeling), time travel (version control), and perhaps most importantly empathy, as "with great power comes great responsibility".

Mine Cetinkaya-Rundel

May 25, 2022
Tweet

More Decks by Mine Cetinkaya-Rundel

Other Decks in Education

Transcript

  1. My toolbox is full of shiny tools, do I also

    need super powers? mine çetinkaya-rundel duke university / rstudio 🔗 bit.ly/superpowers-ecots22
  2. Data visualization Graphic vision ‣ Start, literally, on day one

    and continue improving throughout the curriculum ‣ Teach it to ‣ motivate inquiry and exploration ‣ support multivariate thinking ‣ effectively communicate of results and findings ‣ advance programming skills ‣ aid inferential decisions
  3. ‣ Ready to go computing environment ‣ Reproducible document with

    code to produce the visualization ‣ Code that’s obviously straightforward to modify for customizing the plot Data visualization on day one unvotes |> filter(country %in% c("United Kingdom", "United States", "France")) |> ggplot(…)
  4. ‣ “Recreate” to advance programming skills ‣ “Recreate, then improve”

    to advance programming and communication skills Data visualization later in curriculum
  5. ‣ “Recreate” to advance programming skills ‣ “Recreate, then improve”

    to advance programming and communication skills ‣ “Go beyond the basics” exercises to introduce commonly used visuals in scientific communication Data visualization later in curriculum
  6. ‣ Take visualizations beyond EDA ‣ Use them to assess

    significance, as an alternative method for inference Data visualization for inference
  7. Data wrangling Shapeshifting ‣ Start with data summarizing, then move

    on to data reshaping and tidying ‣ Teach it to ‣ motivate inquiry and exploration ‣ join data from multiple sources ‣ preprocess data for statistical analysis
  8. ‣ Start with the basics as early as possible Data

    wrangling for summarization penguins |> count(island, species) # A tibble: 5 × 3 island species n <fct> <fct> <int> 1 Biscoe Adelie 44 2 Biscoe Gentoo 124 3 Dream Adelie 56 4 Dream Chinstrap 68 5 Torgersen Adelie 52
  9. ‣ Start with the basics as early as possible ‣

    Wrangle further for better presentation Data wrangling for summarization penguins |> count(island, species) |> pivot_wider(names_from = species, values_from = n, values_fill = 0) # A tibble: 3 × 4 island Adelie Gentoo Chinstrap <fct> <int> <int> <int> 1 Biscoe 44 124 0 2 Dream 56 0 68 3 Torgersen 52 0 0
  10. ‣ Introduce more advanced data wrangling tools for joining multiple

    datasets into a single tidy dataset Data wrangling for data tidying
  11. ‣ Introduce more advanced data wrangling tools for joining multiple

    datasets into a single tidy dataset ‣ Reshape data that comes in non-tidy format into a tidy format Data wrangling for data tidying ## [ ## { ## "gender": ["Female"], ## "first_name": ["Kimberly"], ## "last_name": ["Beckstead"], ## "age": [24], ## "phone_number": ["216-555-2549"], ## "purchases": [ ## { ## "SetID": [24701], ## "Number": ["76062"], ## "Theme": ["DC Comics Super Heroes"], ## "Subtheme": ["Mighty Micros"], ## "Year": [2016], ## "Name": ["Robin vs. Bane"], ## "Pieces": [77], ## "USPrice": [9.99], ## "ImageURL": ["http://images.brickset.com/sets/images/ 76062-1.jpg"], ## "Quantity": [1] ## } ## ] ## } ## ]
  12. Data import Shapeshifting ‣ Think beyond the CSV! ‣ Teach

    it to ‣ motivate discussion on data types ‣ create an opportunity to harvest web data
  13. Data types ‣ Discussion of data types and classes can

    feel dry without the right motivation ‣ Having to deal with unexpected data types after importing data is a very common task, hence a good motivation for this topic fav_food <- read_excel("data/favourite-food.xlsx") fav_food ## # A tibble: 5 x 6 ## `Student ID` `Full Name` favourite.food mealPlan AGE SES ## <dbl> <chr> <chr> <chr> <chr> <chr> ## 1 1 Sunil Huffm… Strawberry yog… Lunch on… 4 High ## 2 2 Barclay Lynn French fries Lunch on… 5 Midd… ## 3 3 Jayendra Ly… N/A Breakfas… 7 Low ## 4 4 Leon Rossini Anchovies Lunch on… 99999 Midd… ## 5 5 Chidiegwu D… Pizza Breakfas… five High
  14. Web data ‣ The web is an incredible source for

    data, but turning it into a structured format (without copy- paste or manual entry) requires learning web scraping skills ‣ Beyond screen scraping, it’s useful to introduce the idea of getting data from an API at some point in the curriculum ‣ Both of these offer an opportunity for discussion on ethics and data privacy Dogucu, M. & Çetinkaya-Rundel, M. “Web Scraping in the Statistics and Data Science Curriculum: Challenges and Opportunities.” Journal of Statistics Education (2021): 1-11. https://doi.org/10.1080/10691898.2020.1787116.
  15. Predictive modeling Clairvoyance ‣ Don’t just leave it to the

    machine learning course, introduce it along with explanatory / inferential models ‣ Teach it to ‣ introduce the idea of overfitting and mitigating it with splitting the data into testing and training sets ‣ allow for creativity with feature engineering ‣ discuss bias-variance tradeoff early on ‣ enable those open-ended projects for classifying binary outcome variables
  16. Predictive (tidy) models ‣ The tidymodels framework is a collection

    of packages for modeling and machine learning using tidyverse principles ‣ Tidymodels pipelines start with an initial_split() into training and testing data and the tooling provides guard rails to prevent prediction on the testing data at the model and feature development phase ‣ Functions designed specifically for feature engineering motivate creative thinking during model development ‣ eCOTS 2022 breakout session Modernizing the undergraduate regression analysis course — bit.ly/modern-regression
  17. Version control Time travel ‣ Teach it as early as

    possible and as needed, but when you can make time in your curriculum and integrate it throughout the curriculum ‣ Teach it to ‣ build good habits when the stakes are low ‣ motivate not just reproducibility but also collaboration ‣ instill practice of open sharing and start curating an online portfolio Beckman, Matthew D., et al. "Implementing version control with Git and GitHub as a learning objective in statistics and data science courses." Journal of Statistics and Data Science Education 29, no. sup1 (2021): S132-S144. https://doi.org/10.1080/10691898.2020.1848485.
  18. Empathy Empathy ‣ Strive to introduce the story with the

    dataset ‣ Couple each dataset with a datasheet: ‣ For what purpose was the dataset created? ‣ Does the dataset contain data that might be considered confidential (for example, data that is protected by legal privilege or by doctor–patient confidentiality, data that includes the content of individuals’ non-public communications)? ‣ Is it possible to identify individuals (that is, one or more natural persons), either directly or indirectly (that is, in combination with other data) from the dataset? ‣ Were the individuals in question notified about the data collection? ‣ … ‣ Use this practice to motivate discussion around wider data science ethics issues like algorithmic bias, privacy and re-identification, etc. Gebru, Timnit, et al. "Datasheets for datasets." Communications of the ACM 64.12 (2021): 86-92. DOI: http://dx.doi.org/10.1145/3458723.
  19. Accessibility ‣ You could teach a whole course or even

    a whole curriculum on accessibility… ‣ At a minimum, your students shouldn’t graduate without ever thinking / learning about it! ‣ Tooling exists to accomplish the bare minimum and that can go a long way in raising the next generation of data scientists who consider accessibility in their work
  20. ```{r} #| fig-cap: Body mass vs. bill length of penguins.

    ggplot(penguins, aes(x = bill_length_mm, y = body_mass_g, color = species)) + geom_point() ```
  21. ```{r} #| fig-cap: Body mass vs. bill length of penguins.

    #| fig-alt: > #| A scatterplot showing positive, relatively strong #| relationship between body mass and bill length. The #| points representing each of the three species are #| clustered with Adelies with lowest typical bill length #| and body mass, Chinstraps with higher typical bill #| length and similar body mass, and Gentoos with typical #| bill length between the other two but higher typical #| body mass. ggplot(penguins, aes(x = bill_length_mm, y = body_mass_g, color = species, shape = species)) + geom_point() + colorblindr::scale_color_OkabeIto() ```
  22. Learning on one’s own Self sufficiency ‣ Share with students

    ‣ how you learn, and be specific: books, blog posts, Twitter accounts you follow, etc. ‣ how you choose what to learn ‣ Demonstrate how you solve problems — e.g., via live coding ‣ Encourage them to take active part in the community
  23. In the chat, share a open educational resource you’ve created

    or reused. Please don’t be shy! Call to action Image by DONT SELL MY ARTWORK AS IS Pixabay.
  24. Sharing with others Knowledge projection ‣ Open-source your course materials

    ‣ Write about your experiences ‣ Blog posts ‣ Journal articles - not just for empirical studies but also reflective essays, datasets and stories, brief communications, etc.
  25. Making time to keep current Temporal statis ‣ Probably impossible,

    but you can try 😜 ‣ A few things I’m learning / playing with nowadays to keep current: ‣ Transitioning to the native R pipe |> ‣ Recommended reading: Blog post by Isabella Velásquez ‣ Quarto: Open-source scientific and technical multi-lingual publishing system, aka next generation R Markdown that supports multiple programming languages ‣ Recommended reading: Get Started tutorials at quarto.org ‣ Databases / SQL 😬 ‣ The wealth of resources from eCOTS 2022, particularly those on Diversity, Inclusion and Social Justice in data science!
  26. ‣ You don’t have to learn everything / you don’t

    have to teach everything ‣ Incremental changes over time more than fine! ‣ New “things” (features, packages, tools) being discussed / hyped in the community can be a good indication of their importance but doesn’t mean you have to adopt them right away NORMALIZE BEING HUMAN ❤
  27. References ‣ Gebru, Timnit, et al. "Datasheets for datasets." Communications

    of the ACM 64.12 (2021): 86-92. DOI: http://dx.doi.org/10.1145/3458723. ‣ Çetinkaya-Rundel et al. “An educator’s perspective of the tidyverse.” Technology Innovations in Statistics Education (2022): 14(1). http://dx.doi.org/10.5070/T514154352. ‣ Dogucu, M. & Çetinkaya-Rundel, M. “Web Scraping in the Statistics and Data Science Curriculum: Challenges and Opportunities.” Journal of Statistics Education (2021): 1-11. https://doi.org/10.1080/10691898.2020.1787116. ‣ Beckman, Matthew D., et al. "Implementing version control with Git and GitHub as a learning objective in statistics and data science courses." Journal of Statistics and Data Science Education 29, no. sup1 (2021): S132-S144. https://doi.org/ 10.1080/10691898.2020.1848485.