My toolbox is full of shiny tools, do I also need super powers?

My toolbox is full of shiny tools, do I also
need super powers? mine çetinkaya-rundel duke university / rstudio 🔗 bit.ly/superpowers-ecots22

Super power > super > power superhero data science

graphic vision > data > visualization

Data visualization Graphic vision ‣ Start, literally, on day one
and continue improving throughout the curriculum ‣ Teach it to ‣ motivate inquiry and exploration ‣ support multivariate thinking ‣ effectively communicate of results and findings ‣ advance programming skills ‣ aid inferential decisions

‣ Ready to go computing environment ‣ Reproducible document with
code to produce the visualization ‣ Code that’s obviously straightforward to modify for customizing the plot Data visualization on day one unvotes |> filter(country %in% c("United Kingdom", "United States", "France")) |> ggplot(…)

‣ “Recreate” to advance programming skills Data visualization later in
curriculum

‣ “Recreate” to advance programming skills ‣ “Recreate, then improve”
to advance programming and communication skills Data visualization later in curriculum

‣ “Recreate” to advance programming skills ‣ “Recreate, then improve”
to advance programming and communication skills ‣ “Go beyond the basics” exercises to introduce commonly used visuals in scientific communication Data visualization later in curriculum

‣ Take visualizations beyond EDA ‣ Use them to assess
significance, as an alternative method for inference Data visualization for inference

shape- Shifting > data > wrangling

Data wrangling Shapeshifting ‣ Start with data summarizing, then move
on to data reshaping and tidying ‣ Teach it to ‣ motivate inquiry and exploration ‣ join data from multiple sources ‣ preprocess data for statistical analysis

‣ Start with the basics as early as possible Data
wrangling for summarization penguins |> count(island, species) # A tibble: 5 × 3 island species n <fct> <fct> <int> 1 Biscoe Adelie 44 2 Biscoe Gentoo 124 3 Dream Adelie 56 4 Dream Chinstrap 68 5 Torgersen Adelie 52

‣ Start with the basics as early as possible ‣
Wrangle further for better presentation Data wrangling for summarization penguins |> count(island, species) |> pivot_wider(names_from = species, values_from = n, values_fill = 0) # A tibble: 3 × 4 island Adelie Gentoo Chinstrap <fct> <int> <int> <int> 1 Biscoe 44 124 0 2 Dream 56 0 68 3 Torgersen 52 0 0

‣ Introduce more advanced data wrangling tools for joining multiple
datasets into a single tidy dataset Data wrangling for data tidying

‣ Introduce more advanced data wrangling tools for joining multiple
datasets into a single tidy dataset ‣ Reshape data that comes in non-tidy format into a tidy format Data wrangling for data tidying ## [ ## { ## "gender": ["Female"], ## "first_name": ["Kimberly"], ## "last_name": ["Beckstead"], ## "age": [24], ## "phone_number": ["216-555-2549"], ## "purchases": [ ## { ## "SetID": [24701], ## "Number": ["76062"], ## "Theme": ["DC Comics Super Heroes"], ## "Subtheme": ["Mighty Micros"], ## "Year": [2016], ## "Name": ["Robin vs. Bane"], ## "Pieces": [77], ## "USPrice": [9.99], ## "ImageURL": ["http://images.brickset.com/sets/images/ 76062-1.jpg"], ## "Quantity": [1] ## } ## ] ## } ## ]

Tele- kinesis > data > import

Data import Shapeshifting ‣ Think beyond the CSV! ‣ Teach
it to ‣ motivate discussion on data types ‣ create an opportunity to harvest web data

Data types ‣ Discussion of data types and classes can
feel dry without the right motivation ‣ Having to deal with unexpected data types after importing data is a very common task, hence a good motivation for this topic fav_food <- read_excel("data/favourite-food.xlsx") fav_food ## # A tibble: 5 x 6 ## `Student ID` `Full Name` favourite.food mealPlan AGE SES ## <dbl> <chr> <chr> <chr> <chr> <chr> ## 1 1 Sunil Huffm… Strawberry yog… Lunch on… 4 High ## 2 2 Barclay Lynn French fries Lunch on… 5 Midd… ## 3 3 Jayendra Ly… N/A Breakfas… 7 Low ## 4 4 Leon Rossini Anchovies Lunch on… 99999 Midd… ## 5 5 Chidiegwu D… Pizza Breakfas… five High

Web data ‣ The web is an incredible source for
data, but turning it into a structured format (without copy- paste or manual entry) requires learning web scraping skills ‣ Beyond screen scraping, it’s useful to introduce the idea of getting data from an API at some point in the curriculum ‣ Both of these offer an opportunity for discussion on ethics and data privacy Dogucu, M. & Çetinkaya-Rundel, M. “Web Scraping in the Statistics and Data Science Curriculum: Challenges and Opportunities.” Journal of Statistics Education (2021): 1-11. https://doi.org/10.1080/10691898.2020.1787116.

clairvoyance > predictive > modeling

Predictive modeling Clairvoyance ‣ Don’t just leave it to the
machine learning course, introduce it along with explanatory / inferential models ‣ Teach it to ‣ introduce the idea of overfitting and mitigating it with splitting the data into testing and training sets ‣ allow for creativity with feature engineering ‣ discuss bias-variance tradeoff early on ‣ enable those open-ended projects for classifying binary outcome variables

Predictive (tidy) models ‣ The tidymodels framework is a collection
of packages for modeling and machine learning using tidyverse principles ‣ Tidymodels pipelines start with an initial_split() into training and testing data and the tooling provides guard rails to prevent prediction on the testing data at the model and feature development phase ‣ Functions designed specifically for feature engineering motivate creative thinking during model development ‣ eCOTS 2022 breakout session Modernizing the undergraduate regression analysis course — bit.ly/modern-regression

time travel > version > control

Version control Time travel ‣ Teach it as early as
possible and as needed, but when you can make time in your curriculum and integrate it throughout the curriculum ‣ Teach it to ‣ build good habits when the stakes are low ‣ motivate not just reproducibility but also collaboration ‣ instill practice of open sharing and start curating an online portfolio Beckman, Matthew D., et al. "Implementing version control with Git and GitHub as a learning objective in statistics and data science courses." Journal of Statistics and Data Science Education 29, no. sup1 (2021): S132-S144. https://doi.org/10.1080/10691898.2020.1848485.

Reproducibility and collaboration

Web hosting to online portfolio

empathy > empathy

Empathy Empathy ‣ Strive to introduce the story with the
dataset ‣ Couple each dataset with a datasheet: ‣ For what purpose was the dataset created? ‣ Does the dataset contain data that might be considered confidential (for example, data that is protected by legal privilege or by doctor–patient confidentiality, data that includes the content of individuals’ non-public communications)? ‣ Is it possible to identify individuals (that is, one or more natural persons), either directly or indirectly (that is, in combination with other data) from the dataset? ‣ Were the individuals in question notified about the data collection? ‣ … ‣ Use this practice to motivate discussion around wider data science ethics issues like algorithmic bias, privacy and re-identification, etc. Gebru, Timnit, et al. "Datasheets for datasets." Communications of the ACM 64.12 (2021): 86-92. DOI: http://dx.doi.org/10.1145/3458723.

Accessibility ‣ You could teach a whole course or even
a whole curriculum on accessibility… ‣ At a minimum, your students shouldn’t graduate without ever thinking / learning about it! ‣ Tooling exists to accomplish the bare minimum and that can go a long way in raising the next generation of data scientists who consider accessibility in their work

```{r} #| fig-cap: Body mass vs. bill length of penguins.
ggplot(penguins, aes(x = bill_length_mm, y = body_mass_g, color = species)) + geom_point() ```

```{r} #| fig-cap: Body mass vs. bill length of penguins.
#| fig-alt: > #| A scatterplot showing positive, relatively strong #| relationship between body mass and bill length. The #| points representing each of the three species are #| clustered with Adelies with lowest typical bill length #| and body mass, Chinstraps with higher typical bill #| length and similar body mass, and Gentoos with typical #| bill length between the other two but higher typical #| body mass. ggplot(penguins, aes(x = bill_length_mm, y = body_mass_g, color = species, shape = species)) + geom_point() + colorblindr::scale_color_OkabeIto() ```

self- Sufficiency > learning > on one’s own

Learning on one’s own Self sufficiency ‣ Share with students
‣ how you learn, and be specific: books, blog posts, Twitter accounts you follow, etc. ‣ how you choose what to learn ‣ Demonstrate how you solve problems — e.g., via live coding ‣ Encourage them to take active part in the community

And a few superpowers for the educators…

power mimicry > leveraging > open resources

sta210-s22.github.io/website Stat 2 / Regression vizdata.org Data visualization datasciencebox.org Introductory
data science Leveraging open resources Power mimicry

In the chat, share a open educational resource you’ve created
or reused. Please don’t be shy! Call to action Image by DONT SELL MY ARTWORK AS IS Pixabay.

knowledge projection > sharing knowledge > with others

Sharing with others Knowledge projection ‣ Open-source your course materials
‣ Write about your experiences ‣ Blog posts ‣ Journal articles - not just for empirical studies but also reflective essays, datasets and stories, brief communications, etc.

Temporal statis > making time > to keep current

Making time to keep current Temporal statis ‣ Probably impossible,
but you can try 😜 ‣ A few things I’m learning / playing with nowadays to keep current: ‣ Transitioning to the native R pipe |> ‣ Recommended reading: Blog post by Isabella Velásquez ‣ Quarto: Open-source scientific and technical multi-lingual publishing system, aka next generation R Markdown that supports multiple programming languages ‣ Recommended reading: Get Started tutorials at quarto.org ‣ Databases / SQL 😬 ‣ The wealth of resources from eCOTS 2022, particularly those on Diversity, Inclusion and Social Justice in data science!

‣ You don’t have to learn everything / you don’t
have to teach everything ‣ Incremental changes over time more than fine! ‣ New “things” (features, packages, tools) being discussed / hyped in the community can be a good indication of their importance but doesn’t mean you have to adopt them right away NORMALIZE BEING HUMAN ❤

thank you! 🔗 bit.ly/superpowers-ecots22

References ‣ Gebru, Timnit, et al. "Datasheets for datasets." Communications
of the ACM 64.12 (2021): 86-92. DOI: http://dx.doi.org/10.1145/3458723. ‣ Çetinkaya-Rundel et al. “An educator’s perspective of the tidyverse.” Technology Innovations in Statistics Education (2022): 14(1). http://dx.doi.org/10.5070/T514154352. ‣ Dogucu, M. & Çetinkaya-Rundel, M. “Web Scraping in the Statistics and Data Science Curriculum: Challenges and Opportunities.” Journal of Statistics Education (2021): 1-11. https://doi.org/10.1080/10691898.2020.1787116. ‣ Beckman, Matthew D., et al. "Implementing version control with Git and GitHub as a learning objective in statistics and data science courses." Journal of Statistics and Data Science Education 29, no. sup1 (2021): S132-S144. https://doi.org/ 10.1080/10691898.2020.1848485.

My toolbox is full of shiny tools, do I also ne...

My toolbox is full of shiny tools, do I also need super powers?

More Decks by Mine Cetinkaya-Rundel

Other Decks in Education

Featured

Transcript