class: center, middle, inverse, title-slide # Online Data and Open Source Tools ## Analyzing Educational Internet Data Using R ### Bret Staudt Willet, Spencer Greenhalgh, & Joshua Rosenberg ### October 23, 2019 --- class: inverse, center, middle **Access our slide deck here:** https://bretsw.github.io/aect19-workshop/ **Chat with Josh here:** https://tennessee.zoom.us/my/jmrosenberg **Follow us on Twitter:** [@bretsw](https://twitter.com/bretsw) [@spgreenhalgh](https://twitter.com/spgreenhalgh) [@jrosenberg6432](https://twitter.com/jrosenberg6432) --- # Who We Are * **Bret Staudt Willet** * PhD Candidate, Ed Psych & Ed Tech, Michigan State University * *Primary area of interest:* social media and K-20 educators' professional learning * **Dr. Spencer Greenhalgh** * Assistant Professor, Information Communication Technology, University of Kentucky * *Primary area of interest:* implications of digital contexts for teaching, learning, and other meaningful practices * **Dr. Joshua Rosenberg** * Assistant Professor, STEM Education, University of Tennessee, Knoxville * *Primary area of interest:* data science in education + data science education <img src="img/koehler-diaspora-circa-2017.jpg" width="360px" style="display: block; margin: auto;" /> --- # How We Teach <img src="img/tech_support_cheat_sheet.png" width="480px" style="display: block; margin: auto;" /> --- class: inverse, center, middle # Agenda --- # Morning Agenda 1. Instructor introductions 1. Pre-survey 1. Attendee introductions 1. Why learn R? 1. Download the repository 1. Getting started with R Studio 1. Demo Doc Part 1: Processing data 1. Loading data 1. Demo Doc Part 2: Visualizing data 1. Demo Doc Part 3: Modeling data 1. Questions 1. Lunch break --- # Afternoon Agenda 1. Considerations for using social media data in LD&T research 1. Conducting ethical research 1. Framing the research 1. Organizing the research process 1. Collecting data 1. Analyzing data 1. Disseminating research 1. Learning and doing more with R 1. Self-guided exploration (Option A: Text analysis; Option B: Social network analysis) 1. Questions 1. Post-survey + (optional) upload your .Rmd file for the day --- class: inverse, center, top # Pre-survey http://bit.ly/r-at-aect-pre <img src="img/survey.jpg" width="720px" style="display: block; margin: auto;" /> --- class: inverse, center, middle # Why learn R? --- # Why learn R? * It is capable of carrying out basic and complex statistical analyses * It is able to work with data small (*n* = 30!) and large (*n* = 100,000+) efficiently * It is a programming language and so is quite flexible * There is a great, inclusive community of users and developers (and teachers) * It is cross-platform, open-source, and freely-available --- class: inverse, center, middle # Download the repository --- # Download the repository You can access all of the workshop materials at: https://github.com/bretsw/aect19-workshop/ --- # Download the repository <img src="img/download_repo_screenshot.png" width="720px" style="display: block; margin: auto;" /> --- class: inverse, center, middle # Getting started with R Studio --- # Installations To download R: - Visit this page to download R: https://cran.r-project.org/ - Find your operating system (Mac, Windows, or Linux) - Download the 'latest release' on the page for your operating system and download and install the application To download R Studio: - Visit this page to download R studio: https://www.rstudio.com/products/rstudio/download/ - Find your operating system (Mac, Windows, or Linux) - Download the 'latest release' on the page for your operating system and download and install the application ## Check that it worked Open R Studio. Find the console window and type in `2 + 2`. If what you can guess is returned (hint: it's what you expect!), then R Studio *and* R both work. --- # Working With Projects Before proceeding, we're going to take a few steps to set ourselves to make the analysis easier; namely, through the use of Projects, an R Studio-specific organizational tool. We already have a project set up. Navigate to the downloaded repo files and open `aect-workshop-2019.Rproj` (To create a project in R Studio, you can navigate to "File" and then "New Directory". Then, click "New Project".) Names should be short and easy-to-remember but also informative. --- # Looking at RStudio <img src="img/RStudio.png" width="720px" style="display: block; margin: auto;" /> --- # Packages "Packages" are shareable collections of R code that provide functions (i.e., a command to perform a specific task), data, and documentation. Packages increase the functionality of R by improving and expanding on base R (basic R functions). ### Installing and Loading Packages To download a package, you must call `install.packages()`: ```r install.packages("dplyr") ``` You can also navigate to the Packages pane, and then click "Install", which will work the same as the line of code above. This is a way to install a package using code or part of the R Studio interface. Usually, writing code is a bit quicker, but using the interface can be very useful and complementary to use of code. --- # Using packages *After* the package is installed, it must be loaded into your R Studio session using `library()`: ```r library(dplyr) ``` We only have to install a package once, but to use it, we have to load it each time we start a new R session. > A package is a like a book, a library is like a library; you use library() to check a package out of the library > (Hadley Wickham, Chief Scientist, R Studio) --- --- # Configuring R Studio By default, R Studio saves all of the objects in your environment. In general, this is not ideal, because it means that you may have taken steps interactively that are not documented your code. <img src="img/save-workspace-reminder.jpg" width="425px" style="display: block; margin: auto;" /> --- # The tidyverse The tidyverse is a set of packages for data manipulation, exploration, and visualization using the design philosophy of 'tidy' data. Tidy data has a specific structure: each variable is a column, each observation is a row, and each type of observational unit is a table. The packages contained in the Tidyverse provide useful functions that augment base R functionality. You can instal and load the complete tidyverse with: ```r install.packages("tidyverse") ``` ```r library(tidyverse) ``` --- class: inverse, center, middle # Demo doc (part 0): Try it out! * Note. Check out [R Studio Cloud](https://rstudio.cloud/) if you're having any installation trouble!* *You can find the demo doc in RStudio or in your downloaded repository files—it is titled "demo-doc.Rmd"* *The doc is also [here](https://github.com/bretsw/aect19-workshop/blob/master/demo-doc.Rmd) in the repository* --- # Looking at Demo Doc <img src="img/RStudio.png" width="720px" style="display: block; margin: auto;" /> --- class: inverse, center, middle # Demo doc (part 1): Try it out! * Note. Check out [R Studio Cloud](https://rstudio.cloud/) if you're having any installation trouble!* *You can find the demo doc in RStudio or in your downloaded repository files—it is titled `demo-doc.Rmd`* *The doc is also [here](https://github.com/bretsw/aect19-workshop/blob/master/demo-doc.Rmd) in the repository* --- class: inverse, center, middle # Processing Data --- # The pipe: %>% First, let's load the data we downloaded in the last step. ```r student_responses <- read_csv("data/student-responses-data.csv") ``` We're also going to introduce a powerful, unusual *operator* in R, the pipe. The pipe is this symbol: `%>%`. It lets you *compose* functions. It does this by passing the output of one function to the next. Select reduces the number of *columns* of a dataset. ```r student_mot_vars <- # save object student_mot_vars by... student_responses %>% # using dataframe student_responses select(SCIEEFF, JOYSCIE, INTBRSCI, EPIST, INSTSCIE) # and selecting only these five variables student_mot_vars ``` Also, check out the helper functions: `contains()`, `starts_with()`, and `ends_with()`, i.e.: ```r student_responses %>% select(contains("sci")) ``` --- # Saving the results Note that we saved the output from the `select()` function to `student_mot_vars` but we could also save it back to `student_responses`, which would simply overwrite the original data frame (the following code is not run here): ```r student_responses <- # save object student_responses by... student_responses %>% # using dataframe student_responses select(student_responses, SCIEEFF, JOYSCIE, INTBRSCI, EPIST, INSTSCIE) # and selecting only these five variables ``` --- # Renaming We can also rename the variables at the same time we select them. Let's save the results to `student_mot_vars`. ```r student_mot_vars <- student_responses %>% select(student_efficacy = SCIEEFF, student_joy = JOYSCIE, student_broad_interest = INTBRSCI, student_epistemic_beliefs = EPIST, student_instrumental_motivation = INSTSCIE ) student_mot_vars %>% head(3) ``` ``` ## # A tibble: 3 x 5 ## student_efficacy student_joy student_broad_i… student_epistem… ## <dbl> <dbl> <dbl> <dbl> ## 1 3.28 2.16 1.73 2.16 ## 2 -0.668 0.509 0.491 -0.501 ## 3 -1.62 0.0992 -0.638 -0.193 ## # … with 1 more variable: student_instrumental_motivation <dbl> ``` --- # Creating a new variable `mutate()` takes instructions to create a new variable. What goes *before* the equals sign is the name of the new variable. What goes after are the instructions. ```r student_responses %>% select(student_efficacy = SCIEEFF, student_joy = JOYSCIE, student_broad_interest = INTBRSCI, student_epistemic_beliefs = EPIST, student_instrumental_motivation = INSTSCIE) %>% mutate(student_joy_interest = (student_joy + student_broad_interest) / 2) %>% head(3) ``` ``` ## # A tibble: 3 x 6 ## student_efficacy student_joy student_broad_i… student_epistem… ## <dbl> <dbl> <dbl> <dbl> ## 1 3.28 2.16 1.73 2.16 ## 2 -0.668 0.509 0.491 -0.501 ## 3 -1.62 0.0992 -0.638 -0.193 ## # … with 2 more variables: student_instrumental_motivation <dbl>, ## # student_joy_interest <dbl> ``` --- # Filtering the data set Filter takes *logical statements* (statements that can evalute to true or false) to select a number of *rows* from a dataset. ```r student_responses %>% select(student_efficacy = SCIEEFF, student_joy = JOYSCIE, student_broad_interest = INTBRSCI, student_epistemic_beliefs = EPIST, student_instrumental_motivation = INSTSCIE) %>% mutate(student_joy_interest = (student_joy + student_broad_interest) / 2) %>% filter(student_joy_interest > mean(student_joy_interest, na.rm = TRUE)) %>% head(3) ``` ``` ## # A tibble: 3 x 6 ## student_efficacy student_joy student_broad_i… student_epistem… ## <dbl> <dbl> <dbl> <dbl> ## 1 3.28 2.16 1.73 2.16 ## 2 -0.668 0.509 0.491 -0.501 ## 3 -0.0378 1.67 0.332 2.16 ## # … with 2 more variables: student_instrumental_motivation <dbl>, ## # student_joy_interest <dbl> ``` --- # Grouping and summarizing It is often useful to aggregate a data set. The `group_by()` and `summarize()` functions are especially helpful for this purpose. Let's find the mean and standard deviation (and counts) of student broad interest by teacher. ```r student_responses %>% select(student_broad_interest = INTBRSCI, teacher_id = SCHID) %>% group_by(teacher_id) %>% summarize(student_broad_interest_mean = mean(student_broad_interest, na.rm = TRUE), student_broad_interest_sd = sd(student_broad_interest, na.rm = TRUE), n = n()) %>% head(3) ``` ``` ## # A tibble: 3 x 4 ## teacher_id student_broad_interest_mean student_broad_interest_sd n ## <chr> <dbl> <dbl> <int> ## 1 00001 -0.157 0.846 36 ## 2 00003 -0.136 1.08 31 ## 3 00004 -0.0430 1.26 39 ``` --- # Arranging Finally, let's arrange the table by the number of students in each class. `arrange()` sorts in order from the smallest to largest values. `desc()` tells `arrange()` to sort in descending order. ```r student_responses %>% select(student_broad_interest = INTBRSCI, teacher_id = SCHID) %>% group_by(teacher_id) %>% summarize(student_broad_interest_mean = mean(student_broad_interest, na.rm = TRUE), student_broad_interest_sd = sd(student_broad_interest, na.rm = TRUE), n = n()) %>% arrange(desc(n)) %>% head(3) ``` ``` ## # A tibble: 3 x 4 ## teacher_id student_broad_interest_mean student_broad_interest_sd n ## <chr> <dbl> <dbl> <int> ## 1 00107 0.204 0.786 42 ## 2 00117 0.204 0.893 42 ## 3 00131 0.0349 0.870 41 ``` --- # Counting Sometimes, we simply want to count the number of observations associated with each group. ```r student_responses %>% select(teacher_id = SCHID) %>% count(teacher_id) ``` We could then use this as the basis of *other* summary statistics. ```r student_responses %>% select(teacher_id = SCHID) %>% count(teacher_id) %>% summarize(n_mean = mean(n), n_sd = sd(n)) ``` ``` ## # A tibble: 1 x 2 ## n_mean n_sd ## <dbl> <dbl> ## 1 32.3 8.61 ``` On average, there are around 32 (*SD* = 8.61) students in each teachers' class. --- class: inverse, center, middle # Loading data --- # Loading a CSV File Okay, we're ready to go. The easiest way to read a CSV file is with the function `read_csv()` from the package `readr`, which is contained within the Tidyverse. Again, we'll load (or have loaded, already) the tidyverse library. ```r library(tidyverse) # so tidyverse packages can be used for analysis ``` Just as a recap of a line that we ran earlier. ```r student_responses <- read_csv("data/student-responses-data.csv") ``` Since we loaded the data, we now want to look at it. We can type its name in the function `glimpse()` to print some information on the dataset (this code is not run here). ```r glimpse(student_responses) ``` You can also print the name of the object. --- # Loading Excel and SAV files ## Loading Excel files ```r install.packages("readxl") ``` ```r library(readxl) ``` ```r my_data <- read_excel("path/to/file.sav") ``` ## Loading SAV files ```r install.packages("haven") ``` ```r library(haven) my_data <- read_sav("path/to/file.xlsx") ``` --- # Saving (Writing) Files Using our data frame `student_responses`, we can save it as a CSV (for example) with the following function. The first argument, `student_reponses`, is the name of the object that you want to save. The second argument, `student-responses.csv`, what you want to call the saved dataset. ```r write_csv(student_responses, "student-responses.csv") ``` That will save a CSV file entitled `student-responses.csv` in the working directory. If you want to save it to another directory, simply add the file path to the file, i.e. `path/to/student-responses.csv`. To save a file for SPSS, load the haven package and use `write_sav()`. There is not a function to save an Excel file, but you can save as a CSV and directly load it in Excel. --- class: inverse, center, middle # Demo doc (part 2): Try it out! * Note. Check out [R Studio Cloud](https://rstudio.cloud/) if you're having any installation trouble!* *You can find the demo doc in RStudio or in your downloaded repository files—it is titled `demo-doc.Rmd`* *The doc is also [here](https://github.com/bretsw/aect19-workshop/blob/master/demo-doc.Rmd) in the repository* --- class: inverse, center, middle # Visualizing data --- # The grammar of graphics *ggplot2* is a tidyverse package for creating visualizations (or figures). Let's work with a smaller data set, for now. ```r student_mot_vars <- # save object student_mot_vars by... student_responses %>% # using dataframe student_responses select(student_efficacy = SCIEEFF, # selecting variable SCIEEFF and renaming to student_efficiency student_joy = JOYSCIE, # selecting variable JOYSCIE and renaming to student_joy student_broad_interest = INTBRSCI, # selecting variable INTBRSCI and renaming to student_broad_interest student_epistemic_beliefs = EPIST, # selecting variable EPIST and renaming to student_epistemic_beliefs student_instrumental_motivation = INSTSCIE, # selecting variable INSTSCIE and renaming to student_instrumental_motivation teacher_id = SCHID ) ``` --- # Scatter plot Notice the three parts of all `ggplot2` graphs: a) the data (`student_mot_vars`), b) an `aes()`thetic mapping, and c) a `geom`: <img src="index_files/figure-html/unnamed-chunk-33-1.png" width="450px" style="display: block; margin: auto;" /> --- # Scatter plot with a regression line ```r ggplot(student_mot_vars, aes(x = student_efficacy, y = student_broad_interest)) + geom_point() + geom_smooth(method = "lm") ``` <img src="index_files/figure-html/unnamed-chunk-34-1.png" width="350px" style="display: block; margin: auto;" /> --- # Cleaning up ```r ggplot(student_mot_vars, aes(x = student_efficacy, y = student_broad_interest)) + geom_point() + geom_smooth(method = "lm") + theme_bw() + xlab("Student Efficacy") + ylab("Student Broad Interest") ``` <img src="index_files/figure-html/unnamed-chunk-35-1.png" width="350px" style="display: block; margin: auto;" /> --- class: inverse, center, middle # Demo doc (part 3): Try it out! * Note. Check out [R Studio Cloud](https://rstudio.cloud/) if you're having any installation trouble!* *You can find the demo doc in RStudio or in your downloaded repository files—it is titled `demo-doc.Rmd`* *The doc is also [here](https://github.com/bretsw/aect19-workshop/blob/master/demo-doc.Rmd) in the repository* --- class: inverse, center, middle # Modeling data --- # Linear models The `lm()` function is a very helpful general purpose function for linear models. We'll use the `student_mot_vars` data frame we created earlier. ```r lm(student_epistemic_beliefs ~ student_broad_interest, data = student_mot_vars) ``` ``` ## ## Call: ## lm(formula = student_epistemic_beliefs ~ student_broad_interest, ## data = student_mot_vars) ## ## Coefficients: ## (Intercept) student_broad_interest ## 0.2407 0.2401 ``` --- # Linear models Let's save the results back to an object, `m1`. ```r m1 <- lm(student_epistemic_beliefs ~ student_broad_interest, data = student_mot_vars) ``` --- # Linear models We can then run a summary on the output. ```r summary(m1) ``` ``` ## ## Call: ## lm(formula = student_epistemic_beliefs ~ student_broad_interest, ## data = student_mot_vars) ## ## Residuals: ## Min 1Q Median 3Q Max ## -3.6197 -0.5614 -0.1755 0.5917 2.5144 ## ## Coefficients: ## Estimate Std. Error t value Pr(>|t|) ## (Intercept) 0.24066 0.01390 17.31 <2e-16 *** ## student_broad_interest 0.24015 0.01423 16.88 <2e-16 *** ## --- ## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 ## ## Residual standard error: 1.013 on 5323 degrees of freedom ## (387 observations deleted due to missingness) ## Multiple R-squared: 0.0508, Adjusted R-squared: 0.05062 ## F-statistic: 284.9 on 1 and 5323 DF, p-value: < 2.2e-16 ``` --- # Linear models You can then build up a more complex model, i.e.: ```r m2 <- lm(student_epistemic_beliefs ~ student_broad_interest + student_joy + student_broad_interest:student_joy, data = student_mot_vars) ``` You can then run `summary()` to view the results. --- # Multi-level models ```r library(lme4) ``` ``` ## Warning: package 'lme4' was built under R version 3.5.2 ``` ``` ## Loading required package: Matrix ``` ``` ## ## Attaching package: 'Matrix' ``` ``` ## The following object is masked from 'package:tidyr': ## ## expand ``` ```r m2_mlm <- lmer(student_epistemic_beliefs ~ student_broad_interest + student_joy + student_broad_interest:student_joy + (1|teacher_id), data = student_mot_vars) m2_mlm ``` ``` ## Linear mixed model fit by REML ['lmerMod'] ## Formula: ## student_epistemic_beliefs ~ student_broad_interest + student_joy + ## student_broad_interest:student_joy + (1 | teacher_id) ## Data: student_mot_vars ## REML criterion at convergence: 14853.02 ## Random effects: ## Groups Name Std.Dev. ## teacher_id (Intercept) 0.1839 ## Residual 0.9648 ## Number of obs: 5315, groups: teacher_id, 176 ## Fixed Effects: ## (Intercept) student_broad_interest ## 0.16608 0.07491 ## student_joy student_broad_interest:student_joy ## 0.27409 0.01991 ``` --- class: inverse, center, middle # Afternoon --- # Afternoon Agenda 1. Considerations for using social media data in LD&T research 1. Conducting ethical research 1. Framing the research 1. Organizing the research process 1. Collecting data 1. Analyzing data 1. Disseminating research 1. Learning and doing more with R 1. Self-guided exploration (Option A: Text analysis; Option B: Social network analysis) 1. Questions 1. Post-survey + (optional) upload your .Rmd file for the day --- class: inverse, center, middle # Considerations for Using Social Media Data in Learning Design & Technology Research <img src="img/book.jpg" width="600px" style="display: block; margin: auto;" /> --- # Chapter overview Our purpose in this chapter is to suggest considerations that LDT professionals should make as they use social media data in their research. --- # Important notes -- * Our purpose is *not* to dismiss traditional research methods -- * There is considerable room for debate and disagreement around the proper way to conduct social media research -- * Our focus is not providing answers but raising questions -- * The issues and processes we raise are not linear; this is not a simple recipe to follow -- * We avoid the traditional distinction between *quantitative* and *qualitative* research, and instead suggest various other distinctions as we go --- class: inverse, center, middle # What's your project? <img src="img/project-legos.jpg" width="600px" style="display: block; margin: auto;" /> --- class: inverse, center, middle # Conducting ethical research --- # Conducting ethical research * Often, IRBs do not flag social media work as human subjects research; however, we would suggest that it is a human phenomenon -- * Not a checklist, but a thorough consideration of research context and ethical principles is needed -- * Public vs. private -- * Harms and benefits -- * Vulnerability -- * Anonymity -- * Consent -- * Legal considerations --- class: inverse, center, middle # What ethical considerations should you examine in your project? <img src="img/project-legos.jpg" width="600px" style="display: block; margin: auto;" /> --- class: inverse, center, middle # Framing the research --- # Framing the research -- * Paradigms and assumptions -- * Research design, methods, and modes of inquiry -- * Conceptual frameworks -- * Community, network, or space? -- * Phenomena and units of analysis --- class: inverse, center, middle # How will you frame your project? <img src="img/project-legos.jpg" width="600px" style="display: block; margin: auto;" /> --- class: inverse, center, middle # Organizing the research process --- # Organizing the research process -- * Software -- * Storing data -- * Workflows -- * Documentation --- class: inverse, center, middle # What do you need to do to organize your project? <img src="img/project-legos.jpg" width="600px" style="display: block; margin: auto;" /> --- class: inverse, center, middle # Collecting data --- # Collecting data -- * Obtrusive or unobtrusive -- * Unobtrusive: *digital traces* -- * Process -- * Application programming interfaces (APIs) * TAGS + rtweet * Web scraping * Accessing archives -- * Quantity: big vs. small --- class: inverse, center, middle # How will you collect data for your project? <img src="img/project-legos.jpg" width="600px" style="display: block; margin: auto;" /> --- class: inverse, center, middle # Analyzing data --- # Analyzing data -- * Spam -- * Machine vs. human analysis -- * Networks --- class: inverse, center, middle # How will you analyze data for your project? <img src="img/project-legos.jpg" width="600px" style="display: block; margin: auto;" /> --- class: inverse, center, middle # Disseminating research --- # Disseminating research -- * Sharing data -- * Sharing code -- * TAGS + rtweet = rTAGS? -- * Github (http://github.com) * Open Science Framework (http://osf.io) -- * Publishing and publicizing research --- class: inverse, center, middle # How will you disseminate your project? <img src="img/project-legos.jpg" width="600px" style="display: block; margin: auto;" /> --- class: inverse, center, middle # Conclusion --- # Conclusion -- * The ready availability of social media data provides many opportunities... -- * ...but also considerable challenges for LDT researchers. -- * So, get familiar with the broad considerations, -- * and don't skip asking the important questions. -- * Start and end with research ethics, and revisit throughout. --- class: inverse, center, middle # Self-guided exploration --- # How We Teach <img src="img/tech_support_cheat_sheet.png" width="480px" style="display: block; margin: auto;" /> --- # Introduction to text analysis -- There are all kinds of things you might wonder about the text of a document. For instance: -- * How many words are in a document? -- * Which words are most used most often in a text? -- * How many times is a specific word used? --- # Introduction to social network analysis -- <img src="img/csed-graph-forced-atlas8.png" width="480px" style="display: block; margin: auto;" /> *A network is a set of objects, and network analysis is interested in the relationships between them.* --- # Introduction to social network analysis <img src="img/csed-graph-forced-atlas8.png" width="480px" style="display: block; margin: auto;" /> *So, what are the objects? How are they related?* --- # Self-guided exploration -- ### Option A: Text analysis (Difficulty: Moderate) *The doc is [here](https://github.com/bretsw/aect19-workshop/blob/master/explore-on-your-own-1.Rmd) in the repository* *You can also find it in the `explore-on-your-own1.Rmd1` file* -- ### Option B: Social network analysis (Difficulty: Hard) *The doc is [here](https://github.com/bretsw/aect19-workshop/blob/master/explore-on-your-own-2.Rmd) in the repository* *You can also find it in the `explore-on-your-own2.Rmd1` file* --- # Acknowledgments Thank you to [Emily Bovee](https://github.com/emilybovee) for co-developing the workshop this workshop is adapted from: https://github.com/jrosen48/MTSU-workshop and https://github.com/jrosen48/MSU-workshop-2019 Parts of the content for this workshop are also adapted from: - The [*Data Science in Education Using R* book](https://github.com/data-edu/data-science-in-education) by Emily A. Bovee, Ryan A. Estrellado, Jesse Mostipak, Joshua M. Rosenberg, and Isabella C. Velásquez to be published by Routledge in 2020 - An American Educational Research Association 2019 annual meeting [workshop on *Reproducible and transparent research with R*](https://github.com/ResearchTransparency/rr_aera19) by [Daniel Anderson](https://github.com/datalorax]) and [Joshua Rosenberg](https://github.com/jrosen48/) Finally, parts of the content for this workshop are inspired from content associated with the [Data Science Specialization for UO COE](https://github.com/uo-datasci-specialization) (led by [Daniel Anderson](https://github.com/datalorax])) --- class: inverse, center, middle # Questions? Bret Staudt Willet: [staudtwi@msu.edu](mailto:staudtwi@msu.edu) Spencer Greenhalgh: [spencer.greenhalgh@uky.edu](mailto:spencer.greenhalgh@uky.edu) Joshua Rosenberg: [jmrosenberg@utk.edu](mailto:jmrosenberg@utk.edu) The GitHub repository for this workshop is: https://github.com/bretsw/aect19-workshop --- class: inverse, center, top # Post-survey http://bit.ly/r-at-eact-post (optional: upload your .Rmd file of work for the day) <img src="img/survey.jpg" width="720px" style="display: block; margin: auto;" /> --- class: inverse, center, middle # Questions? --- class: inverse, center, middle # Learning and doing more with R --- # Resources - [Doing data science with R](https://r4ds.had.co.nz/) by Wickham and Grolemund (2017) - [Big magic with R: Creating learning beyond fear](https://speakerdeck.com/apreshill/big-magic-with-r-creative-learning-beyond-fear) by Hill (2017) - [Data science in education](https://github.com/data-edu/data-science-in-education) - [R Studio Community](https://community.rstudio.com/) - [Data Science in Education Using R](https://github.com/data-edu/data-science-in-education) --- # Courses - [\#r4ds](https://medium.com/@kierisi/r4ds-the-next-iteration-d51e0a1b0b82) (see a talk at rstudio::conf() [here](https://resources.rstudio.com/rstudio-conf-2019/r4ds-online-learning-community-improvements-to-self-taught-data-science-and-the-critical-need-for-diversity-equity-and-inclusion-in-data-science-education) by Maegan (2019)) - [Data science for social scientists](http://datascience.tntlab.org/) by Landers (2019) - [University of Oregon Data Science Specialization for the College of Education](https://github.com/uo-datasci-specialization) by Anderson (2019) --- # Useful packages With one exception, can install via `install.packages(<name-of-package)`, i.e., `install.packages("tidylog")`. - tidyverse: tools for data manipulation and processing - tidylog: companion to the tidyverse; gives you feedback - haven: import and export files for other statistical software - apaTables: create APA-style tables in MS Word format - devtools: to download packages only available on GitHub - psych: general psychological science-related analysis tools - skimr: quickly calculate summary statistics for a data frame --- # Useful packages With one exception, can install via `install.packages(<name-of-package)`, i.e., `install.packages("tidylog")`. - lavaan: structural equation modeling - tipyLPA: latent profile analysis - lme4: multi-level modeling (HLM) - sjstats: convenient statistical functions - poLCA: latent class analysis - rmarkdown: reports (PDF, Word, HTML, etc.) - papaja: APA-style paper generation (must install from [GitHub](https://github.com/crsh/papaja)) - blogdown create a website/blog - xaringan: create presentations - MplusAutomation: Run Mplus via R --- # People (and a few hashtags) to follow (on Twitter) - [Jesse Mostipak](https://twitter.com/kierisi) - [Alison Hill](https://twitter.com/apreshill) - [Hadley Wickham](https://twitter.com/hadleywickham) - [Daniel Anderson](https://twitter.com/datalorax_) - [Mara Averick](https://twitter.com/dataandme) - [Thomas Mock](https://twitter.com/thomas_mock) - [Angela Li](https://twitter.com/CivicAngela) - [R4DS community](https://twitter.com/R4DScommunity) - [#tidytuesday](https://twitter.com/hashtag/TidyTuesday?src=hash) - [#rstats](https://twitter.com/hashtag/rstats?src=hash) - [David Robinson](https://twitter.com/drob) - [Julia Silge](https://twitter.com/juliasilge) - [Gabriela de Quieroz](https://twitter.com/gdequeiroz) - [Emily Bovee](https://twitter.com/ebovee09) - [Joshua Rosenberg](https://twitter.com/jrosenberg6432) - [Spencer Greenhalgh](https://twitter.com/spgreenhalgh) - [Bret Staudt Willet](https://twitter.com/bretsw) --- class: inverse, center, top # Reference Material The following slides include material that we cut for time or simplicity, but which may be helpful for you later. --- class: inverse, center, middle # Computer science foundations --- # Data structures - Data Frame ```r data.frame(var1 = c("a", "b", "c"), var2 = c(4, 2, 5)) ``` ``` ## var1 var2 ## 1 a 4 ## 2 b 2 ## 3 c 5 ``` - Tibble ```r library(tidyverse) tibble(var1 = c("a", "b", "c"), var2 = c(4, 2, 5)) ``` ``` ## # A tibble: 3 x 2 ## var1 var2 ## <chr> <dbl> ## 1 a 4 ## 2 b 2 ## 3 c 5 ``` --- # Data structures ```r vec1 = c("a", "b", "c") vec1 ``` ``` ## [1] "a" "b" "c" ``` ```r vec2 = c(4, 2, 5) vec2 ``` ``` ## [1] 4 2 5 ``` ```r tibble(var1 = vec1, var2 = vec2) ``` ``` ## # A tibble: 3 x 2 ## var1 var2 ## <chr> <dbl> ## 1 a 4 ## 2 b 2 ## 3 c 5 ``` --- # Conditional logic - `ifelse()` ```r vec1 <- runif(10) # generates random number between 0 and 1 d1 <- tibble(var1 = vec1) d1 %>% mutate(var2 = ifelse(var1 >= .5, "greater", "lesser")) ``` ``` ## # A tibble: 10 x 2 ## var1 var2 ## <dbl> <chr> ## 1 0.102 lesser ## 2 0.657 greater ## 3 0.848 greater ## 4 0.944 greater ## 5 0.225 lesser ## 6 0.0332 lesser ## 7 0.392 lesser ## 8 0.813 greater ## 9 0.732 greater ## 10 0.884 greater ``` --- # Conditional logic - `dplyr::case_when()` ```r d1 %>% mutate(var2 = case_when( var1 >= .75 ~ "a lot", var1 >= .5 & var1 < .75 ~ "greater", var1 >= .25 & var1 < .5 ~ "lesser", var1 < .25 ~ "very little" )) ``` ``` ## # A tibble: 10 x 2 ## var1 var2 ## <dbl> <chr> ## 1 0.102 very little ## 2 0.657 greater ## 3 0.848 a lot ## 4 0.944 a lot ## 5 0.225 very little ## 6 0.0332 very little ## 7 0.392 lesser ## 8 0.813 a lot ## 9 0.732 greater ## 10 0.884 a lot ``` --- # Objects, functions, directories, and data structures Here's a short URL: `https://goo.gl/bUeMhV` This is what it resolves to (it's a CSV file): `https://raw.githubusercontent.com/data-edu/data-science-in-education/master/data/pisaUSA15/stu-quest.csv` This chunk of code will save that file to a data directory in our working environment. ```r student_responses_url <- "https://goo.gl/bUeMhV" student_responses_file_name <- paste0(getwd(), "/data/student-responses-data.csv") download.file( url = student_responses_url, destfile = student_responses_file_name) ``` --- # Step 1 1. The *character string* `"https://goo.gl/wPmujv"` is being saved to an *object* called `student_responses_url`. ```r student_responses_url <- "https://goo.gl/bUeMhV" ``` --- # Step 2 2. We concatenate the working directory file path to the desired file name for the CSV using a *function* called `str_c`. This is stored in another *object* called `student_reponses_file_name`. This creates a file name with a *file path* in your working directory and it saves the file in the folder that you are working in. ```r student_responses_file_name <- str_c(getwd(), "./data/student-responses-data.csv") ``` --- # Step 3 3. The `student_responses_url` *object* is passed to the `url` argument of the *function* called `download.file()` along with `student_responses_file_name`, which is passed to the `destfile` argument. In short, the `download.file()` function needs to know - where the file is coming from (which you tell it through the `url`) argument and - where the file will be saved (which you tell it through the `destfile` argument). ```r download.file( url = student_responses_url, destfile = student_responses_file_name) ``` --- # Vignettes Vignettes are long-form documentation (and tutorials) that can provide a helpful introduction to a package. Run `vignette()` in order to view *all* of the vignettes available for a package: ```r vignette(package = "dplyr") ``` Then, you can load a specific vignette. ```r vignette("dplyr", package = "dplyr") ``` These are also available through CRAN (i.e., https://cran.r-project.org/web/packages/dplyr/index.html) --- # Using a specific function If you know the specific function that you want to look up, you can run this in the R Studio console: ```r ?dplyr::filter ``` Once you know what you want to do with the function, you can run it in your code: ```r dat <- # example data frame tibble(letter = c("A", "A", "A", "B", "B"), number = c(1L, 2L, 3L, 4L, 5L)) filter(dat, letter == "A") ``` ``` ## # A tibble: 3 x 2 ## letter number ## <chr> <int> ## 1 A 1 ## 2 A 2 ## 3 A 3 ``` ---