```{r setup, include=FALSE} knitr::opts_chunk$set(echo = TRUE, warning = FALSE) ``` ## Working in R These code downloads have been constructed as supplements to the full [Doing Economics projects](https://core-econ.org/doing-economics/). You’ll need to download the data before running the code that follows. ## Getting started in R If you need to install either of these packages, run the following code: ```{r} #install.packages(c("readxl", "tidyverse")) ``` You can import the libraries now, or when they are used in the R walk-throughs. ```{r} library(readxl) library(tidyverse) ``` ### **R walk-through 1** Importing a `.csv` datafile into R and doing basic data cleaning We start by setting our working directory using the `setwd` command. This command tells R where your codes and data files are stored. In the code below, replace 'YOURFILEPATH' with the full filepath that indicates the folder in which you have saved the code chunks file. Note that you have to use forward slashes ('/') rather than backslashes ('\\'). If you don’t know how to find the path to your working folder, see the [‘Technical Reference’ section](50-00-technical-reference.html#finding-your-path-on-mac). ```{r} setwd(‘YOURFILEPATH’) ``` Since our data is in Excel format, we use the `read_excel` function to import the data into R. We run this command twice; once to import the data dictionary (which we will call `var_info`) and once to import the survey data (which we will call `dat`). We use the `sheet` option to tell R which tab in the Excel file to import. ```{r} var_info <- read_excel("dataset_hope-et-al_simplified.xlsx", sheet = "Data dictionary") dat <- read_excel("dataset_hope-et-al_simplified.xlsx", sheet = "Data") ``` To check that the data has been imported correctly, you can use the head function to view the first six rows of the dataset, and confirm that they correspond to the columns in the Excel file. ``` {r} head(dat) ``` Before working with the data, we use the `str` function to check that the data is formatted correctly. ``` {r} str(dat) ``` R correctly recognises variables that are numbers (`num`), such as `respondent_id` and `age`, and variables that are words (`chr`), such as `partisanship`. ### **R walk-through 2** Making a frequency table We use the `count` function (part of the `tidyverse` package) to count the number of respondents in the control and treatment group. This information is stored in the variable `treatment`. The punctuation `%>%` can be used to link multiple commands together. ```{r} dat %>% count(treatment) ``` There are 1,516 respondents in the control group (`treatment` = 0) and 1,481 respondents in the treatment group (`treatment` = 1). ### **R walk-through 3** Creating dummy variables Here we use the `mutate` function (within the `tidyverse` package) to create new columns in our `dat` dataset, one for each dummy variable. We use the command `case_when` to specify the condition(s) that R should use to assign values, and the `is.na` function to code any missing data (`NA`) as `NA_real` (numerical missing data). ```{r} # Create carbon tax dummy (1 if support is 1 or 2) dat <- dat %>% mutate( carbon_tax_support_dummy = case_when( carbon_tax_support == 1 ~ 1, carbon_tax_support == 2 ~ 1, is.na(carbon_tax_support) ~ NA_real_, TRUE ~ 0 ) ) ``` In the code above, we set the values of our carbon tax support dummy variable (called `carbon_tax_support_dummy`) to equal 1 if the variable `carbon_tax_support` equals 1 or 2. Otherwise, this dummy variable should equal 0 (`TRUE ~ 0`). We repeat the same steps to create dummy variables for age (`age_dummy`), commuting by car (`car_dummy`), rural neighbourhood (`utp_dummy`), and unequal treatment perceptions (`unequal_treatment_dummy`). ```{r} # Create age dummy (1 if age is greater than 40) dat <- dat %>% mutate( age_dummy = case_when( age > 40 ~ 1, is.na(age) ~ NA_real_, TRUE ~ 0 ) ) #Create car dummy (1 if people commute by car) dat <- dat %>% mutate( car_dummy = case_when( commute == "Car" ~ 1, is.na(commute) ~ NA_real_, TRUE ~ 0 ) ) #Create rural dummy (1 if people live in rural neighborhood) dat <- dat %>% mutate( rural = case_when( neighbourhood == "Rural" ~ 1, is.na(neighbourhood) ~ NA_real_, TRUE ~ 0 ) ) #Create unequal treatment perceptions dummy (1 if people 8 and above) dat <- dat %>% mutate( unequal_treatment_dummy = case_when( unequal_treatment >= 8 ~ 1, is.na(unequal_treatment) ~ NA_real_, TRUE ~ 0 ) ) ``` ### **R walk-through 4** Making a frequency and column chart on a subset of data We apply the `filter` function to select observations in the control group (`treatment == 0`) and save these in a new dataframe called `Control`. ```{r} Control <- dat %>% filter(treatment == 0) ``` We will create a frequency table (called `freq_table`) using three tidyverse functions. First, we filter (`filter`) out the missing data in the `carbon_tax_support` variable. Second, we count (`count`) the number of respondents in this variable. Third, we use `mutate` to create a new column called `Percentage` that contains the percentages. ```{r} #Frequency table for carbon tax support freq_table <- dat %>% filter(!is.na(carbon_tax_support)) %>% count(carbon_tax_support) %>% mutate(Percentage = (n / sum(n)) * 100) ``` We now use `ggplot` to make a column chart (`geom_col()`) with `carbon_tax_support` as the horizontal (`x`) variable and `Percentage` as the vertical (`y`) variable. ```{r} #Create column chart ggplot(freq_table, aes(x = carbon_tax_support, y = Percentage)) + geom_col() + labs( title = "Share of respondents by carbon tax support", x = "Carbon tax support", y = "Percentage of respondents" ) ``` ### **R walk-through 5** Calculating the average of a dummy variable The average of a dummy variable is a proportion (between 0 and 1), which can be multiplied by 100 to represent the percentage of respondents for which the variable equals 1. We use the `count` and `mutate` functions, just like in R walk-through 5, and store the results in `freq_table_dummy`. ```{r} freq_table_dummy <- dat %>% count(carbon_tax_support_dummy) %>% mutate(Percentage = (n / sum(n)) * 100) freq_table_dummy ``` R creates a separate row for the missing data (56 observations). To remove this missing data, we use the `filter` option before creating the frequency table. ```{r} freq_table_dummy <- dat %>% filter(!is.na(carbon_tax_support)) %>% count(carbon_tax_support_dummy) %>% mutate(Percentage = (n / sum(n)) * 100) freq_table_dummy ``` The counts ('n') remain the same, but the percentages are now calculated over the non-missing data only. ### **R walk-through 6** Creating summary tables for different groups Using the `Control` dataframe, we first use `group_by` to specify the groups that the calculations will be done separately on (`age_dummy` in this case), then use `filter` to remove rows with missing data for age and carbon tax support. Finally, we use the `summarise` function to calculate the mean (`mean()`) and store it in a new variable called ` mean_carbon_tax_support_dummy`. ```{r} # Carbon tax support by age group (1 = age above 40) Control %>% group_by(age_dummy) %>% filter(!is.na(age_dummy))%>% filter(!is.na(carbon_tax_support_dummy)) %>% summarise(mean_carbon_tax_support_dummy = mean(carbon_tax_support_dummy)) ``` Now we repeat the same steps for the other dummy variables. ```{r} #Carbon tax support by commuting by car(1 = car) Control %>% group_by(car_dummy) %>% filter(!is.na(car_dummy))%>% filter(!is.na(carbon_tax_support_dummy)) %>% summarise(mean_carbon_tax_support_dummy = mean(carbon_tax_support_dummy)) #Carbon tax support by rural status (1 = rural) Control %>% group_by(rural) %>% filter(!is.na(rural))%>% filter(!is.na(carbon_tax_support_dummy)) %>% summarise(mean_carbon_tax_support_dummy = mean(carbon_tax_support_dummy)) #Carbon tax support by partisanship Control %>% group_by(partisanship) %>% filter(!is.na(partisanship))%>% filter(!is.na(carbon_tax_support_dummy)) %>% summarise(mean_carbon_tax_support_dummy = mean(carbon_tax_support_dummy)) ``` ### **R walk-through 7** Creating summary tables on a subset of data First, we use `filter` to select only the rows in the `Control` dataframe that have rural respondents, and call this dataframe `rural_control`. ```{r} rural_control <- Control %>% filter(rural == 1) ``` To create the summary tables, we use the `group_by`, `filter`, `is.na`, and `summarise` functions just like in R walk-through 6 from Part 1. Within the `mean` function, the `na.rm = T` option tells R to exclude any missing data from the calculation. ```{r} #Perception of carbon tax unfairness among rural respondents by unequal treatment perceptions (1 if people 8 and above) rural_control %>% group_by(unequal_treatment_dummy) %>% filter(!is.na(unequal_treatment_dummy))%>% filter(!is.na(carbon_tax_unfairness)) %>% summarise(mean_unfairness = mean(carbon_tax_unfairness, na.rm = T)) #Support for carbon taxation among rural respondents by unequal treatment perceptions (1 if people 8 and above) rural_control %>% group_by(unequal_treatment_dummy) %>% filter(!is.na(unequal_treatment_dummy))%>% filter(!is.na(carbon_tax_support_dummy)) %>% summarise(mean_carbon_tax_support_dummy = mean(carbon_tax_support_dummy, na.rm = T)) ``` ### **R walk-through 8** Creating column charts We use `filter` to create a new dataframe called `rural` that contains all rows that satisfy the condition `rural == 1`. ```{r} rural <- dat %>% filter(rural == 1) ``` To create the column charts, we first use the `group_by` and `summarise` functions to calculate the mean separately for each group (treatment and control). Then we ‘pipe’ this output into the `ggplot` function, using the `factor` option to specify that each category in the `treatment` variable should have a separate column. ```{r} rural %>% group_by(treatment) %>% summarise(mean_unequal_treatment = mean(unequal_treatment, na.rm = T)) %>% ggplot(., aes(x = factor(treatment), y = mean_unequal_treatment, fill = treatment)) + geom_col(width = 0.3) + labs( title = "Average unequal treatment perceptions", x = "Treatment group", y = "Average perceived unequal treatment" ) + theme(legend.position = "none") ``` We repeat the same steps for unfairness perceptions (`carbon_tax_unfairness`) and carbon tax support (`carbon_tax_support_dummy`). ```{r} # Unfairness perception by treatment group rural %>% group_by(treatment) %>% summarise(mean_carbon_tax_unfairness = mean(carbon_tax_unfairness, na.rm = T)) %>% ggplot(., aes(x = factor(treatment), y = mean_carbon_tax_unfairness, fill = treatment)) + geom_col(width = 0.3) + labs( title = "Average carbon tax unfairness perception", x = "Treatment group", y = "Average carbon tax unfairness perception" ) + theme(legend.position = "none") # Carbon tax support by treatment group rural %>% group_by(treatment) %>% summarise(mean_carbon_tax_support_dummy = mean(carbon_tax_support_dummy, na.rm = T)) %>% ggplot(., aes(x = factor(treatment), y = mean_carbon_tax_support_dummy, fill = treatment)) + geom_col(width = 0.3) + labs( title = "Average carbon tax support", x = "Treatment group", y = "Average carbon tax support" ) + theme(legend.position = "none") ``` ### **R walk-through 9** Calculating p-values for differences in means First, we use the `arrange` function to sort the data according to `treatment`. ```{r} rural <- rural %>% arrange(treatment, .by_group=TRUE) head(rural$treatment) tail(rural$treatment) ``` To calculate the p-value for `unequal_treatment`, we use `pull` to extract the data for that variable only, and create separate vectors to store the data for control and treatment group (`p1_c` and `p1_t` respectively). We then use these vectors as inputs in the `t.test` function. ```{r} # Create vectors for control and treatment group and run t test, unequal treatment perceptions p1_c <- rural %>% filter(treatment == 0) %>% pull(unequal_treatment) p1_t <- rural %>% filter(treatment == 1) %>% pull(unequal_treatment) t1 <- t.test(x = p1_c, y = p1_t) t1$p.value ``` We repeat these steps for ` carbon_tax_unfairness` and ` carbon_tax_support_dummy`. ```{r} # t-test for unequal treatment perceptions p2_c <- rural %>% filter(treatment == 0) %>% pull(carbon_tax_unfairness) p2_t <- rural %>% filter(treatment == 1) %>% pull(carbon_tax_unfairness) t2 <- t.test(x = p2_c, y = p2_t) t2$p.value # t-test for carbon tax support p3_c <- rural %>% filter(treatment == 0) %>% pull(carbon_tax_support_dummy) p3_t <- rural %>% filter(treatment == 1) %>% pull(carbon_tax_support_dummy) t3 <- t.test(x = p1_c, y = p1_t) t3$p.value ``` ### **R walk-through 10** Calculating confidence intervals and adding them to a chart We will use `unequal_treatment` as an example; the steps for the other variables are exactly the same (just change the variable name in the code below). The `t.test` function calculates confidence intervals automatically, along with other information (including the p-value, which we used in R walk-through 9). The confidence interval is stored as `conf.int[1:2]`. ```{r} # Confidence interval for unequal_treatment t1$conf.int[1:2] ``` For R to plot the confidence intervals, instead of the actual values, we need to store the interval values as the amount to add/subtract from the mean value. The easiest way is to calculate the standard error⁠ for the sample mean and multiply this by 1.96 (`m_err` = $1.96 \times sqrt{\text{standard deviation}/(\text{number of observations}-1)$), where 1.96 is the factor required to get a 95% confidence interval (assuming a normal distribution). The confidence interval is then [mean_m − m_err, mean_m + m_err]. We use `group_by` and `summarize` to create a summary table showing the number of observations, the mean, standard deviation, and standard error for the treatment and control group. We arrange the data according to the mean values of `unequal_treatment` (`rev`sorts from highest to lowest), and save the final result in `table_stats`. ```{r} table_stats <- rural %>% group_by(treatment) %>% summarize(obs = length(unequal_treatment), mean_m = mean(unequal_treatment), sd_m = sd(unequal_treatment, na.rm = TRUE), m_err = 1.96 * sqrt(sd_m^2 / (obs - 1))) %>% arrange(rev(mean_m)) table_stats ``` Now we can use this information to make a bar chart. ```{r} ggplot(table_stats, aes(y = mean_m, x = treatment)) + geom_bar(position = position_dodge(), # Use black outlines and add thinner bar outlines stat = "identity", colour = "black", width = .5) + ylab("Mean of unequal treatment perceptions") + xlab("") + geom_errorbar(aes(ymin = mean_m - m_err, ymax = mean_m + m_err), # Thinner lines for confidence intervals size = .6, width = .2, position = position_dodge(.2)) ```