Import Spreadsheet – Statsbook

Data are usually collected in a spreadsheet program such as Microsoft Excel or Open Office. Once data collection is complete it is preferable to import the data directly from the spreadsheet into R, rather than entering it manually.

The comma separated values is the most versatile file format, but the Excel format is often used.

Comma Separated Values (CSV)

Heights of 10 people (csv file)

Microsoft Excel (XLS)

Heights of 10 people (xls file)

There should be no spaces in variable names. In a spreadsheet, the variable names are stored in the first row. If the variable name contains a space, this space will be converted to a full stop “.”. However, it is generally preferable not to use spaces in variable names and use one of the conventions as described.

To import the spreadsheet:

library(readxl)
df_csv <- read_csv('/path/to_file/height10.csv')
Rows: 10 Columns: 1                                            
── Column specification ─────────────────────────────────────────
Delimiter: ","
dbl (1): height

ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
df_csv
# A tibble: 10 × 1
   height
    <dbl>
 1    184
 2    146
 3    169
 4    185
 5    160
 6    173
 7    179
 8    171
 9    160
10    150
# or for excel files:
df_xls <- read_excel('/path/to_file/height10.xls')                                                            
df_xls
# A tibble: 10 × 1
   height
    <dbl>
 1    184
 2    146
 3    169
 4    185
 5    160
 6    173
 7    179
 8    171
 9    160
10    150

To see the structure:

str(df_xls)
tibble [10 × 1] (S3: tbl_df/tbl/data.frame)
 $ height: num [1:10] 184 146 169 185 160 173 179 171 160 150

Before proceeding with statistical analysis, make sure the variables are set of the appropriate type (numerical, factor, date etc) as also discussed in the manual section.

To save the data frame as an rda (R data) file:

save(df_xls, file = '/path/to_file/df_xls.rda')

To import a csv file with base R (without importing a package):

height10 <- read.table("/path/to_file/height10.csv", header=T, sep=",", quote="\"")

This will create a data frame ‘height10’ that is available in the Data Viewer (please edit the /path/to_file/ to the appropriate location of the file on your computer).

To import a xlsx file can also be done with the readxl package:

For example the SpecGroup.xlsx file:

library(readxl)
library(lubridate)
df <- read_excel('/path/to_file/SpecGroup.xlsx')                                                          
df
# A tibble: 60 × 4
   Month   Group     Speciality Number
   <chr>   <chr>     <chr>       <dbl>
 1 01/2016 New       Oncology        5
 2 01/2016 New       Ortho          16
 3 01/2016 New       Oncology        7
 4 01/2016 Other     Oncology        2
 5 01/2016 Other     Ortho           2
 6 01/2016 Other     Ortho           4
 7 01/2016 Other     Oncology        7
 8 01/2016 Resection Ortho           4
 9 01/2016 Resection Ortho           3
10 02/2016 New       Oncology        8
# ℹ 50 more rows
# ℹ Use `print(n = ...)` to see more rows
str(df)
tibble [60 × 4] (S3: tbl_df/tbl/data.frame)
 $ Month     : chr [1:60] "01/2016" "01/2016" "01/2016" "01/2016" ...
 $ Group     : chr [1:60] "New" "New" "New" "Other" ...
 $ Speciality: chr [1:60] "Oncology" "Ortho" "Oncology" "Oncology" ...
 $ Number    : num [1:60] 5 16 7 2 2 4 7 4 3 8 ...
df$Month = my(df$Month)
df$Group = as.factor(df$Group)
df$Speciality = as.factor(df$Speciality)
df
# A tibble: 60 × 4
   Month      Group     Speciality Number
   <date>     <fct>     <fct>       <dbl>
 1 2016-01-01 New       Oncology        5
 2 2016-01-01 New       Ortho          16
 3 2016-01-01 New       Oncology        7
 4 2016-01-01 Other     Oncology        2
 5 2016-01-01 Other     Ortho           2
 6 2016-01-01 Other     Ortho           4
 7 2016-01-01 Other     Oncology        7
 8 2016-01-01 Resection Ortho           4
 9 2016-01-01 Resection Ortho           3
10 2016-02-01 New       Oncology        8
# ℹ 50 more rows
# ℹ Use `print(n = ...)` to see more rows
str(df)
tibble [60 × 4] (S3: tbl_df/tbl/data.frame)
 $ Month     : Date[1:60], format: "2016-01-01" ...
 $ Group     : Factor w/ 3 levels "New","Other",..: 1 1 1 2 2 2 2 3 3 1 ...
 $ Speciality: Factor w/ 2 levels "Oncology","Ortho": 1 2 1 1 2 2 1 2 2 1 ...
 $ Number    : num [1:60] 5 16 7 2 2 4 7 4 3 8 ...

The package readxl¹ should be installed as described. (please edit the /path/to_file/ to the appropriate location of the file)

The lubridate² package has several functions to convert dates to ISO 8601 standard. It is recommended to use the ISO standard (year-month-day) in statistical analysis. Similar to numbers provide the most significant number (year) first, followed by month and day. This will also result in the dates being ordered correctly. The functions in lubridate are logically named; ie ymd() expects the string to be in ‘year-month-day’ order. Here the my() function was used as the string was in month/year order. The last str(df) call shows that the date has been changed to ISO 8601 format. Other lubridate functions are described on the website.

References

↑ Wickham et al., « readxl: Read Excel Files », 2016-03-28

↑ Grolemund et al., « lubridate: Make Dealing with Dates a Little Easier », 2016-04-06