Data are usually collected in a spreadsheet program such as Microsoft Excel or Open Office. Once data collection is complete it is preferable to import the data directly from the spreadsheet into R, rather than entering it manually.
The comma separated values is the most versatile file format, but the Excel format is often used.
Comma Separated Values (CSV)
Heights of 10 people (csv file)
Microsoft Excel (XLS)
Heights of 10 people (xls file)
There should be no spaces in variable names. In a spreadsheet, the variable names are stored in the first row. If the variable name contains a space, this space will be converted to a full stop “.”. However, it is generally preferable not to use spaces in variable names and use one of the conventions as described.
To import the spreadsheet:
library(readxl)
df_csv <- read_csv('/path/to_file/height10.csv')
Rows: 10 Columns: 1
── Column specification ─────────────────────────────────────────
Delimiter: ","
dbl (1): height
ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
df_csv
# A tibble: 10 × 1
height
<dbl>
1 184
2 146
3 169
4 185
5 160
6 173
7 179
8 171
9 160
10 150
# or for excel files:
df_xls <- read_excel('/path/to_file/height10.xls')
df_xls
# A tibble: 10 × 1
height
<dbl>
1 184
2 146
3 169
4 185
5 160
6 173
7 179
8 171
9 160
10 150
To see the structure:
str(df_xls)
tibble [10 × 1] (S3: tbl_df/tbl/data.frame)
$ height: num [1:10] 184 146 169 185 160 173 179 171 160 150
Before proceeding with statistical analysis, make sure the variables are set of the appropriate type (numerical, factor, date etc) as also discussed in the manual section.
To save the data frame as an rda (R data) file:
save(df_xls, file = '/path/to_file/df_xls.rda')
To import a csv file with base R (without importing a package):
height10 <- read.table("/path/to_file/height10.csv", header=T, sep=",", quote="\"")
This will create a data frame ‘height10’ that is available in the Data Viewer (please edit the /path/to_file/ to the appropriate location of the file on your computer).
To import a xlsx file can also be done with the readxl package:
For example the SpecGroup.xlsx file:
library(readxl)
library(lubridate)
df <- read_excel('/path/to_file/SpecGroup.xlsx')
df
# A tibble: 60 × 4
Month Group Speciality Number
<chr> <chr> <chr> <dbl>
1 01/2016 New Oncology 5
2 01/2016 New Ortho 16
3 01/2016 New Oncology 7
4 01/2016 Other Oncology 2
5 01/2016 Other Ortho 2
6 01/2016 Other Ortho 4
7 01/2016 Other Oncology 7
8 01/2016 Resection Ortho 4
9 01/2016 Resection Ortho 3
10 02/2016 New Oncology 8
# ℹ 50 more rows
# ℹ Use `print(n = ...)` to see more rows
str(df)
tibble [60 × 4] (S3: tbl_df/tbl/data.frame)
$ Month : chr [1:60] "01/2016" "01/2016" "01/2016" "01/2016" ...
$ Group : chr [1:60] "New" "New" "New" "Other" ...
$ Speciality: chr [1:60] "Oncology" "Ortho" "Oncology" "Oncology" ...
$ Number : num [1:60] 5 16 7 2 2 4 7 4 3 8 ...
df$Month = my(df$Month)
df$Group = as.factor(df$Group)
df$Speciality = as.factor(df$Speciality)
df
# A tibble: 60 × 4
Month Group Speciality Number
<date> <fct> <fct> <dbl>
1 2016-01-01 New Oncology 5
2 2016-01-01 New Ortho 16
3 2016-01-01 New Oncology 7
4 2016-01-01 Other Oncology 2
5 2016-01-01 Other Ortho 2
6 2016-01-01 Other Ortho 4
7 2016-01-01 Other Oncology 7
8 2016-01-01 Resection Ortho 4
9 2016-01-01 Resection Ortho 3
10 2016-02-01 New Oncology 8
# ℹ 50 more rows
# ℹ Use `print(n = ...)` to see more rows
str(df)
tibble [60 × 4] (S3: tbl_df/tbl/data.frame)
$ Month : Date[1:60], format: "2016-01-01" ...
$ Group : Factor w/ 3 levels "New","Other",..: 1 1 1 2 2 2 2 3 3 1 ...
$ Speciality: Factor w/ 2 levels "Oncology","Ortho": 1 2 1 1 2 2 1 2 2 1 ...
$ Number : num [1:60] 5 16 7 2 2 4 7 4 3 8 ...
The package readxl1 should be installed as described. (please edit the /path/to_file/ to the appropriate location of the file)
The lubridate2 package has several functions to convert dates to ISO 8601 standard. It is recommended to use the ISO standard (year-month-day) in statistical analysis. Similar to numbers provide the most significant number (year) first, followed by month and day. This will also result in the dates being ordered correctly. The functions in lubridate are logically named; ie ymd() expects the string to be in ‘year-month-day’ order. Here the my() function was used as the string was in month/year order. The last str(df) call shows that the date has been changed to ISO 8601 format. Other lubridate functions are described on the website.