DNA Analysis | Stats Book

Bioinformatics

Bioinformatics in challenging field in computer science and data analysis. The human genome contains an enormous amount of information which is the subject of much research. The human genome was published in 2001 and there are an estimated 20,000 protein coding genes. The protein coding genes are only a small proportion of the total amount of DNA (approx 2%). The total human genome is 3,235 Mb (Mega-basepairs) per haploid genome and 6,450 Mb in total (diploid).

Information on specific genes and entire genomes can be downloaded from the National Center for Biotechnology Information (NCBI) website. There is also comprehensive guidance on how to use the database and website.

There are several formats to download genetic coding information. These include:

Fasta file – simple files that contain text based information with nucleotide or peptide sequences (example – open with a text editor)
GenBank file – standard produced by the NCBI with more information including references (example – open with a text editor)

Fasta files are simple, easy to parse and a standard in bioinformatics. This file format is also used in this book. The first line of each sequence starts with a “>” sign followed by specific information about the species and sequence. The fasta files in this book open in a new window of your browser. To save the contents to a file, select all (Ctrl-A), copy (Ctrl-C) and paste (Ctrl-V) in a text editor. Subsequently save the file with an appropriate name and the “fasta” extension (make sure there is no “txt” extension in the file name).

Installation of packages

Packages that need to be installed to go through the examples are:

seginr 1
- Installation is easy and the same as for any other R package. In the R console, just enter:
- ```
install.packages("seqinr")
```
- To load the package in R:
- ```
library(seqinr) 
```
Biostrings 2
- Installation is different than for other packages, as explained on the website: https://www.bioconductor.org/ . For installation, please follow the instructions on https://www.bioconductor.org/install/
- ```
source("https://bioconductor.org/biocLite.R")
biocLite() 
```
- In addition, the Biostrings package may not be installed automatically. To specifically install it:
```
biocLite("Biostrings") 
```
- To load the package in R:
```
library(Biostrings) 
```

Examples of gene analysis

Lady’s slipper orchid (Cypripedium calceolus)

Isocitrate dehydrogenase (IDH2) gene on chromosome 15 in humans (homo sapiens) and orangutans (pongo abelii) coding for the mitochondrial enzyme.

Charif D, Clerc O, Frank C, Lobry JR, Necsulea A, Palmeira L, et al. seqinr [Internet]. 2017. (seqinr: Biological Sequences Retrieval and Analysis). Available from: https://CRAN.R-project.org/package=seqinr

Bioconductor [Internet]. Bioconductor, open source software for bioinformatics. Available from: http://www.bioconductor.org/