Statsbook

DNA Analysis

Bioinformatics

Bioinformatics in challenging field in computer science and data analysis. The human genome contains an enormous amount of information which is the subject of much research. The human genome was published in 2001 and there are an estimated 20,000 protein coding genes. The protein coding genes are only a small proportion of the total amount of DNA (approx 2%). The total human genome is 3,235 Mb (Mega-basepairs) per haploid genome and 6,450 Mb in total (diploid).

Information on specific genes and entire genomes can be downloaded from the National Center for Biotechnology Information (NCBI)  website. There is also comprehensive guidance on how to use the database and website.

There are several formats to download genetic coding information. These include:

  • Fasta file – simple files that contain text based information with nucleotide or peptide sequences (example – open with a text editor)
  • GenBank file – standard produced by the NCBI with more information including references (example – open with a text editor)

Fasta files are simple, easy to parse and a standard in bioinformatics. This file format is also used in this book. The first line of each sequence starts with a “>” sign followed by specific information about the species and sequence. The fasta files in this book open in a new window of your browser. To save the contents to a file, select all (Ctrl-A), copy (Ctrl-C) and paste (Ctrl-V) in a text editor. Subsequently save the file with an appropriate name and the “fasta” extension (make sure there is no “txt” extension in the file name).

Installation of packages

Packages that need to be installed to go through the examples are:

Examples of gene analysis

Lady’s slipper orchid (Cypripedium calceolus)

Isocitrate dehydrogenase (IDH2) gene on chromosome 15 in humans (homo sapiens) and orangutans (pongo abelii) coding for the mitochondrial enzyme.