{"id":2232,"date":"2018-04-02T15:04:08","date_gmt":"2018-04-02T14:04:08","guid":{"rendered":"http:\/\/pcool.dyndns.org:8080\/statsbook\/?page_id=2232"},"modified":"2025-06-29T17:30:29","modified_gmt":"2025-06-29T16:30:29","slug":"idh2-gene","status":"publish","type":"page","link":"https:\/\/pcool.dyndns.org\/index.php\/idh2-gene\/","title":{"rendered":"IDH2 gene"},"content":{"rendered":"\n<p class=\"wp-block-paragraph\">Please make sure the necessary packages (seqinr<sup class=\"sup-ref-note\" id=\"note-zotero-ref-p2232-r1-o1\"><a class=\"sup-ref-note\" href=\"#zotero-ref-p2232-r1\">1<\/a><\/sup> and Biostrings<sup class=\"sup-ref-note\" id=\"note-zotero-ref-p2232-r2-o1\"><a class=\"sup-ref-note\" href=\"#zotero-ref-p2232-r2\">2<\/a><\/sup>) are<a href=\"https:\/\/pcool.dyndns.org\/index.php\/packages\/\" data-type=\"page\" data-id=\"22\">&nbsp;installed<\/a>&nbsp;as described to allow analysis. <\/p>\n\n\n\n<p class=\"wp-block-paragraph\">The <a href=\"https:\/\/bioconductor.org\/packages\/release\/bioc\/html\/Biostrings.html\" target=\"_blank\" rel=\"noreferrer noopener\">Biostrings<\/a> and <a href=\"https:\/\/bioconductor.org\/packages\/release\/bioc\/html\/pwalign.html\" target=\"_blank\" rel=\"noreferrer noopener\">pwalign<\/a> packages are part of <a href=\"https:\/\/www.bioconductor.org\/\" target=\"_blank\" rel=\"noreferrer noopener\">Bioconductor<\/a> and installation is a little different:<\/p>\n\n\n\n<pre class=\"wp-block-code has-small-font-size\"><code><em><mark style=\"background-color:rgba(0, 0, 0, 0);color:#f60606\" class=\"has-inline-color\">install.packages('BiocManager')\nBiocManager::install('Biostrings')<\/mark><\/em>\n<em><mark style=\"background-color:rgba(0, 0, 0, 0);color:#f70b0b\" class=\"has-inline-color\">BiocManager::install('pwalign') # this package is also needed<\/mark><mark style=\"background-color:rgba(0, 0, 0, 0);color:#f60606\" class=\"has-inline-color\">\nlibrary(Biostrings)<\/mark><\/em><\/code><\/pre>\n\n\n\n<p class=\"is-style-text-annotation is-style-text-annotation--1 wp-block-paragraph\">If you used reshape2 or dplyr, you may need to restart R if you get a can&#8217;t unload package error message.<\/p>\n\n\n\n<p class=\"wp-block-paragraph\">Furthermore, the ggplot2<sup class=\"sup-ref-note\" id=\"note-zotero-ref-p2232-r3-o1\"><a class=\"sup-ref-note\" href=\"#zotero-ref-p2232-r3\">3<\/a><\/sup> and reshape2<sup class=\"sup-ref-note\" id=\"note-zotero-ref-p2232-r4-o1\"><a class=\"sup-ref-note\" href=\"#zotero-ref-p2232-r4\">4<\/a><\/sup> libraries should be loaded.<\/p>\n\n\n\n<p class=\"wp-block-paragraph\">In this section, the genetic sequence of the isocitrate dehydrogenase (IDH) 2 gene in humans&nbsp;is compared to that of orangutans.&nbsp;Mutations in the gene have been associated with Ollier&#8217;s disease and Maffucci syndrome in humans.<\/p>\n\n\n\n<figure class=\"wp-block-image size-large\"><img loading=\"lazy\" decoding=\"async\" width=\"1024\" height=\"668\" src=\"https:\/\/pcool.dyndns.org\/wp-content\/uploads\/2025\/06\/orangutan-baby-1024x668.jpg\" alt=\"\" class=\"wp-image-3295\" srcset=\"https:\/\/pcool.dyndns.org\/wp-content\/uploads\/2025\/06\/orangutan-baby-1024x668.jpg 1024w, https:\/\/pcool.dyndns.org\/wp-content\/uploads\/2025\/06\/orangutan-baby-300x196.jpg 300w, https:\/\/pcool.dyndns.org\/wp-content\/uploads\/2025\/06\/orangutan-baby-768x501.jpg 768w, https:\/\/pcool.dyndns.org\/wp-content\/uploads\/2025\/06\/orangutan-baby.jpg 1500w\" sizes=\"auto, (max-width: 1024px) 100vw, 1024px\" \/><\/figure>\n\n\n\n<p class=\"wp-block-paragraph\">IDH is an&nbsp;enzyme that catalyses the oxidative&nbsp;decarboxylation of isocitrate. In humans, IDH has three forms: IDH1, IDH2 and IDH3. IDH3 catalyses the reaction with nicotinamide adenine dinucleotide (NAD+) as cofactor within the citric acid cycle, whilst IDH1 and IDH2&nbsp;catalyse the reaction outside the citric acid cycle (with&nbsp;nicotinamide adenine dinucleotide phosphate (NADP+) as cofactor). Consequently, IDH1 and IDH2 are NADP+ dependent and IDH3 is NAD+ dependent.<\/p>\n\n\n\n<p class=\"wp-block-paragraph\">IDH2 is a mitochondrial enzyme,<\/p>\n\n\n\n<figure class=\"wp-block-image size-full\"><img loading=\"lazy\" decoding=\"async\" width=\"928\" height=\"643\" src=\"https:\/\/pcool.dyndns.org\/wp-content\/uploads\/2025\/06\/Protein_IDH2.png\" alt=\"\" class=\"wp-image-3442\" srcset=\"https:\/\/pcool.dyndns.org\/wp-content\/uploads\/2025\/06\/Protein_IDH2.png 928w, https:\/\/pcool.dyndns.org\/wp-content\/uploads\/2025\/06\/Protein_IDH2-300x208.png 300w, https:\/\/pcool.dyndns.org\/wp-content\/uploads\/2025\/06\/Protein_IDH2-768x532.png 768w\" sizes=\"auto, (max-width: 928px) 100vw, 928px\" \/><\/figure>\n\n\n\n<p class=\"wp-block-paragraph\">encoded on chromosome 15 (in humans and orangutans):<\/p>\n\n\n\n<figure class=\"wp-block-image size-full\"><img loading=\"lazy\" decoding=\"async\" width=\"474\" height=\"189\" src=\"https:\/\/pcool.dyndns.org\/wp-content\/uploads\/2025\/06\/human_chromosome_15.png\" alt=\"\" class=\"wp-image-3215\" srcset=\"https:\/\/pcool.dyndns.org\/wp-content\/uploads\/2025\/06\/human_chromosome_15.png 474w, https:\/\/pcool.dyndns.org\/wp-content\/uploads\/2025\/06\/human_chromosome_15-300x120.png 300w\" sizes=\"auto, (max-width: 474px) 100vw, 474px\" \/><\/figure>\n\n\n\n<p class=\"wp-block-paragraph\">The genetic code for&nbsp;human (<em>homo sapiens<\/em>) IDH2 can be&nbsp;downloaded&nbsp;<a href=\"https:\/\/pcool.dyndns.org:\/wp-content\/data_files\/idh2_human.fasta\" target=\"_blank\" rel=\"noreferrer noopener\">here<\/a> and the genetic code for an orangutan (<em>pongo abelii<\/em>)&nbsp;IDH2&nbsp;<a href=\"https:\/\/pcool.dyndns.org:\/wp-content\/data_files\/idh2_orangutan.fasta\" target=\"_blank\" rel=\"noreferrer noopener\">here<\/a>. Select all the text (Ctrl-A), copy (Ctrl-C) and paste (Ctrl-V) it into a text editor. Save the files as&nbsp;<strong>idh2_human.fasta&nbsp;<\/strong>and<strong> idh2_orangutan.fasta<\/strong> respectively (<em><strong>without<\/strong><\/em>&nbsp;a txt extension) in the working directory (folder) of R.<\/p>\n\n\n\n<p class=\"wp-block-paragraph\">Load the files into R and examen the structure (str):<\/p>\n\n\n\n<pre class=\"wp-block-code has-small-font-size\"><code><em><mark style=\"background-color:rgba(0, 0, 0, 0);color:#f40404\" class=\"has-inline-color\">library(seqinr)<\/mark><\/em>\n<span style=\"color: #ff0000;\"><em>human &lt;- read.fasta(file = \"idh2_human.fasta\")<\/em><\/span>\n<span style=\"color: #ff0000;\"><em>orangutan &lt;- read.fasta(file = \"idh2_orangutan.fasta\")<\/em><\/span>\n<span style=\"color: #ff0000;\"><em>str(human)<\/em><\/span>\n<span style=\"color: #0000ff;\"><em>List of 1<\/em><\/span>\n<span style=\"color: #0000ff;\"><em> $ NG_023302.1:4923-23499:Class 'SeqFastadna' atomic &#091;1:18577] t c c c ...<\/em><\/span>\n<span style=\"color: #0000ff;\"><em> .. ..- attr(*, \"name\")= chr \"NG_023302.1:4923-23499\"<\/em><\/span>\n<span style=\"color: #0000ff;\"><em> .. ..- attr(*, \"Annot\")= chr \"&gt;NG_023302.1:4923-23499 Homo sapiens isocitrate dehydrogenase (NADP(+)) 2, mitochondrial (IDH2), RefSeqGene (LR\"| __truncated__<\/em><\/span>\n<span style=\"color: #ff0000;\"><em>str(orangutan)<\/em><\/span>\n<span style=\"color: #0000ff;\"><em>List of 1<\/em><\/span>\n<span style=\"color: #0000ff;\"><em> $ NC_036918.1:c70426964-70408539:Class 'SeqFastadna' atomic &#091;1:18426] g c a a ...<\/em><\/span>\n<span style=\"color: #0000ff;\"><em> .. ..- attr(*, \"name\")= chr \"NC_036918.1:c70426964-70408539\"<\/em><\/span>\n<span style=\"color: #0000ff;\"><em> .. ..- attr(*, \"Annot\")= chr \"&gt;NC_036918.1:c70426964-70408539 Pongo abelii isolate Susie chromosome 15, Susie_PABv2, whole genome shotgun sequence\"<\/em><\/span><\/code><\/pre>\n\n\n\n<p class=\"wp-block-paragraph\">The sequences can be viewed by:<\/p>\n\n\n\n<pre class=\"wp-block-code has-small-font-size\"><code><em><mark style=\"background-color:rgba(0, 0, 0, 0);color:#f00717\" class=\"has-inline-color\">human&#091;1]<\/mark><mark style=\"background-color:rgba(0, 0, 0, 0);color:#3d08f0\" class=\"has-inline-color\">\n$`NG_023302.1:4923-23499`\n    &#091;1] \"t\" \"c\" \"c\" \"c\" \"c\" \"g\" \"g\" \"c\" \"a\" \"a\" \"g\" \"g\" \"c\" \"c\" \"c\" \"a\" \"a\" \"t\" \"g\" \"g\" \"g\" \"g\" \"c\" \"g\" \"g\" \"c\" \"a\" \"g\" \"g\" \"c\" \"c\" \"c\"\n.....\n.....\n&#091;18561] \"a\" \"a\" \"a\" \"a\" \"g\" \"c\" \"t\" \"c\" \"t\" \"t\" \"c\" \"a\" \"c\" \"a\" \"a\" \"a\" \"a\"\nattr(,\"name\")\n&#091;1] \"NG_023302.1:4923-23499\"\nattr(,\"Annot\")\n&#091;1] \"&gt;NG_023302.1:4923-23499 Homo sapiens isocitrate dehydrogenase (NADP(+)) 2, mitochondrial (IDH2), RefSeqGene (LRG_611) on chromosome 15\"\nattr(,\"class\")\n&#091;1] \"SeqFastadna\"\n\n<\/mark><mark style=\"background-color:rgba(0, 0, 0, 0);color:#f0072f\" class=\"has-inline-color\">orangutan&#091;1]<\/mark><mark style=\"background-color:rgba(0, 0, 0, 0);color:#3d08f0\" class=\"has-inline-color\">\n$`NC_036918.1:c70426964-70408539`\n    &#091;1] \"g\" \"c\" \"a\" \"a\" \"g\" \"g\" \"c\" \"c\" \"c\" \"a\" \"a\" \"t\" \"g\" \"g\" \"g\" \"g\" \"c\" \"g\" \"g\" \"c\" \"g\" \"g\" \"g\" \"c\" \"c\" \"c\" \"g\" \"g\" \"c\" \"a\" \"g\" \"c\"\n.....\n.....\n&#091;18401] \"t\" \"a\" \"g\" \"c\" \"t\" \"a\" \"c\" \"t\" \"a\" \"a\" \"a\" \"a\" \"a\" \"g\" \"c\" \"t\" \"c\" \"t\" \"t\" \"c\" \"a\" \"c\" \"a\" \"a\" \"a\" \"a\"\nattr(,\"name\")\n&#091;1] \"NC_036918.1:c70426964-70408539\"\nattr(,\"Annot\")\n&#091;1] \"&gt;NC_036918.1:c70426964-70408539 Pongo abelii isolate Susie chromosome 15, Susie_PABv2, whole genome shotgun sequence\"\nattr(,\"class\")\n&#091;1] \"SeqFastadna\"<\/mark><\/em><\/code><\/pre>\n\n\n\n<p class=\"wp-block-paragraph\">The sequences are not of equal length:<\/p>\n\n\n\n<pre class=\"wp-block-code has-small-font-size\"><code><span style=\"color: #ff0000;\"><em>length(human&#091;&#091;1]])<\/em><\/span>\n<span style=\"color: #0000ff;\"><em>&#091;1] 18577<\/em><\/span>\n<span style=\"color: #ff0000;\"><em>length(orangutan&#091;&#091;1]])<\/em><\/span>\n<span style=\"color: #0000ff;\"><em>&#091;1] 18426<\/em><\/span><\/code><\/pre>\n\n\n\n<p class=\"wp-block-paragraph\">G-C bonds are more stable as they contain 3 hydrogen bonds whilst the A-T combination only has two.&nbsp;It is straight forward to create a table and calculate the G-C content:<\/p>\n\n\n\n<pre class=\"wp-block-code has-small-font-size\"><code><span style=\"color: #ff0000;\"><em>human_table &lt;- table(human&#091;&#091;1]])<\/em><\/span>\n<span style=\"color: #ff0000;\"><em>human_table<\/em><\/span>\n<span style=\"color: #0000ff;\"><em>a c g t <\/em><\/span>\n<span style=\"color: #0000ff;\"><em>4244 4645 5144 4544 <\/em><\/span>\n<span style=\"color: #ff0000;\"><em>human_table&#091;1]<\/em><\/span>\n<span style=\"color: #0000ff;\"><em> a <\/em><\/span>\n<span style=\"color: #0000ff;\"><em>4244 <\/em><\/span>\n<span style=\"color: #ff0000;\"><em>human_table&#091;2]<\/em><\/span>\n<span style=\"color: #0000ff;\"><em> c <\/em><\/span>\n<span style=\"color: #0000ff;\"><em>4645 <\/em><\/span>\n<span style=\"color: #ff0000;\"><em>human_table&#091;3]<\/em><\/span>\n<span style=\"color: #0000ff;\"><em> g <\/em><\/span>\n<span style=\"color: #0000ff;\"><em>5144 <\/em><\/span>\n<span style=\"color: #ff0000;\"><em>human_table&#091;4]<\/em><\/span>\n<span style=\"color: #0000ff;\"><em> t <\/em><\/span>\n<span style=\"color: #0000ff;\"><em>4544 <\/em><\/span>\n<span style=\"color: #ff0000;\"><em># calculate the g-c content:<\/em><\/span>\n<span style=\"color: #ff0000;\"><em>gc_content_human &lt;- (human_table&#091;3] + human_table&#091;2]) \/ sum(human_table)<\/em><\/span>\n<span style=\"color: #ff0000;\"><em>gc_content_human<\/em><\/span>\n<span style=\"color: #0000ff;\"><em> g <\/em><\/span>\n<span style=\"color: #0000ff;\"><em>0.5269419 <\/em><\/span>\n<span style=\"color: #ff0000;\"><em># or calculate GC content with the build in function:<\/em><\/span>\n<span style=\"color: #ff0000;\"><em>GC(human&#091;&#091;1]])<\/em><\/span>\n<span style=\"color: #0000ff;\"><em>&#091;1] 0.5269419<\/em><\/span>\n\n<span style=\"color: #ff0000;\"><em>orangutan_table &lt;- table(orangutan&#091;&#091;1]])<\/em><\/span>\n<span style=\"color: #ff0000;\"><em>orangutan_table<\/em><\/span>\n<span style=\"color: #0000ff;\"><em>a c g t <\/em><\/span>\n<span style=\"color: #ff0000;\"><em><span style=\"color: #0000ff;\">4181 4591 5152 4502<\/span> <\/em><\/span>\n<span style=\"color: #ff0000;\"><em>orangutan_table&#091;1]<\/em><\/span>\n<span style=\"color: #0000ff;\"><em> a <\/em><\/span>\n<span style=\"color: #ff0000;\"><em><span style=\"color: #0000ff;\">4181<\/span> <\/em><\/span>\n<span style=\"color: #ff0000;\"><em>orangutan_table&#091;2]<\/em><\/span>\n<span style=\"color: #0000ff;\"><em> c <\/em><\/span>\n<span style=\"color: #ff0000;\"><em><span style=\"color: #0000ff;\">4591<\/span> <\/em><\/span>\n<span style=\"color: #ff0000;\"><em>orangutan_table&#091;3]<\/em><\/span>\n<span style=\"color: #0000ff;\"><em> g <\/em><\/span>\n<span style=\"color: #ff0000;\"><em><span style=\"color: #0000ff;\">5152<\/span> <\/em><\/span>\n<span style=\"color: #ff0000;\"><em>orangutan_table&#091;4]<\/em><\/span>\n<span style=\"color: #0000ff;\"><em> t <\/em><\/span>\n<span style=\"color: #ff0000;\"><em><span style=\"color: #0000ff;\">4502<\/span> <\/em><\/span>\n<span style=\"color: #ff0000;\"><em># calculate the g-c content:<\/em><\/span>\n<span style=\"color: #ff0000;\"><em>gc_content_orangutan &lt;- (orangutan_table&#091;3] + orangutan_table&#091;2]) \/ sum(orangutan_table)<\/em><\/span>\n<span style=\"color: #ff0000;\"><em>gc_content_orangutan<\/em><\/span>\n<span style=\"color: #0000ff;\"><em> g <\/em><\/span>\n<span style=\"color: #ff0000;\"><em><span style=\"color: #0000ff;\">0.5287637<\/span> <\/em><\/span>\n<span style=\"color: #ff0000;\"><em>#&nbsp;<em>or calculate GC content with the build in function:<\/em><\/em><\/span>\n<span style=\"color: #ff0000;\"><em>GC(orangutan&#091;&#091;1]])<\/em><\/span>\n<span style=\"color: #0000ff;\"><em>&#091;1] 0.5287637<\/em><\/span><\/code><\/pre>\n\n\n\n<p class=\"wp-block-paragraph\">There is also a function to count individual nucleotides, or combinations of nucleotides:<\/p>\n\n\n\n<pre class=\"wp-block-code has-small-font-size\"><code><em><mark style=\"background-color:rgba(0, 0, 0, 0);color:#f40202\" class=\"has-inline-color\">count(human&#091;&#091;1]], 1) # for single nucleotides\n<\/mark><mark style=\"background-color:rgba(0, 0, 0, 0);color:#4202f5\" class=\"has-inline-color\">a c g t \n4244 4645 5144 4544 <\/mark><mark style=\"background-color:rgba(0, 0, 0, 0);color:#f40202\" class=\"has-inline-color\">\ncount(human&#091;&#091;1]], 2) # for pairs of nucleotides<\/mark><\/em>\n<span style=\"color: #0000ff;\"><em>aa ac ag at ca cc cg ct ga gc gg gt ta tc tg tt <\/em><\/span>\n<em><span style=\"color: #0000ff;\">1100 834 1479 830 1317 1486 393 1449 1152 1286 1690 1016 675 1039 1582 1248<\/span> <\/em>\n<span style=\"color: #ff0000;\"><em>count(human&#091;&#091;1]], 3) # for triplets of nucleotides (codons)<\/em><\/span>\n<span style=\"color: #0000ff;\"><em>aaa aac aag aat aca acc acg act aga agc agg agt ata atc atg att caa cac cag cat cca ccc ccg cct <\/em><\/span>\n<span style=\"color: #0000ff;\"><em>438 162 295 204 272 249 70 243 341 348 512 278 157 177 257 239 219 291 538 269 437 464 123 462 <\/em><\/span>\n<span style=\"color: #0000ff;\"><em>cga cgc cgg cgt cta ctc ctg ctt gaa gac gag gat gca gcc gcg gct gga ggc ggg ggt gta gtc gtg gtt <\/em><\/span>\n<span style=\"color: #0000ff;\"><em> 68 117 127 81 161 367 585 336 265 231 452 204 341 440 126 379 403 488 489 310 178 208 393 237 <\/em><\/span>\n<span style=\"color: #0000ff;\"><em>taa tac tag tat tca tcc tcg tct tga tgc tgg tgt tta ttc ttg ttt <\/em><\/span>\n<em><span style=\"color: #0000ff;\">178 150 194 153 267 333 74 365 340 333 562 347 179 286 347 436<\/span> <\/em>\n<span style=\"color: #ff0000;\"><em># the output is a table object. So the extract is easy:<\/em><\/span>\n<span style=\"color: #ff0000;\"><em>count(human&#091;&#091;1]], 1)&#091;3] # using reference by place<\/em><\/span>\n<span style=\"color: #0000ff;\"><em> g <\/em><\/span>\n<em><span style=\"color: #0000ff;\">5144<\/span> <\/em>\n<span style=\"color: #ff0000;\"><em>count(human&#091;&#091;1]], 1)&#091;\"g\"] # using reference by name<\/em><\/span>\n<span style=\"color: #0000ff;\"><em> g <\/em><\/span>\n<em><span style=\"color: #0000ff;\">5144<\/span> <\/em>\n<span style=\"color: #ff0000;\"><em># similarly:<\/em><\/span>\n<span style=\"color: #ff0000;\"><em>count(human&#091;&#091;1]], 3)&#091;\"att\"]<\/em><\/span>\n<span style=\"color: #0000ff;\"><em>att <\/em><\/span>\n<span style=\"color: #0000ff;\"><em>239<\/em> <\/span>\n\n<em><mark style=\"background-color:rgba(0, 0, 0, 0);color:#f80606\" class=\"has-inline-color\">count(orangutan&#091;&#091;1]], 1)\n<\/mark><mark style=\"background-color:rgba(0, 0, 0, 0);color:#0a05f7\" class=\"has-inline-color\">a c g t\n 4181 4591 5152 4502<\/mark><mark style=\"background-color:rgba(0, 0, 0, 0);color:#f80606\" class=\"has-inline-color\">\ncount(orangutan&#091;&#091;1]], 2)<\/mark><\/em>\n<em><span style=\"color: #0000ff;\">aa ac ag at ca cc cg ct ga gc gg gt ta tc tg tt<\/span><\/em>\n<em><span style=\"color: #0000ff;\"> 1076 834 1463 807 1284 1444 413 1450 1157 1277 1701 1017 664 1036 1574 1228<\/span><\/em>\n<span style=\"color: #ff0000;\"><em>count(orangutan&#091;&#091;1]], 3)<\/em><\/span>\n<em><span style=\"color: #0000ff;\">aaa aac aag aat aca acc acg act aga agc agg agt ata atc atg att caa cac cag cat cca ccc ccg cct<\/span><\/em>\n<em><span style=\"color: #0000ff;\"> 422 176 280 197 267 244 70 253 344 348 497 274 144 179 246 238 216 278 533 257 413 443 134 454<\/span><\/em>\n<em><span style=\"color: #0000ff;\"> cga cgc cgg cgt cta ctc ctg ctt gaa gac gag gat gca gcc gcg gct gga ggc ggg ggt gta gtc gtg gtt<\/span><\/em>\n<em><span style=\"color: #0000ff;\"> 67 127 140 79 169 376 591 314 263 230 453 211 339 430 129 379 402 477 509 313 175 206 408 228<\/span><\/em>\n<em><span style=\"color: #0000ff;\"> taa tac tag tat tca tcc tcg tct tga tgc tgg tgt tta ttc ttg ttt<\/span><\/em>\n<em><span style=\"color: #0000ff;\"> 175 150 197 142 265 327 80 364 344 324 555 351 176 275 329 448<\/span><\/em>\n<span style=\"color: #ff0000;\"><em># the output is a table object. So the extract is easy:<\/em><\/span>\n<span style=\"color: #ff0000;\"><em>count(orangutan&#091;&#091;1]], 1)&#091;3] # reference by place<\/em><\/span>\n<em><span style=\"color: #0000ff;\"> g<\/span><\/em>\n<em><span style=\"color: #0000ff;\"> 5152<\/span><\/em>\n<span style=\"color: #ff0000;\"><em>count(orangutan&#091;&#091;1]], 1)&#091;\"g\"]&nbsp; # reference by name<\/em><\/span>\n<em><span style=\"color: #0000ff;\"> g<\/span><\/em>\n<em><span style=\"color: #0000ff;\"> 5152<\/span><\/em>\n<span style=\"color: #ff0000;\"><em># similarly:<\/em><\/span>\n<span style=\"color: #ff0000;\"><em>count(orangutan&#091;&#091;1]], 3)&#091;\"att\"]<\/em><\/span>\n<em><span style=\"color: #0000ff;\"> att<\/span><\/em>\n<em><span style=\"color: #0000ff;\"> 238<\/span><\/em><\/code><\/pre>\n\n\n\n<p class=\"wp-block-paragraph\">So, the length of the sequences is similar (<em>18577<\/em>&nbsp;in the human sequence and 18426 in the orangutan sequence), the combination att occurs 239 times in humans and 238 times in orangutans and the G-C content is also similar (52.7%&nbsp;in humans and&nbsp;52.9% in orangutans).<\/p>\n\n\n\n<p class=\"wp-block-paragraph\">To show the sequence lengths and G-C content in a bar chart (after reshaping with reshape2<sup class=\"sup-ref-note\" id=\"note-zotero-ref-p2232-r5-o1\"><a class=\"sup-ref-note\" href=\"#zotero-ref-p2232-r5\">5<\/a><\/sup>:<\/p>\n\n\n\n<pre class=\"wp-block-code has-small-font-size\"><code><span style=\"color: #ff0000;\"><em># show sequence length and GC content in each sequence (human and orangutan):<\/em><\/span>\n<span style=\"color: #ff0000;\"><em>seq_lengths_human &lt;- lapply(human, length)<\/em><\/span>\n<span style=\"color: #ff0000;\"><em>seq_lengths_human_gc &lt;- lapply(human, GC)<\/em><\/span>\n<span style=\"color: #ff0000;\"><em>seq_lengths_orangutan &lt;- lapply(orangutan, length)<\/em><\/span>\n<span style=\"color: #ff0000;\"><em>seq_lengths_orangutan_gc &lt;- lapply(orangutan, GC)<\/em><\/span>\n\n<span style=\"color: #ff0000;\"><em>lengths_human_plot &lt;- reshape2::melt(seq_lengths_human)<\/em><\/span>\n<span style=\"color: #ff0000;\"><em>gcs_human_plot &lt;- reshape2::melt(seq_lengths_human_gc)<\/em><\/span>\n<span style=\"color: #ff0000;\"><em>lengths_orangutan_plot &lt;- reshape2::melt(seq_lengths_orangutan)<\/em><\/span>\n<span style=\"color: #ff0000;\"><em>gcs_orangutan_plot &lt;- reshape2::melt(seq_lengths_orangutan_gc)<\/em><\/span>\n\n<span style=\"color: #ff0000;\"><em>plot_data &lt;- data.frame(human_length = lengths_human_plot,human_gc = gcs_human_plot,orangutan_length = lengths_orangutan_plot,orangutan_gc = gcs_orangutan_plot)<\/em><\/span>\n<span style=\"color: #ff0000;\"><em>plot_data<\/em><\/span>\n<span style=\"color: #0000ff;\"><em> human_length.value human_length.L1 human_gc.value human_gc.L1<\/em><\/span>\n<span style=\"color: #0000ff;\"><em>1 18577 NG_023302.1:4923-23499 0.5269419 NG_023302.1:4923-23499<\/em><\/span>\n<span style=\"color: #0000ff;\"><em> orangutan_length.value orangutan_length.L1 orangutan_gc.value<\/em><\/span>\n<span style=\"color: #0000ff;\"><em>1 18426 NC_036918.1:c70426964-70408539 0.5287637<\/em><\/span>\n<span style=\"color: #0000ff;\"><em> orangutan_gc.L1<\/em><\/span>\n<span style=\"color: #0000ff;\"><em>1 NC_036918.1:c70426964-70408539<\/em><\/span>\n\n<span style=\"color: #ff0000;\"><em>dev.new()<\/em><\/span>\n<span style=\"color: #ff0000;\"><em>plot_data %&gt;%<\/em><\/span>\n<span style=\"color: #ff0000;\"><em>ggplot() +<\/em><\/span>\n<span style=\"color: #ff0000;\"><em>geom_bar(aes(x = human_length.value, y = human_gc.value), stat = \"identity\", colour = \"black\", fill = \"blue\", width = 10) +<\/em><\/span>\n<span style=\"color: #ff0000;\"><em>geom_bar(aes(x = orangutan_length.value, y = orangutan_gc.value), stat = \"identity\", colour = \"black\", fill = \"orange\", width = 10) +<\/em><\/span>\n<span style=\"color: #ff0000;\"><em>scale_x_continuous(name = \"Sequence Length\", limits = c(18400, 18600), breaks = c(18400, 18500, 18600, 18700)) +<\/em><\/span>\n<span style=\"color: #ff0000;\"><em>scale_y_continuous(name = \"G-C proportion\", limits = c(0.0, 0.6), breaks = c(0.0, 0.2, 0.4, 0.6)) +<\/em><\/span>\n<span style=\"color: #ff0000;\"><em>theme_bw()<\/em><\/span>\n<\/code><\/pre>\n\n\n\n<figure class=\"wp-block-image size-large\"><img loading=\"lazy\" decoding=\"async\" width=\"1024\" height=\"831\" src=\"https:\/\/pcool.dyndns.org\/wp-content\/uploads\/2025\/06\/idh2_1-1024x831.png\" alt=\"\" class=\"wp-image-3225\" srcset=\"https:\/\/pcool.dyndns.org\/wp-content\/uploads\/2025\/06\/idh2_1-1024x831.png 1024w, https:\/\/pcool.dyndns.org\/wp-content\/uploads\/2025\/06\/idh2_1-300x244.png 300w, https:\/\/pcool.dyndns.org\/wp-content\/uploads\/2025\/06\/idh2_1-768x623.png 768w, https:\/\/pcool.dyndns.org\/wp-content\/uploads\/2025\/06\/idh2_1-1536x1247.png 1536w, https:\/\/pcool.dyndns.org\/wp-content\/uploads\/2025\/06\/idh2_1.png 1934w\" sizes=\"auto, (max-width: 1024px) 100vw, 1024px\" \/><\/figure>\n\n\n\n<p class=\"wp-block-paragraph\">A sliding window plot is useful to compare the G-C content in different parts of the sequence. Here, we compare the G-C sequence in blocks of 1000 nucleotides and plot them with ggplot2<sup class=\"sup-ref-note\" id=\"note-zotero-ref-p2232-r6-o1\"><a class=\"sup-ref-note\" href=\"#zotero-ref-p2232-r6\">6<\/a><\/sup>:<\/p>\n\n\n\n<pre class=\"wp-block-code has-small-font-size\"><code><span style=\"color: #ff0000;\"><em># sliding window plot to show gc content per 1000 base pairs for sequences:<\/em><\/span>\n<span style=\"color: #ff0000;\"><em>human_starts &lt;- seq(1, length(human&#091;&#091;1]]) - 1000, by = 1000)<\/em><\/span>\n<span style=\"color: #ff0000;\"><em>human_n &lt;- length(human_starts) # how many chuncks?<\/em><\/span>\n<span style=\"color: #ff0000;\"><em>human_chunkGCs &lt;- numeric(human_n) # null vector that has the same length as n<\/em><\/span>\n<span style=\"color: #ff0000;\"><em>for (i in 1:human_n) {<\/em><\/span>\n<span style=\"color: #ff0000;\"><em>human_chunk &lt;- human&#091;&#091;1]]&#091;human_starts&#091;i]:(human_starts&#091;i] + 999)]<\/em><\/span>\n<span style=\"color: #ff0000;\"><em>human_chunkGC &lt;- GC(human_chunk)<\/em><\/span>\n<span style=\"color: #ff0000;\"><em>human_chunkGCs&#091;i] &lt;- human_chunkGC<\/em><\/span>\n<span style=\"color: #ff0000;\"><em>}<\/em><\/span>\n<span style=\"color: #ff0000;\"><em>human_chunkGCs<\/em><\/span>\n<span style=\"color: #0000ff;\"><em> &#091;1] 0.730 0.521 0.495 0.534 0.470 0.427 0.456 0.452 0.507 0.497 0.526 0.555 0.584 0.525 0.550<\/em><\/span>\n<span style=\"color: #0000ff;\"><em>&#091;16] 0.514 0.488 0.633<\/em><\/span>\n<span style=\"color: #ff0000;\"><em># and for orangutans:<\/em><\/span>\n<span style=\"color: #ff0000;\"><em>orangutan_starts &lt;- seq(1, length(orangutan&#091;&#091;1]]) - 1000, by = 1000)<\/em><\/span>\n<span style=\"color: #ff0000;\"><em>orangutan_n &lt;- length(orangutan_starts) # how many chuncks?<\/em><\/span>\n<span style=\"color: #ff0000;\"><em>orangutan_chunkGCs &lt;- numeric(orangutan_n) # null vector that has the same length as n<\/em><\/span>\n<span style=\"color: #ff0000;\"><em>for (i in 1:orangutan_n) {<\/em><\/span>\n<span style=\"color: #ff0000;\"><em>orangutan_chunk &lt;- orangutan&#091;&#091;1]]&#091;orangutan_starts&#091;i]:(orangutan_starts&#091;i] + 999)]<\/em><\/span>\n<span style=\"color: #ff0000;\"><em>orangutan_chunkGC &lt;- GC(orangutan_chunk)<\/em><\/span>\n<span style=\"color: #ff0000;\"><em>orangutan_chunkGCs&#091;i] &lt;- orangutan_chunkGC<\/em><\/span>\n<span style=\"color: #ff0000;\"><em>}<\/em><\/span>\n<span style=\"color: #ff0000;\"><em>orangutan_chunkGCs<\/em><\/span>\n<span style=\"color: #0000ff;\"><em> &#091;1] 0.719 0.523 0.499 0.528 0.463 0.435 0.464 0.451 0.498 0.510 0.536 0.556 0.585 0.534 0.553<\/em><\/span>\n<span style=\"color: #0000ff;\"><em>&#091;16] 0.528 0.494 0.638<\/em><\/span>\n\n<span style=\"color: #ff0000;\"><em>dev.new()<\/em><\/span>\n<span style=\"color: #ff0000;\"><em>ggplot() + <\/em><\/span>\n<span style=\"color: #ff0000;\"><em>geom_line(aes(x = human_starts, y = human_chunkGCs), colour = \"blue\") +<\/em><\/span>\n<span style=\"color: #ff0000;\"><em>geom_line(aes(x = orangutan_starts, y = orangutan_chunkGCs), colour = \"orange\") +<\/em><\/span>\n<span style=\"color: #ff0000;\"><em>scale_x_continuous(name = \"Sequence Position\", limits = c(0, 16000), breaks = c(5000, 10000, 15000)) +<\/em><\/span>\n<span style=\"color: #ff0000;\"><em>scale_y_continuous(name = \"GC proportion\", limits = c(0.0, 0.6), breaks = c(0.0, 0.2, 0.4, 0.6)) +<\/em><\/span>\n<span style=\"color: #ff0000;\"><em>theme_bw()<\/em><\/span><\/code><\/pre>\n\n\n\n<figure class=\"wp-block-image size-large\"><img loading=\"lazy\" decoding=\"async\" width=\"1024\" height=\"820\" src=\"https:\/\/pcool.dyndns.org\/wp-content\/uploads\/2025\/06\/idh2_2-1024x820.png\" alt=\"\" class=\"wp-image-3230\" srcset=\"https:\/\/pcool.dyndns.org\/wp-content\/uploads\/2025\/06\/idh2_2-1024x820.png 1024w, https:\/\/pcool.dyndns.org\/wp-content\/uploads\/2025\/06\/idh2_2-300x240.png 300w, https:\/\/pcool.dyndns.org\/wp-content\/uploads\/2025\/06\/idh2_2-768x615.png 768w, https:\/\/pcool.dyndns.org\/wp-content\/uploads\/2025\/06\/idh2_2-1536x1230.png 1536w, https:\/\/pcool.dyndns.org\/wp-content\/uploads\/2025\/06\/idh2_2.png 1956w\" sizes=\"auto, (max-width: 1024px) 100vw, 1024px\" \/><\/figure>\n\n\n\n<p class=\"wp-block-paragraph\">To show the amino acids the nucleotides code for:<\/p>\n\n\n\n<pre class=\"wp-block-code has-small-font-size\"><code><span style=\"color: #ff0000;\"><em>getTrans(human&#091;&#091;1]]) # this will give the standard one letter abbreviation<\/em><\/span>\n<span style=\"color: #0000ff;\"><em> &#091;1] \"S\" \"P\" \"A\" \"R\" \"P\" \"N\" \"G\" \"A\" \"A\" \"G\" \"P\" \"A\" \"A\" \"P\" \"P\" \"R\" \"W\" \"C\" \"P\" \"R\" \"G\" \"Q\" \"R\"<\/em><\/span>\n<span style=\"color: #0000ff;\"><em>- - - -<\/em><\/span>\n<span style=\"color: #0000ff;\"><em>&#091;6188] \"K\" \"A\" \"L\" \"H\" \"K\"<\/em><\/span>\n<span style=\"color: #ff0000;\"><em>aaa(getTrans(human&#091;&#091;1]])) # will give the three letter abbreviation<\/em><\/span>\n<span style=\"color: #0000ff;\"><em> &#091;1] \"Ser\" \"Pro\" \"Ala\" \"Arg\" \"Pro\" \"Asn\" \"Gly\" \"Ala\" \"Ala\" \"Gly\" \"Pro\" \"Ala\" \"Ala\" \"Pro\" \"Pro\"<\/em><\/span>\n<span style=\"color: #0000ff;\"><em>- - - -<\/em><\/span>\n<span style=\"color: #0000ff;\"><em>&#091;6181] \"Gly\" \"Ala\" \"Ile\" \"Phe\" \"Ser\" \"Tyr\" \"Stp\" \"Lys\" \"Ala\" \"Leu\" \"His\" \"Lys\"<\/em><\/span>\n\n<span style=\"color: #ff0000;\"><em>getTrans(orangutan&#091;&#091;1]]) # this will give the standard one letter abbreviation<\/em><\/span>\n<span style=\"color: #0000ff;\"><em> &#091;1] \"A\" \"R\" \"P\" \"N\" \"G\" \"A\" \"A\" \"G\" \"P\" \"A\" \"A\" \"P\" \"P\" \"R\" \"W\" \"C\" \"P\" \"R\" \"G\" \"P\" \"R\" \"P\" \"P\"<\/em><\/span>\n<span style=\"color: #0000ff;\"><em>- - - -&nbsp;<\/em><\/span>\n<span style=\"color: #0000ff;\"><em>&#091;6142] \"K\"<\/em><\/span>\n<span style=\"color: #ff0000;\"><em>aaa(getTrans(orangutan&#091;&#091;1]])) # will give the three letter abbreviation<\/em><\/span>\n<span style=\"color: #0000ff;\"><em> &#091;1] \"Ala\" \"Arg\" \"Pro\" \"Asn\" \"Gly\" \"Ala\" \"Ala\" \"Gly\" \"Pro\" \"Ala\" \"Ala\" \"Pro\" \"Pro\" \"Arg\" \"Trp\"<\/em><\/span>\n<span style=\"color: #0000ff;\"><em>- - - - -&nbsp;<\/em><\/span>\n<span style=\"color: #0000ff;\"><em>&#091;6136] \"Thr\" \"Lys\" \"Lys\" \"Leu\" \"Phe\" \"Thr\" \"Lys\"<\/em><\/span><\/code><\/pre>\n\n\n\n<p class=\"wp-block-paragraph\">Traditionally, dot plots are used to compare DNA sequences. The seqinr package<sup class=\"sup-ref-note\" id=\"note-zotero-ref-p2232-r7-o1\"><a class=\"sup-ref-note\" href=\"#zotero-ref-p2232-r7\">7<\/a><\/sup> does include a function to create dot plots.&nbsp;The number of nucleotides (wsize), step (wstep for overlap) and number of matches (nmatch) can be set. To compare chunks of 9 nucleotides, without overlap,&nbsp; in the DNA sequences of the IDH2 gene in humans and orangutans:<\/p>\n\n\n\n<pre class=\"wp-block-code has-small-font-size\"><code><span style=\"color: #ff0000;\"><em>dotPlot(human&#091;&#091;1]], orangutan&#091;&#091;1]], wsize = 9, wstep = 9,nmatch = 9, col = c(\"white\", \"black\"), xlab =deparse(substitute(human&#091;&#091;1]])), ylab =deparse(substitute(orangutan&#091;&#091;1]])))<\/em><\/span><\/code><\/pre>\n\n\n\n<p class=\"wp-block-paragraph\">Depending on the computer, it can take some time for the comparison to be completed:<\/p>\n\n\n\n<figure class=\"wp-block-image size-large\"><img loading=\"lazy\" decoding=\"async\" width=\"1024\" height=\"815\" src=\"https:\/\/pcool.dyndns.org\/wp-content\/uploads\/2025\/06\/idh2_3-1024x815.png\" alt=\"\" class=\"wp-image-3235\" srcset=\"https:\/\/pcool.dyndns.org\/wp-content\/uploads\/2025\/06\/idh2_3-1024x815.png 1024w, https:\/\/pcool.dyndns.org\/wp-content\/uploads\/2025\/06\/idh2_3-300x239.png 300w, https:\/\/pcool.dyndns.org\/wp-content\/uploads\/2025\/06\/idh2_3-768x612.png 768w, https:\/\/pcool.dyndns.org\/wp-content\/uploads\/2025\/06\/idh2_3-1536x1223.png 1536w, https:\/\/pcool.dyndns.org\/wp-content\/uploads\/2025\/06\/idh2_3.png 2042w\" sizes=\"auto, (max-width: 1024px) 100vw, 1024px\" \/><\/figure>\n\n\n\n<p class=\"wp-block-paragraph\">The plot produced by the standard plot is a little difficult to assess , but this will be addressed later. If the sequences were the same, there should be a diagonal line&nbsp;from the origin of the plot to the upper right corner (similar as seen in the dot plot&nbsp;<a href=\"https:\/\/pcool.dyndns.org\/index.php\/ladys-slipper-orchid\/\" data-type=\"page\" data-id=\"2239\">here<\/a>).&nbsp; There seems little resemblance. Why is this? The sequences are not of the same length and not aligned. To compare the sequences, it is necessary to align them first.<\/p>\n\n\n\n<p class=\"wp-block-paragraph\">To compare sequences, first define a <em><strong>scoring matrix<\/strong><\/em>. Here a score of +2 is given for a match and a score of -1 for a mismatch:<\/p>\n\n\n\n<pre class=\"wp-block-code has-small-font-size\"><code><span style=\"color: #ff0000;\"><em>sigma &lt;- pwalign::nucleotideSubstitutionMatrix(match = 2, mismatch = -1, baseOnly = TRUE)<\/em><\/span>\n<span style=\"color: #ff0000;\"><em>sigma # Print out the matrix<\/em><\/span>\n<em><mark style=\"background-color:rgba(0, 0, 0, 0);color:#1f01f1\" class=\"has-inline-color\">   A  C  G  T\nA  2 -1 -1 -1\nC -1  2 -1 -1\nG -1 -1  2 -1\nT -1 -1 -1  2<\/mark><\/em><\/code><\/pre>\n\n\n\n<p class=\"is-style-text-annotation is-style-text-annotation--2 wp-block-paragraph\">If you used reshape2 or dplyr, you may need to restart R if you get a can&#8217;t unload package error message.<\/p>\n\n\n\n<p class=\"wp-block-paragraph\">It is also required to convert the sequences to a string:<\/p>\n\n\n\n<pre class=\"wp-block-code has-small-font-size\"><code><em><span style=\"color: #ff0000;\">human_string &lt;- getSequence(human&#091;&#091;1]], as.string = TRUE)<\/span><\/em>\n<em><span style=\"color: #ff0000;\">orangutan_string &lt;- getSequence(orangutan&#091;&#091;1]], as.string = TRUE) <\/span><\/em>\n<em><span style=\"color: #ff0000;\">human_string &lt;- toupper(human_string&#091;&#091;1]]) # convert to upper case<\/span><\/em>\n<em><span style=\"color: #ff0000;\">orangutan_string &lt;- toupper(orangutan_string&#091;&#091;1]]) # convert to upper case<\/span><\/em><\/code><\/pre>\n\n\n\n<p class=\"wp-block-paragraph\">Now align the strings using the scoring matrix and setting gaps:<\/p>\n\n\n\n<pre class=\"wp-block-code has-small-font-size\"><code><em><mark style=\"background-color:rgba(0, 0, 0, 0);color:#f00727\" class=\"has-inline-color\">comparison &lt;- pwalign::pairwiseAlignment(human_string,orangutan_string, substitutionMatrix = sigma,gapOpening = -2,gapExtension = -8,scoreOnly = FALSE)\ncomparison<\/mark><mark style=\"background-color:rgba(0, 0, 0, 0);color:#4407f0\" class=\"has-inline-color\">\nGlobal PairwiseAlignmentsSingleSubject (1 of 1)\npattern: TCCCCGGCAAGGCCCAATGGGGCGGCAGGCCCGGCAGCCCCGCCCCGGTGGTGCCCGCGCGGCCAGCGCCCGCCAGGCC...ATGTTTTGCATACTGTAATTTATATTGCCCTTGGAACACATGGTGCCATATTTAGCTACTAAAAAGCTCTTCACAAAA\nsubject: ------GCAAGGCCCAATGGGGCGGCGGGCCCGGCAGCCCCGCCCCGGTGGTGTCCGCGCGGCCCGCGCCCGCCAGGCC...ATGTTTTGCATACTGTAATTTATATTGCCCTTGGAACACATGGTGCCATATTTAGCTACTAAAAAGCTCTTCACAAAA\nscore: 32478 <\/mark><\/em><\/code><\/pre>\n\n\n\n<p class=\"wp-block-paragraph\">Considering there are 18577 base pairs and a score of 2 is given for a match, a score of 32478 is very convincing for similarity!<\/p>\n\n\n\n<p class=\"wp-block-paragraph\">A function to review the aligned sequences is not part of the standard packages. However, Avril Coghlan<sup class=\"sup-ref-note\" id=\"note-zotero-ref-p2232-r8-o1\"><a class=\"sup-ref-note\" href=\"#zotero-ref-p2232-r8\">8<\/a><\/sup> has published a function for this on her <a href=\"http:\/\/a-little-book-of-r-for-bioinformatics.readthedocs.io\/en\/latest\/\" target=\"_blank\" rel=\"noopener\">website<\/a>. The function can be&nbsp;can be seen on&nbsp;<a href=\"http:\/\/a-little-book-of-r-for-bioinformatics.readthedocs.io\/en\/latest\/src\/chapter4.html\" target=\"_blank\" rel=\"noopener\">this page<\/a>&nbsp;and can be downloaded from&nbsp;<a href=\"https:\/\/pcool.dyndns.org:\/wp-content\/R_functions\/pairwise_alignment.txt\" target=\"_blank\" rel=\"noreferrer noopener\">here<\/a>. To use the function, copy and paste&nbsp;the function&nbsp;into the R console and execute it. The function should now be available to R as &#8220;printPairwiseAlignment&#8221;. To show the alignment using the function:<\/p>\n\n\n\n<pre class=\"wp-block-code has-small-font-size\"><code><em><mark style=\"background-color:rgba(0, 0, 0, 0);color:#fb0202\" class=\"has-inline-color\">printPairwiseAlignment(alignment=comparison) \nLoading required package: pwalign\n\nAttaching package: \u2018pwalign\u2019\n\nThe following objects are masked from \u2018package:Biostrings\u2019:\n\n    aligned, alignedPattern, alignedSubject, compareStrings, deletion, errorSubstitutionMatrices, indel, insertion, mismatchSummary, mismatchTable,\n    nedit, nindel, nucleotideSubstitutionMatrix, pairwiseAlignment, PairwiseAlignments, PairwiseAlignmentsSingleSubject, pattern, pid,\n    qualitySubstitutionMatrices, stringDist, unaligned, writePairwiseAlignments<\/mark>\n\n<mark style=\"background-color:rgba(0, 0, 0, 0);color:#1205f2\" class=\"has-inline-color\">&#091;1] \"GCAAGGCCCAATGGGGCGGCAGGCCCGGCAGCCCCGCCCCGGTGGTGCCCGCGCGGCCAG 60\"\n&#091;1] \"GCAAGGCCCAATGGGGCGGCGGGCCCGGCAGCCCCGCCCCGGTGGTGTCCGCGCGGCCCG 60\"\n&#091;1] \" \"<\/mark><\/em>\n.....\n.....<em><mark style=\"background-color:rgba(0, 0, 0, 0);color:#1205f2\" class=\"has-inline-color\">\n18531\"\n&#091;1] \"TGTTTTACCTCAGCCAGTCAGTATGTTTTGCATACTGTAATTTATATTGCCCTTGGAACA 18386\"\n&#091;1] \" \"\n&#091;1] \"CATGGTGCCATATTTAGCTACTAAAAAGCTCTTCACAAAA 18591\"\n&#091;1] \"CATGGTGCCATATTTAGCTACTAAAAAGCTCTTCACAAAA 18446\"\n&#091;1] \" \"<\/mark><\/em><\/code><\/pre>\n\n\n\n<p class=\"wp-block-paragraph\">This looks promising. To create a new dot plot with the aligned sequences:<\/p>\n\n\n\n<pre class=\"wp-block-code has-small-font-size\"><code><mark style=\"background-color:rgba(0, 0, 0, 0);color:#fa0707\" class=\"has-inline-color\"><em>comparison@pattern # this is the human\n<\/em><\/mark><em><mark style=\"background-color:rgba(0, 0, 0, 0);color:#2e07fa\" class=\"has-inline-color\">&#091;7] GCAAGGCCCAATGGGGCGGCAGGCCCGGCAGCCCCGCCCCGGTGGTGCCCGCGCGGCCAGCGCCCGCCAGGCCCAGCGT...TATGTTTTGCATACTGTAATTTATATTGCCCTTGGAACACATGGTGCCATATTTAGCTACTAAAAAGCTCTTCACAAAA <\/mark><\/em><mark style=\"background-color:rgba(0, 0, 0, 0);color:#fa0707\" class=\"has-inline-color\">\n<em>comparison@subject # this is the orangutan\n<\/em><\/mark><em><mark style=\"background-color:rgba(0, 0, 0, 0);color:#4707fa\" class=\"has-inline-color\">&#091;1] GCAAGGCCCAATGGGGCGGCGGGCCCGGCAGCCCCGCCCCGGTGGTGTCCGCGCGGCCCGCGCCCGCCAGGCCCAGCGT...TATGTTTTGCATACTGTAATTTATATTGCCCTTGGAACACATGGTGCCATATTTAGCTACTAAAAAGCTCTTCACAAAA <\/mark><\/em><mark style=\"background-color:rgba(0, 0, 0, 0);color:#fa0707\" class=\"has-inline-color\"><em>\n\nhuman_seq &lt;- as.character(comparison@pattern)\norangutan_seq &lt;- as.character(comparison@subject)\nhuman_seq &lt;- getSequence(human_seq)\norangutan_seq &lt;- getSequence(orangutan_seq)\ndotPlot(human_seq, orangutan_seq, wsize=9,wstep=9,nmatch=9,col=c('white', 'black'))<\/em><\/mark><\/code><\/pre>\n\n\n\n<figure class=\"wp-block-image size-large\"><img loading=\"lazy\" decoding=\"async\" width=\"1024\" height=\"786\" src=\"https:\/\/pcool.dyndns.org\/wp-content\/uploads\/2025\/06\/idh2_4-1024x786.png\" alt=\"\" class=\"wp-image-3240\" srcset=\"https:\/\/pcool.dyndns.org\/wp-content\/uploads\/2025\/06\/idh2_4-1024x786.png 1024w, https:\/\/pcool.dyndns.org\/wp-content\/uploads\/2025\/06\/idh2_4-300x230.png 300w, https:\/\/pcool.dyndns.org\/wp-content\/uploads\/2025\/06\/idh2_4-768x590.png 768w, https:\/\/pcool.dyndns.org\/wp-content\/uploads\/2025\/06\/idh2_4-1536x1179.png 1536w, https:\/\/pcool.dyndns.org\/wp-content\/uploads\/2025\/06\/idh2_4-2048x1572.png 2048w\" sizes=\"auto, (max-width: 1024px) 100vw, 1024px\" \/><\/figure>\n\n\n\n<p class=\"wp-block-paragraph\">This looks much better and the DNA sequences look very similar. However, the dot plot function included in the seqinr<sup class=\"sup-ref-note\" id=\"note-zotero-ref-p2232-r9-o1\"><a class=\"sup-ref-note\" href=\"#zotero-ref-p2232-r9\">9<\/a><\/sup> package is not very versatile. It would be better to plot the comparison with ggplot2<sup class=\"sup-ref-note\" id=\"note-zotero-ref-p2232-r10-o1\"><a class=\"sup-ref-note\" href=\"#zotero-ref-p2232-r10\">10<\/a><\/sup>&nbsp;.<\/p>\n\n\n\n<p class=\"wp-block-paragraph\">The main advantage of open source software is that the code of functions is available to the user. The code of the dot plot was changed so that it returns a matrix (with sequence 1 as rows and sequence 2 as columns) rather than a plot. The row names are the first sequence and the column names the second sequence. Copy and paste&nbsp;<a href=\"https:\/\/pcool.dyndns.org:\/wp-content\/R_functions\/dot_plot_data.txt\" target=\"_blank\" rel=\"noreferrer noopener\">this function<\/a>&nbsp;(called dot_plot_data) into the R console and execute it.<\/p>\n\n\n\n<p class=\"wp-block-paragraph\">The new function (dot_plot_data) should now be available to R (it can also be&nbsp;viewed and downloaded&nbsp;from the <a href=\"https:\/\/pcool.dyndns.org\/index.php\/functions\/\" data-type=\"page\" data-id=\"24\">download page<\/a>). To access the function comparing nine&nbsp;nucleotides and matching all&nbsp;nine nucleotides takes the same format as the dotPlot function from the seqinr package<sup class=\"sup-ref-note\" id=\"note-zotero-ref-p2232-r11-o1\"><a class=\"sup-ref-note\" href=\"#zotero-ref-p2232-r11\">11<\/a><\/sup>:<\/p>\n\n\n\n<pre class=\"wp-block-code has-small-font-size\"><code><em><span style=\"color: #ff0000;\">new_plot &lt;- dot_plot_data(human_seq, orangutan_seq, wsize = 9, wstep = 9, nmatch = 9)<\/span><\/em><\/code><\/pre>\n\n\n\n<p class=\"wp-block-paragraph\">To&nbsp;show the structure (str) and class of the returned object:<\/p>\n\n\n\n<pre class=\"wp-block-code has-small-font-size\"><code><span style=\"color: #ff0000;\"><em>str(new_plot)<\/em><\/span>\n<span style=\"color: #0000ff;\"><em> logi &#091;1:2071, 1:2071] TRUE FALSE FALSE FALSE FALSE FALSE ...<\/em><\/span>\n<span style=\"color: #0000ff;\"><em> - attr(*, \"dimnames\")=List of 2<\/em><\/span>\n<span style=\"color: #0000ff;\"><em> ..$ : chr &#091;1:2071] \"GCAAGGCCC\" \"AATGGGGCG\" \"GCAGGCCCG\" \"GCAGCCCCG\" ...<\/em><\/span>\n<span style=\"color: #0000ff;\"><em> ..$ : chr &#091;1:2071] \"GCAAGGCCC\" \"AATGGGGCG\" \"GCGGGCCCG\" \"GCAGCCCCG\" ...<\/em><\/span>\n<span style=\"color: #ff0000;\"><em>class(new_plot)<\/em><\/span>\n<span style=\"color: #0000ff;\"><em>&#091;1] \"matrix\" \"array\"<\/em><\/span><\/code><\/pre>\n\n\n\n<p class=\"wp-block-paragraph\">The data are now available as a matrix called new_plot. The row names are sequence 1 and the column names sequence 2. Unfortunately, ggplot2 can&#8217;t plot matrices directly. Consequently, the matrix need to be converted to a data frame suitable for plotting,<\/p>\n\n\n\n<p class=\"wp-block-paragraph\">First, extract the row names and column names as vectors:<\/p>\n\n\n\n<pre class=\"wp-block-code has-small-font-size\"><code><span style=\"color: #ff0000;\"><em>sequence_x &lt;- rownames(new_plot)<\/em><\/span>\n<span style=\"color: #ff0000;\"><em>sequence_y &lt;- colnames(new_plot)<\/em><\/span><\/code><\/pre>\n\n\n\n<p class=\"wp-block-paragraph\">Secondly, remove the row and column names from the matrix (the names are not unique and otherwise ggplot2&nbsp;will group by them):<\/p>\n\n\n\n<pre class=\"wp-block-code has-small-font-size\"><code><span style=\"color: #ff0000;\">rownames(new_plot) &lt;- NULL<\/span>\n<span style=\"color: #ff0000;\">colnames(new_plot) &lt;- NULL<\/span><\/code><\/pre>\n\n\n\n<p class=\"wp-block-paragraph\">Thirdly, use the reshape2 package<sup class=\"sup-ref-note\" id=\"note-zotero-ref-p2232-r12-o1\"><a class=\"sup-ref-note\" href=\"#zotero-ref-p2232-r12\">12<\/a><\/sup> to reshape the data to allow plotting (the row and column numbers are unique, representing the number in the sequence):<\/p>\n\n\n\n<pre class=\"wp-block-code has-small-font-size\"><code><em><mark style=\"background-color:rgba(0, 0, 0, 0);color:#fe0404\" class=\"has-inline-color\">library(reshape2<\/mark><mark style=\"background-color:rgba(0, 0, 0, 0);color:#f60202\" class=\"has-inline-color\">)<\/mark><\/em>\n<em><span style=\"color: #ff0000;\">new_plot_2 &lt;- melt(new_plot)<\/span><\/em><\/code><\/pre>\n\n\n\n<p class=\"wp-block-paragraph\">The data frame new_plot_2 has 3 columns (variables in tidy format).&nbsp;<strong>Var1<\/strong>&nbsp;represents sequence 1 (x coordinate) and&nbsp;<strong>Var2<\/strong>&nbsp;sequence 2 (y coordinate). The&nbsp;<strong>value&nbsp;<\/strong>variable contains the value stored in the matrix. To plot the data:<\/p>\n\n\n\n<pre class=\"wp-block-code has-small-font-size\"><code><em><span style=\"color: #ff0000;\">dot_plot &lt;- new_plot_2 %>%<\/span><\/em>\n<em><span style=\"color: #ff0000;\">ggplot(aes(x = Var1, y = Var2, fill = value)) +<\/span><\/em>\n<em><span style=\"color: #ff0000;\">geom_raster() +<\/span><\/em>\n<em><span style=\"color: #ff0000;\">scale_fill_manual(values = c(\"lightgray\", \"blue\"), name = \"Pair\", <\/span><\/em>\n  <em><span style=\"color: #ff0000;\">labels = c(\"Different\", \"Same\")) +<\/span><\/em>\n<em><span style=\"color: #ff0000;\">scale_x_continuous(\"Human\") +<\/span><\/em>\n<em><span style=\"color: #ff0000;\">scale_y_continuous(\"Orangutan\") + <\/span><\/em>\n<em><span style=\"color: #ff0000;\">theme_bw() <\/span><\/em>\n\n<em><span style=\"color: #ff0000;\">dev.new()<\/span><\/em>\n<em><span style=\"color: #ff0000;\">dot_plot<\/span><\/em><\/code><\/pre>\n\n\n\n<figure class=\"wp-block-image size-large\"><img loading=\"lazy\" decoding=\"async\" width=\"1024\" height=\"803\" src=\"https:\/\/pcool.dyndns.org\/wp-content\/uploads\/2025\/06\/idh2_5-1024x803.png\" alt=\"\" class=\"wp-image-3245\" srcset=\"https:\/\/pcool.dyndns.org\/wp-content\/uploads\/2025\/06\/idh2_5-1024x803.png 1024w, https:\/\/pcool.dyndns.org\/wp-content\/uploads\/2025\/06\/idh2_5-300x235.png 300w, https:\/\/pcool.dyndns.org\/wp-content\/uploads\/2025\/06\/idh2_5-768x602.png 768w, https:\/\/pcool.dyndns.org\/wp-content\/uploads\/2025\/06\/idh2_5-1536x1204.png 1536w, https:\/\/pcool.dyndns.org\/wp-content\/uploads\/2025\/06\/idh2_5.png 1998w\" sizes=\"auto, (max-width: 1024px) 100vw, 1024px\" \/><\/figure>\n\n\n\n<p class=\"wp-block-paragraph\">If the sequences would be indicated on the axes, the axes would become too crowded. However, the plot can be windowed (here between 10 and 20):<\/p>\n\n\n\n<pre class=\"wp-block-code has-small-font-size\"><code><em><span style=\"color: #ff0000;\">dot_plot_2 &lt;- new_plot_2 %>%<\/span><\/em>\n<em><span style=\"color: #ff0000;\">ggplot(aes(x = Var1, y = Var2, fill = value)) +<\/span><\/em>\n<em><span style=\"color: #ff0000;\">geom_raster() +<\/span><\/em>\n<em><span style=\"color: #ff0000;\">scale_fill_manual(values = c(\"lightgray\", \"blue\"), name = \"Pair\",<\/span><\/em>\n  <em><span style=\"color: #ff0000;\">labels = c(\"Different\", \"Same\")) +<\/span><\/em>\n<em><span style=\"color: #ff0000;\">scale_x_continuous(\"Human\", breaks = 1: length(sequence_x), <\/span><\/em>\n  <em><span style=\"color: #ff0000;\">labels = sequence_x, <strong>limit = c(10,20)<\/strong>) +<\/span><\/em>\n<em><span style=\"color: #ff0000;\">scale_y_continuous(\"Orangutan\", breaks = 1: length(sequence_y), <\/span><\/em>\n  <em><span style=\"color: #ff0000;\">labels = sequence_y, <strong>limit = c(10,20)<\/strong>) + <\/span><\/em>\n<em><span style=\"color: #ff0000;\">theme_bw() <\/span><\/em>\n\n<em><span style=\"color: #ff0000;\">dev.new()<\/span><\/em>\n<em><span style=\"color: #ff0000;\">dot_plot_2<\/span><\/em><\/code><\/pre>\n\n\n\n<figure class=\"wp-block-image size-large\"><img loading=\"lazy\" decoding=\"async\" width=\"1024\" height=\"577\" src=\"https:\/\/pcool.dyndns.org\/wp-content\/uploads\/2025\/06\/idh2_6-1024x577.png\" alt=\"\" class=\"wp-image-3250\" srcset=\"https:\/\/pcool.dyndns.org\/wp-content\/uploads\/2025\/06\/idh2_6-1024x577.png 1024w, https:\/\/pcool.dyndns.org\/wp-content\/uploads\/2025\/06\/idh2_6-300x169.png 300w, https:\/\/pcool.dyndns.org\/wp-content\/uploads\/2025\/06\/idh2_6-768x432.png 768w, https:\/\/pcool.dyndns.org\/wp-content\/uploads\/2025\/06\/idh2_6-1536x865.png 1536w, https:\/\/pcool.dyndns.org\/wp-content\/uploads\/2025\/06\/idh2_6-2048x1153.png 2048w\" sizes=\"auto, (max-width: 1024px) 100vw, 1024px\" \/><\/figure>\n","protected":false},"excerpt":{"rendered":"<p>Please make sure the necessary packages (seqinr and Biostrings) are&nbsp;installed&nbsp;as described to allow analysis. The Biostrings and pwalign packages are part of Bioconductor and installation is a little different: If you used reshape2 or dplyr, you may need to restart R if you get a can&#8217;t unload package error message. Furthermore, the ggplot2 and reshape2 [&hellip;]<\/p>\n","protected":false},"author":2,"featured_media":0,"parent":0,"menu_order":0,"comment_status":"closed","ping_status":"closed","template":"","meta":{"inline_featured_image":false,"footnotes":""},"class_list":["post-2232","page","type-page","status-publish","hentry"],"_links":{"self":[{"href":"https:\/\/pcool.dyndns.org\/index.php\/wp-json\/wp\/v2\/pages\/2232","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/pcool.dyndns.org\/index.php\/wp-json\/wp\/v2\/pages"}],"about":[{"href":"https:\/\/pcool.dyndns.org\/index.php\/wp-json\/wp\/v2\/types\/page"}],"author":[{"embeddable":true,"href":"https:\/\/pcool.dyndns.org\/index.php\/wp-json\/wp\/v2\/users\/2"}],"replies":[{"embeddable":true,"href":"https:\/\/pcool.dyndns.org\/index.php\/wp-json\/wp\/v2\/comments?post=2232"}],"version-history":[{"count":8,"href":"https:\/\/pcool.dyndns.org\/index.php\/wp-json\/wp\/v2\/pages\/2232\/revisions"}],"predecessor-version":[{"id":4566,"href":"https:\/\/pcool.dyndns.org\/index.php\/wp-json\/wp\/v2\/pages\/2232\/revisions\/4566"}],"wp:attachment":[{"href":"https:\/\/pcool.dyndns.org\/index.php\/wp-json\/wp\/v2\/media?parent=2232"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}