{"id":2239,"date":"2018-04-02T15:07:24","date_gmt":"2018-04-02T14:07:24","guid":{"rendered":"http:\/\/pcool.dyndns.org:8080\/statsbook\/?page_id=2239"},"modified":"2025-06-29T09:35:19","modified_gmt":"2025-06-29T08:35:19","slug":"ladys-slipper-orchid","status":"publish","type":"page","link":"https:\/\/pcool.dyndns.org\/index.php\/ladys-slipper-orchid\/","title":{"rendered":"Lady&#039;s Slipper Orchid"},"content":{"rendered":"\n<p>Please make sure the necessary packages (seqinr<sup class='sup-ref-note' id='note-zotero-ref-p2239-r1-o1'><a class='sup-ref-note' href='#zotero-ref-p2239-r1'>1<\/a><\/sup> and Biostrings<sup class='sup-ref-note' id='note-zotero-ref-p2239-r2-o1'><a class='sup-ref-note' href='#zotero-ref-p2239-r2'>2<\/a><\/sup>) are<a href=\"https:\/\/pcool.dyndns.org\/index.php\/packages\/\" data-type=\"page\" data-id=\"22\">&nbsp;installed&nbsp;as described<\/a>&nbsp;to allow analysis. Furthermore, the ggplot2<sup class='sup-ref-note' id='note-zotero-ref-p2239-r3-o1'><a class='sup-ref-note' href='#zotero-ref-p2239-r3'>3<\/a><\/sup> and reshape2<sup class='sup-ref-note' id='note-zotero-ref-p2239-r4-o1'><a class='sup-ref-note' href='#zotero-ref-p2239-r4'>4<\/a><\/sup> packages should be loaded.<\/p>\n\n\n\n<p>The genetic code for the lady&#8217;s slipper orchid <a href=\"http:\/\/pcool.dyndns.org:8080\/statsbook\/?page_id=2239\">(<\/a><i>Cypripedium calceolus)<\/i>&nbsp;can be&nbsp;<a href=\"https:\/\/pcool.dyndns.org:\/wp-content\/data_files\/ls_orchid.fasta\" target=\"_blank\" rel=\"noreferrer noopener\">downloaded<\/a>. Select all the text (Ctrl-A), copy (Ctrl-C) and paste (Ctrl-V) it into a text editor. Save the file as <strong>ls_orchid.fasta<\/strong> (<em><strong>without<\/strong><\/em> a txt extension!) in the working directory (folder) of R.<\/p>\n\n\n\n<figure class=\"wp-block-image size-full\"><img loading=\"lazy\" decoding=\"async\" width=\"195\" height=\"258\" src=\"https:\/\/pcool.dyndns.org\/wp-content\/uploads\/2025\/06\/LS_orchid.jpg\" alt=\"\" class=\"wp-image-3270\"\/><\/figure>\n\n\n\n<p>Load the required package, import the fasta file as a list (orchid) and review the structure (str):<\/p>\n\n\n\n<pre class=\"wp-block-code has-small-font-size\"><code><em><mark style=\"background-color:rgba(0, 0, 0, 0);color:#f20525\" class=\"has-inline-color\">library(seqinr)\norchid &lt;- read.fasta(file='~\/Desktop\/ls_orchid.fasta')\nstr(orchid)<\/mark><mark style=\"background-color:rgba(0, 0, 0, 0);color:#2a05f1\" class=\"has-inline-color\">\nList of 94\n $ gi|2765658|emb|Z78533.1|CIZ78533: 'SeqFastadna' chr &#091;1:740] \"c\" \"g\" \"t\" \"a\" ...\n  ..- attr(*, \"name\")= chr \"gi|2765658|emb|Z78533.1|CIZ78533\"\n  ..- attr(*, \"Annot\")= chr \"&gt;gi|2765658|emb|Z78533.1|CIZ78533 C.irapeanum 5.8S rRNA gene and ITS1 and ITS2 DNA\"\n.....\n.....\nP.purpuratum 5.8S rRNA gene and ITS1 and ITS2 DNA\"\n $ gi|2765564|emb|Z78439.1|PBZ78439: 'SeqFastadna' chr &#091;1:592] \"c\" \"a\" \"t\" \"t\" ...\n  ..- attr(*, \"name\")= chr \"gi|2765564|emb|Z78439.1|PBZ78439\"\n  ..- attr(*, \"Annot\")= chr \"&gt;gi|2765564|emb|Z78439.1|PBZ78439 P.barbatum 5.8S rRNA gene and ITS1 and ITS2 DNA\"<\/mark><\/em><\/code><\/pre>\n\n\n\n<p>The information is stored in a list. Every sequence can be&nbsp;viewed separately. To view the first sequence:<\/p>\n\n\n\n<pre class=\"wp-block-code has-small-font-size\"><code><em><mark style=\"background-color:rgba(0, 0, 0, 0);color:#f00727\" class=\"has-inline-color\">orchid&#091;1]<\/mark><mark style=\"background-color:rgba(0, 0, 0, 0);color:#0608ef\" class=\"has-inline-color\">\n$`gi|2765658|emb|Z78533.1|CIZ78533`\n  &#091;1] \"c\" \"g\" \"t\" \"a\" \"a\" \"c\" \"a\" \"a\" \"g\" \"g\" \"t\" \"t\" \"t\" \"c\" \"c\" \"g\" \"t\" \"a\" \"g\" \"g\" \"t\" \"g\" \"a\" \"a\" \"c\" \"c\" \"t\" \"g\" \"c\" \"g\" \"g\" \"a\" \"a\"\n.....\n.....\n&#091;694] \"t\" \"t\" \"c\" \"g\" \"g\" \"a\" \"t\" \"g\" \"t\" \"g\" \"a\" \"c\" \"c\" \"c\" \"c\" \"a\" \"g\" \"g\" \"t\" \"c\" \"a\" \"g\" \"g\" \"c\" \"g\" \"g\" \"g\" \"g\" \"g\" \"c\" \"a\" \"c\" \"c\"\n&#091;727] \"c\" \"g\" \"c\" \"t\" \"g\" \"a\" \"g\" \"t\" \"t\" \"t\" \"a\" \"c\" \"g\" \"c\"\nattr(,\"name\")\n&#091;1] \"gi|2765658|emb|Z78533.1|CIZ78533\"\nattr(,\"Annot\")\n&#091;1] \"&gt;gi|2765658|emb|Z78533.1|CIZ78533 C.irapeanum 5.8S rRNA gene and ITS1 and ITS2 DNA\"\nattr(,\"class\")\n&#091;1] \"SeqFastadna\"<\/mark><\/em><\/code><\/pre>\n\n\n\n<p>and the last:<\/p>\n\n\n\n<pre class=\"wp-block-code has-small-font-size\"><code><em><mark style=\"background-color:rgba(0, 0, 0, 0);color:#f50a1a\" class=\"has-inline-color\">orchid&#091;94]<\/mark><mark style=\"background-color:rgba(0, 0, 0, 0);color:#0914f4\" class=\"has-inline-color\">\n$`gi|2765564|emb|Z78439.1|PBZ78439`\n  &#091;1] \"c\" \"a\" \"t\" \"t\" \"g\" \"t\" \"t\" \"g\" \"a\" \"g\" \"a\" \"t\" \"c\" \"a\" \"c\" \"a\" \"t\" \"a\" \"a\" \"t\" \"a\" \"a\" \"t\" \"t\" \"g\" \"a\" \"t\" \"c\" \"g\" \"a\" \"g\" \"t\" \"t\"\n.....<\/mark><\/em>\n.....<em><mark style=\"background-color:rgba(0, 0, 0, 0);color:#0914f4\" class=\"has-inline-color\">\n&#091;529] \"g\" \"t\" \"g\" \"g\" \"a\" \"c\" \"g\" \"a\" \"a\" \"c\" \"t\" \"a\" \"t\" \"g\" \"c\" \"t\" \"a\" \"c\" \"a\" \"a\" \"c\" \"a\" \"a\" \"a\" \"a\" \"t\" \"t\" \"g\" \"t\" \"t\" \"g\" \"t\" \"g\"\n&#091;562] \"c\" \"a\" \"a\" \"a\" \"t\" \"g\" \"c\" \"c\" \"c\" \"c\" \"g\" \"g\" \"t\" \"t\" \"g\" \"g\" \"c\" \"c\" \"g\" \"t\" \"t\" \"t\" \"a\" \"g\" \"t\" \"t\" \"g\" \"g\" \"g\" \"c\" \"c\"\nattr(,\"name\")\n&#091;1] \"gi|2765564|emb|Z78439.1|PBZ78439\"\nattr(,\"Annot\")\n&#091;1] \"&gt;gi|2765564|emb|Z78439.1|PBZ78439 P.barbatum 5.8S rRNA gene and ITS1 and ITS2 DNA\"\nattr(,\"class\")\n&#091;1] \"SeqFastadna\"<\/mark><\/em><\/code><\/pre>\n\n\n\n<p>To show the total number of sequences:<\/p>\n\n\n\n<pre class=\"wp-block-code has-small-font-size\"><code><em><mark style=\"background-color:rgba(0, 0, 0, 0);color:#f00505\" class=\"has-inline-color\">&gt; number_sequences &lt;- length(orchid)\n&gt; number_sequences\n<\/mark><mark style=\"background-color:rgba(0, 0, 0, 0);color:#3a05f0\" class=\"has-inline-color\">&#091;1] 94<\/mark><\/em><\/code><\/pre>\n\n\n\n<p>To show the length of the first and the last sequence:<\/p>\n\n\n\n<pre class=\"wp-block-code has-small-font-size\"><code><span style=\"color: #ff0000;\"><em>length(orchid&#091;&#091;1]])<\/em><\/span>\n<span style=\"color: #0000ff;\"><em>&#091;1] 740<\/em><\/span>\n<span style=\"color: #ff0000;\"><em>length(orchid&#091;&#091;number_sequences]])<\/em><\/span>\n<span style=\"color: #0000ff;\"><em>&#091;1] 592<\/em><\/span><\/code><\/pre>\n\n\n\n<p>To create a table of the nucleotides and calculate the GC (Guanine &#8211; Cytosine) content (more stable as three hydrogen bonds):<\/p>\n\n\n\n<pre class=\"wp-block-code has-small-font-size\"><code><em><span style=\"color: #ff0000;\">table &lt;- table(orchid&#091;&#091;1]])<\/span><\/em>\n<em><span style=\"color: #ff0000;\">table<\/span><\/em>\n<span style=\"color: #0000ff;\"><em>a c g t <\/em><\/span>\n<em><span style=\"color: #ff0000;\"><span style=\"color: #0000ff;\">144 200 241 155<\/span> <\/span><\/em>\n<em><span style=\"color: #ff0000;\">str(table)<\/span><\/em>\n<span style=\"color: #0000ff;\"><em> 'table' int &#091;1:4(1d)] 144 200 241 155<\/em><\/span>\n<span style=\"color: #0000ff;\"><em> - attr(*, \"dimnames\")=List of 1<\/em><\/span>\n<span style=\"color: #0000ff;\"><em> ..$ : chr &#091;1:4] \"a\" \"c\" \"g\" \"t\"<\/em><\/span>\n<em><span style=\"color: #ff0000;\">table&#091;1]<\/span><\/em>\n<span style=\"color: #0000ff;\"><em> a <\/em><\/span>\n<em><span style=\"color: #ff0000;\"><span style=\"color: #0000ff;\">144<\/span> <\/span><\/em>\n<em><span style=\"color: #ff0000;\">table&#091;2]<\/span><\/em>\n<span style=\"color: #0000ff;\"><em> c <\/em><\/span>\n<em><span style=\"color: #ff0000;\"><span style=\"color: #0000ff;\">200<\/span> <\/span><\/em>\n<em><span style=\"color: #ff0000;\">table&#091;3]<\/span><\/em>\n<span style=\"color: #0000ff;\"><em> g <\/em><\/span>\n<em><span style=\"color: #ff0000;\"><span style=\"color: #0000ff;\">241<\/span> <\/span><\/em>\n<em><span style=\"color: #ff0000;\">table&#091;4]<\/span><\/em>\n<span style=\"color: #0000ff;\"><em> t <\/em><\/span>\n<em><span style=\"color: #ff0000;\"><span style=\"color: #0000ff;\">155<\/span> <\/span><\/em>\n<em><span style=\"color: #ff0000;\"># calculate the gc content:<\/span><\/em>\n<em><span style=\"color: #ff0000;\">gc_content &lt;- (table&#091;3] + table&#091;2]) \/ sum(table)<\/span><\/em>\n<em><span style=\"color: #ff0000;\">gc_content<\/span><\/em>\n<span style=\"color: #0000ff;\"><em> g <\/em><\/span>\n<span style=\"color: #0000ff;\"><em>0.5959459 <\/em><\/span><\/code><\/pre>\n\n\n\n<p>Or, the easy way:<\/p>\n\n\n\n<pre class=\"wp-block-code has-small-font-size\"><code><span style=\"color: #ff0000;\"><em>GC(orchid&#091;&#091;1]])<\/em><\/span>\n<span style=\"color: #0000ff;\"><em>&#091;1] 0.5959459<\/em><\/span><\/code><\/pre>\n\n\n\n<p>So, almost 60%. It is also easy to count single nucleotides or combinations thereof:<\/p>\n\n\n\n<pre class=\"wp-block-code has-small-font-size\"><code><span style=\"color: #ff0000;\"><em>count(orchid&#091;&#091;1]], 1)<\/em><\/span>\n<span style=\"color: #0000ff;\"><em>a c g t <\/em><\/span>\n<em><span style=\"color: #0000ff;\">144 200 241 155<\/span> <\/em>\n<span style=\"color: #ff0000;\"><em>count(orchid&#091;&#091;1]], 2)<\/em><\/span>\n<span style=\"color: #0000ff;\"><em>aa ac ag at ca cc cg ct ga gc gg gt ta tc tg tt <\/em><\/span>\n<em><span style=\"color: #0000ff;\">41 27 39 37 39 66 61 33 53 70 73 45 11 36 68 40<\/span> <\/em>\n<span style=\"color: #ff0000;\"><em>count(orchid&#091;&#091;1]], 3)<\/em><\/span>\n<span style=\"color: #0000ff;\"><em>aaa aac aag aat aca acc acg act aga agc agg agt ata atc atg att caa cac cag cat cca ccc ccg cct <\/em><\/span>\n<span style=\"color: #0000ff;\"><em> 7 11 12 11 3 15 7 2 6 11 16 6 4 10 15 8 11 4 12 12 14 23 20 9 <\/em><\/span>\n<span style=\"color: #0000ff;\"><em>cga cgc cgg cgt cta ctc ctg ctt gaa gac gag gat gca gcc gcg gct gga ggc ggg ggt gta gtc gtg gtt <\/em><\/span>\n<span style=\"color: #0000ff;\"><em> 9 14 25 13 3 11 9 10 19 9 12 13 16 18 22 13 19 26 15 13 3 10 22 10 <\/em><\/span>\n<span style=\"color: #0000ff;\"><em>taa tac tag tat tca tcc tcg tct tga tgc tgg tgt tta ttc ttg ttt <\/em><\/span>\n<em><span style=\"color: #0000ff;\"> 4 3 3 1 6 10 11 9 19 19 17 13 1 5 22 12<\/span> <\/em>\n<span style=\"color: #ff0000;\"><em># the output is a table object. So the extract is easy:<\/em><\/span>\n<span style=\"color: #ff0000;\"><em>count(orchid&#091;&#091;1]], 1)&#091;3]&nbsp;<\/em><\/span>\n<span style=\"color: #0000ff;\"><em> g <\/em><\/span>\n<span style=\"color: #0000ff;\"><em>241 <\/em><\/span><\/code><\/pre>\n\n\n\n<p>Or better, refer by name:<\/p>\n\n\n\n<pre class=\"wp-block-code has-small-font-size\"><code><span style=\"color: #ff0000;\"><em>count(orchid&#091;&#091;1]], 1)&#091;\"g\"]<\/em><\/span>\n<span style=\"color: #0000ff;\"><em> g <\/em><\/span>\n<em><span style=\"color: #0000ff;\">241<\/span> <\/em>\n<span style=\"color: #ff0000;\"><em># similarly to count a specific triplets of nucleotides:<\/em><\/span>\n<span style=\"color: #ff0000;\"><em>count(orchid&#091;&#091;1]], 3)&#091;\"att\"]<\/em><\/span>\n<span style=\"color: #0000ff;\"><em>att <\/em><\/span>\n<span style=\"color: #0000ff;\"><em> 8<\/em><\/span><\/code><\/pre>\n\n\n\n<p>Show the GC content for every sequence, reshape the data with reshape2<sup class='sup-ref-note' id='note-zotero-ref-p2239-r5-o1'><a class='sup-ref-note' href='#zotero-ref-p2239-r5'>5<\/a><\/sup>&nbsp; and plot with ggplot2<sup class='sup-ref-note' id='note-zotero-ref-p2239-r6-o1'><a class='sup-ref-note' href='#zotero-ref-p2239-r6'>6<\/a><\/sup>:<\/p>\n\n\n\n<p>Show sequence length and GC content in each sequence, reshape and plot the data:<\/p>\n\n\n\n<pre class=\"wp-block-code has-small-font-size\"><code><em><mark style=\"background-color:rgba(0, 0, 0, 0);color:#f30808\" class=\"has-inline-color\">library(reshape2)<\/mark><\/em>\n<em><mark style=\"background-color:rgba(0, 0, 0, 0);color:#f40505\" class=\"has-inline-color\">library(ggplot2)<\/mark><\/em>\n<span style=\"color: #ff0000;\"><em>seq_lengths &lt;- lapply(orchid, length)<\/em><\/span>\n<span style=\"color: #ff0000;\"><em>seq_lengths_gc &lt;- lapply(orchid, GC)<\/em><\/span>\n<em><span style=\"color: #ff0000;\">gcs_plot &lt;- melt(seq_lengths_gc)<\/span><\/em>\n<em><span style=\"color: #ff0000;\">lengths_plot &lt;- melt(seq_lengths)<\/span><\/em>\n<em><span style=\"color: #ff0000;\">plot_data &lt;- data.frame(length = lengths_plot, gc = gcs_plot)<\/span><\/em>\n<em><span style=\"color: #ff0000;\">ggplot(plot_data, aes(x = length.value, y = gc.value)) +<\/span><\/em>\n<em><span style=\"color: #ff0000;\">geom_point() +<\/span><\/em>\n<em><span style=\"color: #ff0000;\">theme_bw()<\/span><\/em><\/code><\/pre>\n\n\n\n<figure class=\"wp-block-image size-large\"><img loading=\"lazy\" decoding=\"async\" width=\"1024\" height=\"804\" src=\"https:\/\/pcool.dyndns.org\/wp-content\/uploads\/2025\/06\/orchid_gc-1024x804.png\" alt=\"\" class=\"wp-image-3325\" srcset=\"https:\/\/pcool.dyndns.org\/wp-content\/uploads\/2025\/06\/orchid_gc-1024x804.png 1024w, https:\/\/pcool.dyndns.org\/wp-content\/uploads\/2025\/06\/orchid_gc-300x236.png 300w, https:\/\/pcool.dyndns.org\/wp-content\/uploads\/2025\/06\/orchid_gc-768x603.png 768w, https:\/\/pcool.dyndns.org\/wp-content\/uploads\/2025\/06\/orchid_gc-1536x1206.png 1536w, https:\/\/pcool.dyndns.org\/wp-content\/uploads\/2025\/06\/orchid_gc.png 2004w\" sizes=\"auto, (max-width: 1024px) 100vw, 1024px\" \/><\/figure>\n\n\n\n<p>So the majority of sequences have a GC content of around 50%. There is however a group of longer sequences that have a GC content of more than 55%.<\/p>\n\n\n\n<p>Sliding window plots are often used to show how the GC content varies along the sequence. Here, we create a sliding window plot of the longest (38th) sequence per 100 base pairs:<\/p>\n\n\n\n<pre class=\"wp-block-code has-small-font-size\"><code><span style=\"color: #ff0000;\"><em># sliding window plot to show&nbsp;GC content per 100 base pairs for sequence 38:<\/em><\/span>\n<span style=\"color: #ff0000;\"><em>starts &lt;- seq(1, length(orchid&#091;&#091;38]]) - 100, by = 100)<\/em><\/span>\n<span style=\"color: #ff0000;\"><em>n &lt;- length(starts) # how many chunks?<\/em><\/span>\n<span style=\"color: #ff0000;\"><em>chunkGCs &lt;- numeric(n) # null vector that has the same length as n<\/em><\/span>\n<span style=\"color: #ff0000;\"><em>for (i in 1:n) {<\/em><\/span>\n<span style=\"color: #ff0000;\"><em>chunk &lt;- orchid&#091;&#091;38]]&#091;starts&#091;i]:(starts&#091;i] + 99)]<\/em><\/span>\n<span style=\"color: #ff0000;\"><em>chunkGC &lt;- GC(chunk)<\/em><\/span>\n<span style=\"color: #ff0000;\"><em>chunkGCs&#091;i] &lt;- chunkGC<\/em><\/span>\n<span style=\"color: #ff0000;\"><em>}<\/em><\/span>\n<span style=\"color: #ff0000;\"><em>chunkGCs<\/em><\/span>\n<span style=\"color: #0000ff;\"><em>&#091;1] 0.43 0.60 0.55 0.45 0.59 0.57 0.49<\/em><\/span>\n\n<span style=\"color: #ff0000;\"><em>plot_data_2 &lt;- data.frame(location = starts, GC = chunkGCs)<\/em><\/span>\n<span style=\"color: #ff0000;\"><em>dev.new()<\/em><\/span>\n<span style=\"color: #ff0000;\"><em>ggplot(plot_data_2, aes(x = location, y = GC)) +<\/em><\/span>\n<span style=\"color: #ff0000;\"><em>geom_line() +<\/em><\/span>\n<span style=\"color: #ff0000;\"><em>theme_bw()<\/em><\/span><\/code><\/pre>\n\n\n\n<figure class=\"wp-block-image size-large\"><img loading=\"lazy\" decoding=\"async\" width=\"1024\" height=\"784\" src=\"https:\/\/pcool.dyndns.org\/wp-content\/uploads\/2025\/06\/orchd_sliding_window-1024x784.png\" alt=\"\" class=\"wp-image-3300\" srcset=\"https:\/\/pcool.dyndns.org\/wp-content\/uploads\/2025\/06\/orchd_sliding_window-1024x784.png 1024w, https:\/\/pcool.dyndns.org\/wp-content\/uploads\/2025\/06\/orchd_sliding_window-300x230.png 300w, https:\/\/pcool.dyndns.org\/wp-content\/uploads\/2025\/06\/orchd_sliding_window-768x588.png 768w, https:\/\/pcool.dyndns.org\/wp-content\/uploads\/2025\/06\/orchd_sliding_window-1536x1176.png 1536w, https:\/\/pcool.dyndns.org\/wp-content\/uploads\/2025\/06\/orchd_sliding_window-2048x1568.png 2048w\" sizes=\"auto, (max-width: 1024px) 100vw, 1024px\" \/><\/figure>\n\n\n\n<p>It is straight forward to get the single and triple letter amino acids coded for:<\/p>\n\n\n\n<pre class=\"wp-block-code has-small-font-size\"><code><span style=\"color: #ff0000;\"><em>getTrans(orchid&#091;&#091;38]]) # this will give the standard one letter abbreviation<\/em><\/span>\n<span style=\"color: #0000ff;\"><em> &#091;1] \"R\" \"N\" \"K\" \"V\" \"S\" \"V\" \"G\" \"E\" \"P\" \"P\" \"E\" \"G\" \"S\" \"L\" \"L\" \"R\" \"S\" \"H\" \"N\" \"N\" \"*\" \"S\" \"R\"<\/em><\/span>\n<span style=\"color: #0000ff;\"><em>- - -<\/em><\/span>\n<span style=\"color: #0000ff;\"><em>&#091;254] \"T\" \"P\" \"G\" \"Q\" \"*\" \"A\" \"T\" \"R\" \"*\" \"V\"<\/em><\/span>\n<span style=\"color: #ff0000;\"><em>aaa(getTrans(orchid&#091;&#091;38]])) # will give the three letter abbreviation<\/em><\/span>\n<span style=\"color: #0000ff;\"><em> &#091;1] \"Arg\" \"Asn\" \"Lys\" \"Val\" \"Ser\" \"Val\" \"Gly\" \"Glu\" \"Pro\" \"Pro\" \"Glu\" \"Gly\" \"Ser\" \"Leu\" \"Leu\"<\/em><\/span>\n<span style=\"color: #0000ff;\"><em>-&nbsp; --<\/em><\/span>\n<span style=\"color: #0000ff;\"><em>&#091;256] \"Gly\" \"Gln\" \"Stp\" \"Ala\" \"Thr\" \"Arg\" \"Stp\" \"Val\"<\/em><\/span><\/code><\/pre>\n\n\n\n<p>Dot plots are traditionally used to compare DNA sequences. The seqinr<sup class='sup-ref-note' id='note-zotero-ref-p2239-r7-o1'><a class='sup-ref-note' href='#zotero-ref-p2239-r7'>7<\/a><\/sup> package contains a function that creates a dot plot (called dotPlot). The number of nucleotides (wsize), step (wstep for overlap) and number of matches (nmatch) can be set. To compare single nucleotides of the 38th orchid sequence with itself:<\/p>\n\n\n\n<pre class=\"wp-block-code has-small-font-size\"><code><span style=\"color: #ff0000;\"><em>dev.new()<\/em><\/span>\n<span style=\"color: #ff0000;\"><em>dotPlot(orchid&#091;&#091;38]], orchid&#091;&#091;38]], wsize = 1, wstep = 1, nmatch = 1)<\/em><\/span><\/code><\/pre>\n\n\n\n<figure class=\"wp-block-image size-large\"><img loading=\"lazy\" decoding=\"async\" width=\"1024\" height=\"855\" src=\"https:\/\/pcool.dyndns.org\/wp-content\/uploads\/2025\/06\/orchid_dotplot_1-1024x855.png\" alt=\"\" class=\"wp-image-3305\" srcset=\"https:\/\/pcool.dyndns.org\/wp-content\/uploads\/2025\/06\/orchid_dotplot_1-1024x855.png 1024w, https:\/\/pcool.dyndns.org\/wp-content\/uploads\/2025\/06\/orchid_dotplot_1-300x251.png 300w, https:\/\/pcool.dyndns.org\/wp-content\/uploads\/2025\/06\/orchid_dotplot_1-768x642.png 768w, https:\/\/pcool.dyndns.org\/wp-content\/uploads\/2025\/06\/orchid_dotplot_1-1536x1283.png 1536w, https:\/\/pcool.dyndns.org\/wp-content\/uploads\/2025\/06\/orchid_dotplot_1.png 2028w\" sizes=\"auto, (max-width: 1024px) 100vw, 1024px\" \/><\/figure>\n\n\n\n<p>The diagonal line&nbsp;is uninterrupted, indicating the sequences are the same (obviously as the sequence was compared to itself).<\/p>\n\n\n\n<p>Or to compare triplets of nucleotides of the 38th orchid sequence with itself, specifying that all three nucleotides need to match (nmatch =3):<\/p>\n\n\n\n<pre class=\"wp-block-code has-small-font-size\"><code><span style=\"color: #ff0000;\"><em>dev.new()<\/em><\/span>\n<span style=\"color: #ff0000;\"><em>dotPlot(orchid&#091;&#091;38]], orchid&#091;&#091;38]], wsize = 3, wstep = 3, nmatch = 3)<\/em><\/span><\/code><\/pre>\n\n\n\n<figure class=\"wp-block-image size-large\"><img loading=\"lazy\" decoding=\"async\" width=\"1024\" height=\"787\" src=\"https:\/\/pcool.dyndns.org\/wp-content\/uploads\/2025\/06\/orchid_dotplot_3-1024x787.png\" alt=\"\" class=\"wp-image-3310\" srcset=\"https:\/\/pcool.dyndns.org\/wp-content\/uploads\/2025\/06\/orchid_dotplot_3-1024x787.png 1024w, https:\/\/pcool.dyndns.org\/wp-content\/uploads\/2025\/06\/orchid_dotplot_3-300x231.png 300w, https:\/\/pcool.dyndns.org\/wp-content\/uploads\/2025\/06\/orchid_dotplot_3-768x591.png 768w, https:\/\/pcool.dyndns.org\/wp-content\/uploads\/2025\/06\/orchid_dotplot_3-1536x1181.png 1536w, https:\/\/pcool.dyndns.org\/wp-content\/uploads\/2025\/06\/orchid_dotplot_3.png 1982w\" sizes=\"auto, (max-width: 1024px) 100vw, 1024px\" \/><\/figure>\n\n\n\n<p>The dot plot function included in the package is not very versatile. It would be better to plot with ggplot2<sup class='sup-ref-note' id='note-zotero-ref-p2239-r8-o1'><a class='sup-ref-note' href='#zotero-ref-p2239-r8'>8<\/a><\/sup>. The main advantage of open source software is that the code of the functions is available to the user. The code of the dot plot was changed so that it returns a matrix (with sequence 1 as rows and sequence 2 as columns). The row names are the first sequence and the column names the second sequence. Copy and paste <a href=\"https:\/\/pcool.dyndns.org:\/wp-content\/R_functions\/dot_plot_data.txt\" target=\"_blank\" rel=\"noreferrer noopener\">this function<\/a> (called dot_plot_data) into the R console.<\/p>\n\n\n\n<p>The new function (dot_plot_data) should now be available to R (it can also be&nbsp;<a href=\"https:\/\/pcool.dyndns.org\/index.php\/functions\/\" data-type=\"page\" data-id=\"24\" target=\"_blank\" rel=\"noreferrer noopener\">viewed and downloaded<\/a> from the download page). To access the function comparing three nucleotides and matching all three nucleotides takes the same format as the dotPlot function from the seqinr package<sup class='sup-ref-note' id='note-zotero-ref-p2239-r9-o1'><a class='sup-ref-note' href='#zotero-ref-p2239-r9'>9<\/a><\/sup>&nbsp;.<\/p>\n\n\n\n<pre class=\"wp-block-code has-small-font-size\"><code><span style=\"color: #ff0000;\"><em>new_plot &lt;- dot_plot_data(orchid&#091;&#091;38]], orchid&#091;&#091;38]], wsize = 3, wstep = 3, nmatch = 3)<\/em><\/span>\n<em><span style=\"color: #0000ff;\">Modification of dotPlot function in package seqinr.<\/span><\/em>\n<em><span style=\"color: #0000ff;\">The function returns a matrix for plotting.<\/span><\/em>\n<em><span style=\"color: #0000ff;\">Row names contain Sequence1, Col names Sequence2.<\/span><\/em>\n<em><span style=\"color: #0000ff;\">Use the reshape2 package to convert and plot with ggplot2 and geom_raster.<\/span><\/em><\/code><\/pre>\n\n\n\n<p>To&nbsp;show the structure and class of the returned object:<\/p>\n\n\n\n<pre class=\"wp-block-code has-small-font-size\"><code><span style=\"color: #ff0000;\"><em>str(new_plot)<\/em><\/span>\n<span style=\"color: #0000ff;\"><em> logi &#091;1:263, 1:263] TRUE FALSE FALSE FALSE FALSE FALSE ...<\/em><\/span>\n<span style=\"color: #0000ff;\"><em> - attr(*, \"dimnames\")=List of 2<\/em><\/span>\n<span style=\"color: #0000ff;\"><em> ..$ : chr &#091;1:263] \"cgt\" \"aac\" \"aag\" \"gtt\" ...<\/em><\/span>\n<span style=\"color: #0000ff;\"><em> ..$ : chr &#091;1:263] \"cgt\" \"aac\" \"aag\" \"gtt\" ...<\/em><\/span>\n<span style=\"color: #ff0000;\"><em>class(new_plot)<\/em><\/span>\n<span style=\"color: #0000ff;\"><em>&#091;&#091;1] \"matrix\" \"array\"<\/em><\/span><\/code><\/pre>\n\n\n\n<p>The data is now available as a matrix called new_plot. The row names are sequence 1 and the column names sequence 2. Unfortunately, ggplot2<sup class='sup-ref-note' id='note-zotero-ref-p2239-r10-o1'><a class='sup-ref-note' href='#zotero-ref-p2239-r10'>10<\/a><\/sup> can&#8217;t plot matrices directly. Consequently, the matrix need to be converted to a data frame suitable for plotting,<\/p>\n\n\n\n<p>First, extract the row names and column names as vectors:<\/p>\n\n\n\n<pre class=\"wp-block-code has-small-font-size\"><code><em><span style=\"color: #ff0000;\">sequence_x &lt;- rownames(new_plot)<\/span><\/em>\n<em><span style=\"color: #ff0000;\">sequence_y &lt;- colnames(new_plot)<\/span><\/em><\/code><\/pre>\n\n\n\n<p><\/p>\n\n\n\n<p>Secondly, remove the row and column names from the matrix (the names are not unique and otherwise ggplot2<sup class='sup-ref-note' id='note-zotero-ref-p2239-r11-o1'><a class='sup-ref-note' href='#zotero-ref-p2239-r11'>11<\/a><\/sup> will group by them)<\/p>\n\n\n\n<pre class=\"wp-block-code has-small-font-size\"><code><em><span style=\"color: #ff0000;\">rownames(new_plot) &lt;- NULL<\/span><\/em>\n<em><span style=\"color: #ff0000;\">colnames(new_plot) &lt;- NULL<\/span><\/em><\/code><\/pre>\n\n\n\n<p>Thirdly, use the reshape2 package<sup class='sup-ref-note' id='note-zotero-ref-p2239-r12-o1'><a class='sup-ref-note' href='#zotero-ref-p2239-r12'>12<\/a><\/sup> to reshape the data to allow plotting (the row and column numbers are unique, representing the number in the sequence):<\/p>\n\n\n\n<pre class=\"wp-block-code has-small-font-size\"><code><em><span style=\"color: #ff0000;\">new_plot_2 &lt;- melt(new_plot)<\/span><\/em><\/code><\/pre>\n\n\n\n<p>The data frame new_plot_2 has 3 columns (variables in tidy format). <strong>Var1<\/strong> represents sequence 1 (x coordinate) and <strong>Var2<\/strong> sequence 2 (y coordinate). The <strong>value&nbsp;<\/strong>variable contains the value stored in the matrix. To plot the data:<\/p>\n\n\n\n<pre class=\"wp-block-code has-small-font-size\"><code><span style=\"color: #ff0000;\"><em>dot_plot &lt;-\u00a0 ggplot(new_plot_2, <\/em><\/span>\n     <span style=\"color: #ff0000;\"><em>aes(x = Var1, y = Var2, fill = value)) +<\/em><\/span>\n<span style=\"color: #ff0000;\"><em>geom_raster() +<\/em><\/span>\n<span style=\"color: #ff0000;\"><em>scale_fill_manual(values = c(\"lightgray\", \"blue\"), name = \"Pair\", labels = c(\"Different\", \"Same\")) +<\/em><\/span>\n<span style=\"color: #ff0000;\"><em>scale_x_continuous(\"Sequence 1\") +<\/em><\/span>\n<span style=\"color: #ff0000;\"><em>scale_y_continuous(\"Sequence 2\") + <\/em><\/span>\n<span style=\"color: #ff0000;\"><em>theme_bw()<\/em><\/span> \n<span style=\"color: #ff0000;\"><em>dot_plot<\/em><\/span><\/code><\/pre>\n\n\n\n<figure class=\"wp-block-image size-large\"><img loading=\"lazy\" decoding=\"async\" width=\"1024\" height=\"649\" src=\"https:\/\/pcool.dyndns.org\/wp-content\/uploads\/2025\/06\/orchid_dotplot_ggplot-1024x649.png\" alt=\"\" class=\"wp-image-3320\" srcset=\"https:\/\/pcool.dyndns.org\/wp-content\/uploads\/2025\/06\/orchid_dotplot_ggplot-1024x649.png 1024w, https:\/\/pcool.dyndns.org\/wp-content\/uploads\/2025\/06\/orchid_dotplot_ggplot-300x190.png 300w, https:\/\/pcool.dyndns.org\/wp-content\/uploads\/2025\/06\/orchid_dotplot_ggplot-768x487.png 768w, https:\/\/pcool.dyndns.org\/wp-content\/uploads\/2025\/06\/orchid_dotplot_ggplot-1536x974.png 1536w, https:\/\/pcool.dyndns.org\/wp-content\/uploads\/2025\/06\/orchid_dotplot_ggplot-2048x1298.png 2048w\" sizes=\"auto, (max-width: 1024px) 100vw, 1024px\" \/><\/figure>\n\n\n\n<p>It is possible to show the nucleotides on the axes. However, they will become too crowded. It is possible to zoom in though:<\/p>\n\n\n\n<pre class=\"wp-block-code has-small-font-size\"><code><span style=\"color: #ff0000;\"><em>dot_plot_2 &lt;- ggplot(new_plot_2, <\/em><\/span>\n     <span style=\"color: #ff0000;\"><em>aes(x = Var1, y = Var2, fill = value)) +<\/em><\/span>\n<span style=\"color: #ff0000;\"><em>geom_raster() +<\/em><\/span>\n<span style=\"color: #ff0000;\"><em>scale_fill_manual(values = c(\"lightgray\", \"blue\"), name = \"Pair\", labels = c(\"Different\", \"Same\")) +<\/em><\/span>\n<span style=\"color: #ff0000;\"><em>scale_x_continuous(\"Sequence 1\", breaks = 1: length(sequence_x), labels = sequence_x, limit = c(10,20)) +<\/em><\/span>\n<span style=\"color: #ff0000;\"><em>scale_y_continuous(\"Sequence 2\", breaks = 1: length(sequence_y), labels = sequence_y, limit = c(10,20)) + <\/em><\/span>\n<span style=\"color: #ff0000;\"><em>theme_bw() <\/em><\/span>\n<span style=\"color: #ff0000;\"><em>dot_plot_2<\/em><\/span>\n<span style=\"color: #0000ff;\"><em>Warning message:<\/em><\/span>\n<span style=\"color: #0000ff;\"><em>Removed 69048 rows containing missing values (geom_raster).<\/em><\/span><\/code><\/pre>\n\n\n\n<figure class=\"wp-block-image size-large\"><img loading=\"lazy\" decoding=\"async\" width=\"1024\" height=\"788\" src=\"https:\/\/pcool.dyndns.org\/wp-content\/uploads\/2025\/06\/orchid_dotplot_ggplot_zoom-1024x788.png\" alt=\"\" class=\"wp-image-3315\" srcset=\"https:\/\/pcool.dyndns.org\/wp-content\/uploads\/2025\/06\/orchid_dotplot_ggplot_zoom-1024x788.png 1024w, https:\/\/pcool.dyndns.org\/wp-content\/uploads\/2025\/06\/orchid_dotplot_ggplot_zoom-300x231.png 300w, https:\/\/pcool.dyndns.org\/wp-content\/uploads\/2025\/06\/orchid_dotplot_ggplot_zoom-768x591.png 768w, https:\/\/pcool.dyndns.org\/wp-content\/uploads\/2025\/06\/orchid_dotplot_ggplot_zoom-1536x1183.png 1536w, https:\/\/pcool.dyndns.org\/wp-content\/uploads\/2025\/06\/orchid_dotplot_ggplot_zoom-2048x1577.png 2048w\" sizes=\"auto, (max-width: 1024px) 100vw, 1024px\" \/><\/figure>\n\n\n\n<p>Finally, to save fasta files is straight forward. To save the orchid sequences to a fasta file (orchid_2) in the working directory:<\/p>\n\n\n\n<pre class=\"wp-block-code has-small-font-size\"><code><em><span style=\"color: #ff0000;\">write.fasta(<span class=\"s1\">names<\/span> = c(<span class=\"s2\">1<\/span>:<span class=\"s2\">94<\/span>), <span class=\"s1\">sequences<\/span> = <span class=\"s1\">orchid<\/span>, <span class=\"s1\">file.out<\/span> = <span class=\"s3\">\"orchid_2.fasta\"<\/span>)<\/span><\/em><\/code><\/pre>\n","protected":false},"excerpt":{"rendered":"<p>Please make sure the necessary packages (seqinr and Biostrings) are&nbsp;installed&nbsp;as described&nbsp;to allow analysis. Furthermore, the ggplot2 and reshape2 packages should be loaded. The genetic code for the lady&#8217;s slipper orchid (Cypripedium calceolus)&nbsp;can be&nbsp;downloaded. Select all the text (Ctrl-A), copy (Ctrl-C) and paste (Ctrl-V) it into a text editor. Save the file as ls_orchid.fasta (without a [&hellip;]<\/p>\n","protected":false},"author":2,"featured_media":0,"parent":0,"menu_order":0,"comment_status":"closed","ping_status":"closed","template":"","meta":{"inline_featured_image":false,"footnotes":""},"class_list":["post-2239","page","type-page","status-publish","hentry"],"_links":{"self":[{"href":"https:\/\/pcool.dyndns.org\/index.php\/wp-json\/wp\/v2\/pages\/2239","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/pcool.dyndns.org\/index.php\/wp-json\/wp\/v2\/pages"}],"about":[{"href":"https:\/\/pcool.dyndns.org\/index.php\/wp-json\/wp\/v2\/types\/page"}],"author":[{"embeddable":true,"href":"https:\/\/pcool.dyndns.org\/index.php\/wp-json\/wp\/v2\/users\/2"}],"replies":[{"embeddable":true,"href":"https:\/\/pcool.dyndns.org\/index.php\/wp-json\/wp\/v2\/comments?post=2239"}],"version-history":[{"count":3,"href":"https:\/\/pcool.dyndns.org\/index.php\/wp-json\/wp\/v2\/pages\/2239\/revisions"}],"predecessor-version":[{"id":4552,"href":"https:\/\/pcool.dyndns.org\/index.php\/wp-json\/wp\/v2\/pages\/2239\/revisions\/4552"}],"wp:attachment":[{"href":"https:\/\/pcool.dyndns.org\/index.php\/wp-json\/wp\/v2\/media?parent=2239"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}