DNA reading frame

Question

DNA is read in groups of three nucleotides (the letters you see in a sequence), called codons, for determining what protein sequence the portion of DNA will yield. I already used this in a previous question. A protein coding DNA sequence (CDS) will always start with ATG and end with a stop codon, either TAA, TAG or TGA.

This creates a notion of "frame" within a DNA sequence. Let's take this sequence for example:

ATATGCGGCGTAATGAATACTGCTAAGGCTTATCGTGCATTCT

Although it contains multiple stop codons, only one of them is in the same frame as the start codon ATG near the beginning of the sequence. Thus, the sequence should be decomposed as such:

AT ATG CGG CGT AAT GAA TAC TGC TAA GGC TTA TCG TGC ATT CT

As that is how it would be read by the cell's machinery.

This means that there are six frames in every DNA sequence: the three forward ones, and the three reverse frames. DNA is found as a double helix, with the second strand being the reverse complement of the first. This means it will be read backwards, and all As are swapped for Ts, all Gs are swapped for Cs, and vice versa. So my example sequence will become:

AGAATGCACGATAAGCCTTAGCAGTATTCATTACGCCGCATAT

Which creates another sequence starting with ATG and ending with a stop codon:

AGA ATG CAC GAT AAG CCT TAG CAG TAT TCA TTA CGC CGC ATA T

In bioinformatics, these are called Open Reading Frames (ORFs): any DNA sequence that starts with ATG and ends with a stop codon in the same frame.

The challenge: : write a function in as few bytes as possible that takes as input a DNA sequence and output all ORFs found, each with its start position in the original sequence (one-based), end position, frame (+/- [1;2;3]) and full DNA sequence of the ORF. So for the previous example:

input: ATATGCGGCGTAATGAATACTGCTAAGGCTTATCGTGCATTCT
output: 
3; 26; 3; ATGCGGCGTAATGAATACTGCTAA
40; 23; -1; ATGCACGATAAGCCTTAG

Separators between fields are required but can be as you wish.

A few more examples:

input: AATGGCGTAAATGCCTTGA
output: 
2; 10; 2; ATGGCGTAA
11; 19; 2; ATGCCTTGA

input: TTACAT
output:
6; 1; -1; ATGTAA

input: TTTTTTTTT
output:

input: ATGCATGCATAATTAA
output:
1; 12; 1; ATGCATGCATAA
5; 16; 2; ATGCATAATTAA

input: ATGATGTAA
output:
1; 9; 1; ATGATGTAA
4; 9; 1; ATGTAA

input: ATGCTGTAC
output:


```

Is the input guaranteed to always have a valid stop codon following every start codon? If not, what should we do if we reach the end of the string without finding a stop codon? (depending on the frame of the start codon, there may be 0 1 or 2 letters left.) — Level River St
– Level River St, Commented Oct 10 at 19:17
What is the output for ATGATGTAA? There are 2 valid sequences here aren't there? ATGATGTAA and ATGTAA (as far as I know, ATG (methionine) can occur in the middle of a protein and there are additional ways that the cell uses to decide which ATG's to start protein synthesis from. — Level River St
– Level River St, Commented Oct 10 at 19:21
@Arnauld two reading frames, one in +1 and one in +2. + is not required, I'll edit the Q — Whitehot
– Whitehot, Commented Oct 13 at 8:19
@LevelRiverSt That sequence has two reading frames. If there's no stop codon in the sequence then there's no ORF, it needs to end with a stop to qualify. — Whitehot
– Whitehot, Commented Oct 13 at 8:20

Neil · Accepted Answer · 2025-10-11 12:58:36Z

Retina, 148 bytes

$
¶$`
O$^`\G.

T`ACGT`Ro`^.+
Lv$`(?<=(¶(.*)))?(?<=(.*)(...)*)ATG(...)*?T(AA|AG|GA)(?=((.*)¶))?
$.($1$8$#8*$&); $.($7$2$#2*$&); $#2*+$#8*-$.($3_); $&

Try it online! Link includes test cases. Explanation:

$
¶$`

Duplicate the input line.

O$^`\G.

T`ACGT`Ro`^.+

Reverse the first line and switch A with T and C with G.

Lv$`(?<=(¶(.*)))?(?<=(.*)(...)*)ATG(...)*?T(AA|AG|GA)(?=((.*)¶))?

Match all overlapping sequences, plus also capture the suffix or prefix respectively depending on whether this is the first or second line. $.1 = length of prefix + 1; $.2 = length of prefix; $#2 = 1 on the second line, 0 on the first; $.3 = frame - 1; $.7 = length of suffix + 1; $.8 = length of suffix; $#8 = 1 on the first line, 0 on the second.

$.($1$8$#8*$&); $.($7$2$#2*$&); $#2*+$#8*-$.($3_); $&

Output the positions of the first and last character, the frame and the sequence. If the match is on the first line, the start position is the length of the suffix plus the length of the string, otherwise it is one more than the length of the prefix. Similarly if the match is on the second line then the end position is the length of the prefix plus the length of the string, otherwise it is one more than the length of the suffix.

Kevin Cruijssen · Accepted Answer · 2025-10-13 09:49:06Z

05AB1E, 84 bytes

Â.•∍–•uÂ‡‚εDŒʒg3Öy…ATGÅ?y.•5ʒŒΩœ •u3ôÅ¿àP}©kDU>X®€g+‚NiIgα>}`X3%Ni(<ë>}®)ø.γн}€н}í€`

Try it online or verify all test cases.

If we're allowed to output a pair of quartets where the regular and reversed DNA-ORFs are separated, the trailing 4 bytes can be removed:
Try it online or verify all test cases.

Explanation:

Â                # Bifurcate the (implicit) input; aka, Duplicate & Reverse copy
 .•∍–•           # Push compressed string "acgt"
   u             # Uppercase it
    Â            # Bifurcate it as well
     ‡           # Transliterate ("A" to "T"; "C" to "G"; "G" to "C"; "T" to "A")
      ‚          # Pair the two strings together
ε                # Map over this pair:
 D               #  Duplicate the current string
  Œ              #  Pop and push its substrings
   ʒ             #  Filter this list by:
    g            #   Check that its length
     3Ö          #   is divisible by 3
    y    Å?      #   Check that it starts with
     …ATG        #   string "ATG"
    y      Å¿à   #   Check that it ends with any of these three:
     .•5ʒŒΩœ •   #    Push compressed string "taatagtga"
        u        #    Uppercase it
         3ô      #    Split into triplets: ["TAA","TAG","TGA"]
    P            #   Take the product of the stack to verify all are truthy
   }©            #  After the filter(s): store it in variable `®` (without popping)
     k           #  Get the 0-based index of each in the full string
      DU         #  Store a copy of these indices in variable `X`
      >          #  +1 to make it a 1-based index
     X           #  Push 0-based indices `X` again
      ®          #  Push the list of substrings `®` again
       €g        #  Get the length of each substring
         +       #  Add it to the 0-based indices
     ‚           #  Pair these two lists together
      Ni         #  If the 0-based map-index is 1 (aka, it's the reversed string):
        Ig       #   Push the input-length
          α      #   Take its absolute difference with each value
           >     #   Increase each value by 1
       }`        #  After the if-statement: push both lists to the stack again
         X       #  Push 0-based indices `X` again
          3%     #  Modulo-3 each
            Ni   #  If the map-index is 1 (aka, it's the reversed string)
              (  #   Negate each
               < #   Then decrease each by 1
             ë   #  Else (it's the regular string):
              >  #   Increase each by 1 instead
             }   #  Close the if-else statement
       ®         #  Push the list of substrings `®` again
        )        #  Wrap all four lists on the stack into a list
         ø       #  Zip/transpose; swapping rows/columns
          .γ     #  Group each quartet by:
            н    #   Its first value, the 1-based index
           }€н   #  After the group-by: keep the first quartet of each group
}                # After the map:
  €`             # Flatten the pair of lists one level down
 í               # (without changing its order)
                 # (after which the result is output implicitly)

See this 05AB1E tip of mine (section How to compress strings not part of the dictionary?) to understand why .•∍–• is "acgt" and .•5ʒŒΩœ • is "taatagtga".

The pair of quartets option works nicely, I'm not expecting a specific output format — Whitehot
– Whitehot, Commented Oct 13 at 12:24

Neil · Accepted Answer · 2025-10-13 21:36:33Z

2

Charcoal, 102 bytes

Ｆ²«≔⭆⮌θ§ACGT⌕TGCκθＦ⌕ＡθATG«≔⪪✂θκＬθ¹¦³η≔⌊ΦＥ⁺T⪪×²AAG²⊕⌕ηλλζ¿ζ⟦⪫⟦⎇ι⊕κ⁻Ｌθκ⎇ι⁺κ×³ζ⊕⁻Ｌθ⁺κ×³ζ⁺§-+ι⊕﹪κ³⪫…ηζω⟧;

Try it online! Link is to verbose version of code. Explanation:

Ｆ²«

Loop twice, once for the reverse complement.

≔⭆⮌θ§ACGT⌕TGCκθ

Take the reverse complement of the string. (So on the second pass, we recover the original input.)

Ｆ⌕ＡθATG«

Loop over all occurrences of ATG in the string.

≔⪪✂θκＬθ¹¦³η

Split the suffix of the string into groups of three.

≔⌊ΦＥ⁺T⪪×²AAG²⊕⌕ηλλζ

Search the groups of three for each of TAA, TGA and TAG, and take the lowest codon count, if any.

¿ζ⟦⪫⟦⎇ι⊕κ⁻Ｌθκ⎇ι⁺κ×³ζ⊕⁻Ｌθ⁺κ×³ζ⁺§-+ι⊕﹪κ³⪫…ηζω⟧;

If a match was found then output the details of the match.

answered Oct 13 at 21:36

Neil

184k12 gold badges76 silver badges290 bronze badges

\$\begingroup\$ How is... "that" looking for TAA, TAG and TGA? \$\endgroup\$

Whitehot
– Whitehot

2025-10-15 08:33:49 +00:00
Commented Oct 15 at 8:33
1

\$\begingroup\$ @Whitehot You take the string AAG, repeat it making AAGAAG, split it into pairs of letters AA, GA, and AG, then prepend T to each, which is why I didn't list them in alphabetical order in my explanation. \$\endgroup\$

Neil
– Neil

2025-10-15 11:38:27 +00:00
Commented Oct 15 at 11:38
\$\begingroup\$ Damn that's smooth \$\endgroup\$

Whitehot
– Whitehot

2025-10-15 12:46:12 +00:00
Commented Oct 15 at 12:46
1

\$\begingroup\$ @Whitehot Anything to save a byte in code-golf ;-) \$\endgroup\$

Neil
– Neil

2025-10-15 13:10:39 +00:00
Commented Oct 15 at 13:10

Add a comment |

Arnauld · Accepted Answer · 2025-10-14 07:51:07Z

JavaScript (V8), 163 bytes

Expects a string and prints each match on a single space-separated line, using the field order described in the challenge.

s=>[1,k=-1].map(r=>[...S=s].map((c,i)=>S.slice(i).replace(/^ATG(...)*?T(AA|AG|GA)/,s=>print(k,k+~-s.length*r,r*i%3+r,s),k+=r,s=q[q.search(c)^1]+s),s=q="CGAT",k++))

Try it online!

Commented

s =>                            // s = input string
[1, k = -1]                     // k = ORF starting position, initialized to -1
.map(r =>                       // for r = 1 and r = -1:
  [...S = s]                    //   S = copy of s
  .map((c, i) =>                //   for each character c at index i in S:
    S.slice(i)                  //     take the substring of S starting at i
    .replace(                   //     search for:
      /^ATG(...)*?T(AA|AG|GA)/, //       a valid ORF at the start of this substring
      s =>                      //       if found, store it in s and
      print(                    //       print:
        k,                      //         starting position k
        k + ~-s.length * r,     //         ending position: k + (s.length - 1) * r
        r * i % 3 + r,          //         frame: (r * i) % 3 + r
        s                       //         matched sequence s
      ),                        //       end of print()
      k += r,                   //       update the starting position
      s =                       //       build the reverse sequence in s
        q[                      //       by prepending the complement base
          q.search(c) ^ 1       //       using c and the lookup string q
        ] + s                   //
    ),                          //     end of replace()
    s = q = "CGAT",             //     set the lookup string q and reset s for the
                                //     reverse sequence (*)
    k++                         //     increment k
  )                             //   end of inner map()
)                               // end of outer map()

_{(*) This is safe because C is put in front and cannot accidentally complete a 'stop' codon.}

\$\begingroup\$ Why this some difffer? \$\endgroup\$

l4m2
– l4m2

2025-10-14 18:36:18 +00:00
Commented Oct 14 at 18:36 — l4m2
– l4m2, Commented Oct 14 at 18:36
\$\begingroup\$ @l4m2 (c,s,_,$,i) should be (c,S,_,$,i), right? \$\endgroup\$

Arnauld
– Arnauld

2025-10-14 18:47:09 +00:00
Commented Oct 14 at 18:47 — Arnauld
– Arnauld, Commented Oct 14 at 18:47

Stack Exchange Network

DNA reading frame

4 Answers 4

Retina, 148 bytes

05AB1E, 84 bytes

Charcoal, 102 bytes

JavaScript (V8), 163 bytes

Commented

Your Answer

Linked

Hot Network Questions

DNA reading frame

4 Answers 4

Retina, 148 bytes

05AB1E, 84 bytes

Charcoal, 102 bytes

JavaScript (V8), 163 bytes

Commented

Your Answer

Sign up or log in

Post as a guest

Linked

Related

Hot Network Questions