5
\$\begingroup\$

DNA is read in groups of three nucleotides (the letters you see in a sequence), called codons, for determining what protein sequence the portion of DNA will yield. I already used this in a previous question. A protein coding DNA sequence (CDS) will always start with ATG and end with a stop codon, either TAA, TAG or TGA.

This creates a notion of "frame" within a DNA sequence. Let's take this sequence for example:

ATATGCGGCGTAATGAATACTGCTAAGGCTTATCGTGCATTCT

Although it contains multiple stop codons, only one of them is in the same frame as the start codon ATG near the beginning of the sequence. Thus, the sequence should be decomposed as such:

AT ATG CGG CGT AAT GAA TAC TGC TAA GGC TTA TCG TGC ATT CT

As that is how it would be read by the cell's machinery.

This means that there are six frames in every DNA sequence: the three forward ones, and the three reverse frames. DNA is found as a double helix, with the second strand being the reverse complement of the first. This means it will be read backwards, and all As are swapped for Ts, all Gs are swapped for Cs, and vice versa. So my example sequence will become:

AGAATGCACGATAAGCCTTAGCAGTATTCATTACGCCGCATAT

Which creates another sequence starting with ATG and ending with a stop codon:

AGA ATG CAC GAT AAG CCT TAG CAG TAT TCA TTA CGC CGC ATA T

In bioinformatics, these are called Open Reading Frames (ORFs): any DNA sequence that starts with ATG and ends with a stop codon in the same frame.

The challenge: : write a function in as few bytes as possible that takes as input a DNA sequence and output all ORFs found, each with its start position in the original sequence (one-based), end position, frame (+/- [1;2;3]) and full DNA sequence of the ORF. So for the previous example:

input: ATATGCGGCGTAATGAATACTGCTAAGGCTTATCGTGCATTCT
output: 
3; 26; 3; ATGCGGCGTAATGAATACTGCTAA
40; 23; -1; ATGCACGATAAGCCTTAG

Separators between fields are required but can be as you wish.

A few more examples:

input: AATGGCGTAAATGCCTTGA
output: 
2; 10; 2; ATGGCGTAA
11; 19; 2; ATGCCTTGA

input: TTACAT
output:
6; 1; -1; ATGTAA

input: TTTTTTTTT
output:

input: ATGCATGCATAATTAA
output:
1; 12; 1; ATGCATGCATAA
5; 16; 2; ATGCATAATTAA

input: ATGATGTAA
output:
1; 9; 1; ATGATGTAA
4; 9; 1; ATGTAA

input: ATGCTGTAC
output:


```
\$\endgroup\$
5
  • \$\begingroup\$ Related \$\endgroup\$ Commented Oct 10 at 17:02
  • 1
    \$\begingroup\$ Is the input guaranteed to always have a valid stop codon following every start codon? If not, what should we do if we reach the end of the string without finding a stop codon? (depending on the frame of the start codon, there may be 0 1 or 2 letters left.) \$\endgroup\$ Commented Oct 10 at 19:17
  • 3
    \$\begingroup\$ What is the output for ATGATGTAA? There are 2 valid sequences here aren't there? ATGATGTAA and ATGTAA (as far as I know, ATG (methionine) can occur in the middle of a protein and there are additional ways that the cell uses to decide which ATG's to start protein synthesis from. \$\endgroup\$ Commented Oct 10 at 19:21
  • 1
    \$\begingroup\$ @Arnauld two reading frames, one in +1 and one in +2. + is not required, I'll edit the Q \$\endgroup\$ Commented Oct 13 at 8:19
  • \$\begingroup\$ @LevelRiverSt That sequence has two reading frames. If there's no stop codon in the sequence then there's no ORF, it needs to end with a stop to qualify. \$\endgroup\$ Commented Oct 13 at 8:20

4 Answers 4

2
\$\begingroup\$

Retina, 148 bytes

$
¶$`
O$^`\G.

T`ACGT`Ro`^.+
Lv$`(?<=(¶(.*)))?(?<=(.*)(...)*)ATG(...)*?T(AA|AG|GA)(?=((.*)¶))?
$.($1$8$#8*$&); $.($7$2$#2*$&); $#2*+$#8*-$.($3_); $&

Try it online! Link includes test cases. Explanation:

$
¶$`

Duplicate the input line.

O$^`\G.

T`ACGT`Ro`^.+

Reverse the first line and switch A with T and C with G.

Lv$`(?<=(¶(.*)))?(?<=(.*)(...)*)ATG(...)*?T(AA|AG|GA)(?=((.*)¶))?

Match all overlapping sequences, plus also capture the suffix or prefix respectively depending on whether this is the first or second line. $.1 = length of prefix + 1; $.2 = length of prefix; $#2 = 1 on the second line, 0 on the first; $.3 = frame - 1; $.7 = length of suffix + 1; $.8 = length of suffix; $#8 = 1 on the first line, 0 on the second.

$.($1$8$#8*$&); $.($7$2$#2*$&); $#2*+$#8*-$.($3_); $&

Output the positions of the first and last character, the frame and the sequence. If the match is on the first line, the start position is the length of the suffix plus the length of the string, otherwise it is one more than the length of the prefix. Similarly if the match is on the second line then the end position is the length of the prefix plus the length of the string, otherwise it is one more than the length of the suffix.

\$\endgroup\$
2
\$\begingroup\$

05AB1E, 84 bytes

Â.•∍–•u‡‚εDŒʒg3Öy…ATGÅ?y.•5ʒŒΩœ •u3ôÅ¿àP}©kDU>X®€g+‚NiIgα>}`X3%Ni(<ë>}®)ø.γн}€н}í€`

Try it online or verify all test cases.

If we're allowed to output a pair of quartets where the regular and reversed DNA-ORFs are separated, the trailing 4 bytes can be removed:
Try it online or verify all test cases.

Explanation:

                # Bifurcate the (implicit) input; aka, Duplicate & Reverse copy
 .•∍–•           # Push compressed string "acgt"
   u             # Uppercase it
    Â            # Bifurcate it as well
     ‡           # Transliterate ("A" to "T"; "C" to "G"; "G" to "C"; "T" to "A")
      ‚          # Pair the two strings together
ε                # Map over this pair:
 D               #  Duplicate the current string
  Œ              #  Pop and push its substrings
   ʒ             #  Filter this list by:
    g            #   Check that its length
     3Ö          #   is divisible by 3
    y    Å?      #   Check that it starts with
     …ATG        #   string "ATG"
    y      Å¿à   #   Check that it ends with any of these three:
     .•5ʒŒΩœ •   #    Push compressed string "taatagtga"
        u        #    Uppercase it
         3ô      #    Split into triplets: ["TAA","TAG","TGA"]
    P            #   Take the product of the stack to verify all are truthy
   }©            #  After the filter(s): store it in variable `®` (without popping)
     k           #  Get the 0-based index of each in the full string
      DU         #  Store a copy of these indices in variable `X`
      >          #  +1 to make it a 1-based index
     X           #  Push 0-based indices `X` again
      ®          #  Push the list of substrings `®` again
       €g        #  Get the length of each substring
         +       #  Add it to the 0-based indices
     ‚           #  Pair these two lists together
      Ni         #  If the 0-based map-index is 1 (aka, it's the reversed string):
        Ig       #   Push the input-length
          α      #   Take its absolute difference with each value
           >     #   Increase each value by 1
       }`        #  After the if-statement: push both lists to the stack again
         X       #  Push 0-based indices `X` again
          3%     #  Modulo-3 each
            Ni   #  If the map-index is 1 (aka, it's the reversed string)
              (  #   Negate each
               < #   Then decrease each by 1
             ë   #  Else (it's the regular string):
              >  #   Increase each by 1 instead
             }   #  Close the if-else statement
       ®         #  Push the list of substrings `®` again
        )        #  Wrap all four lists on the stack into a list
         ø       #  Zip/transpose; swapping rows/columns
          .γ     #  Group each quartet by:
            н    #   Its first value, the 1-based index
           }€н   #  After the group-by: keep the first quartet of each group
}                # After the map:
  €`             # Flatten the pair of lists one level down
 í               # (without changing its order)
                 # (after which the result is output implicitly)

See this 05AB1E tip of mine (section How to compress strings not part of the dictionary?) to understand why .•∍–• is "acgt" and .•5ʒŒΩœ • is "taatagtga".

\$\endgroup\$
1
  • \$\begingroup\$ The pair of quartets option works nicely, I'm not expecting a specific output format \$\endgroup\$ Commented Oct 13 at 12:24
2
\$\begingroup\$

Charcoal, 102 bytes

F²«≔⭆⮌θ§ACGT⌕TGCκθF⌕AθATG«≔⪪✂θκLθ¹¦³η≔⌊ΦE⁺T⪪ײAAG²⊕⌕ηλλζ¿ζ⟦⪫⟦⎇ι⊕κ⁻Lθκ⎇ι⁺κ׳ζ⊕⁻Lθ⁺κ׳ζ⁺§-+ι⊕﹪κ³⪫…ηζω⟧; 

Try it online! Link is to verbose version of code. Explanation:

F²«

Loop twice, once for the reverse complement.

≔⭆⮌θ§ACGT⌕TGCκθ

Take the reverse complement of the string. (So on the second pass, we recover the original input.)

F⌕AθATG«

Loop over all occurrences of ATG in the string.

≔⪪✂θκLθ¹¦³η

Split the suffix of the string into groups of three.

≔⌊ΦE⁺T⪪ײAAG²⊕⌕ηλλζ

Search the groups of three for each of TAA, TGA and TAG, and take the lowest codon count, if any.

¿ζ⟦⪫⟦⎇ι⊕κ⁻Lθκ⎇ι⁺κ׳ζ⊕⁻Lθ⁺κ׳ζ⁺§-+ι⊕﹪κ³⪫…ηζω⟧; 

If a match was found then output the details of the match.

\$\endgroup\$
4
  • \$\begingroup\$ How is... "that" looking for TAA, TAG and TGA? \$\endgroup\$ Commented Oct 15 at 8:33
  • 1
    \$\begingroup\$ @Whitehot You take the string AAG, repeat it making AAGAAG, split it into pairs of letters AA, GA, and AG, then prepend T to each, which is why I didn't list them in alphabetical order in my explanation. \$\endgroup\$ Commented Oct 15 at 11:38
  • \$\begingroup\$ Damn that's smooth \$\endgroup\$ Commented Oct 15 at 12:46
  • 1
    \$\begingroup\$ @Whitehot Anything to save a byte in code-golf ;-) \$\endgroup\$ Commented Oct 15 at 13:10
2
\$\begingroup\$

JavaScript (V8), 163 bytes

Expects a string and prints each match on a single space-separated line, using the field order described in the challenge.

s=>[1,k=-1].map(r=>[...S=s].map((c,i)=>S.slice(i).replace(/^ATG(...)*?T(AA|AG|GA)/,s=>print(k,k+~-s.length*r,r*i%3+r,s),k+=r,s=q[q.search(c)^1]+s),s=q="CGAT",k++))

Try it online!

Commented

s =>                            // s = input string
[1, k = -1]                     // k = ORF starting position, initialized to -1
.map(r =>                       // for r = 1 and r = -1:
  [...S = s]                    //   S = copy of s
  .map((c, i) =>                //   for each character c at index i in S:
    S.slice(i)                  //     take the substring of S starting at i
    .replace(                   //     search for:
      /^ATG(...)*?T(AA|AG|GA)/, //       a valid ORF at the start of this substring
      s =>                      //       if found, store it in s and
      print(                    //       print:
        k,                      //         starting position k
        k + ~-s.length * r,     //         ending position: k + (s.length - 1) * r
        r * i % 3 + r,          //         frame: (r * i) % 3 + r
        s                       //         matched sequence s
      ),                        //       end of print()
      k += r,                   //       update the starting position
      s =                       //       build the reverse sequence in s
        q[                      //       by prepending the complement base
          q.search(c) ^ 1       //       using c and the lookup string q
        ] + s                   //
    ),                          //     end of replace()
    s = q = "CGAT",             //     set the lookup string q and reset s for the
                                //     reverse sequence (*)
    k++                         //     increment k
  )                             //   end of inner map()
)                               // end of outer map()

(*) This is safe because C is put in front and cannot accidentally complete a 'stop' codon.

\$\endgroup\$
2
  • \$\begingroup\$ Why this some difffer? \$\endgroup\$ Commented Oct 14 at 18:36
  • \$\begingroup\$ @l4m2 (c,s,_,$,i) should be (c,S,_,$,i), right? \$\endgroup\$ Commented Oct 14 at 18:47

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.