DNA is read in groups of three nucleotides (the letters you see in a sequence), called codons, for determining what protein sequence the portion of DNA will yield. I already used this in a previous question. A protein coding DNA sequence (CDS) will always start with ATG and end with a stop codon, either TAA, TAG or TGA.
This creates a notion of "frame" within a DNA sequence. Let's take this sequence for example:
ATATGCGGCGTAATGAATACTGCTAAGGCTTATCGTGCATTCT
Although it contains multiple stop codons, only one of them is in the same frame as the start codon ATG near the beginning of the sequence. Thus, the sequence should be decomposed as such:
AT ATG CGG CGT AAT GAA TAC TGC TAA GGC TTA TCG TGC ATT CT
As that is how it would be read by the cell's machinery.
This means that there are six frames in every DNA sequence: the three forward ones, and the three reverse frames. DNA is found as a double helix, with the second strand being the reverse complement of the first. This means it will be read backwards, and all As are swapped for Ts, all Gs are swapped for Cs, and vice versa. So my example sequence will become:
AGAATGCACGATAAGCCTTAGCAGTATTCATTACGCCGCATAT
Which creates another sequence starting with ATG and ending with a stop codon:
AGA ATG CAC GAT AAG CCT TAG CAG TAT TCA TTA CGC CGC ATA T
In bioinformatics, these are called Open Reading Frames (ORFs): any DNA sequence that starts with ATG and ends with a stop codon in the same frame.
The challenge: : write a function in as few bytes as possible that takes as input a DNA sequence and output all ORFs found, each with its start position in the original sequence (one-based), end position, frame (+/- [1;2;3]) and full DNA sequence of the ORF. So for the previous example:
input: ATATGCGGCGTAATGAATACTGCTAAGGCTTATCGTGCATTCT
output:
3; 26; 3; ATGCGGCGTAATGAATACTGCTAA
40; 23; -1; ATGCACGATAAGCCTTAG
Separators between fields are required but can be as you wish.
A few more examples:
input: AATGGCGTAAATGCCTTGA
output:
2; 10; 2; ATGGCGTAA
11; 19; 2; ATGCCTTGA
input: TTACAT
output:
6; 1; -1; ATGTAA
input: TTTTTTTTT
output:
input: ATGCATGCATAATTAA
output:
1; 12; 1; ATGCATGCATAA
5; 16; 2; ATGCATAATTAA
input: ATGATGTAA
output:
1; 9; 1; ATGATGTAA
4; 9; 1; ATGTAA
input: ATGCTGTAC
output:
```
ATGATGTAA? There are 2 valid sequences here aren't there?ATGATGTAAandATGTAA(as far as I know, ATG (methionine) can occur in the middle of a protein and there are additional ways that the cell uses to decide which ATG's to start protein synthesis from. \$\endgroup\$+is not required, I'll edit the Q \$\endgroup\$