I want to use HMM (forward backward model) for protein secondary structure prediction.
Basically, a three-state model is used: States = {H=alpha helix, B=beta sheet, C=coil}
and each state has a emission probability pmf of 1-by-20 (for the 20 amino acids).
After using a "training set" of sequences on the forward backward model, the expectation maximization converges for an optimal transitions matrix (3-by-3 between the three states), and emission probability pmf for each state.
Does anyone know of a dataset (preferably very small) of sequences for which the "correct" values of the transition matrix and emission probabilities are determined. I would like to use that dataset in Excel to apply the forward backward algorithm and build my confidence to determine whether or not I can get the same result.
And then move on to something less primitive than Excel :o)