Dirichlet Mixtures, the Dirichlet Process, and the Topography of Amino Acid Multinomial Space

The Dirichlet Process is used to estimate probability distributions that are mixtures of an unknown and unbounded number of components. Amino acid frequencies at homologous positions within related proteins have been fruitfully modeled by Dirichlet mixtures, and we have used the Dirichlet Process to construct such distributions. The resulting mixtures describe multiple alignment data substantially better than do those previously derived. They consist of over 500 components, in contrast to fewer than 40 previously, and provide a novel perspective on protein structure. Individual protein positions should be seen not as falling into one of several categories, but rather as arrayed near probability ridges winding through amino-acid multinomial space.