Genome sequence alignments are a priceless resource for finding functional elements (protein-coding sequences, RNA structures, cis-regulatory elements, miRNA target sites, etc.) and charting evolutionary history. Many genome alignment algorithms have been developed. All of these algorithms require selection of various mundane but critical parameters. In the most classic approach to alignment (Smith-Waterman/BLAST), these parameters include the scoring matrix and gap costs, which determine alignment scores, and thus which alignments are produced. This study aims to reveal the influence of these and other parameters, and to guide their selection for accurate genome alignment.
In the classic alignment framework, it is necessary to choose an alignment score cutoff: low enough to find weak homologies, but high enough to avoid too many spurious alignments. A rational approach is to calculate the E-value--the expected number of alignments between two random sequences scoring above the cutoff--and choose a cutoff that has an acceptable E-value. Surprisingly, this approach does not seem to be used for genome alignment (or if it is, it is not mentioned in method descriptions). The authors of BLASTZ tested their score cutoff by aligning two genomes after reversing, but not complementing, one of them. Homology between reversed and non-reversed DNA is (thought to be) impossible, so this is a good measure of the spurious alignment rate, but it is inconvenient to repeat it with each new pair of genomes.