How to locate a group II intron

If you want to know if there is a group II intron in your bacterial genome sequence, here are some tips. Of course we would be happy to help you in evaluating your sequence if you have problems. In general, group II introns are easy to find if they encode a reverse trnascriptase, while ORF-less introns are more difficult to locate.

Does your DNA sequence produce a BLAST hit to group II intron-encoded RTs?
If you do a BLAST search at NCBI and your sequence gives a strong match to a group II intron-encoded protein, then the sequence probably contains either a group II intron or a fragment of one.

If your intron has at least 80% identity to a known intron, then it should be straightforward to locate the boundaries by a simple alignment. If there is a sudden break in alignment before the end of the intron, then the unknown sequence is a fragment. A fragment is also indicated on the protein level if the ORF lacks parts of RT domains (0-7) or domain X (the En domain depends on the subclass). Truncated group II introns are in fact more numerous in bacterial genomes than full-length introns, so this situation is not unusual. Even among full-length introns there are frequent frame shifts and/or stop codons in the ORFs, indicating that the intron is probably a "dead" mobile DNA.

If your intron is not closely related to a known intron, then the closest relative is still very useful because it provides a reference for finding the intron boundaries and folding the intron. For instance, if the closest relative has a IIA RNA structure, then the unkown intron should have a IIA structure. Even if the entire intron sequence cannot be aligned, some of the conserved motifs should be alignable and identifiable (e.g. D5, the epsilon prime motif, kappa motif, etc.).

Can intron domains 5 and 6 be located?
Domain 5 is the only RNA motif that can be identifed reliably based on sequence. For each class of intron there is a consensus sequence and structure for domain 5 (see secondary structure section). Domain 5 is usually located within 40 nt of the IEP's stop codon, and is usually downstream. If domain 5 is found, then there should be a reasonable domain 6 directly downstream (follow the consensus structures for the structural subclass). The sequence of the closest relative is usually very helpful in assigning the 3' end.
Identification of the 5' end.
Identification of the 5' boundary can be hard and is less reliable. The consensus sequence for the 5' end is GUGYG, but there are often multiple candidate GUGYG's in the region of the expected start. The 5' end is usually located 400-700 bp upstream of the start codon for the IEP. The best bet is to locate the boundary based on a closely related intron. Otherwise domain I has to be folded into a reasonable structure based on consensus structures. Briefly, the easiest way to fold the intron is to remove the ORF sequence from the intron and subject it to RNA folding via MFOLD. The optimal folding is usually not entirely correct, but by scanning through the suboptimal foldings, a structure can often be identified that obeys most of the structural rules and is mostly right. The correct pairings can then be used as folding constraints in MFOLD in another cycle of folding. This iterative process requires a certain amount of judgment and practice, but it works pretty well for most bacterial introns.
Some tips for folding an intron:
It is important to identify a closely related intron RNA as a reference for folding, as well as an RNA consensus structure for the class of your intron.

Several RNA motifs or regions are most critical when folding an intron de novo. Finding domains 5 and 6 is the easiest and should be done first. If both domains cannot be found, then the putative intron might be an intron fragment.

The next most important region is the basal stem of domain 4. This is important because it defines the large ORF loop that is not part of the ribozyme and can be deleted from subsequent folding. The 3' half of the domain 4 stem is easily located, because it is just upstream of domain 5. The 5' half of the stem can be estimated as being located within a couple hundred bps upstream or downstream of the ORF's start codon. It is usually in the region 100 bp upstream. The candidate sequence is pasted together with the sequence of domains 4/5/6, and the combined sequence is folded with MFOLD. The correct D4 stem is identified among the resulting foldings, by comparison to the consensus structure. Having found the domain 4 stem, one can excise the extraneous loop sequence, so that in subsequent steps, only the ribozyme sequence of <800 bp is folded.

Note: The ORF's start codon and stop codon are not always in the expected location in the secondary structure. Sometimes the start is upstream of domain 4, sometimes the stop codon is in domain 5; so, it doesn't always work to simply excise the ORF from start codon to stop codon, and then fold the rest.

If the 5' end of the intron is not yet located, it can be found in a similar way as domain 4. A sequence is pasted together corresponding to 5' exon sequence, and intron through domain 6. If the whole sequence is longer than 800 nts, then domains 4/5/6 can be omitted. After the RNA folding, one looks among folding possibilities for an appropriate stem motif following the 5'GUGYG that could demarcate the basal helix of D1, again with reference to the consensus structure. Additional confirmation of boundaries can come from identification of the IBS1 sequence at the end of the 5' exon, and identification of the host gene sequence that will be spliced together during intron excision.

Basically, all of the "conserved regions" in the consensus secondary structure should be identifiable. In cases where two related sequences have different structures, then one or both structures need to be adjusted so that all equivalent bases in the two sequences have the same pairings.
Confirmation of the intron boundaries
Once the boundaries are located (and sometimes before), it is useful to identify the host gene that is interrupted by the intron. Based on the deduced intron boundaries, the sequence can be spliced in silico, and the spliced exons can be used as a query in a BLAST search against a protein database. If the host gene product is found, it is straightforward to judge if the intron boundaries are correct.