Advisor

McCarthy, Fiona M.

Committee Member

Nanduri, Bindu

Committee Member

Peterson, Daniel G.

Committee Member

Bridges, Susan M.

Committee Member

Burgess, Shane C.

Date of Degree

1-1-2012

Document Type

Dissertation - Open Access

Abstract

Advances in next-generation sequencing (NGS) technologies have resulted in significant reduction of cost per sequenced base pair and increase in sequence data volume. On the other hand, most currently used NGS technologies produce relatively short sequence reads (50 - 150 bp) compared to Sanger sequencing (~700 bp). This represents an additional challenge in data analysis, because shorter reads are more difficult to assemble. At this point, production of sequencing data outpaces our capacity to analyze them. Newer NGS technologies capable of producing longer reads are emerging, which should simplify and speed up genome assembly. However, this will only increase the number of sequenced genomes without structural and functional annotation. In addition to multiple scientific initiatives to sequence thousands of genomes, personalized medicine centered on sequencing and analysis of individual human genomes will become more available. This poses a challenge for computer science and emphasizes the importance of developing new computational algorithms, methodology, tools, and pipelines. This dissertation focuses on development of these software tools, methodologies, and resources to help address the need for processing of volumes of data generated by new sequencing technologies. The research concentrated on genome structure analysis, individual variation, and comparative biology. This dissertation presents: (1) the Short Read Classification Pipeline (SRCP) for preliminary genome characterization of unsequenced genomes; (2) a novel methodology for phylogenetic analysis of closely related organisms or strains of the same organism without a sequenced genome; (3) a centralized online resource for standardized gene nomenclature. Utilizing the SRCP and the methodology for initial phylogenetic analysis developed in this dissertation enables positioning the organism in the evolutionary context. This should facilitate identification of orthologs between the species and paralogs within the species even in the initial stage of the analysis when only exome is sequenced and, thus, enable functional annotation by transferring gene nomenclature from well-annotated 1:1 orthologs, as required by the online standardized gene nomenclature resource developed in this dissertation. Thus, the tools, methodology, and resources presented here are tied together in following the initial analysis workflow for structural and functional annotation.

URI

https://hdl.handle.net/11668/20369

Share

COinS