Theses and Dissertations

Incorporating semantic and syntactic information into document representation for document clustering

Yong Wang

Issuing Body

Mississippi State University

Advisor

Hodges, Julia E.

Committee Member

Hansen, Eric

Committee Member

Banicescu, Ioana

Committee Member

Boggess, Lois

Committee Member

Vaughn, Ray

Other Advisors or Committee Members

Bridges, Susan M.

Date of Degree

8-6-2005

Document Type

Dissertation - Open Access

Major

Computer Science

Degree Name

Doctor of Philosophy

College

James Worth Bagley College of Engineering

Department

Department of Computer Science

Abstract

Document clustering is a widely used strategy for information retrieval and text data mining. In traditional document clustering systems, documents are represented as a bag of independent words. In this project, we propose to enrich the representation of a document by incorporating semantic information and syntactic information. Semantic analysis and syntactic analysis are performed on the raw text to identify this information. A detailed survey of current research in natural language processing, syntactic analysis, and semantic analysis is provided. Our experimental results demonstrate that incorporating semantic information and syntactic information can improve the performance of our document clustering system for most of our data sets. A statistically significant improvement can be achieved when we combine both syntactic and semantic information. Our experimental results using compound words show that using only compound words does not improve the clustering performance for our data sets. When the compound words are combined with original single words, the combined feature set gets slightly better performance for most data sets. But this improvement is not statistically significant. In order to select the best clustering algorithm for our document clustering system, a comparison of several widely used clustering algorithms is performed. Although the bisecting K-means method has advantages when working with large datasets, a traditional hierarchical clustering algorithm still achieves the best performance for our small datasets.

Temporal Coverage

2000-2009

URI

https://hdl.handle.net/11668/14883

Recommended Citation

Wang, Yong, "Incorporating semantic and syntactic information into document representation for document clustering" (2005). Theses and Dissertations. 2682.
https://scholarsjunction.msstate.edu/td/2682

Download

COinS

Theses and Dissertations

Incorporating semantic and syntactic information into document representation for document clustering

Issuing Body

Advisor

Committee Member

Committee Member

Committee Member

Committee Member

Other Advisors or Committee Members

Date of Degree

Document Type

Major

Degree Name

College

Department

Abstract

Temporal Coverage

URI

Recommended Citation

Browse

Search

Author Corner

Links

Links

MSU Libraries

Theses and Dissertations

Incorporating semantic and syntactic information into document representation for document clustering

Author

Issuing Body

Advisor

Committee Member

Committee Member

Committee Member

Committee Member

Other Advisors or Committee Members

Date of Degree

Document Type

Major

Degree Name

College

Department

Abstract

Temporal Coverage

URI

Recommended Citation

Share

Browse

Search

Author Corner

Links

Links

MSU Libraries