Semi-Automatic Labeling of Training Data Sets in Text Classification
- Nayereh Ghahreman
- Ahmad Dastjerdi
Abstract
Web includes digital libraries and billions of text documents. A fast and simple search through this sizeable set is important for users and researchers. Since manual or rule based document classification is a difficult, time consuming process, automatic classification systems are absolutely needed. Automatic text classification systems demand extensive and proper training data sets. To provide these data sets, usually, numerous unlabeled documents are labeled manually by experts. Manual labeling of documents is a difficult and time consuming process. Moreover, in manual labeling, due to human exhaustion and carelessness, there is the possibility of mistakes.In this study, semi-automatic creation of training data set has been proposed in a way that only a small percentage of this extensive set’s documents is labeled manually and the remaining percentage is done automatically. Results show that by labeling only ten percent of the training set, remaining documents can be automatically labeled with 98 percent of accuracy. It is worth mentioning that this reduction in accuracy only occurs in standard data sets, while for large practical data sets, this reduction is trivial compared to the accuracy reduction resulted by human exhaustion and carelessness.
- Full Text: PDF
- DOI:10.5539/cis.v4n6p48
This work is licensed under a Creative Commons Attribution 4.0 License.
Journal Metrics
WJCI (2022): 0.636
Impact Factor 2022 (by WJCI): 0.419
h-index (January 2024): 43
i10-index (January 2024): 193
h5-index (January 2024): N/A
h5-median(January 2024): N/A
( The data was calculated based on Google Scholar Citations. Click Here to Learn More. )
Index
- Academic Journals Database
- BASE (Bielefeld Academic Search Engine)
- CiteFactor
- CNKI Scholar
- COPAC
- CrossRef
- DBLP (2008-2019)
- EBSCOhost
- EuroPub Database
- Excellence in Research for Australia (ERA)
- Genamics JournalSeek
- Google Scholar
- Harvard Library
- Infotrieve
- LOCKSS
- Mendeley
- PKP Open Archives Harvester
- Publons
- ResearchGate
- Scilit
- SHERPA/RoMEO
- Standard Periodical Directory
- The Index of Information Systems Journals
- The Keepers Registry
- UCR Library
- Universe Digital Library
- WJCI Report
- WorldCat
Contact
- Chris LeeEditorial Assistant
- cis@ccsenet.org