Towards a Linguistic Stylometric Model for the Authorship Detection in Cybercrime Investigations
- Abdulfattah Omar
- Aldawsari Bader Deraan
Abstract
This study proposes an integrated framework that considers letter-pair frequencies/combinations along with the lexical features of documents as a means to identifying the authorship of short texts posted anonymously on social media. Taking a quantitative morpho-lexical approach, this study tests the hypothesis that letter information, or mapping, can identify unique stylistic features. As such, stable word combinations and morphological patterns can be used successfully for authorship detection in relation to very short texts. This method offers significant potential in the fight against online hate speech, which is often posted anonymously and where authorship is difficult to identify. The data analyzed is from a corpus of 12,240 tweets derived from 87 Twitter accounts. A self-organizing map (SOM) model was used to classify input patterns in the tweets that shared common features. Tweets grouped in a particular class displayed features that suggested they were written by a particular author. The results indicate that the accuracy of classification according to the proposed system was around 76%. Up to 22% of this accuracy was lost, however, when only distinctive words were used and 26% was lost when the classification procedure was based solely on letter combinations and morphological patterns. The integration of letter-pairs and morphological patterns had the advantage of improving accuracy when determining the author of a given tweet. This indicates that the integration of different linguistic variables into an integrated system leads to better performance in classifying very short texts. It is also clear that the use of a self-organizing map (SOM) led to better clustering performance because of its capacity to integrate two different linguistic levels for each author profile.
- Full Text: PDF
- DOI:10.5539/ijel.v9n5p182
Journal Metrics
Google-based Impact Factor (2021): 1.43
h-index (July 2022): 45
i10-index (July 2022): 283
h5-index (2017-2021): 25
h5-median (2017-2021): 37
Index
- Academic Journals Database
- ANVUR (Italian National Agency for the Evaluation of Universities and Research Institutes)
- CNKI Scholar
- CrossRef
- Excellence in Research for Australia (ERA)
- IBZ Online
- JournalTOCs
- Linguistic Bibliography
- Linguistics and Language Behavior Abstracts
- LOCKSS
- MIAR
- MLA International Bibliography
- PKP Open Archives Harvester
- Scilit
- Semantic Scholar
- SHERPA/RoMEO
- UCR Library
Contact
- Diana XuEditorial Assistant
- ijel@ccsenet.org