Towards a Linguistic Stylometric Model for the Authorship Detection in Cybercrime Investigations

Abdulfattah Omar; Aldawsari Deraan

doi:10.5539/ijel.v9n5p182

Towards a Linguistic Stylometric Model for the Authorship Detection in Cybercrime Investigations

Abdulfattah Omar
Aldawsari Bader Deraan

Abstract

This study proposes an integrated framework that considers letter-pair frequencies/combinations along with the lexical features of documents as a means to identifying the authorship of short texts posted anonymously on social media. Taking a quantitative morpho-lexical approach, this study tests the hypothesis that letter information, or mapping, can identify unique stylistic features. As such, stable word combinations and morphological patterns can be used successfully for authorship detection in relation to very short texts. This method offers significant potential in the fight against online hate speech, which is often posted anonymously and where authorship is difficult to identify. The data analyzed is from a corpus of 12,240 tweets derived from 87 Twitter accounts. A self-organizing map (SOM) model was used to classify input patterns in the tweets that shared common features. Tweets grouped in a particular class displayed features that suggested they were written by a particular author. The results indicate that the accuracy of classification according to the proposed system was around 76%. Up to 22% of this accuracy was lost, however, when only distinctive words were used and 26% was lost when the classification procedure was based solely on letter combinations and morphological patterns. The integration of letter-pairs and morphological patterns had the advantage of improving accuracy when determining the author of a given tweet. This indicates that the integration of different linguistic variables into an integrated system leads to better performance in classifying very short texts. It is also clear that the use of a self-organizing map (SOM) led to better clustering performance because of its capacity to integrate two different linguistic levels for each author profile.