Automatic Identification and Filtration of COVID-19 Misinformation

The rapid spread of online fake news through some media platforms has increased over the last decade. Misinformation and disinformation of any kind is extensively propagated through social media platforms, some of the popular ones are Facebook and Twitter. With the present global pandemic ravaging the world and killing hundreds of thousands, getting fake news from these social media platforms can exacerbate the situation. Unfortunately, there has been a lot of misinformation and disinformation on COVID-19 virus implications of which has been disastrous for various people, countries, and economies. The right information is crucial in the fight against this pandemic and, in this age of data explosion, where TBs of data is generated every minute, near real time identification and tagging of misinformation is quintessential to minimize its consequences. In this paper, the authors use Natural Language Processing (NLP) based two-step approach to classify a tweet to be a potentially misinforming one or not. Firstly, COVID -19 tagged tweets were filtered based on the presence of keywords formulated from the list of common misinformation spread around the virus. Secondly, a deep neural network (RNN) trained on openly available real and fake news dataset was used to predict if the keyword filtered tweets were factual or misinformed.


Introduction
Social media is a significant conduit for news and information in the modern media environment (P. Sharma and Kaur, 2017), with one in three people in the world engaging in social media, and two thirds of those on the internet using it (Ortiz-Ospina, 2019). There were 255 million Monthly active Twitter users in February 2014 during the start of the Ebola outbreak (Twitter, 2014). But, in February 2020, during the start of the Covid-19 outbreak, twitter reports 166 million daily active users (Twitter, 2020). This shows the magnitude of reach any information or misinformation can have in this one social media platform. As much as it proves to be the novel means of connecting people and disseminating information, it has also proved to provide too much information and misinformation, causing hysteria, mental distress, self-harm and in few cases, suicides (Rosenberg, Syed, and Rezaie, 2020).
Misinformation can be defined as a claim of fact that is currently false due to lack of scientific evidence (Chou, Oh, and Klein, 2018). Also, in the study by Kouzy et. al. (2020), authors states that tweet quality (misinformation vs correct information) did not differ based on the number of likes or retweets, indicating that misinformation is as likely to spread and engage users as the truth (Kouzy et al., 2020). Also, more than 40% of the total tweets includes an URL, which presumably indicate authenticity, indicate that only 0.4% of those are from very credible health sources like the CDC and WHO (Singh et al., 2020). As more and more people live in isolation, fearing the risk of outbreak, they are more prone to probe more information about the disease, which demands authenticity. This need has been acknowledged by the World Health Organization, which has partnered with several social media platforms and seven major tech companies-namely, Facebook, Google, LinkedIn, Microsoft, Reddit, Twitter, and YouTube-that agreed to stamp out fraud and misinformation, and to promote critical updates from healthcare agencies (Statt, 2020). Identifying hoaxes and rumors from online platforms and segregating them from scientific information's can be achieved through NLP and advanced text analytics. Applying principles and learning from the algorithms and methodologies used by online companies like Amazon, reddit etc. to detect fake accounts and fake reviews could serve as a substantial means to identify misinformation (Shu et al., 2017).
In this paper, we discuss our approach to classify Covid-19 related misinformation in Twitter using a Recurrent Neural Network model that classify Real and fake news. We collected tweets related to Covid 19 and do an advanced keyword-based filtration and categorization based on the presence of keywords related to the common misinforma-tion. The newly filtered tweets are now classified as factual or misinformed using the text classification model based on real and fake news datasets.
The research work by Sharma (2020) approaches the problem in multiple angles. They identify misinformation based on information cascades (the source tweets and the propagation information from the cluster of the retweet graph) and analyses the degree of falsehoods and varying or deliberate intents, thereby classifying them into four categories -Unreliable (false, questionable, rumors and mis-leading news), Conspiracy (conspiracy theories and scientif-ically dubious news), Click-bait (exaggerated or misleading headlines to attract attention) and Political/Biased (written in support of a particular political orientation). They analyzed the geographical spread for each of these categories and their sentiment (K. Sharma et al., 2020).
Whereas the research by Shahi (2020) follows a two-step approach of first identifying the accounts involved in the spread of misinformation followed by analysis (Shahi, Dirkson, and Majchrzak, 2020). Accounts are categorized as: 1) 'bot' accounts, 2) accounts associated with brands and lastly, 3) their popularity or follower count. Information diffusion is analyzed by using the speed of retweets as a proxy for the speed of propagation, which was followed by content analysis using hashtags, emojis and distinctive terms.

Methodology
A. Data Collection COVID-19 related English tweets from the United States of America were obtained within the period of March 20, 2020, to April 30, 2020. There were 4 million tweets obtained; out of which more than 77% were geotagged. Table I summarizes the stats and Table 2 lists the hashtags that were used.

B. Formulating a keyword corpus of popular misinformation spread around COVID -19
IDeaS Center and CASOS Center released a curated list of stories containing inaccurate information regarding COVID-19, categorized into six categories-cure, nature of virus, con-spiracy theories, emergencies, misbehavior, and good news (Carley, 2020). We extracted keywords from each of these categories using RAKE library (Python implementation of the Rapid Automatic Keyword Extraction algorithm using NLTK) and stored them in separate files for next steps.

C. Tweets Pre-processing and Categorization
The scraped data from twitter was stored in a file and cleaned by removing html tags using Beautiful Soup library. PhraseMatcher API from Spacy was used to match the previously extracted keywords with the tweets, indicating a tweet that may contain misinformation related to one of the categories from table 3.

D. Classification Model
We used about 35,918 news articles to train on the model and 8,980 to test the model. The minimum length of the article in the training dataset is 32 words and maximum is 51,894. The median length of articles is 2,269 words. We truncated the articles to a maximum length of 300 words. We used top 10,000 words and tokenized them. The smaller articles were padded with zeros at the end. All the words in the text sentences are converted to low dimensional vectors. These word vectors are then stacked to create an embedding matrix: E w ϵR d×| v| . Here d signifies the dimension of the vector and v signifies the vocabulary size. Each word is mapped into 100 dimensional vector using pre-trained GloVe embeddings (Pennington, Socher, and Manning, 2014) to represent the words into higher dimensional vector space. Next, these word embedding are sent into Long Short Term Memory (LSTM) cells (Graves, 2012). Each LSTM cells is fed with word vector at the current time step as well as the output from the previous LSTM cells. In this fashion, the LSTM layer learns the patterns that represent a fake news or a fact. LSTM has an advantage over vanilla RNN that it tackles the vanishing gradient problem, making it widely popular choice for NLP applications that requires long range dependencies (Graves, Mohamed, and Hinton, 2013), (Rao et al., 2018). Since, it is a binary classification we used binary crossentropy loss function: In order to predict the class, we used a sigmoid activation function. Sigmoid is a non-linear function that squeezes the input values between 0 and 1.
The model architecture had the following layers in sequential order: input layer, embedding layer, LSTM layer with 128 hidden parameters, dropout layer with dropout value 0.2, LSTM layer with 64 hidden parameters, dropout layer with dropout value 0.2, fully connected layer with 32 units and ReLU activation function, and lastly fully connected layer with 1 unit and sigmoid activation function. We used adam optimizer to compile the model and the model architecture is presented in figure 1.
The model, when trained on label data of true and fake news articles gives and accuracy of 99% on test data. The Precision, recall and F1 score are mentioned in table V.

E. Making Predictions on Tweets
Once the tweets were filtered as mentioned in III-C, these tweets were then tokenized, padded and passed through the fake news classification model, described in III-D, to get a score ranging between 0 and 1. We can then use an appropriate threshold value to tag the misinformation carrying tweets and thus counter the spread of covid-19 related misinformation.

Results
When we set the threshold of the prediction to a higher number, 0.8, we get genuine tweets and when we set the threshold to a lower value, 0.2, then we get the misinforming tweets related to ; examples can be viewed in the tableau dashboard: Tableau Dashboard.

Conclusion
We observed that the LSTM based sequential model can learn the tone, sentiment, and genuineness of a text when trained on a labelled data of fake news and can classify the misinformation from the real information with high accuracy. In the current study, we attempted to segregate the covid-19 related misinformation from 4M tweets collected over 40 days period using a model which is pre-trained on fake news dataset. We can see from the tableau dashboard that on a higher threshold (0.8 in this case), we get genuine tweets that do not carry any misinformation. On the other side, when we use a lower threshold (0.2), it indicated misinformation containing tweets.

Discussions and Limitations
This was an attempt to classify misinformation related to COVID-19 and counter the spread of it. We extracted tweets that indicate a possible misinformation theory based on keywords and used a model that was trained on a labeled dataset of true and fake news articles to classify the tweets. The misinformation tweets identified by the model can be used to further train the model to identify the patterns in misinforming tweets related to  In this way the model can learn new patterns and language related to COVID-19. Another better approach can be to check the misinforming tweets predicted by the model and after fact-checking, we feed the cleaner data to the model for training. The current model is based on LSTM approach and uses twitter data of about 4M tweets collected over a period of about a month. In the future work of this paper, we will collect data over wider span of time for our analysis and explore more advanced NLP models including transformer-based models for comparisons with the current models.