Our news

How to combat fake news:
the case for language data analysis





How to combat fake news: the case for language data analysis


Fake news has always been an issue in public discourse, but the radical development of technology has catapulted it to the top of problems democratic societies have to deal with currently. The rapid adoption of web and social media sources as main sources of information has led certain groups to take advantage of the freedom, the anonymity and the space that these services offer and spread false information disguising it as legitimate. This information can be harmful to individuals or society in general as it is very easy for people to believe information without verifying it first, especially when it confirms their biases. When fake news articles are not detected early, they will spread rapidly thus harming individuals, organizations and societies. The spread of fake news is associated with extreme beliefs that somehow aim to contradict scientific disciplines across the world. There is a very recent example with which most people can relate: Alongside the COVID-19 pandemic , the world had to combat a fake news epidemic.

The role of AI and NLP in Fake News Detection

To tackle this growing issue, various technical approaches have been developed. Artificial Intelligence (AI) and Natural Language Processing (NLP) are used to assist business needs and provide specialized services to customers. Considering the huge amount of data produced constantly which may surpass the capability humans possess to evaluate, analyze and make data driven decisions, AI has a key role in present day business issues and decisions. Modern businesses use NLP strategically to gain advantages over other competitors. Due to its capability to handle huge unstructured data sets from multiple domains including education, healthcare, business and security, it is a powerful asset for fake news detection. In addition, NLP has multiple ways of processing language data, such as Emotion Analyzer and Detection, Named Entity Recognition (NER), Parts of Speech (POS) Taggers, Chunking, and Semantic Role that can be used to enhance decision-making and business efficiency in multiple languages (i.e. Greek, Arabic, Chinese etc.). When it comes to fake news detection, the main goal is to develop a methodology that applies different classification algorithms to build a classification model that could be used as a scanner.

How to combat fake news: the case for language data analysis

Detecting Fake News: The technical approaches

As always, the first step in these types of classification problems is data collection. For textual analysis and classification purposes the optimum solution is a dataset consisting of both fake and real news of the same subject. A full dataset should consist of 50% fake and 50% real news so that there can be a single point of truth of its accuracy. Then the usual follow up is preprocessing in order to remove excess noise data.

Next steps include feature selection and train-test split. There are of course various approaches to implement features. From a language perspective it is particularly useful to choose certain keywords that may suggest an article has fake information and then formulate a dictionary based on the findings. The frequency of these keywords can be used as both features and evaluators to the validity and accuracy of the texts provided. Certain examples of speech patterns, key words and phrases that may suggest that a text contains fake information could be the following:

  • Derogatory terms
  • Extremely Informal/ Unscientific use of speech
  • Lack of references in case of scientific topics
  • Specific terminology and words associated with extreme/unscientific views

Depending on the complexity of the language in context, Part-of-Speech  1 and tokenization  2 could also be mobilized, in order to extract the semantics of each word. Combining the above mentioned approach with the bag-of-words approach, could result in a classifier that can evaluate information validity from term frequency. This method could however prove to be problematic in practice, because of the importance of word order in a text.

Moving on to more automated methods, like N-grams  3 take in consideration word order and also context which is a key aspect of meaning. After feature selection the dataset is split and the classifier model is implemented. Following testing and verification of the results, the model can be applied on selected unseen data.

We always have to keep in mind that the goal is to have the best performance and accuracy. A common method to achieve this, is to set the categorization correctly by non-overlapping classifiers. Various classifiers can provide additional insights for the model as well as the data provided and processed. The most famous models used for textual analysis are the following:

  • KNN (k- Nearest Neighbors)
  • Naive Bayes
  • Support Vector Machine (SVM)
  • Random Forests

Limitations of the technical approaches

Of course, there are certain limitations to our approaches. Common problems that trouble the NLP community are relevant to fake news detection as well. Source material in general tends to be language specific and the sources available can be found mostly in the most popular spoken languages. In practice, if one was to try to train a model in a less popular language, she would have to collect a lot of material in order to compose a proper dictionary to use as a basis for language classification. Additionally, one should take into account linguistic complexity. Linguistic phenomena such as morphological syntheses and derivations could possibly affect the accuracy of classifiers if not taken into account. To be successful, aspiring data scientists need to acquire specific linguistic knowledge in order to perform classifications correctly on language matters.

The Future of Fake News Detection

We have presented several methods/classifiers which can be used to detect and label fake news efficiently and prevent the after effects of their spread. We predict that the application of fake news detection systems will become a best practice prior to important political, business and even personal decisions. Thus, we expect new systems to be developed and the existing ones to be enhanced. As mentioned in the beginning of this article, fake news have always been around, they are not going away anytime soon, but hopefully science will be able to equip mankind with the right tools to reduce their impact in decision making, societal peace and progress.

Sounds interesting?

Get in touch now to schedule
an introductory chat.
No strings attached!