A Modern k-Nearest Neighbors (kNN) Algorithm for Effective Text Misinformation Detection
Abstract
Messaging apps like WhatsApp, LINE, and WeChat, are very popular in these days.
People use the apps to receive, send, or relay messages such as news, video clips, and comments.
They become an indispensable tool in our daily lives and much of the information we receive is from them.
However, not all messages we receive are true and the worst is we may relay the false messages to our friends, who may lose faith on us after finding the messages are not correct.
This research uses a modern k-Nearest Neighbors (kNN) algorithm to identify fake text messages.
The kNN algorithm, a supervised learning algorithm, classifies the query instance by taking simple majority of the
k nearest neighbors, which are determined based on the
distances from the query instance to the training samples.
Therefore, the accuracy of the kNN algorithms are highly dependent on the distances applied.
This research will use the following distance measurements to identify text misinformation:
- Number of keywords matched, which is the simplest method among all measurements, for example,
Message: Flu shots are placebo and do not reduce the risk of catching flu.
Query: The risk of getting flu is lowered by flu shots.
# of keywords matched: 5 (flu, shot, risk, reduce/lower, get/catch)
- Number of phrases matched, which considers the number and length of the phrases (sequences of keywords) matched other than counting the number of keywords matched, for example,
Message: Flu shots are placebo and do not reduce the risk of catching flu.
Query: The risk of getting flu is lowered by flu shots.
# of phrases matched: 2 (flu shot, risk catch/get flu)
- Longest common subsequence (LCS), which is the longest subsequence common to all sequences in a set of sequences (ordered lists of keywords), for example,
Message: Flu shots are placebo and do not reduce the risk of catching flu.
Query: The risk of getting flu is lowered by flu shots.
LCS: 3 (risk catch/get flu)
- Longest approximate common subsequence (LACS), which produces a maximum-gain approximate common subsequence of two sequences, for example,
Message: Flu shots are placebo and do not reduce the risk of catching flu.
Query: The risk of getting flu is lowered by flu shots.
LACS: 6-loss (flu shot risk catch/get reduce/lower flu)
The problem of misinformation has bothered people for a long time, and serious consequences (like affecting elections) have happened from time to time.
In order to solve the problem, the misinformation must be identified first.
However, no satisfactory method is found to detect misinformation completely so far.
This research tries to identify text misinformation by using a machine-learning method, a kNN algorithm, whose accuracy relies heavily on the distances between the query message and the training messages.
Several effective distance measurements are proposed in this research.
Keywords
mobile computing, security, fake news, misinformation, misinformation identification, kNN, k-nearest neighbors, machine learning, information retrieval
Conference