On Scalable and Robust Truth Discovery in Big Data Social Media Sensing Applications

Abstract:

Identifying trustworthy information in the presence of noisy data contributed by numerous unvetted sources from online social media (e.g., Twitter, Facebook, and Instagram) has been a crucial task in the era of big data. This task, referred to as truth discovery, targets at identifying the reliability of the sources and the truthfulness of claims they make without knowing either a priori. In this work, we identified three important challenges that have not been well addressed in the current truth discovery literature. The first one is “misinformation spread” where a significant number of sources are contributing to false claims, making the identification of truthful claims difficult. For example, on Twitter, rumors, scams, and influence bots are common examples of sources colluding, either intentionally or unintentionally, to spread misinformation and obscure the truth. The second challenge is “data sparsity” or the “long-tail phenomenon” where a majority of sources only contribute a small number of claims, providing insufficient evidence to determine those sources’ trustworthiness. For example, in the Twitter datasets that we collected during real-world events, more than 90% of sources only contributed to a single claim. Third, many current solutions are not scalable to large-scale social sensing events because of the centralized nature of their truth discovery algorithms. In this paper, we develop a Scalable and Robust Truth Discovery (SRTD) scheme to address the above three challenges. In particular, the SRTD scheme jointly quantifies both the reliability of sources and the credibility of claims using a principled approach. We further develop a distributed framework to implement the proposed truth discovery scheme using Work Queue in an HTCondor system. The evaluation results on three real-world datasets show that the SRTD scheme significantly outperforms the state-of-the-art truth discovery methods in terms of both effectiveness and efficiency.

Existing System:

Current truth discovery solutions do not fully address the “misinformation spread” problem where a significant number of sources are spreading false information on social media. Many current truth discovery algorithms depend heavily on the accurate estimation of the reliability of sources, which often requires a reasonably dense dataset. However, “data sparsity” or the “long-tail phenomenon” [44] is commonly observed in real-world applications. Existing truth discovery solutions did not fully explore the scalability aspect of the truth discovery problem [12]. Social sensing applications often generate large amounts of data during important events

Proposed System:

In this paper, we develop a Scalable and Robust Truth Discovery (SRTD) scheme to address the misinformation spread, data sparsity, and scalability challenges in big data social media sensing applications. To address the misinformation spread challenge, the SRTD scheme explicitly models various behaviors that sources exhibit such as copying/forwarding, self-correction, and spamming. To address data sparsity, the SRTD scheme employs a novel algorithm that estimates claim truthfulness from both the credibility analysis on the content of the claim and the historical contributions of sources who contribute to the claim. To address the scalability challenge, we develop a lightweight distributed framework using Work Queue [4] and HTCondor [27], which form a system that is shown to be both scalable and efficient in solving the truth discovery problem. We evaluate our SRTD scheme in comparison with state-of-the-art baselines on three real-world datasets collected from Twitter during recent events (Dallas Shooting in 2016, Charlie Hebdo Attack in 2015, and Boston Bombing in 2013). The evaluation results show that our SRTD scheme outperforms the state-of-the-art truth discovery schemes by accurately identifying the truthful information in the presence of widespread misinformation and sparse data, and significantly improving the computational efficiency

CONCLUSION:

In this paper, we proposed a Scalable Robust Truth Discovery (SRTD) framework to address the data veracity challenge in big data social media sensing applications. In our solution, we explicitly considered the source reliability, report credibility, and a source’s historical behaviors to effectively address the misinformation spread and data sparsity challenges in the truth discovery problem. We also designed and implemented a distributed framework using Work Queue and the HTCondor system to address the scalability challenge of the problem. We evaluated the SRTD scheme using three real-world data traces collected from Twitter. The empirical results showed our solution achieved significant performance gains on both truth discovery accuracy and computational efficiency compared to other stateof-the-art baselines. The results of this paper are important because they provide a scalable and robust approach to solve the truth discovery problem in big data social media sensing applications where data is noisy, unvetted, and sparse.

REFERENCES

[1] O. Banerjee, L. E. Ghaoui, and A. dAspremont. Model selection through sparse maximum likelihood estimation for multivariate gaussian or binary data. Journal of Machine learning research, 9(Mar):485–516, 2008.

[2] S. Bhuta and U. Doshi. A review of techniques for sentiment analysis of twitter data. In Proc. Int Issues and Challenges in Intelligent Computing Techniques (ICICT) Conf, pages 583–591, Feb. 2014.

[3] J. Bian, Y. Yang, H. Zhang, and T.-S. Chua. Multimedia summarization for social events in microblog stream. IEEE Transactions on multimedia, 17(2):216–228, 2015.

[4] P. Bui, D. Rajan, B. Abdul-Wahid, J. Izaguirre, and D. Thain. Work queue+ python: A framework for scalable scientific ensemble applications. In Workshop on python for high performance and scientific computing at sc11, 2011.

[5] P.-T. Chen, F. Chen, and Z. Qian. Road traffic congestion monitoring in social media with hinge-loss markov random fields. In Data Mining (ICDM), 2014 IEEE International Conference on, pages 80–89. IEEE, 2014.

[6] X. L. , L. Berti-Equille, and D. Srivastava. Integrating conflicting data: the role of source dependence. In Proceedings of the VLDB Endowment, pages 550–561, 2009.

[7] X. X. et al. Towards confidence in the truth: A bootstrapping based truth discovery approach. In Proceedings of the 22th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining – KDD ’16, 2016.

[8] R. Farkas, V. Vincze, G.Mora, J. Csirik, and G.Szarvas. The conll2010 shared task: Learning to detect hedges and their scope in natural language text. In In Proceedings of the Fourteenth Conference on Computational Natural Language Learning., 2010.

[9] R. Feldman and M. Taqqu. A practical guide to heavy tails: statistical techniques and applications. Springer Science & Business Media, 1998.

[10] A. Galland, S. Abiteboul, A. Marian, and P. Senellart. Corroborating information from disagreeing views. In In Proc. of the ACM International Conference on Web Search and Data Mining (WSDM’10), pages 131–140, 2010.

[11] A. Gelman, J. B. Carlin, H. S. Stern, D. B. Dunson, A. Vehtari, and D. B. Rubin. Bayesian data analysis, volume 2. CRC press Boca Raton, FL, 2014.

[12] C. Huang, D. Wang, and N. Chawla. Scalable uncertainty-aware truth discovery in big data social sensing applications for cyberphysical systems. IEEE Transactions on Big Data, 2017.

[13] A. Karandikar. Clustering short status messages: A topic model based approach. PhD thesis, University of Maryland, Baltimore County, 2010.

[14] J. M. Kleinberg, R. Kumar, P. Raghavan, S. Rajagopalan, and A. S. Tomkins. The web as a graph: Measurements, models, and methods. In International Computing and Combinatorics Conference, pages 1–17. Springer, 1999.

[15] K. Lee, J. Caverlee, and S. Webb. Uncovering social spammers: social honeypots+ machine learning. In Proceedings of the 33rd international ACM SIGIR conference on Research and development in information retrieval, pages 435–442. ACM, 2010.