The analysis of Canada's health through social media using machine learning
Abstract
Real-time online data processing is quickly becoming an essential tool in the
analysis of social media for political trends, advertising, public health awareness
programs and policy making. Traditionally, processes associated with offline analysis
are productive and efficient only when the data collection is a one-time process.
Currently, cutting edge research requires real-time data analysis that comes
with a set of challenges, particularly the efficiency of continuous data fetching
within the context of present NoSQL and relational databases. In this thesis, I
demonstrate a solution to effectively address the challenges of real-time analysis
using a configurable Elasticsearch search engine. We are using a distributed
database architecture, pre-build indexing and standardizing the Elasticsearch framework
for large scale text mining. The results from the Elasticsearch engine is visualized
in almost real-time.
We focused on taking our solution to the challenges of real-time data processing
is to apply it on social media to conduct a large scale health analaysis in Canada.
Social media a crucial database that provides information on a variety of topics
such as health, food, feedback on products, and many others. At present, people
utilize social media to share their daily lifestyles, for example, where they are
going, what exercise are they doing, or what are they eating. By analyzing the
information, collected from these individuals, the health of the population can be
gauged. This analysis can become an integral part of the government’s efforts to
study the health of people on a large scale. This is because public health is becoming
the primary concern for many governments around the world, and they
believe it is necessary to analyze the present scenario within the population before
creating any new policies. Traditionally, governments use a door to door survey,
for example, a census, or hospital information to decide their health policies. This information is limited and sometimes takes a long time to collect and analyze sufficiently
enough to aid in decision making. Our approach is to try to solve such
problems through the advancement of natural language processing algorithms and
large scale data analysis. Results show, the proposed method provides the solution
in less time with the same accuracy when compared to the traditional one.