A data warehouse-oriented methodology for qualitative semi-structured web information and social networking sites' user status search
Abstract
Finding most desired and useful information from the diverse information and content embedded
on webpages has become more challenging due to the rapid growth of websites and webpages,
dynamic changes and updates of information and content on webpages, the lack of well-formed
structure of webpage content and so on. Information search seems a trivial task when plain text,
hyperlink texts, embedded images, videos that all make up webpage content remain in semistructured
form. Semi-structured webpage content do not have predefined structure and remains
in hierarchically nested HTML tags of a webpage body. Unlike structured webpage content,
heterogeneous semi-structured webpage content can’t be neatly formatted, organized and modeled
directly into relational database. One of the most important information types on the web is web
user’s emotion expressed in user-posted status on Social Networking Sites like Facebook, Twitter.
Publicly posted user status is informative enough to know user’s daily thoughts, feelings, emotions
through textual self-description. The data warehouse-oriented methodology of semi-structured
webpage content extraction and modeling into database introduces a simplified and less labor
intensive XML-based semi-structured webpage content extraction technique that overcomes the
limitations of existing pre-defined specification file and Wrapper-based techniques to adapt rapid
changes of webpage content and to extract same piece of information on different webpages having
differentiated nested HTML structure. This methodology also introduces Multidimensional Fact
data modeling technique for semi-structured webpage content storage into relational database.
Our implemented methodology ensures qualitative search result in terms of hyperlinks to most
desired webpages appearing first with a relatively very low fractional amount of minute. [...]