With the rapid growth of the web, there are grand challenges when making sense of BIG web data:
big volume, high velocity, high variety, and unknown veracity.
In the physical world, a sensor is a converter that measures a physical quantity and converts
it into a signal that can be read by an observer or by an instrument—today, mostly electronic.
The WebSensor project targets to providing a platform for both developers
and end users to easily build "web sensors" which are programmable "focused crawlers" that
continuously discover, extract, and aggregate structured information around a topic.
We developped a WebSensor platform based on Windows PowerShell and the .NET Framework.
It enables developers to easily create WebSensors that continuously extract information from the web
and generate time-series stream data.
The websensor platform has many built-in capabilities to extract and collect time-sequenced
data embedded in web sites. These built-in capabilities include:
- Convenient wrapper generation on webpages (just by a few clicks)
- Automatically wrapper adaption to page layout change
- Easy to configure and run
- Easy to extend using simple script language
- Easy to manage and retrieve the data collected
Websensors can connect to form a sensor network for more complex analysis tasks
that involve multiple time-sequenced data.
We has used the WebSensor platform to gather huge amounts of data related to "United States presidential election, 2012."
We analyzed the sentiment trends for "who will be the next president of United States,"
and sucessfully predict the election results.
The platform has also been used in Microsoft product teams,
such as Bing News team and Bing Relevance Measurement team.
Keywords: Big Data; Wrapper; Information Extraction; Time Series