Zhicheng Dou

WebSensor
With the rapid growth of the web, there are grand challenges when making sense of BIG web data: big volume, high velocity, high variety, and unknown veracity. In the physical world, a sensor is a converter that measures a physical quantity and converts it into a signal that can be read by an observer or by an instrument—today, mostly electronic. The WebSensor project targets to providing a platform for both developers and end users to easily build "web sensors" which are programmable "focused crawlers" that continuously discover, extract, and aggregate structured information around a topic.

We developped a WebSensor platform based on Windows PowerShell and the .NET Framework. It enables developers to easily create WebSensors that continuously extract information from the web and generate time-series stream data. The websensor platform has many built-in capabilities to extract and collect time-sequenced data embedded in web sites. These built-in capabilities include:
- Convenient wrapper generation on webpages (just by a few clicks)
- Automatically wrapper adaption to page layout change
- Easy to configure and run
- Easy to extend using simple script language
- Easy to manage and retrieve the data collected
Websensors can connect to form a sensor network for more complex analysis tasks that involve multiple time-sequenced data.

We has used the WebSensor platform to gather huge amounts of data related to "United States presidential election, 2012." We analyzed the sentiment trends for "who will be the next president of United States," and sucessfully predict the election results. The platform has also been used in Microsoft product teams, such as Bing News team and Bing Relevance Measurement team.

Keywords: Big Data; Wrapper; Information Extraction; Time Series
Project Q：Query-Centric Search

Search has a long document-centric tradition, where "searching information" is equivalent to "searching document". In Project Q, we explore a new query-centric search paradigm, which treats query as object and shifts search from "searching document" to “searching query”. We have developed several effective query mining technologies and proved that, when deeply mining queries "without" time constraint, we can greatly improve search relevance and user experiences. The concept of "query-centric search" has been widely accepted by Microsoft Bing search ranking team. It brings them new opportunities for improving ranking quality.
User Intent Understanding, Personalized Search, and Search Diversification
Studies show that the vast majority of queries to search engines are short and vague in specifying a user’s intent. Different users may have completely different information needs and goals when using precisely the same query. For example, User A is finding information about Apply Company by issuing a query "apple,", while User B is finding information related to fruit apple using the same query. When such a query is issued, search engines will return a list of documents that mix different topics. It takes time for a user to choose which information he/she wants. We studied two effective solutions to solve this problem:
- Personalied Search: it provides different search results based upon the preferences of users
- Search Result Diversification: it provides a list of results that cover as many aspects as possible, so that most users can be satisfied by the top results
Deep Web Site and Web Page Understanding
Web pages are the main sources of information on the Web. Without deep understanding of web pages, we cannot extract rich information from the Web. We take advantage of text, HTML DOM structure and associated visual features, such as font size, width and height of a DOM element, to understand the information contained in web pages. We explored several methods problems related to this, such as:
- Website template generation and wrapper conduction
- Repeat information (in text pattern or layout structure) detection and extraction
- Change pattern summarization and classification by analyzing multiple versions of a page
- Main content extraction from news pages
WebStudio: Experimental Search Platform

WebStudio is an end-to-end experimental search system for facilitating search experiments on specific web data collections. In WebStudio, some default components are implemented. Users can customize major operations (including document parsing, page classification, index building, index serving, and front-end processing) in the E2E search engine, by adding their own experimental logic for testing ideas.
Web Search Ranking

Web Search ranking quality is one of the most important element for the success of a search engine. We closely collaborated with Bing ranking team on developing effective technologies for improving ranking quality of Bing search engine.

Zhicheng Dou

Professor, Renmin University of China

Projects

Search Result Diversification

Query Facet/Dimension Mining

互联网分析引擎（时事探针）

Past Projects