Zhicheng Dou

QDMiner

A query facet is a set of items which describe and summarize one important aspect of a query. Here a facet item is typically a word or a phrase. A query may have multiple facets that summarize the information about the query from different perspectives. Table 1 shows sample facets for some queries. Facets for the query “watches” cover the knowledge about watches in five unique aspects, including brands, gender categories, supporting features, styles, and colors. The query “visit Beijing” has a query facet about popular resorts in Beijing ( tiananmen square, forbidden city, summer palace, ...) and a facet on travel related topics ( attractions, shopping, dining, ...).

Query facets provide interesting and useful knowledge about a query and thus can be used to improve search experiences in many ways. First, we can display query facets together with the original search results in an appropriate way. Thus, users can understand some important aspects of a query without browsing tens of pages. For example, a user could learn different brands and categories of watches. We can also implement a faceted search based on the mined query facets. User can clarify their specific intent by selecting facet items. Then search results could be restricted to the documents that are relevant to the items. A user could drill down to women’s watches if he is looking for a gift for his wife. These multiple groups of query facets are in particular useful for vague or ambiguous queries, such as “apple”. We could show the products of Apple Inc. in one facet and different types of the fruit apple in another. Second, query facets may provide direct information or instant answers that users are seeking. For example, for the query “lost season 5”, all episode titles are shown in one facet and main actors are shown in another. In this case, displaying query facets could save browsing time. Third, query facets may also be used to improve the diversity of the ten blue links. We can re-rank search results to avoid showing the pages that are near-duplicated in query facets at the top. Query facets also contain structured knowledge covered by the query, and thus they can be used in other fields besides traditional web search, such as semantic search or entity search.

We assume that the important aspects of a query are usually presented and repeated in the query’s top retrieved documents in the style of lists, and query facets can be mined out by aggregating these significant lists. We propose a systematic solution, which we refer to as QDMiner, to automatically mine query facets by extracting and grouping frequent lists from free text, HTML tags, and repeat regions within top search results.

Datasets

UserQ
RandQ

Folder structure

Queries: contains the queries within the dataset. There are mainly two columns. The first column is ID of the query, and the second column is the query. Please ignore other columns.

Judgment: human created query facets. Each query has a seperated file named with ID.xml. The sample schema of this type of files is as below.

<QueryDimensions QueryId="03">
    <Query>mobile phones</Query>
    <Dimensions>
        <Dimension Description="mobile phone models" Rating="I like this" RatingValue="2">
            <Items>
                <Item>samsung galaxy s</Item>
                <Item>htc wildfire</Item>
                <Item>htc desire</Item>
                <Item>apple iphone 4</Item>
                <Item>apple iphone</Item>
                ...
            </Items>
        </Dimension>
        <Dimension Description="mobile phone manufacturers or brands" Rating="I like this" RatingValue="2">
            <Items>
                <Item>nokia</Item>
                <Item>blackberry</Item>
                <Item>samsung</Item>
                <Item>lg</Item>
                <Item>motorola</Item>
                <Item>sony ericsson</Item>
               ...
            </Items>
        </Dimension>
        ...
    </Dimensions>
</QueryDimensions>

Facets: query facets generated by QDMiner, with clustering parameter $Dia_{max}=0.6$ and $W_{min}=3$. Schema and sample files:

<QueryDimensions QueryId="8">
    <Query>bravo tv</Query>
    <Dimensions>
        <Dimension Id="0" Weight="683.50">
            <Items>
                <Item Value="top chef" Selected="True" Weight="7.11" />
                <Item Value="project runway" Selected="True" Weight="6.09" />
                <Item Value="shear genius" Selected="True" Weight="3.75" />
                <Item Value="double exposure" Selected="True" Weight="3.73" />           
                ...    
            </Items>
        </Dimension>       
        <Dimension Id="1" Weight="200.36">
            <Items>
                <Item Value="movies" Selected="True" Weight="6.50" />
                <Item Value="music" Selected="True" Weight="6.43" />
                <Item Value="entertainment" Selected="True" Weight="6.34" />
                <Item Value="news" Selected="True" Weight="5.80" />
            </Items>
        </Dimension>
        ...
    </Dimensions>
</QueryDimensions>

    public class DimensionItem
    {
        public string ItemString; 
        public double Weight;
        public bool IsSelected;        
    }

    public class DimensionCompact
    {
        public string Label;
        public int DimensionId;
        public double Weight;
        public List DimensionItems;
        
    }
    
    public class QueryDimensionsCompact
    {
        public string Query;
        public string QueryId;
        public List Dimensions;        
    }

All: all data, including original search results, extracted lists from each result, weighted lists, list clustering, and so on. Please send an email to Zhicheng Dou (dou at ruc.edu.cn) if you want to know more about the details.

Demo

Publications

Zhicheng Dou, Finding Dimensions for Queries, in Proceedings of the 20th ACM international conference on Information and knowledge management (CIKM 2011), pages 1311-1320, ACM, 2011
Zhicheng Dou, Zhengbao Jiang, Sha Hu, Ji-Rong Wen, Ruihua Song: Automatically Mining Facets for Queries from Their Search Results. IEEE Trans. Knowl. Data Eng. (TKDE) 28(2):385-397 (2016)
Sha Hu, Zhicheng Dou, Xiaojie Wang, Tetsuya Sakai, and Ji-Rong Wen. 2015. Search Result Diversification Based on Hierarchical Intents. In Proceedings of the 24th ACM International on Conference on Information and Knowledge Management (CIKM '15). ACM, New York, NY, USA, 63-72. DOI=http://dx.doi.org/10.1145/2806416.2806455