Zhicheng Dou

A Dataset for Hierarchical Search Result Diversification

Sha Hu, sallyshahu@gmail.com
Zhicheng Dou, dou*AT*ruc*dot*edu*dot*cn
Renmin University of China

Overview

Welcome to Hierarchical Diversification. Our goal is to satisfy hierarchical user intents behind a web search query by generating diversified search rankings.

This page provides experimental data about hierarchical diversification, including TREC datasets, hierarchical intents, and experimental runs of state-of-the-art diversification algorithms and hierarchical diversification algorithms.

More details about hierarchical diversification can be found in this paper:
Search Result Diversification based on Hierarchical Intents, by Sha Hu, Zhicheng Dou, Xiaojie Wang, Tetsuya Sakai, and Ji-Rong Wen. In processing of CIKM 2015. [1]

Query Topics

The query topics come from four topic sets from TREC2009, TREC2010, TREC2011, and TREC2012, provided by TREC Web Track. Every topic set contains 50 topics, each of which includes three to eight subtopics with manually labeled relevance judgments. Topics 95 and 100 in the TREC 2010 topic set were removed as they lack diversity relevance judgments. The four datasets are merged into one and named as TREC data.

Hierarchical Intents

For each topic (query) in TREC, its two-level hierarchical intents are extracted by query suggestions from Google search engine.

For each topic, its query suggestions from Google are collected as the first-level subtopics. Next, each first-level subtopic is sent as a query to Google and its query suggestions are retrieved as the second-level subtopics. There are 1,696 first-level subtopics and 10,527 second-level subtopics collected for 194 queries.

We provide extracted subtopics in different formats:
two-level hierarchical subtopics are used for hierarchical diversification algorithms;
flat-list subtopics, including only first-level subtopics, only second-level subtopics, and all subtopics (merged by the obtained suggestions in two levels), are used for traditional diversification algorithms with subtopics in a flat list. (download)

Two-level hierarchical subtopics
Only first-level subtopics
Only second-level subtopics
All subtopics

=================================================================
<topic number=1>
    <query> obama family tree </query>
    <subtopic1 number=1>
	<suggestion> obama family tree bush </description>
	<subtopic2 number=1> barack obama family tree george bush </subtopic2>
	<subtopic2 number=2> bush cheney obama family tree </subtopic2>
	<subtopic2 number=3> bush clinton obama family tree </subtopic2>
    </subtopic1>
    <subtopic1 number=2> 
	<suggestion> obama family tree pictures </description>
	<subtopic2 number=1> barack obama family tree pictures </subtopic2>
	<subtopic2 number=2> barack obama's family tree photo </subtopic2>
    </subtopic1>
    ...
</topic>
=================================================================

Document Collection

The ClueWeb09 dataset is used for TREC data, following the official definitions of TREC Web Track Diversity Task. For each topic, top 1,000 documents are retrieved using the batch search service provided by Lemur Project. The documents are filtered by Waterloo Spam Filter with minimal spam score 70. The filtered documents are viewed as the initial non-diversified ranking results in diversification.

Non-diversified Baseline for TREC09-12 (download)

=================================================================
1 Q0 clueweb09-en0010-57-32937 1 -3.83334 indri
1 Q0 clueweb09-en0011-02-12744 2 -3.87516 indri
1 Q0 clueweb09-en0010-57-32259 3 -4.04239 indri
1 Q0 clueweb09-en0001-02-21397 5 -4.6418 indri
=================================================================
Note: the first column is the topic number; the second column is currently unused; 
the third column is the official document identifier of the retrieved document; 
the fourth column is the rank the document is retrieved, and the fifth column shows the score; 
the sixth column is the name of the result run.

Diversification Algorithms

We compare our hierarchical diversification models, HxQuAD and HPM2, with state-of-the-art diversification algorithms, including xQuAD, PM2, TxQuAD, TPM2, and ConceptH.

xQuAD [2] considers both the relevance between the document and the query, and the topic diversity of the document among the selected documents, which is explicitly estimated by calculating the coverage of matched subtopics for the document.

HxQuAD [1] extends xQuAD to hierarchical subtopics by redefining a multi-level diversity function. Besides balancing the relevance and the diversity, HxQuAD controls the impact of subtopics at different depth.

PM2 [3] promotes result diversities by two processes: finding the best subtopic by unsatisfied subtopic proportionality, and then choosing the best document based on the selected subtopic.

HPM2 [1] adapts PM2 to hierarchical subtopics. HPM2 selects one best subtopic for each level of the hierarchical subtopic tree, and combine them together to find the best document. It considers the impact of subtopics from different levels.

TxQuAD and TPM2 are term level diversification models [4], which split the original subtopics (used by xQuAD and PM2) into terms and use these terms as their subtopics.

ConceptH [5] exploits concept hierarchies to extract query subtopics in a flat list, utilizes hierarchical relations of subtopics to propose a structural similarity function for subtopics, and incorporates this function into the traditional xQuAD framework. This work was done in enterprise search domain, as it is hard to build high-quality concept hierarchies in web search. We directly take our subtopic hierarchy as the concept hierarchy in ConceptH.

Note1: All the traditional methods have a parameter to tune. Our hierarchical models have two parameters to tune. We use a 5-fold cross validation to tune these parameters in terms of ERR-IA@20 on TREC09-12 data.

Note2: All the experiment results are evaluated by official evaluation metrics at TREC Web Track: ERR-IA [6], alpha-NDCG [7], and NRBP [8]; and the official evaluation metric of NTCIR IMine task: D#-NDCG [9].

Experimental Runs

We experiment our hierarchical models and upper traditional models on TREC data.
Their input subtopics are different according to their default settings:

xQuAD and PM2: default subtopics are Google suggestions (first-level subtopics)

TxQuAD and TPM2: default subtopics are the splitted terms of first-level subtopics.

ConceptH: input concept hierarchies are our subtopic hierarchies

HxQuAD and HPM2: their subtopics are the hierarchical subtopics.

The experimental runs and evaluated results for all the algorithms with default settings can be download here (download). See Subsection 5.1 in [1].

In addition, we compare our hierarchical models with their corresponding models by using different levels of subtopics (See Subsection 5.2 in [1]). Their runs and results are:

HxQuAD and its corresponding models (download)

HPM2 and its corresponding models (download)

Note: suffixes _1st, _2nd, _all, and _h denote using first-level subtopics, second-level subtopics, all subtopics, and hierarchical subtopics for a model.

=================================================================
runid 	topic	ERR-IA@20   alpha-nDCG@20   NRBP   D#-nDCG@20
HxQuAD_h  1	0.240449     0.330604     0.250000     0.37814198
HxQuAD_h  2	0.000000     0.000000     0.000000     0.00000000
HxQuAD_h  3	0.040075     0.101052     0.007813     0.18449713
...
HxQuAD_h  200	0.367220     0.533428     0.285989     0.48345266
HxQuAD_h  amean	0.320628     0.422897     0.284462     0.43783065
=================================================================
Note: the first column is the topic number; the second column is currently unused; 
the third column is the official document identifier of the retrieved document; 
the fourth column is the rank the document is retrieved, and the fifth column shows the score; 
the sixth column is the name of the result run.

Released Data

Subtopics extracted from Goolge Suggestions: GSSubtopics.zip
Non-diversified Baseline on TREC09-12: Non-Diversified-Baseline.txt
Overall results by using default subtopics: OverallResult.zip
HxQuAD, xQuAD, and TxQuAD's results by using different subtopics: HxQuADs.zip
HPM2, PM2, and TPM2's results by using different subtopics: HPM2s.zip

If you use this dataset for research, please kindly cite the following paper:

S. Hu, Z. Dou, X. Wang, T. Sakai, and J. Wen. Search result diversification based on hierarchical intents. In CIKM, 2015. [pdf]

@inproceedings{Sha:15CIKM:HDiv,
 author = {Hu, Sha and Dou, Zhicheng and Wang, Xiaojie and Sakai, Tetsuya and Wen, Ji-Rong},
 title = {Search Result Diversification based on Hierarchical Intents},
 booktitle = {Proceedings of the 24th CIKM},
 year = {2015},
 }

References

R. L. Santos, C. Macdonald, and I. Ounis. Exploiting query reformulations for web search result diversification. In WWW, pages 881-890, 2010.
V. Dang and W. B. Croft. Diversity by proportionality: An election-based approach to search result diversification. In SIGIR, pages 65-74, 2012.
V. Dang and W. B. Croft. Term level search result diversification. In SIGIR, pages 603-612, 2013.
W. Zheng, H. Fang, and C. Yao. Exploiting concept hierarchy for result diversification. In CIKM, 2012.
O. Chapelle, D. Metlzer, Y. Zhang, and P. Grinspan. Expected reciprocal rank for graded relevance. In CIKM, pages 621-630, 2009.
C.L. Clarke, M. Kolla, G.V. Cormack, O. Vechtomova, A. Ashkan, S. Buttcher, and I. MacKinnon. Novelty and diversity in information retrieval evaluation. In SIGIR, pages 659-666, 2008.
C. L. Clarke, M. Kolla, and O. Vechtomova. An effectiveness measure for ambiguous and underspecified queries. In ICTIR, pages 188-199, 2009.
T. Sakai and R. Song. Evaluating diversified search results using per-intent graded relevance. In SIGIR, pages 1043-1052, 2011.