Learning to Rank for Web Search

Currently I'm using some statistical learning techniques to improve rank accuracy of search engine. There are four sub topics:
(1) Feature selection
There are hundreds of features in web search scenario, such as term frequency, document frequency, hyperlink information, static ranking value, BM25 score and so on. Some features bring noise, and some features are highly correlated. We need to pre-process the features to reduce noise and computational complexity. By feature selection, we try to drop some noisy features. There are many feature selection algorithms in literature. On one hand, we will study the performance of these algorithms; on the other hand, we will also propose some special methods for web search application.

(2) Revise loss function
Many state-of-the-art machine learning algorithms have been applied for web search, like RankSVM, RankBoost, RankNet. We are interesting in how to revise loss function of these algorithms for web search scenario.

(3) Define new ranking function
The ranking function of most exsiting learning algorithms operats on a single document. We will study the ranking function with multiple variants.

(4) Learning from dependent samples
The basic assumption of statistical learning theory is that the instances are independent identically-distributed. Different from traditional learning problems, the document level samples in web search are inter-dependent. The rank position of a document depends not only on the relevance score of itself, but also on the relevance score of other documents. How to learning from inter-dependent samples brings a new research topic for machine learning.

Relevance Propagation
Web is a huge database. There are many challenges and chances in web search. And many of our ideas and algorithms will be applied in web search, including both general web search and vertical search, such as paper search, image search and so on. For one task of general web search, topic distillation, we proposed a novel concept - subsite retrieval - to improve the search performance. Actually, subsite retrieval is the extension work of our sitemap based feature propagation method for TREC2004. We also gave a comparison study for several representative relevance propagation methods.



CBIR - This was the topic while I was an intern in Media Computing Group in MSRA. We proposed an active feedback framework for CBIR, which combined subspace clustering and label propagation seamlessly.
Narrow band beamforming - This was my bachelor degree research topic. The supervisors were Prof. Hao ZHANG and Xu-Dong ZHANG. In this work, we proposed a criterion "SNEE" to differentiate narrowband and broadband beamforming.