Currently I'm using some statistical learning techniques
to improve rank accuracy of search engine. There are four sub topics:
(1) Feature selection
There are hundreds of features in web search scenario,
such as term frequency, document frequency, hyperlink information, static ranking value, BM25 score and so on.
Some features bring noise, and some features are highly correlated. We need to pre-process the features to
reduce noise and computational complexity. By feature selection, we try to drop some noisy features.
There are many feature selection algorithms in literature. On one hand, we will study the performance
of these algorithms; on the other hand, we will also propose some special methods for web search application.
(2) Revise loss function
Many state-of-the-art machine learning algorithms have been applied for web search,
like RankSVM, RankBoost, RankNet. We are interesting in how to revise loss function
of these algorithms for web search scenario.
(3) Define new ranking function
The ranking function of most exsiting learning algorithms operats on a single document. We will study the ranking function with
(4) Learning from dependent samples
The basic assumption of statistical learning theory is that the instances are independent identically-distributed.
Different from traditional learning problems, the document level samples in web search are inter-dependent.
The rank position of a document depends not only on the relevance score of itself,
but also on the relevance score of other documents. How to learning from inter-dependent
samples brings a new research topic for machine learning.
Web is a huge database. There are many challenges and chances in web
search. And many of our ideas and algorithms will be applied in web
search, including both general web search and vertical search, such as
paper search, image search and so on. For one task of general web
search, topic distillation, we proposed a novel concept - subsite
retrieval - to improve the search performance. Actually, subsite
retrieval is the extension work of our sitemap based feature
propagation method for TREC2004. We also gave a comparison study for
several representative relevance propagation methods.
CBIR - This was the topic while I was an intern in Media Computing
Group in MSRA. We proposed an active feedback framework for CBIR,
which combined subspace clustering and label propagation seamlessly.
Narrow band beamforming - This was my bachelor degree research topic.
The supervisors were Prof. Hao ZHANG and Xu-Dong ZHANG. In this work,
we proposed a criterion "SNEE" to differentiate narrowband and