MLRG/summer08
From ResearchWiki
Machine Learning Reading Group
Informal reading group at Utah on machine learning topics (led by Hal). Papers, discussion, etc. will be posted here.
Time: Wed 3:40-5:00pm (see schedule for each week)
Topic: Multitask learning, transfer learning and domain adaptation
Location: Graphics Annex
Schedule
| Date | Papers | Presenter |
|---|---|---|
| Historical Interest | ||
| W 07 May |
Rich Caruana (1997). Multitask learning. Machine Learning, 28, 41-75. | Hal |
| The Bayesian Way | ||
| W 14 May |
Rajat Raina, Andrew Y. Ng and Daphne Koller (2006). Constructing informative priors using transfer learning. ICML. | Cecily |
| W 21 May |
Kai Yu, Volker Tresp and Anton Schwaighofer (2005). Learning Gaussian Processes from Multiple Tasks. ICML. | Amit |
| W 28 May |
Ya Xue, Xuejun Liao, Lawrence Carin, Balaji Krishnapuram (2007). Multi-task learning for classification with dirichlet process priors. JMLR. | Nathan |
| The Kernel Way | ||
| W 04 Jun |
Theodoros Evgeniou, Charles Micchelli, and Massimiliano Pontil (2005). Learning multiple tasks with kernel methods. J. Machine Learning Research, 6: 615--637. | Arvind |
| Learning Theory and Generalization Bounds | ||
| W 11 Jun |
Ben-David, S. and Schuller, R. (2003). Exploiting task relatedness for multiple task learning. (NIPS version) (Longer COLT Version). | Piyush |
| W 25 Jun |
Abernethy, Bartlett and Rakhlin (2007). Multitask Learning with Expert Advice. Technical Report (UC Berkeley). | Arvind |
| W 2 Jul |
M. M. Mahmud, Sylvian Ray (2007). Transfer Learning using Kolmogorov Complexity: Basic Theory and Empirical Evaluations. NIPS. | Cecily |
| Domain Adaptation | ||
| W 16 Jul |
Rie Kubota Ando and Tong Zhang (2005). A Framework for Learning Predictive Structures from Multiple Tasks and Unlabeled Data. Journal of Machine Learning Research, Vol 6:1817-1853, 2005. | Seth |
| W 23 Jul |
John Blitzer, Ryan McDonald, and Fernando Pereira (2006). Domain Adaptation with Structural Correspondence Learning. Empirical Methods in Natural Language Processing - EMNLP 2006. | Nathan |
| W 30 Jul |
Shai Ben-David, John Blitzer, Koby Crammer, and Fernando Pereira (2006). Analysis of Representations for Domain Adaptation. NIPS | Piyush |
| W 06 Aug |
Sugiyama, M., Nakajima, S., Kashima, H., von Bünau, P. & Kawanabe, M. (2007). Direct importance estimation with model selection and its application to covariate shift adaptation. NIPS. | Chang |
| W 13 Aug |
Steffen Bickel, Michael Bruckner and Tobias Scheffer (2007). Discriminative Learning for Differing Training and Test Distributions. ICML. | |
| W 20 Aug |
Jiayuan Huang, Alex Smola, Arthur Gretton, Karsten Borgwardt and Bernhard Scholkopf (2007). Correcting Sample Selection Bias by Unlabeled data. NIPS. | |
Paper Summaries
14 May:
This paper utilizes the 20 ng data. They learn weights for term vectors using logistic regression. I had a much nicer summary, but being wiki-ly challenged as I am , I erased it when my hotspot internet connection expired, and I don't have time now to generate a new one.
21 May 08: Learning Gaussian Processes from Multiple Tasks (Amit)
Introduction The problem of multi-task learning here is formulated on a hierarchical Bayesian framework. The framework exploits the equivalence between parametric linear models and non-parametric Gaussian models.
Structure of the Paper In Section 2 we will see how can we view linear models as Gaussian Processes(GPs). The advantages of using Gaussian Processes over linear models is GPs directly work on the kernel matrix for finite training points. Thus model complexity is independent of the dimensionality of x, which allows us to work even in infinite dimensional feature space. Second non-linear functions can be handled using kernels K(x,x')=<x,x'>. Section3 tells us how can we learn from multiple tasks using probabilistic framework. Section 4 gives us a generative story which uses linear models for multi-task. Here they assume a normal-inverse-Wishart distribution as the hyper prior over multiple tasks. This distribution is the conjugate prior for a multivariate Gaussian distribution. EM algorithm is used to estimate parameters and the complexity of the above model is O(kmd3). m is number of multi-tasks and k is number of EM iterations and d is the dimensionality of data. Section5 uses Gaussian processes to derive models both for transductive as well as inductive setting. This model is useful only when n < < d i.e. number of training points is comparatively lot less than dimensionality of feature space. The advantage of using a Gaussian process is its complexity only depends on just training data size rather than dimensionality of data size. the complexity of this model is O(kmn3). Rest of the paper talks about experiment section in which they show results on a toy problem and text categorization task. They found their Gaussian process model to be better than other models.
28 May 08: Multi-Task Learning for Classification with Dirichlet Process Priors (Nathan)
The Goal
The authors consider two cases of the Mult-task learning problem. The first is symmetric multi-task learning (SMTL) which deals with the situation of training classifiers for a set of tasks jointly. The second case is referred to as asymmetric multi-task learning (AMTL) which uses the posterior density function from the first case (SMTL) as the prior for a new task (not previous seen in the SMTL step.) The later has the advantage of not requiring all the data from the previous M tasks and thus is arguably more efficient.
The Paper
The authors produce two algorithms for inference in each situation described above, both algorithms are applications of hierarchical Bayesian models utilizing Dirichlet Process Priors (as the title might suggest :) In the symmetric case, a truncated stick-breaking formulation of the Dirichlet Process is used share information across tasks in the form of a prior distribution G from which the parameters for a task-specific logistic regression classifier is drawn. In short, the classifiers are trained on local (task-specific) training data, but also have access to the other tasks via the parameter drawn from the DP prior G. In order to actually compute G an approximation method named variational inference is used and will briefly be described.
For the asymmetrical case, essentially, the SMTL algorithm is ran for M tasks, then the posterior from the first step is used to formulate a prior for the M+1 task. In order to do inference over this model, a Markov Chain Monte Carlo algorithm is used, namely the Metropolis-Hastings algorithm. The AMTL approach is relatively inexpensive IF one already happens to have the posterior from a SMTL experiment laying around.
The Dirichlet Process and its formulation will also be addressed as well as a criterion for determining if tasks are similar that was also presented in this paper.
Other useful links
Overview of Dirichlet Processes
Variational Inference in DP Mixtures
Lots of Variational inference links
June 4, 08 : Learning Multiple Tasks with Kernel Methods (Arvind)
So far we have been looking at Bayesian way of doing a multi-task learning. In this paper, we will see how we can use kernel to learn multiple tasks. We all know how to use kernel to learn a single task (or a function ), in this paper, author extends this idea of learning a single task to learn multiple tasks simultaneously using special kernels called multi-task kernel functions.
Author first considers a linear case and shows that if all learning functions (tasks) are assumed to be linear (fl(x) = ul'x), then multi-tasks learning problem can be formulated in the form of a regularization problem (as given in eq 6):
-- eq(1)
Regularizer term J(u) controls the degree of relatedness among various tasks. Since it is hard to say anything about the regularizer if problem is cast in this way, author presents another formulation that is more intuitive and gives a better idea of task relatedness. We still have the linear function assumption. In this new formulation, feature vectors(weight vectors) of all tasks are combined into a new feature vector w that is common among all tasks. Linear function is given by fl(x) = w'Blx where w is common to all problems and Bl is task specific. Using this linear form of function gives us the following formulation of the multi-task problem:
-- eq(2)
Author shows that these two formulations are related. In eq(1), tasks-relatedness is given by the regularizer term J(u) while in eq(2), task relatedness is given by Bl.
Later, author shows three different examples of the problem for three different Bl. For all three examples, author gives the relation between J(u) and Bl . Finally to show the superiority of this algorithm, author compares it with single-task version of a kernel machine (learning one kernel function for each task) and one-task (learning one function for all tasks using all data combined.)
June 11, 08 : Exploiting Task Relatedness for Multiple Task Learning (Piyush)
So far, we have seen empirical evidences on how multitask learning can outperform single task learning, in certain cases. This paper is among the first ones that provided theoretical justifications to this fact.
The paper presents a framework that views task relatedness, not from an algorithmic perspective, but on the basic of similarity of example generating distributions, underlying each of the tasks. It defines a data generation mechanism that serves to determine the notion of task relatedness. Essentially, this definition of task relatedness makes use of a set of transformations F. It assumes that the data for each task can be thought of as been generated by a distribution obtained by picking a transformation function from this set and applying it to some "fixed" distribution P. Although this notion doesn't capture task relatedness for all MTL scenarios but holds for several interesting real world MTL problems. This notion helps to provide generalization error bounds for each task. This is in contrast with a previous result (corollary 13, this paper) which only bounds the average generalization error over the different tasks.
The paper defines a generalized VC-dimension parameter for the MTL case and shows that the generalized VC-dimension parameter can be significantly less than the standard VC-dimension. It provides an analysis giving the relationship of this parameter with the standard VC-dimension and provides conditions under which the generalization guarantees (based on per task sample size needed) for MTL are better than the case of single task learning.
June 25, 08 : Multitask Learning with Expert Advice (Arvind)
In this paper we see how we can do the multitask learning using many experts in online settings (sequential prediction). Problem can be stated as following: A forecaster (ultimate expert) is to make a sequence of predictions given access to a number of experts where each expert makes its own prediction at every round. The forecaster combines the predictions of the experts to form its own prediction, taking into account each expert's past performance. There are two variations of this online problem, one is sequential multitask problem where task specific examples come in a sequence (Forecaster is asked to make sequence of predictions on task one, then another sequence of predictions on task two and so on) and another is shifting multitask problem where order of tasks is arbitrary.
A single task version of the sequential prediction problem is when there is only one task and many experts. Forecaster makes predictions taking into account each expert's opinion. In multitask version there are many (K) tasks and authors claim that there is a trade off between the number of experts to be used for prediction and the complexity of the prediction algorithm. We can have a case where number of experts is equal to the number of tasks and each expert makes prediction for its own tasks (unrelated case) while in another case, we can just choose one expert making prediction for every task (fully related). We are interested in a case where number of experts (m) is more than one but less than number of tasks.
Once we have number of experts to be used for prediction, we can define a mapping π that maps each task to the one of the m experts and we call this mapping a hyperexpert. Given access to N experts, we would like to make prediction using these hyperexperts (note that these hyperexperts are m-sized subset of N experts which also include a mapping from tasks to experts). Since there are exponential number of subsets, it is hard to enumerate all subsets (or hyperexperts) and make the prediction the way we did for single task version. Authors provide an algorithm which makes use of sampling techniques to approximate the prediction which was hard in its exact form. Authors show experimentally that their algorithm gets good approximation and converge quickly to the exact case.

