Handbook for PhD Students

This PhD Handbook serves a dual purpose: it defines the research methodology of our group and gives general advice to students, and it sets out standards and processes which all students in the group are expected to strive for.

Experimental evaluation direct link

Systematic evaluation is a central element of a research methodology. Much of the research in our group is empirical, which means that claims and hypotheses are tested in practice.

An important question to consider is what evaluation tasks (simulation environments) to use. The choice of evaluation tasks should be guided by your open problem, in that a solution to the problem is required to successfully complete the tasks. I call such tasks challenge tasks because they allow you to say "this task cannot be solved (well) with current methods due to the open problem". Using a non-contrived, ambitious challenge task gives your research a stronger justification and can force you to think outside the box to find novel ideas.

When designing experiments, have specific questions in mind which the experiments should answer. For example:

  • How does your method compare to other methods? (e.g. rewards, accuracy, sample complexity, compute time)
  • How does your method scale in certain dimensions? (e.g. number of agents, actions, states)
  • What is the contribution/effect of different components in your method? (ablation study)
  • How does your method perform if certain assumptions are violated?
  • If you observed a specific result and have a hypothesis to explain the result, you may want to design additional experiments to specifically test your hypothesis.

The use of baseline methods is crucial to set the bars for what constitutes good and bad performance. There are generally three kinds of baselines:

  • Best-case/worst-case: Best/worst-case baselines are useful to show the margin of possible improvement. Best/worst can mean different things depending on your work. For example, a best-case may be an exact method with "unlimited" compute time, or with access to privileged information that would normally not be available.
  • Alterations/ablations: If your method consists of several interacting components, it is important to show the effect of each component. For example, these may be elements of your network architecture and input, loss functions, or other sub-functions in your method. Ablation studies take your full method and remove or alter certain elements to show the performance under these ablated/altered methods.
  • State-of-the-art: If possible, it is a good idea to include some recent state-of-the-art algorithms for comparison. Ideally, algorithms should use individually optimised hyper-parameters to maximise performance of each algorithm.

It is a good idea to think of sensible baselines early in your research rather than as an afterthought.

Once you have results from your experiments, you should use the right tools to understand the data. Use good visualisation, including second-order statistics (e.g. standard deviation/error). Statistical hypothesis tests should be used to establish whether observed differences are real differences or due to variance. You should understand the meaning of statistical significance and how to conduct such tests correctly. See this guide on statistical hypothesis testing in RL research.