PhD position Unsupervised Deep Representation Learning for Video

PhD position Unsupervised Deep Representation Learning for Video

Published Deadline Location
7 May 31 May Amsterdam

You cannot apply for this job anymore.

Browse the current job offers or choose an item in the top navigation above.

Job description

In the last years and with the advent of deep learning, video understanding, be it action or activity classification, video object recognition or object tracking, has benefited significantly. Interestingly, the majority of progress has focused on stylized, short-video segments of utmost few seconds long, which can be and are strongly supervised. However, strong supervision is a significant constraint. It is a significant constraint for learning representations from images and much research is undertaken at the moment with contrastive learning and similar methodologies. More importantly, in video there is the time signal, which is presumably a strong signal for supervision, provided it is exploited correctly, for instance by taking into account the arrow of time, spatiotemporal continuity, or causality. In fact, for video, and spatiotemporal data in general, unsupervised learning is even more critical, given that:

  • the longer the video, the less relevant is to have a single manually provided label that corresponds to the whole duration of the video;
  • be it for short or long videos, the human annotations bias the model towards specific semantics, which, however, are not necessarily the appropriate ones when training video deep networks. For one, motion or dynamic changes have often little to do with the attached label, and it is virtually impossible to manually annotated or dynamic changes in complex, in the wild videos;
  • current strong supervision does not exploit any of the spatiotemporal redundancy and continuity present in videos. As a result, to train a standard video deep network nowadays requires an exorbitant amount of GPU computer, and most importantly, without any sign it could become more manageable with the current approaches. It should be simpler than that;
  • strong supervision gives little information in the way of causality. However, for robust models and good generalization, taken into account the cause and effect is important. Thankfully, videos and recordings over time give great opportunities of exploiting cause and effect, also for learning models without manually provided models;
  • it is very likely that strong and generalizable learning, even at an image level, can only be attained with unsupervised learning on spatiotemporal data and videos. What moves is what acts, and what acts is what influences the world and must be modelled;
  • current unsupervised learning methods from core machine learning is not designed to take the time dimension into account or exploit it optimally;
  • besides the aforementioned arguments, the practical ones of having many more unannotated video data than annotated ones are also relevant and important.

These are few only basic points why unsupervised learning in videos is of critical value, presumably even more so than for images. These differences alone, however, already show that for the next generation of deep neural networks for video we need fundamentally different and more nuanced learning objectives.

In this PhD position, we will research unsupervised deep representation learning for videos and complex spatiotemporal sequences. Our lab has significant experience in long-term video action classification and tracking, showing that decomposition of long spatiotemporal convolutions (Hussein, Gavves, Smeulders, 2019a), and spatiotemporal graphs (Hussein, Gavves, Smeulders, 2019b), and Siamese networks (Tao, Gavves, Smeulders, 2015) are key to very scalable in video. Inspired by prior work, and given the aforementioned (subset of) fundamental challenge, the research includes overarching questions like:

  • What are optimal unsupervised learning objectives for deep neural networks for short videos?
  • What are optimal unsupervised learning objectives for deep neural networks for long and complex videos?
  • Can unsupervised learning objectives exploit the spatiotemporal redundancy, continuity and causality to pretrain deep spatiotemporal neural networks?
  • What are optimal architectures to facilitate unsupervised representation learning in videos?
  • Is it possible to rely on unsupervised representation learning while still maintaining a reasonable computational budget?

You will be supervised by Dr. E. Gavves, Associate Professor at the University of Amsterdam (UvA). This project is financed by the winning H2020 ERC Starting Grant ‘EVA: Expectational Visual Artificial Intelligence’ and NWO VIDI Grant ‘TIMING: Learning Time in Videos’.

What are you going to do

You will carry out research and development in the area of Deep Machine Learning and Vision. The research is embedded in the VISlab group at the UvA. Your tasks will be to:

  • develop new deep machine learning and/or computer vision methods on Unsupervised deep representation learning for video;
  • collaborate with other researchers within the lab;
  • regularly present internally on your progress;
  • regularly present intermediate research at international conferences and workshops, and publish them in proceedings (CVPR, ICCV, ECCV, NeurIPS, IMCL, ICLR) and journals (PAMI, IJCV, CVIU);
  • assist in relevant teaching activities;
  • complete and defend a PhD thesis within the official appointment duration of four years.


University of Amsterdam (UvA)


  • a MSc degree in Artificial Intelligence, Computer Science, Engineering, (Applied) Mathematics or Physics, or related field;
  • a strong background/knowledge in computer vision, machine learning, and deep learning;
  • excellent programming skills preferably in Python;
  • solid mathematics foundations, especially statistics, calculus and linear algebra;
  • you are highly motivated and passionate, creative, independent;
  • strong communication, presentation and writing skills and excellent command of English;

Prior publications in relevant machine learning, vision, dynamical systems conferences or journals (NeurIPS, IMCL, ICLR, CVPR, ICCV, ECCV, JMLR, PAMI, IJCV, CVIU)  is advantageous

Conditions of employment

Fixed-term contract: 18 months.

A temporary contract for 38 hours per week for the duration of 4 years (the initial contract will be for a period of 18 months and after satisfactory evaluation it will be extended to a total duration of 4 years). This should lead to a dissertation (PhD thesis). We will draft an educational plan that includes attendance of courses and (international) meetings. We also expect you to assist in teaching undergraduates and master students. 

Based on a full-time appointment (38 hours per week) the gross monthly salary will range from €2.395 in the first year to €3.061 in the last year exclusive 8 % holiday allowance and 8.3 % end-of-year bonus. A favourable tax agreement, the ‘30% ruling’, may apply to non-Dutch applicants. The Collective Labour Agreement of Dutch Universities is applicable.

Are you curious about our extensive package of secondary employment benefits like our excellent opportunities for study and development? Take a look here.


University of Amsterdam

With over 5,000 employees, 30,000 students and a budget of more than 600 million euros, the University of Amsterdam (UvA) is an intellectual hub within the Netherlands. Teaching and research at the UvA are conducted within seven faculties: Humanities, Social and Behavioural Sciences, Economics and Business, Law, Science, Medicine and Dentistry. Housed on four city campuses in or near the heart of Amsterdam, where disciplines come together and interact, the faculties have close links with thousands of researchers and hundreds of institutions at home and abroad.  

The UvA’s students and employees are independent thinkers, competent rebels who dare to question dogmas and aren’t satisfied with easy answers and standard solutions. To work at the UvA is to work in an independent, creative, innovative and international climate characterised by an open atmosphere and a genuine engagement with the city of Amsterdam and society.


Faculty of Science - Informatics Institute

The Faculty of Science has a student body of around 7,000, as well as 1,600 members of staff working in education, research or support services. Researchers and students at the Faculty of Science are fascinated by every aspect of how the world works, be it elementary particles, the birth of the universe or the functioning of the brain.

The mission of the Informatics Institute Informatics Institute is to perform curiosity-driven and use-inspired fundamental research in Computer Science. The main research themes are Artificial Intelligence, Computational Science and Systems and Network Engineering. Our research involves complex information systems at large, with a focus on collaborative, data driven, computational and intelligent systems, all with a strong interactive component.

The position is with Dr. Efstratios Gavves, associate professor in the Video & Image Sense lab (VISlab) led by Prof. C. Snoek. VISlab is a world-leading lab on Computer Vision and Machine Learning, and has over 40 PhD students, postdoctoral researchers and faculty members working on a broad variety of core computer vision and core machine learning subjects: from action and object recognition or efficient spatiotemporal deep learning, to stochastic probabilistic models, temporal causality and graph neural networks. In the lab we encourage strongly collaborations. Other labs on Machine Learning and Computer Vision at the Informatics Institute include AMLab ( ) by Prof. M. Welling and CVlab ( ) by Prof. T. Gevers.


  • PhD scholarship
  • Natural sciences
  • max. 38 hours per week
  • €2395—€3061 per month
  • University graduate
  • 21-335


University of Amsterdam (UvA)

Learn more about this employer


Science Park 904, 1098 XH, Amsterdam

View on Google Maps

Interesting for you