The field of computer vision has seen incredible progress, but some believe there are signs it is stalling. At the International Conference on Computer Vision 2023 workshop “Quo Vadis, Computer Vision?”, researchers discussed what’s next for Computer Vision.
In this post we bring you the main takeaways from some of the best minds in the Computer Vision landscape that gathered for this workshop during ICCV23 in Paris.
Table of Contents
- Quo Vadis, Computer Vision?
- The Anti Foundation Models
- Data over Algorithms
- Video can describe the world better than Text
- After Data-Centric, the User will be the core
- Bring back the fundamentals
- So, is Computer Vision dead?
Disclaimer: We went under cover into the workshop to bring you the most secret CAMRiP quality insights! 🕵️
1. Quo Vadis, Computer Vision?
Computer vision has reached a critical juncture with the emergence of large generative models. This development is having a dual impact. On one hand, it is opening new research avenues and attracting academics and businesses eager to capitalize on these innovations. However, the swift pace of advancement is also causing uncertainty among computer vision researchers about where to focus next.
Many feel conflicted wondering if they can match the progress in generative models compared to more established computer vision problems. This ICCV 2023 workshop (see Figure 1) brought together experts like David Forsyth, Bill Freeman, and Jitendra Malik to discuss this pivotal moment.
In the following sections we provide some highlights of the lively discussions followed on how computer vision should adapt and leverage generative models while still tackling core challenges in areas like video and embodied perception. There was consensus that combining strengths of computer vision and generative models thoughtfully is key, rather than seeing them as competing approaches.
2. The Anti Foundation Models
MIT’s professor Bill Freeman, provided three reasons why he doesn’t like foundation models:
Reason 1: They don’t tell us how vision works
In short, Bill Freeman argues that foundation models are capable of solving vision tasks but despite this achievement, nobody is able to explain how vision works (i.e. they are still a black-box).
Reason 2. They aren’t fundamental (and therefore not stable)
As shown in Figure 2, professor’s Freeman hints that foundation models are simply just a trend.
Reason 3. They separate academia from industry
Finally, professor’s Freeman argues that foundation models create a boundary between those in academia (i.e. creative teams but no resources) versus those in industry (i.e. unimaginative teams but well-organized resources).
3. Data over Algorithms
Berkeley’s professor, Alexei (Alyosha) Efros, shared the two ingredients for achieving true AI:
- Focus on data over algorithms: GigaGAN [1] showed that large datasets enable old archiectures such as GAN to scale.
- Bottom-up emergence: data per-se is mostly noise, what is crucial is the right kind of (high-quality) data.
Also, he argues that LLMs are winning because they are being trained on all the available data with just 1 single epoch! (see Figure 3).
4. Video can describe the world better than Text
An audacious take was made by Berkeley’s professor Jitendra Malik, where he suggested that video is a more efficient (and perhaps effective) way to describe the world.
His views are supported by arguing that any book (see Figure 4 for some examples) can be represented in a more compact way using video (i.e. frames) than text (i.e. tokens): the same information can be conveyed way more efficiently using video than text.
Professor Malik believes video will help put Computer Vision again on the map in the next few years.
5. After Data-Centric, the User will be the core
Princeton’s professor, Olga Russakovsky, provided fascinating insights on what is next after the data-centric approach to machine learning.
She elegantly explained, Figure 5, how the field has evolved from a pure focus on models (i.e. year 2000) to the current moat of “data is king”, and argues that a time where the human (i.e. user) is the center is next.
For instance, she makes the case for the need of gathering truly representative data from all over the world rather than simply focusing on web data, see Figure 6.
6. Bring back the fundamentals
Finally, MIT’s professor, Antonio Torralba gave a lightweight talk where he candidly shared his views on why curiosity is more important than performance (see Figure 8), especially in today’s LLMs driven world.
Professor’s Torralba argues that the field of Computer Vision has been already in a position where (mostly) outsiders confidently argue that the field has stalled, yet time has proven that someone comes up with some clever idea by focusing on the fundamentals rather than following the crowd.
7. So, is Computer Vision dead?
The ICCV23 workshop makes clear that rather than being dead, computer vision is evolving. As leading experts argued, promising directions lie in the interplay between vision and language models.
However, other frontiers also hold potential, like exploring when large vision models are needed or providing granular control over frozen generative architectures, as described by one of the papers awarded with the Marr Prize [2] in ICCV23.
While progress may require integrating strengths of vision and language, key computer vision challenges remain in areas like texture perception or peripheral vision where the question of how to throw away information is still a challenge. With an influx of new researchers and industry interest, the field is poised to take on some of these questions.
References
[1] Scaling up GANs for Text-to-Image Synthesis
[2] Adding Conditional Control to Text-to-Image Diffusion Models
Authors: Jose Gabriel Islas Montero, Dmitry Kazhdan
If you would like to know more about Tenyks, sign up for a sandbox account.