Food recognition is a major challenge in the field of computer vision, requiring models that can effectively handle the wide variability and complexity of food images. In this paper, we explore the use of vision transformers, a category of models based on self-attention mechanisms, to address the task of food recognition. We focus on training and fine-tuning different vision transformer architectures on Food2K, a large-scale dataset of food images with 2,000 categories. We compare the performance of vision transformers with convolutional neural networks (CNNs) on Food2K and Food101. In addition, we use state-of-the-art explainability techniques to highlight the regions of interest that vision transformers take into account when performing a prediction. Our results show that vision transformers can achieve competitive results on food recognition tasks, with the added benefit that pre-training on Food2K improve their generalization capabilities and interpretability. This study highlights the potential of vision transformers in food computing, paving the way for future research in this field.
Bianco, S., Buzzelli, M., Chiriaco, G., Napoletano, P., Piccoli, F. (2023). Food Recognition with Visual Transformers. In 2023 IEEE 13th International Conference on Consumer Electronics - Berlin (ICCE-Berlin) (pp.82-87). IEEE [10.1109/ICCE-Berlin58801.2023.10375660].
Food Recognition with Visual Transformers
Bianco, Simone;Buzzelli, Marco;Napoletano, Paolo;Piccoli, Flavio
2023
Abstract
Food recognition is a major challenge in the field of computer vision, requiring models that can effectively handle the wide variability and complexity of food images. In this paper, we explore the use of vision transformers, a category of models based on self-attention mechanisms, to address the task of food recognition. We focus on training and fine-tuning different vision transformer architectures on Food2K, a large-scale dataset of food images with 2,000 categories. We compare the performance of vision transformers with convolutional neural networks (CNNs) on Food2K and Food101. In addition, we use state-of-the-art explainability techniques to highlight the regions of interest that vision transformers take into account when performing a prediction. Our results show that vision transformers can achieve competitive results on food recognition tasks, with the added benefit that pre-training on Food2K improve their generalization capabilities and interpretability. This study highlights the potential of vision transformers in food computing, paving the way for future research in this field.File | Dimensione | Formato | |
---|---|---|---|
Bianco-2023-ICCE Berlin-VoR.pdf
Solo gestori archivio
Tipologia di allegato:
Publisher’s Version (Version of Record, VoR)
Licenza:
Tutti i diritti riservati
Dimensione
387.08 kB
Formato
Adobe PDF
|
387.08 kB | Adobe PDF | Visualizza/Apri Richiedi una copia |
I documenti in IRIS sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione.