Get description for images and create matching text to it. New zero-shot instructed vision-to-language generation
Multimedia content