Vision-Language Model (VLM) Engineer
Join as Vision-Language Model Engineer to shape multimodal AI — design, train, and deploy models for image, text, and beyond. Collaborative, impact-driven, apply today.
We usually respond within two weeks
We are seeking a highly skilled Vision-Language Model (VLM) Engineer to design, develop, and deploy state-of-the-art multimodal AI systems. You will work at the intersection of computer vision and natural language processing, contributing to cutting-edge products that combine image and text understanding.
Key Responsibilities:
Design and implement vision-language models for tasks such as image captioning, visual question answering, and cross-modal retrieval
Train, fine-tune, and evaluate multimodal models using large-scale datasets
Optimize model performance for scalability and real-world deployment
Collaborate with cross-functional teams including data scientists, software engineers, and product managers
Stay up to date with the latest research in multimodal AI and apply it to production systems
Required Qualifications:
Bachelor’s or Master’s degree in Computer Science, Artificial Intelligence, or a related field
Strong experience with Python and deep learning frameworks (e.g., PyTorch or TensorFlow)
Solid understanding of machine learning, computer vision, and NLP concepts
Experience with multimodal models or related architectures (e.g., transformers)
Familiarity with handling large datasets and distributed training
Preferred Qualifications:
Experience with models such as CLIP, BLIP, or similar multimodal architectures
Knowledge of model deployment (Docker, APIs, cloud services)
Publications or contributions to AI research projects
Experience working with real-world AI applications
- Locations
- Istanbul
- Remote status
- Fully Remote
Colleagues
About Wide and Wise
Wide and Wise is a top recruitment agency with offices in Istanbul, Milan, and Dubai, connecting exceptional talent with leading companies across EMEA, MENA, and the US.