A week before Apple hosts its “Awe Dropping” event on September 9, the Cupertino-based tech giant released two new AI models – FastVLM and MobileCLIP2.
Available on the popular open-source platform Hugging Face, both AI Models run locally and offer instantaneous responses. While FastVLM is a Visual Language Model (VLM) that offers almost instant high-resolution processing, MobileCLIP2 brings vision and language capabilities.
Both AI models are fine-tuned for Apple silicon and are based on the company’s own open-sourced machine learning framework, which offers a lightweight way to run and train models.
Talking of MobileCLIP2, Apple claims that it is 85 times faster and 3.4 times smaller than previous versions. In case you are wondering, MobileCLIP2 is a type of AI model known as a vision language model, which means it can decipher images or videos and language simultaneously.
If you think @Apple is not doing much in AI, you’re getting blindsided by the chatbot hype and not paying enough attention!
They just released FastVLM and MobileCLIP2 on @huggingface. The models are up to 85x faster and 3.4x smaller than previous work, enabling real-time vision… pic.twitter.com/jYCPukNuiK
— clem 🤗 (@ClementDelangue) September 1, 2025
This is really handy since Apple might use it for looking at a picture and describing what’s in it, identifying objects, and even generating captions for images without compromising your privacy. Coming to FastVLM, Apple also unveiled a light version of the model. Called FastVLM-0.5B, you can try it out in your preferred browser by heading over to Hugging Face.
Depending on your hardware, it may take some time to load, but when done, you can either feed it video using your camera or even use a virtual camera app. While it does work as intended, sometimes it becomes difficult to understand what’s going on. If you are unsure what to ask, the AI models also give some suggestions in the bottom left corner, like “Describe what you see in one sentence”, “Identify any text or written content visible” and “Name the object I am holding in my hand.”
The FastVLM family of AI models also includes more powerful versions with 1.5 billion and 7 billion parameters, which improve both performance and response time.
© IE Online Media Services Pvt Ltd