Apple’s Innovative Leap with Open-Source Multimodal AI – Ferret


In October 2023, in a surprising move, researchers from Columbia University and Apple Inc. quietly released Ferret, an open-source multimodal large language model. The introduction of Ferret on GitHub went unnoticed, lacking any official announcement or celebration. On October 30, the code for Ferret was quietly released, accompanied by Ferret-Bench. Checkpoint releases were subsequently introduced on December 14.

Unveiling both the code and weights, Apple took an unexpected turn, given its historical guardedness about tech releases. Ferret’s Multimodal Capabilities are widespread and are available in 7-billion and 13-billion parameter models. Its ability to analyze specific image regions and respond contextually to queries stands out. The smaller model is tailored for iOS devices, demonstrating Apple’s dedication to mobile efficiency.

Apple's Push for AI Integration

Apple’s recent research papers, focusing on deploying Large Language Models (LLMs) on phones, emphasize their commitment to integrating more AI components into devices. Ferret Bench, a benchmarking tool, assists researchers in evaluating its efficiency and flexibility across various use cases.

Ferret hints at Apple’s heightened commitment to transformer language models, signaling substantial enhancements for Siri and other language-related features. This model positions Apple as a frontrunner in multimodal AI capabilities, hinting at progress in AR/VR, camera technologies, and autonomous systems throughout Apple’s product range.

Technical Prowess and Performance

Powered by 8 Nvidia A100 GPUs, Ferret understands small image regions with minimal errors, showcasing Apple’s prowess in generative AI and multimodal capabilities. Trained on the GRIT dataset, it outshines its counterparts in referring and grounding tasks.

Ferret's Impact on Apple Devices

Ferret’s integration into Apple products promises revolutionary user experiences, including improved interactions with Siri, advanced visual searches, and enriched media understanding. Developers can leverage its capabilities for innovative applications across diverse domains.

Beyond Textual Comprehension

Ferret’s unique approach transcends textual comprehension, offering contextual responses by analyzing specific image regions. This sets a new standard in AI capabilities, providing deeper insights into visual content.

Ferret's Technical Aspects

Primarily a vision model, Ferret combines image and text understanding. Utilizing Clip Viit l14, it comprehends image content, identifies specific areas accurately, and understands complex shapes and details.

Benchmarking Against GPT 4 ROI

Benchmarked against GPT 4 ROI, Ferret outperformed in various aspects, showcasing advanced multimodal understanding and interaction capabilities.

Apple's Commitment to AI

Apple’s strategic acquisitions in the AI realm reflect its commitment to machine learning. These acquisitions enhance Apple’s AI capabilities, driving product and service innovation.

Apple’s introduction of Ferret marks a breakthrough in machine learning. Surpassing GPT 4, Ferret’s advanced image identification has implications across industries. As Apple continues to unveil its AI efforts, we eagerly anticipate innovations that will redefine our interactions with technology, making them seamless and intuitive. Apple’s commitment to pushing the boundaries of what’s possible in technology remains evident, setting the stage for the next wave of groundbreaking developments.

The model uses a hybrid region representation and a spatial-aware visual sampler to enable fine-grained and open-vocabulary referring and grounding in the Multimodal Large Language Model. They also created the GRIT Dataset, a large-scale, hierarchical, robust ground-and-refer instruction tuning dataset containing approximately 1.1 million entries. The Ferret Model aims to refer and ground anything anywhere at any granularity, accepting any form of referring and grounding anything in response.

While Ferret proves to be a potent tool, it possesses specific constraints. As a relatively recent model, it may lack the robustness of more established counterparts. Similar to many other Machine Learning Language Models (MLLMs), Ferret carries the potential to generate responses that could be deemed harmful.

Additionally, it is crucial to highlight that Ferret is distributed under a non-commercial license. Apple envisions augmenting Ferret’s capabilities in the future to include the ability to generate segmentation masks alongside bounding boxes.

Drop a query if you have any questions regarding Ferret and we will get back to you quickly.

1. How can developers leverage Ferret for innovative applications?

ANS: – Developers can experiment with Ferret by utilizing its open-source code and weights to build applications that benefit from its advanced generative AI capabilities. Whether enhancing image-based interactions, implementing visual search functionalities, or creating innovative solutions across various domains, Ferret offers a versatile platform for developers to explore and integrate its capabilities.

2. What impact does Ferret have on Apple devices and user experiences?

ANS: – Ferret’s integration into Apple products holds the potential to revolutionize user experiences. The anticipated impacts are improved image-based interactions with Siri, advanced visual search functionalities, and enriched media understanding. Understanding how Ferret enhances Apple devices provides insights into the evolving landscape of AI integration in consumer technology.

