Streamline AI Workloads with Meta Llama 3.3 70B on Amazon SageMaker

Introduction

Meta Llama 3.3 70B is available on Amazon SageMaker JumpStart. This new version of Llama offers a remarkable breakthrough in large language model (LLM) efficiency, providing comparable performance to larger Llama versions but with significantly lower computational resource requirements. Llama 3.3 70B is designed for cost-effective inference operations, delivering up to five times more efficiency than its larger counterparts, making it an ideal choice for production deployments.

We will explore how to efficiently deploy the Llama 3.3 70B model on Amazon SageMaker, leveraging advanced features to optimize performance and manage costs. With its enhanced attention mechanism and refined training process, including Reinforcement Learning from Human Feedback (RLHF), this model is ready to tackle many tasks efficiently and accurately.

The following figure summarizes the benchmark results (source)

intro

Pioneers in Cloud Consulting & Migration Services

Reduced infrastructural costs
Accelerated application deployment

Get Started

Getting started with Amazon SageMaker JumpStart

A machine learning (ML) hub called Amazon SageMaker JumpStart can help you start with ML more quickly. You can assess, contrast, and choose pre-trained foundation models (FMs), including Llama 3 models, with Amazon SageMaker JumpStart. You may utilize the UI or SDK to deploy these models into production, and they are completely adaptable to your use case using your data.

There are two easy ways to deploy Llama 3.3 70B with Amazon SageMaker JumpStart: programmatically using the Amazon SageMaker Python SDK or the user-friendly Amazon SageMaker JumpStart UI. To assist in selecting the strategy that best meets the goals, let’s examine both approaches.

Steps to Deploy Llama 3.3 70B through the Amazon SageMaker JumpStart UI

You can use Amazon SageMaker Studio or Amazon SageMaker Unified Studio to access the SageMaker JumpStart UI. Follow these steps to deploy Llama 3.3 70B using the Amazon SageMaker JumpStart UI:

Select JumpStart models from the Build menu in Amazon SageMaker Unified Studio.

step1

2. Search for Meta Llama 3.3 70B.

step2

3. Choose the Meta Llama 3.3 70B model.

step3

4. Choose Deploy.

step4

5. Accept the end-user license agreement (EULA).

6. For Instance type, choose an instance (ml.g5.48xlarge or ml.p4d.24xlarge).

7. Choose Deploy.

step7

Await the endpoint’s status changing to InService The model can now be used to perform inference.

step7b

Steps to Deploy Llama 3.3 70B using the Amazon SageMaker Python SDK

The code below can be used to deploy the model using the Amazon SageMaker Python SDK for teams wishing to automate deployment or interact with pre-existing MLOps pipelines:

from sagemaker.serve.builder.model_builder import ModelBuilder
from sagemaker.serve.builder.schema_builder import SchemaBuilder
from sagemaker.jumpstart.model import ModelAccessConfig
from sagemaker.session import Session
import logging
sagemaker_session = Session()
artifacts_bucket_name = sagemaker_session.default_bucket()
execution_role_arn = sagemaker_session.get_caller_identity_arn()
js_model_id = "meta-textgeneration-llama-3-3-70b-instruct"
gpu_instance_type = "ml.p4d.24xlarge"
response = "Hello, I'm a language model, and I'm here to help you with your English."
sample_input = {
    "inputs": "Hello, I'm a language model,",
    "parameters": {"max_new_tokens": 128, "top_p": 0.9, "temperature": 0.6},
}
sample_output = [{"generated_text": response}]
schema_builder = SchemaBuilder(sample_input, sample_output)
model_builder = ModelBuilder(
    model=js_model_id,
    schema_builder=schema_builder,
    sagemaker_session=sagemaker_session,
    role_arn=execution_role_arn,
    log_level=logging.ERROR
)
model= model_builder.build()
predictor = model.deploy(model_access_configs={js_model_id:ModelAccessConfig(accept_eula=True)}, accept_eula=True)
predictor.predict(sample_input)

from sagemaker.serve.builder.model_builder import ModelBuilder

from sagemaker.serve.builder.schema_builder import SchemaBuilder

from sagemaker.jumpstart.model import ModelAccessConfig

from sagemaker.session import Session

import logging

sagemaker_session = Session()

artifacts_bucket_name = sagemaker_session.default_bucket()

execution_role_arn = sagemaker_session.get_caller_identity_arn()

js_model_id = "meta-textgeneration-llama-3-3-70b-instruct"

gpu_instance_type = "ml.p4d.24xlarge"

response = "Hello, I'm a language model, and I'm here to help you with your English."

sample_input = {

"inputs": "Hello, I'm a language model,",

"parameters": {"max_new_tokens": 128, "top_p": 0.9, "temperature": 0.6},

}

sample_output = [{"generated_text": response}]

schema_builder = SchemaBuilder(sample_input, sample_output)

model_builder = ModelBuilder(

model=js_model_id,

schema_builder=schema_builder,

sagemaker_session=sagemaker_session,

role_arn=execution_role_arn,

log_level=logging.ERROR

)

model= model_builder.build()

predictor = model.deploy(model_access_configs={js_model_id:ModelAccessConfig(accept_eula=True)}, accept_eula=True)

predictor.predict(sample_input)

Optimize deployment with Amazon SageMaker AI

Amazon SageMaker provides several powerful features to optimize the deployment and performance of models like LLaMA 3.3 70B, ensuring cost-effectiveness and efficiency in production environments:

Speculative Decoding: By default, Amazon SageMaker JumpStart uses speculative decoding to increase throughput, enabling accelerated deployment. This method helps optimize generative AI inference by predicting and pre-processing outputs, reducing wait times, and enhancing model performance. Learn more about how speculative decoding improves throughput on Amazon.
Fast Model Loader: This feature leverages a novel weight streaming approach that drastically reduces model initialization time. By sending weights directly from Amazon Simple Storage Service (Amazon S3) to the accelerator, the Fast Model Loader significantly reduces the startup and scaling times, bypassing the traditional method of loading the entire model into memory first.
Container Caching: Amazon SageMaker’s container caching optimizes how model containers are handled during scaling. Pre-caching container images removes the need for time-consuming downloads during scaling, thus reducing latency and improving the responsiveness of the system, particularly for large models like LLaMA 3.3 70B.
Scale to Zero: A breakthrough in resource management, this feature automatically adjusts computational power based on actual usage. During periods of inactivity, endpoints can scale down completely and then up quickly when demand returns, optimizing costs and making it ideal for models with fluctuating workloads or running multiple models simultaneously.

By leveraging these Amazon SageMaker AI features, businesses can efficiently deploy and manage LLaMA 3.3 70B, maximizing performance and cost-effectiveness, ensuring that large language models are deployed at scale with minimal overhead.

Conclusion

Combining Llama 3.3 70B with Amazon SageMaker AI sophisticated inference capabilities is the best option for production installations. By leveraging features like Fast Model Loader, Container Caching, and Scale to Zero, businesses can achieve excellent performance and cost-effectiveness for their LLM deployments. The optimization tools within Amazon SageMaker AI significantly enhance model initialization, scaling, and resource management, ensuring that organizations can deploy large language models like Llama 3.3 70B at scale with minimal overhead.

Additionally, the efficiency gains provided by Llama 3.3 70B, offering performance comparable to another model, mean businesses can achieve high-quality inference at a fraction of the cost, making it an ideal solution for cost-sensitive production environments.

With its powerful architecture, refined training methodology, and seamless integration with Amazon SageMaker, Llama 3.3 70B provides organizations with a scalable and affordable option to meet their generative AI needs.

Drop a query if you have any questions regarding Amazon SageMaker AI and we will get back to you quickly.

Making IT Networks Enterprise-ready – Cloud Management Services

Accelerated cloud migration
End-to-end view of the cloud environment

Get Started

About CloudThat

CloudThat is an award-winning company and the first in India to offer cloud training and consulting services worldwide. As a Microsoft Solutions Partner, AWS Advanced Tier Training Partner, and Google Cloud Platform Partner, CloudThat has empowered over 850,000 professionals through 600+ cloud certifications winning global recognition for its training excellence including 20 MCT Trainers in Microsoft’s Global Top 100 and an impressive 12 awards in the last 8 years. CloudThat specializes in Cloud Migration, Data Platforms, DevOps, IoT, and cutting-edge technologies like Gen AI & AI/ML. It has delivered over 500 consulting projects for 250+ organizations in 30+ countries as it continues to empower professionals and enterprises to thrive in the digital-first world.

FAQs

1. What is LLaMA 3.3 70B, and how does it differ from larger models?

ANS: – Llama 3.3 70B is a more efficient version of the Meta Llama model, providing performance similar to the larger Llama 3.1 405B model but with significantly lower computational requirements. It is designed to offer cost-effective inference operations, making it ideal for production deployments.

2. How does Amazon SageMaker optimize LLaMA 3.3 70B deployment?

ANS: – Amazon SageMaker features like Fast Model Loader, Container Caching, and Scale to Zero streamline initialization, scaling, resource management, and optimizing deployment.

WRITTEN BY Aayushi Khandelwal

Aayushi is a data and AIoT professional at CloudThat, specializing in generative AI technologies. She is passionate about building intelligent, data-driven solutions powered by advanced AI models. With a strong foundation in machine learning, natural language processing, and cloud services, Aayushi focuses on developing scalable systems that deliver meaningful insights and automation. Her expertise includes working with tools like Amazon Bedrock, AWS Lambda, and various open-source AI frameworks.