All posts by

Revolutionize Customer Satisfaction with tailored reward models for your business on Amazon SageMaker

As more powerful large language models (LLMs) are used to perform a variety of tasks with greater accuracy, the number of applications and services that are being built with generative artificial intelligence (AI) is also growing. With great power comes responsibility, and organizations want to make sure that these LLMs produce responses that align with their organizational values and provide the same unique experience they always intended for their end-customers.

Evaluating AI-generated responses presents challenges. This post discusses techniques to align them with company values and build a custom reward model using Amazon SageMaker. By doing so, you can provide customized customer experiences that uniquely reflect your organization’s brand identity and ethos.

Challenges with out-of-the-box LLMs

Out-of-the-box LLMs provide high accuracy, but often lack customization for an organization’s specific needs and end-users. Human feedback varies in subjectivity across organizations and customer segments. Collecting diverse, subjective human feedback to refine LLMs is time-consuming and unscalable.

This post showcases a reward modeling technique to efficiently customize LLMs for an organization by programmatically defining rewards functions that capture preferences for model behavior. We demonstrate an approach to deliver LLM results tailored to an organization without intensive, continual human judgement. The techniques aim to overcome customization and scalability challenges by encoding an organization’s subjective quality standards into a reward model that guides the LLM to generate preferable outputs.

Objective vs. subjective human feedback

Not all human feedback is the same. We can categorize human feedback into two types: objective and subjective.

Any human being who is asked to judge the color of the following boxes would confirm that the left one is a white box and right one is a black box. This is objective, and there are no changes to it whatsoever.

Determining whether an AI model’s output is “great” is inherently subjective. Consider the following color spectrum. If asked to describe the colors on the ends, people would provide varied, subjective responses based on their perceptions. One person’s white may be another’s gray.

This subjectivity poses a challenge for improving AI through human feedback. Unlike objective right/wrong feedback, subjective preferences are nuanced and personalized. The same output could elicit praise from one person and criticism from another. The key is acknowledging and accounting for the fundamental subjectivity of human preferences in AI training. Rather than seeking elusive objective truths, we must provide models exposure to the colorful diversity of human subjective judgment.

Unlike traditional model tasks such as classification, which can be neatly benchmarked on test datasets, assessing the quality of a sprawling conversational agent is highly subjective. One human’s riveting prose is another’s aimless drivel. So how should we refine these expansive language models when humans intrinsically disagree on the hallmarks of a “good” response?

The key is gathering feedback from a diverse crowd. With enough subjective viewpoints, patterns emerge on engaging discourse, logical coherence, and harmless content. Models can then be tuned based on broader human preferences. There is a general perception that reward models are often associated only with Reinforcement Learning from Human Feedback (RLHF). Reward modeling, in fact, goes beyond RLHF, and can be a powerful tool for aligning AI-generated responses with an organization’s specific values and brand identity.

Reward modeling

You can choose an LLM and have it generate numerous responses to diverse prompts, and then your human labelers will rank those responses. It’s important to have diversity in human labelers. Clear labeling guidelines are critical. Without explicit criteria, judgments can become arbitrary. Useful dimensions include coherence, relevance, creativity, factual correctness, logical consistency, and more. Human labelers put these responses into categories and label them favorite to least favorite, as shown in the following example. This example showcases how different humans perceive these possible responses from the LLM in terms of their most favorite (labeled as 1 in this case) and least favorite (labeled as 3 in this case). Each column is labeled 1, 2, or 3 from each human to signify their most preferred and least preferred response from the LLM.

By compiling these subjective ratings, patterns emerge on what resonates across readers. The aggregated human feedback essentially trains a separate reward model on writing qualities that appeal to people. This technique of distilling crowd perspectives into an AI reward function is called reward modeling. It provides a method to improve LLM output quality based on diverse subjective viewpoints.

Solution overview

In this post, we detail how to train a reward model based on organization-specific human labeling feedback collected for various prompts tested on the base FM. The following diagram illustrates the solution architecture.

For more details, see the accompanying notebook.

Prerequisites

To successfully train a reward model, you need the following:

A large dataset with prompts and ranked responses from human labelers that reflects your organizational and end-user needs. For this post, we store the dataset in an Amazon Simple Storage Service (Amazon S3) bucket.
A small language model with a numerical head like OPT-2.7b, Falcon 7b (a decoder-only model of approximately 6 GB is good enough).
A mechanism to run distributed training. For this post, we use SageMaker.
An AWS Identity and Access Management (IAM) role associated with the Amazon SageMaker Studio user profile that has access to the S3 bucket holding the curated dataset. The standard SageMaker IAM role will suffice for this post. Refer to Amazon SageMaker Identity-Based Policy Examples for guidance on best practices and examples of identity-based policies for SageMaker.
A SageMaker domain. You can quickly spin up a SageMaker domain and set up a single user for launching the SageMaker Studio notebook environment you’ll need to complete the model training. For instructions on setting up your environment, see Quick onboard to Amazon SageMaker domain.

Launch SageMaker Studio

Complete the following steps to launch SageMaker Studio:

On the SageMaker console, choose Studio in the navigation pane.
On the Studio landing page, select the domain and user profile for launching Studio.
Choose Open Studio.
To launch SageMaker Studio, choose Launch personal Studio.

Let’s see how to create a reward model locally in a SageMaker Studio notebook environment by using a pre-existing model from the Hugging Face model hub.

Prepare a human-labeled dataset and train a reward model

When doing reward modeling, getting feedback data from humans can be expensive. This is because reward modeling needs feedback from other human workers instead of only using data collected during regular system use. How well your reward model behaves depends on the quality and amount of feedback from humans.

We recommend using AWS-managed offerings such as Amazon SageMaker Ground Truth. It offers the most comprehensive set of human-in-the-loop capabilities, allowing you to harness the power of human feedback across the machine learning (ML) lifecycle to improve the accuracy and relevancy of models. You can complete a variety of human-in-the-loop tasks with SageMaker Ground Truth, from data generation and annotation to model review, customization, and evaluation, either through a self-service or AWS-managed offering.

For this post, we use the IMDB dataset to train a reward model that provides a higher score for text that humans have labeled as positive, and a lower score for negative text.

We prepare the dataset with the following code:

def create_custom_dataset(raw_dataset):
df = raw_dataset.to_pandas()
negative_df = df[df[‘label’]==0]
positive_df = df[df[‘label’]==1]
negative_df = negative_df.drop(
columns=[‘label’]).rename(
columns={‘text’: ‘rejected’})
# shuffle the data
positive_df = positive_df.sample(
frac=1, random_state=0).reset_index(
drop=True).drop(columns=[‘label’]).rename(
columns={‘text’: ‘chosen’})
joined_df = negative_df.join(positive_df)

def tokenize_fn(texts, max_length=args.seq_length):
encoded = tokenizer(
texts,
padding=’max_length’,
max_length=max_length,
truncation=True,
add_special_tokens=False,
)
return encoded

rejected_encoded = tokenize_fn(joined_df.rejected.values.tolist())
joined_df[‘rejected_input_ids’] = rejected_encoded[‘input_ids’]
joined_df[‘rejected_attention_mask’] = rejected_encoded[‘attention_mask’]
encoded_chosen = tokenize_fn(joined_df.chosen.values.tolist())
joined_df[‘chosen_input_ids’] = encoded_chosen[‘input_ids’]
joined_df[‘chosen_attention_mask’] = encoded_chosen[‘attention_mask’]

train_dataset = Dataset.from_pandas(joined_df, preserve_index=False)

return train_dataset.with_format(“torch”)

The following example shows a sample record from the prepared dataset, which includes references to rejected and chosen responses. We have also embedded the input ID and attention mask for the chosen and rejected responses.

{‘rejected’: “If only to avoid making this type of film in the future. This film is interesting as an experiment but tells no cogent story.<br /><br />One might feel virtuous for sitting thru it because it touches on so many IMPORTANT issues but it does so without any discernable motive. The viewer comes away with no new perspectives (unless one comes up with one while one’s mind wanders, as it will invariably do during this pointless film).<br /><br />One might better spend one’s time staring out a window at a tree growing.<br /><br />”,
‘chosen’: “This is a great movie. I love it more each time i watch. Most comedies can get pretty lame because you know all the gags, but mystery men has so much integrity in the writing and characterization that watching once again — as Ben Stiller tears at the hood ornament of the limo, or Hank Azaria says good-bye to Louise Lasser, or Geoffrey Rush flashes his fuhrer choreography, or Tom Waits mumbles while he watches the news report, or Janeane Garofalo refuses a kiss from Paul Reubens — is a pleasure. This is pitch perfect ensemble acting. The story develops directly and consistently, the action sequences are creative and not too dominant, all the set-ups payoff by the end. Seriously, if you’ve seen it and it’s been a while, watch it again, and if you haven’t then get started. You can’t watch it again until you’ve seen it the first time. (Wes Studi, William H. Macy, the tryouts scene. Too much good stuff!)”,
‘rejected_input_ids’: tensor([1106, 129, 7, …, 1, 1, 1]),
‘rejected_attention_mask’: tensor([1, 1, 1, …, 0, 0, 0]),
‘chosen_input_ids’: tensor([713, 16, 10, …, 1, 1, 1]),
‘chosen_attention_mask’: tensor([1, 1, 1, …, 0, 0, 0])}

Load the pre-trained model

In this case, we use the OPT-1.3b (Open Pre-trained Transformer Language Model) model in Amazon SageMaker JumpStart from Hugging Face. If you want to do all of the training locally on your notebook instead of distributed training, you need to use an instance with enough accelerator memory. We run the following training on a notebook running on ml.g4dn.xlarge instance type:

from transformers import(
AutoModelForSequenceClassification,
AutoTokenizer,
set_seed,
)
from datasets import Dataset, load_dataset
import torch

model = AutoModelForSequenceClassification.from_pretrained(
‘facebook/opt-1.3b’,
torch_dtype=torch.bfloat16,
device_map=”auto”,
num_labels=1,
)

Define the custom trainer function

In the following code snippet, we create a custom trainer that calculates how well a model is performing on a task:

from torch import nn
from transformers import Trainer
import torch.nn.functional as F

class CustomTrainer(Trainer):
def compute_loss(self, model, inputs, return_outputs=False):

chosen_input_ids = inputs[‘chosen_input_ids’] chosen_attention_mask = inputs[‘chosen_attention_mask’] rejected_input_ids = inputs[‘rejected_input_ids’] rejected_attention_mask = inputs[‘rejected_attention_mask’]
r_w = model(chosen_input_ids, chosen_attention_mask).logits
r_l = model(rejected_input_ids, rejected_attention_mask).logits outputs = (r_w, r_l)
loss = -F.logsigmoid(r_w – r_l).mean()
return (loss, outputs) if return_outputs else loss

It compares the model’s results for two sets of input data: one set that was chosen and another set that was rejected. The trainer then uses these results to figure out how good the model is at distinguishing between the chosen and rejected data. This helps the trainer adjust the model to improve its performance on the task. The CustomTrainer class is used to create a specialized trainer that calculates the loss function for a specific task involving chosen and rejected input sequences. This custom trainer extends the functionality of the standard Trainer class provided by the transformers library, allowing for a tailored approach to handling model outputs and loss computation based on the specific requirements of the task. See the following code:

from transformers import TrainingArguments

training_args = TrainingArguments(output_dir=”reward_model”,
overwrite_output_dir=True,
do_train=True,
do_eval=False,
do_predict=False,
evaluation_strategy=”no”,
learning_rate=5e-5,
num_train_epochs=1,
per_device_train_batch_size=2,
gradient_accumulation_steps=32,
remove_unused_columns=False)
trainer = CustomTrainer(
model=model,
args=training_args,
train_dataset=train_dataset
)
trainer.train()
trainer.save_model()

The TrainingArguments in the provided code snippet are used to configure various aspects of the training process for an ML model. Let’s break down the purpose of each parameter, and how they can influence the training outcome:

output_dir – Specifies the directory where the trained model and associated files will be saved. This parameter helps organize and store the trained model for future use.
overwrite_output_dir – Determines whether to overwrite the output directory if it already exists. Setting this to True allows for reusing the same directory without manual deletion.
do_train – Indicates whether to perform training. If set to True, the model will be trained using the provided training dataset.
do_eval and do_predict – Control whether to perform evaluation and prediction tasks, respectively. In this case, both are set to False, meaning only training will be conducted.
evaluation_strategy – Defines when evaluation should be performed during training. Setting it to “no” means evaluation will not be done during training.
learning_rate – Specifies the learning rate for the optimizer, influencing how quickly or slowly the model learns from the data.
num_train_epochs – Sets the number of times the model will go through the entire training dataset during training. One epoch means one complete pass through all training samples.
per_device_train_batch_size – Determines how many samples are processed in each batch during training on each device (for example, GPU). A smaller batch size can lead to slower but more stable training.
gradient_accumulation_steps – Controls how often gradients are accumulated before updating the model’s parameters. This can help stabilize training with large batch sizes.
remove_unused_columns – Specifies whether unused columns in the dataset should be removed before processing, optimizing memory usage.

By configuring these parameters in the TrainingArguments, you can influence various aspects of the training process, such as model performance, convergence speed, memory usage, and overall training outcome based on your specific requirements and constraints.

When you run this code, it trains the reward model based on the numerical representation of subjective feedback you gathered from the human labelers. A trained reward model will give a higher score to LLM responses that humans are more likely to prefer.

Use the reward model to evaluate the base LLM

You can now feed the response from your LLM to this reward model, and the numerical score produced as output informs you of how well the response from the LLM is aligning to the subjective organization preferences that were embedded on the reward model. The following diagram illustrates this process. You can use this number as the threshold for deciding whether or not the response from the LLM can be shared with the end-user.

For example, let’s say we created an reward model to avoiding toxic, harmful, or inappropriate content. If a chatbot powered by an LLM produces a response, the reward model can then score the chatbot’s responses. Responses with scores above a pre-determined threshold are deemed acceptable to share with users. Scores below the threshold mean the content should be blocked. This lets us automatically filter chatbot content that doesn’t meet standards we want to enforce. To explore more, see the accompanying notebook.

Clean up

To avoid incurring future charges, delete all the resources that you created. Delete the deployed SageMaker models, if any, and stop the SageMaker Studio notebook you launched for this exercise.

Conclusion

In this post, we showed how to train a reward model that predicts a human preference score from the LLM’s response. This is done by generating several outputs for each prompt with the LLM, then asking human annotators to rank or score the responses to each prompt. The reward model is then trained to predict the human preference score from the LLM’s response. After the reward model is trained, you can use the reward model to evaluate the LLM’s responses against your subjective organizational standards.

As an organization evolves, the reward functions must evolve alongside changing organizational values and user expectations. What defines a “great” AI output is subjective and transforming. Organizations need flexible ML pipelines that continually retrain reward models with updated rewards reflecting latest priorities and needs. This space is continuously evolving: direct preference-based policy optimization, tool-augmented reward modeling, and example-based control are other popular alternative techniques to align AI systems with human values and goals.

We invite you to take the next step in customizing your AI solutions by engaging with the diverse and subjective perspectives of human feedback. Embrace the power of reward modeling to ensure your AI systems resonate with your brand identity and deliver the exceptional experiences your customers deserve. Start refining your AI models today with Amazon SageMaker and join the vanguard of businesses setting new standards in personalized customer interactions. If you have any questions or feedback, please leave them in the comments section.

About the Author

Dinesh Kumar Subramani is a Senior Solutions Architect based in Edinburgh, Scotland. He specializes in artificial intelligence and machine learning, and is member of technical field community with in Amazon. Dinesh works closely with UK Central Government customers to solve their problems using AWS services. Outside of work, Dinesh enjoys spending quality time with his family, playing chess, and exploring a diverse range of music.

Amazon Personalize launches new recipes supporting larger item catalogs with lower latency

Personalized customer experiences are essential for engaging today’s users. However, delivering truly personalized experiences that adapt to changes in user behavior can be both challenging and time-consuming. Amazon Personalize makes it straightforward to personalize your website, app, emails, and more, using the same machine learning (ML) technology used by Amazon, without requiring ML expertise. With the recipes—algorithms for specific uses cases—provided by Amazon Personalize, you can deliver a wide array of personalization, including product or content recommendations and personalized ranking.

Today, we are excited to announce the general availability of two advanced recipes in Amazon Personalize, User-Personalization-v2 and Personalized-Ranking-v2 (v2 recipes), which are built on the cutting-edge Transformers architecture to support larger item catalogs with lower latency.

In this post, we summarize the new enhancements, and guide you through the process of training a model and providing recommendations for your users.

Benefits of new recipes

The new recipes offer enhancements in scalability, latency, model performance, and functionality.

Enhanced scalability – The new recipes now support training with up to 5 million item catalogs and 3 billion interactions, empowering personalization for large catalogs and platforms with billions of usage events.
Lower latency – The lower inference latency and faster training times for large datasets of these new recipes can reduce the delay for your end-users.
Performance optimization – Amazon Personalize testing showed that v2 recipes improved recommendation accuracy by up to 9% and recommendation coverage by up to 1.8x compared to previous versions. A higher coverage means Amazon Personalize recommends more of your catalog.
Return item metadata in inference responses – The new recipes enable item metadata by default without extra charge, allowing you to return metadata such as genres, descriptions, and availability in inference responses. This can help you enrich recommendations in your user interfaces without extra work. If you use Amazon Personalize with generative AI, you can also feed the metadata into prompts. Providing more context to large language models can help them gain a deeper understanding of product attributes to generate more relevant content.
Highly automated operations – Our new recipes are designed to reduce your overhead for training and tuning the model. For example, Amazon Personalize simplifies training configuration and automatically selects the optimal settings for your custom models behind the scenes.

Solution overview

To use the User-Personalization-v2 and Personalized-Ranking-v2 recipes, you first need to set up Amazon Personalize resources. Create your dataset group, import your data, train a solution version, and deploy a campaign. For full instructions, see Getting started.

For this post, we follow the Amazon Personalize console approach to deploy a campaign. Alternatively, you can build the entire solution using the SDK approach. You can also get batch recommendations with an asynchronous batch flow. We use the MovieLens public dataset and User-Personalization-v2 recipe to show you the workflow.

Prepare the dataset

Complete the following steps to prepare your dataset:

Create a dataset group. Each dataset group can contain up to three datasets: users, items, and interactions, with the interactions dataset being mandatory for User-Personalization-v2 and Personalized-Ranking-v2.
Create an interactions dataset using a schema.
Import the interactions data to Amazon Personalize from Amazon Simple Storage Service (Amazon S3).

Train a model

After the dataset import job is complete, you can analyze data before training. Amazon Personalize Data analysis shows you statistics about your data as well as actions you can take to meet training requirements and improve recommendations.

Now you’re ready to train your model.

On the Amazon Personalize console, choose Dataset groups in the navigation pane.
Choose your dataset group.
Choose Create solutions.
For Solution name, enter your solution name.
For Solution type, select Item recommendation.
For Recipe, choose the new aws-user-personalization-v2 recipe.

In the Training configuration section, for Automatic training, select Turn on to maintain the effectiveness of your model by retraining it on a regular cadence.

Under Hyperparameter configuration, select Apply recency bias. Recency bias determines whether the model should give more weight to the most recent item interactions data in your interactions dataset.

Choose Create solution.

If you turned on automatic training, Amazon Personalize will automatically create your first solution version. A solution version refers to a trained ML model. When a solution version is created for the solution, Amazon Personalize trains the model backing the solution version based on the recipe and training configuration. It can take up to 1 hour for the solution version creation to start.

Under Custom resources in the navigation pane, choose Campaigns.
Choose Create campaign.

A campaign deploys a solution version (trained model) to generate real-time recommendations. Campaigns created with solutions trained on v2 recipes are automatically opted-in to include item metadata in recommendation results. You can choose metadata columns during an inference call.

Provide your campaign details and create your campaign.

Get recommendations

After you create or update your campaign, you can get a recommended list of items that users are more likely to interact with, sorted from highest to lowest.

Select the campaign and View details.
In the Test campaign results section, enter the User ID and choose Get recommendations.

The following table shows a recommendation result for a user that includes the recommended items, relevance score, and item metadata (Title and Genre).

Your User-Personalization-v2 campaign is now ready to feed into your website or app and personalize the journey of each of your customers.

Clean up

Make sure you clean up any unused resources you created in your account while following the steps outlined in this post. You can delete campaigns, datasets, and dataset groups via the Amazon Personalize console or using the Python SDK.

Conclusion

The new Amazon Personalize User-Personalization-v2 and Personalized-Ranking-v2 recipes take personalization to the next level with support of larger item catalogs, reduced latency, and optimized performance. For more information about Amazon Personalize, see the Amazon Personalize Developer Guide.

About the Authors

Jingwen Hu is a Senior Technical Product Manager working with AWS AI/ML on the Amazon Personalize team. In her spare time, she enjoys traveling and exploring local food.

Daniel Foley is a Senior Product Manager for Amazon Personalize. He is focused on building applications that leverage artificial intelligence to solve our customers’ largest challenges. Outside of work, Dan is an avid skier and hiker.

Pranesh Anubhav is a Senior Software Engineer for Amazon Personalize. He is passionate about designing machine learning systems to serve customers at scale. Outside of his work, he loves playing soccer and is an avid follower of Real Madrid.

Tianmin Liu is a senior software engineer working for Amazon personalize. He focuses on developing recommender systems at scale using various machine learning algorithms. In his spare time, he likes playing video games, watching sports, and playing the piano.

Abhishek Mangal is a software engineer working for Amazon Personalize. He works on developing recommender systems at scale using various machine learning algorithms. In his spare time, he likes to watch anime and believes One Piece is the greatest piece of storytelling in recent history.

Yifei Ma is a Senior Applied Scientist at AWS AI Labs working on recommender systems. His research interests lie in active learning, generative models, time series analysis, and online decision-making. Outside of work, he is an aviation enthusiast.

Hao Ding is a Senior Applied Scientist at AWS AI Labs and is working on advancing the recommender system for Amazon Personalize. His research interests lie in recommendation foundation models, Bayesian deep learning, large language models, and their applications in recommendation.

Rishabh Agrawal is a Senior Software Engineer working on AI services at AWS. In his spare time, he enjoys hiking, traveling and reading.

Get started with Amazon Titan Text Embeddings V2: A new state-of-the-art embeddings model on Amazon Bedrock

Embeddings are integral to various natural language processing (NLP) applications, and their quality is crucial for optimal performance. They are commonly used in knowledge bases to represent textual data as dense vectors, enabling efficient similarity search and retrieval. In Retrieval Augmented Generation (RAG), embeddings are used to retrieve relevant passages from a corpus to provide context for language models to generate informed, knowledge-grounded responses. Embeddings also play a key role in personalization and recommendation systems by representing user preferences, item characteristics, and historical interactions as vectors, allowing calculation of similarities for personalized recommendations based on user behavior and item embeddings. As new embedding models are released with incremental quality improvements, organizations must weigh the potential benefits against the associated costs of upgrading, considering factors like computational resources, data reprocessing, integration efforts, and projected performance gains impacting business metrics.

In September of 2023, we announced the launch of Amazon Titan Text Embeddings V1, a multilingual text embeddings model that converts text inputs like single words, phrases, or large documents into high-dimensional numerical vector representations. Since then, many of our customers have used the V1 model, which supported over 25 languages, with an input up to 8,192 tokens and outputs vector of 1,536 dimensions for high accuracy and low latency. The model was made available as a serverless offering via Amazon Bedrock, simplifying embedding generation and integration with downstream applications. We published a follow-up post on January 31, 2024, and provided code examples using AWS SDKs and LangChain, showcasing a Streamlit semantic search app.

Today, we are happy to announce Amazon Titan Text Embeddings V2, our second-generation embeddings model for Amazon Bedrock. The new model is optimized for the most common use cases we see with many of our active customers, including RAG, multi-language, and code embedding use cases. The following table summarizes the key differences compared to V1.

Feature
Amazon Titan Text Embeddings V1
Amazon Titan Text Embeddings V2

Output dimension support
1536
256, 512, 1024

Language support
25+
100+

Unit vector normalization support
No
Yes

Price per million tokens
$0.10
$0.02 per 1 million tokens, or $0.00002 per 1,000 tokens

With these new features, we expect many more customers choosing Amazon Titan Text Embeddings V2 to build common generative artificial intelligence (AI) applications. In this post, we discuss the benefits of the V2 model, how to conduct your own evaluation of the model, and how to migrate to using the new model.

Let’s dig in!

Benefits of Amazon Titan Text Embeddings V2

Amazon Titan Text Embeddings V2 is the second-generation embedding model for Amazon Bedrock, optimized for some of the most common customer use cases we have seen with our customers. Some of the key features include:

Optimized for RAG solutions
Flexible embedding sizes
Improved multilingual support and code

Embeddings have become an integral part of various NLP applications, and their quality is crucial for achieving optimal performance.

The large language model (LLM) landscape is rapidly evolving, with leading providers offering increasingly powerful and versatile embedding models. Although incremental improvements in embedding quality may seem modest at the high level, the actual benefits can be significant for specific use cases. For example, in a recommendation system for a large ecommerce platform, a modest increase in recommendation accuracy could translate into significant additional revenue.

A common way to select an embedding model (or any model) is to look at public benchmarks; an accepted benchmark for measuring embedding quality is the MTEB leaderboard. The Massive Text Embedding Benchmark (MTEB) evaluates text embedding models across a wide range of tasks and datasets. MTEB encompasses 8 different embedding tasks, covering a total of 58 datasets and 112 languages. In this benchmark, 33 different text embedding models were evaluated on the MTEB tasks. A key finding from the benchmark was that no single text embedding method emerged as the clear leader across all tasks and datasets. Each model exhibited strengths and weaknesses depending on the specific embedding task and data characteristics. This highlights the need for continued research into developing more versatile and robust text embedding techniques that can perform well across diverse use cases and language domains.

Although this is a useful benchmark, we caution our enterprise customers with the following considerations:

Although the MTEB leaderboard is widely recognized, it provides only a partial assessment by focusing solely on accuracy metrics and overlooking crucial practical factors like inference latency and model capabilities. The leaderboard rankings combine and compare embedding models across different vector dimensions, making direct and fair model comparisons challenging.
Additionally, the leaders on this accuracy-centric leaderboard change frequently as new models are continually introduced, providing a shifting and incomplete perspective on practical model performance trade-offs that real-world applications must consider beyond just accuracy numbers.
Lastly, costs need to be weighed against the expected benefits and performance improvements in the specific use case. A small gain in accuracy may not justify the significant overhead and opportunity costs of transitioning embeddings models, especially in large-scale, business-critical applications. Enterprises should perform a rigorous cost-benefit analysis to make sure the projected performance uplift from an updated embeddings model provides sufficient return on investment (ROI) to offset the migration costs and operational disruption.

In summary, start with evaluating the benchmark scores, but don’t decide until you have done your own due diligence.

Benchmark results

The Amazon Titan Text Embeddings V2 model has the ability to output embeddings of various size. This implies that if you use a lower size, you’ll reduce your memory footprint, which will translate directly into cost savings. The default size is 1024, compared to V1, which is an 1536 output size, implying a direct cost reduction of approximately 33%, which translates into savings given the cost of a RAG solution has a major component in the form of a vector databases. In our internal testing, we found that using the 256-output token resulted in only about 3.24% accuracy loss while translating to a four times saving due to size reduction. Running our evaluation on MTEB datasets, we found Amazon Titan Text Embeddings V2 to perform competitively with scores like 57.5 on reranking tasks, for example. With the model trained on over 100 languages, it’s no surprise the model achieves scores like 55 on the MIRACL multilingual dataset and has an overall weighted average MTEB score of 60.37. Full MTEB scores are available on the MTEB leaderboard.

However, we strongly encourage you to run your own benchmarks with your own dataset to understand the operational metrics. A sample notebook showing how to run the benchmarks against the MTEB datasets is hosted here. The key steps involved are:

Choose a representative set of data to embed and keywords to search.
Use the Amazon Titan Text Embeddings V2 model to embed your data and keywords, adjusting the chunk size and overlap as needed.
Carry out a similarity search using your preferred vector comparison method (such as Euclidean distance or cosine similarity).

Use Amazon Titan Text Embeddings V2 on Amazon Bedrock

The new Amazon Titan Text Embeddings V2 model is available through the fully managed, serverless experience on Amazon Bedrock. You can use the model through either the Amazon Bedrock REST API or the AWS SDK. The required parameters are the text that you want to generate the embeddings of and the modelID parameter, which represents the name of the Amazon Titan Text Embeddings model. Furthermore, now you can specify the output size of the vector, which is a significant feature of the V2 model.

Throughput has been a key requirement for running large ingestion workloads, and the Amazon Titan Text Embeddings model supports batching via Bedrock Batch to increase the throughput for your workloads. The following code is an example using the AWS SDK for Python (Boto3):

import boto3
import json

#Create the connection to Bedrock

bedrock_runtime = boto3.client(
service_name=’bedrock-runtime’,
region_name=’us-west-2′,

)

# Define prompt and model parameters
prompt_data = “””Priority should be funding retirement through ROTH/IRA/401K over HSA extra. You need to fund your HSA for reasonable and expected medical expenses. “””
modelId = “amazon.titan-embed-text-v2:0”
accept = “application/json”
contentType = “application/json”

sample_model_input={
“inputText”: prompt_data,
“dimensions”: 256,
“normalize”: True
}

body = json.dumps(sample_model_input)
# Invoke model
response = bedrock_runtime.invoke_model(body=body, modelId=modelId, accept=accept, contentType=contentType)

response_body = json.loads(response.get(‘body’).read())
embedding = response_body.get(“embedding”)
# Print response and embedding
print(f”The embedding vector has {len(embedding)} valuesn{embedding[0:3]+[‘…’]+embedding[-3:]}”)

The full notebook is available at on the Github Repo.

With Amazon Titan Text Embeddings, you can input up to 8,192 tokens, allowing you to work with phrases or entire documents based on your use case. The model returns output vectors of a range of dimensions from 256–1024 without sacrificing accuracy, while also optimizing for cost storage and low latency. Typically, you will find larger content window models tuned for accuracy while sacrificing latency because they’re typically used in asynchronous workloads. However, with its larger content window, Amazon Titan Text Embeddings is able to achieve low latency, and with batching, it gives higher throughput for your workloads.

Run your own benchmarking

We always encourage our customers to perform their own benchmarking using their documents or the standard MTEB datasets and evaluation. For a sample of how to use the MTEB, see the GitHub repo. This notebook shows you how to load the dataset and set up evaluation for your specific use case (task) and run the benchmarking. If you run the benchmarking with your dataset, the typical steps involved are:

Use the Amazon Titan Text Embeddings V2 model to embed your data and keywords, adjusting the chunk size and overlap as needed.
Run similarity searches using your preferred distance metrics based on your choice of vector database.

A sample notebook showing how to use an in-memory database is available in the GitHub repo. This is a sample setup and should not be used for your production workloads where you would be connecting to robust vector database offerings like Amazon OpenSearch Serverless.

Migrate to Amazon Titan Text Embeddings V2

The cost and performance advantages provided by the V2 model are compelling reasons to consider reindexing your existing vector embeddings using V2. Let’s explore a few examples to illustrate the potential benefits, focusing solely on embedding costs.

Use case 1: High volume of searches

This first use case pertains to customers with a high volume of searches. The details are as follows:

Scenario:

1 million documents, 100 million chunks, 1,000 average tokens per chunk
100,000 searches per day, 1,000 token size for search

One-time cost:

Number of tokens: 100,000 million
Price per million tokens: $0.02
Reindexing cost: 100,000 * $0.02 = $2,000

Ongoing monthly savings (compared to V1):

Tokens embedded per month: 30 * 100,000 * 1,000 = 3,000 million
Savings per month (when migrating from V1 to V2): 3,000 * ($0.1 – $0.02) = $240

For this use case, the one-time reindexing cost of $2,000 will likely break even within 8–9 months through the ongoing monthly savings.

Use case 2: Ongoing indexing

This use case is for customers with ongoing indexing. The details are as follows:

Scenario:

500,000 documents, 50 million chunks, average 1,000 tokens per chunk
10,000 (2%) new documents added per month
1,000 searches per day, 1,000 token size for search

One-time cost:

Number of tokens: 50,000 million
Price per million tokens: $0.02
Reindexing cost: 50,000 * $0.02 = $1,000

Ongoing monthly savings (compared to V1):

Tokens embedded per month for storage: 1,000 * 1,000 * 1,000 = 1,000 million
Tokens embedded per month for search: 30 * 1,000 * 1,000 = 30 million
Savings per month (vs. V1): 1,030 * ($0.1 – $0.02) = $82.4

For this use case, the one-time reindexing cost of $1,000 nets an estimated monthly savings of $82.4.

These calculations do not account for the additional savings due to the reduced storage size (up to four times) with V2. This could translate into further cost savings in terms of your vector database storage requirements. The extent of these savings will vary depending on your specific data storage needs.

Conclusion

In this post, we introduced the new Amazon Titan Text Embeddings V2 model, with superior performance across various use cases like retrieval, reranking, and multilingual tasks. You can potentially realize substantial cost savings and performance improvements by reindexing your vector embeddings using the V2 model. The specific benefits will vary based on factors such as the volume of data, search traffic, and storage requirements, but the examples discussed in this post illustrate the potential value proposition. Amazon Titan Text Embeddings V2 is available today in the us-east-1 and us-west-2 AWS Regions.

About the authors

Shreyas Subramanian is a Principal AI/ML specialist Solutions Architect, and helps customers by using Machine Learning to solve their business challenges using the AWS platform. Shreyas has a background in large scale optimization and Machine Learning, and in use of Machine Learning and Reinforcement Learning for accelerating optimization tasks.

Rupinder Grewal is a Senior AI/ML Specialist Solutions Architect with AWS. He currently focuses on serving of models and MLOps on Amazon SageMaker. Prior to this role, he worked as a Machine Learning Engineer building and hosting models. Outside of work, he enjoys playing tennis and biking on mountain trails.

Pradeep Sridharan is a Senior Solutions Architect at AWS. He has years of experience in digital business transformation—designing and implementing solutions to drive market competitiveness and revenue growth across multiple sectors. He  specializes in AI/ML, Data Analytics and Application Modernization and Migration. Pradeep is based in Arizona (US).

Anuradha Durfee is a Senior Product Manager at AWS working on generative AI. She has spent the last five years working on natural language understanding and is motivated by enabling life-like conversations between humans and technology. Anuradha is based in Boston, MA.

SECNAV Del Toro Names Next Big Deck Amphib USS Helmand Province

Bougainville (LHA-8) at Ingalls Shipbuilding launched on Sept. 30, 2023. USNI News Photo

WASHINGTON, D.C. ­– Secretary of the Navy Carlos Del Toro has named the next America-class big deck amphibious warship after the Helmand province campaign in Afghanistan.
“In keeping with naval tradition of naming our Navy’s amphibious assault ships after U.S. Marine Corps battles,” he said during a keynote at the Modern Day Marine conference on Thursday.
“I am honored to announce today that the future LHA-10 will be named USS Helmand Province, recognizing the bravery and sacrifice of our Marines and Sailors who fought for almost 20 years in the mountains of Afghanistan.”

Following Del Toro’s announcement, Marine commandant Gen. Eric Smith spoke on his experience as the commander of Regimental Combat Team 8 who fought in Helmand in 2011.

“Helmand province holds a unique place in the hearts of this generation of Marines,” Smith said.
“From 2009 to 2014 this region was the center of efforts to give stability and security to a troubled land. Helmand province as many of you know, it was not just any theater of war. It was the heart of the opium trade, a Taliban stronghold, and the terrain is rugged and formidable as any. And yet, that our Marines and sailors and allies and partners showed what it means to be the tip of the spear.”

Del Toro named Smith’s wife Trish Smith as Helmand Province’s sponsor.

In 2022, Del Toro named LHA-9 after the first and second Battle of Fallujah in Iraq. In November, the Navy awarded a $130 million advanced procurement contract award to HII’s Ingalls Shipbuilding in Mississippi.

The 45,000-ton ship will be the third Flight I America-class ship following Bougainville (LHA-8) and Fallujah.

The Flight Is will have a well deck capable of carrying two Landing Craft Air Cushion hovercraft. The first two Americas ­– USS America (LHA-6) and USS Tripoli (LHA-7) – were built without well decks and oriented around Marine Corps aviation assets like the F-35B Lighting II Joint Strike Fighter and the MV-22B Osprey tiltrotor.

Report to Congress on Polar Security Cutter Program

The following is the April 29, 2024, Congressional Research Service report, Coast Guard Polar Security Cutter (Polar Icebreaker) Program: Background and Issues for Congress.

From the report

Required number of polar icebreakers. A 2023 Coast Guard fleet mix analysis concluded that the service will require a total of eight to nine polar icebreakers, including four to five heavy polar icebreakers and four to five medium polar icebreakers, to perform its polar (i.e., Arctic and Antarctic) missions in coming years.

Current operational polar icebreaker fleet. The operational U.S. polar icebreaking fleet currently consists of one heavy polar icebreaker, Polar Star, and one medium polar icebreaker, Healy. A second Coast Guard heavy polar icebreaker, Polar Sea. Polar Sea, suffered an engine casualty in June 2010 and has been nonoperational since then. Polar Star and Polar Sea entered service in 1976 and 1977, respectively, and are now well beyond their originally intended 30-year service lives. The Coast Guard plans to extend Polar Star’s service life until the delivery of at least the second Polar Security Cutter (PSC; see next paragraph).

Polar Security Cutter (PSC). The Coast Guard PSC program aims to acquire four or five new PSCs (i.e., heavy polar icebreakers), to be followed at some later point by the acquisition of new Arctic Security Cutters (ASCs) (i.e., medium polar icebreakers). The Coast Guard in 2021 estimated PSC procurement costs in then-year dollars as $1,297 million (i.e., about $1.3 billion) for the first ship, $921 million for the second ship, and $1,017 million (i.e., about $1.0 billion) for the third ship, for a combined estimated cost of $3,235 million (i.e., about $3.2 billion). The procurement of the first two PSCs is fully funded. The Coast Guard’s proposed FY2024 budget requested $170.0 million in continued procurement funding for the PSC program. The Coast Guard’s proposed FY2025 budget requests no procurement funding for the PSC program. The Coast Guard originally aimed to have the first PSC delivered in 2024, but the ship’s estimated delivery date has been delayed repeatedly and may now occur no earlier than 2029. Another potential issue concerns the accuracy of the PSC’s estimated procurement cost, given the PSC’s size and internal complexity as well as cost growth in other Navy and Coast Guard shipbuilding programs. The PSC’s estimated procurement cost per weight is roughly half that of the Navy’s LPD-17 Flight II and LHA amphibious ships. These amphibious ships are equipped with expensive combat system equipment that is not included in the PSC design, but whether this would account for all of the difference in cost per weight between the PSC design and the two amphibious ship designs is not clear. If substantial cost growth occurs in the PSC program, it could raise a question regarding whether to grant some form of contract relief to the PSC shipbuilder.

Commercially available polar icebreaker (CAPI). The Coast Guard’s proposed FY2024 budget requested $125.0 million in procurement funding for the purchase of an existing CAPI that would be modified to become a Coast Guard polar icebreaker. The Coast Guard’s proposed FY2025 budget requests no procurement funding for CAPI, but the Coast Guard’s FY2025 Unfunded Priorities List (UPL) includes an item for $25.0 million in procurement funding for the ship.

Great Lakes icebreaker (GLIB). The Coast Guard’s proposed FY2024 budget proposed to initiate a new procurement program for procuring a new GLIB that would have capabilities similar to those of Mackinaw, the Coast Guard’s existing heavy GLIB. The Coast Guard’s proposed FY2024 budget requested $55.0 million in initial procurement funding for the ship, and the Coast Guard’s FY2024 UPL included an item for an additional $20.0 million in procurement funding for the ship. The Coast Guard’s proposed FY2025 budget requests no procurement funding for GLIB, but the Coast Guard’s FY2025 UPL includes an item for $25.0 million in procurement funding for the ship.

Download the document here.

Navy Air Defense Mission in the Red Sea Makes Case for Directed Energy Weapons, Says VCJCS Grady

Vice Chairman of the Joint Chiefs Adm. Christopher Grady. DoD Photo

Downing Iranian-supplied missiles and drones with multi-million dollar SM-2 missiles to protect shipping in the Red Sea and Gulf of Aden is a bad exchange that must change, the vice chairman of the Joint Chiefs of Staff said Wednesday.

“It has been an air-defense fight” in which the Navy and Air Force, along with allies and partners in Operation Prosperity Guardian, have largely prevailed in demonstrating “how we bring defense in depth,” Adm. Christopher Grady said during a U.S. Naval Institute-CSIS Maritime Security Dialogue.

To change the cost-benefit equation, he wants more directed energy systems deployed “where a drop of fuel becomes a weapon” to destroy attacking unmanned systems.

For the Navy, in particular, he said Red Sea operations have shown how “the ships, carrier and air wing” can “learn quickly and fast” in responding to evolving threats that have included ship hijackings, unmanned surface and subsurface vessels’ attacks, in addition to missile and unmanned aerial vehicle strikes.

But “the solution [in the Red Sea] is not a military solution,” he said, referring to the larger conflict between Israel and Hamas that began in October. The fighting in Gaza shows no signs of ending soon. The Iran-backed Houthis in Yemen, when they began attacking merchant shipping heading to and from the Suez Canal, said their strikes would be limited to vessels delivering goods to Israel.

As months passed, the attacks became indiscriminate, including on U.S. Navy ships participating in Operation Prosperity Guardian, an international effort by more than 20 nations like the United Kingdom, Canada and Australia, to protect merchant shipping in the region.
“I would like to see more from concerned stakeholders,” Grady added.

As part of Prosperity Guardian, the U.S. and U.K. have carried out airstrikes on suspected missile launch sites and assembly facilities in Yemen that have produced mixed results. Since the first attacks in the fall, an estimated 70 percent of the maritime traffic that routinely passed through those waterways have changed course to sail around Africa rather than risk a transit near Yemen.

The Houthis have now extended their missile attacks into the Indian Ocean, according to press reports.

“I don’t know if [the Houthi missile and unmanned systems’ attacks] deter” merchantmen from sailing in those waters, but they have forces commercial shipping companies to consider what routes to take, Grady said.

When asked to evaluate how air defense worked on April 13, when Iran retaliated against Israel for targeting Iran’s Syrian embassy, Grady said that like the Aegis destroyers, Israel, allies and partners “did their jobs.”

Iran fired more than 200 drones and cruise missiles, but only a few made it through Israeli defenses.

“Years of training together” paid off in knowing “who’s going to shoot what, when. You don’t do that overnight,” he added.

As for the impact of Iran firing “one-way drones” on Israeli targets, he said they were “not very successful.”

Grady said Ukraine’s need for air defense is an area “that concerns me most.” The $60 billion aid package passed after a six-month delay in Congress is coming at a time when Russia has adopted a “we’re coming after critical infrastructure and the electric grid” strategy to alter the course of the war in its third year.

The package also addresses immediate needs, like artillery and 155 mm shells, long-range munitions, electronic warfare systems and unmanned capabilities.

Grady said both the Russian and Ukrainian militaries “are learning organizations” and understand the value of “never underestimating your enemy” to adapt. The war has seen forces “weaponizing” iPhones and employing unmanned systems in the air and on the water.

While both sides still use the Soviet tactic of “shoot and then move,” relying heavily on artillery to clear the way for an infantry assault, unmanned aerial vehicles have stymied Russia’s massed armor attacks from the beginning, he said.

The increased use of electronic warfare to jam GPS targeting has also changed throughout the war. “Early on, we didn’t see EW,” but now “it’s certainly one of the battlefield characteristics” in Ukraine, Grady said.

With 18 months left to serve in his position, Grady said he wants to strengthen the joint requirements process. Grady said he and his two immediate predecessors have taken steps to reduce the stovepiped process of committing to individual service-specific systems and shift to a portfolio approach in the Pentagon and on the Joint Requirements Oversight Council, which includes all of the service vice chiefs.

The combatant commanders’ need for hardware and software quickly versus the services looking at the future creates “a constructive tension” over requirements, he said. Grady wants to “put teeth in the JROC,” where the services would follow through on its decisions.

“Traceability” through a “scorecard” would allow the secretary of defense and the chairman of the Joint Chiefs to see if and how a gap is closed. During his remaining time in office, he doesn’t expect to see a change in the Goldwater-Nichols law that restructured the services’ and Pentagon’s role.

House Lawmakers Pushing for 2 Virginia Subs in FY 2025, CNO Franchetti Gives Details on Boxer Repair

Virginia-class submarine USS Oregon (SSN 793) transits the Thames River during routine operations in Groton, Conn., on Oct. 6, 2022. US Navy Photo

A group of 120 House lawmakers are asking the House Appropriations defense subcommittee to add another Virginia-class attack submarine to the Navy’s Fiscal Year 2025 shipbuilding budget.
The group, led by Rep. Joe Courtney (D-Conn.), argued the Navy’s purchase of one Virginia in FY 2025 puts submarine suppliers at risk and sets the Navy back in its goals for the program, according to a letter to HAC-D chair Rep. Ken Calvert (R-Calif.) and ranking member Rep. Betty McCollum (D-Minn.).

“While the FY25 budget request includes substantial investments in the nationwide submarine industrial base, there is no alternative to stabilize the supply chain other than consistent procurement of two Virginia-class submarines in FY 2025,” reads the letter.
“The proposal to request one attack submarine is contrary to the Department of Defense’s National Defense Industrial Strategy, which cites procurement instability as a systemic challenge. This proposal is also an alarming deviation from the Virginia-class procurement profile in the FY 2024 Future Years Defense Plan and 30 Year Shipbuilding Plan.”

The service funded one Virginia-class as part of its March budget request. Navy officials justified the move by pointing to the backlog of submarine work at builders General Dynamics Electric Boat and HII’s Newport News Shipbuilding that translates to the yards delivering 1.3 boats a year. Instead of funding a second attack boat, the Navy set aside money for advanced procurement to support submarine suppliers. The request is seeking $3.6 billion for the FY 2025 boat and an additional $3.7 billion in advanced procurement money for boats in FY 2026 and 2027.

Courtney, the ranking member of the House Armed Services Committee’s seapower and projection forces subcommittee, and others argue that not funding the second boat will hurt suppliers that aren’t part of the advanced procurement pool. Courtney said his staff said that those suppliers will be out about a $1 billion.

During a Wednesday hearing before the House Armed Services Committee, Secretary of the Navy Carlos Del Toro defended the decision to buy one boat based on the Electric Boat and Newport News delivery rate, pointing to the delivery of Virginia-class attack boat New Jersey (SSN-796) last week.

“I’m trying to work with industry to increase the production rates,” Del Toro said during the hearing.
“New Jersey, for example, was delivered just last week, and it was delivered almost three years late. If all the submarines that we had ordered actually had been delivered on time … we’d actually have five additional submarines in our fleet today to be able to meet our operational needs.”

Del Toro also said the advanced procurement money is not meant to replace the work a new submarine contract would give suppliers.

“The purpose of advanced procurement money … isn’t to fully fund all the vendors that are in the supply chain,” Del Toro said during the hearing.
“It’s to fund those vendors that are most critical to the supply chain. I don’t think there’s ever been a confirmation that we can support full funding of all the vendors across the entire spectrum.”

Rep. Rob Wittman (R-Va.) said that asking for only one submarine in the budget could send the wrong signal to Australia as part of the AUKUS nuclear submarine agreement. Canberra is set to buy three to five Virginia-class submarines for the Royal Australian Navy.

“Now the Australians look at that and they go well, wait a minute, we thought we had an AUKUS agreement here… We thought we were going to be able to buy some from the United States?” Wittman asked during the hearing.
“If you are an Australian looking at this you’d go, ‘is the U.S. really serious about this [agreement]?’”

During the hearing, Chief of Naval Operations Adm. Lisa Franchetti gave some additional details on the planned repair for big deck amphibious warship USS Boxer (LHD-4) in the water at Naval Station San Diego, Calif. After leaving in early April, Boxer was unable to continue a deployment to the Western Pacific due to damage to the starboard rudder.

“[Boxer] has a bearing on her starboard rudder that is not in good condition, so it needs to be replaced,” Franchetti told the House panel.
“We are evaluating the different procedures that will be done to repair her – right now about a four to six-week repair. We look to be able to finish that repair pier-side –the bearing is available – and then get her back out on deployment.”

Marine Corps Commandant Gen. Eric Smith told Rep. Trent Kelly (R-Miss.) how losing a big deck would affect a deployment.

“We’re designed to operate on a three-ship, amphibious ready group, one big deck LHA or LHD and … two LPDs,” Smith said.
“When you lose your big deck, you lose most of your aviation assets and you lose your crisis response force.”

The three-ship Boxer Amphibious Ready Group was scheduled to leave in January with the 15th Marine Expeditionary Units embarked, but only USS Somerset (LPD-25) left on time, requiring the Navy and Marine Corps to retool their participation in several Western Pacific exercises. The third ship in the ARG, USS Harpers Ferry (LSD-49), joined Somerset in the South China Sea for the recent Balikatan 2024 exercise series.

Earlier this week, Vice Chief of Naval Operations Adm. Jim Kilby told the House Armed Services readiness subcommittee that the Navy is having difficulty maintaining older big-deck amphibious ships like Boxer.

“We found our amphib ships – the big decks in particular with steam plants – are having larger growth work than most of our ships and it’s a challenge because of availability of parts, artisans, etc.,” Kilby told the panel on Tuesday.

Natural language boosts LLM performance in coding, planning, and robotics

Large language models (LLMs) are becoming increasingly useful for programming and robotics tasks, but for more complicated reasoning problems, the gap between these systems and humans looms large. Without the ability to learn new concepts like humans do, these systems fail to form good abstractions — essentially, high-level representations of complex concepts that skip less-important details — and thus sputter when asked to do more sophisticated tasks.

Luckily, MIT Computer Science and Artificial Intelligence Laboratory (CSAIL) researchers have found a treasure trove of abstractions within natural language. In three papers to be presented at the International Conference on Learning Representations this month, the group shows how our everyday words are a rich source of context for language models, helping them build better overarching representations for code synthesis, AI planning, and robotic navigation and manipulation.

The three separate frameworks build libraries of abstractions for their given task: LILO (library induction from language observations) can synthesize, compress, and document code; Ada (action domain acquisition) explores sequential decision-making for artificial intelligence agents; and LGA (language-guided abstraction) helps robots better understand their environments to develop more feasible plans. Each system is a neurosymbolic method, a type of AI that blends human-like neural networks and program-like logical components.

LILO: A neurosymbolic framework that codes

Large language models can be used to quickly write solutions to small-scale coding tasks, but cannot yet architect entire software libraries like the ones written by human software engineers. To take their software development capabilities further, AI models need to refactor (cut down and combine) code into libraries of succinct, readable, and reusable programs.

Refactoring tools like the previously developed MIT-led Stitch algorithm can automatically identify abstractions, so, in a nod to the Disney movie “Lilo & Stitch,” CSAIL researchers combined these algorithmic refactoring approaches with LLMs. Their neurosymbolic method LILO uses a standard LLM to write code, then pairs it with Stitch to find abstractions that are comprehensively documented in a library.

LILO’s unique emphasis on natural language allows the system to do tasks that require human-like commonsense knowledge, such as identifying and removing all vowels from a string of code and drawing a snowflake. In both cases, the CSAIL system outperformed standalone LLMs, as well as a previous library learning algorithm from MIT called DreamCoder, indicating its ability to build a deeper understanding of the words within prompts. These encouraging results point to how LILO could assist with things like writing programs to manipulate documents like Excel spreadsheets, helping AI answer questions about visuals, and drawing 2D graphics.

“Language models prefer to work with functions that are named in natural language,” says Gabe Grand SM ’23, an MIT PhD student in electrical engineering and computer science, CSAIL affiliate, and lead author on the research. “Our work creates more straightforward abstractions for language models and assigns natural language names and documentation to each one, leading to more interpretable code for programmers and improved system performance.”

When prompted on a programming task, LILO first uses an LLM to quickly propose solutions based on data it was trained on, and then the system slowly searches more exhaustively for outside solutions. Next, Stitch efficiently identifies common structures within the code and pulls out useful abstractions. These are then automatically named and documented by LILO, resulting in simplified programs that can be used by the system to solve more complex tasks.

The MIT framework writes programs in domain-specific programming languages, like Logo, a language developed at MIT in the 1970s to teach children about programming. Scaling up automated refactoring algorithms to handle more general programming languages like Python will be a focus for future research. Still, their work represents a step forward for how language models can facilitate increasingly elaborate coding activities.

Ada: Natural language guides AI task planning

Just like in programming, AI models that automate multi-step tasks in households and command-based video games lack abstractions. Imagine you’re cooking breakfast and ask your roommate to bring a hot egg to the table — they’ll intuitively abstract their background knowledge about cooking in your kitchen into a sequence of actions. In contrast, an LLM trained on similar information will still struggle to reason about what they need to build a flexible plan.

Named after the famed mathematician Ada Lovelace, who many consider the world’s first programmer, the CSAIL-led “Ada” framework makes headway on this issue by developing libraries of useful plans for virtual kitchen chores and gaming. The method trains on potential tasks and their natural language descriptions, then a language model proposes action abstractions from this dataset. A human operator scores and filters the best plans into a library, so that the best possible actions can be implemented into hierarchical plans for different tasks.

“Traditionally, large language models have struggled with more complex tasks because of problems like reasoning about abstractions,” says Ada lead researcher Lio Wong, an MIT graduate student in brain and cognitive sciences, CSAIL affiliate, and LILO coauthor. “But we can combine the tools that software engineers and roboticists use with LLMs to solve hard problems, such as decision-making in virtual environments.”

When the researchers incorporated the widely-used large language model GPT-4 into Ada, the system completed more tasks in a kitchen simulator and Mini Minecraft than the AI decision-making baseline “Code as Policies.” Ada used the background information hidden within natural language to understand how to place chilled wine in a cabinet and craft a bed. The results indicated a staggering 59 and 89 percent task accuracy improvement, respectively.

With this success, the researchers hope to generalize their work to real-world homes, with the hopes that Ada could assist with other household tasks and aid multiple robots in a kitchen. For now, its key limitation is that it uses a generic LLM, so the CSAIL team wants to apply a more powerful, fine-tuned language model that could assist with more extensive planning. Wong and her colleagues are also considering combining Ada with a robotic manipulation framework fresh out of CSAIL: LGA (language-guided abstraction).

Language-guided abstraction: Representations for robotic tasks

Andi Peng SM ’23, an MIT graduate student in electrical engineering and computer science and CSAIL affiliate, and her coauthors designed a method to help machines interpret their surroundings more like humans, cutting out unnecessary details in a complex environment like a factory or kitchen. Just like LILO and Ada, LGA has a novel focus on how natural language leads us to those better abstractions.

In these more unstructured environments, a robot will need some common sense about what it’s tasked with, even with basic training beforehand. Ask a robot to hand you a bowl, for instance, and the machine will need a general understanding of which features are important within its surroundings. From there, it can reason about how to give you the item you want. 

In LGA’s case, humans first provide a pre-trained language model with a general task description using natural language, like “bring me my hat.” Then, the model translates this information into abstractions about the essential elements needed to perform this task. Finally, an imitation policy trained on a few demonstrations can implement these abstractions to guide a robot to grab the desired item.

Previous work required a person to take extensive notes on different manipulation tasks to pre-train a robot, which can be expensive. Remarkably, LGA guides language models to produce abstractions similar to those of a human annotator, but in less time. To illustrate this, LGA developed robotic policies to help Boston Dynamics’ Spot quadruped pick up fruits and throw drinks in a recycling bin. These experiments show how the MIT-developed method can scan the world and develop effective plans in unstructured environments, potentially guiding autonomous vehicles on the road and robots working in factories and kitchens.

“In robotics, a truth we often disregard is how much we need to refine our data to make a robot useful in the real world,” says Peng. “Beyond simply memorizing what’s in an image for training robots to perform tasks, we wanted to leverage computer vision and captioning models in conjunction with language. By producing text captions from what a robot sees, we show that language models can essentially build important world knowledge for a robot.”

The challenge for LGA is that some behaviors can’t be explained in language, making certain tasks underspecified. To expand how they represent features in an environment, Peng and her colleagues are considering incorporating multimodal visualization interfaces into their work. In the meantime, LGA provides a way for robots to gain a better feel for their surroundings when giving humans a helping hand. 

An “exciting frontier” in AI

“Library learning represents one of the most exciting frontiers in artificial intelligence, offering a path towards discovering and reasoning over compositional abstractions,” says assistant professor at the University of Wisconsin-Madison Robert Hawkins, who was not involved with the papers. Hawkins notes that previous techniques exploring this subject have been “too computationally expensive to use at scale” and have an issue with the lambdas, or keywords used to describe new functions in many languages, that they generate. “They tend to produce opaque ‘lambda salads,’ big piles of hard-to-interpret functions. These recent papers demonstrate a compelling way forward by placing large language models in an interactive loop with symbolic search, compression, and planning algorithms. This work enables the rapid acquisition of more interpretable and adaptive libraries for the task at hand.”

By building libraries of high-quality code abstractions using natural language, the three neurosymbolic methods make it easier for language models to tackle more elaborate problems and environments in the future. This deeper understanding of the precise keywords within a prompt presents a path forward in developing more human-like AI models.

MIT CSAIL members are senior authors for each paper: Joshua Tenenbaum, a professor of brain and cognitive sciences, for both LILO and Ada; Julie Shah, head of the Department of Aeronautics and Astronautics, for LGA; and Jacob Andreas, associate professor of electrical engineering and computer science, for all three. The additional MIT authors are all PhD students: Maddy Bowers and Theo X. Olausson for LILO, Jiayuan Mao and Pratyusha Sharma for Ada, and Belinda Z. Li for LGA. Muxin Liu of Harvey Mudd College was a coauthor on LILO; Zachary Siegel of Princeton University, Jaihai Feng of the University of California at Berkeley, and Noa Korneev of Microsoft were coauthors on Ada; and Ilia Sucholutsky, Theodore R. Sumers, and Thomas L. Griffiths of Princeton were coauthors on LGA. 

LILO and Ada were supported, in part, by ​​MIT Quest for Intelligence, the MIT-IBM Watson AI Lab, Intel, U.S. Air Force Office of Scientific Research, the U.S. Defense Advanced Research Projects Agency, and the U.S. Office of Naval Research, with the latter project also receiving funding from the Center for Brains, Minds and Machines. LGA received funding from the U.S. National Science Foundation, Open Philanthropy, the Natural Sciences and Engineering Research Council of Canada, and the U.S. Department of Defense.

Chinese Aircraft Carrier Fujian Leaves for First Set of Sea Trials

Chinese aircraft carrier Fujian. Xinhua Photo

China’s third aircraft carrier Fujian (18) left Shanghai on Wednesday morning to conduct its first sea trial, according to a report by People’s Liberation Army News. Meanwhile, the People’s Liberation Army Navy’s (PLAN) first batch of female naval aviators carried out their first solo flight on Apr. 25.
Fujian left Jiangnan Shipyard at 8 a.m. on Wednesday, according to PLA News, with the sea trial being conducted to test and verify the reliability and stability of the carrier’s power, electrical and other systems. No details were given as to the location or duration of the sea trials, but the China Maritime Safety Administration issued a navigational hazard safety notice for an area 80 miles away from Shanghai starting from Wednesday and concluding on May 9. The PLA News report stated that since the carrier was launched in 2022, its construction has been on schedule and it had completed its mooring trials, equipment adjustment and met the technical requirements to sail for sea trials.

The 80,000-ton carrier is China’s first CATOBAR (Catapult Assisted Take-Off Barrier Arrested Recovery) carrier, in contrast to CNS Liaoning (16) and CNS Shandong (17), which both use ski jumps to assist aircraft launches. Fujian also uses the EMALS (Electromagnetic Aircraft Launch System) to launch its aircraft. Currently, only the Gerald R. Ford-class U.S. carriers feature EMALS, though the French PANG (porte-avions de nouvelle génération) new-generation aircraft carrier that will enter service in 2038 will also employ EMALS.

Shandong conducted nine sea trials from May 2018 to November 2019 before it was commissioned in December 2019, though it remains to be seen as to whether Fujian will conduct the same number of trials and in the same time length.

Fujian is expected to enter service by late next year or in 2026, allowing the PLAN’s carrier strike groups (CSGs) to maintain a higher deployment tempo. Neither the Liaoning and Shandong CSGs have conducted a deployment for this year. Liaoning is working its way to operational readiness after coming out of a year-long refit that began in February 2023. Shandong has remained in its home base in Sanya conducting in port drills and crew training since December last year, when it returned from northern China after conducting a month of training of carrier aviation pilots.

In March, Yuan Huazhi, political commissar of the PLAN, told Chinese media that China would announce a fourth carrier soon and would also reveal if it would be a nuclear powered or a conventionally powered like its existing three carriers. So far no official announcement has been made.

With a third and potentially fourth carrier, the PLAN’s carrier aviation force will need to expand, leading to the service in April 2023 opening pilot recruitment to women for the first time. The first batch of female pilot trainees carried out their first solo flights on Apr. 25 at the PLA Naval Aviation University in Yantai, according to a PLA Daily report.

The initial report did not disclose how many trainees made the flights, though a second report by PLA Daily stated that all trainees completed their solo flights successfully and during the hour-long flight, instructors on the ground did not have to issue any corrections to the trainee pilots. All the trainee pilots were born after the year 2000, according to PLA Daily.

PLA Daily also reported that in the summer, the female trainee pilots will carry out advanced flight training which will include instrument flying, navigation, formation flying and night flying. In its 2023 recruitment announcement, PLAN stated that after two months of basic training, cadet pilots would undergo 3-4 years of flight training at the PLA Naval Aviation University before graduating for assignment, thus, at the earliest, China will have its first batch of female naval aviators in late 2026.