We created Phoenix to help developers ship quality LLM apps with confidence. What began as a set of tools for observability has grown into a platform for developing, testing, and improving LLM pipelines. We want Phoenix to be a homebase for teams exploring LLM apps, and with that in mind, we are excited to announce a slew of new features that get us closer to achieving that goal.
We’re excited to kick off Phoenix 2.0 Launch Week (7/12-7/17)! 🚀
Join us over the next week as we ship a major new Phoenix feature each day.
⭐Star us on Github to catch all of our new Phoenix updates, and follow us on 🕊️ X/Twitter to make sure you don’t miss any of our announcements!
We can’t wait to show you what the team has been working on!
To start things off, we’re partnering with LlamaIndex to release a new deployment method for Phoenix: Hosted Phoenix aka LlamaTrace. Up until now, you’ve been able to deploy Phoenix locally in a notebook, through the CLI, or via a Docker container on your own infrastructure. Hosted Phoenix gives a new option for a persistent, accessible, collaborative instance of Phoenix hosted for you.
This instance can be set up directly through the web, no code required, and will automatically store traces and spans between sessions. Invite other team members to collaborate on your projects – which will be even more powerful after you see some of the announcements later this week.
For those looking for more customization and control, our existing deployment options aren’t going anywhere.
Check out this video for a walkthrough of how Hosted Phoenix works, or get started with hosted Phoenix here!
Now that we’ve made it easier to set up persistent projects, we wanted to amp up our experimentation and iteration capabilities. And to do that, we needed a way to easily move data into and out of the platform. Enter Datasets.
Datasets are a new core feature in Phoenix that live alongside your projects. They can be imported, exported, created, curated, manipulated, and viewed within the platform, and should make a few flows much easier:
For more details on using datasets see our documentation or example notebook!
Experiments build on top of our existing dataset and evaluation features to provide full iteration workflow in Phoenix. Start from a dataset of example cases, define your evaluators, then iterate through as many prompt variations, model combinations, or chain structures as you’d like – Experiments will collect all the results side by side in the Phoenix UI so that you can quickly choose the optimal approach or identify problem areas.
We’re extremely excited about the power of Hosted Phoenix + Datasets + Experiments to provide a full, ready-to-use workbench to test and iterate on your LLM pipelines. For a more detailed run through of Experiments in action, check out our docs, example notebook, or see a full video demo below:
We’ve released a new built-in evaluator to Phoenix that measures the performance of function or tool calling within LLMs!
We’ve found that issues with function calling can arise in 3 main areas:
1. 🗺️ Routing – deciding which function to call
2. 🔢 Parameter extraction – pulling out the right parameters from the question
3. 🧑💻 Function generation – actually generating the function code
This new eval checks for each of these error cases, and helps direct you where to focus. This new evaluator can be used within an experiment or on its own.
Check out more details in the walkthrough below, or see here for an example notebook.
Stay tuned over the next few days for two more updates!
To cap off our launch week, we’re releasing our integration with Guardrails AI. Guardrails can be applied to either user input methods (e.g. jailbreak attempts) or LLM output messages (e.g. relevance), and can trigger either retries in the pipeline or default responses.
Partnering with the Guardrails team, we’ve built support to capture spans and traces on their existing suite of Guards. You’ll be able to view the invocation of these guards within Phoenix alongside your LLM calls and create Datasets based on guards triggering.
And we’ve gone one step further by developing our own Guard within Guardrails’ ecosystem: ArizeDatasetEmbeddings. This Guard will take a set of “bad” responses, generate embeddings for each, then compare these embeddings against user or LLM messages. If a response is too similar, it triggers the Guard.
Check out the video below to see this integration in action, or see our example notebook here:
For those who missed it, check out our town hall session where we went through all these updates live and answered a ton of community questions.
Get the latest news, expertise, and product updates from Phoenix .