Phoenix logo

Open-Source Tracing and Evaluation

Trace, evaluate, and iterate on generative AI applications

chain
query
0.00s
hallucinated
retriever
retrieve
0.00s
irrelevant
embedding
embedding
0.00s
chain
synthesize
0.00s
llm
llm
0.00s

Top AI Builders Use Phoenix

Ship your LLM Apps with quality and confidence

Traces and sessions

Tracing out of the box

Easily collect LLM app data with automatic instrumentation.

USER QUERY
TOPIC: Hours
Are your branches open on Veterans Day?
Retrieved Content
1.02s
9 chunks
CHUNK 1
Company owned locations are closed on federal holidays, however franchised locations...
CHUNK 2
Evaluation framework

Evals in one command

Laser fast, pre-tested eval templates that are easy to customize to any task.

Faithfulness 0.66
48%
Correctness 0.67
33%
Dataset creation

Datasets for Prompt Testing

Save, curate and build test sets for prompt templates, prompt iteration and fine-tuning.

Prompt & Variable Management

Easy iteration on your prompts

Easily test new prompt changes against your data for greater confidence before deployment.

system
You are an export Q&A system that is trusted around the world. Always answer the query using the provided context information, and not prior knowledge.
user
Context information is below.^...
user
Lorem ipsum dolor sit amet, consectetur adipiscing elit, sed do eiusmod tempor incididunt et dolore magna aliqua.
RAG analysis & benchmarking

Clustering of your data

Uncover semantically similar questions, chunks, or responses using embeddings to find poor performance.

OSS + OpenTelemetry

Proudly open source

Built on top of OpenTelemetry, Phoenix is agnostic of vendor, framework, and language – granting you the flexibility you need in today’s generative landscape.

span = tracer.start_span("chat", attributes={SpanAttributes.OPENINFERENCE_SPAN_KIND: "CHAIN"})
with trace.use_span(span, end_on_exit=False):
last_message_content, messages = await parse_chat_data(data)
span.set_attribute(SpanAttributes.INPUT_VALUE, last_message_content)
response = await chat_engine.astream_chat(last_message_content, messages)
async def event_generator():
full_response = ""
async for token in response.async_response_gen():
if await request.is_disconnected():
break
full_response = full_response + token
yield token
span.set_attribute(SpanAttributes.OUTPUT_VALUE, full_response)
span.end()
return StreamingResponse(event_generator(), media_type="text/plain")

Phoenix enables new workflows for LLM app developers

Iterate on your LLM workflow and deploy to production with confidence

01

Trace

02

Evaluate

03

Iterate

"Just came across Arize-phoenix, a new library for LLMs and RNs that provides visual clustering and model interpretability. Super useful."
Lior Sinclair AI Researcher
"As LLM-powered applications increase in sophistication and new use cases emerge, deeper capabilities around LLM observability are needed to help debug and troubleshoot. We’re pleased to see this open-source solution from Arize, along with a one-click integration to LlamaIndex, and recommend any AI engineers or developers building with LlamaIndex check it out."
Jerry Liu CEO and Co-Founder, LlamaIndex
"This is something that I was wanting to build at some point in the future, so I’m really happy to not have to build it. This is amazing."
Tom Matthews Machine Learning Engineer at Unitary.ai
"Arize's Phoenix tool uses one LLM to evaluate another for relevance, toxicity, and quality of responses. The tool uses "Traces" to record the paths taken by LLM requests (made by an application or end user) as they propagate through multiple steps. An accompanying OpenInference specification uses telemetry data to understand the execution of LLMs and the surrounding application context. In short, it's possible to figure out where an LLM workflow broke or troubleshoot problems related to retrieval and tool execution."
Lucas Mearian Senior Reporter, ComputerWorld
"Large language models...remain susceptible to hallucination — in other words, producing false or misleading results. Phoenix, announced today at Arize AI’s Observe 2023 summit, targets this exact problem by visualizing complex LLM decision-making and flagging when and where models fail, go wrong, give poor responses or incorrectly generalize."
Shubham Sharma, VentureBeat

Ready to get started?

Start now
"Phoenix is a much-appreciated advancement in model observability and production. The integration of observability utilities directly into the development process not only saves time but encourages model development and production teams to actively think about model use and ongoing improvements before releasing to production. This is a big win for management of the model lifecycle."
Christopher Brown CEO and Co-Founder of Decision Patterns and a former UC Berkeley Computer Science lecturer
"Phoenix integrated into our team’s existing data science workflows and enabled the exploration of unstructured text data to identify root causes of unexpected user inputs, problematic LLM responses, and gaps in our knowledge base."
Yuki Waka Application Developer, Klick
"We are in an exciting time for AI technology including LLMs. We will need better tools to understand and monitor an LLM’s decision making. With Phoenix, Arize is offering an open source way to do exactly just that in a nifty library."
Erick Siavichay Project Mentor, Inspirit AI

Subscribe to stay up to date