Category: Generative AI

Creating and Validating Synthetic Datasets for LLM Evaluation & Experimentation

Category: Generative AI, Large Language Models | Write Comment
Posted by Evan Jolley | September 16, 2024

Thanks to John Gilhuly for his contributions to this piece. Liking Phoenix? Please consider giving us a star on Github! ⭐️ Synthetic datasets are artificially created datasets that are designed to mimic real-world information. Unlike naturally occurring data, which is gathered from actual events or interactions, synthetic datasets are generated using algorithms, rules, or other…

Evaluating an Image Classifier

Category: Generative AI, Large Language Models, Phoenix | Write Comment
Posted by John Gilhuly | September 3, 2024

Phoenix supports multi-modal evaluation and tracing. In this tutorial, we’ll take advantage of that to walk through the process of setting up an image classification experiment using Phoenix. This involves uploading a dataset, creating an experiment to classify the images, and evaluating the model’s accuracy. We’ll be using OpenAI’s GPT-4o-mini model for the classification task. …

Trace Your Haystack Application

Category: Generative AI, Phoenix | Comments Off on Trace Your Haystack Application
Posted by Steve Herzog | August 19, 2024

Haystack is an open-source framework for building LLM applications, retrieval-augmented generative pipelines and search systems that work intelligently over large document collections. Haystack makes it very easy to get off and running quickly with search and RAG LLM apps. Combining Haystack with Phoenix gives you the tools to not only build these apps quickly, but…

How To Use Annotations To Collect Human Feedback On Your LLM Application

Category: Generative AI, Phoenix | Comments Off on How To Use Annotations To Collect Human Feedback On Your LLM Application
Posted by Steve Herzog | August 15, 2024

Liking Phoenix? Please consider giving us a ⭐ on Github! Cast your mind back to the early days of mainstream AI development – a whopping seven years ago. NVIDIA stock was just over $1 per share. “Transformers” meant Optimus Prime, not a word-changing technical leap forward. And the primary way of evaluating AI applications was…

LLM Function Calling: Evaluating Tool Calls In LLM Pipelines

Category: Generative AI, Large Language Models, LLMOps, Phoenix | Comments Off on LLM Function Calling: Evaluating Tool Calls In LLM Pipelines
Posted by Steve Herzog | July 16, 2024

Function calling is an essential part of any AI engineer’s toolkit, enabling builders to enhance a model’s utility at specific tasks. As more LLM applications leveraging tool calls get deployed into production, the task of effectively evaluating their performance in LLM pipelines becomes more critical. What Is Function Calling In AI? First launched by OpenAI…

LlamaIndex’s Newly-Released Instrumentation Module + Phoenix Integration

Category: Generative AI, Large Language Models, LLMOps, Phoenix, Use-Case | Comments Off on LlamaIndex’s Newly-Released Instrumentation Module + Phoenix Integration
Posted by Evan Jolley | July 1, 2024

Due to the black box nature of LLMs and the importance of tasks they’re being trusted to handle, intelligent monitoring and optimization tools are essential to ensure they operate efficiently and effectively. The integration of Arize Phoenix with LlamaIndex’s newly released instrumentation module offers developers unprecedented power to fine-tune performance, diagnose issues, and enhance the…

Evaluate RAG with LLM Evals and Benchmarks

Category: Generative AI, Large Language Models, LLMOps, Phoenix, Use-Case | Comments Off on Evaluate RAG with LLM Evals and Benchmarks
Posted by Shittu Olumide | March 6, 2024 | Tags: llamaindex, llm evaluation, phoenix

Recently, I attended a workshop organized by Arize AI titled “RAG Time! Evaluate RAG with LLM Evals and Benchmarking.” Hosted by Amber Roberts – ML Growth Lead at Arize AI, and Mikyo King – Head of Open Source at Arize AI, the talks provided valuable insights into an important field of study. Miss the event?…

Evaluating and Analyzing Your RAG Pipeline with Ragas

Category: Generative AI, Large Language Models, LLMOps, Phoenix, Product | Comments Off on Evaluating and Analyzing Your RAG Pipeline with Ragas
Posted by Shahul ES | February 20, 2024

This article is co-authored by Mikyo King, Founding Engineer and Head of Open Source at Arize AI, and Xander Song, AI Engineer at Arize AI Building a baseline for a RAG pipeline is not usually difficult, but enhancing it to make it suitable for production and ensuring the quality of your responses is almost always…

Calling All Functions: Benchmarking OpenAI Function Calling and Explanations

Category: Generative AI, Large Language Models, LLMOps, Phoenix, Use-Case | Comments Off on Calling All Functions: Benchmarking OpenAI Function Calling and Explanations
Posted by Amber Roberts | December 7, 2023

This piece is co-authored by Roger Yang, Software Engineer at Arize AI Observability in third-party large language models (LLMs) is largely approached with benchmarking and evaluations since models like Anthropic’s Claude, OpenAI’s GPT models, and Google’s PaLM 2 are proprietary. In this blog post, we benchmark OpenAI’s GPT models with function calling and explanations against…

LLM Tracing and Observability

Category: Generative AI, Large Language Models, LLMOps, Phoenix, Product | Comments Off on LLM Tracing and Observability
Posted by Amber Roberts | October 2, 2023 | Tags: LLM observability

What is LLM App Tracing? The rise of large language model (LLM) application development has enabled developers to move quickly in building applications powered by LLMs. The abstractions created by these frameworks can accelerate development, but also make it hard to debug an LLM app. This is where Arize Phoenix, a popular open-source library for…