Evaluating an Image Classifier

Published September 3, 2024

Phoenix supports multi-modal evaluation and tracing. In this tutorial, we’ll take advantage of that to walk through the process of setting up an image classification experiment using Phoenix. This involves uploading a dataset, creating an experiment to classify the images, and evaluating the model’s accuracy. We’ll be using OpenAI’s GPT-4o-mini model for the classification task.

This guide assumes you have an OpenAI API key ready to go.

View the full notebook here
Loving Phoenix? Consider giving us a star on Github. It really helps us keep the lights on! ⭐️

Setting Up Your Environment

To get started, you’ll need to install the necessary dependencies:

pip install -q "arize-phoenix>=4.29.0" openinference-instrumentation-openai openai datasets

Connecting to Phoenix

Next, we’ll connect to Phoenix. There are many different ways to host Phoenix, from accessing a cloud instance to self-hosting. The snippet below will connect to a cloud instance of Phoenix if you have PHOENIX_API_KEY set in your environment variables. Otherwise, it will launch a local instance.

import os

 

if "PHOENIX_API_KEY" in os.environ:

    os.environ["PHOENIX_CLIENT_HEADERS"] = f"api_key={os.environ['PHOENIX_API_KEY']}"

    os.environ["PHOENIX_COLLECTOR_ENDPOINT"] = "https://app.phoenix.arize.com"

else:

    import phoenix as px

    px.launch_app()

 

from phoenix.otel import register

tracer_provider = register()

Loading the Dataset

We’ll be using a sample image classification dataset from Hugging Face. After loading the dataset, we’ll convert the image data to base64 encoded strings, which Phoenix and OpenAI can handle automatically.


import phoenix as px

from datasets import load_dataset

 

df = load_dataset("huggingface/image-classification-test-sample")["train"].to_pandas()

 

import base64

df['img'] = df['img'].apply(lambda x: x['bytes'])

df['img'] = df['img'].apply(lambda x: base64.b64encode(x).decode('utf-8'))

df['img'] = df['img'].apply(lambda x: 'data:image/png;base64,' + x)

We’ll also map the numerical labels to more meaningful class names:


label_map = {1: 'automobile', 2: 'snakes', 3: 'cat', 4: 'tree', 5: 'dog', 6: 'frog', 7: 'horse', 8: 'ship'}

df['label'] = df['label'].map(label_map)

Verifying the Dataset

Before proceeding, let’s take a quick look at the first image in the dataset to ensure everything is in order:


from IPython.display import display, Image

image_data = df.loc[0, 'img'].split(',')[1]

image_bytes = base64.b64decode(image_data)

display(Image(data=image_bytes))

Uploading the Dataset to Phoenix

Now that our dataset is ready, we can upload it to Phoenix. This dataset will be used as the test cases for our experiment.


import datetime

test_cases = px.Client().upload_dataset(

    dataset_name=f"image-classification-test-sample-{datetime.datetime.now().strftime('%Y-%m-%d-%H-%M-%S')}",

    dataframe=df,

    input_keys=["img"],

    output_keys=["label"],

)

Defining the Experiment Task

We’ll use OpenAI’s GPT-4o-mini model for the classification task. To do this, we first instrument the model to ensure we capture all the necessary traces in Phoenix.


from openinference.instrumentation.openai import OpenAIInstrumentor

OpenAIInstrumentor().instrument(tracer_provider=tracer_provider, skip_dep_check=True)

Next, we define our task, which takes an image and classifies it:


from openai import OpenAI

def task(input):
    client = OpenAI()

    response = client.chat.completions.create(
        model="gpt-4o-mini",
        messages=[
            {
                "role": "user",
                "content": [
                    {"type": "text", "text": "What’s in this image? Your answer should be a single word. The word should be one of the following: " + str(label_map.values())},
                    {
                        "type": "image_url",
                        "image_url": {
                            "url": input['img'],
                        },
                    },
                ],
            }
        ],
        max_tokens=300,
    )

    output_label = response.choices[0].message.content.lower()
    return output_label

Setting Up Evaluators

For this experiment, our evaluators will be very simple, since we already have ground truth labels in our original dataset to compare to. We simply need to check if the model’s output matches the expected label.


def matches_expected_label(expected, output):

    return expected["label"] == output

Running the Experiment

With everything set up, we can now run the experiment. This function processes each image in the dataset, classifies it, and evaluates the results:


from phoenix.experiments import run_experiment
import nest_asyncio

nest_asyncio.apply()

run_experiment(
    task=task,
    evaluators=[matches_expected_label],
    dataset=test_cases,
    experiment_description="Image classification experiment",
    experiment_metadata={"model": "gpt-4o"},
)

From here, you can modify your task or classification code as much as you’d like, and re-run the experiment to see how your new code compares to this baseline.

Additional Resources

For more information, check out: