- 18 Feb 2025
- 8 Minutes to read
- Print
- DarkLight
- PDF
Performance Report
- Updated on 18 Feb 2025
- 8 Minutes to read
- Print
- DarkLight
- PDF
This article applies to these versions of LandingLens:
LandingLens | LandingLens on Snowflake |
✓ | ✓ |
Clicking an evaluation set score on the Models page opens the Performance Report tab (you can also click a model on the Models page and then select this tab).
This report shows how the model performed on the selected evaluation set (and not for the entire dataset). You can select different sets from the Evaluation Set drop-down menu.
Performance Report
Analyze Model Performance
Watch the following video to learn how to use the Performance Report and related tools to analyze and improve model performance.
Adjust Threshold
If you have an Object Detection or Segmentation project, you can see how the model performs on the evaluation set with different Confidence Thresholds. To do this:
- Open the Performance Report.
- Click Adjust.
Click "Adjust"
- Change the Confidence Threshold by using the slider or entering a value in the text box.
- If you want to see a full performance report for the selected threshold, click Generate a New Report.
See How Different Confidence Thresholds Impact Performance
- LandingLens creates a new performance report for the selected threshold. (This is a temporary report. If you close and then later reopen the report, the data will be for the original Confidence Threshold.)
See the Full Performance Report for the Selected Confidence Threshold
Overall Score for the Evaluation Set
The Performance Report includes a score for the evaluation set (and not for the entire dataset). The type of score depends on the project type:
Object Detection and Classification: F1 Score
The Performance Report includes the F1 score for Object Detection and Classification projects.

For Object Detection, the F1 score combines precision and recall into a single score, creating a unified measure that assesses the model’s effectiveness in minimizing false positives and false negatives. A higher F1 score indicates the model is balancing the two factors well. LandingLens uses micro-averaging to calculate the F1 score.
For Classification, the F1, Precision, and Recall scores are identical. This is because Classification models have only two prediction outcomes: "Correct" and "Misclassified". Therefore, the F1, Precision, and Recall scores for Classification models are all calculated using this algorithm:
Segmentation: Intersection Over Union (IoU)
The Performance Report includes the Intersection over Union (IoU) score for Segmentation projects.

Intersection over Union (IoU) is used to measure the accuracy of the model by measuring the overlap between the predicted and actual masks in an image. A higher IoU indicates better agreement between the ground truth and predicted mask. LandingLens does not include the implicit background and micro-averaging in the calculation of the IoU.
Precision Score for Evaluation Set
The Performance Report includes the Precision score for the evaluation set (and not for the entire dataset).

Precision is the model’s ability to be accurate when it says something is true. Precision answers the natural language question, “When the model makes a prediction, how often is it correct?” This metric shows how accurate the model predictions are. The higher the Precision score, the more accurate the predictions are.
For Object Detection and Segmentation, Precision is calculated using this algorithm:
For Classification, the F1, Precision, and Recall scores are identical. This is because Classification models have only two prediction outcomes: "Correct" and "Misclassified". Therefore, the F1, Precision, and Recall scores for Classification models are all calculated using this algorithm:
Recall Score for Evaluation Set
The Performance Report includes the Recall score for the evaluation set (and not for the entire dataset).

Recall is the model’s ability to find all objects of interest. Recall answers the natural language question, “Of all the labels (ground truths) in the dataset, what percent of them are found by the model?” It conveys how accurately the model can correctly identify all the actual positive instances in the dataset. The higher the Recall score, the lower the chance the model will have a false negative.
For Object Detection and Segmentation, Recall is calculated using this algorithm:
For Classification, the F1, Precision, and Recall scores are identical. This is because Classification models have only two prediction outcomes: "Correct" and "Misclassified". Therefore, the F1, Precision, and Recall scores for Classification models are all calculated using this algorithm:
Download CSV of Evaluation Set
For Object Detection and Classification projects, click Download CSV to download a CSV of information about the images in the evaluation set. The CSV includes several data points for each image, including the labels ("ground truth") and model's predictions.

CSV Data for Evaluation Set
The CSV includes the information described in the following table.
Item | Description | Example |
---|---|---|
Image ID | Unique ID assigned to the image. | 30243316 |
Image Name | The file name of the image uploaded to LandingLens. | sample_003.jpg |
Image Path | The URL of where the image is stored. | s3://path/123/abc.jpg |
Model ID | Unique ID assigned to the model. | a3c5e461-0786-4b17-b0a8-9a4bfb8c1460 |
Model Name | The name of the model in LandingLens. | Model-06-04-2024_5 |
GT_Class | The classes you assigned to the image (ground truth or “GT”) . For Object Detection, this also includes the number of objects you labeled. | {"Scratch":3} |
PRED_Class | The classes the model predicted. For Object Detection, this also includes the number of objects predicted. If the model didn't predict any objects, the value is {"null":1}. | {" Scratch":2} |
Model_Correct | If the model's prediction matched the original label (ground truth or “GT”), the value is TRUE. If the model's prediction didn't match the original label (ground truth or “GT”), the value is FALSE. Only applicable to Classification projects. | TRUE |
PRED_Confidence | The model's confidence score for its prediction. Only applicable to Classification projects. | 0.9987245 |
GT-PRED JSON | The JSON output comparing the original labels (ground truth or "GT") to the model's predictions. For more information, go to JSON Output. | {"gtDefectName":"No Fire","predDefectName":"No Fire","predConfidence":0.9684047} |
Confusion Matrix
The Performance Report includes a Confusion Matrix that counts ground truth labels versus model predictions. The confusion matrix shown here is for the selected evaluation set.
The y-axis represents each ground truth label. The x-axis represents each possible model prediction.
Each cell shows the count of instances that correspond to particular ground truth class-predicted class pair. For example, in the image below, the model correctly predicted the class "Wheat" 6 times and misclassified it 2 times.

Precision Score for Class
The Precision score for each class is listed along the x-axis. Precision answers the natural language question, “When the model predicts Class A, how often is it correct?”
The Precision score for a class is the percentage of instances that the model correctly predicted the class out of all instances that the model predicted the class, and is calculated using this equation:
For example, let’s calculate the Precision score for the Wheat class in the image below. The model predicts Wheat 7 times. Of those, 6 are correct (True Positives) and 1 is incorrect (False Positives). When we plug those numbers into the Precision equation, we see that the Precision for this class is 85.7%.

Recall Score for Class
The Recall score for each class is listed along the y-axis. Recall answers the natural language question, “Of all the Class As in the dataset, what percent of them are found by the model?”
The Recall score for a class is the percentage of instances that the model correctly predicted the class out of all actual instances of the class, and is calculated using this equation:
For example, let’s calculate the Recall score for the Wheat class in the image below. The dataset has 8 instances of Wheat. The model predicts 6 instances correctly (True Positives) and 2 instances incorrectly (False Negatives). When we plug those numbers into the Recall equation, we see that the Recall for this class is 75.0%.

Use Colors to Help Interpret Performance
Each cell has a color that can quickly help you identify correct classifications and errors. Darker colors indicate a higher number, and lighter colors indicate a lower number.
For example, if the model correctly predicts all instances, only the cells on the diagonal will be blue and have non-zero values. See the following image as an example.

If any off-diagonal cells contain values, start by looking for the darker colors to understand where the model is making errors. For example, let’s say you want to evaluate performance for the Rice class using the confusion matrix below. The model correctly predicted 1 instance and misclassified 3 instances. Consider looking at the instances that were misclassified as Corn first, since that cell is darker (and has a higher number) than the Wheat cell.

Click a Cell for Detailed Information
Click a cell in the confusion matrix to see detailed information for that ground truth / prediction pairing. For example, click the Wheat/Corn cell in the image below to see the image that was labeled as Wheat but predicted as Corn.
The table on the left shows the ground truth / prediction pairings for all images in the evaluation set. The section on the right shows the images that represent the cell you clicked.

Analyze All Images
Click Analyze All Images to see all images with their ground truth labels and predictions. Click an image to see a larger version.
For Object Detection and Segmentation, LandingLens shows a side-by-side comparison of the ground truth labels and predictions on each image in the dataset.

For Classification, LandingLens shows each image, the ground truth class, and whether the model predicted the class correctly (green check mark) or not (red "x").
