- 05 Aug 2024
- 19 Minutes to read
- Print
- DarkLight
- PDF
Analyze and Compare Models
- Updated on 05 Aug 2024
- 19 Minutes to read
- Print
- DarkLight
- PDF
This article applies to these versions of LandingLens:
LandingLens | LandingLens on Snowflake |
✓ | ✓ |
Use the Models page to analyze and compare model performance across multiple datasets in a project. The Models page gives you the tools to:
- Analyze how a model performed. Quickly see how the model performed on its Train, Dev, and Test sets. You can also view the model's Loss chart, Validation chart, F1 or IoU score, and predictions. For more information, go to Model Information.
- See how a model performs on different datasets. When you train a model, you can see how it performed on the dataset it was trained with. On the Models page, you can add more datasets (called "evaluation sets") and run the models on those images to see how the model performs. To get started, go to Evaluation Sets.
- Compare two models. When you run a model comparison, LandingLens shows you the differences in F1 or IoU score and the number of correct and incorrect predictions. You can use this information to fine-tune your labels, datasets, and hyperparameters. To compare models, go to Compare Two Models.
- Deploy models: After analyzing and comparing model performance, choose which model or models you want to deploy. Go to Cloud Deployment.
How do I use the Models table to see what model is the best for my project?
You can use the Models table to quickly evaluate model performance across different datasets. You can also see how the same model—but with different confidence scores—performs on the same datasets.
There is no one-size-fits-all solution, but quickly comparing model performance can help you identify 1) what model works best for your use case and 2) what models might need better images or labels.
Here are some considerations:
- If two models have the same confidence threshold but different scores on the same datasets, view the predictions for the model with the lower score. Are the labels correct? Do you need more images of a specific class?
- If a model has a higher score on a dataset that is most like your real-world scenario, that model might be the best one for your use case.
Models Table Overview
Here's a quick orientation to the Models table:
# | Item | Description |
---|---|---|
1 | Model | The model name and training method (customized or default). |
2 | Evaluation sets | These columns consist of your evaluation sets, which are sets of images used to evaluate model performance. The model's Train, Dev, and Test sets display by default. You can add more datasets and run the models on those sets. Shows the F1 score (for Object Detection and Classification projects) and IoU score (for Segmentation projects). |
3 | Confidence Threshold | The Confidence Threshold for the model. The confidence score indicates how confident the model is that its prediction is correct. The confidence threshold is the minimum confidence score the model must assign to a prediction in order for it to believe that its prediction is correct. Typically, a lower confidence threshold means that you will see more predictions, while a higher confidence threshold means you will see fewer. When LandingLens creates a model, it selects the confidence threshold with the best F1 score for all labeled data. |
4 | Deployment | Deploy the model via Cloud Deployment. If the model has been deployed with Cloud Deployment, an icon for each endpoint displays. |
5 | More Actions | Favorite, deploy, and delete models. Can also copy the Model ID. |
Model Information
The Model column displays the model name and its training method:
- Default configuration: Trained using Fast Training.
- Customized configuration: Trained using Custom Training.
Click the cell to see the model's Training Information and Performance Report.
A model can have multiple rows. For example, if you deploy a model and select a confidence score that is not the default one, then two rows for the model display in the table. The first row has the default confidence threshold, and the second has the custom confidence threshold.
For example, in the screenshot below, the default confidence threshold is 0.71, and the custom confidence threshold is 0.99.
Training Information
Clicking a model on the Models page opens the Training Information tab. This tab shows basic information about the model and the dataset it was trained on.
Highlights include:
# | Item | Description |
---|---|---|
1 | Loss Chart | The Loss chart is calculated on the Train split, which is the split that the model trains on. During model training, LandingLens calculates the error between the ground truth and the predictions, which is called loss. This chart shows the loss over time (in seconds). If the model improves during the training process, the line goes down toward 0 over time. |
2 | Validation Chart | The Validation chart is calculated on the Dev split. This chart displays when the model was trained using Custom Training and the Dev split has at least 6 images. If the model improves during the validation process, the line goes up over time. The line will look slightly different for each project type because each uses a different validation metric:
|
3 | Trained From | The name of the dataset snapshot that the model was trained on. |
4 | Split | Shows how many images are in each split. |
5 | View Images | Click View Images to see the dataset snapshot that the model was trained on. |
6 | Hyperparameter Transform Augmentation | The configurations used to train the model. For Fast Training (default configuration), this includes Hyperparameters, which are the number of epochs and model size. For Custom Training (customized configuration), this also includes any Transforms and Augmentations. For more information about these configurations, go to Custom Training. |
Performance Report
Clicking an evaluation set score on the Models page opens the Performance Report tab (you can also click a model on the Models page and then select this tab).
This report shows how the model performed on the selected evaluation set (and not for the entire dataset). You can select different sets from the Evaluation Set drop-down menu.
The bottom part of the report compares the ground truth (your labels) to the model's predictions. You can filter by prediction type (False Positive, False Negative, Mis-Classification, and Correct Predictions) and sort by model performance.
The Performance Report and Build Tab May Have Different Results
The results in the Performance Report might be different than the results in the Build tab. This is because the Performance Report is based on a specific version of a dataset—the images and labels never change.
However, the results on the Build tab are “live” and might change based on any updates to images or labels.
For example, let’s say that you train a model and create an evaluation set based on the dataset currently in the Build tab. You then add images and labels. This leads to the performance and results being different, as shown in the screenshots below.
Adjust Threshold
To see how the model performs on the evaluation set with a different Confidence Threshold, click Adjust Threshold and select a different score.
Overall Score for the Evaluation Set
The Performance Report includes a score for the evaluation set (and not for the entire dataset). The type of score depends on the project type:
Object Detection and Classification: F1 Score
The Performance Report includes the F1 score for Object Detection and Classification projects.
The F1 score combines precision and recall into a single score, creating a unified measure that assesses the model’s effectiveness in minimizing false positives and false negatives. A higher F1 score indicates the model is balancing the two factors well. LandingLens uses micro-averaging to calculate the F1 score.
Segmentation: Intersection Over Union (IoU)
The Performance Report includes the Intersection over Union (IoU) score for Segmentation projects.
Intersection over Union (IoU) is used to measure the accuracy of the model by measuring the overlap between the predicted and actual masks in an image. A higher IoU indicates better agreement between the ground truth and predicted mask. LandingLens does not include the implicit background and micro-averaging in the calculation of the IoU.
Download CSV of Evaluation Set
For Object Detection and Classification projects, click Download CSV to download a CSV of information about the images in the evaluation set. The CSV includes several data points for each image, including the labels ("ground truth") and model's predictions.
CSV Data for Evaluation Set
The CSV includes the information described in the following table.
Item | Description | Example |
---|---|---|
Image ID | Unique ID assigned to the image. | 30243316 |
Image Name | The file name of the image uploaded to LandingLens. | sample_003.jpg |
Image Path | The URL of where the image is stored. | s3://path/123/abc.jpg |
Model ID | Unique ID assigned to the model. | a3c5e461-0786-4b17-b0a8-9a4bfb8c1460 |
Model Name | The name of the model in LandingLens. | Model-06-04-2024_5 |
GT_Class | The classes you assigned to the image (ground truth or “GT”) . For Object Detection, this also includes the number of objects you labeled. | {"Scratch":3} |
PRED_Class | The classes the model predicted. For Object Detection, this also includes the number of objects predicted. If the model didn't predict any objects, the value is {"null":1}. | {" Scratch":2} |
Model_Correct | If the model's prediction matched the original label (ground truth or “GT”), the value is TRUE. If the model's prediction didn't match the original label (ground truth or “GT”), the value is FALSE. Only applicable to Classification projects. | TRUE |
PRED_Confidence | The model's confidence score for its prediction. Only applicable to Classification projects. | 0.9987245 |
GT-PRED JSON | The JSON output comparing the original labels (ground truth or "GT") to the model's predictions. For more information, go to JSON Output. | {"gtDefectName":"No Fire","predDefectName":"No Fire","predConfidence":0.9684047} |
Evaluation Sets
The Models table shows how each model performs on different sets of images. These image sets are called evaluation sets, because they're used to evaluate model performance.
The default evaluation sets are the Train, Dev, and Test splits for the models. You can add evaluation sets.
Click a cell to see the Performance Report for that evaluation set.
Evaluation Set Scores
A good indication that a model performs well is that its Train and Dev set scores are high and similar to each other.
The score for the Train set might be higher than the scores for the other splits, because these are the images that the model trains on. It is normal for the Train set score to be less than 100% because models usually make mistakes during the training process.
In fact, a score of 100% on the Train might indicate overfitting, especially if the Dev set score is much lower. If the two scores are very different, try adding more images to these sets.
Similarly, the score for the Test set might be lower than the scores for the other splits, because the model is not trained on these images.
The following image and table explain the evaluation set scores.
# | Item | Description |
---|---|---|
1 | Percentage | Shows the F1 score (for Object Detection and Classification projects) and IoU score (for Segmentation projects). Learn more about these scores in Overall Score for the Evaluation Set. |
2 | -- | The subset doesn't have any images. If you don't assign splits to a dataset before you train a model, LandingLens automatically assigns images to the Train and Dev splits, but not the Test split. Therefore, you will see "--" for the Test split in that situation. |
3 | Blank | The model hasn't run on the set yet. To run the model, hover over the cell and click Evaluate. For more information, go here. |
Run the Model on a "Blank" Set
If an evaluation set cell is blank, hover over the cell and click Evaluate. The model runs inference on the images in that evaluation set and displays the score.
Add Evaluation Sets and Run Models on Them
By default, each model's performance score for its Train, Dev, and Test set scores displays in the Models table. You can add more datasets. These are called evaluation sets, because they're used to evaluate model performance.
To add an evaluation set:
- Open the project to the Models tab.
- Click Add Evaluation Set. If you've already dismissed this message, click + in the table header.
- Select a snapshot.
- If you want to run the model only on one of the splits, click that split.
- Click Add to the Table.
- LandingLens adds a column for that dataset. To run a model on the dataset, hover over the cell and click Evaluate. (To prevent slowing down the system, LandingLens doesn't automatically run each model on the evaluation sets. Click Evaluate for each model / evaluation set combination that you want to run.)
- The model runs inference on the images in that evaluation set and displays the F1 or IoU score.
- Click the percentage to open the Performance Report.
Archive Evaluation Sets
You can archive evaluation sets. This removes the evaluation set column from the Models table. You can later add the evaluation set to the table again.
To archive an evaluation set:
- Open the project to the Models tab.
- Hover of the area to the left of the evaluation set name.
- Click the Archive icon that appears.
- Click Yes on the pop-up window to confirm the action.
Confidence Threshold
The Confidence Threshold column shows the Confidence Threshold for that model.
The confidence score indicates how confident the model is that its prediction is correct. The confidence threshold is the minimum confidence score the model must assign to a prediction in order for it to believe that its prediction is correct. Typically, a lower confidence threshold means that you will see more predictions, while a higher confidence threshold means you will see fewer.
When LandingLens creates a model, it selects the confidence threshold with the best F1 score for all labeled data.
Cloud Deployment
The Deployment column allows you to deploy a model via Cloud Deployment, and to see how many times the model has been deployed via Cloud Deployment.
To start the deployment process, click the Deploy or + button in the Deployment column. For more information, go to Cloud Deployment.
A Cloud icon displays for each deployment. Click an icon to see the deployment details for the model. LandingLens cycles through seven colors for the Cloud icon.
More Actions
In the last column, you can:
Favorite Models
To mark a model as a "favorite", click the Favorite (star) icon. This changes the star color to yellow, so that you can easily see which models in the table you've marked as favorites. You can favorite multiple models. To unfavorite a model, click the Favorite icon again.
To filter by favorites, select the Only show favorite models checkbox.
Copy Model ID
If you're deploying a model via Docker, the Model ID is included in the deployment command. The Model ID tells the application which model to download from LandingLens. To locate the Model ID on the Models page, click the Actions (...) icon and select Copy Model ID.
Delete Models
You can delete a model from the table. This action removes the model only from the table; you can still deploy it and access it from other areas in LandingLens, like Dataset Snapshots.
To delete a model, click the Actions (...) icon and select Delete. A model can't be re-added to this table after it's been deleted.
Compare Two Models
The Models page is a great way to get a high-level view of how different models performed on multiple datasets at once. However, if you'd like to see more details about how two specific models compare, use the Compare Models tool. The Compare Models tool is a great way to evaluate performance on multiple iterations of the model. It can help you identify if you need to improve your labels, datasets, and hyperparameters.
When you run the Compare Models tool, you set a baseline model and a candidate model. LandingLens then shows you if the candidate model performed better or worse for each prediction outcome (False Positive, False Negative, Correct, etc). You can even see a side-by-side comparison of how the baseline and candidate models performed on each image in the dataset.
Run the Compare Models Tool
To compare two models:
- Open the project to the Models tab.
- Hover over the cell for one of the models you want to compare and click the Compare icon that appears. This model will be the baseline in the comparison. In other words, the second model will be compared as either better or worse than this model.
- Click the cell for the second model you want to compare. This model will be the candidate in the comparison. In other words, this model will be compared as either better or worse than the first model.Note:Want to switch the baseline and candidate models? Click the Switch icon.
- Click Compare.
- The Compare Models window opens and shows the difference in performance between the two models.
Model Performance
The top of the Compare Models window show scores for the baseline and candidate models. Click the link below a model to see the Training Information for that model.
The score type depend on the project type:
- Object Detection: F1 score
- Segmentation: IOU (Intersection over Union)
- Classification: F1 score
The window also shows the difference in score between the two models. Click the link below the score difference to see the dataset snapshot that the models were evaluated on.
Compare Training Settings
In the Compare Models window, click Compare Training Settings.
This opens a table with a side-by-side comparison of the settings used to train each models. Differences in settings are highlighted.
Confusion Matrices
By default, the Compare Models window compares the two models using a confusion matrix for each prediction outcome. A confusion matrix is a table that visualizes the performance of an algorithm—in this case, the two computer vision models that you're comparing.
First, the data is grouped into tables (confusion matrices) based on prediction outcome. The prediction outcomes include:
- False Positive: The model predicted that an object of interest was present, but the model was incorrect. This is only applicable to Object Detection and Segmentation projects.
- False Negative: The model predicted that an object of interest was not present, but the model was incorrect. This is only applicable to Object Detection and Segmentation projects.
- Misclassified: The model correctly predicted that an object of interest was present, but it predicted the wrong class.
- Correct: The model’s prediction was correct. This includes True Positives and True Negatives.
Ground Truth and Predictions
Each confusion matrix focuses on a specific prediction outcome (False Positive, False Negative, etc). Each row in a matrix represents each instance of the outcome that occurred in both the baseline and candidate models. The first column is the Ground truth, which is the labeled class on the image in the dataset. The second column is the Prediction, which is a class that either the baseline or candidate model predicted incorrectly.
Baseline, Candidate, and Differences
For each Ground Truth / Prediction pairing in a confusion matrix, LandingLens shows how each model performed and how the candidate model either improved or got worse. This information is displayed in the Baseline, Candidate, and Differences columns.
The Baseline and Candidate column depend on the project type:
- Object Detection and Classification: The number of times that the model made that prediction for the specific Ground Truth / Prediction pairing.
- Segmentation: The number of pixels for which the model made that prediction for the specific Ground Truth / Prediction pairing.
The Differences column shows if the candidate model improved or got worse, when compared to the baseline model. The following table describes the possible outcomes in the Differences column.
Outcome | Description |
---|---|
Green | The candidate performed better than the baseline. |
Red | The candidate performed worse than the baseline. |
Fixed | The baseline made errors, but the candidate did not. In other words, the candidate "fixed" all of the issues that the baseline had. |
New Error | The baseline did not make errors, but the candidate did. In other words, the candidate introduced a "new error type" that wasn't present in the baseline. |
Percentage | Both the baseline and candidate made errors. In this case, the Difference is calculated as: ((candidate - baseline) / baseline) * 100 |
-- | This is only applicable to the Correct category. Either the baseline or candidate made mistakes, but the other model did not. |
For example, this is how the Differences column looks in an Object Detection project:
View Images for a Confusion Matrix
Click View next to the Differences column of a confusion matrix (or simply click the row) to see the images that are included in this Ground Truth / Prediction pairing. Click the images to see a larger version of the images.
View Images - Overlays
Images have overlays that show the relevant predictions for the confusion matrix. If the model missed an object of interest, the overlay is white. Otherwise, the overlay colors correlate to the class colors.
The overlay formatting is different for each project type, as described in the following sections.
View Images - Object Detection
Each relevant prediction displays as an overlay. The number of predictions displays in the bottom right corner of the image. Some confusion matrices have additional overlays, as described in the following table.
Confusion Matrix | Overlay Description | Example |
---|---|---|
False Positive | The overlay includes the confidence score of the prediction. | |
False Negative | The overlay includes "Missed", because the model predicted that an object of interest was not present, but the model was incorrect. | |
Correct | The overlay includes the confidence score of the prediction. |
View Images - Segmentation
There are overlays over the regions that the model predicted incorrectly. It is important to note that the overlay does not show the full prediction, but only the part that was wrong for this this specific Ground Truth / Prediction pairing.
The overlay format is slightly different for each confusion matrix, as described in the following table.
Confusion Matrix | Overlay Description | Example |
---|---|---|
False Positive | There is a purple striped overlay over the regions that the model predicted incorrectly. | |
False Negative | There is a white striped overlay over the regions that the model "missed". | |
Correct | There is a purple striped overlay over the regions that the model predicted correctly. |
View Images - Classification
Because classes are assigned to an entire image in Classification projects, it doesn't make sense to show the predictions as an overlay. Therefore, LandingLens shows the images and lists the ground truth and predictions next to the images.
Compare All Images
Click All Images to see a visual comparison of all images. This shows three versions of each image in the evaluation dataset:
- Ground truth: The original image with the ground truth (labels) that you added.
- Baseline model: The predictions of the candidate model.
- Candidate model: The predictions of the candidate model. LandingLens highlights an image if the candidate model performed better or worse than the baseline for that specific image.
Click a set of images to see more information about those images.
Failed Models
If the model training process fails, you can access more information about that failure to help you troubleshoot. Each project has a Failed Jobs page that shows a list of model training processes that failed in the past 7 days.
To access a project's Failed Jobs page:
- Open the project to the Models tab.
- Click the Actions (...) icon and select Failed Jobs.
- Information about model training jobs that failed in the past 7 days displays.