How to use test files to analyze my AI model – Blip | Blip Help

Index:

The test file is a list of critical interactions that you want the NLP model to classify. This model was created using the knowledge base (intents and entities).

The purpose of creating this file is to enable validation of the model's assertiveness, more specifically, to ensure that the model correctly identifies the intentions for the most critical interactions of the chatbot. Critical questions are understood as interactions related to skills (and content) that the chatbot cannot, under any circumstances, fail to answer.

The recommendation is to collect real user interactions that are within the critical issues mentioned above.

Tip: Use the Upgrade Screen filters to find these interactions.

This file is important, as it is possible to validate the changes made to the base, ensuring that such changes do not generate any negative impact on the model, that is, everything that was recognized continues to be recognized correctly.

The file must be in .csv format, where the first column contains the questions and the second the intent id that the model is expected to recognize for that question, use the BLiP Build AI Model Analysis File tool to easily build this file.

Use

The file is used in the AI Model Analysis screen, where you can create the report with the AI model evaluation metrics. Choose the File option and follow the instructions.

Remembering that to generate the report, the BLIP must send the questions to the model, which can generate costs depending on the provider used.

Results

The metrics presented by the report are:

accuracy
Precision
Recall
F1 Score
average reliability
classified correctly
misclassified
Top False Positives
Top False Negatives

In the case of the report created with the test file, the metrics generated must have the following values from the table below:

accuracy	1,00
Precision	1,00
Recall	1,00
F1 Score	1,00
Medium Reliability	Variable
Correctly classified	100%
Misclassified	0%
Top False Positives	None
Top False Negatives	None

The average reliability is variable, as this value is the average of the reliability given by the provider when analyzing each of the questions in the test file.

If the value of any of the other metrics is different from what is in the table, it means that the model is not correctly answering all the questions. Therefore, the suggestion is to check what they are in the Top False Positives and Top False Negatives tabs, where it is possible to identify which intention was expected and which was recognized.

In addition, the Confusion Matrix is also generated, where it is possible to identify points of confusion between intentions.

The top column represents expected intents, while the left column shows recognized intents.

Ex.: It was expected that 10 questions would be recognized as Curiosities, but only 5 were. So there is confusion between the Trivia intent with the Whats, Basic Signs, and How to Learn intent, as one question was recognized as Whats, another as Basic Signs, and 3 others as How to Learn.

The ideal scenario for the confusion matrix analysis is that only the main diagonal is different from 0 (zero), and this is the scenario that should be taken into account when the test file is used to generate the report.

file update

The test file must contain critical questions related to skills (and content) that the chatbot cannot, under any circumstances, fail to answer. Therefore, every time the model is trained and published, interactions must be added to test what was changed in the base (as long as it is something critical).

It is important to note that you should not add exactly the example that was added to an intent, but an interaction that tests the ability of the NLP model to understand when something similar is sent to the chatbot.

Furthermore, the recommendation is that the update (and operation) of the file be done by the same person who made the modifications in the knowledge base (intents and entities) or, at most, by someone who is aware of the changes that were made.

Versioning

For file version control, it is recommended that each created version be named with the day and time of publication of the model to be tested, so that there is a relationship between the file version and the respective model.

If the recommendations made in this document are followed, the person responsible for the evolution of the knowledge base (and, consequently, the NLP model) will be able to validate the modifications made to the base, ensuring that, in general, there has been evolution and not retrogression.

In addition, a way is created to ensure that the model responds correctly to what the customer expects and, if something is identified that is not answered, this must be understood as an improvement of the model and not a bug.

For more information, visit the discussion on the subject in our community or the videos on our channel. 😃

Related articles