BoolQ
BoolQ is a reading comprehension dataset containing 16K yes/no questions (3.3K in the validation set). BoolQ features naturally occurring questions, meaning they are generated in an unprompted setting, with each question accompanied by a passage.
To learn more about the dataset and its construction, you can read the original paper here.
Arguments
There are two optional arguments when using the BoolQ
benchmark:
- [Optional]
n_problems
: the number of problems for model evaluation. By default, this is set to 3270 (all problems). - [Optional]
n_shots
: the number of examples for few-shot learning. This is set to 5 by default and cannot exceed 5.
Example
The code below assesses a custom mistral_7b
model (click here to learn how to use ANY custom LLM) on 10 problems in BoolQ
using 3-shot CoT prompting.
from deepeval.benchmarks import BoolQ
# Define benchmark with n_problems and shots
benchmark = BoolQ(
n_problems=10,
n_shots=3,
)
# Replace 'mistral_7b' with your own custom model
benchmark.evaluate(model=mistral_7b)
print(benchmark.overall_score)
The overall_score
for this benchmark ranges from 0 to 1, where 1 signifies perfect performance and 0 indicates no correct answers. The model's score, based on exact matching, is calculated by determining the proportion of questions for which the model produces the precise correct answer (i.e. 'Yes' or 'No') in relation to the total number of questions.
As a result, utilizing more few-shot prompts (n_shots
) can greatly improve the model's robustness in generating answers in the exact correct format and boost the overall score.