Skip to main content

LogiQA

LogiQA is a comprehensive dataset designed to assess an LLM's logical reasoning capabilities, encompassing various types of deductive reasoning, including categorical and disjunctive reasoning. It features 8,678 multiple-choice questions, each paired with a reading passage. To learn more about the dataset and its construction, you can read the original paper here.

info

LogiQA is derived from publicly available logical comprehension questions from China's National Civil Servants Examination. These questions are designed to evaluate candidates' critical thinking and problem-solving skills.

Arguments

There are two optional arguments when using the LogiQA benchmark:

  • [Optional] tasks: a list of tasks (LogiQATask enums), which specifies the subject areas for model evaluation. By default, this is set to all tasks. The list of LogiQATask enums can be found here.
  • [Optional] n_shots: the number of examples for few-shot learning. This is set to 5 by default and cannot exceed 5.

Example

The code below assesses a custom mistral_7b model (click here to learn how to use ANY custom LLM) on categorical reasoning and sufficient conditional reasoning using 3-shot prompting.

from deepeval.benchmarks import LogiQA
from deepeval.benchmarks.tasks import LogiQATask

# Define benchmark with specific tasks and shots
benchmark = LogiQA(
tasks=[LogiQATask.CATEGORICAL_REASONING, LogiQATask.SUFFICIENT_CONDITIONAL_REASONING],
n_shots=3
)

# Replace 'mistral_7b' with your own custom model
benchmark.evaluate(model=mistral_7b)
print(benchmark.overall_score)

The overall_score for this benchmark ranges from 0 to 1, where 1 signifies perfect performance and 0 indicates no correct answers. The model's score, based on exact matching, is calculated by determining the proportion of questions for which the model produces the precise correct multiple choice answer (e.g. 'A' or ‘C’) in relation to the total number of questions.

tip

As a result, utilizing more few-shot prompts (n_shots) can greatly improve the model's robustness in generating answers in the exact correct format and boost the overall score.

LogiQA Tasks

The LogiQATask enum classifies the diverse range of reasoning categories covered in the LogiQA benchmark.

from deepeval.benchmarks.tasks import LogiQATask

math_qa_tasks = [LogiQATask.CATEGORICAL_REASONING]

Below is the comprehensive list of available tasks:

  • CATEGORICAL_REASONING
  • SUFFICIENT_CONDITIONAL_REASONING
  • NECESSARY_CONDITIONAL_REASONING
  • DISJUNCTIVE_REASONING
  • CONJUNCTIVE_REASONING