Our methodology and evaluation criteria

This benchmark is designed to measure additional value that Wordsmith provides on top of some best-in-class foundation models, helping to tailor them to the specific vertical use case of in-house legal.

Our solution is focused on the problems faced by legal teams in their day to day jobs, which include commercial contracting, legal document review and dealing with “business as usual” questions from inside and outside of the organization. We have not trained our own foundation model, rather we built a system on top of existing large language models which is enhanced with customised legal ontologies, large and diverse knowledge bases and tools to ensure that we can truly provide exceptional performance for lawyers operating these use cases.

In order to understand and measure the value our system provides, we have built and will continue to expand on a set of custom evaluations that we believe represent the key challenges faced by in-house lawyers in their day to day job. As we built these evaluations we compared performance of our system against both vanilla large language models and available Retrieval Augmented Generation (RAG) solutions.

The comparison has been done against the ground truth answers provided by human lawyers and correctness of the AI answers was manually validated by the lawyer. The percentages below represent the number of answers deemed correct against ground truth over the total number of questions in evaluation dataset.

84%

Wordsmith

63%

Gemini 1.5 Pro

48%

GPT-4 Turbo

45%

Claude 3 Opus

44%

Custom GPT (GPT-4 powered RAG)

41%

Mistral Large

39%

Cohere Command R+

34%

Grounded Cohere Command R+ (RAG)

*What percentage of the time these models gave correct answer on our evaluation dataset

84%

Wordsmith

63%

Gemini 1.5 Pro

48%

GPT-4 Turbo

45%

Claude 3 Opus

44%

Custom GPT (GPT-4 powered RAG)

41%

Mistral Large

39%

Cohere Command R+

34%

Grounded Cohere Command R+ (RAG)

*What percentage of the time these models gave correct answer on our evaluation dataset

Evaluation dataset details

In our evaluation set we focused on commercial agreements (see example here) and the real questions we have identified from user research with our customers.

Example questions from our dataset are:

  • Do we have a precedent of reducing late payment fees?

    • Answer: Yes, there is a precedent for reducing late payment fees. In an agreement with Acme, the late payment charge was reduced from 6% to 2% interest rate.

  • Which contracts have a general liability cap more than 10x annual subscription fee?

    • Answer: The following contracts have general liability cap higher than 10x annual fees

      • The contract for Globex Corporation. The general liability cap is USD 25,824, and 10 times the annual fees is USD 11,880.

      • The contract for Initech. The general liability cap is EUR 12,438, and 10 times the annual fees is EUR 11,880.

Our evaluation dataset contains several dozens of questions similar to the above with varying levels of complexity (see some more here).

We had to limit our dataset for the purpose of this measurement to less than 20 contracts, so that they fit in the 128K context window of models like GPT-4 Turbo and Claude 3 Opus, because we wanted to be able to compare Wordsmith to LLMs that see all the contracts in one shot.

Solutions that use RAG (OpenAI Custom GPTs and Cohere Grounded Command R+) can scale to a large number of contracts, though performance degrades with increased numbers. As such this benchmark would be further biased towards Wordsmith if we expanded the evaluation to a dataset that is larger than the context window of the foundation models.

Results

The current state of the art large language models and Retrieval Augmented Generation (RAG) solutions show relatively low performance on real questions in-house legal teams are facing, with only Gemini Pro 1.5 giving correct answers to more than half of the queries. At the same time the approach Wordsmith is taking shows significant improvement over using vanilla large language models or RAG solutions we tested against.