Compare Telosnex to Other AI Apps
TL;DR
See Pricing and Benchmarks sections for details.
Pricing
The key difference between Telosnex and other AI apps is Telosnex offers a number of AI providers, and you purchase credits to use them. Virtually every other AI app only provides access to one provider.
Telosnex works sort of like Costco. You pay a small fee to become a member, and then you buy things at cost. Or, you stay a guest and pay more, allowing you to try before you buy.
Telosnex's membership is $10/month. For chat AIs (LLMs), you get roughly 15 hours of the small LLM for a $1 of credits, and 1 hour of the largest LLM for $1 of credits.
Telosnex also built a way to run LLMs easily on your device, for free. Additionally, developers can use the same API keys they have from a provider in Telosnex for free.
Cheaper and better? How?
Telosnex is able to be an excellent deal because we're a small team, we didn't have to find and pay for a huge team to launch quickly.
Additionally, we're not interested in charging for the AI itself: we believe the more you can use AI, the more you'll use Telosnex, the more value you'll get from it, and the more likely you are to stick around, enjoy it, and recommend it to others.
Why credits?
We went with the credits model because we saw some major distortions due to the monthly "all you can eat" model, and dishonesty that made us uncomfortable as customers.
-
Providers heavily limit usage of the large AI, usually with multi-hour waits to use it again. This is unacceptable for professional use: it's very frustrating to be in the middle of complex work and have to stop.
-
Providers are incentivized to use the worst AI possible that you won't complain about, and also give AI less info to read. This dramatically affecting AI search engine performance, more than we imagined.
-
It also causes issues with transparency and control, such as why messages with info from your files and searches aren't included in conversations. Hiding messages avoids revealing they're not sending all the information you expect to AI. This is unacceptable because you are responsible for what you use from AI: you shouldn't have to guess if it saw everything you thought was necessary.
Benchmarks
Evaluation was performed between May 30th, 2024 and June 2nd, 2024.
We compared Telosnex to other AI apps on benchmarks evaluating complex professional reasoning with high stakes and liability. The two datasets we used were LegalBench and MedQA, as both law and medicine require high levels of knowledge recall paired with complex reasoning.
Datasets
- LegalBench is an open, collaborative effort by the legal community to develop a benchmark for legal reasoning. It includes 162 types of tasks, each with many questions. It is maintained by the Hazy Research Group at Stanford University.
- MedQA is a collection of questions from professional medical licensings exams. It contains 13,000 questions from the United States Medical Licensing Examination (USMLE), and another 48,000 from Chinese and Taiwanese medical licensing exams.
Samples
Both datasets contains tens of thousands of questions and would be prohibitively expensive to run in their entirety. We calculated the sample
- LegalBench: Required a sample size of 170. A Dart script randomly selected 2 questions from each of the 162 tasks, for a total of 324 questions. We decided to use all 324 questions, rather than the 194 sample size, in order to provide full coverage of all tasks. size required to achieve a 95% confidence level and a 5% margin of error.
- MedQA: Required a sample size of 194. We used numbergenerator.org to generate random numbers between 1 and the total number of questions in each dataset, and then selected the first 170 questions using a Dart script.
Sample sizes were refined based on the population proportion, the # of answers Telosnex got correct. Desired confidence level was increased from 95% to 99.9% for MedQA, and the 5% margin of error was maintained. This indicated a sample size of 98 for LegalBench and 85 for MedQA.
Providers
Briefly, the AIs used were:
- Perplexity: Perplexity AI is most analogous to Telosnex, as it also provides web pages from search to AI to get answers.
- Perplexity Pro: $40/month. Perplexity Pro reads more web pages and uses a better AI than the free version.
- Google's Gemini 1.5 Pro: $20/month. Released May 2024. Google's top-tier AI. It was accessed through Google's API.
- Anthropic's Claude Opus: Released March 2024. Anthropic's top-tier AI. It was accessed through Anthropic's API.
- OpenAI's GPT-4o: $20/month. Released April 2024. OpenAI's top-tier AI. It was accessed through OpenAI's API.
- Telosnex: Used with the default providers for LLM and Search: GPT-4o and Serper, a search API that uses Google's search engine.
- Telosnex Free: Used with an on-device AI, Llama 3 8B, and Serper.
We ran the benchmarks on the leading AI providers as of June 2024. We use LMSYS to discuss provider quality, as it is the most widely accepted metric for comparing AIs.
Other evaluations end up with virtually the same results, Google, Anthropic, and OpenAI's latest and most expensive AIs are all at the top, with some differences in order.
Thus, Google's Gemini 1.5 Pro, Anthropic's Claude Opus, and OpenAI's GPT-4o were all selected.
Additionally, Perplexity was included. It is a well-known and loved AI search engine. It scores poorly on LMSYS, but it is very similar to Telosnex in that it couples AI with search, and our initial hypothesis was the low score may not be related to correctness of answers, but rather some subjective perception of wording or speed.
Of course, Telosnex was included, as it is the AI we are evaluating.
Two evaluations of Telosnex were performed. One used the default provider in Telosnex, GPT-4o. The other used Llama 3 8B, an on-device AI that Telosnex offers for free.
Question Formatting
0-shot for MedQA, some N-shot for LegalBench.
AI evaluations often are "N-shot", meaning the AI is given N examples of a task and then asked to perform the task. This increases scores, but, is not representative of real-world use.
Thus, we use 0-shot for MedQA. The alternative would be ex. injection 3 question/answer pairs before asking the question that we are evaluating.
For LegalBench, we use the question as written. Many of 162 tasks are formatted as N-shot. For example:
- a task where the input is a contrast clause of a contract, and the question is "Does the clause specify a minimum order size or minimum amount or units pertime period that one party must buy from the counterparty?"
- a task where the input is a question in a courtroom transcript, and the task is to label the function it serves, where the choices are "Background, Clarification, Communicate, Criticism, Humor, Implications, Support"
Chain of Thought for MedQA, not for LegalBench.
Another core prompt engineering technique is to encourage the AI to "think" about the question, then answer it. We readily include this in our prompts, as it is simple and representative of real-world use.
We do not use this for LegalBench, as the questions are already formatted as N-shot. LegalBench is formatted as:
(question)
Provide the final label/answer that I left blank, that is all I need from you.
MedQA is formatted as:
(question)
Take a deep breath, work through the problem step by step, and after working through it, select your final answer
(answer choices)
Search Results Prompt
Perplexity's prompt introducing search results is unknown. It seems to include snippets from up to 20 web pages from 3 search queries.
Telosnex's is documented below. It is the last message before the AI's answer.
The entire chat history submitted looks like:
- User: (prompt)
- User: (search results, up to 3000 words, 3 search queries, 10 documents)
- User: (search results prompt)
Telosnex Search Results Prompt
The information above is search results you'll be using the accomplish the following task:
# Incoming Request
{{ prompt }}
# Task
You must provide a response to the request, drawing only upon search results that are directly relevant to answering the question. It is crucial that you carefully evaluate the applicability of each search result and only include citations to those that provide valuable and pertinent information. Irrelevant search results should not be mentioned or cited under any circumstances. In case it's helpful when you're sifting through results, the current date is {{datetime}}, you don't need to mention that to the user unless they asked directly.
# Rules
- Do not repeat the intent, the user knows it already.
- Use advanced Markdown through the text for formatting.
- Assume the character of an expert professional familiar with the subjects in the intent.
- Speak authoritatively and concisely.
- Think carefully when reasoning about dates, ex. if a query is for upcoming events, that would imply it expects answers of events with dates *after* today.
- Thoroughly analyze each search result to determine its relevance to the request.
- Add citations -- using the format [^N^], where N is the snippet # of the citation.
- If you find yourself unable to locate search results that are indisputably relevant to the request and can be cited to support your response, simply provide a complete answer to the best of your knowledge without including any citations. It is better to submit a response without citations than to cite irrelevant information.
- Use your expert-level knowledge to supplement the search results. Don't share links from your knowledge, knowledge about current events / news, and any knowledge before 2021 - Its {{datetime}}, but you last learned new information in 2021. Rely on the search results for links and current events. Explain concisely, densely.
- Demonstrate your subject-level knowledge by sharing your step by step thinking through all factors mentioned in the intent.
- Include data points, p-values, and quantitative information mentioned in the articles, reinforcing the credibility of your writing.
- Make sure to cite your findings to the matching note you got them from.
- At the end of your response, DO NOT INCLUDE MARKDOWN FOOTNOTES. Pause silently when you have finished. I will then provide them.
- Reinforce credibility by actively pointing out things that could seem like errors: ex. referencing studies about children when providing an answer for an adult.
- Respond in well-formatted & well-designed Markdown.
The other AIs do not provide search results.
Inference Settings
LLMs have settings that influencing the AI's behavior, such as temperature, max tokens, and top p.
Perplexity does not provide this information or allow the user to change it.
Other providers require it for an API call. 0.5 temperature, 2000 max tokens were used for all providers.
Fresh Context
The AI was given a fresh context for each question. This is a common practice in AI evaluations, as it is representative of real-world use.
Perplexity does not provide this information or allow the user to change it. A new conversation was started for each question.
Preventing Cheating
Telosnex was coded to exclude sending web page snippets that matched > 5 consecutive words from the question. All providers were tested using Telosnex. Thus, all providers except Perplexity had an automated guard.
Perplexity was hand-checked several times across ~600 questions. We only found 1 instance of a web page snippets having the question + answer in it. Perplexity's snippets were accessed by clicking the 3 dot menu below the answer, then clicking "View Sources".
Results Dataset
Results were recorded in two spreadsheets, shared below:
The format is roughly one row per question. The columns are:
- question
- formatted question (as discussed in the previous section)
- correct answer
- 2 columns for each AI.
- The AI's answer, or blank, if the AI was correct. N/A is used when the AI did not provide an answer that was one of the choices, even after a follow-up prompt.
- The other column says either "correct" or "incorrect", based on the answer and AI answer columns.
Results
Name | Legal | Med | Average |
---|---|---|---|
Perplexity | 58% | 61% | 60% |
Telosnex Local | 67% | 79% | 73% |
Perplexity Pro | 67% | 79% | 73% |
Gemini Pro 1.5 | 82% | 68% | 75% |
Perplexity Pro x GPT4o | 76% | 86% | 81% |
Claude Opus | 89% | 76% | 82% |
GPT-4 | 77% | 91% | 84% |
Telosnex | 90% | 95% | 93% |
Further Directions, or, 'How could search results make LLMs *worse'
We see that Telosnex is the best AI for both LegalBench and MedQA.
We're curious after seeing the results of the benchmarks.
Telosnex's most direct competitor, Perplexity Pro, is the only tool on the list with similar semantics: finding search results and then asking an AI to answer a question after selecting snippets from the search results.
We were happy to see Telosnex doing better than Perplexity Pro's defaults. It was puzzling to see, and we found that Perplexity Pro had a setting to change its LLM to GPT-4o, the same Telosnex uses by default.
We ran the benchmarks again, and Perplexity Pro with GPT-4o did significantly better than Perplexity Pro with its default AI.
However, Perplexity Pro with GPT-4o did not beat GPT-4o. That implies answers where Perplexity uses search + gives snippets to GPT is worse than plain GPT. This is a significant finding, as it implies that the search results are not helping the AI.
This brings to mind a recent paper from Google Research finding that LLMs given the option to get search documents before answer performed worse than just a plain LLM. A Comprehensive Evaluation of Tool-Assisted Generation STrategies.
This also brings to mind specific behavior patterns we observed in Perplexity, ex. ending up in infinite loop of generating new search queries, or searching for the name of an attached document as base64,
Perplexity also frequently search queries that were responses to "internal dialogue", i.e. an invisible prompt Perplexity provided to the AI before answering the question. The queries would ask the LLM to answer yes or no as to need for 3 separate questions:
- Does the query require personalization?
- Does the query require multi-step thinking?
- Are the sources provided sufficient?
Our hypothesis is that LLMs with the option to search genuinely perform worse than a standard LLM, however, LLMs that always search perform better than a standard LLM.
This unintiutive result points to a tension present in AI services: they would prefer not to include search documents, as it is cheaper. However, the AI is an unreliable narrator as to whether it needs the search documents: in fact, the documents provide a small, but crucial, qualitative edge.
We further hypothesize that simply having access to language in-domain to the question, as given by the search results, provides a small, but critical, reduction in perplexity for reasoning that leads to the correct answer.