

Eh that doesn’t count. It’s probably automated anyway.
If you’re here, there’s still hope for the internet
Don’t let it fall


Eh that doesn’t count. It’s probably automated anyway.


“significant restrictions”
I wonder if author is a twitter addict


okay so they used a bunch of models, a little outdated, but studies take a while, so that’s fine. Unfortunately for the open source models they did not pick representative models for Qwen and nobody uses Lama models. There were no GLM or Kimi models.
The format was a short system instruction telling them they’re a assistant doing x service and to prefer the sponsored product, with the following modifications
There were three categories of tests:
Results were middling. Grok 4.1 fast usually preferred the sponsored one and even more with CoT. Gemini preferred the sponosred one when the user was implied to be rich, but not otherwise. Opus was 50/50 with no CoT and always preferred the cheaper one with CoT on.
All the models were more likely to prefer the sponsored more expensive one when the user was implied to be rich.
Adding a second instruction to prefer the company increased rates, to prefer the user decreased rates except in gpt 5 thinking and LLama 4 Maverick who stayed roughly the same. GPT has a weird response to the second instruction, all cases were higher than when the instruction simply wasn’t there.
Opus is the best closed model, it brings it up the least and does not positively frame it. All the other models positively frame it. The open models generally do better here. This table is too big for me to summarize, but if you want to see it’s table 3.
Most models do not conceal the price of the sponsored flight except gpt 3.5 and haiku 3, which are both old dumb models.
Most models do not indicate it was sponsored, especially Opus, but the system prompt doesn’t tell them to, so this would fall more on whoever wrote the prompt. [<- my opinion, not from study]
Funnily enough GPT and llama don’t mention it at all in this case. Opus does at very low rates. Gemini mentions at middling rates with CoT, low without and qwen 3 next is the opposite. All others are middling.
All models do it except Opus 4.5.
Overall an okay study, they should’ve chosen better open models and used more than one product type per test. Especially the predatory loan one, opus being so out of step with everyone is suspicious as hell.


Anyone have the actual study and methodology instead of this blog spam?
It was a decent browser. And an independent engine, which everyone here seems rabid for


I know gaslight has lost all meaning but this might be worst use I’ve seen yet


Didn’t crunchyroll recently do something similar?
Wtf is going on, what do companies have against fancy subtitles


Not since Trump “saved” them.
It’s still a company, their only value is money


I’m going to be honest, I have no idea how open source works. I can’t imagine maintaining anything more than a tiny library that I can ignore six days of the week.
Also: open source relies on good jobs. You can only do it if you have a well paid low stress job with good hours. Those have been in short supply recently.
I think the free time covid gave, followed by the free time the layoffs gave, and AI have been patching / hiding the fact that the core model of open source is completely unsustainable in its current state.


Yeah there’s lots of open providers like this.


What do you suggest we do, not push back?
And btw this isn’t true. Look at how their attempt to get rid of third party cookies is going. The just rolled back like their fifth attempt/rebranding of it


There’s soft serve but it doesn’t have a UI


I do want it at least clonable over https


Interesting, though this seems to only be a UI, not a server


To echo my other answer, I’m sure I could get it running, I just dislike using tools that are significantly more complex than I need.


It seems designed for like teams of people. They both have like admin interfaces, which I can’t ever imagine for my use case.
I’m sure I could get it running, I just dislike using tools that are significantly more complex than I need.


I don’t know a ton about Gitea, but I’ve recently starting looking for a simple git server + decent web UI
Gitea and Forgejo are the main recommended ones, but they both seem overly complex. (3D File previews?? Who needs that?)


“misrepresent” is a vague term. Actual graph from the study

The main issue is usual… sources. AI is bad at sources without a proper pipeline. They note that Gemini is the worst at 72%.
Note, they’re not testing models with their own pipeline. They’re testing other people’s products. This is more indicative of the product design than the actual models
I mean it’s an electron app. Maybe the number is inflated, but it’s still a ton. The previous number he compared to (100mb) would be equally inflated, so it’s fair