this looks much like replicate. has anyone tried it(https://replicate.com/). How's the experience with cold start?
We have models that are crucial but do not require dedicated hosting. We are looking for an aws lambda type of service, but for a fine tuned llama2-13b. any suggestions? would try out Cloudflare AI too.
Previously (including several comments from CF folks including cofounder eastdakota): https://news.ycombinator.com/item?id=37674097
There isn't much about pricing, but this fragment suggests it will be economical mostly for light use cases.
>“Currently, customers are paying for a lot of idle compute in the form of virtual machines and GPUs that go unused,”
I'm definitely looking forward to having a lot more competition in the "pay as you go LLM AI" space. Especially services that use models one can download and run on your own hardware once a good use case has been developed.
is a few milliseconds in latency really a problem for current LLM models? They are already so slow that users are used to waiting 10s of seconds for a response anyways. I feel like until the actual latency of LLM models improve to sub-second, this is not a product that worth the price.
One of the offerings is language translation where latency might matter. Though I don't know how fast it is.
Cloudflare doesn't currently have a "not edge" worker, so anything they offer has to be "edge".
The expected pricing is very strange:
Regular Twitch Neurons (RTN) - running wherever there's capacity at $0.01 / 1k neurons
Fast Twitch Neurons (FTN) - running at nearest user location at $0.125 / 1k neurons
Neurons are a way to measure AI output that always scales down to zero. To give you a sense of what you can accomplish with a thousand neurons, you can: generate 130 LLM responses, 830 image classifications, or 1,250 embeddings.
Who came up with this? This is ridiculous. I understand the underlying issues but would still prefer a metric like seconds of utilization multiplied by the size of worker.
Besides this, the expected pricing doesn't talk about the expected pricing but just the pricing model. Have the feeling that this is not going to be competitive to platforms like Vast.ai
Depending on how many tokens a typical response is using, pricing will vary wildly but a rough estimate put the fast one as more expensive than chatgpt3.5 and the cheap one as way cheaper.
Quality will likely be heaps worse than chatgpt3.5, given it's llama 7b
It's 0.96$ per 100 fast chat responses It's 0.0076$ per 100 slow chat responses
Chatgpt 3.5 with 50 tokens input, 50 tokens output will give you 0.02$ per 100 fast responses If the llm responses are 500 tokens in and 500 tokens out then you get 0.2$ per 100 fast responses
I presume people will flock to the cheap version for when they can't afford the price and quality of chatgpt3.5.
So running fast is >100x expensive? That's too much of a difference
On the other hand, if it reflects their costs, I'm very happy to have an option that is 100x cheaper, rather than a more strategic one that raises the lower price by 10x.