Switzerland to release an Open-source, fully transparent and privacy-preserving AI Model

Quotes from the announcement:

Currently in final testing, the model will be downloadable under an open license. The model focuses on transparency, multilingual performance, and broad accessibility

The model will be fully open: source code, and weights will be publicly available, and the training data will be transparent and reproducible, supporting adoption across science, government, education, and the private sector. This approach is designed to foster both innovation and accountability

Open LLMs are increasingly viewed as credible alternatives to commercial systems, most of which are developed behind closed doors in the United States or China

Multilingual by design
A defining characteristic of the LLM is its fluency in over 1000 languages

Some data:

  • The model is trained on the “Alps” supercomputer at CSCS, one of the world’s most advanced AI platforms, equipped with over 10,000 NVIDIA Grace Hopper Superchips
  • The model will be released in two sizes—8 billion and 70 billion parameters
  • Training on over 15 trillion high-quality training tokens

(whatever that means…maybe somebody can enlighten us)

and of course the “Swiss touch”:

Responsible data practices
The LLM is being developed with due consideration to Swiss data protection laws, Swiss copyright laws, and the transparency obligations under the EU AI Act
…
respecting web crawling opt-outs during data acquisition

  • 100 percent carbon-neutral electricity to train the model
6 Likes

To put it simply, tokens are in this context units of data. There are many ways to tokenize data and there are many levels that this can be done (characters, words…). If its words, short words can be a single token but longer may be several. The combinations and relations (think networks of datapoints) between the tokes is what the statistical models are learning (creating a database/model). There are several texts online you can look for more in-depth info, one to start: https://medium.com/@jimcanary/ai-tokens-explained-understanding-context-windows-and-processing-a99ca2dd9142

8 billion is considered small and 70 billion is mid sized. Very roughly, the 8b models have needed about (at least) 8GB of GPU memory to run, so that model is just about usable in common computers. The trend over the last year has been that bigger isn’t always better as algos and fine tuning have advanced - again simplyfying, a model made specifically for something can be better than one that is made to do everything and includes everything + the kitchen sink [links are irrelevant to AI, just a bit of history].

4 Likes

Actually I’m using this one:

https://lmarena.ai/

1 Like

The model is not available for testing (benchmarking really) at that site just yet - it’s still being finalized (no idea what’s the lag if it should appear there). And remember, those sites should not be used for anything sensitive (and if you’re using them regularly for something else than testing, you are wasting their resources). That being said, I recommend using the side-by-side comparison - this kind of empirical testing is more understandable than what the “accuracy percentages” from task tests tell to most.

1 Like

Are there already usable open-source and fully transparent models that can be tried (and that are not feeding on your data nor commercially oriented)?

1 Like

Three different things there. The open-sourceness and transparency of models has been a bit of sliding scale and fully (in all aspects) are rare - a lot of close tries and even more marketing “open”. We’ve had a couple of thread where these have been discussed. But that’s just the models. A system or a service that uses the model (offers it for use) is another thing and those are the ones possibly gathering data (a good model can be at a bad system). A model doesn’t actively and at that moment learn, a system gathers data and then that is used to train a new version of the model. That is why a “bad model in a good system” is possible, as you can set up your own system (own server, own comp) where that black box model is sandboxed. Due to the blackbox nature of the models (they are so complex), it’s possible (although I haven’t come across reports of it) that a model could try to use system weaknesses (or just normal features) to send data to a third party.

Not a straight answer, I know. I wouldn’t recommend any one AI service as such - it kinda depends on your risk profile and the data you are using with the service, do you trust it. Using your own system (like Ollama or maybe Alpaca) may be pretty safe but the catch is the limited resources and size of model - is it sufficient vs. your security and privacy needs. For a truly secure combination, you’d need a trusted and tested good model, audible safe system and a combination of security features to keep an eye on and limiting the model, data, user and network.

4 Likes

Thank you. The situation you describe is pretty much what I thought it was presently.
Since I don’t want to install a model on my own hardware, I am stuck with using a “service” (AI Chatbot) and I haven’t done so yet because of confidentiality and ethical reasons.
This is why I was eagerly waiting for some truly ethical and privacy-respecting service to emerge. This Swiss initiative (which they claim will also be a broadly available service) looks good for my purpose and it may well be what I was waiting for to enter the AI Age (whatever that means…)

3 Likes

Maybe with the emergence of truly ethical models there will emerge truly secure and privacy respecting and transparent services. Undoubtedly then also the catch will be that they cant operate without asking money for the services, since there’s no data to sell. What’s the appropriate sum you’d be willing to pay for such service (the computing ain’t cheap: AI is computing intensive [using a lot of electricity, often forcing others to use less-green sources, and also drinking water for cooling, so AI use itself is not often ethical])?

[edit to add: Some like the NextCloud services already and they’ve added AI as an additional service, with several options how to do that, which seems like a believable way to do it - user has control, payment is part of other services and the system and models aren’t the worst. Seems relatively ok. Have to say I was at one point expecting Purism to add AI as part of Librem One.]

2 Likes

Zero.

Is it still open source and legal with an “opt-out” approach? My copyright is not lost by not doing any action like opting out…

Still, better than what we see elsewhere as LLM, but I would not even call it ethical or fully open.

Money is the key-question here. If there is a valid finance system and all transparent, it’s very likely they do what they write. They may make private use for free while companies and other commercial purpose has to pay or they just get enough donations. If it is fully free, it is maybe an investment until they have a huge market share to begin with monopoly. That is maybe good for today, but not for tomorrow, so I would also recommend not to use it. If it just costs no money it costs data. Whatever it is, check it out before using it and keep in mind that even the best promises can be a lie.

I would just take the costs myself and run the model locally, but it also depends on the available hardware.

2 Likes

Could we, as Purism community, develop a top ethical open source AI maybe asking money on crowdfunding platform?

1 Like

Ethical is easy: just don’t do stupid sh*t. I know, right - blindingly obvious! :see_no_evil:
The “top” is a bit harder. Training takes a lot of data (and not all of it is free - especially, if it needs to be good quality) and that has to be stored and managed. Then there is the computation cost - sure, you don’t need the most powerful bitbarn if you want to use a bit more time, but that’s still gonna cost. The cheap Chinese ones that caught the big US players off guard still used millions when they were doing them cheaply (and that didn’t include years of research and development that they did before that).

I’m not against the idea, but there are quite substantial hurdles to start from 0 with just a crowdsourcing. Maybe start with a specialist mini or small model, specifically for IoT devices and linux phones, or some similar limited project (still a bit of a challenge)…? Or first define (acceptable) level of ethicality and “top” that enthusiasts and general public can agree on, a standard? Although there are some work done on that front already, if I’m not mistaken, by bigger communities (including linuxfoundation), so maybe supporting them (or some other org) would be wiser…

2 Likes

Or, maybe, jump to human brain simulation? I think it’ll be the future, the next step!

Project like those:

  1. NEST simulator
  2. Brian2
  3. SpiNNaker + sPyNNaker
  4. Nengo
  5. BindsNET
1 Like

Top ethical begins with the data set of trainings data. Here some thoughts of data and if it is okay to use those:

  • Bio-metric data (faces, voices, fingerprints, …) is not okay, no matter the license (even CC0), as long as there is no special agreement to the person who own the original bio-metrics. If the person is already dead, we cannot ask, so we cannot use it (especially since we do not know if using it would be ethical okay, we just should avoid it).
  • CC0 and public domain can always be used when it contains no bio-metrics.
  • CC-BY and more requirements start to become too complicated. Filling a list of millions of artists that has to be send with the model and everyone generating an image/text/sound/… has to send the list next to the image, too. I guess that should already be avoided.
  • Data donations are totally fine.
  • Data contracts are also totally fine (paying for trainings data). But here we have not enough money to do so.
  • Our very own data.

So the first step is very clear: creating trainings data have to be made by hand, not by crawlers or similar to make sure data is really ethical. And crawler themselves are also not ethical as long as there is no agreement with a crawled page, because it externalize costs to all the crawled internet pages.

Such a database is very hard to achieve and so very unlikely we can do it. If we could, we would already be very close to the final goal. We do not even need to code and train the AI, because we could sell or gift that data. We could create licenses that just allow to be used for open source AI and that does not allow harmful models like creating mass surveillance stuff etc.

The much simpler option would be to create specialized AIs like an AI that saves energy on a technical process (render pipeline or similar). But this is something completely different to LLM use cases.

1 Like

Pfft, a “model” is not the source nor a specification. Seriousness aside, a “model” is a smaller representation of something else.

2 Likes

duck.ai (duck duck go) has good agreements with some providers, check it out!

1 Like

I came across an interesting research article in beta phase that has looked at the ethicality and openness questions of AI systems. I’m only half way done skimming it, but it collects most of the salient points and has done some work on how for instance data rights holders see the various aspects - good hub for sources, if nothing else: MusGO: A Community-Driven Framework for Assessing Openness in Music-Generative AI
From the article, about relevant categories in their framework: “13 openness categories: 8 essential and 5 desirable” which seems a number of categories that can be comprehended, but for instance “training data” could still maybe be divided into several sub categories about how data was acquired (spiders, bought, pirated, free, public domain etc.), was it filtered or quality controlled (enhanced, censored etc.), how are the licenses and approvals (do potentially involved people know that their data is used and could they object etc.), is there personal info included and how personal is it, and similar areas. As these models for what should be required come later, it becomes a challenge to afterwards gain all the info one want’s to know to assess the AI services.

1 Like

This new AI ecosystem (not very ‘eco’ per se), like anything created before, brings entirely new classes of attacks. Remember RowHammer (bit flips in the DRAM): well, now meet GPUHammer!

Dubbed GPUHammer, the attacks mark the first-ever RowHammer exploit demonstrated against NVIDIA’s GPUs (e.g., NVIDIA A6000 GPU with GDDR6 Memory), causing malicious GPU users to tamper with other users’ data by triggering bit flips in GPU memory.
The most concerning consequence of this behavior, University of Toronto researchers found, is the degradation of an artificial intelligence (AI) model’s accuracy from 80% to less than 1%.

Accuracy IMO will be the measure of the effectiveness of this new AI frenzy

It’s a clear sign that GPUHammer isn’t just a memory glitch—it’s part of a broader wave of attacks targeting the core of AI infrastructure, from GPU-level faults to data poisoning and model pipeline compromise

In the end, we will just have to trust our own human brains…

2 Likes

Ok, we’re doomed :scream:

3 Likes

or:
our brains were too weak…so we invented AI and now we are doomed

1 Like