Currently in final testing, the model will be downloadable under an open license. The model focuses on transparency, multilingual performance, and broad accessibility
The model will be fully open: source code, and weights will be publicly available, and the training data will be transparent and reproducible, supporting adoption across science, government, education, and the private sector. This approach is designed to foster both innovation and accountability
Open LLMs are increasingly viewed as credible alternatives to commercial systems, most of which are developed behind closed doors in the United States or China
Multilingual by design
A defining characteristic of the LLM is its fluency in over 1000 languages
Some data:
The model is trained on the âAlpsâ supercomputer at CSCS, one of the worldâs most advanced AI platforms, equipped with over 10,000 NVIDIA Grace Hopper Superchips
The model will be released in two sizesâ8 billion and 70 billion parameters
Training on over 15 trillion high-quality training tokens
(whatever that meansâŚmaybe somebody can enlighten us)
and of course the âSwiss touchâ:
Responsible data practices
The LLM is being developed with due consideration to Swiss data protection laws, Swiss copyright laws, and the transparency obligations under the EU AI Act
âŚ
respecting web crawling opt-outs during data acquisition
100 percent carbon-neutral electricity to train the model
To put it simply, tokens are in this context units of data. There are many ways to tokenize data and there are many levels that this can be done (characters, wordsâŚ). If its words, short words can be a single token but longer may be several. The combinations and relations (think networks of datapoints) between the tokes is what the statistical models are learning (creating a database/model). There are several texts online you can look for more in-depth info, one to start: https://medium.com/@jimcanary/ai-tokens-explained-understanding-context-windows-and-processing-a99ca2dd9142
8 billion is considered small and 70 billion is mid sized. Very roughly, the 8b models have needed about (at least) 8GB of GPU memory to run, so that model is just about usable in common computers. The trend over the last year has been that bigger isnât always better as algos and fine tuning have advanced - again simplyfying, a model made specifically for something can be better than one that is made to do everything and includes everything + the kitchen sink [links are irrelevant to AI, just a bit of history].
The model is not available for testing (benchmarking really) at that site just yet - itâs still being finalized (no idea whatâs the lag if it should appear there). And remember, those sites should not be used for anything sensitive (and if youâre using them regularly for something else than testing, you are wasting their resources). That being said, I recommend using the side-by-side comparison - this kind of empirical testing is more understandable than what the âaccuracy percentagesâ from task tests tell to most.
Three different things there. The open-sourceness and transparency of models has been a bit of sliding scale and fully (in all aspects) are rare - a lot of close tries and even more marketing âopenâ. Weâve had a couple of thread where these have been discussed. But thatâs just the models. A system or a service that uses the model (offers it for use) is another thing and those are the ones possibly gathering data (a good model can be at a bad system). A model doesnât actively and at that moment learn, a system gathers data and then that is used to train a new version of the model. That is why a âbad model in a good systemâ is possible, as you can set up your own system (own server, own comp) where that black box model is sandboxed. Due to the blackbox nature of the models (they are so complex), itâs possible (although I havenât come across reports of it) that a model could try to use system weaknesses (or just normal features) to send data to a third party.
Not a straight answer, I know. I wouldnât recommend any one AI service as such - it kinda depends on your risk profile and the data you are using with the service, do you trust it. Using your own system (like Ollama or maybe Alpaca) may be pretty safe but the catch is the limited resources and size of model - is it sufficient vs. your security and privacy needs. For a truly secure combination, youâd need a trusted and tested good model, audible safe system and a combination of security features to keep an eye on and limiting the model, data, user and network.
Thank you. The situation you describe is pretty much what I thought it was presently.
Since I donât want to install a model on my own hardware, I am stuck with using a âserviceâ (AI Chatbot) and I havenât done so yet because of confidentiality and ethical reasons.
This is why I was eagerly waiting for some truly ethical and privacy-respecting service to emerge. This Swiss initiative (which they claim will also be a broadly available service) looks good for my purpose and it may well be what I was waiting for to enter the AI Age (whatever that meansâŚ)
Maybe with the emergence of truly ethical models there will emerge truly secure and privacy respecting and transparent services. Undoubtedly then also the catch will be that they cant operate without asking money for the services, since thereâs no data to sell. Whatâs the appropriate sum youâd be willing to pay for such service (the computing ainât cheap: AI is computing intensive [using a lot of electricity, often forcing others to use less-green sources, and also drinking water for cooling, so AI use itself is not often ethical])?
[edit to add: Some like the NextCloud services already and theyâve added AI as an additional service, with several options how to do that, which seems like a believable way to do it - user has control, payment is part of other services and the system and models arenât the worst. Seems relatively ok. Have to say I was at one point expecting Purism to add AI as part of Librem One.]
Is it still open source and legal with an âopt-outâ approach? My copyright is not lost by not doing any action like opting outâŚ
Still, better than what we see elsewhere as LLM, but I would not even call it ethical or fully open.
Money is the key-question here. If there is a valid finance system and all transparent, itâs very likely they do what they write. They may make private use for free while companies and other commercial purpose has to pay or they just get enough donations. If it is fully free, it is maybe an investment until they have a huge market share to begin with monopoly. That is maybe good for today, but not for tomorrow, so I would also recommend not to use it. If it just costs no money it costs data. Whatever it is, check it out before using it and keep in mind that even the best promises can be a lie.
I would just take the costs myself and run the model locally, but it also depends on the available hardware.
Ethical is easy: just donât do stupid sh*t. I know, right - blindingly obvious!
The âtopâ is a bit harder. Training takes a lot of data (and not all of it is free - especially, if it needs to be good quality) and that has to be stored and managed. Then there is the computation cost - sure, you donât need the most powerful bitbarn if you want to use a bit more time, but thatâs still gonna cost. The cheap Chinese ones that caught the big US players off guard still used millions when they were doing them cheaply (and that didnât include years of research and development that they did before that).
Iâm not against the idea, but there are quite substantial hurdles to start from 0 with just a crowdsourcing. Maybe start with a specialist mini or small model, specifically for IoT devices and linux phones, or some similar limited project (still a bit of a challenge)âŚ? Or first define (acceptable) level of ethicality and âtopâ that enthusiasts and general public can agree on, a standard? Although there are some work done on that front already, if Iâm not mistaken, by bigger communities (including linuxfoundation), so maybe supporting them (or some other org) would be wiserâŚ
Top ethical begins with the data set of trainings data. Here some thoughts of data and if it is okay to use those:
Bio-metric data (faces, voices, fingerprints, âŚ) is not okay, no matter the license (even CC0), as long as there is no special agreement to the person who own the original bio-metrics. If the person is already dead, we cannot ask, so we cannot use it (especially since we do not know if using it would be ethical okay, we just should avoid it).
CC0 and public domain can always be used when it contains no bio-metrics.
CC-BY and more requirements start to become too complicated. Filling a list of millions of artists that has to be send with the model and everyone generating an image/text/sound/⌠has to send the list next to the image, too. I guess that should already be avoided.
Data donations are totally fine.
Data contracts are also totally fine (paying for trainings data). But here we have not enough money to do so.
Our very own data.
So the first step is very clear: creating trainings data have to be made by hand, not by crawlers or similar to make sure data is really ethical. And crawler themselves are also not ethical as long as there is no agreement with a crawled page, because it externalize costs to all the crawled internet pages.
Such a database is very hard to achieve and so very unlikely we can do it. If we could, we would already be very close to the final goal. We do not even need to code and train the AI, because we could sell or gift that data. We could create licenses that just allow to be used for open source AI and that does not allow harmful models like creating mass surveillance stuff etc.
The much simpler option would be to create specialized AIs like an AI that saves energy on a technical process (render pipeline or similar). But this is something completely different to LLM use cases.
I came across an interesting research article in beta phase that has looked at the ethicality and openness questions of AI systems. Iâm only half way done skimming it, but it collects most of the salient points and has done some work on how for instance data rights holders see the various aspects - good hub for sources, if nothing else: MusGO: A Community-Driven Framework for Assessing Openness in Music-Generative AI
From the article, about relevant categories in their framework: â13 openness categories: 8 essential and 5 desirableâ which seems a number of categories that can be comprehended, but for instance âtraining dataâ could still maybe be divided into several sub categories about how data was acquired (spiders, bought, pirated, free, public domain etc.), was it filtered or quality controlled (enhanced, censored etc.), how are the licenses and approvals (do potentially involved people know that their data is used and could they object etc.), is there personal info included and how personal is it, and similar areas. As these models for what should be required come later, it becomes a challenge to afterwards gain all the info one wantâs to know to assess the AI services.
This new AI ecosystem (not very âecoâ per se), like anything created before, brings entirely new classes of attacks. Remember RowHammer (bit flips in the DRAM): well, now meet GPUHammer!
Dubbed GPUHammer, the attacks mark the first-ever RowHammer exploit demonstrated against NVIDIAâs GPUs (e.g., NVIDIA A6000 GPU with GDDR6 Memory), causing malicious GPU users to tamper with other usersâ data by triggering bit flips in GPU memory.
The most concerning consequence of this behavior, University of Toronto researchers found, is the degradation of an artificial intelligence (AI) modelâs accuracy from 80% to less than 1%.
Accuracy IMO will be the measure of the effectiveness of this new AI frenzy
Itâs a clear sign that GPUHammer isnât just a memory glitchâitâs part of a broader wave of attacks targeting the core of AI infrastructure, from GPU-level faults to data poisoning and model pipeline compromise
In the end, we will just have to trust our own human brainsâŚ