Hard Drive Improvement

user471 · April 30, 2024, 4:57pm

So, here’s the idea, we put a chip between SSD in computer and motherboard, which sole function is to cache and compress. We could possibly fit there as inter-step for NVMe drive a 16GB RAM, chip, half of this space would be compression dictionary, other half for caching. It could also have some flash storage, like 128GB dictionary that can be access pretty fast (paralelling lines, instead of it being one block) That would contain just another deeper compression dictionary. With compression like this, 4TB SSD can hold so much of video, that it may seem impossible. I suggested something like this also for network, on Ubuntu forums. But videos can be really small thing to transfer with both of those compressions. Trust me. We need to form some association to design algorythm and select material to compress for dictionary for that.

Dlonk · April 30, 2024, 5:14pm

Hmm. I have not personally been in a situation where nVME was not fast enough. Instead, the problems that I tend to have today are often problems of software rather than hardware. If I had an army of copies of myself with infinite money to feed them, and they wrote tons of software for my Purism hardware, it would basically be perfect. I don’t have that, though.

I wasn’t able to accomplish this part of your request because we were conversing on the public internet. Could you try posting some more on the Purism forums to boost your trust level(s)?

user471 · April 30, 2024, 5:55pm

It’s not just about the speed, it’s also about compression. If harddrive can store you more data, you spend less for hard drive.

Privacy2 · April 30, 2024, 6:19pm

user471:

So, here’s the idea, we put a chip between SSD in computer and motherboard, which sole function is to cache and compress. We could possibly fit there as inter-step for NVMe drive a 16GB RAM, chip, half of this space would be compression dictionary, other half for caching. It could also have some flash storage, like 128GB dictionary that can be access pretty fast (paralelling lines, instead of it being one block) That would contain just another deeper compression dictionary. With compression like this, 4TB SSD can hold so much of video, that it may seem impossible. I suggested something like this also for network, on Ubuntu forums. But videos can be really small thing to transfer with both of those compressions. Trust me. We need to form some association to design algorythm and select material to compress for dictionary for that.

Most good SSD’s already have a DRAM cache. Adding a separate cache would probably not have a significant effect.
Some filesystems have built-in compression → e.g. BTRFS. This consumes CPU resources, but still generally speeds up access. It would be better if the compression were hardware accelerated (some systems have this). Nonetheless the compression should be done by the filesystem and not the disk. See below.
Compression:
a. Not every file compresses the same.
– Video files like MP4 are already compressed and, so, do not compress well. Using standard libraries (gzip, zstd, xz, …) to compress further often only gains a few percent.
– Encrypted files do not compress well. Why? The basic idea is that encrypted data has high entropy (i.e. appears random) and high entropy data is, by definition, not as compressible.
b. Because we like to have our filesystems encrypted, if we are going to have compression on the hard drive, the compression has to be done by the filesystem before it is encrypted. If it is done at the stage of the SSD hardware writes, it’s already encrypted and, because of (3), it doesn’t compress well.

Thus, I think your idea is naive. That said, having a dedicated HW compression chip that the filesystem knew how to use/access would be a good idea. IBM and others have done this as part of their network storage devices.

user471 · April 30, 2024, 6:43pm

I’m talking about big pre-shared dictionary, that’ll for sure maintain videos, to be being able to be compressed even further. DRAM cache is necessary to store that big dictionary.

Also if it would be NVMe card that interacts with SSD NVMe like a middle step, computer can have different driver than block device, and use card as filesystem, instead of blockspace, and it’ll handle it’s own filesystem on real drive. It could also support various forms of encryptions like that.

It can be also raid ready, so one NVMe goes into motherboard and have slots for 2or4NVMe HDD’s.
Therefore everything done with block, can be done with this raid: writes data in pattern so real data are only on one of two disks, then on idle, it clones itself. That’s best raid up today.

Also having faster DRAM in this, could mean, that files that are very often on heap could have feature, that this computer in the middle thing will just stream it with it’s write hooks from the ram on read, sooner than it’s written down.

I could surely design hardware that does all of this.

Thank for your attention.

FranklyFlawless · April 30, 2024, 10:38pm

Great, let us know when you have reached production status and the firmware and hardware are both open-source too.

irvinewade · May 1, 2024, 12:24am

To add to that …

this can be solved by having the disk do compress-then-encrypt but “we” don’t like that because it usually makes the encryption a blackbox. So it isn’t really acceptable unless either

you trust the disk manufacturer, or
all the software and firmware is open source

and you don’t need any of the features of host-based-encryption.

To further that latter thought … one of the problems with disk-based-encryption is that inevitably the disk becomes abandonware, which means that it will not be tracking the latest improvements in security. So, for example, as the world moves away from PBKDF2 to Argon* for slot encryption, your disk would stay behind and be more vulnerable than it needs to be.

Worse still, the procedure for upgrading disk firmware is often baroque at best and Windows-only at worst.

Some of these observations could also apply to a smart disk that did compression but not encryption.

Perhaps the OP should try this him-/her-self.

I took a random JPEG and compressed it with gzip - and I achieved a compression (reduction) of 0.3% - which is hardly worth worrying about.

Also concerning is that individual file formats can use domain-specific compression (optimised for the actual nature of the uncompressed data effectively contained therein) whereas a disk is usually relegated to applying the same compression algorithm across the entire disk, without regard to the nature of the content (as the disk has no concept of the file system, much less the file content).

The following is not a rigorous claim but it is my impression that noone bothers much about compression in hardware these days because the central CPU has become so fast.

Presumably the dictionary would need to be stored permanently on the disk somewhere also.

jasonolshefsky · May 2, 2024, 11:35am

IMO, and from my knowledge of computer history, compressing disks is not all that useful.

My first real purchased computer was a Mac and there was a program called RAM Doubler, I think, that compressed RAM on the fly, giving you “more” memory than was in the computer. This was useful as a way to eek a little more life out of old hardware since RAM expansion was rather limited. (E.g. I like to “brag” that I bought a 32MB SIMM for my Macintosh LC III for $1,200 in the 1990s—Yes, 32 MEGAbytes, 0.032GB.) While RAM slots are still limited and address width can prevent you from indefinitely doubling your memory, secondary storage has become nearly immune to this issue, and you can stick a drive with 1000x the capacity of the largest drive available at the time of a computer’s production (e.g. that LC III came with a 160MB drive, but I could easily have put a 1.6TB drive in its place without issue, while the RAM had a hard limit of 32MB.)

It seems Moore’s “Law” has been pretty consistent when it comes to memory and storage. As such, for any fixed price, the amount of storage you get goes up and up each year, and there is no reason this can’t continue for a long time to come. For instance, I started doing backups with tapes and then CD-ROM burners, but it turned out I was limited to the size of each media type and those strategies became quickly non-viable. Now I use hard drives: whatever about $150 will buy. These days that’s about a 8TB spinning drive.

So compressing data for storage is kind of a fool’s errand: the additional processor and computational costs/losses/speed-reduction is just not worth it when in a year’s time, any additional space you gain from compression is overshadowed by the newest storage at the same price.

irvinewade · May 3, 2024, 6:28am

Just to add to that … I’ve already mentioned JPEG files but the following I believe are all compressed using PKZIP format (zip format):

LibreOffice document formats (“OpenDocument”)
Microsoft Office 2007 document formats (“OpenXML”) - I still have a few legacy documents that I have never converted
Java archive files (.jar files)

PDF is more complex because, as a “programming language”, the content is free to be compressed or not compressed - but I would guess that most PDF documents are using compression, and it may contain embedded images that themselves may be represented compressed.

Throw in some MP3 files (domain-specific compression). Or FLAC if you prefer.

It wouldn’t surprise me if the substantial majority of the disk space used by files that I have that are not installed software are already compressed internally - and would severely limit the utility of disk compression.

Email would benefit because most email is either ASCII (for me) or Base-64 encoded binary attachments - both of which will compress with any half-decent algorithm.

So if I were to boot using squashfs that would just about cover everything (except email).

user471 · May 3, 2024, 5:04pm

I must argue, that if you have 100 similar MP3’s as MP3’s it got file size of 100 MP3’s, but if you’ve got whole disk compression, it’s getting advantage of a fact, that there are compressible similarities across multiple files.

Privacy2 · May 3, 2024, 7:25pm

It depends on what you mean “there are compressible similarities across multiple files”. For common usage this just isn’t true. You should take an album (a dozen mp3’s) and create a tarfile (similar to a disk) and see how much you can compress it with any algorithm you want vs. compressing each individually. Even an album that contains two different versions (i.e. different recordings; e.g. same song with a guest artist and possibly different tempo or arrangement) of the same song do not compress appreciably. [Edit: If you are talking about 100 mp3 files that are all variations/filters/cuts of what was the exact same recording, then, yes one can compress that. This is just not common outside of a recording studio.]

Please do this for yourself since until you do, I fear you won’t understand.

There are algorithms that can do a bit better than the standard compression techniques when trying to find global similarities, but these are generally incredibly time intensive (no limit, but 2-3 orders of magnitude, i.e. factor of 100 or 1000 times). See, for example, the winners of Hutter’s Prize ( Hutter Prize - Wikipedia ).

user471 · May 4, 2024, 2:45am

But even then, I also lied about one order of magnitude, actually with disk compression, you have 100 000 MP3’s there, and there’s a huge possibility you’ll fit there more than that.

Like they lied in Clay’s institute for which I optimized their unpairable students problem, they lie even here, you won’t get the money. Possibly.

Privacy2 · May 4, 2024, 3:45am

In regard to directories of MP3’s —> say an album. I do not think you’ll be able to compress them to be smaller than 95% of the original MP3 size. If you have, say 50 albums (roughly 500 mp3s) … I would say the same. You should really try it. While I think you understand the methodology of compression, I believe your expectations regarding compression are naive and I think you will learn a lot by trying.

In regard to prizes being a lie. Your comment about the Clay Institute lying is strange. Presumably you’re talking about one of the Millennium Prize problems. I find it far more likely that you’ve misunderstood something than actually succeeded. I’m a mathematician and know the (now former) President of the Clay Institute well.

While the Hutter Prize has been deceptively advertised, it is real. It is 5,000Euros for each 1% improvement over the previous year … with a total maximum of 500,000 Euros. Each year they award a prize based on the improvement. I think they’ve paid out a total of 30,000 Euros and the maximum payout has been 9,000 Euros so far. I encourage you to try that too — I gave it a try and found it very instructive.

irvinewade · May 6, 2024, 5:20am

6 posts were split to a new topic: Question: Is P = NP?

user471 · May 4, 2024, 5:48pm

I would need to get 1 terrabyte of data from the internet, that’ll took for my provider like 7 months.

Privacy2 · May 4, 2024, 7:27pm

You don’t have any mp3’s hanging around? You could try it with, for example, 10 albums of mp3’s.

FranklyFlawless · May 5, 2024, 1:05am

Making claims without evidence or proof is not going to convince anyone about your credibility, and neither will sharing your hunches about your perception of others.

Prove it so we can move this thread forward.

irvinewade · May 6, 2024, 5:14am

That is true in a specific circumstance, and there are disk compression schemes that operate like that.

However as @Privacy2 points out, “similar” likely won’t be good enough to detect for compression in the case of music. You have “radio edits” and “club mixes” and “remixes”. Even “cover versions”. Which might all meet the definition of “similar”. But won’t look similar at the level of the disk block.

The circumstance where this should work is that you have a large number of compilation albums and eventually you have entirely duplicated tracks. Those could compress under “disk block deduplication”, or using .tar.gz (or similar) as implied by @Privacy2. But I’m not certain that gzip (or similar) is actually going to be good enough to detect the duplicates. (See paragraph after the next.)

Another complication … in my case … is that a significant fraction of MP3s have been converted from CD. So identical uncompressed sound data from a CD might produce a different MP3 if converted with different software / version e.g. converted at quite different times - or if converted with different conversion options.

However, as @Privacy2 implies, the proof of the pudding is in the eating. No compression algorithm is any good unless you can show that in real-world use cases it actually achieves compression.

I don’t think you need to build any hardware. Just simulate it in software first to see what the results are. If you are getting good results then it may be worth implementing in hardware. If you aren’t getting good results then you saved yourself the effort in making the hardware.

user471 · May 6, 2024, 10:16pm

This post was flagged by the community and is temporarily hidden.