filesystems - Why is there such a big difference between "Size" and "Size on disk"?

27
2014-02
  • thelastblack

    As you can see below, there is so much difference between the Size and Size on disk fields in my folder. Why is that?

    Screenshot showing 50,875 files in 1,504 folders, 105 MB being 1.43 GB on disk

    I know that Size on disk should be a little more than Size because of allocation units in Windows, but why that much of a difference? Could it be because of the large number of files?

    BTW, this folder is on my Android phone’s SD card. Inside this, my maps app stores its cached maps and the app gets its map from Google Maps.

  • Answers
  • Bob

    I will be assuming that you are using the FAT/FAT32 filesystem here, since you mention this is a SD card. NTFS and exFAT behave similarly with regards to allocation units. Other filesystems might be different, but they aren't supported on Windows anyway.

    If you have a lot of small files, this is certainly possible. Consider this:

    • 50,000 files.

    • 32 kB cluster size (allocation units), which is the max for FAT32

    Ok, now the minimum space taken is 50,000 * 32,000 = 1.6 GB (using SI prefixes, not binary, to simplify the maths). The space each file takes on the disk is always a multiple of the allocation unit size - and here we're assuming each file is actually small enough to fit within a single unit, with some (wasted) space left over.

    If each file averaged 2 kB, you'd get about 100 MB total - but you're also wasting 15x that (30 kB per file) on average due to the allocation unit size.


    In-depth explanation

    Why does this happen? Well, the FAT32 filesystem needs to keep track of where each file is stored. If it were to keep a list of every single byte, the table (like an address book) would grow at the same speed as the data - and waste a lot of space. So what they do is use "allocation units", also known as the "cluster size". The volume is divided into these allocation units, and as far as the filesystem is concerned, they cannot be subdivided - those are the smallest blocks it can address. Much like you have a house number, but your postman doesn't care how many bedrooms you have or who lives in them.

    So what happens if you have a very small file? Well, the filesystem doesn't care if the file is 0 kB, 2 kB or even 15 kB, it'll give it the least space it can - in the example above, that's 32 kB. Your file is only using a small amount of this space, and the rest is basically wasted, but still belongs to the file - much like a bedroom you leave unoccupied.

    Why are there different allocation unit sizes? Well, it becomes a tradeoff between having a bigger table (address book, e.g. saying John owns a house at 123 Fake Street, 124 Fake Street, 666 Satan Lane, etc.), or more wasted space in each unit (house). If you have larger files, it makes more sense to use larger allocation units - because a file doesn't get a new unit (house) until all others are filled up. If you have lots of small files, well, you're going to have a big table (address book) anyway so may as well give them small units (houses).

    Large allocation units, as a general rule, will waste a lot of space if you have lots of small files. There usually isn't a good reason to go above 4 kB for general use.


    Fragmentation?

    As for fragmentation, fragmentation shouldn't waste space in this manner. Large files may be fragmented, i.e. split up, into multiple allocation units, but each unit should be filled before the next one is started. Defragging might save a little space in the allocation tables, but this isn't your specific issue.


    Possible solutions

    As gladiator2345 suggested, your only real options at this point are to live with it or reformat with smaller allocation units.

    Your card might be formatted in FAT16, which has a smaller limit on table size and therefore requires much larger allocation units in order to address a larger volume (with an upper limit of 2 GB with 32 kB allocation units). Source courtesy of Braiam. If that is the case, you should be able to safely format as FAT32 anyway.

  • Braiam

    This is one of those situations where compressing/archiving into a single file may help. What Bob said in his answer is true but the solution may be easier than reformating the disk as other answers suggests. If you compress or archive the directory (using zip, tar, or any other method) the file system will see that you have a single big file, instead of several smaller ones. Even without compressing you will be getting back almost 1.4 GiB of space back, because all those "small files" will be counted as a single big file.

    Inside this, my maps app stores its cached maps and the app gets its map from Google Maps

    Maybe you should discuss with the developer to use an archive or a database instead of multiple files. This probably will also help to have the disk less fragmented and will surely save space especially if it's a NAND flash drive. If you explain the ridiculous situation where 100MB of payload/useful data becomes 1.4GiB, there's something wrong with how the data is stored, and the developers should bring a nicer solution.

  • Approaching minimums

    In case anyone is confronted with this problem, it could be useful to also know that another reason to see big difference in file size / space on disk is the use of alternate data streams (ADS)

    This applies only to NTFS to my knowledge. ADS are known for both legitimate and not legitimate uses:

    • to tag a file as downloaded from Internet
    • to store metadata (Microsoft wanted to include some of the Apple OS feature, like not using file extension to determine the type of a file)
    • to hide data or code in the context of a malware.

    ADS simply: any NTFS file can hold multiple data streams (understand "subfiles"). One is the main stream, used by Windows Explorer and other Windows tools, it holds the usual content of a file. Alternate data streams may contain other information, exactly as the main stream, but they cannot be handled directly by Windows tools (in particular Explorer display the file size as equal to the size of the main stream, regardless of the size of the ADS), you have to use specialized tools or code to write, read, and locate ADS.

    The main point is that in case of big file size difference observed, don't overlook the possibility of ADS, and hidden malware.

    Another link.

    To safely experiment with ADS, try this at DOS/CMD level...

    Create and then display the content of a file in the root of C:

    C:\> echo The main data stream> test.txt
    C:\> type test.txt
    

    Result:

    C:\> The main data stream
    

    Now add an ADS with the same method, just specify the ADS name in addition of the file name:

    C:\> echo The secret message> test.txt:secret
    

    You have just hidden the secret message in the file. Note that the file size in Explorer has not changed in spite we added bytes in the ADS "secret".

    Try to display the ADS content:

    C:\> type test.txt:secret
    

    Result:

    The filename, directory name, or volume label syntax is incorrect.
    

    CMD type is not able to display the content of the ADS. We will use Notepad instead:

    notepad test.txt:secret
    

    In Notepad we can see the content of the ADS:

    The secret message
    

    You can also hide a full executable in an ADS of an innocent text file, and run it at any time. Wealth does not harm for hackers :-)

  • Kevin Panko

    The problem may be because of the cluster size.

    According to Microsoft:

    If you are not using NTFS compression for any files or folders contained on the volume, the difference between SIZE and SIZE ON DISK is wasted space because of a larger-than-necessary cluster size. You should attempt to use an optimal cluster size so that the SIZE ON DISK value is as close to the SIZE value as possible. An excessive discrepancy between the SIZE ON DISK and the SIZE value is an indication that the default cluster size is too large for the average file size that you are storing on the volume, and that it should be decreased. This can be done only by backing up the volume and then reformatting the volume by using the format command and the /a switch to specify the appropriate allocation size: IE: format D: /a:2048 (This example uses a 2-KB cluster size).

    Try formatting your drive with smaller cluster size.

  • Matias N Goldberg

    I see many people recommending to reformat your drive with a smaller cluster size. Since this is an SD card, note that many vendors pre-format the card to the recommended cluster size to match the size of the NAND's cluster size (keeping both in sync is very important for optimum read/write performance and reducing wear-out)

    You can't change the NAND's cluster size (it's a physical attribute of your SD card's hardware).

    First run scandisk/chkdsk on your SD card to be sure the size report problem doesn't lie within a corrupted filesystem.

    Second, I'd suggest you report the bug to Google Map devs, for them being the one to blame here. They should be using a superior storage method. Fixing it should also make the app to run faster on many devices due to less I/O and file system's driver activity.

  • CyberSkull

    This is a general issue with many filesystems. There are two factors at work here, the maximum number of "blocks" a filesystem can handle per logical volume and physical restrictions of the storage medium. Only 1 file can be allocated to any given block (files generally take as many blocks as they need). So a text file with 64 bytes can often take anything from 4k to 32k, depending on the block size of the filesystem it resides on.

    One way to think about this is think of each block in the filesystem as a box, and the filesystem as a room. All your boxes are the same size, and you try to fit as many as you can in a room. If you fit them all in with more room left over, you have to get bigger boxes so that the room is filled completely with boxes.

    One of the rules for putting things in boxes is that you can't put two unrelated things in a box. They have to be part of the same document. So if I were to type up a page of text, it would have it's own box. If my typed text had so many pages I couldn't fit it all in one box, I'd simply find another box and continue putting pages in there instead, repeating until I'd filed all my pages. I'd also have written down the boxes I'd used for that document and the order of the boxes to read it in sequence.

    Depending on how I'd organize the boxes, I may only have enough room in my manifest for a certain number of boxes. So if I had a big room to fill, but only a small number of boxes I'd have to use very large boxes to reach the room capacity.

    So in that case my one page document would still occupy a single box, with nothing else sharing it.

    The same situations play out amongst various storage solutions. FAT32 can only manage what is considered a low number of "boxes" on today's huge hard drives, so it ends up with very large "boxes" to compensate for this.

  • Archimedes Trajano

    Aside from cluster sizes, you can also have a discrepancy due to the following conditions:

    • Compressed or encrypted files can use up a different space than that of the logical file size.
    • Linked files will report n times the number of links times the size of the file for the logical file size, but the physical space used is usually less.
  • kriss

    You should have a look at Block Suballocation entry in Wikipedia. That is exactly what's happening to you. Using a file system with support for Tail Packaging is a file system level solution for this problem besides changing allocation cluster size.

    All have the inconvenient of needing to reformat the disk.

    In some case merely storing those files in an archive would fix the problem (and the small files would also be compressed beside stopping loosing space at end of files). This has the inconvenient of spending some time for decompression.

    Another option if you have so many small files because of some specific application related problem is store your software data using another method (may be in a database). But of course it's a solution for programmers, not end users.

    http://en.wikipedia.org/wiki/Tail_packing


  • Related Question

    filesystems - Why isn't write-access to NTFS partitions provided with Mac OS X?
  • Matias Nino

    Does Microsoft simply not allow it or is it because Apple refuses to pay licensing?

    I know there are software workarounds, but my question is simply WHY?


  • Related Answers
  • Synetech

    It is a licensing issue. NTFS is Microsoft's proprietary format and they hold the rights to it.

    A developer asked on the MSDN forums about getting a license to the NTFS specs for an app he was writing but was unable to get that information. There is (limited) information available in the Technical Reference.

    The Wikipedia page on NTFS mentions a couple of third-party solutions:

    Mac OS X v10.3 and later include read-only support for NTFS-formatted partitions. The GPL-licenced NTFS-3G also works on Mac OS X through FUSE and allows reading and writing to NTFS partitions. A proprietary solution for Mac OS X with read/write access is "Paragon NTFS for Mac OS X".[23]

  • EvilChookie

    There's no concrete evidence that I have found online to say why / why not, however I believe you're correct in that it's a licensing issue.

    Snow Leopard will see HFS+ drivers being used with Boot Camp and Windows - meaning you can use your HFS+ formatted drives in Windows.

    Perhaps we'll see a similar trend from Microsoft regarding NTFS - a possibility, given that Exchange support is now coming to Snow Leopard.

  • Paul Schifferer

    I don't think it's necessarily a licensing issue. The Linux kernel is able to read NTFS filesystems, as well as write (but they consider that "dangerous." Mac OS X has the ability to read NTFS, but not write.

  • las3rjock

    From the NTFS Wikipedia page:

    Details on the implementation's internals are not released, which makes it difficult for third-party vendors to provide tools to handle NTFS.

    This suggests that it's some sort of licensing issue. Were it not for the NTFS-3G project, there would probably be no way for most non-Windows operating systems (Linux, Mac OS X, etc.) to write to NTFS partitions.