Chapter 1 - File Systems.

Gary Kildall, the man that should have been Bill Gates

It is 2202 and your great-great grand child has a school project to find out something interesting or notable about an ancestor and how life was in 2020, and the first question they ask is ‘where was their data stored?’. Today, when we ask that question about prehistorical societies, we look to markings etched on cave walls, or objects stowed in containers buried under the rubble of time. When we ask that question about modern history, we search libraries and archives. ‘Facebook’s servers’ is the reply your posterity is looking for. Their data was stored in large companies’ servers, but more accurately, magnetized into sectors on hard disks accessible only through the decoding provided by file systems. Every piece of information that we generate or consume or are consumed by in the modern world is captured within a file system of some sort, and that is an anchor of our data driven reality. Just think about that for a second. It doesn’t matter whether you use a Mac, Windows or Linux computer, or a mobile device. Any fundamental change to the technology used to magnetize data into 1s and 0s, or any changes to how that information is retrieved and processed in a file system will impact how you use technology, and change what Sonny might find when searching about you two hundred years hence. At the same time, any event that can fundamentally alter the functioning of this very well calibrated process of writing information to disk, and reading from disk into memory for display will limit what the future will find of our activity today. For all our disaster recovery planning, that is quite a humbling thought. The purpose of this book is to introduce the data scientist and the generally interested reader to such fundamental anchors of modern computing that are relevant for their job as current or aspiring data handlers. Another purpose is to consider some of the encores that the world has experienced with respect to these enabling technologies. For instance, magnetic writing to disks were an encore of using transistors to hold electrical charges to store data. Transistors were an encore of punching holes in punch cards with standardized layouts, and so the wheel of time oscillates backwards to the very first etches in rock face and forward to the current font face you are staring at on this screen.

Files started as Files

The earliest way files were stored was on sequential tape drives, kind of like a roll of (camera) film—also pretty much obsolete—or, to put it even more simply, like a roll of cellotape coated with a special material that could hold data. These tape systems, sold by companies like IBM and DEC (acquired by Compaq(acquired by HP (separated from HP Enterprise in 2015) in 2002) in 1998), had to be kept in stable environments because they were incredibly sensitive. Too much heat, a nearby magnet, or just bad luck could wipe them clean.

And if you’ve never heard someone freak out about bringing a magnet near their computer, I can probably guess your age. I’ve got my own war stories—like placing a diskette too close to my car speakers after spending hours at a public internet café, only to get home and find the entire thing unreadable. If you’ve never had to suffer through that, consider yourself lucky!

Before modern file systems, organizing files was a completely different game. There were no folders, no directories—just a massive list of files stored in one place. Early operating systems, such as those in mainframe environments, used a single-level directory where all files were stored in a single space without subdirectories. This meant every file had to have a unique name, making organization difficult as the number of files grew. Early computer users had to manually keep track of everything, sometimes even writing down where things were stored. It wasn’t until systems like Multics came along that the idea of a hierarchical file system took root, where you could have folders inside folders—something so obvious now but revolutionary at the time.

Hierarchical File Systems

Hierarchical file systems refer to the idea that data on your computer is stored in a tree structure, starting from a ‘root’ directory, and dropping levels into folders, with files being terminal locations within folders (you can’t go any deeper than a file in the file system, but you can hide things inside the files themselves, of course - like a ghostly reflection in the background of your selfie, an embedded cat image or, if you're unlucky, a piece of malware patiently waiting for its moment - shudders!). But I digress. Multics was the first operating system to use hierarchical file systems. As a result the command-line commands we use today, many of them are inherited from Multics. The vision for the modern file system was first clearly established by some of the developers of Multics in a paper published in 1965 called “A General File System for Secondary Storage”. The file system was formalized to allow users to hold information outside of the computing environments when not in use, allowing existing memory to be dedicated to those who were ‘on line’, which is what being on a computer used to be called before the internet was invented. A section of this original paper stands out in how it has shaped the cloak of invisibility which still masks file systems till today. Daley and Neumann of MIT and Bell Labs write, “The basic structure of the file system is independent of machine considerations. Within a hierarchy of files, the user is aware only of symbolic addresses. All physical addressing of a multilevel complex of secondary storage devices is done by the file system, and is not seen by the user.”

Little wonder the portability of the file system idea between different machines and operating systems over time. Also little wonder that the workings of file systems have been relegated to the world of computer scientists, and the users have learned to take for granted that files will exist and they can be moved as needed between devices, through email, across the internet and so on. At its simplest, file systems function by managing the process of copying in an object known as a file, and maintaining addressing information of what specific blocks on a storage device the bytes that are that file have been stored at. Similarly, file systems have objects known as folders that ‘contain’ files and other folders, and that information of what folders and files are contained within another target folder is also stored and managed by the file system for all folders on the computer. A critical part of the guarantee that we have when we copy a file into a system that the file will be there comes from the effectiveness of file systems at managing the process of data transfer and record keeping about where data is stored. There is a window of time between when you initiate a file copy and when the file successfully rests on your disk, and for most systems that is represented as a status menu that shows when copying large files, but only flashes for smaller files.

One of the things to know is that older file systems struggled to scale with growing data needs due to limitations in file size and other issues such as file fragmentation. FAT32 file system, operating systems such as Windows 95, Windows 98, Windows ME, and early versions of Windows XP, was limited to 4GB file sizes and 2TB partitions, making them impractical for modern workloads. Many older thumb drives ran on FAT32 for instance, but then again thumb drives with up to 4GB of memory were a premium luxury at the time these operating systems were developed. In the early 2000s, a 1GB thumb drive cost around $400. File systems have not remained static since the 1960s or even the early 2000s. For instance, older file systems, such as ext3, FAT32, and NTFS, typically overwrite data in place, meaning that changes are directly applied to existing storage blocks. This approach can lead to issues if a crash or power failure occurs in the middle of a write operation, potentially corrupting files. These systems rely on a process known as journaling (as seen in ext3 and NTFS) to reduce the risk of corruption, but recovery is not always guaranteed. Journaling works by recording intended changes in a dedicated journal before applying them to the main file system, ensuring a more structured recovery process in case of failure. However, while journaling reduces the likelihood of corruption, it does not eliminate it entirely, and data loss can still occur if the journal itself becomes compromised or if an unexpected failure happens before changes are fully committed. Newer file systems such as Copy-on-Write (COW) file systems, such as ZFS, Btrfs, and Bcachefs, take data integrity a step further by eliminating the risks associated with in-place modifications. Instead of overwriting existing data, COW file systems write changes to a new location and only update metadata once the write is complete, ensuring that the previous data remains untouched until the new version is safely stored. This approach not only improves reliability but also enables powerful features such as snapshots, which allow users to revert to previous states without significant overhead. Yes, the online rollback capabilities that you have within your Dropbox or Box folder could soon be available offline for your entire hard drive. [footnote]

File System Navigation

Before graphical file managers, interacting with files required entering in commands into a terminal window - every action required a command, and if you forgot the syntax, good luck finding that one file buried in a subdirectory. Early computer users had to type commands just to navigate folders, open files, or move things around. It worked, but it wasn’t exactly intuitive.

Then came Xerox PARC in the 1970s, where researchers reimagined how people should interact with computers. Instead of forcing users to memorize commands, they introduced the desktop metaphor—files were represented as icons, and you could move them around just like physical documents. This innovation appeared in the Xerox Alto and later the Xerox Star, making it the first system where you could visually browse files and folders instead of typing your way through directories.

Apple took this idea mainstream in 1984 with the Macintosh Finder. The Finder gave users a way to see their files in a structured, folder-based system that felt natural—click to open, drag to move, drop to delete. It was a revolution in usability, and that same basic structure still defines how we interact with files today, whether in Windows Explorer, macOS Finder, or even cloud storage interfaces. The file system may have been the foundation, but the file manager was the bridge that made computers accessible to everyone.

Today, when a computer user thinks of accessing a file, they imagine opening up a file browser location and double clicking on a name and icon object to launch the desired document, or launching the application first and looking through recent (if you’re motion stingy like me). Therefore, the idea of accessing files without using a file manager, but rather via lines of code commands, is something that new analysts may struggle with when they realize that they have to provide a path to a file in order to work with it in data analysis, e.g. when using packages like Pandas or even base python or R. Despite the reality that visual/GUI based navigation is the norm, it is not necessary to use a visual file manager work with files on your computer - and sometimes we can’t even do that. There are a lot of cool and important tasks that you can perform interacting directly with files in your file system using a shell (also called Bash, Cmd, command line, terminal, etc). In the driver lab for this chapter, you will be exploring some basic file operations using only the terminal.

<aside>

Driver Lab: Create a New File Using Bash/Cmd

In this skills lab, you are going to perform some basic tasks on your file system using nothing but your keyboard and the computer terminal. If you’ve watched any detective movies where there is a geek providing hacking support as a sidekick to the actor who actually moves around the place, then you know that no serious geek ever uses a mouse. Every thing they do has to be achieved by the clackity-clack of their mechanical keyboards and the triumphant press of the enter key. You are going to experience that in this lab.

What you need to start:

A computer (any operating system, any memory configuration, enough storage space to launch the computer successfully
No internet connection is required for this lab

Steps:

Launch the command prompt. You may have to do a little digging to achieve this. Think about how you currently access programs on this computer. Where are the programs stored? A file system, right? Correct. But more specifically, when the programs are stored, where are the references to the programs currently installed stored? Your computer probably has some sort of programs menu. That’s where to start your digging from. It should be called something like ‘terminal’ or ‘cmd’ or ‘shell’.
Before you do anything, I want you to pause and take it in for a moment, then read every single word on the terminal screen you launched. A small history lesson here. Imagine you were an excited 14 year old whose parents splurged on the latest technology in 1996 and purchased you a strange contraption called a computer, specifically a XXXXX. What you currently see on the screen is basically what they got after powering on the computer and letting it launch. The world has come a long way from just a monochrome screen and text as the input into computers, as the world behind that terminal window clearly indicates, but it is worth pausing and acknowledging that history. Perhaps one day, our children will peer weirdly behind augmented heads up displays to computer screens and finger based input and think the exact same thoughts you are thinking now. What a thought right? But I digress.
If you read every single word on the screen, you should see a variation of the following pieces of information. A computer name and a user name, software version for your terminal application, an indication of what directory or folder in your file system you are currently ‘located’ (update for Windows/Ubuntu). Now, I want you to imagine yourself as a creeping reptile that is currently in one spot within the file system of your computer (technically you are in memory but you are pointing at a single location on the disk, the directory you are located). Note that file system is represented as it currently stands, with all the documents and pictures and videos you have saved on it right now.
Next we are going to make some changes within your file system. First to existing documents and then by creating new documents from within the file system. The next step to achieve this is to view all the files within the current directory you are pointed at. The command to do this is ls (or dir if you are on windows). This command, like many others are inherited from the Multics operating system from the 1970s. At the current moment, I am not allowed to show my clairvoyant skills by the (insert name of futuristic spy agency for monitoring supertalented individuals), but I can guess that you got some output from running the ls command in your home directory, which is where the terminal would have pointed you to naturally when launched through your programs menu. The first hurdle to leap over is to distinguish between what is a file in your output and what is a directory. For a first time user of the terminal for this purpose, your memory will serve you well. Does any of the listed names seem familiar? Like a Documents folder, or a Downloads folder? Decide on one of the names that is definitely a folder and type cd folder_name to point inside of the directory. When you do this type in the previous command for listing files to view whats inside of that directory again.

In the very niche case that the folder you decided to point into was not a single word named folder like Documents or Downloads, first of all I’d like to congratulate you for being an extremely curious minded explorer. Then I’d like to say your reward for this curiosity is a quick explanation of the fact that spaces, while great for legibility and humans, is terrible for computer addressing systems. A space is not actually empty, it contains text that has to be suppressed from display. So in order to change directories into a directory whose name has spaces, you have to either enclose the entire directory name in quotes, or escape the space characters with a backslash. 5. Now that we are pointing within a different folder from the one we were launched into with the terminal, and have viewed the files within that folder. I’m sure you are itching to know what comes next. I want you to pick an unfortunate file this time, and make a copy of it. To achieve this, I want you to put together the command required to do this. Think of the structure you have already used to switch directories. It had an instruction part cd and an argument part, the name of the location to move into. Now you are going to put together the instruction cd with the arguments ‘name of file to copy’ and ‘name of new file after the copy’. If you are feeling extra repetitive, you can type in those exact words, but unless you have an actual file called ‘name of file to copy’, nothing is going to happen when you run this command. You will have to substitute the description ‘name of file to copy’ with the actual filename e.g. checkinticket.pdf which is an actual file on my computer at the time of authoring this chapter.

The way that the computer can differentiate different types of files from each other (without peaking into the file) is using the file extension. The last three letters after the dot in the filename I presented above represent the type of file it is. The file extension differs by the type of file, e.g. Microsoft Word documents have the .doc or .docx extension, images take the extension of the encoding standard used to save the color information e.g. .jpeg, .png, .gif and so on. Note that when you use your GUI to look at files, the information that is conveyed by file extensions has been substituted with a file icon, a nice picture that lets you know what type of file you are looking at, and file extensions are usually hidden by default. There are corners of the internet where you may find people railing against this decision my major operating system providers to hide filenames by default, but I get it. I mean, I don’t like it, but I get it. More on that in Chapter XXX. 6. </aside>