Alternative Title: Data Persistence. Designing for Scalability and Growth

<aside> 🤖

CARMA CHRONICLES

The harsh fluorescent lights of the mainframe room hummed a monotonous tune as Dennis walked in, the stale, slightly metallic scent of electronics a familiar greeting. It was 1965, and the cutting edge of computing looked less like sleek personal devices and more like an imposing wall of blinking lights and whirring tape drives. Dennis, a programmer for a large insurance company, settled into his chair in front of the terminal, its green text glowing faintly.

With a practiced motion, he pulled a worn spiral-bound notebook from his briefcase, its cover creased and dog-eared. He flipped it open to a page crammed with handwritten entries, codes, and cryptic notes. Today’s task, meticulously jotted down: PROCESS_Q4_PREMIUMS_BATCH_7_V3. He needed to run the weekly premium calculation batch job for a specific client group, which required accessing and updating a large data file.

He typed the command to list the files on the system, and the screen instantly filled with a dizzying cascade of filenames. There were hundreds, perhaps thousands, all in one long, undifferentiated list. No folders. No directories. Just a seemingly endless scroll of characters, a digital sprawl representing every program, every dataset, every user's active work, and every system utility. Dennis squinted, his finger running down his notebook page, locating his target.

He began pressing the 'down' key on the keyboard, the screen scrolling slowly, one line at a time. SYSTEM_LOG_ARCHIVE_FEB_65… UTILITY_SORT_MODULE_V2_FINAL… ACCOUNTS_PAYABLE_JAN_CLOSING_REPORT… And then, MARTHA_TEST_PROGRAM_DRAFT_V2. He paused, a flicker of curiosity crossing his face. Martha from accounting. What was she testing now? Her file was sitting right there, alongside critical system files and other users’ current projects. He resisted the urge to inspect it, knowing it would be a time sink. "I guess I'll check when I'm done," he murmured to himself, pressing 'down' again. He scrolled past ADMIN_BACKUP_SCRIPT, JOHN_REPORT_TEMPLATE_FINAL, until finally, his own entry appeared.

He spent the next few hours meticulously inputting commands, monitoring the batch process, and cross-referencing output with figures in his notebook. Each step was precise, unforgiving; a single typo could derail the entire operation or, worse, corrupt someone else’s data. The risk was always present, a silent partner in the room.

As the late afternoon sun cast long shadows across the room, Dennis completed his batch job. He reviewed the final status codes (all bb, indicating success) and logged off the system. He didn't bother to delete his working file; there was no concept of 'his' space anyway, no personal folder to clean. His file, PROCESS_Q4_PREMIUMS_BATCH_7_V3, now simply sat amongst the countless others, waiting for the next user, or perhaps for Martha to accidentally overwrite it during her next test. He gathered his notebook, already planning tomorrow's session, and stepped out into the evening, leaving his digital breadcrumbs scattered in the singular, flat universe of the mainframe.

</aside>

The Evolution of Data Organization: From Flat Files to Hierarchical Systems

You might wonder, if data is already stored on disk as files (as we learned in Chapter 2 ) in certain formats (as we learned in Chapter 3), why do we even need databases? The answer lies in the increasing complexity and demands placed on data.

In the earliest operating systems, users of a mainframe who had files to work on had only one folder to work with. Just pause for a minute and try to think of this in terms of your current devices. Imagine all the files on your computer could only be stored in one folder! You already find it difficult to remember all the files on your computer; imagine if every time you wanted to find a file you couldn’t go to ‘Downloads’ or ‘Documents’ but you had one big sprawl of file names to look at and select from. This meant that people had to carry notebooks in which they wrote down the names of files they were working on with the computer because they couldn’t just create a notepad file and save that to their ‘Desktop’ with notes to themselves. Also, to work with the limited memory on these earlier computer systems, a user deleting the wrong file (because they had access to it) would mean lost data or non-functional programs for other users of these multi-user systems.

Fortunately, this horror scenario was transformed with the creation of the hierarchical file system. Special folders could be created for system files, for instance, like your famous ‘Program Files’ folder in Windows, or your ‘Applications’ folder on Mac where new programs were installed and kept separate from the rest of the user’s files. This innovation simplified a lot about how computers could be used by multiple people, for instance, the creation of ‘user folders’ linked to specific users, and used to organize one’s files separate from others' documents. What an effect a little hierarchy can make!

However, as it tends to go with humans, we are never satisfied for long. People started to wonder if not only files, but the data contained within files, could be stored using similar tree-like or hierarchical structures. Why stop at creating a single file for all the information that needs to be computed?

In 1961, the folks at IBM had the opportunity to explore this idea in more detail. When American Rockwell won the bid to build the spacecraft for the Apollo program to send a man to the moon and back, they needed a system to manage the large bills of material (BOMs) associated with the construction of the spacecraft. If you’ve never seen a bill of materials (BOM), let me describe what it looks like. A bill of materials represents a list of all the items you need to construct a certain number of a final product. Each bill of material has at the very top a description of what item is being made, and then a table containing the part items, their cost, the quantity, and any other relevant details to identify the part uniquely. For instance, the bill of materials for a chair with metal legs might include the type of metal, its thickness and length, and the number of pieces if the metal is sold in fixed lengths. It will also include the type of screws and bolts required for the chair, the material for the seat, any special paints required to finish the chair, and so on.

IBM needed to build a system to capture this sort of hierarchical information for individual parts of the spacecraft. For a spacecraft, you can imagine that some parts have to be built and then used as a part of other parts of the craft, so there was a need for some referential links in the system that was being built. This hierarchical tree-like structure was the foundation of the design of ‘Information Control System (ICS)’ and the view that was created to enter information into each individual bill of materials, called Data Language/Interface (DL/I). Because the format of a bill of materials is standard, IBM created DL/I and ‘Data Language One’ to interact with the system, including adding new information for a specific bill of materials, or reading information about stored data.

The ICS system introduced the idea that application code should be separate from the data, and that a management layer should ‘watch over’ the data in an application, for instance, through providing response codes that report about the status of data within an application. During DL/I processing, the system reports the events that occur with DL/I codes by issuing status codes which had associated programmer responses to fix the issues. The DL/I status codes are two-character alphanumeric codes, starting from AA (destination of command wrongly specified) to Z0 (invalid data found in the input). Famously, the DL/I response when a transaction was completed successfully was… ‘nothing!’, a blank (bb), which meant you could proceed to the next step. The seeds of the perennial idea in programming that ‘Success is Silent’ and ‘no news is good news’ can be seen in this technology. IBM successfully delivered on this system for NASA, playing a key role in landing a man on the moon. Shortly before Neil Armstrong walked on the moon in July 1969, IBM released ICS and DL/I under the descriptive name IMS (Information Management System) in 1968. As of 2005, IBM IMS (which retained the same core hierarchical structure) was being used by 95% of Fortune 1000 companies.

The Relational Model: Bringing Structure and Relationships to Data

While hierarchical databases like IMS were powerful for specific use cases, they often struggled with the flexibility needed for more complex, interconnected data. Imagine if a supplier updated the material used in a component—say, switching from aluminum to carbon fiber—and that component appeared in dozens or even hundreds of product assemblies. In a hierarchical system, you’d need to find every single entry across the system and manually update the material, much like having to open every individual file in a cabinet just to update a pharmacy’s new address. This kind of redundancy and manual labor was not only inefficient, it was error-prone and difficult to scale.

This growing frustration paved the way for a revolutionary idea: the relational model. Instead of embedding the same data in multiple places, relational databases allowed you to store each piece of information just once and link it to other data using relationships. Now, if a supplier's details changed, you only had to update it in one place—and every system that depended on that information would reflect the change automatically. This shift wasn't just about convenience; it fundamentally changed how we design, query, and maintain data, opening the door for the flexible, scalable applications we rely on today.

Conceived by Edgar F. Codd at IBM in 1970, the relational model introduced a remarkably elegant way to organize data. Rather than locking data into rigid, pre-defined hierarchies, this model used simple, two-dimensional tables—much like spreadsheets—to represent real-world entities such as Customers, Products, or Orders. The real magic lay in how these tables connected: through common fields—primary keys in one table and foreign keys in another. These links acted like clues in a treasure hunt, allowing you to follow relationships across different parts of your dataset. But this flexibility introduced complexity: how would users express these connections and extract insights without having to write custom code for every question? In other words, the move to relational databases in the 1970s wasn’t just a technical upgrade—it represented a fundamental shift in how people interacted with data. Keep this in mind, because today’s rise of generative AI is creating a similar transformation in how we engage with information.

But let’s go back to the 1970s. Everyone could intuitively grasp a hierarchical structure—like placing files in folders, or organizing lines of text into chapters and sections. That mental model came naturally. But relational databases introduced a new paradigm: storing interlinked relationships in neatly structured tables. This approach required a clear logic to avoid overwhelming complexity. Rather than finding all relevant information about a topic in a single hierarchy, users now had to work across multiple tables and reassemble the pieces using shared identifiers. Once all the data was centralized, a new challenge emerged: how could users efficiently join records across different tables?

Enter SQL, or Structured Query Language, a breakthrough invented by IBM researchers Donald D. Chamberlin and Raymond F. Boyce in the early 1970s. SQL was designed to be a declarative language—meaning that users could state what they wanted, rather than having to spell out every step to retrieve it. This was a big deal. Rather than navigating nested records or writing custom programs for each report, SQL let people describe sets of data and how they should relate.

At the heart of SQL is the concept of the join—a way to combine rows from two or more tables based on a related column. For example, if you had a table of customers and a table of orders, a simple JOIN operation would let you pull up every order alongside the name and contact details of the customer who placed it. No more duplication, no more manual tracking across systems—just clean, composable logic that mirrored how we think about relationships in the real world.