"It is a capital mistake to theorize before one has data. Insensibly one begins to twist facts to suit theories, instead of theories to suit facts."
- Sherlock Holmes, A Scandal in Bohemia
"In general, we look for a new law by the following process: First we guess it (theorize); then we compute the consequences of the guess to see what would be implied if this law that we guessed is right; then we compare the result of the computation to nature, with experiment or experience, compare it directly with observation, to see if it works. If it disagrees with experiment, it is wrong.”
<aside> 👥
As a professor of Information Systems, I feel a professional obligation to pause here. We often treat Information Systems (IS) as a purely technical discipline—a matter of wiring, coding, and storage. But at its core, IS is a discipline concerned with the deep implications of technology as an artifact and the reality it creates.
The tension between the two quotes above—Sherlock Holmes (who demands data before theory) and Richard Feynman (who demands theory before observation)—is not just a difference of opinion. It represents a fundamental philosophical divide in how we "know" things in the digital age.
Before you start creating new data, consider three striking notes on the nature of digital reality:
In the traditional scientific method (Feynman’s view), research is linear: you form a hypothesis (theory), and only then do you collect data to test it. To collect data without a theory is considered "fishing"—a statistical sin.
However, in the modern business environment, the massive gravity of Big Data breaks this model. We now possess data that exists far beyond our ability to create a priori theory. We have petabytes of logs before we have a single hypothesis.
This creates a fundamental "bipolar" quality to being an IS academic. From a strict scientific perspective, mining this pre-existing data looks wildly opportunistic—we are shooting arrows and drawing bullseyes around where they land. Yet, from a professional/pragmatic perspective (Holmes’ view), ignoring this massive evidence because you "didn't guess it first" is negligent. The modern analyst must learn to live in this tension: respecting the rigor of science while exploiting the opportunity of the archive.
When we query a database, we assume the data is the reality. This leads to deep issues of Epistemology—the philosophical study of how we know what we know.
When I say "This individual is a Customer," I am making a massive philosophical leap. What I actually know is that "This string of text (a name) was entered into a digital ledger next to a floating-point number (a price)."
What if they used a coupon?
What if it was a free trial?
What if a spouse used their email address?
To the database, they are a "Customer." To reality, they might be a "Freeloader" or "The Wrong Person." Managers rarely concern themselves with these distinctions, assuming the data row is a perfect stand-in for the human being. It never is.
We presume that a digital action (a click) signifies a specific human intent (a thought). But just as people lie, data lies.
Consider the "Add to Cart" button. In a simple model, a click here signifies "Intent to Purchase." But what if the product page hides the price with a message saying "Add to Cart to view price"?
Suddenly, the meaning of the action shifts. The user is not saying "I want this"; they are saying "I am curious about the cost." If you automate your inventory ordering based on "Add to Cart" volume without interpreting the context of the speech act, you will overstock your warehouse. Data is not just a number; it is a language, and it requires a manager's interpretation to separate the signal from the noise.
The "lying click" exemplifies a broader class of deceptive user interface tactics known as dark patterns—intentional design choices that manipulate users into actions they might not otherwise take, often by obscuring information, exploiting cognitive biases, or creating false urgency.
In the e-commerce context, forcing a user to "Add to Cart to view price" is a classic hidden information dark pattern: it tricks curiosity into commitment, inflating perceived purchase intent in your analytics while potentially frustrating customers and eroding trust.
Other common variants include pre-checked subscription boxes during checkout (sneaking in recurring fees), countdown timers that reset on refresh (fabricating scarcity), or roach motel designs where canceling a service is far harder than signing up.
As analytics managers, we must recognize that data born from such upstream manipulations isn't just noisy—it's ethically tainted, leading to misguided strategies (like overproducing based on inflated "intent" signals) and increasing regulatory scrutiny under laws prohibiting deceptive practices.
True insight requires clean, consensual signals; dark patterns may boost short-term metrics but poison long-term reality.
These cautions aren't academic navel-gazing—they prevent costly mistakes when creating data for high-stakes decisions. For instance, as the era of AI agents emerges, with shopping agents able to carry out actions (as bots do) on behalf of users, pending a final authorization or some other processing step, the lying click will become the standard for many digital interactions, requiring even more careful and deep questioning from the analytics manager.
Now that this professional caution is out of the way, let us talk about what happens when the analyst does not readily find the data they need and have to go out and create new data.
</aside>
In Chapter 9, we played the role of the Harvester. We walked into the corporate storehouse, checked the shelves, and gathered the ingredients that were already there (Internal and External data).
But what happens when the cupboard is bare?
This is the Explorer's dilemma. You have a burning business question—"Why are customers leaving?" or "Will this new product price work?"—and you turn to your Data Warehouse only to find silence. The transactional systems record what happened, but they rarely record why it happened or what might happen if we changed something.
In this chapter, we shift from gathering to creation. We will explore the three fundamental modes of creation—Asking, Observing, Experimenting and Inferring - and the cutting-edge frontier of Synthetic Data.
Before picking a tool, we must understand the texture of the data we are trying to create. A classic dilemma in analytics is choosing between Behavioral Data (what people do) and Attitudinal Data (what people say).
Consider the Churn Prediction problem. You want to know which customers are about to cancel.
You send a survey: "How likely are you to cancel?"
You analyze the "Digital Exhaust." You look for customers who haven't logged in for 20 days.