Most enterprise AI initiatives fail not because the technology doesn't work, but because nobody did the discovery work. Teams jump straight to model selection and prompt engineering. Six months later they have a demo that impressed the board and a production system that doesn't exist. The discovery phase is where you figure out whether AI will actually solve your problem, what it will cost to run, and what has to change in your organization to make it work.
I have watched this pattern play out at more than a dozen companies. A VP reads about GPT transforming their industry. An internal champion gets budget. An engineering team spins up a prototype in three weeks. The demo looks great. Then the prototype sits on a shelf because nobody answered the questions that matter: Does this solve a real business problem? Can we get the data into the right format? Who owns this system when it goes wrong? What does it cost at production volume?
These are not engineering questions. They are discovery questions. And they need to be answered before anyone writes a single line of AI code.
TL;DR
- 85% of enterprise AI projects never reach production, usually because the team skipped discovery and jumped straight to building
- Two weeks of structured discovery (workflow audit, data assessment, cost modeling) saves months of rework and prevents six-figure surprises
- Start with your most expensive, error-prone workflows, not with "where can we use AI?"
- Define success in business terms (e.g., "cut processing time from 5 days to 8 hours") before choosing any technology
- A focused discovery deliverable covering use cases, data readiness, architecture, costs, risks, governance, and a 90-day roadmap is the difference between a plan and a hope
Why Discovery Gets Skipped
There are three forces that kill the discovery phase before it starts.
The first is pressure to show results. The board approved an AI budget. The CEO mentioned it in the earnings call. The VP who championed it needs something to demo at the next quarterly review. In that environment, spending two weeks on research and planning feels like stalling. So teams skip straight to building, because a working prototype is a tangible artifact and a strategy document is not.
The second is vendor pressure. Every AI vendor wants to get you into a proof of concept as fast as possible. Not because POCs are the right starting point, but because POCs create switching costs. Once your team has spent eight weeks building on a platform, the sunk cost makes it hard to walk away. Vendors know this. "Just try it" is a sales strategy, not engineering advice.
The third is the "just try GPT" culture. Large language models are so accessible that anyone with an API key can build a demo in an afternoon. That accessibility creates the illusion that production AI is easy. The demo works on five test cases. It breaks on the five hundred edge cases you haven't tested yet. The gap between a demo and a production system is where most AI projects go to die.
The cost of skipping discovery is not a theoretical risk. It is a measurable expense. You either build the wrong thing (solving a problem nobody actually has) or you build the right thing in a way that cannot scale. Either way, you end up starting over, and restarting after six months of engineering work costs far more than two weeks of upfront research.
Audit Your Workflows First
The single most common mistake I see in enterprise AI planning is starting with the technology. "Where can we use AI?" is the wrong first question. The right first question is: What are our most expensive, most error-prone, or most time-consuming workflows?
Start with a workflow audit. Map every process that touches the problem space you are considering. Not at the abstract level. At the level of what a person actually does, step by step, when they complete this task today.
A claims processing team at an insurance company does not just "process claims." They receive a document, read it, cross-reference it against policy details, check for missing information, classify the claim type, route it to the right adjuster, draft an initial assessment, and flag anomalies. Each of those steps has a different profile for AI suitability.
Once you have the workflow mapped, classify each step into one of three categories:
- Mechanical steps that follow clear rules and rarely require judgment. Document classification, data extraction from structured forms, routing based on known criteria. AI can handle these reliably today.
- Assisted steps where judgment is required but AI can do the heavy lifting. Drafting responses that a human reviews before sending, summarizing complex documents, flagging anomalies for human investigation. AI handles the volume; humans handle the quality control.
- Human-only steps that require empathy, complex negotiation, novel problem solving, or regulatory sign-off. AI should not touch these. Trying to automate them creates more risk than value.
This classification is where I spend the most time with clients, because the boundaries are not always obvious. A step that looks mechanical might have edge cases that require deep domain expertise. A step that looks like it requires judgment might actually follow a decision tree that can be codified. The only way to know is to sit with the people who do the work.
The output of this audit is a prioritized list of workflow steps, ranked by a combination of volume (how often does this happen), cost (how much does it cost per occurrence), and error rate (how often does it go wrong today). The steps at the top of that list are your AI candidates. Everything else is a distraction.
Map Your Data Landscape
Every AI system is a data system. The model is the least interesting part. The data that feeds it determines whether it works.
I run a data readiness assessment with every client, and the results are almost always sobering. Companies overestimate their data readiness by a wide margin. They say "we have ten years of customer data." What they actually have is ten years of data spread across four systems that don't talk to each other, with inconsistent field names, duplicate records, and three different date formats.
The data readiness assessment covers four areas:
Data inventory
What data do you actually have? Not what you think you have. What exists, where does it live, and who owns it? For each data source, document the format, the update frequency, the volume, and the access method. This sounds tedious. It is. It also prevents you from building an AI system that depends on data you cannot access.
Accessibility
Can the data be accessed programmatically? Data behind an API is gold. Data in a database with proper access controls is silver. Data in spreadsheets on someone's desktop is a problem. Data in email threads is a bigger problem. Data in someone's head is not data at all. If the answer is "someone exports a CSV every Tuesday," that is a pipeline dependency you need to plan for.
Quality assessment
Pull a representative sample from each data source. Check it for completeness, consistency, accuracy, and timeliness. I typically check 200 to 500 records per source. That is enough to identify systemic quality issues without turning the assessment into a data cleaning project.
Gap analysis
Compare what you have to what the AI system will need. Some gaps can be filled by joining existing data sources. Some require new data collection. Some are dealbreakers. I worked with a financial services firm that wanted to build an AI system for personalized investment recommendations. They needed transaction history, risk profile data, and market sentiment signals. They had the first two. The third would have required a six-figure data licensing agreement they had not budgeted for. That is a gap you want to find in week two, not month six.
Define Success in Business Terms
"Deploy an AI model" is not a success metric. Neither is "implement machine learning" or "use LLMs in our workflow." These are activities, not outcomes. If you cannot define success without mentioning technology, you are not ready to build.
Good success metrics are operational and measurable:
- Reduce customer response time from 4 hours to 15 minutes
- Cut manual data entry by 80% while maintaining 99% accuracy
- Decrease claims processing time from 5 days to 8 hours
- Reduce compliance review backlog from 6 weeks to 1 week
- Increase first-call resolution rate from 45% to 75%
Each of these metrics has three properties that matter. They describe a business outcome, not a technical activity. They are measurable with data you already collect. And they have a clear "before" number, which means you can prove impact after deployment.
I draw a hard line between vanity metrics and operational metrics. Vanity metrics make the project look good in a slide deck. Operational metrics tell you whether the system is actually working.
| Vanity Metric | Operational Metric |
|---|---|
| "We processed 10,000 documents with AI" | "Error rate dropped from 12% to 2%" |
| "Our model has 95% accuracy on test data" | "Customer escalations decreased by 40%" |
| "We reduced response time by 80%" | "Customer satisfaction scores increased from 3.2 to 4.1" |
| "We handle 3x more volume" | "Cost per transaction decreased from $4.50 to $1.20" |
The distinction matters because vanity metrics can mask a failing system. You can process 10,000 documents with AI and still have a higher error rate than the manual process. Operational metrics force you to measure what actually matters.
Define your success metrics during discovery, not after deployment. Write them down. Get sign-off from the business stakeholders, not just the technical team. These metrics become the contract between the AI project and the organization.
Identify the Humans in the Loop
Every production AI system needs human oversight. That is not a philosophical statement about the limits of AI. It is a practical observation based on how production systems actually work.
The question is not whether you need humans in the loop. The question is where in the loop, how many, and what authority do they have. Companies that skip this question during discovery end up in one of two failure modes.
The first failure mode is full autonomy. The AI system makes decisions with no human review. This works until it doesn't. A fully autonomous customer-facing system that starts generating incorrect information will do so at machine speed, creating hundreds of bad interactions before anyone notices.
The second failure mode is the opposite: the AI system is built, but nobody creates a process for using it. There is no clear workflow for when to check AI output or what to do when the AI gets something wrong. The system sits unused because the people who are supposed to use it don't know how it fits into their day.
During discovery, I map the human oversight structure for every proposed use case:
- Review points. Where in the workflow does a human check AI output before it reaches a customer or enters a system of record? What are they checking for?
- Escalation paths. When the AI produces output that the reviewer is unsure about, what happens next? Who makes the final call? What is the turnaround time?
- Error ownership. When the AI makes a mistake that reaches a customer, who is accountable? Not "the AI team" (too vague) but a specific person with a specific remediation process.
- Feedback loops. How do the humans in the loop report issues back to the team that maintains the AI system? Is there a structured process, or do problems get reported ad hoc over Slack?
- Volume capacity. If the AI system handles 1,000 interactions per day and 5% require human review, that is 50 reviews per day. Who handles those reviews? Do they have the capacity, or does this create a new bottleneck?
The oversight design has direct implications for the technical architecture. If every output needs human review, the system needs a review queue with an approval workflow. If only flagged items need review, the system needs confidence scoring. If escalations go to a specialist team, the system needs routing logic. These requirements need to be captured during discovery, because they affect the build timeline and cost.
Model Your Costs Before You Build
The vendor quote is just the starting point. The real cost picture includes infrastructure, monitoring, governance, and the team to run it. None of that is scary if you plan for it upfront. It only becomes a problem when it surprises you mid-project.
Here is what a realistic cost model includes:
Inference costs at scale
This is the cost most people think about, and even this one gets underestimated. In our model routing benchmarks, we found a 64x cost spread between the cheapest and most expensive models for the same task. At 1,000 users, the difference between naive model selection and optimized routing was $1.96 million per year. If you haven't benchmarked your specific workloads against multiple models, your cost estimate is a guess.
Inference costs also scale nonlinearly with input size. A system that processes 500-word documents costs dramatically less per call than one that processes 50,000-word documents. Know your token volumes before you build your cost model.
Hidden costs
These are the costs that don't appear on the vendor invoice but show up in your operating budget:
- Monitoring and observability. You need logging, alerting, and dashboards. Someone needs to watch them. Budget for tooling and at least 10 to 15 hours per week of engineering time for ongoing monitoring.
- Prompt engineering and maintenance. Prompts degrade as models update. Someone needs to test, tune, and version-control prompts on an ongoing basis.
- Data pipeline maintenance. The data feeding your AI system needs to keep flowing, stay clean, and adapt to schema changes upstream.
- Retraining and evaluation. If you fine-tune models, you need periodic retraining. Even with off-the-shelf models, you need periodic evaluation to confirm they still meet your quality bar.
- Governance and compliance. Audit trails, access controls, data retention policies, bias monitoring. The regulatory burden is increasing, and it requires dedicated effort.
Pilot costs vs. production costs
A pilot with 10 users and 100 transactions per day costs almost nothing in inference. The same system at 1,000 users and 70,000 transactions per day costs real money. More importantly, production requires infrastructure that pilots don't: load balancing, failover, rate limiting, caching, queue management, and SLA monitoring. I budget production infrastructure at 3 to 5x the pilot infrastructure cost, depending on the reliability requirements.
The cost model should be a living spreadsheet with three scenarios: optimistic, expected, and pessimistic. If the pessimistic scenario is still within budget, you have a viable project. If the expected scenario is already at the edge of budget, you have a risk.
Privacy and Compliance Constraints
Privacy and compliance constraints are not a section in the project plan. They are the frame that determines the shape of everything else. Get them wrong and you don't have a system that needs modifications. You have a system that needs to be rebuilt.
The fundamental question is: What data can leave your network? The answer to that question determines your entire architecture.
If the answer is "all of it" (rare), you can use cloud APIs from any provider, store data in vendor-hosted systems, and optimize purely for cost and performance.
If the answer is "some of it" (common), you need a classification system that separates data into tiers. Public data can go to cloud APIs. Internal data might require a private endpoint with a BAA or DPA. Regulated data (PII, PHI, financial records) might need to stay on-premise entirely. That classification drives your architecture: you may need different providers or deployment models for different data tiers.
If the answer is "none of it" (heavily regulated industries), you are looking at self-hosted models and on-premise infrastructure. Self-hosted models are viable. We have benchmarked them. But they come with tradeoffs in speed, quality, and operational complexity that need to be planned for.
Beyond the data residency question, the compliance assessment during discovery should cover:
- Regulatory requirements. HIPAA, SOC 2, GDPR, CCPA, industry-specific regulations. Each one imposes specific requirements on data handling, logging, and audit trails.
- Logging and audit trails. What needs to be logged? For how long? Who can access the logs? Many enterprises discover during deployment that their AI system doesn't produce the audit trail their compliance team requires. Discover this during discovery, not during the compliance review.
- Model provenance. Some regulated industries require documentation of which model produced which output, when, and with what inputs. This requires versioned model tracking and input/output logging that most prototypes don't support.
- Bias and fairness requirements. If your AI system makes decisions that affect people (hiring, lending, insurance, healthcare), you may have legal obligations to demonstrate that the system does not discriminate. This requires testing infrastructure and ongoing monitoring that should be scoped during discovery.
Compliance surprises after months of engineering work are one of the most common and most expensive patterns in enterprise AI. The fix is simple: bring your compliance team into the discovery phase, not the deployment review.
The Discovery Deliverable
A discovery phase is only valuable if it produces an actionable deliverable. Not a 100-page report that nobody reads. A focused document that gives leadership enough information to make a go/no-go decision and gives engineering a clear starting point.
The discovery deliverable I produce for clients takes about a month. It covers seven sections:
1. Prioritized use cases
A ranked list of AI opportunities, scored by business impact, technical feasibility, and data readiness. The top two or three use cases get detailed treatment. The rest go on a backlog with clear criteria for when to revisit them.
2. Data readiness assessment
For each prioritized use case, a detailed assessment of data availability, quality, and gaps. Includes specific remediation steps for any data issues that need to be resolved before building, with estimated timelines and costs.
3. Architecture recommendation
The recommended technical architecture, including model selection (or model routing strategy), deployment model (cloud API vs. self-hosted vs. hybrid), data pipeline design, and integration points with existing systems. This is not a final architecture. It is a starting architecture with documented decision points that will need validation during the build phase.
4. Cost model
A three-scenario cost model (optimistic, expected, pessimistic) covering inference costs, infrastructure, monitoring, maintenance, and personnel. Includes break-even analysis: at what volume or what efficiency gain does the AI system pay for itself?
5. Risk assessment
Technical risks (model quality, data dependencies, integration complexity), organizational risks (change management, skills gaps, stakeholder alignment), and compliance risks (regulatory requirements, data privacy, audit obligations). Each risk gets a severity rating, a likelihood estimate, and a mitigation plan.
6. Governance framework outline
Who owns the AI system? Who monitors it? What are the escalation paths? How is model performance tracked? What triggers a review or a rollback? This does not need to be a complete governance policy during discovery. It needs to be a clear enough outline that the organization knows what governance decisions need to be made before production deployment.
7. 90-day implementation roadmap
A week-by-week plan for the first 90 days after the go decision. Covers team formation, data preparation, environment setup, initial model testing, integration development, human-in-the-loop process design, and the criteria for moving from pilot to production. Concrete enough to start executing on day one.
That document is typically 15 to 25 pages. It represents two weeks of focused work: stakeholder interviews, workflow mapping, data assessment, technical evaluation, and cost modeling. It is the difference between starting a build with a clear plan and starting with a vague hope.
The Bottom Line
The discovery phase is not a delay. It is the fastest path to a production system that actually works.
Every week spent on discovery saves a month of rework. That is not a platitude. It is arithmetic. A two-week discovery phase costs a small fraction of an AI initiative's budget plus internal stakeholder hours for interviews and reviews. Rebuilding a failed AI system costs six to twelve months of engineering time, plus the opportunity cost of a system that isn't delivering value.
The companies that skip discovery end up with expensive demos. A working prototype that the board loved in Q2, a stalled deployment in Q3, a quiet write-off in Q4. The technology was never the problem. The preparation was.
The companies that do discovery end up with systems that compound in value. The first use case works and delivers measurable results. The second use case builds on the infrastructure and governance framework from the first. By the third, the cost per deployment drops and the time to production shrinks. That compounding effect is only possible when the foundation is solid.
The discovery phase requires someone who has seen enough AI deployments to know which questions matter and which are distractions. The frameworks in this article are what I walk through with every client before a single line of code gets written.