How to Evaluate Any AI Implementation Provider (A Framework You Can Use Today)

Five question evaluation framework for ai implementation providers separating real methodology from repackaged prompt libraries

In this article

We built this. We build yours.

MIT’s Project NANDA analyzed 300 enterprise AI deployments and found that 95% produced zero measurable return. Not low return. Zero. RAND Corporation puts the broader failure rate above 80%. S&P Global reports that 42% of companies scrapped most of their AI initiatives in 2025, up from 17% the year before.

Those numbers are bad enough on their own. But here is the part nobody talks about: those companies evaluated providers before they bought. They sat through demos. They checked references. They compared pricing. They still ended up with AI that produces the same generic output they could get for free.

The evaluation itself failed. Not because they asked bad questions, but because they never asked the one question that determines everything.

The Question Nobody Asks

Most evaluation frameworks focus on the wrong layer. Team size. Client logos. Technology stack. Feature comparisons. Demo quality. None of those tell you what the AI will actually know when it produces output for your business.

And that is the only thing that determines whether the output sounds like yours or sounds like everyone else’s.

Every AI tool on the market uses the same underlying models. The models are commodities. GPT-4, Claude, Gemini. They all pull from the same ocean of internet data. The difference between generic output and output that reflects your business is not the model. It is what the model knows when it answers.

If nobody built a real foundation underneath, the output will be the internet’s statistical average. And if the provider did build a foundation but filled it with every methodology they could find, proven and unproven, tested and theoretical, blended together with no curation, the output will sound confident and mean nothing. Both failures look fine in a demo. Both collapse in production.

Here are five questions that will separate real AI implementation from a repackaged prompt library. Each one probes a different layer of the provider’s work. Each one has a version of “good” and a version of “run.” You can use them on every provider you evaluate. Including us.

Question 1: “What Is the AI Actually Built On?”

This is the question that matters more than every other question combined.

A provider who built something real can describe, specifically, what knowledge goes into the foundation. Not “we customize it for your business.” The actual architecture. What sources. What methodology. What frameworks. What structured, calibrated knowledge the AI draws from when it produces output.

Why it matters. AI trained on nothing specific produces nothing specific. That is not a flaw in the technology. It is exactly how the technology works. If the implementation does not put something better than the internet’s statistical average into the foundation, the output will sound like what your competitor gets using ChatGPT with no implementation at all.

This is the root cause behind the statistic that 86% of marketers rewrite every piece of AI content before publishing. The AI was never given anything real to build on. So it produces what it has: the average of everything, distinctive to nothing.

What a good answer sounds like. The provider names the specific frameworks, research, and operational logic built into the foundation. They can trace the output back to a specific methodology with a specific rationale. The knowledge is curated, not blended from every source available. It is auditable. And it is not something you could replicate by typing a better prompt.

What the red flag sounds like. “We use proprietary prompts.” “Our prompt engineers have years of experience.” “We fine-tune the AI for your brand.” Any answer that describes how they ask the AI to perform rather than what the AI actually knows when it answers. If the answer is about the question, not the knowledge, you are looking at a prompt library in a nice package.

Here is a simple test. Ask the provider: “If I used the same AI model with no implementation at all and just typed a really good prompt, what would yours know that mine would not?” If they cannot give you a specific, concrete answer, there is no foundation. You are buying a demo.

Question 2: “How Do You Learn My Business?”

Every business has methodology that lives in people, not documents. Your best salesperson’s decision logic. Your operations manager’s triage process. The way your best emails actually sound versus what your brand guide says they should sound like. The gap between the documented version of your business and the real version is where the value lives.

If the provider’s process does not uncover that gap, the AI will be built on the surface version of your company. And the surface version is exactly what generic AI already has access to.

Why it matters. The depth of the provider’s learning process determines the depth of the AI’s output. Every provider starts somewhere. A form. A call. A review of your materials. The question is not how it starts. The question is how deep it goes after that. Calibration requires understanding your business at a level that most providers never reach.

What a good answer sounds like. The provider describes a process with layers. The initial intake is the beginning, not the end. After the surface information is collected, there is a defined process for going deeper: interviewing your best people, examining your strongest existing work (not the brand guide, but the actual communications that produce results), uncovering methodology that has never been documented, and mapping the gaps where instinct has never been structured into systems. The process produces specific artifacts the buyer can see and validate before the AI is built.

What the red flag sounds like. The red flag is not the form. It is when the form is the finish line. If the entire learning process could be completed in a single call or a single submission, the provider is building on the surface version of your business. “We review your website and brand materials” is not discovery. “We do a kickoff call and then start building” is not discovery. Discovery means the provider’s understanding of your business goes deeper than what you could hand them in a document. If it does not, the AI will reflect exactly that depth. No more.

Question 3: “What Does ‘Customization’ Actually Mean in Your Process?”

Almost every AI implementation provider uses the word “customized.” The word has been stretched to the point of meaninglessness. This question forces the provider to define what they actually do.

Why it matters. There is a spectrum. At one end: a prompt library with your company name dropped into the blanks. In the middle: adjusting tone, format, and channel settings. At the other end: integrating structured methodology, voice architecture, and operational logic into what the AI knows when it answers.

Most providers operate at the first two levels and describe it using the language of the third.

The FTC has noticed. In 2024 and 2025, the agency filed twelve AI-washing cases, including “Operation AI Comply,” which targeted firms selling AI services powered by nothing more than template libraries. A 2019 study of 2,830 European AI startups found that 40% did not use real AI at all. They relied on simple automation and classic software tools dressed up in AI language.

What a good answer sounds like. The provider can show you the difference between output before and after their work, and the difference is not cosmetic. The knowledge the AI draws from is different. The decision logic is different. The output reflects things the AI could not know from the internet alone. They can point to structural changes, not surface adjustments.

What the red flag sounds like. “We customize the tone and voice.” “We configure the AI to match your style.” “We train it on your content.” Tone is a setting. Style is a setting. Content is raw material. None of those are methodology. If customization means adjusting surface-level controls, you are paying implementation prices for work you could do yourself in an afternoon.

Question 4: “What Happens After You Deploy It?”

This question separates providers who sell projects from providers who build systems. A project has an end date. A system gets better over time.

Why it matters. AI implementations that are static on day one are static on day 300. Your business evolves. Your market evolves. Customer language evolves. Your competitors adapt. If the implementation has no mechanism for feeding performance data back into the foundation, the output quality plateaus the moment the provider finishes the project.

Production AI models degrade over time as data, users, and environments change. The technical term is model drift. But the concept is simpler than the term: the world moves and the AI does not. Without a defined process for continuous improvement, you are paying for a snapshot of your business, not a system that grows with it.

What a good answer sounds like. The provider describes a specific mechanism for ongoing optimization. They can explain how performance data gets fed back into the AI’s foundation. They have a defined process with defined intervals, not a vague promise. And their pricing model reflects it. If the provider has no financial incentive to improve the implementation after deployment, they will not improve it.

What the red flag sounds like. “We provide 30 days of support after launch.” “We offer a maintenance retainer.” “You can contact us if you need changes.” Support is not optimization. Maintenance is not improvement. If the post-deployment plan sounds like a help desk, the implementation is designed to be static. Ask yourself: six months from now, will this AI be better than it is today? If the provider cannot explain specifically how, the answer is no.

Question 5: “How Is Your Compensation Tied to My Results?”

This question tests alignment. It is the simplest question on the list and the most revealing.

Why it matters. If the provider gets paid the same amount whether the implementation produces results or not, the incentive is to deliver and move on. If some portion of the provider’s compensation is tied to measurable business outcomes, the provider has a structural reason to keep improving the implementation. Not a goodwill reason. A financial one.

73% of consulting clients now prefer pricing models tied to measurable outcomes rather than time spent. The market is moving in this direction because accountability produces better results. When the provider’s revenue depends on the buyer’s performance, the provider invests in foundations, not shortcuts.

What a good answer sounds like. The pricing structure combines upfront investment with ongoing performance alignment. The upfront fee is real because the work is real and costs real money to do well. The performance component matters because it means the provider wins when the buyer wins. Not all of the compensation needs to be at risk. But if none of it is, the provider has no structural incentive to care whether the AI performs after handoff.

What the red flag sounds like. “We charge a flat project fee.” “We bill hourly.” “Our retainer covers ongoing support.” None of these are inherently wrong for certain kinds of work. But if there is no performance-tied component at all, ask yourself: if the AI produced mediocre output for six months, would the provider feel it? If the answer is no, the provider is selling a project. Not a partnership.

What the Pattern Tells You

The power of these five questions is the pattern across all five. Not any single answer.

A provider who has strong answers to all five is operating at a different level than a provider who stumbles on two or three. And you will know within the first 15 minutes of a conversation which category you are dealing with.

Most providers will answer Question 1 with confidence and vagueness. They will describe their learning process in generalities. They will use the word “customized” without defining it. They will mention support without describing optimization. And they will not volunteer how their compensation connects to your results.

That pattern is the signal. Not because those providers are dishonest. But because their model does not require them to have better answers. Their process is not built around methodology depth, deep discovery, structural calibration, ongoing improvement, or outcome alignment. It is built around delivery. And delivery without depth is exactly how 80% to 95% of AI implementations produce zero measurable return.

Use This Framework on Us

These five questions are not a sales pitch. They are a filter.

Use them on every provider you evaluate. Use them on us. If any provider, Built on Proof included, cannot give you specific, concrete, verifiable answers to all five, keep looking.

The companies in the 5% to 20% that succeed with AI are not luckier. They are more rigorous about what they buy. This framework is the starting point for that rigor.

If you want to see what the difference looks like with your own content, take the Proof Test. Your content. Two outputs. You decide.

Keep reading.

Everybody got the same tool. Nobody got the manual. We wrote it.