Do AI coding tools train on your code? What to verify before you roll out

A clear answer to whether AI coding tools train on your code, and the contract terms, settings, and data-handling questions engineering leaders should verify before approving one.

Illustration of data-handling controls protecting source code sent to an AI coding tool

It is the first question security asks about any AI coding tool, and the one engineers most want a straight answer to: does this thing train on our code?

The short version: it depends on the product tier, the account settings, and the contract you signed, not on the vendor's reputation. Most major business and enterprise plans now state they do not use your code or prompts to train their models, and many offer zero-retention options. But the consumer free tier of the same product often says the opposite, and defaults are not always the safe choice. The difference between safe and exposed is usually a plan and a setting, not a brand.

We help engineering teams answer this properly. Not with a vibe, but with the specific things you can verify before a tool touches your repositories. Here is what to check.

What "training on your code" actually means

The phrase hides three separate concerns that need separating, because a tool can be fine on one and not another.

ConcernThe real questionWhy it matters
Model trainingIs your code used to train or fine-tune the provider's models?Your logic could influence outputs shown to other customers
RetentionHow long are prompts and completions stored, and where?Stored data is data that can be breached, subpoenaed, or mishandled
Human reviewCan provider staff read your prompts, e.g. for abuse monitoring?A person seeing proprietary code is a different risk than a model

A vendor saying "we don't train on your data" answers only the first. A complete answer covers all three, and you should ask for all three in writing.

The plan tier usually decides the default

The single biggest factor is which tier you are on. The pattern across the market is consistent enough to plan around, even though specifics vary by vendor and change over time.

  • Consumer / free tiers frequently use your inputs to improve the product by default, sometimes with an opt-out buried in settings. These are the plans individual engineers sign up for on their own, which is exactly how shadow AI usage creates exposure nobody approved.
  • Business / team tiers typically state they do not train on your content, and often retain data only briefly for abuse monitoring.
  • Enterprise tiers usually add contractual data-processing terms, zero or short retention options, and sometimes the ability to run in your own cloud tenancy.

The implication is blunt: an engineer using the free tier of a tool and an organisation on the enterprise tier of the same tool can be in completely different risk positions. The brand on the logo tells you almost nothing.

Verify it, do not assume it

Reputation is not a control. Before a tool is approved, get specific answers and keep them on file.

  • Read the data-handling terms for your exact plan, not the marketing page. The relevant language is in the terms of service, DPA, and trust/security documentation.
  • Ask the vendor in writing: Do you train on our prompts or code? What is the retention period? Is there human review, and can it be disabled? Is a zero-retention mode available?
  • Check the actual settings. Many tools have an organisation-level toggle for training and telemetry. Confirm the safe setting is enabled and that engineers cannot silently override it.
  • Confirm sub-processors and region. Find out who else processes the data and where it is stored, which matters for GDPR obligations if any personal data reaches a prompt.
  • Get it in the contract. A toggle can change with a product update. A contractual commitment is what you rely on when it matters.

This is the core of a vendor security review for AI tools: the answers exist, but only if you ask for them as part of approval rather than after an incident.

Reduce what you send in the first place

Even with the best terms, the lowest-risk data is the data that never leaves. Two habits cut exposure regardless of the vendor's policy.

  • Keep secrets and personal data out of prompts. Tokens, credentials, customer records, and production data should not be pasted into a tool, whatever its retention policy. Enforce this with pre-commit and prompt-scrubbing tooling, not willpower.
  • Scope what the tool can see. Where a tool indexes your repository, exclude the directories it does not need (secrets, infrastructure, sensitive business logic) using the tool's ignore configuration.

These are the same instincts that govern secure AI-assisted coding generally: minimise what is exposed, so the consequences of a bad policy or a breach are smaller.

Put the answer in your policy

Once you have verified the facts, write them down so every engineer inherits the decision instead of re-litigating it.

  • The approved tools and plan tiers, named explicitly: the enterprise tier is approved, the free tier is not.
  • The required settings, so training and telemetry are off where they should be.
  • The data rules: what must never go into a prompt, on any tool.

That is exactly the job of an AI usage policy: turn a verified answer into a default, so the safe path is the easy one and nobody has to guess.

Our view

"Does it train on my code" is the right question, but the answer is never the brand. It is the plan, the settings, and the contract. The market has matured: business and enterprise tiers of major tools generally do not train on your content and offer real retention controls. The exposure that remains comes from consumer tiers used unofficially, defaults left unchecked, and sensitive data sent that never needed to be.

Verify the terms for your exact plan in writing, set the controls, minimise what you send, and write the result into policy. Do that and the question stops being a worry and becomes a settled fact, which is where a security question should end up.

Sources

  • OWASP, OWASP Top 10 for Large Language Model Applications, on data leakage and sensitive information disclosure, accessed 2026-06-10
  • European Data Protection Board, guidance on processors and international transfers, accessed 2026-06-10
  • NIST, AI Risk Management Framework, on data governance for AI systems, accessed 2026-06-10

Frequently asked questions

Do AI coding tools train on our code?
It depends on the plan tier, the account settings, and the contract — not the vendor's brand. Most major business and enterprise plans explicitly state they do not use your code or prompts to train models, and many offer zero-retention options. The consumer or free tier of the same product often says the opposite, and defaults are not always the safe choice.
What is the difference between model training, data retention, and human review in AI coding tools?
These are three distinct concerns. Model training asks whether your code influences the provider's model outputs shown to other customers. Retention asks how long prompts and completions are stored and where. Human review asks whether provider staff can read your prompts, for example for abuse monitoring — a meaningfully different risk from a model processing your code. A vendor saying 'we don't train on your data' answers only the first; you need all three answered in writing.
How do we verify that an AI coding tool won't use our source code for training?
Read the data-handling terms for your exact plan — not the marketing page — covering the terms of service, DPA, and trust documentation. Ask the vendor in writing about training use, retention periods, human review, and zero-retention availability. Then confirm the organisation-level training and telemetry toggle is set correctly and that engineers cannot silently override it. Get the commitment in the contract, not just a setting, because a toggle can change with a product update.
What data should we never send to an AI coding tool regardless of the vendor's policy?
Tokens, credentials, customer records, and production data should never be pasted into any AI coding tool, whatever the retention policy states. Enforce this with pre-commit and prompt-scrubbing tooling rather than relying on willpower. Where a tool indexes your repository, use its ignore configuration to exclude secrets, infrastructure, and sensitive business logic directories.

Talk to us

Scale AI in engineering with control.

We help define the workflows, guardrails, and proof you need.

Get in contact