6 LLM Prompting Techniques for Data Scientists and Engineers in 2026
Six techniques matched to six failure modes, including inconsistent output formats, shallow reasoning, instruction drift, and more.
I’ve been leaning heavily into agentic coding this year using Claude Code, Cursor, and similar tools to the point where I’m writing far more natural language than actual code.
This shift forced me to take prompting seriously in a way I hadn’t before.
I went back through recent arXiv papers and engineering blogs from Anthropic, OpenAI, and Perplexity to fill my knowledge gap, and came out with six techniques I now use regularly.
Each one fixes a specific failure mode. The right technique depends on which failure you’re actually hitting.
1. Chain-of-Thought (on the Right Model)
This is the one that changed most in the last year.
Chain-of-Thought (CoT) asks the model to reason step by step before giving a final answer. On standard chat models (GPT-5.5, Claude Sonnet 4.6), adding “walk through your reasoning before answering” surfaces intermediate steps you can verify before the model commits to a conclusion.
Reasoning models work differently. o3, o4-mini, and Opus 4.7 run an extended thinking chain before writing a single word of their response. They explore multiple approaches, backtrack, and self-correct internally. Telling them to “think step by step” doesn’t help, but rather constrains the model to the reasoning path you’ve described instead of a better one it would find on its own.
For reasoning models, OpenAI recommends leaving reasoning to the model and instead structuring prompts around three things:
Goal
Constraints
Evidence
Here’s an example:
This function is supposed to return only rows where score > 0.5, but it keeps returning an empty DataFrame. I’ve verified the input DataFrame has rows with score above 0.5. Here’s the function. Find the bug.
Notice that this prompt includes no instructions on how to reason but instead contains the following:
Goal: “filter by score”
Observed failure + verified context: empty output, confirmed it’s not a data issue
Evidence: the code
2. Structured Outputs (for Consistent, Parseable Responses)
This is the most important one if you’re building anything with LLM API calls.
When I was building an AI language learning app, early responses came back in the right format. After a few turns, the model started mixing explanations in different languages and dropping fields, which broke the JSON parser in my frontend.
Defining the expected structure in your prompt and telling the model to return only that gets you ~90% compliance on capable models. When that’s not good enough for production, use native structured outputs. For example, OpenAI supports structured outputs with response_format: {type: "json_schema", ...} for certain models.
Both give guaranteed compliance and aren’t probabilistic.
3. Role Prompting (for Domain-appropriate Responses)
The same question gets genuinely different answers depending on the persona you assign.
When I ask an LLM to explain why a model’s precision dropped, I get standard debugging steps, i.e. check data distribution, inspect feature importances etc.
When I tell it to respond as a senior ML engineer presenting the same issue to a product team, it leads with business impact, drops the jargon, and ends with a recommendation on what to do next.
It’s the same model and same problem. The persona shapes who the model thinks it’s talking to, and consequently how it responds.
4. Negative Prompting (for Removing Default Behaviors)
LLM outputs frequently include:
Hedged findings
Bullet points
Caveats at the end of everything
Phrases like “it’s worth noting that...”
You can be explicit in your prompt about what not to include:
”Do not hedge findings with ‘may’ or ‘could.’ Do not use bullet points. State results as observed.”
These run as hard constraints during generation. The model can’t softly ignore them the way it interprets style suggestions like “be concise” loosely.
I use this most often when writing summaries, cleaning up API responses, or generating SQL.
5. Attentive Reasoning Queries (for Multi-turn Instruction Adherence)
Rules you set in a system prompt lose influence as the conversation grows.
By turn fifteen, your original constraints are thousands of tokens back. The model doesn’t forget them, but they’re just far enough back that they carry less weight. This is instruction drift, and it’s a structural property of how LLMs process context.
Attentive Reasoning Queries (ARQ) (arXiv:2503.03669) address this by embedding a structured checklist directly into the prompt that wraps each user message with a short list of questions the model must answer before generating its response. Because the checklist is injected at every turn (not just once at the top), it stays in the model’s immediate context where it actually influences output.
For a customer-facing support agent, the checklist might look like this:
Before responding:
1. Is this a billing or payment question? → If yes, don’t answer it. Direct the user to billing@company.com.
2. Am I about to make a promise on behalf of the company (e.g., “we will fix this by...”)? → If yes, remove it.
3. Is my response in English? → If not, translate it before sending.
Then respond.You include this block in the system prompt or inject it into the prompt template that wraps every user turn.
Across 87 multi-turn agent scenarios, ARQ reached 90.2% instruction adherence, compared to 86.1% for CoT and 81.5% for direct prompting (arXiv:2503.03669).
6. Verbalized Sampling (for Output Diversity)
Most modern LLMs are fine-tuned using human feedback where reviewers score different responses, and the model learns to produce what reviewers preferred. This is why models converge toward the most “typical” output. Ask for five ideas and you often get the same idea five times with slightly different wording.
Verbalized Sampling (arXiv:2510.01171) works around this by changing how you phrase the request. Instead of asking for a response, you ask for multiple responses with their probabilities.
“Generate 5 approaches to this feature engineering problem, with their corresponding probabilities.”
Asking for probabilities forces the model to represent its full range of options rather than picking the single safest one. You get a more diverse selection, including a high-confidence default, a few credible alternatives, and sometimes an unexpected angle you wouldn’t have thought to ask for separately.
In creative writing experiments from the paper, Verbalized Sampling boosted diversity by 1.6–2.1× over direct prompting and improved human evaluation scores by 25.7% (arXiv:2510.01171).
How to Use These
Start by identifying which failure mode you’re actually hitting. Here’s a quick recap:
Model skips reasoning steps → CoT (check your model type first)
Output format keeps changing → structured outputs
Generic answer for the wrong audience → role prompting
Unwanted hedges or formatting → negative prompting
Rules drift in long sessions → Attentive Reasoning Queries (ARQ)
Same answer every time → verbalized sampling
Good prompting is also what makes agentic coding work. Agents run these same techniques in a loop, at scale, with less room to catch a bad one before it compounds.
I’ve got two posts coming this month on agentic coding specifically, and most of what they cover builds on this foundation.
Paid subscribers get implementation guides for the agentic coding posts coming this month: exact prompts, model configurations, and the mistakes I already made so you don't have to.
Which of these prompting techniques are you already using? Which ones are new to you?


