As the use of large language models (LLMs) expands from individuals to organizations, the need to manage prompts with the same rigor as traditional source code becomes increasingly important.
This article discusses three organizational practices—prompt codification, quantitative evaluation using a Golden Dataset, and establishing security governance frameworks—as well as the evolving role of engineers.
It organizes thoughts on transforming AI from a "personal convenience tool" to a "reliable organizational asset."
Prompt as Code: The Three Pillars of Organizational Practices
1. Codifying and Centrally Managing Prompts
Prompts are "instruction sets for AI" and represent critical assets that determine system behavior. These should be managed in Git, just like source code, rather than in Excel or document tools.
Implementation Pattern:
Defining prompts in formats like YAML or JSON and enabling external variable injection provides the following benefits:
- Reusability: Use the same prompt in multiple places
- Testability: Conduct automated tests by changing variables
- Change History: Track changes via Git
- Review: Ensure quality through pull requests
2. Quantitative Evaluation Using a Golden Dataset
A system is needed to determine whether "AI output has improved or degraded" based on data rather than subjective judgment.
Definition of a Golden Dataset: A test dataset that includes real-world inquiries from production, typical use cases, and edge cases (boundary conditions).
| Input Example | Expected Output | Evaluation Criteria |
|---|---|---|
| "I want to return this." | Explanation of return policy + steps | Accuracy, tone |
| "When will my delivery arrive?" | Instructions to check delivery status | Conciseness |
| "This is a complaint!" | Apology + escalation | Appropriate response |
Whenever prompts are updated, tests must be run against the Golden Dataset to ensure no performance regressions occur. Integrating this into CI/CD (Continuous Integration/Continuous Delivery) pipelines prevents quality degradation.
LLM-as-a-Judge (AI-based Evaluation): Manually checking hundreds of outputs is impractical. Therefore, using a more advanced LLM as an evaluator has become a mainstream approach.
Example evaluation prompt:
3. Establishing Security and Governance Frameworks
Organizational frameworks are required to address the unique security risks of AI systems.
Key Risks:
| Risk | Description | Countermeasures |
|---|---|---|
| Prompt Injection | Overwriting AI instructions with malicious input | Input validation, context isolation |
| Data Leakage | Mishandling of PII in inputs/outputs | Automatic PII detection and masking |
| Inappropriate Output | Generation of discriminatory or harmful content | Implementation of guardrail models |
Organizational Measures:
- Establish an AI Governance Committee: A cross-functional approval process involving IT, legal, risk management, and business units
- Define Accountability: Appoint an "AI System Owner" for each AI use case
- Maintain Audit Logs: Record and ensure traceability of all AI communications
Formalizing Knowledge: AGENTS.md and SKILL.md
A system for documenting organizational standard processes and enabling AI agents to learn from them.
- AGENTS.md: The "constitution" of the project. Defines agent roles, available tools, and decision-making criteria
- SKILL.md: Specific workflows, such as "PR review procedures" or "incident response flows"
This prevents reliance on individual expertise and streamlines onboarding for new team members.
The Evolving Role of Engineers: Three Paradigm Shifts
From Writing Code → Designing Environments
Traditional engineering focused on "writing precise instructions (code)." Engineering in the AI era focuses on "designing environments where AI can function effectively."
Prompts, data, tools, memory, and evaluation criteria—all these are components of the "environment":
- Prompts: Clarify instructions for AI
- Data (RAG): Organize reference information for AI
- Tools (MCP): Define external functions accessible to AI
- Memory: Design the information AI should retain (short-term/working/long-term memory)
- Evaluation Criteria: Define metrics for assessing AI output quality (Golden Dataset)
The engineer's role is to combine these elements effectively to build an "environment" where AI can consistently create value. Coding skills remain important but are only a subset of the overall responsibilities.
From Individual Skill → Organizational Process Design
While individual ability to write excellent prompts is important, the essence lies in building systems that enable the entire team to maintain a consistent level of quality.
In traditional software development, processes like code reviews, testing, and CI/CD have been used to prevent reliance on individual expertise and ensure quality. A similar approach is needed for AI development:
- Git management and change tracking for prompts
- Automated regression testing with a Golden Dataset
- Formalizing organizational knowledge with AGENTS.md and SKILL.md
- Standardizing security checks (PII detection, injection prevention)
From Striving for Perfection → Driving Continuous Improvement
Traditional code is deterministic, always returning the same output for the same input. However, AI is probabilistic, and perfect initial settings do not exist.
What matters is the ability to rapidly iterate through monitoring, evaluation, and improvement cycles:
- Monitoring: Real-time monitoring of AI output quality (drift detection)
- Evaluation: Regular performance measurement using the Golden Dataset
- Improvement: Adjusting prompts and retesting when issues are detected
Organizations that can execute this "Build-Measure-Learn" loop on a weekly or daily basis will gain a competitive edge.
Moreover, AI systems cannot be managed by engineers alone. Collaboration with legal, compliance, domain experts, and product managers is essential. Beyond technical accuracy, engineers are also responsible for organizational consensus-building and selecting tools accessible to non-engineers (e.g., no-code editing environments like PromptLayer).
A Platform Engineering Perspective
Practices such as prompt code management and evaluation using a Golden Dataset align closely with the principles of platform engineering.
Instead of reinventing the wheel, verified prompt templates can be centrally provided as a "Golden Path" for engineers to use via self-service when needed. This reduces time spent on individual trial-and-error and improves the overall developer experience (DX) across the organization.
Conclusion: The Paradigm Shift in Engineering
Engineering in the AI era is transforming from "writing code" to "designing environments where AI can function effectively."
Three Key Transformations:
- Prompts are the new code: Rigorous management and testing are essential
- Evaluation must be data-driven, not subjective: Golden Dataset and LLM-as-a-Judge
- From individual skills to organizational systems: Formalizing knowledge with AGENTS.md/SKILL.md
The Evolution Path for Engineers:
- AI Users → AI Trainers → AI Co-Designers
By establishing the right organizational frameworks and equipping engineers with the necessary skills, organizations can transform LLMs from experimental tools into assets that continuously generate business value.