Rethinking Engineering in the Age of AI

As the use of large language models (LLMs) expands from individuals to organizations, the need to manage prompts with the same rigor as traditional source code becomes increasingly important.

This article discusses three organizational practices—prompt codification, quantitative evaluation using a Golden Dataset, and establishing security governance frameworks—as well as the evolving role of engineers.

It organizes thoughts on transforming AI from a "personal convenience tool" to a "reliable organizational asset."

Prompt as Code: The Three Pillars of Organizational Practices

1. Codifying and Centrally Managing Prompts

Prompts are "instruction sets for AI" and represent critical assets that determine system behavior. These should be managed in Git, just like source code, rather than in Excel or document tools.

Implementation Pattern:

Defining prompts in formats like YAML or JSON and enabling external variable injection provides the following benefits:

Reusability: Use the same prompt in multiple places
Testability: Conduct automated tests by changing variables
Change History: Track changes via Git
Review: Ensure quality through pull requests

2. Quantitative Evaluation Using a Golden Dataset

A system is needed to determine whether "AI output has improved or degraded" based on data rather than subjective judgment.

Definition of a Golden Dataset: A test dataset that includes real-world inquiries from production, typical use cases, and edge cases (boundary conditions).

Input Example	Expected Output	Evaluation Criteria
"I want to return this."	Explanation of return policy + steps	Accuracy, tone
"When will my delivery arrive?"	Instructions to check delivery status	Conciseness
"This is a complaint!"	Apology + escalation	Appropriate response

Whenever prompts are updated, tests must be run against the Golden Dataset to ensure no performance regressions occur. Integrating this into CI/CD (Continuous Integration/Continuous Delivery) pipelines prevents quality degradation.

LLM-as-a-Judge (AI-based Evaluation): Manually checking hundreds of outputs is impractical. Therefore, using a more advanced LLM as an evaluator has become a mainstream approach.

Example evaluation prompt:

3. Establishing Security and Governance Frameworks

Organizational frameworks are required to address the unique security risks of AI systems.

Key Risks:

Risk	Description	Countermeasures
Prompt Injection	Overwriting AI instructions with malicious input	Input validation, context isolation
Data Leakage	Mishandling of PII in inputs/outputs	Automatic PII detection and masking
Inappropriate Output	Generation of discriminatory or harmful content	Implementation of guardrail models

Organizational Measures:

Establish an AI Governance Committee: A cross-functional approval process involving IT, legal, risk management, and business units
Define Accountability: Appoint an "AI System Owner" for each AI use case
Maintain Audit Logs: Record and ensure traceability of all AI communications

Formalizing Knowledge: AGENTS.md and SKILL.md

A system for documenting organizational standard processes and enabling AI agents to learn from them.

AGENTS.md: The "constitution" of the project. Defines agent roles, available tools, and decision-making criteria
SKILL.md: Specific workflows, such as "PR review procedures" or "incident response flows"

This prevents reliance on individual expertise and streamlines onboarding for new team members.

The Evolving Role of Engineers: Three Paradigm Shifts

From Writing Code → Designing Environments

Traditional engineering focused on "writing precise instructions (code)." Engineering in the AI era focuses on "designing environments where AI can function effectively."

Prompts, data, tools, memory, and evaluation criteria—all these are components of the "environment":

Prompts: Clarify instructions for AI
Data (RAG): Organize reference information for AI
Tools (MCP): Define external functions accessible to AI
Memory: Design the information AI should retain (short-term/working/long-term memory)
Evaluation Criteria: Define metrics for assessing AI output quality (Golden Dataset)

The engineer's role is to combine these elements effectively to build an "environment" where AI can consistently create value. Coding skills remain important but are only a subset of the overall responsibilities.

From Individual Skill → Organizational Process Design

While individual ability to write excellent prompts is important, the essence lies in building systems that enable the entire team to maintain a consistent level of quality.

In traditional software development, processes like code reviews, testing, and CI/CD have been used to prevent reliance on individual expertise and ensure quality. A similar approach is needed for AI development:

Git management and change tracking for prompts
Automated regression testing with a Golden Dataset
Formalizing organizational knowledge with AGENTS.md and SKILL.md
Standardizing security checks (PII detection, injection prevention)

From Striving for Perfection → Driving Continuous Improvement

Traditional code is deterministic, always returning the same output for the same input. However, AI is probabilistic, and perfect initial settings do not exist.

What matters is the ability to rapidly iterate through monitoring, evaluation, and improvement cycles:

Monitoring: Real-time monitoring of AI output quality (drift detection)
Evaluation: Regular performance measurement using the Golden Dataset
Improvement: Adjusting prompts and retesting when issues are detected

Organizations that can execute this "Build-Measure-Learn" loop on a weekly or daily basis will gain a competitive edge.

Moreover, AI systems cannot be managed by engineers alone. Collaboration with legal, compliance, domain experts, and product managers is essential. Beyond technical accuracy, engineers are also responsible for organizational consensus-building and selecting tools accessible to non-engineers (e.g., no-code editing environments like PromptLayer).

A Platform Engineering Perspective

Practices such as prompt code management and evaluation using a Golden Dataset align closely with the principles of platform engineering.

Instead of reinventing the wheel, verified prompt templates can be centrally provided as a "Golden Path" for engineers to use via self-service when needed. This reduces time spent on individual trial-and-error and improves the overall developer experience (DX) across the organization.

Conclusion: The Paradigm Shift in Engineering

Engineering in the AI era is transforming from "writing code" to "designing environments where AI can function effectively."

Three Key Transformations:

Prompts are the new code: Rigorous management and testing are essential
Evaluation must be data-driven, not subjective: Golden Dataset and LLM-as-a-Judge
From individual skills to organizational systems: Formalizing knowledge with AGENTS.md/SKILL.md

The Evolution Path for Engineers:

AI Users → AI Trainers → AI Co-Designers

By establishing the right organizational frameworks and equipping engineers with the necessary skills, organizations can transform LLMs from experimental tools into assets that continuously generate business value.