Background
This article is about how to enter a new field quickly. The method used here is seeking truth from facts, with three steps: investigation, practice, and iteration through review. The concrete example in this article is agent engineering in the AI domain.
- Investigation: what problems appeared when entering the AI field, how to approach it, and what initial assumptions were formed through investigation
- Conclusions: what core judgments were formed after investigation and practice
- Practical engineering: how those judgments were derived, designed, and validated through a real agent engineering project
Chinese version: 我进入 AI 领域的方式,以及 Agent SDK 的设计
Investigation
Exploring and understanding what AI is
- Understand the basic principles of AI and its capability boundaries
- Understand engineering objects such as prompt, agent, context, and MCP
- Understand where AI is effective in real delivery and where it loses control
Deriving a human-AI collaboration model from AI’s characteristics
- Decide which problems are suitable for AI and which must remain human-led
- Reassign engineering focus toward requirement discovery, problem decomposition, architecture design, and execution control
- Form a practical way of working where humans and AI collaborate on engineering tasks
Validating these judgments in practice and turning them into tools
- Use real projects to validate whether the previous judgments hold
- Turn effective methods into process, runtime, and tooling
- Make those capabilities reusable, extensible, and sustainable
Conclusions from AI Project Retrospectives
- From the AI technology perspective: AI behaves more like a group of highly capable but not fully controllable executors. The key is to define goals, paths, and constraints first, then progressively compress highly uncertain problems into more concrete and more deterministic ones.
- From the perspective of personal capability focus: in the AI era, more value shifts upward into requirement discovery, problem definition, problem decomposition, solution design, process control, and result acceptance.
- From the perspective of project practice: several projects in different directions are validating the same point: AI works better inside systems that are already decomposed, constrained, and supported by documents and process, rather than replacing an entire workflow without boundaries.
| Project | Domain | Main Goal | Current Conclusion |
|---|---|---|---|
agent_runtime |
Infra | Provide a shared runtime foundation for agent systems, unifying session, orchestration, context, tooling, model, observability, and storage capabilities | Stable AI engineering depends not only on model calls, but on a controllable runtime |
sdlc |
Software delivery | Connect requirement, design, implementation, and verification into a staged delivery chain, and support continuous update based on existing artifacts | AI fits better into decomposed, constrained, and document-backed engineering processes than into full end-to-end replacement |
travel direction project |
Product / business | Validate AI capability in complex user scenarios such as requirement discovery, planning, and continuous task carrying | AI improves information processing and plan generation efficiency, but the real difficulty still lies in requirement discovery and problem definition under complex scenarios |
Practical Engineering
Why I Started with an Agent SDK
Conclusion: an Agent SDK is the right place to start because it is the most suitable layer for carrying the common problems in AI engineering.
- Runtime hosting
- Interface boundaries
- Orchestration control
- Context governance
- Tool integration
- Model integration
- Permission control
- Observability
It is foundational enough, general enough, and suitable enough to become a shared base reused by later projects.
SDK Design
Requirement Definition and Decomposition
From a requirement perspective, this Agent SDK is not solving one isolated feature problem. It is solving a group of recurring needs that repeatedly appear in real agent engineering. The goal is to compress these needs into shared foundational capabilities, so upper-layer projects no longer rebuild runtime foundations repeatedly and can instead evolve on top of unified boundaries.
A need for unified runtime hosting: an agent system cannot stop at one-shot model calls. As soon as it enters multi-turn conversation, tool use, state carrying, and result write-back, it needs a unified capability to host session lifecycle, execution flow, and runtime state. Otherwise every project handles these basics separately, at high cost and with poor consistency.
A need for stable external interface boundaries: upper-layer projects need stable, clear, and reusable entry points rather than direct exposure to internal implementation details. In other words, what external users need first is a stable boundary, not a larger pile of capabilities. Only with consistent access boundaries can later capability growth, runtime evolution, and cross-project reuse stay manageable.
A need for controllable agent orchestration and execution control: different requests need different execution paths, and execution also needs explicit stopping conditions, failure handling, and state update mechanisms. The system therefore needs unified orchestration and execution control rather than scattering execution decisions across business logic, temporary rules, and prompts.
A need for context governance: multi-turn agent systems naturally depend on context, but context is not just a single large input. Conversation history, stage memory, retrieval results, and context budget constraints are different by nature. If they are mixed together carelessly, the system quickly loses explainability and maintainability. What is really needed is not “more context”, but governable context.
A need for MCP / tool integration and execution: agents do more than generate text. They also need to access and execute external capabilities. Tool registration, discovery, dispatch, and execution should therefore become shared foundational capabilities rather than private implementations rebuilt by each project.
A need for permission control and safety boundaries: tool calls and external capability access naturally introduce safety and permission risks. The system must define not only what can be executed, but also what must be blocked. Without this layer of control, stronger capabilities create larger risks.
A need for model integration and abstraction: models are an important input to agent systems, but upper-layer projects should not directly absorb differences between model providers and invocation styles. What they really need is a unified model access boundary that hides switching, invocation differences, and output normalization.
A need for execution tracing and runtime fact recording: AI systems cannot be judged only by final results. Execution steps, key events, tool calls, failure points, and resource usage all need to be recorded so the system becomes traceable, analyzable, and verifiable.
A need for future evolution: this SDK will not stop at its current capability boundary. It will continue to extend into retrieval, memory, checkpoint, compression, and multi-agent capabilities. That means clear room for evolution must exist from the beginning.
SDK Design
Once the requirements are made clear, the next step is to map them into an implementable and extensible technical architecture. Overall, this SDK adopts a layered runtime architecture. Different layers are responsible for interface exposure, lifecycle control, orchestration, context governance, capability integration, model integration, observability, and persistence.
Overall architecture:
1 | +----------------------------+ |
Application Layer
Responsibilities:
- Provide terminal or external application entry points and solve the lack of directly usable SDK entry forms
- Handle user input/output and runtime access and avoid tight coupling between interaction logic and internal runtime capabilities
Included modules:
TerminalSessionDemo: provides a terminal interaction entry for reading input, invoking runtime APIs, and presenting results- External Application Integrations: entry forms for external apps or host systems to connect to the SDK
Interface Layer
Responsibilities:
- Expose stable session and execution interfaces and solve unstable external access boundaries
- Hide internal runtime details and prevent upper-layer projects from directly coupling to internal implementation
Included modules:
Api: defines and exposes the external boundary of the SDKRuntimeApi: defines stable lifecycle interfaces such as session create, open, and closeISession: defines session access, runtime state query, and execution entry points- Agent / Session API Contracts: unify the contracts exposed externally
Runtime Controller Layer
Responsibilities:
- Carry session lifecycle and solve the lack of a unified execution unit in multi-turn systems
- Centralize execution flow and runtime state, avoiding control logic scattered outside the runtime
- Normalize results and runtime scheduling through a unified execution boundary
Included modules:
Runtime: runtime initialization, session lifecycle management, and unified entry dispatchAgentSession: the main execution chain, state carrying, and result normalization for a single sessionRunCheckpoint: reserved boundaries for checkpoint, resume, and background execution
Agent Orchestration Layer
Responsibilities:
- Handle routing and execution-mode orchestration for agent requests
- Select the proper execution path for different requests instead of forcing all requests into one pattern
Included modules:
AgentSelector: chooses execution mode based on request characteristics and runtime stateChatAgent: handles direct conversational requestsReActAgent: handles iterative execution with tool use and observation feedbackPEOAgent: handles staged plan-execute-observe executionMultiAgentProtocol: reserved protocol boundary for multi-agent collaboration
Context Governance Layer
Responsibilities:
- Manage multiple context sources in multi-turn systems
- Control context assembly, budgeting, and trimming
Included modules:
SessionTranscript: carries session history as the base of multi-turn contextRuntimeMemory: carries stage memory and summarized stateRetrievalProvider: provides retrieval-based context inputContextAssembler: assembles context from multiple sourcesContextBudgetPolicy: controls context budget, trimming, and constraints
Capability and Tooling Layer
Responsibilities:
- Carry external capability access and invocation
- Unify tool registration, dispatch, and execution boundaries
- Manage permission control and execution environments
Included modules:
McpGateway: receives tool invocation requests and dispatches themMcpToolRegistry: manages tool registration, discovery, and lookupRuntimePermissionPolicy: performs permission checks and capability constraints before executionExecutionEnvironment: defines local, sandboxed, or remote execution boundaries
Model Integration Layer
Responsibilities:
- Unify model integration patterns
- Hide differences across providers and invocation styles from upper layers
Included modules:
ModelFactory: creates model instances and selects integration methods- Provider Adapters: isolate provider differences
StreamingEventAdapter: normalizes streaming events and outputs
Observability Layer
Responsibilities:
- Record execution process and key events
- Provide the foundation for tracing, analysis, and verification
Included modules:
Trace: records key execution events and invocation chainsMetrics: collects runtime metrics, resource consumption, and invocation statistics- Diagnostics / Usage: supplements runtime diagnostics and usage facts
Data Layer
Responsibilities:
- Persist runtime data and solve the problem that runtime facts are not stably retained
- Unify the persistence boundary for transcript, memory, checkpoint, trace, and metrics, and provide stable data foundations for future evolution
Included modules:
Storage: defines the unified persistence interface- File / Remote Persistence Backends: provide file-based or remote persistence implementations
Delivery Validation
The current implementation status can be summarized as follows:
| Layer | Module | Status | Notes |
|---|---|---|---|
Application Layer |
TerminalSessionDemo |
Basic implementation | A terminal entry already exists for manual running and session-flow validation |
Interface Layer |
Api / RuntimeApi / ISession |
Basic implementation | Session-oriented interface boundaries already exist, but still in a foundational form |
Interface Layer |
AgentApi / IAgent |
Basic implementation | A direct-agent entry has been added as a complement to session APIs |
Runtime Controller Layer |
Runtime / AgentSession |
Basic implementation | Session lifecycle, execution chain, state updates, and result normalization are in place, but more complex control scenarios remain to be extended |
Runtime Controller Layer |
RunCheckpoint |
Partially complete | Checkpoint boundaries exist, but resume, retry, and background execution are still pending |
Agent Orchestration Layer |
AgentSelector / intent routing |
Partially complete | Routing capability exists, but strategy and modes can continue to evolve |
Agent Orchestration Layer |
ChatAgent / ReActAgent / PEOAgent |
Basic implementation | Three major execution paths are already landed, though still at a foundational orchestration level |
Agent Orchestration Layer |
MultiAgentProtocol |
Reserved for extension | Reserved in architecture but not part of the current main execution path |
Context Governance Layer |
SessionTranscript / RuntimeMemory / ContextAssembler / ContextBudgetPolicy |
Partially complete | Basic support for history, stage memory, context assembly, and budget control exists, but finer governance in complex scenarios is still pending |
Context Governance Layer |
RetrievalProvider |
Partially complete | The retrieval boundary exists, but more complete and rigorous retrieval / RAG support is still pending |
Capability and Tooling Layer |
McpGateway / McpToolRegistry / RuntimePermissionPolicy / ExecutionEnvironment |
Basic implementation | Tool integration, dispatch, permission control, and execution boundaries are in place, but still foundational |
Model Integration Layer |
ModelFactory / StreamingEventAdapter |
Basic implementation | Unified model access and streaming-event adaptation exist, while model-side capability can continue to evolve |
Model Integration Layer |
Provider Adapters | Partially complete | Mock and current providers are connected, but provider coverage remains limited |
Observability Layer |
Trace / Metrics |
Basic implementation | Key runtime events and metrics have entered the main path, but observability depth can still be improved |
Observability Layer |
Diagnostics / Usage | Partially complete | Basic runtime facts are recorded, but more complete diagnostics are still pending |
Data Layer |
Storage / File Persistence |
Basic implementation | Unified storage interfaces and file persistence exist, but the data layer is still mainly focused on basic persistence |
Data Layer |
Remote Persistence Backends | Reserved for extension | Architectural boundaries exist, but remote persistence is not implemented yet |
Summary of the Agent SDK
What this SDK design and implementation taught
- Shared foundational capability: an
Agent SDKdoes not face isolated problems. It faces a set of recurring problems that appear together, including interface boundaries, execution control, context governance, tool integration, model invocation, observability, and persistence. The value of the framework lies in separating these concerns and turning them into shared infrastructure. - Capability control: an
Agent SDKis not only responsible for bringing external capabilities into the system. It must also control how those capabilities are used, including MCP boundaries, tool registration and dispatch, permission checks, and execution environment constraints. - Memory management: multi-turn agent systems naturally need history carrying and stage memory, but those signals cannot grow without bound and cannot simply be mixed together. Transcript, memory, retrieval, and context budget need to be treated as separate governance problems.
- Scenario hosting: the key job of an
Agent SDKis not just model access. It is to host real scenario requirements such as multi-turn state, tool execution, permission constraints, memory management, and result write-back, and organize them into a runnable system.
Comparison with industry Agent SDKs
Common points:
- They solve the same category of problems: all of them deal with multi-turn runtime hosting, tool organization, permission boundaries, runtime recording, and building relatively stable control over uncertain model behavior.
- They share a similar architectural skeleton: all of them need stable interface boundaries, unified runtime hosting, orchestration and routing, tool and model integration, observability, and persistence.
Differences:
- Different maturity and capability depth: compared with mature industry SDKs, this implementation is still foundational. It still has clear gaps in strategy maturity, breadth of capability, stability in complex scenarios, multi-agent support, long-term memory, and complex permission governance.
- Different priority: the current priority of this SDK is to validate what architectural boundaries, runtime layering, and capability organization an agent system should have at the engineering level, rather than to maximize feature completeness first.
Comparison with previous engineering experience
Common points:
- Problem and requirement analysis: the system still has to answer what problem it solves, who it serves, and where its boundaries are.
- Stable external interfaces: how external users access the system and how internal implementation evolves remain core engineering questions.
- Capabilities must be organized: which capabilities belong in shared infrastructure and which belong to upper-layer business logic is still a key architectural problem.
- Clear module layering: interface, control, capability, and data layers still need explicit boundaries and dependency discipline.
- Common runtime-system concerns: lifecycle management, capability boundaries, state carrying, observability, and persistence are still classic runtime-system concerns.
Differences:
- LLM uncertainty: model outputs are probabilistic by nature. The same input does not always produce the same result, which directly affects control, debugging, and verification.
- Context dynamism: context keeps evolving with history, memory, retrieval results, and budget constraints. This is not only a matter of more context, but of harder memory management: what to keep, what to compress, and what should enter the current execution round.
- Dynamic execution paths: whether to call tools, which tools to call, whether to continue, and when to stop often need to be decided at runtime.
- Open capability boundaries: agent systems actively connect to external tools, external data sources, and execution environments, which significantly amplifies safety and permission-governance complexity.
- Higher verification requirements: it is not enough to verify final results. The process, state, and runtime facts must also be trustworthy and traceable.
Summary
When entering a new field, the priority is not to chase trends first. The priority is to keep correcting judgment through investigation, practice, and iteration through review. The method used here is seeking truth from facts.
- Investigation: first understand what it is, what its strengths and weaknesses are, and what industry solutions and experience already exist.
- Practice: form judgments from the investigation, then validate them through concrete engineering.
- Iteration through review: summarize and review practice results, supplement the earlier investigation, update the existing judgments, and continue validating them.
Agent SDK, as one of the earliest engineering foundations produced by this methodology, already shows that this direction is valid, that the architectural structure is complete, and that it can continue evolving as a shared base for later projects.
Chinese version: 我进入 AI 领域的方式,以及 Agent SDK 的设计