Introduction
At IBM Infrastructure & Technology, we design and operate the systems that keep the world running. From high-resiliency mainframes and hybrid cloud platforms to networking, automation, and site reliability. Our teams ensure the performance, security, and scalability that clients and industries depend on every day. Working in Infrastructure & Technology means tackling complex challenges with curiosity and collaboration. You'll work with diverse technologies and colleagues worldwide to deliver resilient, future-ready solutions that power innovation. With continuous learning, career growth, and a supportive culture, IBM provides the opportunities to build expertise and shape the infrastructure that drives progress.
Your role and responsibilities
- Design, develop, and maintain scalable backend services for IBM Storage Insights, supporting observability, analytics, automation, AI-assisted workflows, and enterprise platform capabilities.
- Build robust backend services, APIs, and integration layers usingJava,Python, and related frameworks, with a focus on performance, reliability, maintainability, and secure engineering practices.
- Develop RESTful, event-driven, and asynchronous service interfaces that integrate with distributed systems, data pipelines, monitoring platforms, and enterprise services.
- Contribute to backend architecture for multi-tenant, cloud-native systems, including service decomposition, API design, data modeling, caching, resiliency, and operational observability.
- Implement AI/ML-enabled capabilities, including LLM-backed workflows, agentic services, retrieval-augmented generation, entity and intent extraction, and automation use cases.
- Contribute to Agentic AI engineering with depth in one or more specialized areas such asLangGraph-based orchestration,tool-calling design,agent harness development,Langfuse-based observability,LLM evaluation,RAG pipelines,prompt/version management,guardrails, orproduction tracing. The role values demonstrated depth in at least one such area over broad but shallow familiarity across many frameworks.
- Build and maintain Python-based services, automation utilities, evaluation harnesses, and data-processing components to support AI/ML and backend engineering use cases.
- Collaborate with product management, architects, platform teams, SREs, and other engineering squads to deliver production-grade capabilities for enterprise customers.
- Ensure that delivered solutions meet high standards for scalability, security, reliability, performance, testability, and operational supportability.
- Contribute selectively to frontend or dashboard integrations where required, while primarily focusing on backend services, platform engineering, and AI-enabled capabilities.
Required education
Bachelor's Degree
Preferred education
Master's Degree
Required technical and professional expertise
- Around4-6 years of hands-on software engineering experience, with strong backend programming experience inJava and/orPython.
- Strong understanding of backend development, including API design, service orchestration, concurrency, error handling, logging, testing, and production troubleshooting.
- Experience building RESTful APIs, asynchronous services, event-driven integrations, and backend components that interact with distributed systems.
- Strong programming fundamentals, including object-oriented design, data structures, algorithms, design patterns, database concepts, networking basics, and system design.
- Experience with SQL and/or NoSQL databases, including schema design, query development, indexing concepts, transactions, and data access patterns.
- Working knowledge of microservices architecture, containerized deployments, cloud-native development, and service-to-service communication.
- Hands-on experience with Python for backend development, automation, data processing, AI/ML integration, or engineering productivity use cases.
- Exposure to AI/ML or Generative AI engineering, including integrating LLM APIs, designing prompts, invoking tools, building evaluation harnesses, or implementing agent-based workflows.
- Demonstrated depth in at least one Agentic AI engineering area is preferred, such as agent orchestration, tool-use design, RAG implementation, evaluation harnesses, observability/tracing, guardrails, or production LLM workflow debugging.
- Ability to write clean, modular, testable, and maintainable code with appropriate unit, integration, and regression test coverage.
- Experience working in agile engineering teams with strong ownership of design, implementation, code reviews, debugging, and production-readiness.
- Basic working knowledge of JavaScript/TypeScript and frontend frameworks such as React is useful, but the primary expectation is strong backend engineering capability.
Preferred technical and professional experience
Experience with Java backend frameworks such as Spring Boot, Quarkus, Micronaut, or similar enterprise backend technologies.
Experience building Python services using frameworks such as FastAPI, Flask, or similar backend/API frameworks.
Deep specialization in one or more Agentic AI engineering areas, rather than only general awareness of multiple tools. Examples include:
- Agent orchestration: LangGraph, LangChain, stateful workflows, planner-worker patterns, multi-agent coordination.
- Agent harnesses: test harnesses, tool simulators, scenario replay, regression suites, deterministic evaluation flows.
- LLM observability: Langfuse, tracing, prompt/version tracking, latency and cost instrumentation, failure analysis.
- Tool-calling and integrations: API tool design, schema design, action validation, secure execution, retry and fallback patterns.
- RAG and retrieval systems: embeddings, vector databases, chunking strategies, ranking, grounding, and response quality evaluation.
- AI safety and reliability: guardrails, confidence scoring, fallback design, hallucination mitigation, production monitoring.
- Model evaluation: golden datasets, automated evals, human-in-the-loop review, quality metrics, and release gates.
Experience with observability, monitoring, telemetry, analytics platforms, or infrastructure management systems.
Experience with CI/CD pipelines, DevOps practices, container orchestration, automated testing, and production deployment workflows.
Experience with high-scale, secure, multi-tenant enterprise platforms.
Familiarity with distributed data processing, streaming/event systems, caching, message queues, and backend performance optimization.
Exposure to frontend dashboards, visualization libraries, or React-based UI integration is a plus, but not the core focus of the role.
Years of Experience:3-8