Toronto, Ontario
Job Summary
Support engineers to onboard into the AIDA ecosystem to ensure high availability and reliability of all AIDA features across a global user base. The engineer(s) will be responsible for proactive monitoring, incident triage, troubleshooting, and resolution, along with delivering small enhancements and bug fixes across Python services, web applications, data/analytics pipelines, and cloud infrastructure.
This role requires strong production support mindset, hands-on troubleshooting skills, and ability to work in a shift-based model for near 24x7 coverage.
Key Responsibilities
Production Monitoring & Uptime (Primary)
Monitor all features/services within the AIDA ecosystem to ensure uptime, stability, and performance
Perform proactive checks on health dashboards, logs, alerts, and metrics; identify issues before users are impacted
Own incident management (triage diagnosis mitigation resolution) and drive restoration within SLA
Conduct root cause analysis (RCA) and implement corrective/preventive actions (CAPA)
Maintain and improve runbooks, SOPs, knowledge articles, and on-call procedures
Support Operations (L2/L3)
Handle support tickets and production issues including functional issues, system errors, integration failures, and data pipeline disruptions
Manage escalations effectively—coordinate with product/engineering, infrastructure, and vendor teams as required
Track reliability KPIs (availability, MTTR, incident trends) and contribute to continuous improvement
Enhancements & Fixes (Secondary)
Deliver small enhancements, configuration changes, and bug fixes in:
Python services/scripts and automation
Web UI components (Angular/React) and/or backend (.NET where applicable)
Data workflows (Dataiku pipelines/recipes, job scheduling)
AWS infrastructure/app services
Release & Change Support
Support deployments (lower prod), validate smoke tests, and assist in release readiness activities
Ensure proper change documentation and rollback readiness for production changes
Skill Requirements
Required Technical Skills
Python: troubleshooting, scripting, API debugging, automation
Web/App Support: exposure to Angular or React, and understanding of web app troubleshooting (frontend/backend)
Backend exposure: knowledge of .NET is desirable (or ability to troubleshoot service-side issues)
Dataiku: ability to monitor and troubleshoot Dataiku jobs, pipelines, failures, scheduling
AWS: working knowledge of cloud monitoring, logs, and typical services (e.g., IAM, EC2/ECS/EKS, S3, CloudWatch, Lambda—based on your stack)
Observability & Support Tools: logging/monitoring, alerting systems, ticketing tools (ServiceNow/Jira), dashboards
#body.unify div.unify-button-container .unify-apply-now: focus, #body.unify div.unify-button-container .unify-apply-#body.unify div.unify-button-container .unify-apply-now: focus, #body.unify div.unify-button-container .unify-apply-