VP/AVP, Site Reliability Engineer, Group Consumer Banking and Big Data Analytics Technology, Technology & Operations
Group Technology and Operations (T&O) enables and empowers the bank with an efficient, nimble and resilient infrastructure through a strategic focus on productivity, quality & control, technology, people capability and innovation. In Group T&O, we manage the majority of the Bank's operational processes and inspire to delight our business partners through our multiple banking delivery channels. Responsibilities
- Facilitate/Drive recovery calls for major incidents and coordinate with multiple teams to drive the resolution.
- Responsible to communicate on major incidents and provide regular update to the stakeholders.
- Ensure Preventive and detective measures of the applications are identified and implemented.
- Automation of manual activities/processes for Production teams (automation experience required).
- Identifies persistent or recurring problems and recommends creative solutions.
- Great people skills to build and manage performing team.
- Strong communications skills and collaborate well within global team, ensures proper handoff of incidents and details.
- Ensure incidents are escalated and facilitated to enable efficient and timely service restorations.
- Drives Root Cause Analysis with technology partners, post incident resolution and facilitates RCA reviews.
- Manages the identification and development of monitoring and improvements (process/systemic) to improve the reliability of Production systems.
- Implements SRE practices in CORE Banking.
- Automate processes and remove Toils.
- Build automation and Observability tools to detect, troubleshoot and recover systems faster and improve Production systems reliability and resiliency.
- Build predictive tools and solutions using Machine Learning capabilities. Make best use of available logs and instrumentation.
- Minimum 12 years of experience and out of which minimum 7+ years of Production Management experience preferably in Banking industry.
- Managed Open Systems (eco systems) and multi-countries production environment.
- Good level of command over production Infra; performance monitoring and reporting tools.
- SRE - Implement Site Reliability Engineering principles regarding performance, reliability, monitoring, alerting in Production environment.
- Over 7+ years of experience in Incident Management, Change Management, Problem Management.
- Hands-on experience in Unix/Linux/Shell/Python scripting.
- Familiar with applications Xcelerate/TBMS systems, Quadient or equivalent statement generation engines, Jboss , MariaDB, EDB Postgres, Java (in Linux operating system), Kafka.
- Experience in supporting critical applications using API driven cloud native technologies (e.g: AWS, PCF, OpenShift, or Kubernetes).
- Good knowledge in Java & Spring boot and Python.
- Good working experience in Elasticsearch, Logstash, Grafana/Kibana, Appdynamics etc.
- Production automation. Automation of manual activities/processes for Production teams (automation experience required).
- Good knowledge of Machine Learning Algorithms like Regression, Classification, Decision Tree, Random Forest, Bagging & Boosting Techniques like XGBoost.
- Good Communication skill, Positive work attitude and self-motivated can work independently as well as in a team.
- Good experience in running automation and improvements experience.
- Performing at the highest level of in competencies like:
- Achieving Excellence (4/5)
- People Management (4/5)
- Solution Driven (4/5)
- Relationship (4/5)
- Innovative (4/5)
- Purpose-Driven (4/5)
We offer a competitive salary and benefits package and the professional advantages of a dynamic environment that supports your development and recognises your achievements.