Site Reliability Engineer
Job Description Site Reliability Engineering:
- Act as the Subject Matter Expert (SME) in a variety of enterprise monitoring technologies and solutions.
- Analyze and transform operational and/or functional needs of the organization into monitoring solutions, while remaining compliant with the standard IT policies and procedures
- Build a catalog with detailed descriptions of system monitoring parameters and integrate them to optimize the overall value and effectiveness.
- Life cycle management (onboarding, maintenance, migration and retirement) of several monitoring tools in use (Example: IPCenter, Dynatrace, AppDynamics, Splunk, SCOM, etc.,) and maximizing the use / benefits from each tool. Administer and provide software support for monitoring tools, and perform the necessary customization and implementations with any of the tool suites.
- Responsible for the day to day administration of the Monitoring platform, with focus on improvements that will help reduce alert volumes without compromising system stability and availability.
- Maintain and support infrastructure Monitoring environment to ensure the highest availability while reducing the impact of incidents.
- Collaborate with stakeholders across Moody's IT and business teams on projects and support initiatives and build automated solutions to help detect, log and resolve events and problems that can potentially cause service disruptions.
- Perform and provide certification or feedback on Production Readiness for various technology solutions owned by the organization.
- Conduct in depth evaluations of monitoring / alert data to assist with the diagnosis of various infrastructure and application problems.
- Test, recommend and implement new monitoring technologies. Retire the underused and outdated monitoring technologies with higher costs and / or diminishing returns.
- Develop governance reports and perform analysis of the IT performance data using Tableau or Power BI
- Maintain the knowledge (documentation), reports and other artifacts in a central repository (ServiceNow Knowledge base)
Disaster Recovery Leadership:
- Manage the strategy, design, implementation, execution, automation, documentation and communication of business continuity and disaster recovery plans and processes that ensure the seamless and successful failover, security and integrity of data, applications, databases, infrastructure systems and other related technologies.
- Partner with the internal and external stakeholders and supplier teams to understand the Infrastructure Service organization's objectives, challenges and needs of the Business Continuity Management (BCM) and Disaster Recovery (DR) functions and address them to deliver organizational goals.
- Own, streamline, optimize, automate, document and continuously enhance the BCM and DR plans and the corresponding tasks.
- Ownership, planning, execution and reporting of the scheduled BCM and DR exercises, with successful cross-functional coordination, matrix resource management, crisp communication, enhanced documentation, change management and reporting.
- Reduce the overall execution duration as well as risk of failure and maximize the success rate by simplifying tasks with automation and elimination of redundant, unnecessary steps in the workflows.
- Establish and oversee the successful delivery of DR plan roadmaps with Moody's internal and supplier IT teams, Info Risk, Audit and other stakeholders as applicable.
- Conduct risk analysis to identify critical operations and systems that are core to continued business services in the event of a disruption and include them in the DR planning scope for successful delivery and risk mitigation.
- Develop and deploy training, documentation, and communication of disaster procedures to the organization.
- Own and manage the relevant contracts with suppliers for off-site and other resources required for the execution of Enterprise Monitoring and Disaster Recovery responsibilities.
- Review Suppliers' SLA and SLO details and be accountable for their improvement.
- Ensure that the organizational goals and milestones are met and adhering to approved budgets.
- Develop, enhance and enforce knowledge of organization's IT Service Management processes.
Minimum education and work experience required for this position include:
- BS Computer Science or related technical discipline (or equivalent experience).
- At least 5 years of hands-on experience in Monitoring and Disaster Recovery execution.
- Competent in networking principles and OS operation and maintenance.
- Experienced in design/implementation for reliability, availability, scalability and performance.
- Development skills in at least two scripting languages such as Java, Python, PERL, Shell, SQL, Containers and APIs (provide GitHub account details or code examples, if available).
- Experience with installing, configuring and maintaining monitoring software such as IPCenter (or equivalent), Dynatrace, AppDynamics, Splunk, SCOM, VMWare VRops, AWS CloudWatch, Nagios, Azure Monitoring etc.,
- Solid working knowledge of both Windows and Linux Operating Systems, file and directory structures, commands, command-line interfaces and utilities.
- Knowledge of IT Best Practices as they relate to the following areas: IT Infrastructure Monitoring, Data Networks, IT Security, Virtualization, Web Servers, Cloud and Storage technologies
- Ability to leverage Excel for analysis, produce charts & reports (Pivot tables, charts, tables, and analysis) using macros/VBA and tools like Tableau or Power BI
- Proficiency in ITSM (ITIL v3 Foundation knowledge)
- Experience in Cloud Environments such as, Azure, AWS, Google or private cloud would be a plus
- Understanding of containerization such as, Docker, Kubernetes and Micro services would be a plus.
- Working knowledge of ServiceNow would be preferred
- Strong communication, presentation, analytical and problem solving skills required. Must have the ability to effectively understand and communicate technical issues and their solutions to multiple stakeholder groups and influence their outcome
- Strong customer focus with project management and follow-up skills
- This is not a 9 am to 5 pm job. Candidate must be willing to work during non-standard business hours and weekends - on demand and onsite, if necessary.
Moody's is an equal opportunity employer. All qualified applicants will receive consideration for employment without regard to race, color, sex, gender, age, religion, national origin, citizen status, marital status, physical or mental disability, military or veteran status, sexual orientation, gender identity, gender expression, genetic information, or any other characteristic protected by law. Moody's also provides reasonable accommodation to qualified individuals with disabilities in accordance with applicable laws. If you need to inquire about a reasonable accommodation, or need assistance with completing the application process, please email firstname.lastname@example.org.. This contact information is for accommodation requests only, and cannot be used to inquire about the status of applications.
For San Francisco positions, qualified applicants with criminal histories will be considered for employment consistent with the requirements of the San Francisco Fair Chance Ordinance. For New York City positions, qualified applicants with criminal histories will be considered for employment consistent with the requirements of the New York City Fair Chance Act. For all other applicants, qualified applicants with criminal histories will be considered for employment consistent with the requirements of applicable law.
Click here to view our full EEO policy statement. Click here for more information on your EEO rights under the law.
Candidates for Moody's Corporation may be asked to disclose securities holdings pursuant to Moody's Policy for Securities Trading and the requirements of the position. Employment is contingent upon compliance with the Policy, including remediation of positions in those holdings as necessary.