Hemanth K_9 yrs_Sr. Data Engineer
Hemanth K_9 yrs_Sr. Data Engineer
SUMMARY
With over 9 years of extensive experience in data engineering, I specialize in designing and implementing robust data
pipelines for efficient acquisition, transformation, and loading (ETL) using Python and cloud technologies,
particularly AWS and GCP. I have a proven track record of orchestrating seamless on-prem to cloud data migrations
while ensuring data integrity and security. Proficient in monitoring tools and techniques, I excel in proactively
identifying and resolving complex issues in data processing workflows, optimizing performance and reliability.
Skilled in a wide array of tools and technologies including the Hadoop ecosystem, virtualization, and NoSQL
databases, I leverage my problem-solving abilities to tackle intricate data challenges and drive actionable insights.
PROFESSIONAL EXPERIENCE
9+ years of extensive hands-on experience with Hadoop Ecosystem stack including HDFS, MapReduce, Sqoop,
Hive, Pig, HBase, Oozie, Scala, Flume, ETL, Tomcat, Zookeeper, AWS, SQL, Flink, Datorama and Spark.
Strong Hadoop and stage uphold involvement in major Hadoop Distributions like Cloudera, Hortonworks,
Amazon EMR, Google Cloud Dataproc and Azure HDInsight.
Completed bachelor’s at Jawaharlal Nehru Technological University, Kakinada, India in Electronics and
Communication Engineering (2014).
Experience in different Hadoop distributions like Cloudera and HortonWorks Distributions (HDP).
Comfortable working with various facets of the Hadoop ecosystem, real-time or batch, Structured or Unstructured
data processing.
Built and managed a scalable and secure data lake using AWS Lake Formation and Amazon S3, implementing
fine-grained access controls and data lifecycle policies to manage petabytes of data efficiently.
Developed and deployed serverless applications using AWS Lambda to automate data processing tasks, such as
real-time data validation and transformation, significantly reducing infrastructure management overhead and
improving scalability.
Involved creating new Flink jobs in Python to analyze and transform large volumes of streaming data, and
deploying the jobs on OpenShift, Datorama for scalability and reliability.
Streamlined data migration processes by utilizing AWS DataSync for efficient transfer of large datasets between
on-premises and AWS storage and Utilized AWS Database Migration Service (DMS) to seamlessly migrate and
replicate databases with minimal downtime.
Implemented robust security measures by configuring IAM roles and policies to enforce least privilege access,
leveraging KMS for encrypting sensitive data, and utilizing Amazon CloudWatch for real-time monitoring and
alerting to ensure compliance and system integrity.
Utilized R programming to conduct advanced statistical analysis on large datasets, identifying key trends and
patterns.
Implemented a scalable ETL pipeline by leveraging AWS Glue for data extraction, transformation, and loading,
and AWS Lambda for real-time data validation and trigger-based processing, resulting in enhanced data
workflow automation and reduced processing time.
Implemented robust security measures by configuring IAM roles and policies to enforce least privilege access,
leveraging KMS for encrypting sensitive data, and utilizing Amazon CloudWatch for real-time monitoring and
alerting to ensure compliance and system integrity.
Utilized Data Mesh and Datorama Established standardized APIs for seamless integration of data products
across the organization.
Played a pivotal role in optimizing and automating the data pipeline by integrating Jenkins for continuous
integration and Autosys for job scheduling, enhancing workflow efficiency and reliability.
Enhanced system monitoring and security by implementing Amazon CloudWatch for real-time performance
metrics and alerts, and AWS CloudTrail for comprehensive auditing of AWS account activity, ensuring
compliance and operational visibility.
Formulated versioning policies and lifecycle rules, ensuring consistent labeling and incrementation of software
versions while optimizing resource allocation and enhancing user experience through effective release, support,
and retirement strategies.
Implemented monitoring and alerting systems to proactively identify and address data quality issues within the
Lakehouse architecture.
Used DBT to debug complex chains of queries. They can be split into multiple models and macros that can be
tested separately.
Utilized Palantir Foundry and Docker for the runtime environment of the CI/CD system to build, test deploy.
Implemented and managed data storage solutions utilizing AWS S3 and Glacier, including setting up storage
buckets, configuring lifecycle policies for data retention and archival, and optimizing storage costs through
intelligent data tiering and retrieval strategies.
Implemented mechanisms to capture and propagate metadata version information, enhancing transparency and
reproducibility in data transformations.
Designed and developed robust big data applications leveraging key components of the Cloudera suite, including
for real-time data streaming, HDFS for distributed storage, HBase for NoSQL data storage, Kudu for analytical
data storage, Zookeeper for distributed coordination, HIVE for data warehousing, and Impala for high-
performance querying.
Experience with NoSQL databases like HBase as well as other ecosystems like Zookeeper, Oozie, Impala,
Spark- Streaming/SQL, Tomcat.
Experience in moving large amounts of log, streaming event data and Transactional data using Flume.
Developed real-time data streaming solutions using Apache Flink to process and analyze data streams with low-
latency requirements.
Good experience with Hive concepts like Static/Dynamic Partitioning, Bucketing, Managed, and External
Tables, join operations on tables.
Experience in working with Spark transformations and actions on RDDs and Spark-SQL, Data Frames in
Python spark.
Has very good development experience with Agile Methodology.
Designed, developed, and maintained complex data pipelines using GCP services such as Dataflow and Apache
Beam to process and transform terabytes of data daily.
Experience in managing scalable datasets, tables, and views in BigQuery, optimizing storage and query
performance for large-scale data analytics.
Hands on experience in using the various GCP components such as DataFlow with python SDK, DataProc,
BigQuery, Composer (Airflow), Google Workspace's (formerly G Suite) for impersonation of the service
accounts, Cloud IAM, Cloud pubsub, Cloud functions for handling functions as service requests, Cloud data
fusion, Cloud GCS, and Cloud data catalog.
Automated tasks using scripts, Cloud Functions, and Cloud Scheduler, and maintained infrastructure as code with
Terraform.
Utilized Google Cloud Storage for the secure, scalable storage of unstructured and semi-structured data, ensuring
high availability and durability.
Automated routine data processing tasks and workflows using scripts and GCP automation tools such as Cloud
Functions and Cloud Scheduler, reducing manual effort and minimizing errors.
Developed real-time data streaming solutions using Cloud Pub/Sub, Dataflow, and Cloud Functions, enabling
immediate processing and analytics of incoming data streams.
Implemented security best practices, including data encryption at rest and in transit, IAM (Identity and Access
Management) policies, and VPC (Virtual Private Cloud) configurations to protect sensitive data.
Integrated data from various sources including Cloud Pub/Sub, Cloud SQL, and third-party APIs, ensuring
seamless data flow and synchronization across systems.
Led the design and implementation of scalable and secure data storage solutions using Azure Data Lake Storage.
Spearheaded the deployment of Azure Databricks for large-scale data processing and analytics.
Documented data engineering processes, created comprehensive guides, and generated performance reports using
Data Studio.
TECHNICAL SKILLS
Hadoop Ecosystem MapReduce, HBASE, HIVE, PIG, SQOOP, Zookeeper, OOZIE, Flume,
HUE, Kafka, AWS EMR, SPARK, SPARK-SQL, Scala, Jenkins, Flink,
HDFS, PySpark, Yarn, Impala, Kinesis
Virtualization & Cloud Amazon EC2, Amazon VPC, Amazon EBS, Amazon S3, AWS Lambda,
Tools Amazon RDS, Amazon Route 53, Amazon CloudWatch, AWS IAM, AWS
Direct Connect, Google Compute Engine, Google Cloud SQL,
Google Cloud IAM
Programming C, C++, Java (core), UNIX Shell Scripting, Python, R, PySpark, Scala,
Languages SQL
Data Engineering Apache Airflow, Apache NiFi, Talend, Informatica, Apache Flink, Apache
Tools Beam, Dataflow, Cloud Data Fusion, Cloud Composer.
AWS Data Processing AWS Glue, Amazon Redshift, Amazon EMR, Amazon QuickSight,
& Analytics Amazon Athena, AWS Lake Formation, BigQuery, Dataflow, Pub/Sub,
Dataproc, Vertex AI
Certifications:
WORK EXPERIENCE
Description:
USAA (United Services Automobile Association), a distinguished financial services institution catering to military
members and their families, stands out for its unwavering dedication to customer service and deep-rooted ties to the
military community. Throughout my tenure, I played a pivotal role in transforming data management processes,
driving actionable insights, and propelling business growth. From orchestrating seamless data migrations to modern
cloud environments like AWS Snowflake to implementing real-time analytics solutions leveraging PySpark streaming,
my contributions consistently optimized operations and enhanced decision-making capabilities. By championing
cutting-edge technologies and spearheading the development of tailored data integration systems, I empowered USAA
to remain agile, secure, and responsive to evolving business needs, ultimately reinforcing its position as a leader in the
financial services industry.
Responsibilities:
Worked with Hadoop Ecosystem components like HBase, Sqoop, ZooKeeper, Oozie, Hive and Pig with
Cloudera Hadoop distribution.
Leveraged AWS cloud services such as Amazon EC2 and RDS to create a scalable infrastructure capable of
handling a 50% increase in workloads during peak mortgage application periods, reducing application
processing times by 30%.
Successfully migrated 2 TB of data from IBM DataStage to the AWS Snowflake environment using DBT
Cloud, improving data processing efficiency by 40% and reducing overall data migration time by 25%
Implemented windowed aggregations and stateful processing in PySpark streaming to derive real-time
insights from streaming data.
Handled importing of data from various data sources, performed transformations using Hive, MapReduce,
loaded data into HDFS and Extracted the data from MySQL into HDFS using Sqoop.
Utilized AWS Data Pipeline to schedule an Amazon EMR cluster for cleaning and processing over 500 GB of
web server logs stored in an Amazon S3 bucket, reducing data processing time by 35% and enhancing the
accuracy and timeliness of web analytics for better decision-making.
Designed and implemented a tailored data integration system for Ford automotive's, optimizing data flow
across diverse systems using ETL processes with AWS Glue and Apache Spark, Datorama achieving a 25%
reduction in data processing times.
Developed and implemented data orchestration workflows for ingesting, processing, and transforming data
within the Lakehouse architecture.
Utilized AWS Database Migration Service (DMS) for a near-zero downtime migration, ensuring a smooth
transition from PostgreSQL to Aurora with minimal impact on the application.
Demonstrated expertise in Bash scripting by creating automation scripts for data processing tasks,
streamlining workflows for tasks such as file manipulation, data transformation, and database interactions,
enhancing efficiency in data engineering processes.
Designed and implemented tailored indexing strategies in Elasticsearch, optimizing mappings, custom
analyzers, and index settings to balance indexing speed and query performance, showcasing expertise in data
optimization for specific use cases.
Led the adoption of Amazon EKS for container orchestration, streamlining the deployment and management
of containerized data services, reducing deployment times by 40%, and optimizing resource allocation to
achieve a 20% cost reduction.
Developed ETL to handle data transformations, missing value imputations, and outlier detection, ensuring
data quality and consistency across the organization.
Led the development of a RESTful data API for a complex data analytics platform using Spring Boot and
MongoDB.
Written Python, Shell scripts for various deployments and automation process and written MapReduce
programs in Python with the Hadoop streaming API.
Involved in Cluster coordination services through Zookeeper and Adding new nodes to an existing cluster.
Experience with Snowflake cloud data warehouse and AWS S3 bucket for integrating data from multiple
source system which include loading nested JSON formatted data into snowflake table.
Integrated the mortgage processing system with external credit reporting agencies and property valuation
services.
Conducted in-depth performance analysis and Utilized monitoring tools and performance profiling on Red
Hat Enterprise Linux to identify and analyze bottlenecks in the data warehouse environment.
Implemented efficient ETL processes using AWS Glue, automating data transformations, and reducing
processing times by 30%. Alternatively, leveraged EMR for large-scale data processing, optimizing
performance and resource utilization.
Implemented end-to-end data ingestion and processing pipeline on AWS, orchestrating data import from
various sources to Amazon S3, then automated loading into Amazon Redshift using Python Lambda functions,
optimizing data storage and analytics workflows.
Proficient in Unified Data Analytics with Databricks, utilizing Databricks Workspace, managing Notebooks,
and employing Delta Lake with Python and Spark SQL
Designed and implemented real-time data event notification systems using AWS SNS, ensuring immediate
and reliable distribution of data alerts and messages to multiple endpoints, such as email, SMS, and Lambda
Experience on Palantir Foundry and Data warehouses (SQL Azure and Confidential Redshift/RDS).
Integrated Flink with data warehousing solutions like Apache Hadoop, HDFS, or cloud-based platforms to
facilitate data storage, querying, and analytics.
Exported the analyzed data to the databases such as Teradata, MySQL and Oracle using Sqoop for
visualization and to generate reports for the BI team.
Developed ETL workflow which pushes web server logs to an Amazon S3 bucket.
Collecting and aggregating large amounts of log data using Apache Flume and staging data in HDFS for
further analysis.
Implemented and optimized Spark-based ETL workflows, leveraging Azure Databricks clusters for parallel
processing and optimizing job performance.
Created ETL between different data warehousing such as snowflake & Redshift via Alteryx workflow.
Implemented a high-performance data storage and processing solution using Apache Parquet format,
optimizing for columnar storage and compression.
Developed real-time data streaming solutions using Apache Flink to process and analyze data streams with
low-latency requirements.
Provisioned the highly available EC2 Instances using Terraform and Ansible Playbooks.
Worked on SnowSQL and Snowpipe and converted Talend Joblets to support the snowflake functionality.
I configured the OpenShift environment to dynamically scale the Flink cluster based on workload and
ensured high availability of the jobs.
Enabled real-time data querying and analysis by setting up AWS Athena, empowering business analysts to
derive insights directly from the data lake.
I spearheaded the development of a scalable, cost-effective data warehouse system on AWS, achieving up to
40% cost savings and ensuring fault tolerance with automated snapshots and multi-AZ deployment. Querying
multiple databases like Snowflake, Netezza, UDB and MySQL for data processing.
Developed multiple POCs using PySpark and deployed on the YARN cluster, compared the performance of
Spark, with Hive and SQL and Involved in End-to-End implementation of ETL logic.
Environment: HDFS, MapReduce, Cloudera, HBase, Athena, EKS, Hive, DBT cloud, Apache Parquet, Pig, , Yarn,
MSK, Elastic search, Openshift, Kibana, Sqoop, Spring Boot, Lakehouse, Spark, ElasticSearch, Bash Scripting,
MongoDB, Scala, Flume, Azure Container, RHEL, Redshift, Oozie, Zookeeper, Datorama AWS, Maven, Linux,
Bitbucket, UNIX Shell, Oozie, Apache Flink Scripting, Python. Data Lake Spark-SQL, Ad hoc queries, AWS Glue,
AWS DMS, AWS SCT Teradata, Oracle Fusion, Tableau, SnowSQL, Snowpipe, PySpark and AWS SNS.
Description:
First Citizens Bank leverages innovative technologies to ensure security, efficiency, and customer satisfaction,
spanning 500+ branches across 30 states since its founding in 1898. Under my tenure, the bank transformed its data
infrastructure by implementing advanced distributed data processing pipelines, seamless on-premises and cloud data
migrations, and real-time data integration using Apache Flink and Kubernetes. By designing and optimizing ETL
pipelines, enhancing data security and integrity, and leveraging monitoring tools to proactively address system issues,
I contributed to the bank's robust data analytics capabilities and improved operational efficiency. My work ensured
scalable, reliable data solutions, facilitating insightful financial services across the organization.
Responsibilities:
Designed and implemented distributed data processing pipelines using Dataflow, BigQuery, Python, Cloud
Composer, Pub/Sub, and Data Fusion to ingest customer behavioral data and financial histories into Cloud Storage
for analysis, enhancing risk assessment and customer insights.
Leveraged Data Fusion as an ETL tool to perform transformations, event joins, and pre-aggregations of financial
transaction data before storing the data onto BigQuery, facilitating complex financial analytics.
Implemented monitoring and alerting mechanisms in dbt Cloud (Data Build Tool) to proactively identify and
address issues in data transformation processes.
Implemented monitoring and alerting mechanisms in dbt Cloud (Data Build Tool) to proactively identify and
address issues in data transformation processes, ensuring the accuracy of financial data reporting.
Created declarative pipelines using Cloud Build, incorporating stages for data ingestion, transformation, and
loading (ETL) processes, to streamline financial data processing workflows.
Integrated Apache Flink with data warehouses such as Amazon Redshift, Google BigQuery, or Azure SQL
Data Warehouse to load and analyze real-time data alongside batch data.
Collaborated with the operations team to deploy and monitor the Bigtable cluster, making fine-tuned adjustments
based on real-world usage patterns to support large-scale financial data storage.
Installed and configured a Hadoop cluster of Hortonworks Data Platform using Ambari Server and maintained
it.
Integrated Apache Flink with data warehouses such as BigQuery to load and analyze real-time financial data
alongside batch data, improving the ability to detect fraudulent transactions.
Demonstrated a good understanding of Dataflow architecture, including Stream and Batch processing, to process
financial transactions in real-time.
Set up GCP environments with Dataflow and managed Dataflow templates for Business Analytics and Manage
Workflows in Cloud Composer.
Installed and configured a Hadoop cluster on Google Cloud Dataproc using initialization actions and maintained
it, supporting large-scale financial data processing.
Integrated Google Kubernetes Engine (GKE) with Cloud Composer, leveraging KubernetesPodOperator for
running containerized tasks within the Kubernetes cluster, enhancing the automation of financial data pipelines.
Implemented comprehensive monitoring and alerting using Google Cloud Operations Suite (formerly
Stackdriver) and Flink's metrics to ensure system reliability, crucial for maintaining financial data integrity.
Leveraged Docker to containerize data processing applications, ensuring consistency across development, testing,
and production environments, thereby enhancing the reproducibility of financial data workflows.
Developed robust ETL pipelines integrating Elasticsearch, utilizing tools like Logstash and custom scripts to
efficiently extract, transform, and load financial data into Elasticsearch indices, demonstrating proficiency in
managing data flow within the broader financial data ecosystem.
Designed an end-to-end architecture leveraging Cloud Dataflow, Pub/Sub, and BigQuery to enable seamless
streaming of financial data from source systems, improving the speed and accuracy of financial data analysis.
Developed multiple POCs using Dataflow and deployed them on GCP, comparing the performance of Dataflow
with BigQuery and SQL. Involved in the end-to-end implementation of ETL logics, including financial
transaction processing.
Designed and implemented end-to-end ETL pipelines using Dataflow, processing petabytes of financial data from
various sources, including Cloud Storage, Pub/Sub, and relational databases.
Developed the Dataflow scripts to make the interaction between Pub/Sub and BigQuery. Implemented Workload
Management in BigQuery to prioritize basic financial dashboard queries over more complex longer-running ad-
hoc queries.
Experienced in using Data Fusion for data cleansing and developed data pipelines to extract the data from various
sources to load into BigQuery, improving the quality of financial data analytics.
Designed and implemented custom Extract, Transform, Load (ETL) processes using the R language to preprocess
and clean diverse financial data sources.
Developed a workflow in Cloud Composer to automate the tasks of loading data into BigQuery and pre-
processing with Dataflow and BigQuery, enhancing the automation of financial data processing.
Implemented monitoring and alerting systems to track data lake health, performance, and data availability, taking
measures to address issues and ensure the reliability of financial data.
Worked with various data formats, including Avro, Parquet, JSON, and CSV, to support diverse financial data
types.
Environment: GCP, Dataflow, BigQuery, Cloud Composer, Pub/Sub, Data Fusion, Cloud Storage, DBT Cloud, Cloud
Build, OpenShift, Docker, Kubernetes (GKE), ETL, Bigtable, Cloud Dataproc, Cloud Operations Suite (formerly
Stackdriver), Apache Flink, Google Kubernetes Engine (GKE), R language, PySpark, Elasticsearch, Cloud Dataflow,
Pub/Sub, BigQuery, Avro, Parquet, JSON, CSV, Cloud SQL, Cloud Spanner, Google Cloud IAM, Cloud Functions,
Vertex AI, Google Cloud VPC, Cloud Scheduler, Cloud Pub/Sub, Google Cloud Monitoring, Google Cloud Logging,
and Google Cloud Run.
Data Engineer
Optum, Sacramento, CA. November 2017 – September 2020
Description:
Optum is a healthcare services and technology company that offers a wide range of solutions to improve healthcare
delivery, management, and outcomes. It is a subsidiary of UnitedHealth Group, one of the largest healthcare
organizations in the world. The project's motto is to utilize the power of big data and cutting-edge technologies to
efficiently manage, process, and analyze vast datasets, enabling data-driven insights and decision-making while
ensuring data quality and system reliability.
Responsibilities:
Experience with Apache bigdata components like HDFS, MapReduce, YARN, Hive, HBase, Sqoop, Pig,
Ambari and Nifi.
Led the design and implementation of a comprehensive data integration framework for a life sciences
organization, harmonizing diverse datasets from genomics, proteomics, and clinical sources.
Implemented data streaming with AWS Kinesis to Set up AWS Kinesis Data Streams to ingest and process
real-time data streams, accommodating high-throughput scenarios.
Designed and implemented a robust data ingestion and processing pipeline for real-time streaming data using
Spring Boot and Apache Kafka.
Implemented and optimized NoSQL data solutions using Azure Cosmos DB, ensuring low-latency, globally
distributed access to data.
Optimized bioinformatics workflows by identifying bottlenecks and implementing performance
improvements and implemented robust security protocols to safeguard sensitive genomic and clinical data,
ensuring compliance with data protection regulations (e.g., HIPAA, GDPR).
Applied hands-on experience in designing data architectures and models using tools such as Erwin and
Lucidchart.
To enhance performance, I employed Cassandra's batch loading capabilities and optimized the ETL
workflows to minimize data movement and processing overhead.
Designed and implemented data warehousing solutions using Azure Synapse Analytics, ensuring high-
performance analytics, and reporting.
Implemented Apache Airflow for end-to-end data pipeline automation, enhancing troubleshooting and
resilience with detailed logs and notifications.
To address data quality concerns, I incorporated Airflow sensors to wait for specific conditions before
proceeding with downstream tasks.
Involved in end-to-end implementation of ETL pipelines using Python and SQL for high volume analytics,
also reviewed use cases before on boarding to HDFS.
Responsible to load, manage and review terabytes of log files using Ambari web UI.
Involved in writing rack topology scripts and Java map reduce programs to parse raw data.
Migrated from JMS solace to Apache Tomcat, used Zookeeper to manage synchronization, serialization, and
coordination across the cluster.
Implemented and optimized Nextflow scripts within NF-Core pipelines for bioinformatics data processing,
ensuring reproducibility and scalability. Collaborated with domain experts to enhance pipeline efficiency and
reliability in the genomics data analysis domain.
Integrated SonarQube into CI/CD pipelines, automating code quality checks as part of the build and
deployment process to ensure continuous code improvement.
Managed Spark clusters, optimized ETL pipelines, and achieved significant performance improvements,
while also integrating Spark with data warehousing solutions and embracing real-time data processing.
Utilized AWS S3 as a central data lake to Stored raw and transformed data in Amazon S3 buckets, allowing
for scalable and cost-effective storage.
Implemented and maintained federated data architectures to ensure scalability and efficiency using Data
Mesh.
Spearheaded the design and implementation of comprehensive data security and governance solutions within
the Cloudera ecosystem, utilizing key components such as Apache Atlas for metadata management, Apache
Ranger for access control policies, Ranger KMS for encryption key management, and Key Trustee Server
(KTS) for secure key storage.
Developed and maintained Perl scripts for data processing and transformation, handling large datasets
efficiently and ensuring data quality and consistency.
Installed, configured, and maintained a Hadoop cluster based on the business requirements.
Used Sqoop to migrate data between traditional RDBMS and HDFS. Ingested data, from MS SQL, Teradata,
and Cassandra databases.
Advanced knowledge on Confidential Redshift and MPP database concepts.
Defined and deployed monitoring, metrics, and logging systems on AWS.
Proficiency in Perl, Shell, and other scripting languages with a focus on automation, system administration,
and data manipulation, resulting in increased efficiency and productivity.
Used Nifi to automate the data flow between disparate systems. Designed dataflow models and complicated
target tables to obtain relevant metrics from various sources.
Designed and optimized data workflows to leverage SQS to decouple data producers from consumers,
ensuring high throughput and efficient resource utilization in ETL processes.
Developed Bash scripts to get log files from FTP server and executed Hive jobs to parse them.
Performed data analysis using HiveQL, Pig Latin and custom MapReduce programs in Java.
Enhanced scripts of existing Python modules. Worked on writing APIs to load the processed data
to HBase tables.
Additionally, I implemented change data capture (CDC) mechanisms to capture and propagate incremental
changes in source data to Cassandra, reducing the need for full data reloads and minimizing downtime during
updates.
Migrated ETL jobs to Pig scripts to apply joins, aggregations, and transformations.
Used Power BI as a front-end BI tool and MS SQL Server as a back-end database to design and develop
dashboards, workbooks, and complex aggregate calculations.
Used Jenkins for CI/CD and SVN for version control.
Environment: Hortonworks 2.0, Hadoop, ETL, AWS Kinesis, Perl, Kudu, Impala, Hive v1.0.0, HBase, Sqoop
v1.4.4, Pig v0.12.0, Zookeeper, Tomcat v0.8.1, Nifi, Spring boot, Azure Synapse, Golang, Ambari, Data Lake,
Nextflow, Python, SQL, Sonar Qube, Redshift, Java, Teradata, MS SQL, Cassandra, Power BI, Airflow, Atlas, Ranger,
RangerKMS , KTS, Jenkins, SVN, Jira and AWS SQS.
Responsibilities
Installed and configured Hadoop clusters in Full Distributed and pseudo-Distributed modes, utilizing Scala and
SQL for data extraction from SQL Server and MySQL.
Configured Hadoop stack on EC2 servers, facilitating seamless data transfer between S3 and EC2 instances.
Managed Compute on Azure Cloud, focusing on developing machine learning libraries for data analysis and
visualization.
Developed MapReduce jobs in Java and Python for efficient data preprocessing and Kafka applications for
monitoring consumer lag.
Leveraged Spark for processing unstructured data, configuring Spark Streaming to handle real-time data from
Kafka and store it in HDFS.
Explored AWS Cloud services such as EC2, S3, EBS, RDS, and VPC for real-time data streaming solutions.
Developed applications utilizing Hadoop Ecosystem components and maintained SQL code for SQL Server
databases.
Installed, configured, and maintained Hadoop Clusters using Apache, Cloudera (CDH4) distributions, and AWS,
working with diverse data types in Big Bench.
Environment: Hadoop, HDFS, Map Reduce, Hive, Sqoop, GitHub, Kafka, Scala, Pig, Nifi, Hortonworks, Cloudera,
HBase, Spark, Pyspark, Oozie, Cassandra, Python, Shell Scripting, AWS EMR, EC2.