Manoj DE
Manoj DE
Manoj DE
CAREER HIGHLIGHTS
Around 10 years of professional experience in IT across using Snowflake, Java, Python and Big data frameworks
with expertise in big data development using Hadoop and Spark ecosystems Experience in developing Data
Engineering Pipelines and integrating Machine Learning workflows for distributed architectures in e-commerce
and finance, specializing in search, recommendations, and personalization platforms.
Expertise in implementing Data and Machine Learning pipelines using GCP DATAPROC, AWS Step Functions,
Google Kubernetes Engine (GKE), and Apache Spark on Azure Data bricks.
Implemented in cloud-based data warehousing and analytics platforms, including AWS Glue, Snowflake, and
Google Cloud Platform (GCP).
Skilled in orchestrating complex data workflows using Apache Airflow, ensuring efficient and reliable data
processing across multiple environments.
Developed and optimized machine learning models using AWS Sage Maker, integrating with other AWS services
for end-to-end ML solutions.
Deep understanding of Snowflake's unique features, including automatic scaling, data sharing, and the Time
Travel and Fail-Safe capabilities. Leverage these features to meet specific business requirements.
Software development involving cloud computing platforms like Amazon Web Services (AWS), Azure and Google
Cloud (GCP).
Expertise in working with Azure cloud services like Blob, Data Lake, Data bricks, Synapse, Data Factory, Data
Pipeline, Event Hub, and HD Insights.
Extensive experience in big data processing and analytics using Apache Spark, focusing on large-scale data
transformations and real-time data processing.
Implemented Retrieval Augmented Generation (RAG) based Indexing Pipeline using Lang Chain and ChatGPT,
enhancing AI agent workflows for improved context retrieval and response generation.
Developed high-performance APIs using FastAPI to serve AI models and facilitate seamless integration with
front-end applications.
Experience in analyzing data using Python, R, SQL, Microsoft Excel, Hive, PySpark, and Spark SQL for Data
Mining, Data Cleansing, Data Mugging, and Machine Learning.
Experience in design, development and Implementation of Big data applications using Hadoop ecosystem
frameworks and tools like HDFS, Map Reduce, Yarn, Pig, Hive, Sqoop, Scala, PySpark, Storm, HBase, Kafka,
Flume, Nifi, Impala, Oozie, Zookeeper, Airflow etc.
Demonstrated success in increasing user engagement by 30% and driving revenue growth by 20% through the
implementation of data-driven personalization and recommendation systems.
Developed high-performance APIs using Cloud Functions and Cloud Run, facilitating seamless integration of AI
models with front-end applications.
Implemented data quality checks and data lineage tracking using Google Cloud Data CATALOG, enhancing data
governance and facilitating easier discovery of datasets across the organization.
TECHNICAL SKILLS
GCP Components (Data DATAPROC, Compute Engine, Google Kubernetes Engine (GKE), BIGQUERY, Vision AI,
&AI/ML) Translation AI, Cloud TPUs, AI Platform Notebooks, Vertex AI, AUTOML,
AWS Components(Data AWS Sage Maker, Bedrock, EKS, EC2, S3, IAM, RDS, Lambda, Cloud Watch, Cloud
&AI/ML) Formation, VPC, Secrets Manager, App Mesh, Config, Athena.
Databases Oracle RDBMS, MYSQL, PostgreSQL, Cassandra,
Big Data Tools Snowflake, Apache Airflow, Apache Spark, Hadoop, Hive, HBASE , AWS EMR, Azure
Data bricks
Programming Languages Python, Java, Scala, Groovy, JavaScript, SQL, C++, Kotlin, Lua, HTML, &CSS
Generative AI/ Prompt ChatGPT 4o, LLAMA 3.1, BERT, RAG techniques, Lang Chain, Lang Graph, Few-shot
Engineering learning, Chain-of-thought prompting, MongoDB Atlas, Neo4j, Apache Solr, PineCone
CI/CD & Infrastructure Nginx, GraphQL, Docker, GitHub Container Registries, Jenkins, Akamai, Kubernetes,
Ansible
PROFESSIONAL EXPERIENCE
Responsibilities:
Architected and implemented robust data pipelines on Google Cloud Platform (GCP), Created interactive
dashboards using Tableau to allow clients to track key metrics from Salesforce databases, aiding in decision-
making processes.
Coordinated a team of 8 software engineers on a project to modify 4 existing database management systems,
ensuring seamless integration and improved performance.
Implemented multiple Data pipelines to Ingest data to GCP BIGQUERY from diverse sources including SFTP, JDBC
databases, Teradata, and REST APIs.
Developed Python scripts for parsing complex JSON and XML files, making the data accessible and usable for
analysis within Snowflake.
Utilized Hive SQL scripts to perform data transformations and pre-processing tasks required for in-depth analysis
within Snowflake.
Designed and implemented data processing tasks using PySpark, including data ingestion, merging, enrichment,
and loading into target data destinations in Snowflake.
Responsible for loading data pipelines from web servers using Kafka and Spark Streaming API, ensuring a
continuous flow of data for analysis within Snowflake.
Leveraged Spark for interactive queries, real-time data processing, and integration with popular NoSQL
databases, managing vast volumes of data efficiently within Snowflake.
Designed and developed ETL/ELT solutions using Dataflow and DATAPROC, ensuring data integrity and
consistency across systems. Leveraged BIGQUERY for efficient data warehousing and analytics.
Orchestrated complex data workflows using Cloud Composer (managed Apache Airflow), coordinating data
processing between multiple technologies and automating data flows.
Analysed client requirements and collaborated closely with data scientists, data owners, and data lake teams to
propose and implement effective solutions for new data sources.
Utilized advanced programming skills in SQL, PL/SQL, Python, and Linux bash scripting to optimize data
processes and solve complex data-related challenges.
Managed and administered application servers on GCP, optimizing system performance, implementing
monitoring with Cloud Monitoring, and ensuring IT system security through IAM and VPC Service Controls.
Developed and maintained comprehensive technical and project documentation, supporting the creation of
specifications, testing procedures, and error analysis to ensure system reliability and accuracy.
Implemented data quality checks and monitoring using Cloud Logging and Data Catalogue, establishing
governance policies to maintain data accuracy and reliability.
Leveraged GCP's machine learning capabilities, integrating ML models into data pipelines using Vertex AI and
BIGQUERY ML to enable advanced analytics and predictive modelling.
Provisioned and managed GCP infrastructure resources using Terraform and Cloud Deployment Manager,
ensuring consistent and repeatable deployments.
Stayed current with the latest GCP technologies and best practices, driving innovation in data engineering
solutions and contributing to the modernization of data lakes and data warehouses.
Environment: Python, Data lake, ETL, Groovy, Snowflake, Hadoop, Spark, Spark SQL, Spark Streaming, Scala, Hive,
HBase, MySQL, HDFS, Shell Scripting, Data bricks, Data Pipeline
Bed Bath & Beyond Inc. | Union, NJ May 2022 – Feb 2023
Senior Data Engineer
2
Responsibilities:
Architected, built, and maintained robust data pipelines using AWS services including AWS Glue, AWS Data
Pipeline, AWS Lambda, and AWS Step Functions, ensuring efficient data processing and transformation at scale.
Constructed scalable data warehouses and Lake house on AWS, utilizing Amazon Redshift and Amazon S3 to
store and manage large datasets efficiently, optimizing for performance and cost-effectiveness.
Developed and optimized ETL/ELT processes using AWS Glue and AWS Data Pipeline, ensuring data integrity and
consistency across systems while meeting strict SLAs.
Demonstrated extensive knowledge of Apache Nifi to configure and utilize various processors for data pre-
processing, ensuring incoming data is standardized and formatted as per requirements before loading into
Snowflake.
Implemented comprehensive data quality checks and monitoring using AWS Cloud Watch and AWS Glue Data
Quality, establishing governance policies to maintain data accuracy and reliability.
Continuously optimized data pipelines for performance and cost-effectiveness, utilizing techniques such as query
optimization in Redshift, S3 partitioning, and caching strategies.
Implemented Spark using Scala and Spark SQL for expedited data testing and processing, enhancing the
performance and scalability of Snowflake-based analytics.
Worked on AWS S3 bucket integration for application and development projects.
Scheduled Airflow Directed Acyclic Graphs (DAGs) to orchestrate the execution of multiple Hive and Pig jobs,
effectively managing data processing workflows based on time and data availability in Snowflake.
Collaborated closely with data analysts, data scientists, and business stakeholders to translate data
requirements into technical solutions, leveraging AWS's analytics ecosystem.
Provisioned and managed AWS infrastructure resources using Infrastructure as Code tools like AWS Cloud
Formation and Terraform, ensuring consistent and repeatable deployments.
Integrated machine learning models into data pipelines using Amazon Sage Maker and AWS Glue, enabling
advanced analytics and predictive modelling capabilities.
Created interactive dashboards and reports using Amazon Quick Sight and Tableau, providing actionable insights
from Redshift and S3 datasets to support data-driven decision making.
Processed large datasets using Apache Spark on Amazon EMR and AWS Glue, optimizing big data workloads for
efficiency and scalability.
Leveraged server less computing services like AWS Lambda to build scalable and cost-effective data pipelines,
reducing operational overhead.
Automated data pipelines and infrastructure provisioning using AWS Code Pipeline and AWS Code Build,
significantly improving efficiency and reducing manual effort.
Designed and implemented data Lake house architectures using AWS Lake Formation, combining the flexibility
of data lakes with the performance of data warehouses.
Created and maintained a comprehensive data catalogue using AWS Glue Data CATALOG, improving data
discoverability and understanding across the organization.
Built real-time data processing pipelines using AWS Kinesis and AWS Lambda, enabling immediate insights from
streaming data sources.
Identified and implemented cost-saving strategies for AWS data solutions, including the use of spot instances,
reserved instances, and resource rightsizing, resulting in significant cost reductions.
Environment: GCP, Hadoop, snowflake, AWS, Data bricks, ETL, Python, NIFI, HDFS, Map Reduce, Hive, Spark.
Responsibilities:
Architected and implemented Data Engineering and ML pipeline leveraging Azure Data Lake Storage Gen2 for
raw data storage, Azure Event Grid for event-driven processing, Azure Data bricks for model training, and Azure
Blob Storage for artifact management, significantly enhancing the end-to-end machine learning workflow
efficiency.
Developed sophisticated data processing workflows using Apache Airflow, optimizing data ingestion and feature
engineering pipelines. Implemented high-performance data retrieval mechanisms in MongoDB, and leveraged
3
Involved in the development of Python APIs to dump the array structures in the Processor at the failure point for
debugging. Using Chef deployed and configured Elastic search, Log stash, and Kibana (ELK) for log analytics, full-
text search, and application monitoring in integration with AWS Lambda and Cloud Watch.
Importing and exporting data jobs, to perform operations like copying data from HDFS and to HDFS using Sqoop
and developed Spark code and Spark-SQL/Streaming for faster testing and processing of data.
Apache Spark for large-scale mathematical operations crucial for machine learning model training.
Engineered a robust MLOPS infrastructure using Ansible for configuration management, Docker for
containerization, and Kubernetes for orchestration, ensuring scalable and reproducible machine learning model
deployments across development, staging, and production environments.
Designed and implemented advanced machine learning models using Python and Scala, integrating Apache
Spark for complex data transformations and distributed computing. Developed custom algorithms for feature
selection, model interpretability, and automated hyper parameter tuning.
Led the development of a real-time machine learning system using Azure Event Hubs for data ingestion, Azure
Stream Analytics for real-time processing, and Azure Machine Learning for model serving, enabling low-latency
predictions for critical business applications.
Implemented a comprehensive model monitoring and retraining pipeline using Azure Monitor and Azure
Functions, automatically detecting model drift and triggering retraining to maintain accuracy in production.
Developed a custom AUTOML framework using Azure Machine Learning, automating the process of model
selection, feature engineering, and hyper parameter optimization, significantly reducing the time-to-model for
new machine learning projects.
Architected a scalable feature store using Azure Synapse Analytics, enabling efficient feature sharing across
multiple machines learning projects and reducing redundancy in feature engineering efforts.
Environment: GCP, Hadoop, snowflake, AWS, Data bricks, ETL, Python, NIFI, HDFS, Map Reduce, Hive, Spark.
Responsibilities:
Architected and implemented scalable search personalization pipelines using Data bricks and Apache Spark,
processing large volumes of user interaction data to deliver tailored search results.
Developed and optimized ETL processes using Spark SQL and Delta Lake to ingest, clean, and transform diverse
data sources, including click stream data, user profiles, and product catalogues.
Implemented real-time feature engineering using Structured Streaming in Data bricks, enabling dynamic updates
to user preference models and improving search relevance.
Leveraged MLFLOW on Data bricks to manage the lifecycle of machine learning models for search ranking,
including experiment tracking, model versioning, and automated deployment. Enhancing search results with
personalized product suggestions.
Utilized Data bricks Delta Live Tables to create and maintain reliable data pipelines for continuous updates to
search indexes and user embeddings.
Optimized query performance and resource utilization in Data bricks by implementing partitioning strategies,
caching, and Adaptive Query Execution techniques.
Collaborated with data scientists to integrate advanced NLP models using Spark NLP, improving search query
understanding and semantic matching capabilities.
Implemented ML pipeline linking JUPYTER Notebooks to a diverse Data Lake (S3, EFS, RDS, DynamoDB, Redshift,
EMR, Glue, and Lake Formation). Ensured data availability checks, storing artifacts/ML Models in an accessible
external file system for upstream processes.
Developed massive-scale applications using Apache Spark & Solr within GCP and Lucid works Fusion.
Analyzed large and critical datasets using Cloudera, HDFS, Map Reduce, Hive, Hive UDF, Pig, Sqoop, and Spark.
Design and define data workflows using Oozie, specifying the sequence of actions and dependencies to achieve
specific data processing and transformation tasks.
Developed Spark Applications by using Scala and Implemented Apache Spark data processing project to handle
data from various RDBMS and Streaming sources.
4
Worked with Spark for improving performance and optimization of the existing algorithms in Hadoop using
Spark Context, Spark-SQL, Spark MLlib, Data Frame, Pair RDDs, and Spark YARN.
Used Spark Streaming APIs to perform transformations and actions on the fly for building a common learner
data model which gets the data from Kafka in near real-time and persists it to Cassandra.
Optimize Impala queries for performance, considering factors like query complexity, data distribution, and
hardware resources.
Worked and learned a great deal from AWS Cloud services like EC2, S3, EBS, and RDS.
Used AWS services like EC2 and S3 for small data sets processing and storage, Experienced in Maintaining the
Hadoop cluster on AWS EMR.
Worked on importing and exporting data from snowflake, Oracle, and DB2 into HDFS and HIVE using Sqoop for
analysis, visualization, and generating reports.
Involved in file movements between HDFS and AWS S3 and extensively worked with S3 bucket in AWS
Designed, implemented, and supporting the enterprise Big Data Platforms in Search, Recommendation
networks.
Developed new and enhanced SBT and maven projects using Scala, Java, Python, C++, and JavaScript.
Developed Python Framework and integrated with AWS S3, SQS, RDS, and Snowflake for continuous Extraction
and loading of data from several sources. Hosted Application on Elastic Beanstalk with Auto Scaling.
Added support for Amazon AWS S3 and RDS to host static/media files and the database into Amazon Cloud.
Imported data from AWS S3 into Spark RDDs to perform transformations and actions on those RDDs.
Worked on ETL Migration services by developing and deploying AWS Lambda functions for generating a server
less data pipeline that can be written to Glue Catalog and can be queried from Athena.
Environment: Hadoop, EMR, HDFS, Hive, Impala, Sqoop, Oozie, Apache Kafka, Oracle, MySQL, UNIX, ETL, Spark and
Scala.
Responsibilities:
Assisted in developing semantics for Convolution neural networks to build an artificial system to detect patterns.
Designed, built & maintained database to analyse life cycle of training layer on this neural network.
Eliminated white noise in training layer by 25% with help of advanced featured filters.
Environment: Python, SQL, Django, Pyquery, PostgreSQL, Eclipse, Git, Linux, Shell Scripting
United Online Software Development Pvt. Limited, India June 2014 to May 2015
Java Developer
Follows agile methodology and attend daily and weekly SCRUM meetings to update working status.
Used J2EE design patterns like Façade, Singleton, Strategy and Service Locator, etc.
Implemented Java8 Streams, Lambda Functions, Predicates, Functional Interfaces, Method References, Filters,
Collections, and Default Methods.
Implemented thread safety using Java 8 Executor Service, Lock API, Synchronization, and Multi-threading.
Developed Restful Web-services to interact with 3rd party vendors and payment exchange
Environment: Java1.5, J2EE 1.5, JDBC, JAXB, XML, ANT, Apache Tomcat 5.0, Oracle 8i, JAX-RS,
Jersey, JUnit, PL\SQL, UML, Eclipse
CERTIFICATIONS
1. Develop GENAI Apps with Gemini and STREAMLIT Skill Badge, by Google
https://www.credly.com/badges/4660620e-0284-4499-a045-291fc5147980/linked_in_profile
2. Accelerating End-to-End Data Science Workflows by NVIDIA
https://learn.nvidia.com/certificates?id=u4v4nKH9SvWzbC6oSgxzzw
3. Supervised Machine Learning: Regression and Classification, by DeepLearning.AI
https://www.coursera.org/account/accomplishments/records/8GG7FPBHXMXS
4. Machine Learning with Python, by freeCodeCamp.org
5
https://www.freecodecamp.org/certification/ManojBusam/machine-learning-with-python-v7
5. Oracle Certified Associate, Java SE 8 Programmer 1, Issued by Oracle corp.
https://github.com/manojbusam/JavaSE/blob/main/oracle_cert.pdf
EDUCATION
1. Bachelor of Technology in Electronics & Communications Engineering. (B. TECH in ECE)
Hyderabad, INDIA, May 2014