0% found this document useful (0 votes)
5 views

Big data notes

MongoDB is a NoSQL document database that offers schema flexibility, scalability, and developer agility, allowing for dynamic data structures and efficient handling of unstructured data. It supports various operations such as querying, updating, inserting, and deleting documents, as well as indexing for improved query performance. The aggregation framework enables complex data processing through a series of stages, making it suitable for diverse application requirements.

Uploaded by

srassy45
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
5 views

Big data notes

MongoDB is a NoSQL document database that offers schema flexibility, scalability, and developer agility, allowing for dynamic data structures and efficient handling of unstructured data. It supports various operations such as querying, updating, inserting, and deleting documents, as well as indexing for improved query performance. The aggregation framework enables complex data processing through a series of stages, making it suitable for diverse application requirements.

Uploaded by

srassy45
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 13

What is MongoDB?

MongoDB is a popular NoSQL (Not Only SQL) document database designed for scalability and
developer agility. Unlike traditional relational databases that store data in tables with fixed
schemas, MongoDB stores data in flexible, JSON-like documents. This fundamental difference
underpins its key advantages.
Core Advantages Over Traditional Relational Databases (RDBMS):
1.​ Schema Flexibility: This is a major differentiator. In RDBMS, you define a rigid schema
(tables, columns, data types) before inserting data. Altering this schema in a large
database can be complex and time-consuming. MongoDB, on the other hand, allows
documents within the same collection (analogous to a table) to have different structures.
You can add new fields, remove existing ones, or change data types without disrupting
the entire database or requiring schema migrations.
2.​ Scalability: MongoDB is designed for horizontal scaling. This means you can distribute
your database across multiple servers (sharding) to handle large volumes of data and
high traffic. RDBMS often face challenges in scaling horizontally and typically rely on
vertical scaling (upgrading to more powerful hardware), which can be expensive and has
limitations.
3.​ Developer Agility: The flexible schema aligns well with agile development
methodologies. Developers can evolve the data model as application requirements
change without the overhead of schema migrations. This speeds up the development
process and reduces friction between developers and database administrators.
4.​ Performance for Certain Use Cases: For applications dealing with unstructured or
semi-structured data, or those requiring rapid iteration, MongoDB can offer better
performance. The document structure allows you to embed related data within a single
document, reducing the need for complex joins that can be performance bottlenecks in
RDBMS.
5.​ High Availability: MongoDB offers built-in features for high availability through replica
sets. These are clusters of MongoDB servers that provide redundancy and automatic
failover, ensuring that your application remains available even if one server fails.

Schema Flexibility in MongoDB:

Schema flexibility in MongoDB means that documents within a collection do not need to adhere
to a predefined structure. Each document can have its own unique set of fields and data types.
Impact on Application Development:
●​ Faster Iteration: Developers can adapt the data model quickly as application features
evolve without the need for lengthy schema alteration processes. This accelerates the
development cycle.
●​ Easier Handling of Diverse Data: Applications often deal with entities that have varying
attributes. MongoDB's flexibility makes it easier to store and manage such diverse data
without forcing it into a rigid tabular structure.
●​ Reduced Impedance Mismatch: The document structure closely mirrors how objects are
often represented in programming languages (like JSON objects). This reduces the
"impedance mismatch" between the application code and the database, simplifying data
mapping and manipulation.
Impact on Data Evolution:
●​ Graceful Schema Evolution: As your application grows and data requirements change,
you can add new fields or modify existing ones in your documents without affecting
existing data. You can handle different versions of your data within the same collection.
●​ Simplified Data Integration: When integrating data from different sources with varying
structures, MongoDB's flexibility makes it easier to ingest and work with this
heterogeneous data.

Document Structure in MongoDB:

MongoDB stores data in BSON (Binary JSON) documents. A document is a key-value pair
structure, similar to a JSON object.
●​ Key: A string that identifies a field within the document.
●​ Value: Can be various data types, including other nested documents and arrays.
How it Facilitates the Representation of Complex, Hierarchical Data:
The ability to embed documents and arrays within other documents allows you to represent
complex, hierarchical relationships in a natural and efficient way. For example, you can embed
an array of address objects within a customer document or embed product details within an
order document. This reduces the need for joins to retrieve related data, which can improve
query performance for certain use cases.
Comparison of RDBMS and MongoDB Concepts:
RDBMS MongoDB Description
Table Collection A grouping of related data.
Row Document A single record containing a set
of fields (key-value pairs).
Column Field A specific attribute or piece of
data within a record
(document). The "schema" for
these fields is flexible within a
collection.
Schema Dynamic Schema The structure and data types of
the data. In MongoDB, this is
flexible at the document level.
Join Embedding/Linking Mechanisms for relating data.
MongoDB favors embedding
related data but also supports
referencing (linking) between
documents.

Five Key Data Types Supported by MongoDB:


1.​ String: Standard UTF-8 strings. Example: "name": "John Doe"
2.​ Integer: Various integer types (32-bit signed, 64-bit signed). Example: "age": 30
3.​ Double: 64-bit floating-point numbers. Example: "price": 99.99
4.​ Boolean: Represents true or false values. Example: "isActive": true
5.​ Array: An ordered list of values. The values in an array can be of any data type. Example:
"hobbies": ["reading", "hiking", "coding"]
Querying Documents with Basic Selection Criteria

In MongoDB, you use the find() method on a collection to query for documents. You can provide
a query document as the first argument to find() to specify selection criteria. This query
document uses operators to define conditions on the fields of the documents.
Example:
Suppose you have a users collection with documents like this:
{ "_id": ObjectId("64489a7e2c1f7b3b0e0b1a01"), "name": "Alice", "age": 30, "city": "New York" }
{ "_id": ObjectId("64489a7e2c1f7b3b0e0b1a02"), "name": "Bob", "age": 25, "city": "London" }
{ "_id": ObjectId("64489a7e2c1f7b3b0e0b1a03"), "name": "Charlie", "age": 30, "city": "Paris" }
{ "_id": ObjectId("64489a7e2c1f7b3b0e0b1a04"), "name": "David", "age": 22, "city": "New York" }

To find all users who live in "New York", you would use the following query:
db.users.find({ city: "New York" })

This would return the documents for Alice and David.


You can use various operators within the query document:
* Equality: { field: value } (e.g., { age: 30 })
* Greater Than ($gt): { field: { $gt: value } } (e.g., { age: { $gt: 25 } })
* Greater Than or Equal To ($gte): { field: { $gte: value } } (e.g., { age: { $gte: 30 } })
* Less Than ($lt): { field: { $lt: value } } (e.g., { age: { $lt: 25 } })
* Less Than or Equal To ($lte): { field: { $lte: value } } (e.g., { age: { $lte: 22 } })
* Not Equal To ($ne): { field: { $ne: value } } (e.g., { city: { $ne: "London" } })
* In ($in): { field: { $in: [value1, value2, ...] } } (e.g., { city: { $in: ["New York", "Paris"] } })
* Not In ($nin): { field: { $nin: [value1, value2, ...] } } (e.g., { city: { $nin: ["London", "Berlin"] } })
* Logical AND ($and): Implicitly used when multiple conditions are in the same query
document. { age: 30, city: "New York" } (finds users who are 30 AND live in New York) or
explicitly: { $and: [ { age: 30 }, { city: "New York" } ] }
* Logical OR ($or): { $or: [ { age: 25 }, { city: "Paris" } ] } (finds users who are 25 OR live in
Paris)
Projection in MongoDB Queries
Projection in MongoDB allows you to specify which fields to include or exclude in the documents
returned by a query. This is useful for limiting the amount of data transferred and focusing on the
fields your application actually needs. The second argument to the find() method is the
projection document.
* Include Fields: To include specific fields, set their value to 1 in the projection document. The
_id field is included by default; to exclude it, you must explicitly set _id: 0.
* Exclude Fields: To exclude specific fields (other than _id), set their value to 0. You cannot mix
inclusion and exclusion specifications in the same projection document (except for excluding _id
when including other fields).
Example:
To retrieve only the name and city of all users:
db.users.find({}, { name: 1, city: 1, _id: 0 })

This would return documents like:


{ "name": "Alice", "city": "New York" }
{ "name": "Bob", "city": "London" }
{ "name": "Charlie", "city": "Paris" }
{ "name": "David", "city": "New York" }

To retrieve all fields except the age:


db.users.find({}, { age: 0 })

This would return documents like:


{ "_id": ObjectId("64489a7e2c1f7b3b0e0b1a01"), "name": "Alice", "city": "New York" }
{ "_id": ObjectId("64489a7e2c1f7b3b0e0b1a02"), "name": "Bob", "city": "London" }
{ "_id": ObjectId("64489a7e2c1f7b3b0e0b1a03"), "name": "Charlie", "city": "Paris" }
{ "_id": ObjectId("64489a7e2c1f7b3b0e0b1a04"), "name": "David", "city": "New York" }

Update Operations in MongoDB


MongoDB provides several methods to update documents in a collection:
* updateOne(): Updates a single document that matches the filter.
* updateMany(): Updates all documents that match the filter.
* replaceOne(): Replaces a single document that matches the filter with a new document.
These methods take at least two arguments:
* Filter Document: Specifies the criteria to identify the document(s) to update.
* Update Document: Defines the modifications to be made. Update documents use update
operators to specify the changes.
Common Update Operators:
* $set: Updates the value of a field or creates the field if it doesn't exist.
* $unset: Removes a specified field.
* $inc: Increments or decrements the value of a field by a specified amount.
* $mul: Multiplies the value of a field by a specified amount.
* $rename: Renames a field.
* $min: Updates the value of a field to a specified value if the specified value is less than the
current value.
* $max: Updates the value of a field to a specified value if the specified value is greater than
the current value.
* $push: Appends a value to an array.
* $addToSet: Adds a value to an array only if the value is not already present.
* $pull: Removes all instances of a value from an array.
* $pullAll: Removes multiple values from an array.
Example:
Suppose we want to update Bob's age to 26 in the users collection:
db.users.updateOne(
{ name: "Bob" },
{ $set: { age: 26 } }
)

To increment the age of all users in "New York" by 1:


db.users.updateMany(
{ city: "New York" },
{ $inc: { age: 1 } }
)
To rename the city field to location for Charlie:
db.users.updateOne(
{ name: "Charlie" },
{ $rename: { city: "location" } }
)

Inserting Documents into a MongoDB Collection

You can insert documents into a MongoDB collection using the following methods:
* insertOne(): Inserts a single document.
* insertMany(): Inserts multiple documents.
Inserting a Single Document:
The insertOne() method takes one argument: the document to be inserted. MongoDB
automatically adds an _id field with a unique ObjectId if it's not already present in the document.
Example:
To insert a new user named "Eve" into the users collection:
db.users.insertOne({ name: "Eve", age: 28, city: "Berlin" })

Inserting Multiple Documents:


The insertMany() method takes one argument: an array of documents to be inserted.
Example:
To insert two new users, "Frank" and "Grace":
db.users.insertMany([
{ name: "Frank", age: 35, city: "Tokyo" },
{ name: "Grace", age: 21, city: "Sydney" }
])

The insertMany() method also accepts an optional ordered option. If ordered: true (default),
MongoDB will stop inserting if one of the inserts fails. If ordered: false, MongoDB will continue to
insert the remaining documents even if some fail.
Deleting Documents from a MongoDB Collection
You can delete documents from a MongoDB collection using the following methods:
* deleteOne(): Deletes a single document that matches the filter.
* deleteMany(): Deletes all documents that match the filter.
These methods take one argument: the filter document specifying the criteria for the documents
to be deleted.
Deleting a Single Document:
To delete the first user named "Bob" found in the collection:
db.users.deleteOne({ name: "Bob" })

Deleting Multiple Documents:


To delete all users who live in "Berlin":
db.users.deleteMany({ city: "Berlin" })

Deleting All Documents in a Collection:


To delete all documents from a collection, you can pass an empty filter to deleteMany():
db.users.deleteMany({})
Caution: Be very careful when using deleteMany({}) as it will permanently remove all data from
the specified collection.

Indexes in MongoDB and Their Importance

Indexes in MongoDB are special data structures that store a small portion of the data in a way
that is easy to traverse. They are used to optimize the performance of queries. Without indexes,
MongoDB must perform a collection scan, i.e., examine every document in a collection to select
those that match the query statement. If the collection is large, this can be very inefficient and
time-consuming.
Why are indexes crucial for query performance?
Indexes improve query performance by:
* Reducing the amount of data scanned: When a query has an index that supports the query
criteria, MongoDB can use the index to locate the matching documents without scanning the
entire collection. This significantly reduces the I/O operations and the amount of data that needs
to be processed in memory.
* Speeding up sorting: Indexes can also be used to efficiently sort the results of a query if the
sort order matches the index order. Without an index, MongoDB might need to load all the
matching documents into memory and then sort them, which can be slow for large result sets.
Think of an index in a database like the index in the back of a book. Instead of reading the entire
book to find information on a specific topic, you can look it up in the index, which points you
directly to the relevant pages. Similarly, a database index points MongoDB directly to the
documents that match your query.
Basic Example of Creating an Index:
You can create an index on one or more fields in a MongoDB collection using the createIndex()
method.
Suppose you frequently query the users collection based on the name field. To create an index
on the name field, you would run the following command in the MongoDB shell:
db.users.createIndex({ name: 1 })

Here, { name: 1 } specifies that you want to create an index on the name field in ascending
order (the 1 indicates ascending order; -1 would indicate descending order).
You can also create compound indexes on multiple fields. For example, if you often query based
on both city and age:
db.users.createIndex({ city: 1, age: 1 })

The order of fields in a compound index matters for the efficiency of certain types of queries.
To list all the indexes on a collection, you can use the getIndexes() method:
db.users.getIndexes()

MongoDB Aggregation Framework

The MongoDB aggregation framework is a powerful tool for processing and transforming data
through a pipeline of multiple stages. Each stage in the pipeline performs a specific operation on
the input documents. The output of one stage is passed as the input to the next stage.
Purpose in Data Analysis:
The aggregation framework enables complex data analysis tasks directly within the database. It
allows you to:
* Filter: Select specific documents based on criteria.
* Group: Group documents based on one or more fields and perform calculations on each
group.
* Project: Reshape documents by adding new fields, removing existing ones, or renaming
fields.
* Sort: Order documents based on specific fields.
* Limit: Restrict the number of documents in the output.
* Unwind: Deconstruct array fields to output a separate document for each element in the array.
* Lookup: Perform left outer joins with other collections.
By performing these operations within the database, you can reduce the amount of data that
needs to be transferred to the application for analysis, leading to more efficient and scalable
data analysis.
Example (Conceptual):
Suppose you want to find the average age of users in each city in the users collection. You
could use an aggregation pipeline like this:
* $group: Group documents by the city field and calculate the average of the age field for each
group.
* $project: Reshape the output to include the _id (city) and the calculated average age.
The actual MongoDB syntax for this would look something like:
db.users.aggregate([
{ $group: { _id: "$city", averageAge: { $avg: "$age" } } },
{ $project: { _id: 1, averageAge: 1 } }
])

What is Hive?

Hive is a data warehousing system built on top of Hadoop. It provides a SQL-like interface
called HiveQL (HQL) to query and analyze large datasets stored in Hadoop Distributed File
System (HDFS) or other Hadoop-compatible file systems (like Amazon S3). Hive essentially
translates these SQL-like queries into MapReduce jobs (or other execution engines like Tez or
Spark) that Hadoop can execute.
How does it enable querying data stored in Hadoop?
Hive acts as a layer of abstraction over Hadoop. It allows users familiar with SQL to interact with
Hadoop data without needing to write complex Java MapReduce code. When you submit an
HQL query:
* Hive parses the query.
* It compiles the query into a series of MapReduce tasks (or tasks for other execution engines).
* Hadoop executes these tasks in parallel across the cluster.
* The results are then returned to the user.
Hive provides a schema on top of the unstructured or semi-structured data in Hadoop, allowing
you to define tables and columns, making it easier to query and analyze the data as if it were in
a traditional relational database.
Key Components of the Hive Architecture and Their Respective Roles:

Block Diagram

* Hive Client: This is the interface through which users interact with Hive. It can be a
command-line interface (CLI), a web UI (like Hive View in older versions or Beeswax), or other
applications using JDBC/ODBC drivers. The client submits HQL queries to the Hive Server.
* Hive Server (HiveServer2): This is a service that receives queries from clients, compiles them,
and coordinates their execution. It acts as a central point of access for Hive. HiveServer2 is the
more modern and recommended server compared to the older HiveServer.
* Metastore: This is a central repository that stores metadata about the Hive tables, such as
their schema (column names and data types), their location in HDFS, and other properties. The
metastore is crucial because it allows Hive to know how the data is structured and where it
resides without having to infer it from the data files themselves. Common metastore
implementations use relational databases like MySQL or PostgreSQL.
* Driver: This component within the Hive Server is responsible for:
* Parsing the HQL query.
* Planning the execution of the query.
* Optimizing the execution plan.
* Compiler: The compiler translates the HQL query into an execution plan, which is a set of
MapReduce tasks (or tasks for other engines).
* Execution Engine: This is the component that actually executes the tasks defined in the
execution plan. By default, Hive uses MapReduce as the execution engine, but it can also be
configured to use other engines like Apache Tez or Apache Spark, which can offer performance
improvements.
* Hadoop Distributed File System (HDFS): This is the underlying storage layer for Hadoop
where the actual data being queried by Hive resides. Hive knows the location of the data files
for each table from the metadata stored in the metastore.

How does Hive provide a SQL-like interface (HQL) for interacting with data in Hadoop?

Hive provides a SQL-like interface (HQL) by:


* Syntax Familiarity: HQL's syntax is very similar to standard SQL. Users familiar with SQL can
easily write queries in HQL to interact with data in Hadoop. It supports common SQL constructs
like SELECT, FROM, WHERE, GROUP BY, ORDER BY, JOIN, etc.
* Abstraction of Complexity: Hive hides the underlying complexity of Hadoop and MapReduce.
Users don't need to understand how MapReduce jobs are written or executed. Hive takes care
of translating the HQL queries into these low-level operations.
* Schema on Read: Hive enforces a schema on the data when a query is executed
(schema-on-read). You define the structure of your tables in the metastore, and Hive uses this
schema to interpret the data files in HDFS during query processing. This allows you to treat
unstructured or semi-structured data as if it were in structured tables.
* Data Transformation Capabilities: While HQL is SQL-like, it also includes features specific to
data warehousing and transformation, such as support for different file formats (e.g., CSV,
Parquet, ORC), data partitioning, and bucketing to improve query performance.
Five Fundamental Data Types Available in Hive:
Hive supports a range of primitive and complex data types. Here are five fundamental ones:
* INT (Integer): A 32-bit signed integer. Similar to int in many programming languages.
* BIGINT: A 64-bit signed integer. Used for larger integer values.
* FLOAT: A single-precision floating-point number.
* DOUBLE: A double-precision floating-point number. Offers more precision than FLOAT.
* STRING: A sequence of characters. Can be of arbitrary length. Similar to VARCHAR or TEXT
in SQL.
Hive also supports other primitive types like BOOLEAN, TINYINT, SMALLINT, DECIMAL, DATE,
and TIMESTAMP, as well as complex types like ARRAY, MAP, and STRUCT for representing
more structured data within a single column.

Concept of Hive File Formats

In Hive, file formats define how data within a table is stored on the underlying file system
(typically HDFS). Hive supports various file formats, each with its own characteristics in terms of
storage efficiency, query performance, and data compression capabilities. When you create a
Hive table, you specify the file format using the STORED AS clause.
Why is the choice of file format important for efficiency?
The choice of file format significantly impacts the efficiency of Hive queries for several reasons:
* Storage Space: Different file formats offer varying levels of data compression. Formats like
Parquet and ORC are columnar formats that often achieve better compression than row-based
formats like TextFile or SequenceFile. Reduced storage space translates to lower storage costs
and faster data transfer within the Hadoop cluster.
* Query Performance (Data Retrieval):
* Row-based vs. Columnar: Row-based formats store all the data for a single row together.
While this is good for retrieving all columns of a few rows, it can be inefficient when you only
need a subset of columns, as the entire row needs to be read. Columnar formats, on the other
hand, store data for each column together. This allows Hive to read only the necessary columns
for a query, significantly reducing I/O and improving query speed, especially for analytical
queries that often involve aggregations over a few columns.
* Data Skipping: Some advanced file formats like Parquet and ORC include metadata that
allows Hive to skip entire blocks of data that are not relevant to the query, further enhancing
performance.
* Data Serialization and Deserialization: Different file formats have different mechanisms for
serializing (writing) and deserializing (reading) data. Some formats are more efficient in this
process than others, affecting both data loading and query execution times.
* Data Splittability: For Hadoop to process data in parallel, the input files need to be splittable
into smaller chunks that can be processed by different MapReduce tasks. Some file formats are
naturally splittable (e.g., TextFile), while others require specific configurations or are designed
for better splittability (e.g., ORC).
RCFile Format in Hive

RCFile (Record Columnar File) is a hybrid row-columnar storage format in Hive. It was designed
to provide a balance between the row-based approach (suitable for processing entire records)
and the columnar approach (suitable for selective column retrieval and compression).
Advantages in terms of data storage and retrieval:
* Storage Efficiency (Compression): RCFile achieves better compression than plain text files
because data within each column chunk tends to be more homogeneous, allowing for more
effective compression algorithms.
* Faster Data Retrieval for Column Subsets: While not as efficient as pure columnar formats
like Parquet or ORC, RCFile is better than row-based formats when only a subset of columns is
needed. Hive can read only the necessary column chunks, reducing I/O compared to reading
entire rows.
* Improved Data Locality: By storing column data together in chunks, RCFile can improve data
locality for queries that access multiple columns within the same row group.

How RCFile Works (Simplified):

RCFile organizes table data into row groups. Within each row group, the data is stored in a
columnar fashion. This means that for a set of rows, the values of the first column are stored
together, followed by the values of the second column, and so on. This structure allows for both
efficient row-level operations within a row group and benefits of columnar storage when
accessing specific columns across multiple row groups.
Basic SELECT Query in Hive Query Language (HQL)
A basic SELECT query in HQL is similar to SQL. It allows you to retrieve data from one or more
tables.
Example:
Suppose you have a Hive table named employees with columns like employee_id, name,
department, and salary. To select all columns and all rows from this table, you would use:
SELECT * FROM employees;

To select only the name and salary columns:


SELECT name, salary FROM employees;

Filtering Data in HQL using the WHERE Clause


The WHERE clause in HQL is used to filter rows based on specified conditions. It allows you to
retrieve only the rows that meet certain criteria.
Example:
To select the names of all employees in the 'Sales' department from the employees table:
SELECT name FROM employees WHERE department = 'Sales';

You can use various comparison operators (=, >, <, >=, <=, != or <>), logical operators (AND,
OR, NOT), and other conditions (IN, BETWEEN, LIKE, IS NULL, IS NOT NULL) in the WHERE
clause, just like in SQL.
For example, to find employees with a salary greater than 50000:
SELECT name, salary FROM employees WHERE salary > 50000;

To find employees in either the 'Sales' or 'Marketing' department:


SELECT name, department FROM employees WHERE department = 'Sales' OR department =
'Marketing';

Sorting and Ordering Data in Hive using HQL

You can sort the results of a Hive query using the ORDER BY and SORT BY clauses.
* ORDER BY: Sorts the entire result set globally based on the specified column(s). This
operation typically involves a single reducer, which can be slow for very large datasets.
* SORT BY: Sorts the data within each reducer. If there are multiple reducers (which is often the
case for large datasets), the final output will have the data sorted within each reducer's partition
but not necessarily globally sorted.
You can specify the sort order as ascending (ASC, which is the default) or descending (DESC).
Example using ORDER BY:
To select all employees and order them by salary in ascending order:
SELECT * FROM employees ORDER BY salary ASC;

To order them by salary in descending order:


SELECT name, salary FROM employees ORDER BY salary DESC;

To order by department first (ascending) and then by salary (descending) within each
department:
SELECT name, department, salary FROM employees ORDER BY department ASC, salary
DESC;

Example using SORT BY:


To sort employees by name within each reducer (the number of reducers might depend on Hive
configuration):
SELECT * FROM employees SORT BY name ASC;

Often, SORT BY is used in conjunction with DISTRIBUTE BY to control which reducer


processes which rows, allowing for more controlled sorting before a final aggregation or output.

Built-in Functions in Hive

Hive provides a rich set of built-in functions that can be used in HQL queries to perform various
operations on data, such as string manipulation, mathematical calculations, date and time
operations, and aggregations.
Example of a commonly used function and its purpose:
The COUNT() function is an aggregate function that returns the number of rows in a group or
the total number of rows if no grouping is specified.
Purpose: To count the number of records that meet certain criteria.
Example:
To count the total number of employees in the employees table:
SELECT COUNT(*) FROM employees;

To count the number of employees in each department (using GROUP BY):


SELECT department, COUNT(*) FROM employees GROUP BY department;
Other commonly used built-in functions include:
* String functions: CONCAT(), SUBSTR(), UPPER(), LOWER(), LENGTH()
* Mathematical functions: ROUND(), FLOOR(), CEIL(), ABS()
* Date and time functions: YEAR(), MONTH(), DAY(), TO_DATE()
* Aggregate functions: SUM(), AVG(), MIN(), MAX()

User Defined Function (UDF) in Hive

A User Defined Function (UDF) in Hive is a function that is written by users (typically in Java)
and registered with Hive to extend the functionality of HQL. UDFs allow you to perform custom
data processing or analysis that is not available through Hive's built-in functions.
Why and when would you need to create one?
You would need to create a UDF in Hive in scenarios where:
* Custom Logic Required: You need to implement specific data transformation or analysis logic
that is not supported by the existing built-in functions. This could involve complex calculations,
data parsing in a specific format, or applying custom business rules.
* Integration with External Systems or Libraries: You need to integrate Hive with external
systems or libraries. For example, you might want to use a specific Java library for data
validation or for interacting with an external API within your Hive queries.
* Performance Optimization for Specific Tasks: In some cases, implementing a custom function
in Java might offer better performance for certain operations compared to complex combinations
of built-in functions.
Example Scenario:
Suppose you have a column in your Hive table that contains a custom date format that Hive's
built-in date functions cannot parse directly. You could write a UDF in Java that takes this
custom date string as input and returns a standard date format that Hive can understand. You
would then register this UDF in Hive and use it in your HQL queries.
Steps to Create and Use a UDF (Simplified):
* Write the UDF in Java: Create a Java class that extends one of the UDF base classes
provided by Hive (e.g., org.apache.hadoop.hive.ql.exec.UDF). This class will contain a method
(typically named evaluate) that implements your custom logic.
* Compile the Java Code: Compile your Java code into a JAR file.
* Register the UDF in Hive: In the Hive CLI or through a Hive client, use the ADD JAR
command to make your JAR file available to Hive. Then, use the CREATE TEMPORARY
FUNCTION (for the current session) or CREATE FUNCTION (to make it permanent in the
metastore) command to register your UDF with a name that you can use in your HQL queries.
* Use the UDF in HQL: Once registered, you can call your UDF in your SELECT, WHERE, or
other clauses just like any built-in function.

You might also like

pFad - Phonifier reborn

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.


Alternative Proxies:

Alternative Proxy

pFad Proxy

pFad v3 Proxy

pFad v4 Proxy