Task - Level1 Database Module
Task - Level1 Database Module
Task - Level1 Database Module
1. Problem: We have given some turbine failure data we need to perform some analytics
on this.
a. Create a sap data table with flowing columns:
i. Alert_id
ii. Turbine_id
iii. Alert_start_date
iv. Alert_end_date
v. Farm_name (Farm name where turbine available)
vi. Fail_component (component name which one failed)
vii. Fail_window
viii. Fail_within_fail_window
ix. fail_within_ninty_days
x. fn_check_gen
b. Check the count of null turbines.
c. Count the no. of Fail_within_fail_window is `yes` as new column `fail_window`
d. Count the no. of Fail_within_fail_window is `yes` or `no` as new column
`non_pending`
e. Count the no. of fail_within_ninty_days is `yes` as new column `TP`
f. Count the no. of fn_check_gen is `yes` as new column `FN`
g. Create new column fp=non_pending-TP
h. Calculate the precision(%) =tp*100/non_pending
i. Calculate the fail_window(%) = fail_window*100/non_pending
j. Calculate the reliability(%) = tp*100/(non_pending)
File link:turbine_data.csv
2. Problem:Table 1:IOT devices are submitting live statuses for every second to databases
as a code from 0-5.
Columns: “timestamp”, “source_device”, “status”
Table 2:
Hierarchical metadata of the IOT devices leveling 1-5
- Company(L1) -> country(L2) -> region/state(L3) -> hub(L4) -> device(L5)
Columns: “child_id”, “level”, “parent_id”
Analytical Rules:
- For any sec, the status of the device is whatever that is available in the
database, as all the IOT device is updating the status every second.
- For any sec, the status of the hub is whatever status has max number of
occurrences in the statues from the underlying iot-devices, if the number
of occurrences are same for two statuses then larger one will be our
status there, similarly for every level.
For example:
- some hub (h1) have 4 underlying IOT-devices and 3 of them are 2
status, then the status of hub (h1) for that second will be 2
- some hubs (h2) have 4 underlying IoT-devices and 2 of them are 2
status and 2 of them have 3 status for a second, then for that second the
status of h2 will be 3.
- For any longer duration, like 10mins, the most continuous status will be
the status for that complete duration, if two status were active for the
same duration, the greater status will be the resultant status.
For example: A device has status 2 for continues 2 secs, then the status 4
for next continues 30 secs, and at last the status got changed to 2 again
for next 28 secs, so for that 60secs (1min) the status of the device will be
4.
Views:
Get analytics over the data:
- Status of each level device for every sec.
- Status of each level device for every 5min.
- Status of each level device for a given duration (can be a
procedure, or query with duration as changeable parameters)
Data File : data.csv
3. Problem:
Description:There are multiple files having statues of some devices at different
times. All the files are csv files (with columns: rn(row_number), id, ts(timestamp)
status).
Aim: We need to generate a single file out of these which will be sorted over
timestamp and id columns in the same priority sequence.
We need to get time taken to process as the main benchmark.
Hint:
- Read all files as multi-processing.
- Use merge sorting to sort the data.
File: data.csv hie_data.csv