data_cheat
data_cheat
data_cheat
df.info()
df.describe()
df.isnull().sum()
df.nunique()
df.duplicated()
df.duplicated('Column_name')
df.drop_duplicates(subset=[Column_name'])
Data profiling is the process of examining, analyzing, and creating useful summaries of data.
In [6]: #df.head()
# to see the full dataset '=
pd.set_option("display.max_rows",None)
df
Out[6]:
ID Student_ID Gender AGE Score CLASS
14
14 15.0 34235 F 14 3.5 N
# DATA TYPES
In [7]: info=df.info()
info
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 30 entries, 0 to 29
Data columns (total 6 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 ID 29 non-null float64
1 Student_ID 29 non-null object
2 Gender 30 non-null object
3 AGE 30 non-null int64
4 Score 30 non-null float64
5 CLASS 30 non-null object
dtypes: float64(2), int64(1), object(3)
memory usage: 1.5+ KB
In [8]: # Convert the float ID column to int
#df['ID'] = df['ID'].astype(int) >>>>>>>>> wll give error bcz we havnt rem
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 30 entries, 0 to 29
Data columns (total 6 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 ID 30 non-null int64
1 Student_ID 29 non-null object
2 Gender 30 non-null object
3 AGE 30 non-null int64
4 Score 30 non-null float64
5 CLASS 30 non-null object
dtypes: float64(1), int64(2), object(3)
memory usage: 1.5+ KB
Out[9]:
ID AGE Score
Out[10]: ID 29
Student_ID 26
Gender 2
AGE 4
Score 11
CLASS 8
dtype: int64
Out[11]: ID 0
Student_ID 1
Gender 0
AGE 0
Score 0
CLASS 0
dtype: int64
Out[12]:
ID Student_ID Gender AGE Score CLASS
12 2 34221 M 16 6.5 y
Out[13]:
ID Student_ID Gender AGE Score CLASS
11 0 87656 F 14 6.8 y
12 2 34221 M 16 6.5 y
13 14 34224 F 16 2.3 N
In [14]: # Drop duplicates based on the ''Student ID'' column
df_no_duplicates_Student_ID = df.drop_duplicates(subset=['Student_ID'])
print(df_no_duplicates_Student_ID)
Pyspark
In [15]: from pyspark.sql import SparkSession
from pyspark.sql.functions import col, isnan
ata into In
a Spark
[17]: DataFrame
ad.csv("/Users/pragatigupta/Documents/AI And ML/Linkedin Post/Student Dataset/Student_Score
In [18]: df.head(5)
+-------+------------------+------------------+------+---------------
---+------------------+-----+
|summary| ID| Student_ID|Gender|
AGE| Score|CLASS|
+-------+------------------+------------------+------+---------------
---+------------------+-----+
| count| 29| 29| 30|
30| 30| 30|
| mean|15.241379310344827| 41860.82142857143| null|15.566666666666
666| 5.73| null|
| stddev| 9.276141332987368|20171.267078682085| null| 1.072648457158
112|1.5783339098665028| null|
| min| 1| 12744| F|
14| 2.0| Y|
| max| 30| Null| M|
17| 7.1| y |
+-------+------------------+------------------+------+---------------
---+------------------+-----+
root
|-- ID: integer (nullable = true)
|-- Student_ID: string (nullable = true)
|-- Gender: string (nullable = true)
|-- AGE: integer (nullable = true)
|-- Score: double (nullable = true)
|-- CLASS: string (nullable = true)
+----+----------+------+---+-----+-----+
| ID|Student_ID|Gender|AGE|Score|CLASS|
+----+----------+------+---+-----+-----+
|null| 87656| F| 14| 6.8| y |
+----+----------+------+---+-----+-----+
In [22]: null_count = df.filter(col("Student_ID").isNull()).count()
null_count
Out[22]: 1
+---+-----+
| ID|count|
+---+-----+
| 2| 2|
+---+-----+
+-------+------------------+------------------+------+---------------
---+------------------+-----+
|summary| ID| Student_ID|Gender|
AGE| Score|CLASS|
+-------+------------------+------------------+------+---------------
---+------------------+-----+
| count| 29| 29| 30|
30| 30| 30|
| mean|15.241379310344827| 41860.82142857143| null|15.566666666666
666| 5.73| null|
| stddev| 9.276141332987368|20171.267078682085| null| 1.072648457158
112|1.5783339098665028| null|
| min| 1| 12744| F|
14| 2.0| Y|
| max| 30| Null| M|
17| 7.1| y |
+-------+------------------+------------------+------+---------------
---+------------------+-----+
+---+-----+
| ID|count|
+---+-----+
| 2| 2|
+---+-----+
SQL
In [39]: #!pip install sqlite3
# Total Count
Total_IDs = sqldf("SELECT count() As Total_IDs From df");
Total_IDs
#Duplicates
Total_IDs = sqldf("SELECT count() As Total_IDs From df Group By ID
");
Total_IDs