You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Is your feature request related to a problem? Please describe.
Currently, Danfo.js does not have a method to identify duplicate rows in a DataFrame, which is a common data manipulation task. This limitation can make it challenging to clean or preprocess data efficiently, especially when working with large datasets. For example, in pandas, the df.duplicated() method is highly useful for flagging duplicate rows, but no such equivalent exists in Danfo.js.
The getDuplicate method in utility is array-specific and doesn't directly address DataFrame row duplication.
For Series: The dropDuplicates method works on Series and can help remove duplicate values but isn't designed for identifying duplicate rows in a DataFrame.
Describe the solution you'd like
I would like Danfo.js to implement a df.duplicated() method for DataFrames, similar to pandas. This method should return a boolean Series indicating whether each row in the DataFrame is a duplicate of a previous row. The method should also include parameters such as:
subset: Specify columns to consider when identifying duplicates.
keep: Define which duplicates to mark as True ('first', 'last', or 'none').
Describe alternatives you've considered
An alternative would be to manually implement a custom function to compare rows and identify duplicates. However, this approach is less efficient and may lead to inconsistent or error-prone implementations across different projects. Providing a built-in method would standardize and simplify the process for all users.
Additional context
This feature would align Danfo.js closely with pandas, making it easier for users transitioning from Python to JavaScript.
The implementation could leverage existing internal methods for row/column comparisons to ensure optimal performance.
This feature is particularly useful in data cleaning workflows and preprocessing pipelines.
// Sample DataFrame
const data = [
{ col1: 1, col2: 2, col3: 'A' },
{ col1: 1, col2: 2, col3: 'B' },
{ col1: 3, col2: 4, col3: 'A' },
{ col1: 1, col2: 2, col3: 'A' }
];
const df = new DataFrame(data);
// Find duplicates considering only 'col1' and 'col2'
const duplicates = df.duplicated({ subset: ['col1', 'col2'], keep: 'first' });
console.log(duplicates);
// Output: [false, true, false, true]
// Explanation:
// - Row 1 is not duplicate.
// - Row 2 is duplicate of Row 1 based on 'col1' and 'col2'.
// - Row 3 is not duplicate.
// - Row 4 is duplicate of Row 1 based on 'col1' and 'col2'.
Parameters Explained
subset: Specifies the columns to consider when checking for duplicates. In this example, only col1 and col2 are considered.
keep: Determines which duplicate to mark as False:
'first' (default): Marks duplicates except for the first occurrence.
'last': Marks duplicates except for the last occurrence.
false: Marks all duplicates as True.
This feature would be very useful in filtering and processing data efficiently, similar to pandas' duplicated() method. Let me know if you'd like further clarifications!
The text was updated successfully, but these errors were encountered:
Uh oh!
There was an error while loading. Please reload this page.
Is your feature request related to a problem? Please describe.
Currently, Danfo.js does not have a method to identify duplicate rows in a DataFrame, which is a common data manipulation task. This limitation can make it challenging to clean or preprocess data efficiently, especially when working with large datasets. For example, in pandas, the df.duplicated() method is highly useful for flagging duplicate rows, but no such equivalent exists in Danfo.js.
The getDuplicate method in utility is array-specific and doesn't directly address DataFrame row duplication.
For Series: The dropDuplicates method works on Series and can help remove duplicate values but isn't designed for identifying duplicate rows in a DataFrame.
Describe the solution you'd like
I would like Danfo.js to implement a df.duplicated() method for DataFrames, similar to pandas. This method should return a boolean Series indicating whether each row in the DataFrame is a duplicate of a previous row. The method should also include parameters such as:
Describe alternatives you've considered
An alternative would be to manually implement a custom function to compare rows and identify duplicates. However, this approach is less efficient and may lead to inconsistent or error-prone implementations across different projects. Providing a built-in method would standardize and simplify the process for all users.
Additional context
This feature would align Danfo.js closely with pandas, making it easier for users transitioning from Python to JavaScript.
The implementation could leverage existing internal methods for row/column comparisons to ensure optimal performance.
This feature is particularly useful in data cleaning workflows and preprocessing pipelines.
Proposed API:
Parameters Explained
This feature would be very useful in filtering and processing data efficiently, similar to pandas' duplicated() method. Let me know if you'd like further clarifications!
The text was updated successfully, but these errors were encountered: