Skip to content

Add df.duplicated() method for DataFrame similar to pandas #667

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
RahulDas-dev opened this issue Apr 30, 2025 · 0 comments
Open

Add df.duplicated() method for DataFrame similar to pandas #667

RahulDas-dev opened this issue Apr 30, 2025 · 0 comments

Comments

@RahulDas-dev
Copy link

RahulDas-dev commented Apr 30, 2025

Is your feature request related to a problem? Please describe.
Currently, Danfo.js does not have a method to identify duplicate rows in a DataFrame, which is a common data manipulation task. This limitation can make it challenging to clean or preprocess data efficiently, especially when working with large datasets. For example, in pandas, the df.duplicated() method is highly useful for flagging duplicate rows, but no such equivalent exists in Danfo.js.

The getDuplicate method in utility is array-specific and doesn't directly address DataFrame row duplication.
For Series: The dropDuplicates method works on Series and can help remove duplicate values but isn't designed for identifying duplicate rows in a DataFrame.

Describe the solution you'd like
I would like Danfo.js to implement a df.duplicated() method for DataFrames, similar to pandas. This method should return a boolean Series indicating whether each row in the DataFrame is a duplicate of a previous row. The method should also include parameters such as:

  • subset: Specify columns to consider when identifying duplicates.
  • keep: Define which duplicates to mark as True ('first', 'last', or 'none').

Describe alternatives you've considered
An alternative would be to manually implement a custom function to compare rows and identify duplicates. However, this approach is less efficient and may lead to inconsistent or error-prone implementations across different projects. Providing a built-in method would standardize and simplify the process for all users.

Additional context
This feature would align Danfo.js closely with pandas, making it easier for users transitioning from Python to JavaScript.
The implementation could leverage existing internal methods for row/column comparisons to ensure optimal performance.
This feature is particularly useful in data cleaning workflows and preprocessing pipelines.

Proposed API:

const df = new DataFrame([
    { col1: 1, col2: 2 },
    { col1: 1, col2: 2 },
    { col1: 3, col2: 4 },
]);

const duplicates = df.duplicated();
console.log(duplicates); // Output: [false, true, false]
// Sample DataFrame
const data = [
    { col1: 1, col2: 2, col3: 'A' },
    { col1: 1, col2: 2, col3: 'B' },
    { col1: 3, col2: 4, col3: 'A' },
    { col1: 1, col2: 2, col3: 'A' }
];

const df = new DataFrame(data);

// Find duplicates considering only 'col1' and 'col2'
const duplicates = df.duplicated({ subset: ['col1', 'col2'], keep: 'first' });

console.log(duplicates);
// Output: [false, true, false, true]

// Explanation:
// - Row 1 is not duplicate.
// - Row 2 is duplicate of Row 1 based on 'col1' and 'col2'.
// - Row 3 is not duplicate.
// - Row 4 is duplicate of Row 1 based on 'col1' and 'col2'.

Parameters Explained

  1. subset: Specifies the columns to consider when checking for duplicates. In this example, only col1 and col2 are considered.
  2. keep: Determines which duplicate to mark as False:
  • 'first' (default): Marks duplicates except for the first occurrence.
  • 'last': Marks duplicates except for the last occurrence.
  • false: Marks all duplicates as True.

This feature would be very useful in filtering and processing data efficiently, similar to pandas' duplicated() method. Let me know if you'd like further clarifications!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant
pFad - Phonifier reborn

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.


Alternative Proxies:

Alternative Proxy

pFad Proxy

pFad v3 Proxy

pFad v4 Proxy