Content-Length: 442778 | pFad | https://github.com/apache/iceberg/pull/8168

2B GCP: Add prefix and bulk operations to GCSFileIO by bryanck · Pull Request #8168 · apache/iceberg · GitHub
Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

GCP: Add prefix and bulk operations to GCSFileIO #8168

Merged
merged 12 commits into from
Aug 5, 2023

Conversation

bryanck
Copy link
Contributor

@bryanck bryanck commented Jul 27, 2023

This PR adds bulk and prefix operations to GCSFileIO, and has GCSFileIO implement the SupportsBulkOperations and SupportsPrefixOperations interfaces.

@github-actions github-actions bot added the GCP label Jul 27, 2023
private int gcsDeleteBatchSize;

public GCPProperties() {
gcsDeleteBatchSize = GCS_DELETE_BATCH_SIZE_DEFAULT;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

probably ok to set this here, alternative maybe just in the line above? private int gcsDeleteBatchSize = GCS_DELETE_BATCH_SIZE_DEFAULT

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure, I updated this. I was just following the pattern in S3FileIO.

.iterator();
}

private long getCreateTimeMillis(Blob blob) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: Iceberg typically doesn't use getXyz methods. Maybe `createTimeMillisFrom(Blob)?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks, I changed this to createTimeMillisFrom

internalDeleteFiles(() -> Streams.stream(pathsToDelete).map(BlobId::fromGsUtilUri).iterator());
}

private void internalDeleteFiles(Iterable<BlobId> blobIdsToDelete) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was wondering whether we should pass the Stream<BlobId through to this method to have something like

  @Override
  public void deletePrefix(String prefix) {
    internalDeleteFiles(
        Streams.stream(listPrefix(prefix))
            .map(fileInfo -> BlobId.fromGsUtilUri(fileInfo.location())));
  }

  @Override
  public void deleteFiles(Iterable<String> pathsToDelete) throws BulkDeletionFailureException {
    internalDeleteFiles(Streams.stream(pathsToDelete).map(BlobId::fromGsUtilUri));
  }

  private void internalDeleteFiles(Stream<BlobId> blobIdsToDelete) {
    Streams.stream(Iterators.partition(blobIdsToDelete.iterator(), gcpProperties.deleteBatchSize()))
        .forEach(batch -> client().delete(batch));
  }

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, that is an improvement, I made this change

Copy link
Contributor

@nastra nastra left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

mostly LGTM, just left a few minor comments

String path3 = "list/skip/data3.dat";
storage.create(BlobInfo.newBuilder(TEST_BUCKET, path3).build());

assertThat(StreamSupport.stream(io.listPrefix(gsUri("list/")).spliterator(), false).count())
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

sorry I missed this initially, but there's no need to wrap this in a stream. This (and other places) can be simplified to assertThat(io.listPrefix(gsUri("list/"))).hasSize(3);

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I noticed there are also existing tests in that file that do the same thing, could you please update those as well?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The intent was to avoid using the FileIO we're testing to do the assertion and use the Google client directly, I assume the same for the existing tests.

Copy link
Contributor

@nastra nastra left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, thanks @bryanck

@nastra nastra merged commit 3f94949 into apache:master Aug 5, 2023
nastra pushed a commit to nastra/iceberg that referenced this pull request Aug 15, 2023
* Spark: Update antlr4 to match Spark 3.4 (apache#7824)

* Parquet: Revert workaround for resource usage with zstd (apache#7834)

* GCP: fix single byte read in GCSInputStream (apache#8071)

* GCP: fix byte read in GCSInputStream

* add test

* Parquet: Cache codecs by name and level (apache#8182)

* GCP: Add prefix and bulk operations to GCSFileIO (apache#8168)

* AWS, GCS: Allow access to underlying storage client (apache#8208)

* spotless
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants








ApplySandwichStrip

pFad - (p)hone/(F)rame/(a)nonymizer/(d)eclutterfier!      Saves Data!


--- a PPN by Garber Painting Akron. With Image Size Reduction included!

Fetched URL: https://github.com/apache/iceberg/pull/8168

Alternative Proxies:

Alternative Proxy

pFad Proxy

pFad v3 Proxy

pFad v4 Proxy