Content-Length: 211814 | pFad | http://github.com/USPTO/PatentPublicData/issues/55

62 grant pba*.zip and yyyy.zip processing - TransformerCli fails · Issue #55 · USPTO/PatentPublicData · GitHub
Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

grant pba*.zip and yyyy.zip processing - TransformerCli fails #55

Open
ThomasHeliase opened this issue May 2, 2017 · 1 comment
Open

Comments

@ThomasHeliase
Copy link

ThomasHeliase commented May 2, 2017

Attempting processing of older, greenbook grant_bibliographic format files either fails with heapsize or a java.util.NoSuchElementException.

Is this a case on completeness, or were these files never intended for processing and there is a better source?

a good example is 1990 or 1998 files, a single .dat file for the whole year, and weekly pba*.zip files are also supplied, which also don't load.

http://patentscur.reedtech.com/downloads/GrantRedBookBib/1990/1990.zip http://patentscur.reedtech.com/downloads/GrantRedBookBib/1998/pba19980106_wk01.zip http://patentscur.reedtech.com/downloads/GrantRedBookBib/1998/1998.zip
(the GrantRedBookBib subfolder in source appears misleading, as the text files, when manually extracted, are clearly APS).

both zips fail during TransformCli with a NoSuchElement exception and appear to mis-classify the file as CpcMasterFile format - log:

2017-05-02 18:03:57,678 INFO [ main] TransformerCli - --- Start --- 2017-05-02 18:03:57,709 INFO [ main] 1998.zip TransformerCli - Dump File[1]: C:\data\out\uspto\grant_bibliographic\1998\1998.zip 2017-05-02 18:03:57,709 INFO [ main] 1998.zip PatentDocFormatDetect - PatentDocFormat fromFileName: CpcMasterFile 2017-05-02 18:03:57,724 INFO [ main] 1998.zip ZipReader - Reading zip file: C:\data\out\uspto\grant_bibliographic\1998\1998.zip Exception in thread "main" java.util.NoSuchElementException at gov.uspto.common.file.archive.ZipReader.next(ZipReader.java:122) at gov.uspto.patent.bulk.DumpFile.open(DumpFile.java:65) at gov.uspto.patent.bulk.DumpFileXml.open(DumpFileXml.java:31) at gov.uspto.patent.TransformerCli.processDumpFile(TransformerCli.java:166) at gov.uspto.patent.TransformerCli.process(TransformerCli.java:122) at gov.uspto.patent.TransformerCli.main(TransformerCli.java:301)

Manually extracting the text file in the zip doesn't get much further, in the case of 1998, the file is too large for my VM (1.7GB)

2017-05-02 18:05:23,347 INFO [ main] TransformerCli - --- Start --- 2017-05-02 18:05:23,378 INFO [ main] 1998.dat TransformerCli - Dump File[1]: C:\data\out\uspto\grant_bibliographic\1998\1998.dat 2017-05-02 18:05:23,378 INFO [ main] 1998.dat PatentDocFormatDetect - PatentDocFormat fromFileName: CpcMasterFile 2017-05-02 18:05:23,378 INFO [ main] 1998.dat PatentDocFormatDetect - PatentType fromContent: Greenbook Exception in thread "main" java.lang.OutOfMemoryError: Java heap space at java.util.Arrays.copyOf(Arrays.java:3332) at java.lang.AbstractStringBuilder.expandCapacity(AbstractStringBuilder.java:137) at java.lang.AbstractStringBuilder.ensureCapacityInternal(AbstractStringBuilder.java:121) at java.lang.AbstractStringBuilder.append(AbstractStringBuilder.java:421) at java.lang.StringBuilder.append(StringBuilder.java:136) at gov.uspto.patent.bulk.DumpFileXml.read(DumpFileXml.java:66) at gov.uspto.patent.bulk.DumpFile.next(DumpFile.java:92) at gov.uspto.patent.bulk.DumpFile.next(DumpFile.java:1) at gov.uspto.patent.TransformerCli.processDumpFile(TransformerCli.java:173) at gov.uspto.patent.TransformerCli.process(TransformerCli.java:122) at gov.uspto.patent.TransformerCli.main(TransformerCli.java:301)

or using 1990 file, the file size appears ok but the format can't parse:

2017-05-02 18:06:59,785 INFO [ main] TransformerCli - --- Start --- 2017-05-02 18:06:59,826 INFO [ main] 1990.dat TransformerCli - Dump File[1]: C:\data\out\uspto\grant_bibliographic\1990\1990.dat 2017-05-02 18:06:59,828 INFO [ main] 1990.dat PatentDocFormatDetect - PatentDocFormat fromFileName: CpcMasterFile 2017-05-02 18:06:59,831 INFO [ main] 1990.dat PatentDocFormatDetect - PatentType fromContent: Greenbook Exception in thread "main" java.lang.NullPointerException at gov.uspto.patent.TransformerCli.processDumpFile(TransformerCli.java:175) at gov.uspto.patent.TransformerCli.process(TransformerCli.java:122) at gov.uspto.patent.TransformerCli.main(TransformerCli.java:301)

@bgfeldm
Copy link
Contributor

bgfeldm commented Feb 7, 2019

Try renaming the file to start with "pftaps".

Rename 1990.zip to pftaps1990.zip

In the future I may introduce an option to manually provide the patent type.

Also, try using the new transformer

gov.uspto.bulkdata.cli.Transformer --input="./download/pftaps1990.zip" --skip=0 --limit=0 --type="json_flat" --outDir="./target/output" --bulkKV=true --outputBulkFile=true

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants








ApplySandwichStrip

pFad - (p)hone/(F)rame/(a)nonymizer/(d)eclutterfier!      Saves Data!


--- a PPN by Garber Painting Akron. With Image Size Reduction included!

Fetched URL: http://github.com/USPTO/PatentPublicData/issues/55

Alternative Proxies:

Alternative Proxy

pFad Proxy

pFad v3 Proxy

pFad v4 Proxy