oopcorenlp_corpus – Developer's Guide

Dependencies

oopcorenlp_corpus is direct dependency on oopcorenlp, and thus has a transitive dependency on Stanford CoreNLP.

oopcorenlp_corpus also depends on a great many external java libraries. These dependencies are specified in pom.xml and are managed by maven.

All java dependencies are available from maven central: https://search.maven.org/. Please respect the licensing on any 3rd party libraries.

There are several important non-java dependencies, mostly related to storage and indexing. The corpus batch steps create scratch files and requires an unstructured storage engine to store these objects. oopcorenlp_corpus provides two implementations of this functionality: file system and Amazon S3. If you choose the S3 storage engine you will need an AWS account and a bucket. The corpus batch itself requires a storage engine to store this state. oopcorenlp_corpus provides four implementations of this functionality: file system, Amazon S3, PostgreSQL, and MongoDB. The corpus batch steps that run after the oopcorenlp step require a structured storage engine to store the analysis data. oopcorenlp_corpus provides two implementations of this functionality: PostgreSQL and MongoDB.

For automated build scripts, visit oopcorenlp_ci.

Please respect the terms of service and copyright of any third parties.