Bulk import from an S3 bucket
Load data from files in an existing Amazon Simple Storage Service (Amazon S3) with maximum performance.
Last updated
Was this helpful?
Load data from files in an existing Amazon Simple Storage Service (Amazon S3) with maximum performance.
Last updated
Was this helpful?
aidbox.bulk/load-from-bucket
It allows loading data from a bunch of .ndjson.gz
files on an AWS bucket directly to the Aidbox database with maximum performance.
Be careful You should run only one replica of Aidbox to use aidbox.bulk/load-from-bucket
operation.
The file must consist of Resources of the same type.
The file name should start with a name of the Resource type, then some postfix is possible, and extension .ndjson
is required. Files can be placed in subdirectories of any level. Files with the wrong path structure will be ignored.
Every resource in .ndjson
files MUST contain id property.
aidbox.bulk/load-from-bucket
:aidbox.bulk/load-from-bucket
Required
Not required
Object with the following structure:
bucket
* defines your bucket connection string in formats3://<bucket-name>
thread-num
defines how many threads will process the import. The default is 4.
account
credential:
access-key-id
* AWS key ID
secret-access-key
* AWS secret key
region
* AWS Bucket region
disable-idx?
the default is false
. Allows to drop all indexes for resources, which data are going to be loaded. Indexes will be restored at the end of successful import. All information about dropped indexes is stored at DisabledIndex
resources.
drop-primary-key?
the default is false
. The same as the previous parameter, but drops primary key constraint for resources tables. This parameter disables all checks for duplicates for imported resources.
upsert?
the default is false
. If upsert?
is false
, import for files with id
uniqueness constraint violation will fail with an error, if true
- records in the database will be overridden with records from import. Even when upsert?
is true
, it's still not allowed to have more than one record with the same id in one import file. Setting this option to true will cause a decrease in performance.
scheduler
possible values: optimal
, by-last-modified
, the default is optimal
. Establishes the order in which the files are processed. The optimal
value provides the best performance. by-last-modified
should be used with thread-num = 1
to guarantee a stable order of file processing.
prefixes
array of prefixes to specify which files should be processed. Example: with value ["fhir/1/", "fhir/2/Patient"]
only files from directory "fhir/1"
and Patient
files from directory "fhir/2"
will be processed.
connect-timeout
the default is 0
. Specifies the number of milliseconds after which the file is considered as failed if connection to the resource could not be established. (e.g. in case of network issues). Zero is interpreted as an infinite timeout.
read-timeout
the default is 0
. Specifies the number of milliseconds after which the file is considered as failed if there is no data available to read (e.g. in case of network issues). Zero is interpreted as an infinite timeout.
For each file being imported via load-from-bucket
method, Aidbox creates LoaderFile
resource. To find out how many resources were imported from a file, check the loaded
field.
On launch aidbox.bulk/load-from-bucket
checks if files from the bucket were planned to import and decides what to do:
If ndjson.gz
file has it's related LoaderFile
resource, the loader skips this file from import
If there is no related LoaderFile
resource, Aidbox puts this file to the queue creating a LoaderFile
resource
In order to import a file one more time you should delete related LoaderFile
resource and relaunch aidbox.bulk/load-from-bucket
.
Files are processed completely. The loader doesn't support partial re-import.\
\
aidbox.bulk/load-from-bucket-status
Returns status and progress of import for specified bucket. Possible states are: in-progress
, completed
, interrupted
.
See .\