Bulk import from an S3 bucket
Load data from files in an existing Amazon Simple Storage Service (Amazon S3) with maximum performance.
It allows loading data from a bunch of
.ndjson.gzfiles on an AWS bucket directly to the Aidbox database with maximum performance.
Be careful You should run only one replica of Aidbox to use
- 1.The file must consist of Resources of the same type.
- 2.The file name should start with a name of the Resource type, then some postfix is possible, and extension
.ndjsonis required. Files can be placed in subdirectories of any level. Files with the wrong path structure will be ignored.
- 3.Every resource in
.ndjsonfiles MUST contain id property.
Object with the following structure:
bucket* defines your bucket connection string in format
thread-numdefines how many threads will process the import. The default is 4.
access-key-id* AWS key ID
secret-access-key* AWS secret key
region* AWS Bucket region
disable-idx?the default is
false. Allows to drop all indexes for resources, which data are going to be loaded. Indexes will be restored at the end of successful import. All information about dropped indexes is stored at
drop-primary-key?the default is
false. The same as the previous parameter, but drops primary key constraint for resources tables. This parameter disables all checks for duplicates for imported resources.
upsert?the default is
false, import for files with
iduniqueness constraint violation will fail with an error, if
true- records in the database will be overridden with records from import. Even when
true, it's still not allowed to have more than one record with the same id in one import file. Setting this option to true will cause a decrease in performance.
by-last-modified, the default is
optimal. Establishes the order in which the files are processed. The
optimalvalue provides the best performance.
by-last-modifiedshould be used with
thread-num = 1to guarantee a stable order of file processing.
prefixesarray of prefixes to specify which files should be processed. Example: with value
["fhir/1/", "fhir/2/Patient"]only files from directory
Patientfiles from directory
"fhir/2"will be processed.
connect-timeoutthe default is
0. Specifies the number of milliseconds after which the file is considered as failed if connection to the resource could not be established. (e.g. in case of network issues). Zero is interpreted as an infinite timeout.
read-timeoutthe default is
0. Specifies the number of milliseconds after which the file is considered as failed if there is no data available to read (e.g. in case of network issues). Zero is interpreted as an infinite timeout.
Returns the string "Upload started"
Returns error message
message: Upload from bucket <s3://your-bucket-id> started. 6 new files added.
For each file being imported via
load-from-bucketmethod, Aidbox creates
LoaderFileresource. To find out how many resources were imported from a file, check the
"message": "23505: ERROR: duplicate key value violates unique constraint \"patient_pkey\""
Sources of Error
There are the following sources of error for this request.
- AWS Error
- PostgreSQL Error
- Aidbox Error\
Any other errors than the above can be caught as Aidbox Error. An error message will be provided if available.\
aidbox.bulk/load-from-bucketchecks if files from the bucket were planned to import and decides what to do:
ndjson.gzfile has it's related
LoaderFileresource, the loader skips this file from import
- If there is no related
LoaderFileresource, Aidbox puts this file to the queue creating a
In order to import a file one more time you should delete related
LoaderFileresource and relaunch
Files are processed completely. The loader doesn't support partial re-import.\
Returns status and progress of import for specified bucket. Possible states are:
interruptedmeans that aidbox was restarted during the loading process. If you run
aidbox.bulk/load-from-bucketoperation again on the same bucket, it will be continued.