DAX - the next generation: towards one million processes on commodity hardware

Large scale image processing demands a standardized way of not only storage but also a method for job distribution and scheduling. The eXtensible Neuroimaging Archive Toolkit (XNAT) is one of several platforms that seeks to solve the storage issues. Distributed Automation for XNAT (DAX) is a job control and distribution manager. Recent massive data projects have revealed several bottlenecks for projects with <100,000 assessors (i.e., data processing pipelines in XNAT). In order to address these concerns, we have developed a new API, which exposes a direct connection to the database rather than REST API calls to accomplish the generation of assessors. This method, consistent with XNAT, keeps a full history for auditing purposes. Additionally, we have optimized DAX to keep track of processing status on disk (called DISKQ) rather than on XNAT, which greatly reduces load on XNAT by vastly dropping the number of API calls. Finally, we have integrated DAX into a Docker container with the idea of using it as a Docker controller to launch Docker containers of image processing pipelines. Using our new API, we reduced the time to create 1,000 assessors (a sub-cohort of our case project) from 65040 seconds to 229 seconds (a decrease of over 270 fold). DISKQ, using pyXnat, allows launching of 400 jobs in under 10 seconds which previously took 2,000 seconds. Together these updates position DAX to support projects with hundreds of thousands of scans and to run them in a time-efficient manner.