Tutorial: Workflow Parallelization¶
Warning
Workflows should be optimized to an image test-set before running a whole dataset. See the VIS workflow tutorial or VIS/NIR tutorial. Our download tool, which talks to a LemnaTec database system, has a specific file structure, which may be different than yours unless you are using our tool, but we also have instructions to run PlantCV over a flat file directory (just keep this in mind).
Running PlantCV workflows over a dataset¶
We normally execute workflows in a shell script or in in a cluster scheduler job file. The parallelization tool
plantcv-run-workflow
has many configuration parameters. To make it easier to manage the number of input parameters,
a configuration file can be edited and input.
Configuration-based method¶
To create a configuration file, run the following:
plantcv-run-workflow --template my_config.txt
The code above saves a text configuration file in JSON format using the built-in defaults for parameters. The parameters can be modified
directly in Python as demonstrated in the WorkflowConfig documentation. A configuration can be
saved at any time using the save_config
method to save for later use. Alternatively, open the saved config
file with your favorite text editor and adjust the parameters as needed.
Some notes on JSON format:
- Like Python, string variables (e.g. "VIS") need to be in quotes but must be double
"
quotes. - Unlike Python,
true
andfalse
in JSON are lowercase. None
in Python translates tonull
in JSON\
characters need to be escaped in JSON e.g.\d
in Python becomes\\d
in JSON- There are no comments in JSON
Differences between JSON and Python will be automatically converted appropriately if you make changes to the config in Python and then use save_config
.
Once configured, a workflow can be run in parallel over a dataset using the command:
plantcv-run-workflow --config config.json
As noted on the WorkflowConfig page, plantcv-run-workflow
can be configured to run PlantCV
workflows locally or distribute workflows to a cluster using a scheduler service (e.g. HTCondor, SLURM, etc.).
Running PlantCV workflows over a flat directory of images¶
Note
PlantCV can analyze images in parallel that are stored in a directory (including subdirectories). Our aim is to make this process as flexible as possible but consistency in naming images is key. Ideally image filenames are constructed of metadata information separated by a consistent delimiter (though we provide a regular expression-based parser if needed). Please follow the instructions below carefully.
In order for PlantCV to extract all of the necessary metadata from the image files, image files need to be named in a particular way.
Image name might include:
- Plant ID
- Timestamp
- Measurement/Experiment Label
- Image Type
- Camera Label
- Zoom
Example Name:
AABA002948_2014-03-14 03-29-45_Pilot-031014_VIS_TV_z3500.png
- Plant ID = AABA002948
- Timestamp = 2014-03-14 03-29-45
- Measurement Label = Pilot-031014
- Image Type = VIS
- Camera Label = TV
- Zoom = z3500
Valid Metadata
Valid metadata that can be collected from filenames are camera
, imgtype
, zoom
, exposure
, gain
, frame
, rotation
,
lifter
, timestamp
, id
, barcode
, treatment
, cartag
, measurementlabel
, and other
.
To correctly process timestamps, you need to specify the timestamp format (timestampformat
configuration
parameter) code for the
strptime C library.
For the example above you would use "timestampformat": "%Y-%m-%d %H-%M-%S"
.
Example configuration¶
Sample image filename: cam1_16-08-06-16:45_el1100s1_p19.jpg
{
"input_dir": "/shares/mgehan_share/raw_data/raw_image/2016-08_pat-edger/data/split-round1/split-cam1",
"json": "edger-round1-brassica.json",
"filename_metadata": ["camera", "timestamp", "id", "other"],
"workflow": "/home/mgehan/pat-edger/round1-python-pipelines/2016-08_pat-edger_brassica-cam1-splitimg.py",
"img_outdir": "/shares/mgehan_share/raw_data/raw_image/2016-08_pat-edger/data/split-round1/split-cam1/output",
"tmp_dir": null,
"start_date": null,
"end_date": null,
"imgformat": "jpg",
"delimiter": "_",
"metadata_filters": {"camera": "cam1"},
"timestampformat": "%y-%m-%d-%H:%M",
"writeimg": true,
"other_args": {},
"groupby": ["filepath"],
"group_name": "auto",
"cleanup": true,
"append": true,
"cluster": "HTCondorCluster",
"cluster_config": {
"n_workers": 16,
"cores": 1,
"memory": "1GB",
"disk": "1GB",
"log_directory": null,
"local_directory": null,
"job_extra_directives": null
},
"metadata_terms": {
...
}
}
Running plantcv-run-workflow --config config.json
with the example configuration options above will run the PlantCV
workflow script 2016-08_pat-edger_brassica-cam1-splitimg.py
on the images in the input directory using an HTCondor
compute cluster with up to 16 worker jobs checked out of the cluster.
Using a pattern matching-based filename metadata parser¶
If image filenames do not use a consistent delimiter (e.g. rgb_plant-1_2019-01-01 10_00_00.png) throughout,
then using the delimiter
parameter with a single separator character will not split the filename properly
into the component metadata parts. An advanced option to extract metadata in this situation is to use pattern
matching, or regular expressions. The delimiter
parameter
will accept a regular expression in place of a delimiter character. Example:
Example filename: rgb_plant-1_2019-01-01 10_00_00.png
Metadata components: imgtype
, plantbarcode
, timestamp
Delimiter = "_"
will not work because the timestamp contains _
characters.
Regular expression: '(.{3})_(.+)_(\d{4}-\d{2}-\d{2} \d{2}_\d{2}_\d{2})'
Interpreting the example pattern
A key part of the pattern is the use of parenthesis because in regular expression syntax these mark the start and end of a group that will be returned from a match (or in other words parsed for our purposes). Regular expression patterns can be as general or specific as needed. The pattern above reads as:
Group 1 (camera): any 3 characters
Underscore
Group 2 (plantbarcode): 1 or more of any character
Underscore
Group 3 (timestamp): 4 digits, dash, 2 digits, dash, 2 digits, space, 2 digits, underscore, 2 digits, underscore, 2 digits
Note that the number of groups returned by the regular expression must match the number of metadata terms provided to
in a list to the filename_metadata
parameter.
Example configuration:
{
"input_dir": "input_directory",
"json": "output.json",
"filename_metadata": ["camera", "plantbarcode", "timestamp"],
"workflow": "user-workflow.py",
"img_outdir": "output_directory",
"tmp_dir": null,
"start_date": null,
"end_date": null,
"imgformat": "jpg",
"delimiter": '(.{3})_(.+)_(\d{4}-\d{2}-\d{2} \d{2}_\d{2}_\d{2})',
"metadata_filters": {},
"timestampformat": "%Y-%m-%d %H_%M_%S",
"writeimg": true,
"other_args": {},
"groupby": ["filepath"],
"group_name": "auto",
"cleanup": true,
"append": true,
"cluster": "HTCondorCluster",
"cluster_config": {
"n_workers": 16,
"cores": 1,
"memory": "1GB",
"disk": "1GB",
"log_directory": null,
"local_directory": null,
"job_extra_directives": null
},
"metadata_terms": {
...
}
}
If you need help building a regular expression, https://regexr.com/ is a useful site to help build and interpret patterns. Also feel free to post an issue.
Grouping images for multi-image workflows¶
Advanced PlantCV workflows can co-analyze multiple images. For example, a dataset containing an RGB and grayscale near-infrared image could be co-analyzed in a single workflow.
Sample image filenames: rgb_16-08-06-16:45_el1100s1_p19.jpg
and nir_16-08-06-16:45_el1100s1_p19.jpg
Note in the example above, the two filenames are the same other than the indicated image type (rgb or nir).
In the example configuration below, we can group these images by timestamp
because they share this metadata.
To identify each image within our workflow, we will name them based on the imgtype
metadata values (rgb and nir).
{
"input_dir": "/shares/mgehan_share/raw_data/raw_image/2016-08_pat-edger/data/split-round1/split-cam1",
"json": "edger-round1-brassica.json",
"filename_metadata": ["imgtype", "timestamp", "id", "other"],
"workflow": "/home/mgehan/pat-edger/round1-python-pipelines/2016-08_pat-edger_brassica-cam1-splitimg.py",
"img_outdir": "/shares/mgehan_share/raw_data/raw_image/2016-08_pat-edger/data/split-round1/split-cam1/output",
"tmp_dir": null,
"start_date": null,
"end_date": null,
"imgformat": "jpg",
"delimiter": "_",
"metadata_filters": {},
"timestampformat": "%y-%m-%d-%H:%M",
"writeimg": true,
"other_args": {},
"groupby": ["timestamp"],
"group_name": "imgtype",
"cleanup": true,
"append": true,
"cluster": "HTCondorCluster",
"cluster_config": {
"n_workers": 16,
"cores": 1,
"memory": "1GB",
"disk": "1GB",
"log_directory": null,
"local_directory": null,
"job_extra_directives": null
},
"metadata_terms": {
...
}
}
### Convert the output JSON file into CSV tables
```bash
plantcv-utils json2csv -j output.json -c result-table
See Accessory Tools for more information.
Source Code: Here