Catalyst.earth Logo
Catalyst.earth Logo

Tile Processing

Aerial image of an illicit crop from above.

Processing rasters by tiles is a crucial concept when creating custom processes that can be used on large files. This prevents hardcoding the tile dimensions, as that can degrade the performance or overwhelm memory resources. In some workflows, it is beneficial, and sometimes required, to process the image pieces in batches. This is achieved by accessing a particular subset of the raster, manipulating it, writing it, and then proceeding to the next subset until all the image pixels have been processed.

The scripts described in this tutorial can be found here - Tile Processing Scripts

Benefits

The following are reasons and use cases for which tile processing would be advantageous.

  1. Memory Constraints
    • Larger raster images take up a lot of space and may not fit into memory when loaded at once. Tile processing enables smaller portions to be processed at a time, which is beneficial for machines with limited RAM
  2. Performance and Cache Efficiency
    • Processing smaller contiguous blocks of imagery improves CPU cache usage as the tiles fit better into the cache and help reduce cache misses.
  3. Streaming and Incremental Processing
    • This is important for workflows that need progressive results, since the tiles are processed independently
  4. Selective Processing
    • Processing images by tiles allows users to select parts of the image that can be processed. This is helpful for mask-based operations and for performing spatial queries. It also helps save on time as the entire imagery does not need to be processed.

Reading, Processing and Writing by Tiles

This section shows how to process rasters by iterating over tiles/subsets of the input raster using Landsat-7 MS imagery. It will demonstrate how to read the blue and green channels using 256 x 256 tiles and output their sum to a file.

Full Script

import os 
import numpy as np 
from pci.api import gobs, datasource as ds 
 
# Set input and output files for workflow 
working_dir = r'D:\Data\Tutorial\Python\Process_Tiles\proc_rasters_by_tiles'  # Set this path to the dir where you extracted the tutorial data 
input_l7 = os.path.join(working_dir, 'l7_ms_tiled256.pix') 
out_file = os.path.join(working_dir, 'landsat7_sum.pix') 
 
# Open two datasets using the with-statement, one to read from and one to write to 
with ds.open_dataset(input_l7) as read_dataset, ds.new_dataset(out_file, 'PCIDSK', '') as write_dataset: 
    reader_l7 = ds.BasicReader(read_dataset, [1, 2]) 
    in_coord_sys = reader_l7.crs 
    in_geocode = reader_l7.geocoding 
 
    # Set tile size for reading 
    reader_l7.tile_size = (256, 256)   # set to the desired tile size 
 
    # Create the writer object 
    writer = ds.BasicWriter(write_dataset) 
 
    # Create output file 
    writer_info = gobs.RasterInfo(reader_l7.width, reader_l7.height, 1, gobs.DT_16U) # Set bit-depth to accommodate pixel addition
    writer.create(writer_info)  

    # Iterate through each tile, add the blue channel pixel values with the green channel and write to output 
    for i, tile in enumerate(reader_l7, start=1): 
        print('creating tile: {0} with dimensions: {1}'.format(i, writer.tile_size))
 
        # Store raster channels as numpy arrays and convert to 8bit Unsigned Integers 
        blue_chan = tile.data[:, :, 0] 
        green_chan = tile.data[:, :, 1] 
 
        # Perform addition between blue and green channels and store results in sum_array 
        sum_array = blue_chan + green_chan
        sum_array_16u = sum_array.astype(np.uint16)   # convert to 16-bit for output raster 
 
        # Wrap array in a raster object (this doesn't make a copy of the data) 
        sum_raster = gobs.array_to_raster(sum_array_16u, tile) 
 
        # Write the pixels from sum_raster 
        writer.write_raster(sum_raster) 
        writer.crs = in_coord_sys 
        writer.geocoding = in_geocode 

1. Module Imports and Input/Output Setup

1   import os 
2   import numpy as np 
3   from pci.api import gobs, datasource as ds 
4 
5   # Set input and output files for workflow 
6   working_dir = r'D:\Data\Tutorial\Python\Process_Tiles\proc_rasters_by_tiles'  # Set this path to the dir where you extracted the tutorial data 
7   input_l7 = os.path.join(working_dir, 'l7_ms_tiled256.pix') 
8   out_file = os.path.join(working_dir, 'landsat7_sum.pix') 
9

The first step is to import the CATALYST Python API's modules, which will be called, and set up the variables that will point to the input and output file(s). In the code block above, line 2 is used to import numpy, which is an open-source library for storing and performing computations on large, complex arrays. The custom calculations are performed using this module.

Line 3 imports the gobs and datasource modules from the CATALYST Python API. These modules are used to read in and write out the raster pixels, geocoding information and coordinate system. 

On line 6, working_dir is a variable that contains the path to the directory where the input data is stored.

2. Opening Input and Output Datasets for Reading and Writing

10  # Open two datasets using the with-statement, one to read from and one to write to 
11  with ds.open_dataset(input_l7) as read_dataset, ds.new_dataset(out_file, 'PCIDSK', '') as write_dataset: 
12      reader_l7 = ds.BasicReader(read_dataset, [1, 2]) 
13      in_coord_sys = reader_l7.crs 
14      in_geocode = reader_l7.geocoding 
15 
16      # Set tile size for reading 
17      reader_l7.tile_size = (256, 256)   # set to the desired tile size
18

In this step, the Python with statement is used to open the input and create the output datasets, read in the first two channels, coordinate system and geocoding information, and define the tile size for processing. 

On line 11,  the with statement is used to call the ds.open_dataset() function and pass the input_l7 variable as its only argument, and use the as statement to refer to it as read_dataset in the following block. On the same line, the ds.new_dataset() function is also called (referred to as write_dataset in the following block), with the following arguments passed: 

  • out_file as its first argument -> variable holding the path to the output file
  • ‘PCIDSK’ string as the second argument -> defines the file format 
  • ‘’ an empty string as the final argument -> defines file format options 

On line 12, an object called reader_l7 is created by calling ds.BasicReader() class initializer where the read_dataset variable is passed as the first argument , and the image channels to be read in, [1, 2] which represent the Blue and Green channels respectively, as the second argument. Lines 13 and 14 are used to get the coordinate system and geocoding information of the input file, which are assigned to in_coord_sys and in_geocode, respectively.

On line 17, the tile size is defined. This is used to read pixels from the input file and process them by defining the reader_l7.tile_size attribute. In this case, it's set to 256 x 256 pixels. 

3. Creating The Writer

19  # Create the writer object 
20      writer = ds.BasicWriter(write_dataset) 
21 
22  # Create output file 
23      writer_info = gobs.RasterInfo(reader_l7.width, reader_l7.height, 1, gobs.DT_16U) # Set bit-depth to accommodate pixel addition
24      writer.create(writer_info)
25

On line 20, an object called writer is created using the ds.BasicWriter() initializer with write_dataset passed as the only argument. On line 23, the necessary information for creating the output file is defined by creating an object called writer_info by instantiating the class gobs.RasterInfo(). The arguments are as follows: 

  • width defines the width of the file in pixels
  • height defines the height of the file in pixels 
  • 1 is the number of channels in the output file 
  • DT_16U is the data type for the output channels 

Line 24 creates the output file by calling writer.create() with the output file information held in the writer_info object as its only argument. 

4. Iteratively Processing Each Tile

26      # iterate through each tile, add the blue channel pixel values with the green channel and write to output 
27      for i, tile in enumerate(reader_l7, start=1): 
28          print('creating tile: {0} with dimensions: {1}'.format(i, writer.tile_size))
29   
30          # store raster channels as numpy arrays and convert to 8bit Unsigned Integers 
31          blue_chan = tile.data[:, :, 0] 
32          green_chan = tile.data[:, :, 1] 
33 
34          # Perform addition between blue and green channels and store results in sum_array 
35          sum_array = blue_chan + green_chan 
36          sum_array_16u = sum_array.astype(np.uint16)   # convert to 16-bit for output raster
37
38          # wrap array in a raster object (this doesn't make a copy of the data) 
39          sum_raster = gobs.array_to_raster(sum_array_16u, tile)
40

The iterative process is performed within a for loop. The for-loop is created on line 27, where each tile in reader_l7 is processed. Line 27 also creates a variable i and sets it to start at 1. This is used for illustrative purposes only to keep track of each iteration, as it will increment by 1 for each tile. It is not needed for the actual code.

On line 28, a print-statement prints the tile number and dimensions of the tile being processed. This is solely done to help illustrate the concept and is not required in the code. 

The first element in tile is accessed on line 31 to represent the blue channel and is assigned to the blue_chan variable. The next line (line 32) does the same thing for the green channel.

Note: Instead of using astype(), consider using vectorize() to optimize performance. 

Line 35 performs the addition calculation and assigns the resulting array to the sum_array variable. This variable is then converted from a numpy array to a 16-bit unsigned integer using astype() on line 36. This is according to the tutorial data type and should be changed according to the working dataset. That is, the tutorial adds two 8-bit unsigned raster channels, so the output channel should be at least 16-bit to avoid overflow and preserve full precision. This array is then wrapped into a raster object (sum_raster) using the gobs.array_to_raster() method. This method also has a RasterContext parameter, which represents information about the margin and position of a raster. This is important to use when processing tiles, as each tile has different margin and offset information. The loop variable, tile, can be used for this parameter as it's a Raster object which includes the raster info as well as the raster context.

The array is now formatted to be written to an output file by the CATALYST Python API.

Note: array_to_raster() does not copy the values and is optimized for performance. 

5. Iteratively Writing Each Tile To Output File

41  # Write the pixels from sum_raster 
42          writer.write_raster(sum_raster) 
43          writer.crs = in_coord_sys 
44          writer.geocoding = in_geocode 
45 

This step works to write the results of the addition to the output file, one tile at a time.

Line 42 calls the write_raster() method to write the pixel values from sum_raster to the output file held in the writer object. On lines 43 and 44, the coordinate system and geocoding information are obtained from the input file (held in the in_coord_sys and in_geocode variables from earlier) are written into the writer's crs and geocoding attributes.

Once the code is done executing, the result will be a single output file called landsat7_sum.pix with the output results stored in channel 1.

Optimized Image Processing By Tiles

This section of the tutorial utilizes raw striped Landsat-8 imagery to demonstrate key concepts for optimizing throughput when processing with tiles.

The following concepts are essential to understand when processing rasters by tiles:

  1. Conditions for when files should be processed by tiles.
  2. Determining optimal tile dimensions based on input and output file layouts
    • Avoid creating tiles in the X or Y dimension that are smaller than the tile dimensions in your input and output files.
    • If the processing tile dimensions are smaller than the input/output tile dimensions, multiple reads and/or writes to a specific tile will be required, which may significantly degrade performance.
      • For example, if the input file tile dimensions are 1024 x 1 (striped), and the output file is 512 x 512, the tile dimensions for processing shouldn't be smaller than 1024 x 512.
  3. Creating a cap for how much memory should be used
    • The concept above regarding dimensions should only be broken if the optimal tile dimension would require too much memory
      • For example, if the input is a large striped file with dimensions 1,000,000 x 1, the output file is 1024 x 1024, and there are 4 channels at 16 bits per pixel, the optimal tile size of 1,000,000 x 1024 would require approximately 7.63 GB of memory. This may be too high for some systems
      • memory(bytes)=optimaldimensions∗numchan∗numbytesmemory(bytes) = optimaldimensions * numchan * numbytes
        • 1,000,000 x 1024 x 4 x 2 = 8,192,000,000 bytes = 8,000,000 KB = 7812.5 MB = 7.63 GB
    • Processing by tiles will generally degrade performance; therefore, the largest tile size possible should be used, based on an appropriate memory limit.

The code is divided into two main functions: the main() function, derived from the above section, is responsible for executing the primary program, and the get_tile_size() function calculates the optimal tile size for processing. Although this section of the tutorial focuses on the get_tile_size() function, the main() function will be referenced as it relates to it.

Note: The order of the script is import statements > get_tile_size() > main() > if __name__ == '__main__' statement

1. Module Import

1    import os 
2    import calendar 
3    import time 
4    import numpy as np 
5    from pci.api import gobs, datasource as ds 
6

This updated script adds Python's calendar and time packages (on lines 2 and 3) to derive the processing time and help illustrate the difference in time efficiency between using optimized and non-optimized tile dimensions.

2. Input/Output File and Processing Parameter Setup

111  if __name__ == '__main__': 
112      # set input output files and processing parameters for workflow 
113      working_dir = r'D:\tutorials\optimized_tile_processing'  # set this path to the dir with the input data 
114      input_file = os.path.join(working_dir, 
115                                'LC81950282015242LGN00\LC81950282015242LGN00_MTL.txt-MS') 
116      output_file = os.path.join(working_dir, 'landsat8_sum.tif') 
117      output_tile_size = 'TILED512' 
118      memory_limit = 32   # in MegaBytes (MB) 
119      overwrite = 'yes' 
120
121      if overwrite == 'yes': 
122          try: 
123              os.remove(output_file) 
124
125          except OSError: 
126              pass 
127
128      main(input_file, output_file, output_tile_size, memory_limit)  # run the main process 

This block of code will live at the very bottom of the script and is used to define the input and output files, as well as the processing parameters. If the script is run as the primary process, that is, it's not imported by another file, this block will be initialized.

On line 111, if __name__ == ‘__main__’: tells Python to only run the block of code below it if this Python file is run as the main process. 

Line 113 creates a string variable called working_dir, and sets it to the path where the raw input data is. On lines 114 and 116, the input file and the output file are set to the Landsat-8 key file and the output TIF file, respectively. Line 117 sets the tiling scheme that will be used when creating the output TIF file, which is set to 515 x 512 pixels. On line 118, the memory limit is defined as 32MB, which is quite small, but sufficient for this tutorial. Lines 121 to 126 create a simple rule for whether or not to overwrite the output file. Finally, online 128, the main() function is called and passes the input and output files, and processing variables as arguments. 

3. The Primary Process - main() Function

50   def main(in_file, out_file, out_tile_size, mem_lim): 
51       """ 
52       This function contains the main process for this workflow, which will process an image
53       by tiles, using optimal tile dimensions, computed by the get_tile_dimensions() 
54       function.  
55       :param in_file: path to input file for processing 
56       :param out_file: path and name of output file for writing 
57       :param out_tile_size: tiling method 
58       :param mem_lim: user defined memory limit for determining optimal tile size 
59       :return: none 
60       """ 
61
62       # open two datasets using the with-statement, one to read from and one to write to 
63       with ds.open_dataset(in_file) as read_dataset, ds.new_dataset(out_file, 'TIFF', out_tile_size) as write_dataset: 
64           in_chans = [2, 3] 
65           reader_l8 = ds.BasicReader(read_dataset, in_chans) 
66           in_coord_sys = reader_l8.crs 
67           in_geocode = reader_l8.geocoding 
68           in_raster = reader_l8.read_raster(0, 0, reader_l8.width, reader_l8.height) 
69
70           # create the writer object 
71           writer = ds.BasicWriter(write_dataset) 
72
73           # create output file 
74           out_chan = 1 
75           writer_info = gobs.RasterInfo(reader_l8.width, reader_l8.height, 
76                                         out_chan, gobs.DT_32R) # Set bit-depth to accommodate pixel addition  
77           writer.create(writer_info) 
78
79           # get and set the optimal tile size 
80           tile_sz_x, tile_sz_y = get_tile_size(reader_l8, writer, mem_lim, 
81                                                len(in_chans), out_chan) 
82           reader_l8.tile_size = (tile_sz_x, tile_sz_y) 
83           
84           start_time = calendar.timegm(time.gmtime()) 
84
86           # iterate through each tile, add the blue channel pixel values with the green channel and write to output 
87           for i, tile in enumerate(reader_l8, start=1): 
88               print('creating tile: {0} with dimensions: {1}'.format(i,reader_l8.tile_size))
89 
90               # store raster channels as numpy arrays and convert to 32-bit real 
91               blue_chan = in_raster.data[:, :, 0] 
92               green_chan = in_raster.data[:, :, 1] 
93 
94               # Perform addition between blue and green channels and store results in sum_array 
95               sum_array = blue_chan + green_chan
96               # convert to 32-bit for output raster
97               sum_array_32r = sum_array.astype(np.float32)   
98 
99               # wrap array in a raster object (this doesn't make a copy of the data) 
100               sum_raster = gobs.array_to_raster(sum_array_32r) 
101            
102              # write the pixels from sum_raster, the input coordinate system and geocoding information to output file 
103              writer.write_raster(sum_raster) 
104              writer.crs = in_coord_sys 
105              writer.geocoding = in_geocode 
106
107          # Print out the processing time
108          proc_time = calendar.timegm(time.gmtime()) - start_time 
109          print('\nProcessing took {0} seconds!'.format(proc_time))
110

The main() function holds the primary code for processing by tiles derived from the original script in the first part of this tutorial. Although it contains a few tweaks to accommodate Landsat-8 imagery, the general premise is the same, with the inclusion of a call to the get_tile_size() function to set the optimal tile size. These two functions will live between the import statements and the if __name__ == '__main__': sections.

The main() function takes in 4 arguments:

  • in_file - the input file
  • out_file - the output file
  • out_tile_size - the output tile size
  • mem_lim - the memory limit

Although the get_tile_size() function has not been defined yet, it is called on line 80 and returns the optimal x and y tile sizes (tile_sz_x and tile_sz_y) based on the tiling schemes of the input and output data, as well as the user-defined memory limit.

4. Defining get_tile_size()

7    def get_tile_size(in_reader, out_writer, mem_lim, num_in_chans, num_out_chans): 
8        """ 
9        This function determines the optimal tile dimensions for processing a file based 
10       on the tile dimensions of the input file and output file  
11       :param in_reader: the reader object that contains the tiling information for the input file 
12       :param out_writer: the writer object that contains the tiling information for the output file 
13       :param mem_lim: user defined memory limit in MB 
14       :param num_in_chans: number of input channels that are being read in by the in_reader object 
15       :param num_out_chans: number of output channels that are created by the out_writer object 
16       :return: tile_size_x, tile_size_y 
17       """ 
18
19       in_reader.set_optimal_tile_size() 
20       out_writer.set_optimal_tile_size() 
21
22       max_tile_x = max([in_reader.tile_size[0], out_writer.tile_size[0]]) 
23       max_tile_y = max([in_reader.tile_size[1], out_writer.tile_size[1]]) 
24       in_bytes = in_reader.datatype.bits_per_pixel / 8 
25       out_bytes = out_writer.datatype.bits_per_pixel / 8 
26       

Before the main() function and below the import statements, the get_tile_size() function is created to compute the optimal tile size, in the x and y directions, based on the tiling scheme of the input and output files. The parameters passed to this function are as follows:

  • in_reader: the input reader object that contains tiling information about the input file to be processed.
  • out_writer: the output writer object that contains tiling information about the output file that will be created.
  • mem_lim: the user-defined memory limit. This sets the upper limit of how much memory can be used, which will define the maximum tile size.
  • num_in_chans: the number of channels that will be read in during processing. This impacts memory usage.
  • num_out_chans: the number of channels that will be written to the output file. This impacts memory usage.

Lines 19 and 20 obtain the tile dimensions in the input and output files by calling the in_reader.set_optimal_tile_size() and out_writer.set_optimal_tile_size() methods. Lines 22 and 23 then get the largest tile size in each direction between the files. This is essential as the processing tile size should be equal to or larger than the largest size in each direction, as long as it meets the memory limit.

On lines 24 and 25, the bit-depth of the input and output channels is obtained (in bytes) through the  in_reader.datatype.bits_per_pixel and out_writer.datatype.bits_per_pixel properties. This information is stored in the in_bytes and out_bytes variables and is used to calculate the memory requirements for the optimal tile size.

Figure 1 - Illustration of the optimal tile size

5. Calculating Memory Necessary For Optimal Tile Size

27       mem_req_bytes = max_tile_x * max_tile_y * ((num_in_chans * in_bytes) + 
28                                                  (num_out_chans * out_bytes)) 
29       mem_req_megabytes = mem_req_bytes / 1048576.0 
30

Using the information about the input/output tile size in each direction, the number of channels and the bit depth of each channel, the amount of memory required for processing the tiles can be calculated. This is done on lines 27 & 28, where the result is in bytes. Line 29 converts the required memory from bytes to megabytes (MB) so it can be compared to the user-defined memory limit.

6. Checking The Memory Requirement

31   if mem_lim > mem_req_megabytes: 
32       print('\n$$$$$$$$$ Optimal tile size meets memory limit $$$$$$$$$')
33       print('The memory limit of: {0} is greater than the memory required' \ 
34             'for optimal tiling of: {1}\n'.format(mem_lim, mem_req_megabytes))
35
36       tile_size_x = max_tile_x 
37       tile_size_y = max_tile_y 
38
39   else: 
40       # you can be more creative in creating a dynamic solution to lowering the tile size to meet your memory limit 
41       print('\n$$$$$$$$$ Optimal tile size exceeds memory limit $$$$$$$$$')
42       print('The memory limit of: {0} is less than the memory required' \ 
43             'for optimal tiling of: {1}\n'.format(mem_lim, mem_req_megabytes))
44
45       tile_size_x = 512 
46       tile_size_y = 512 
47
48   return tile_size_x, tile_size_y 
49

On line 31, the memory limit (mem_lim) is checked to see if it's greater than the required memory (mem_req_megabytes). If it is true, then the x and y tile sizes (tile_size_x and tile_size_y) are set to the max tile sizes (max_tile_x and max_tile_y) - the optimized tile sizes (lines 36 and 37). If the memory limit is much bigger than the required memory, rules to further increase the tile size should be created. As a general rule, the largest tile size possible should be used.

Note: Further optimizations should be done here.

The else-condition on line 39 is used to handle the case where the optimal tile size would require more memory than the user-defined limit. In the case of this tutorial, the tile size is set to 512 pixels.

Note: Hardcoding a tile size is not recommended, but was done to keep this tutorial relatively simple.

Finally, on line 48, the tile size variables, tile_size_x and tile_size_y, are returned to be used in the main() function for processing. 

Results

As explained earlier, the function of the script is to illustrate iterative raster tile processing by calculating the addition of the Blue and Green channels, and writing the pixel sums to a new file.

Figure 3 - Landsat-8 Blue Channel (pixel value = 9285)
Figure 4 - Landsat-8 Green Channel (pixel value = 8858)
Figure 5 - Landsat-8 Sum Output Raster (pixel value = 18143)
Figure 6 - When using tile dimensions that are not optimized for the data, considerably more time is required for processing  
envelopephone-handsetcross