3.3.7. pci.api.path module

This module contains functions to find files from various input sources. The files can be filtered by using subclasses of AbstractFilenameFilter.

3.3.7.1. Examples

The following example shows how to search for files in the folder ‘/data’:

 1from pci.api.path import find_files
 2from pci.api.path import find_files_by_masks
 3from pci.api.path import find_files_by_exclusive_masks
 4
 5# find files that contain '_RAW_' in the file name
 6print(find_files('_RAW_', ['/data']))
 7
 8# find files that contain '_PAN' in the file name
 9print(find_files('_PAN', ['/data']))
10
11# find files that start with 'dem'
12print(find_files_by_masks('/data', 'dem*'))
13
14# find files that have a 'txt' or 'tif' file name extension
15print(find_files_by_masks('/data', ['*.txt', '*.tif']))
16
17# find files that have a 'txt' or 'tif' file name extension, and do not search recursively
18print(find_files_by_masks('/data', ['*.txt', '*.tif'], False))
19
20# find files that do not have the file name extension 'pix'
21print(find_files_by_exclusive_masks('/data', ['*.pix']))

The preceding is based on a folder ‘/data’ with the following structure:

  • Scene_001_RAW_MS.pix

  • Scene_001_RAW_PAN.pix

  • image1.tif

  • Scene_002_RAW_MS.pix

  • Scene_001_ORTHO_PAN.pix

  • readme.txt

  • auxiliary

    • readme.txt

    • dem.pix

Note: The ‘auxiliary’ folder contains two files.

The output will be as follows:

['/data/Scene_001_RAW_MS.pix', '/data/Scene_001_RAW_PAN.pix', '/data/Scene_002_RAW_MS.pix']
['/data/Scene_001_RAW_PAN.pix', '/data/Scene_001_ORTHO_PAN.pix']
['/data/auxiliary/dem.pix']
['/data/image1.tif', '/data/readme.txt', '/data/auxiliary/readme.txt']
['/data/image1.tif', '/data/readme.txt']
['/data/image1.tif', '/data/readme.txt', '/data/auxiliary/readme.txt']

New in version 2018.

3.3.7.2. Finding files

pci.api.path.file_exists_in_dir_case_insensitive(directory, basename)

Return True if file with basename exist in directory. This function uses find_file_case_insensitive() to do the search.

pci.api.path.find_data_files(directory)

Return a list of valid datasets from directory that can be opened using pci.api.datasource.open_dataset().

pci.api.path.find_direct_files(directory)

Return a list of path to files in the directory.

pci.api.path.find_file_case_insensitive(directory, basename)

Find the filename of the specified basename in the directory. The comparison is case insensitive. This function may be expensive because it iterates over all files in the directory and does a case insensitive comparison. If a match is found on disk, then it is returned, None is returned otherwise. The returned filename will match the case in the file system

pci.api.path.find_files(search_pattern, search_dirs, ignore_case=True)

Recursively search the list of directories, search_dirs for search_pattern. If the ignore_case is true (default), patterns are matched case insensitively. A list containing all matched files is returned.

search_pattern can be any part of a file name such as ‘_raw__’, ‘.PIX’

pci.api.path.find_files_by_masks(input_source, masks, recursive=True, quiet=False)

Find files that match the list of masks (see pci.api.finder.HandlingFilteredFilenameFinder) from input_source (path or pci.api.inputsource.InputSourceCollection). If recursive is True, then input_source will search for files recursively. If quiet is set, the log messages from the filter are suppressed. A list of file paths are returned on success. An Exception is raised on invalid input_source argument.

pci.api.path.is_local_file(filename) bool

Determine if the given file is on a local drive or on a remote drive. Relative paths are all considered local.

On Windows local paths are those that start with a drive letter. This may not always be true since someone might map drives to some network location.

On Linux local paths are checked with a call to os.major(fileStat.st_dev). The major value is compared to a hard-wired list of expected values. The device list that is referenced is no longer online. Also the list of expected values seems to have worked for the disk configurations that PCI has had, but as Shahab says the new buildcentos8 VM uses pass-through to get better performance and this exposes the actual physical drive.

This raises the question of has isLocalFile() been returning the correct result on customer’s installations. Possibly not, but determining that would be hard. It seems that we use isLocalFile to determine whether to copy files to a temporary directory or not. So an incorrect result from isLocalFile() might only be apparent from decreased performance.

If the goal is to have isLocalFIle() return correct results, then one solution would be to use the package blkinfo which returns information based on the lsblk command. This means we could get a set of block major device numbers. However testing on centos7 and centos8 indicated that these values may not correspond to the values from os.major(). The value from os.major() for a network file is 0 from testing on centos7 and centos8, but is this a portable result?

Another approach would be to use the GNU df command with the -l option. The output for a file on a network drive is a warning compared to the usual df output for a local file. However the -l option is not standard.

Another approach would be to allow a user to update ValidLocalDeviceMajorLinux for the particular installation. We would provide a utility to determine the valid values for that installation. Would each node need to be configured separately? Is this still appropriate on the cloud?

But in the long run it may not matter if sufficient file caching is done.

With no clear path forward at this time and with the only immediate requirement that all unit tests pass on centos8, the fix for now is just to extend the ValidLocalDeviceMajorLinux set. These comments are entered here rather than creating a new card.

@param filename: is the name of the file to check. @return True if the file is local or False otherwise.

New in version CATALYST: 3.1

pci.api.path.locate_file(input_file, source_dir)

Locate the file input_file in the directory source_dir or any of its subdirectories. This function returns the full path to input_file, if not found, an empty string is returned.

pci.api.path.search_directory(target, directory, ignore_case=True, search_type=SearchType.SEARCH_BOTH_FILE_AND_DIRECTORY)

Search the directory, directory for file/directory name, target. If ignore_case is True then file and directory are matched case insensitively. Depending on the value of search_type, this function can search for files, directories or both pci.api.inputsource.SearchType.SEARCH_FILE_ONLY, pci.api.inputsource.SearchType.SEARCH_DIRECTORY_ONLY, pci.api.inputsource.SearchType.SEARCH_BOTH_FILE_AND_DIRECTORY. The first match is returned, if there are no match then None is returned.

For Example

If there is a match:

>>> from pci.api import path
>>> print(path.search_directory('Test_ms.pix', '/data/'))
'/data/Test_MS.pix'
>>> print(path.search_directory('test_ms.pix', '/data/'))
'/data/Test_MS.pix'

No match:

>>> from pci.api import path
>>> print(path.search_directory('Test2.tiff', '/data/'))
None

3.3.7.3. Conditional functions

pci.api.path.find_files_by_exclusive_masks(input_source, masks)

Find files that don’t match any mask in the list, masks in input_source (path to directory or any iterable object). A list of file paths are returned on success. An Exception is raised on invalid input_source argument. This function uses fnmatch.fnmatch() to do the matching.

pci.api.path.is_file_exist_case_insensitive(filename)

Return True if file filename exist. This function uses file_exists_in_dir_case_insensitive() to do the test.

pci.api.path.is_potential_child(potential_parent, potential_child)

Return True if potential_child is a subdirectory of potential_parent or if potential_parent == potential_child