reference

This part of the project documentation focuses on an information-oriented approach. Use it as a reference for the technical implementation of the file-clerk project code.

A collection of functions for dealing with files and file content.

This was a library I created for previous projects that deal with files and file paths in order to get code from files, lists of files in project folders, file extensions, and allows us to capture just files of a particular type. I also developed my projects on Windows OS, so these functions were designed to work with the file paths on Windows, Mac, and Linux (Windows is the one with backslashes - wacky, I know.).

Typical usage example:

extension = get_file_type("path/to/file.js")

code_string = file_to_string("path/to/file.html")

project_path = "path/to/project" all_project_files = get_all_project_files(project_path) just_css_files = get_all_files_of_type(project_path, "css")

`clear_extra_text(my_text)`

Removes line returns and extra spaces from my_text.

Parameters:

Name	Type	Description	Default
`my_text`	`str`	text which may include line returns or extra space.	required

Returns:

Name	Type	Description
`stripped_text`	`str`	text without any line returns or additional spaces.

Source code in file_clerk/clerk.py

def clear_extra_text(my_text: str) -> str:
    """Removes line returns and extra spaces from my_text.

    Args:
        my_text (str): text which may include line returns or extra space.

    Returns:
        stripped_text (str): text without any line returns or additional
            spaces.
    """
    my_text = my_text.replace("\n", "")
    my_text = re.sub(r"\s+", " ", my_text)
    stripped_text = my_text.strip()
    return stripped_text

`delete_file(filepath)`

deletes file in path but only if it exists

Parameters:

Name	Type	Description	Default
`filepath`	`str`	The file location	required

Source code in file_clerk/clerk.py

def delete_file(filepath: str):
    """deletes file in path but only if it exists

    Args:
        filepath (str): The file location
    """
    data_file = Path(filepath)
    try:
        data_file.unlink()
    except IsADirectoryError as e:
        print(f"Error: {data_file} : {e.strerror}")

`file_exists(file_path)`

Returns True or False: whether file in path exists.

Parameters:

Name	Type	Description	Default
`file_path`	`str`	The file location	required

Returns:

Name	Type	Description
`bool`	`bool`	True or False: True if file exists False if not

Source code in file_clerk/clerk.py

def file_exists(file_path: str) -> bool:
    """Returns True or False: whether file in path exists.

    Args:
        file_path (str): The file location

    Returns:
        bool: True or False: True if file exists False if not
    """
    filename = Path(file_path)
    return filename.exists()

`file_to_string(path)`

Returns contents of file as a string.

Parameters:

Name	Type	Description	Default
`path`	`str`	The path to a file using Posix format (forward slashes e.g. path/to/file.ext)	required

Returns:

Name	Type	Description
`file_text`	`str`	The contents of the file in utf-8 string format.

Source code in file_clerk/clerk.py

def file_to_string(path: str) -> str:
    """Returns contents of file as a string.

    Args:
        path (str): The path to a file using Posix format (forward
            slashes e.g. path/to/file.ext)

    Returns:
        file_text (str): The contents of the file in utf-8 string
            format.
    """
    my_file = get_full_path_string(path)
    file_text = my_file.read_text(encoding="utf-8")
    return file_text

`get_all_files_of_type(dir_path, filetype)`

returns all files of a particular type from a directory.

Parameters:

Name	Type	Description	Default
`dir_path`	`str`	The path to a directory using Posix format (forward slashes e.g. path/to/file.ext)	required
`filetype`	`str`	An extension in the form of a string (without the dot (e.g. html, css, js, etc.)	required

Returns:

Name	Type	Description
`files`	`list`	A list of all html, css, and javascript files

Source code in file_clerk/clerk.py

def get_all_files_of_type(dir_path: str, filetype: str) -> list:
    """returns all files of a particular type from a directory.

    Args:
        dir_path (str): The path to a directory using Posix format
            (forward slashes e.g. path/to/file.ext)
        filetype (str): An extension in the form of a string (without
            the dot (e.g. html, css, js, etc.)

    Returns:
        files (list): A list of all html, css, and javascript files
    """
    pattern = "*." + filetype + "*"
    output = []
    files = collections.Counter(
        str(f.as_posix()) for f in Path(dir_path).rglob(pattern)
    )
    output += files.keys()
    return output

`get_all_project_files(dir_path)`

returns a list of all files from the directory in the path.

Parameters:

Name	Type	Description	Default
`dir_path`	`str`	The path to a directory using Posix format (forward slashes e.g. path/to/file.ext)	required

Returns:

Name	Type	Description
`files`	`list`	A list of all html, css, and javascript files

Source code in file_clerk/clerk.py

def get_all_project_files(dir_path: str) -> list:
    """returns a list of all files from the directory in the path.

    Args:
        dir_path (str): The path to a directory using Posix format
            (forward slashes e.g. path/to/file.ext)

    Returns:
        files (list): A list of all html, css, and javascript files
    """
    files = []
    files += get_all_files_of_type(dir_path, "html")
    files += get_all_files_of_type(dir_path, "css")
    files += get_all_files_of_type(dir_path, "js")
    return files

`get_file_name(path)`

returns the name of the file in the path.

Parameters:

Name	Type	Description	Default
`path`	`str`	The path to a file using Posix format (forward slashes e.g. path/to/file.ext)	required

Returns:

Name	Type	Description
`filename`	`str`	The name of the file (with extension)

Source code in file_clerk/clerk.py

def get_file_name(path: str) -> str:
    """returns the name of the file in the path.

    Args:
        path (str): The path to a file using Posix format (forward
            slashes e.g. path/to/file.ext)

    Returns:
        filename (str): The name of the file (with extension)
    """
    filename = Path(path).name
    return filename

`get_file_type(path)`

returns the extension of the file in the path.

Parameters:

Name	Type	Description	Default
`path`	`str`	The path to a file using Posix format (forward slashes e.g. path/to/file.ext)	required

Returns:

Name	Type	Description
`extension`	`str`	The extension of the file type (without)
	`str`	the dot (eg. html, js, css, pdx, png)

Source code in file_clerk/clerk.py

def get_file_type(path: str) -> str:
    """returns the extension of the file in the path.

    Args:
        path (str): The path to a file using Posix format (forward
            slashes e.g. path/to/file.ext)

    Returns:
        extension (str): The extension of the file type (without)
        the dot (eg. html, js, css, pdx, png)
    """
    my_file = get_full_path_string(path)
    suffix = my_file.suffix
    extension = suffix[1:]
    return extension

`get_full_path_string(path)`

returns absolute path to file in relative path.

Parameters:

Name	Type	Description	Default
`path`	`str`	The file location using the Posix format (forward/slashes)	required

Returns:

Name	Type	Description
`full_path`	`Path Object`	will be a WindowsPath (if Windows) or PosixPath (if Mac or Linux)

Source code in file_clerk/clerk.py

def get_full_path_string(path: str):
    """returns absolute path to file in relative path.

    Args:
        path (str): The file location using the Posix format
            (forward/slashes)

    Returns:
        full_path (Path Object): will be a WindowsPath (if Windows) or
            PosixPath (if Mac or Linux)
    """
    full_path = working_dir
    p_list = get_path_list(path)
    for i in p_list:
        full_path = full_path / i
    return full_path

`get_linked_css(contents_str)`

returns a list of linked CSS files.

Parameters:

Name	Type	Description	Default
`contents_str`	`str`	HTML code from a file in string format.	required

Returns:

Name	Type	Description
`filenames`	`list`	A list of all filenames extracted from CSS link tags. Note: no external stylesheets will be included (only local files).

Source code in file_clerk/clerk.py

def get_linked_css(contents_str: str) -> list:
    """returns a list of linked CSS files.

    Args:
        contents_str (str): HTML code from a file in string format.

    Returns:
        filenames (list): A list of all filenames extracted from CSS
            link tags.
            Note: no external stylesheets will be included (only
            local files).
    """
    filenames = []
    soup = BeautifulSoup(contents_str, "html.parser")
    linked_files = soup.find_all("link")

    if len(linked_files) > 1:
        for file in linked_files:
            linked_file = file.attrs.get("href")
            if "https://" in linked_file or "http://" in linked_file:
                continue
            filenames.append(linked_file)
    elif len(linked_files) == 1:
        filename = linked_files[0].attrs.get("href")
        if "https://" in filename or "http://" in filename:
            return None
        filenames.append(filename)
    else:
        return None
    return filenames

`get_path_list(path)`

Returns a list of each path part using slash as separator.

Parameters:

Name	Type	Description	Default
`path`	`str`	The file location using the Posix format (forward/slashes)	required

Returns:

Name	Type	Description
`path_list`	`list`	A path of each folder in a path, with the file at the end. Example: path/to/file.ext will be ["path", "to", "file.ext"]

Source code in file_clerk/clerk.py

def get_path_list(path: str) -> list:
    """Returns a list of each path part using slash as separator.

    Args:
        path (str): The file location using the Posix format
            (forward/slashes)

    Returns:
        path_list (list): A path of each folder in a path, with the
            file at the end.
            Example: path/to/file.ext will be
            ["path", "to", "file.ext"]
    """
    path_list = path.split("/")
    return path_list

`remove_tags(element)`

Removes all HTML tags from another tag's contents

Parameters:

Name	Type	Description	Default
`element`	`str`	the contents of a tag as a string form which may or may not have extra tags (in particular inline tags, such as :code: `<em>` or :code:`<a>`, etc.)	required

Returns:

Name	Type	Description
`tagless_content`	`str`	the contents of the tag minus any inner tags.

Source code in file_clerk/clerk.py

def remove_tags(element: str) -> str:
    """Removes all HTML tags from another tag's contents

    Args:
        element (str): the contents of a tag as a string form which may or
            may not have extra tags (in particular inline tags, such as :code:
            `<em>` or :code:`<a>`, etc.)

    Returns:
        tagless_content (str): the contents of the tag minus any inner tags.
    """
    tagless_content = TAG_RE.sub("", element)
    return tagless_content

`split_into_sentences(contents)`

Returns a list of each sentence from the text.

Parameters:

Name	Type	Description	Default
`contents`	`str`	A string of text (typically from a tag) that most likely has punctuation.	required

Returns:

Name	Type	Description
`sentences`	`list`	A list of each sentence from the text each in string format

Source code in file_clerk/clerk.py

def split_into_sentences(contents: str) -> list:
    """Returns a list of each sentence from the text.

    Args:
        contents (str): A string of text (typically from a tag) that
            most likely has punctuation.

    Returns:
        sentences (list): A list of each sentence from the text
            each in string format
    """

    # contents = re.sub(r"([a-z])\.([A-Z])", r"\1. \2", contents)
    try:
        sentence_detector = PunktTokenizer()
        sentences = sentence_detector.sentences_from_text(contents)
    except LookupError as err:
        print(err)
        # This was a fall-back after errors
        sentence_endings = r"(?<!\w\.\w.)(?<![A-Z][a-z]\.)(?<=\.|\?|\!)\s"
        sentences = re.split(sentence_endings, contents)
    return sentences

`write_csv_file(filepath, data_list)`

Create a CSV file using a 2-D list.

This function will create a CSV file using the data_list (the contents of the file) using the filepath relative to the directory you set (most likely your project directory).

Parameters:

Name	Type	Description	Default
`filepath`	`str`	name of the full path to your file name in relation to the project folder. Example: `path/to/my_csv_file.csv`	required
`data_list`	`list`	a 2D list that will be your CSV file contents. NOTE: the first row will be your headers.	required

Source code in file_clerk/clerk.py

def write_csv_file(filepath: str, data_list: list):
    """Create a CSV file using a 2-D list.

    This function will create a CSV file using the data_list (the
    contents of the file) using the filepath relative to the directory
    you set (most likely your project directory).

    Args:
        filepath: name of the full path to your file name in
            relation to the project folder.
            Example: `path/to/my_csv_file.csv`
        data_list: a 2D list that will be your CSV file contents.
            NOTE: the first row will be your headers.
    """
    try:
        with open(filepath, "w", newline="") as output:
            writer = csv.writer(output)
            writer.writerows(data_list)
    except Exception:
        with open(filepath, "w", newline="", encoding="utf-8") as csvFile:
            writer = csv.writer(csvFile)
            writer.writerows(data_list)