Skip to content

reference

This part of the project documentation focuses on an information-oriented approach. Use it as a reference for the technical implementation of the file-clerk project code.

A collection of functions for dealing with files and file content.

This was a library I created for previous projects that deal with files and file paths in order to get code from files, lists of files in project folders, file extensions, and allows us to capture just files of a particular type. I also developed my projects on Windows OS, so these functions were designed to work with the file paths on Windows, Mac, and Linux (Windows is the one with backslashes - wacky, I know.).

Typical usage example:

extension = get_file_type("path/to/file.js")

code_string = file_to_string("path/to/file.html")

project_path = "path/to/project" all_project_files = get_all_project_files(project_path) just_css_files = get_all_files_of_type(project_path, "css")

clear_extra_text(my_text)

Removes line returns and extra spaces from my_text.

Parameters:

Name Type Description Default
my_text str

text which may include line returns or extra space.

required

Returns:

Name Type Description
stripped_text str

text without any line returns or additional spaces.

Source code in file_clerk/clerk.py
245
246
247
248
249
250
251
252
253
254
255
256
257
def clear_extra_text(my_text: str) -> str:
    """Removes line returns and extra spaces from my_text.

    Args:
        my_text (str): text which may include line returns or extra space.

    Returns:
        stripped_text (str): text without any line returns or additional spaces.
    """
    my_text = my_text.replace("\n", "")
    my_text = re.sub(r"\s+", " ", my_text)
    stripped_text = my_text.strip()
    return stripped_text

delete_file(filepath)

deletes file in path but only if it exists

Parameters:

Name Type Description Default
filepath str

The file location

required
Source code in file_clerk/clerk.py
49
50
51
52
53
54
55
56
57
58
59
def delete_file(filepath: str):
    """deletes file in path but only if it exists

    Args:
        filepath (str): The file location
    """
    data_file = Path(filepath)
    try:
        data_file.unlink()
    except IsADirectoryError as e:
        print(f"Error: {data_file} : {e.strerror}")

file_exists(file_path)

Returns True or False: whether file in path exists.

Parameters:

Name Type Description Default
file_path str

The file location

required

Returns:

Name Type Description
bool bool

True or False: True if file exists False if not

Source code in file_clerk/clerk.py
36
37
38
39
40
41
42
43
44
45
46
def file_exists(file_path: str) -> bool:
    """Returns True or False: whether file in path exists.

    Args:
        file_path (str): The file location

    Returns:
        bool: True or False: True if file exists False if not
    """
    filename = Path(file_path)
    return filename.exists()

file_to_string(path)

Returns contents of file as a string.

Parameters:

Name Type Description Default
path str

The path to a file using Posix format (forward slashes e.g. path/to/file.ext)

required

Returns:

Name Type Description
file_text str

The contents of the file in utf-8 string format.

Source code in file_clerk/clerk.py
 97
 98
 99
100
101
102
103
104
105
106
107
108
109
110
def file_to_string(path: str) -> str:
    """Returns contents of file as a string.

    Args:
        path (str): The path to a file using Posix format (forward
            slashes e.g. path/to/file.ext)

    Returns:
        file_text (str): The contents of the file in utf-8 string
            format.
    """
    my_file = get_full_path_string(path)
    file_text = my_file.read_text(encoding="utf-8")
    return file_text

get_all_files_of_type(dir_path, filetype)

returns all files of a particular type from a directory.

Parameters:

Name Type Description Default
dir_path str

The path to a directory using Posix format (forward slashes e.g. path/to/file.ext)

required
filetype str

An extension in the form of a string (without the dot (e.g. html, css, js, etc.)

required

Returns:

Name Type Description
files list

A list of all html, css, and javascript files

Source code in file_clerk/clerk.py
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
def get_all_files_of_type(dir_path: str, filetype: str) -> list:
    """returns all files of a particular type from a directory.

    Args:
        dir_path (str): The path to a directory using Posix format
            (forward slashes e.g. path/to/file.ext)
        filetype (str): An extension in the form of a string (without
            the dot (e.g. html, css, js, etc.)

    Returns:
        files (list): A list of all html, css, and javascript files
    """
    pattern = "*." + filetype + "*"
    output = []
    files = collections.Counter(
        str(f.as_posix()) for f in Path(dir_path).rglob(pattern)
    )
    output += files.keys()
    return output

get_all_project_files(dir_path)

returns a list of all files from the directory in the path.

Parameters:

Name Type Description Default
dir_path str

The path to a directory using Posix format (forward slashes e.g. path/to/file.ext)

required

Returns:

Name Type Description
files list

A list of all html, css, and javascript files

Source code in file_clerk/clerk.py
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
def get_all_project_files(dir_path: str) -> list:
    """returns a list of all files from the directory in the path.

    Args:
        dir_path (str): The path to a directory using Posix format
            (forward slashes e.g. path/to/file.ext)

    Returns:
        files (list): A list of all html, css, and javascript files
    """
    files = []
    files += get_all_files_of_type(dir_path, "html")
    files += get_all_files_of_type(dir_path, "css")
    files += get_all_files_of_type(dir_path, "js")
    return files

get_file_name(path)

returns the name of the file in the path.

Parameters:

Name Type Description Default
path str

The path to a file using Posix format (forward slashes e.g. path/to/file.ext)

required

Returns:

Name Type Description
filename str

The name of the file (with extension)

Source code in file_clerk/clerk.py
130
131
132
133
134
135
136
137
138
139
140
141
def get_file_name(path: str) -> str:
    """returns the name of the file in the path.

    Args:
        path (str): The path to a file using Posix format (forward
            slashes e.g. path/to/file.ext)

    Returns:
        filename (str): The name of the file (with extension)
    """
    filename = Path(path).name
    return filename

get_file_type(path)

returns the extension of the file in the path.

Parameters:

Name Type Description Default
path str

The path to a file using Posix format (forward slashes e.g. path/to/file.ext)

required

Returns:

Name Type Description
extension str

The extension of the file type (without)

str

the dot (eg. html, js, css, pdx, png)

Source code in file_clerk/clerk.py
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
def get_file_type(path: str) -> str:
    """returns the extension of the file in the path.

    Args:
        path (str): The path to a file using Posix format (forward
            slashes e.g. path/to/file.ext)

    Returns:
        extension (str): The extension of the file type (without)
        the dot (eg. html, js, css, pdx, png)
    """
    my_file = get_full_path_string(path)
    suffix = my_file.suffix
    extension = suffix[1:]
    return extension

get_full_path_string(path)

returns absolute path to file in relative path.

Parameters:

Name Type Description Default
path str

The file location using the Posix format (forward/slashes)

required

Returns:

Name Type Description
full_path Path Object

will be a WindowsPath (if Windows) or PosixPath (if Mac or Linux)

Source code in file_clerk/clerk.py
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
def get_full_path_string(path: str):
    """returns absolute path to file in relative path.

    Args:
        path (str): The file location using the Posix format
            (forward/slashes)

    Returns:
        full_path (Path Object): will be a WindowsPath (if Windows) or
            PosixPath (if Mac or Linux)
    """
    full_path = working_dir
    p_list = get_path_list(path)
    for i in p_list:
        full_path = full_path / i
    return full_path

get_linked_css(contents_str)

returns a list of linked CSS files.

Parameters:

Name Type Description Default
contents_str str

HTML code from a file in string format.

required

Returns:

Name Type Description
filenames list

A list of all filenames extracted from CSS link tags. Note: no external stylesheets will be included (only local files).

Source code in file_clerk/clerk.py
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
def get_linked_css(contents_str: str) -> list:
    """returns a list of linked CSS files.

    Args:
        contents_str (str): HTML code from a file in string format.

    Returns:
        filenames (list): A list of all filenames extracted from CSS
            link tags.
            Note: no external stylesheets will be included (only
            local files).
    """
    filenames = []
    soup = BeautifulSoup(contents_str, "html.parser")
    linked_files = soup.find_all("link")

    if len(linked_files) > 1:
        for file in linked_files:
            linked_file = file.attrs.get("href")
            if "https://" in linked_file or "http://" in linked_file:
                continue
            filenames.append(linked_file)
    elif len(linked_files) == 1:
        filename = linked_files[0].attrs.get("href")
        if "https://" in filename or "http://" in filename:
            return None
        filenames.append(filename)
    else:
        return None
    return filenames

get_path_list(path)

Returns a list of each path part using slash as separator.

Parameters:

Name Type Description Default
path str

The file location using the Posix format (forward/slashes)

required

Returns:

Name Type Description
path_list list

A path of each folder in a path, with the file at the end. Example: path/to/file.ext will be ["path", "to", "file.ext"]

Source code in file_clerk/clerk.py
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
def get_path_list(path: str) -> list:
    """Returns a list of each path part using slash as separator.

    Args:
        path (str): The file location using the Posix format
            (forward/slashes)

    Returns:
        path_list (list): A path of each folder in a path, with the
            file at the end.
            Example: path/to/file.ext will be
            ["path", "to", "file.ext"]
    """
    path_list = path.split("/")
    return path_list

remove_tags(element)

Removes all HTML tags from another tag's contents

Parameters:

Name Type Description Default
element str

the contents of a tag as a string form which may or may not have extra tags (in particular inline tags, such as :code:<em> or :code:<a>, etc.)

required

Returns:

Name Type Description
tagless_content str

the contents of the tag minus any inner tags.

Source code in file_clerk/clerk.py
230
231
232
233
234
235
236
237
238
239
240
241
242
def remove_tags(element: str) -> str:
    """Removes all HTML tags from another tag's contents

    Args:
        element (str): the contents of a tag as a string form which may or
            may not have extra tags (in particular inline tags, such as :code:`<em>`
            or :code:`<a>`, etc.)

    Returns:
        tagless_content (str): the contents of the tag minus any inner tags.
    """
    tagless_content = TAG_RE.sub("", element)
    return tagless_content

split_into_sentences(contents)

Returns a list of each sentence from the text.

Parameters:

Name Type Description Default
contents str

A string of text (typically from a tag) that most likely has punctuation.

required

Returns:

Name Type Description
sentences list

A list of each sentence from the text each in string format

Source code in file_clerk/clerk.py
214
215
216
217
218
219
220
221
222
223
224
225
226
227
def split_into_sentences(contents: str) -> list:
    """Returns a list of each sentence from the text.

    Args:
        contents (str): A string of text (typically from a tag) that
            most likely has punctuation.

    Returns:
        sentences (list): A list of each sentence from the text
            each in string format
    """
    contents = re.sub(r"([a-z])\.([A-Z])", r"\1. \2", contents)
    sentences = nltk.tokenize.sent_tokenize(contents)
    return sentences

write_csv_file(filepath, data_list)

Create a CSV file using a 2-D list.

This function will create a CSV file using the data_list (the contents of the file) using the filepath relative to the directory you set (most likely your project directory).

Parameters:

Name Type Description Default
filepath str

name of the full path to your file name in relation to the project folder. Example: path/to/my_csv_file.csv

required
data_list list

a 2D list that will be your CSV file contents. NOTE: the first row will be your headers.

required
Source code in file_clerk/clerk.py
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
def write_csv_file(filepath: str, data_list: list):
    """Create a CSV file using a 2-D list.

    This function will create a CSV file using the data_list (the
    contents of the file) using the filepath relative to the directory
    you set (most likely your project directory).

    Args:
        filepath: name of the full path to your file name in
            relation to the project folder.
            Example: `path/to/my_csv_file.csv`
        data_list: a 2D list that will be your CSV file contents.
            NOTE: the first row will be your headers.
    """
    try:
        with open(filepath, "w", newline="") as output:
            writer = csv.writer(output)
            writer.writerows(data_list)
    except Exception:
        with open(filepath, "w", newline="", encoding="utf-8") as csvFile:
            writer = csv.writer(csvFile)
            writer.writerows(data_list)