html_tools.py
A collection of functions for getting HTML code and contents.
This is a library I created in order to help me autograde student web desing projects. For example, in a web design assignment, I might ask my students to be sure to include at least two bullet lists or five links.
This tool allows you to get and analyze what tags are present in a project, get contents from elements, find out how many particular elements were present or not.
Typical usage example:
``
add_if_not_in(my_list, item)
inserts an item into a list but only if not already in said list
Parameters:
Name | Type | Description | Default |
---|---|---|---|
my_list
|
list
|
the list in question. |
required |
item
|
str
|
the string in the list |
required |
Source code in webcode_tk/html_tools.py
403 404 405 406 407 408 409 410 |
|
get_all_html_files(dir_path)
Returns a list of all files in the dir_path folder.
This function takes a path to a directory and returns a list of all html documents in that folder as full paths (including the path to the directory).
Parameters:
Name | Type | Description | Default |
---|---|---|---|
dir_path
|
str
|
a string of a path to a folder (directory). This path should be a relative path starting at the root directory of your python project. |
required |
Returns:
Name | Type | Description |
---|---|---|
html_files |
list
|
a list of full paths to all HTML documents in the dir_path folder. |
Source code in webcode_tk/html_tools.py
32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 |
|
get_element_content(el)
gets the content of element (el) as a string
This function can accept a Tag (a BeautifulSoup object) or a string and returns the contents of the tag as a string.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
el
|
Union[Tag, str]
|
the element can either be a Tag (preferred) or a string. |
required |
Returns:
Name | Type | Description |
---|---|---|
content |
str
|
the contents of the tag as a string. This is like .innerText() method in JavaScript. It will include nested markup. |
Source code in webcode_tk/html_tools.py
184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 |
|
get_elements(el, file_path)
Returns a list of all Tag objects of type el from file path.
Extracts all tags of type (el) from the filename (file_path) as a list of BeautifulSoup Tag ojects.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
el
|
str
|
the name of a tag, but not in tag form (for example: p, ul, or div) |
required |
file_path
|
str
|
relative path to an html document (relative to the project folder) |
required |
Returns:
Name | Type | Description |
---|---|---|
num |
list
|
the number of elements found in the document in integer form |
Raises:
Type | Description |
---|---|
FileNotFound
|
the folder path did not exist. |
Source code in webcode_tk/html_tools.py
160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 |
|
get_html(file_path)
Returns an html document (from file_path) as a BeautifulSoup object
This function takes advantage of the bs4 library's BeautifulSoup
datatype, also known as simply a soup object.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
file_path
|
str
|
the file location (and filename as a relative link). |
required |
Returns:
Name | Type | Description |
---|---|---|
soup |
BeautifulSoup
|
this is a BeautifulSoup object that represents an HTML tree or NoneType if there is a failure. |
Raises:
Type | Description |
---|---|
FileNotFound
|
the file path did not exist. |
.. Beautiful Soup Documentation: https://www.crummy.com/software/BeautifulSoup/bs4/doc/#making-the-soup
Source code in webcode_tk/html_tools.py
53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 |
|
get_num_elements_in_file(el, file_path)
Returns the number of HTML elements in a web page (file)
This function takes the name of an element in the string form and the relative path to the HTML document, and it returns the number of occurences of that tag in the document.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
el
|
str
|
the name of a tag, but not in tag form (for example: p, ul, or div) |
required |
file_path
|
str
|
relative path to an html document (relative to the project folder) |
required |
Returns:
Name | Type | Description |
---|---|---|
num |
int
|
the number of elements found in the document in integer form |
Raises:
Type | Description |
---|---|
FileNotFound
|
the file path did not exist. |
Source code in webcode_tk/html_tools.py
81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 |
|
get_num_elements_in_folder(el, dir_path)
Returns the total number of a specific element in all files of a project.
Checks to make sure the folder exists, then goes through all html files in the directory to see how many occurrences there are among all the files.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
el
|
str
|
the name of a tag, but not in tag form (for example: p, ul, or div) |
required |
dir_path
|
str
|
relative path to an html document (relative to the project folder). |
required |
Returns:
Name | Type | Description |
---|---|---|
num |
int
|
the number of elements found in the document in integer form |
Raises:
Type | Description |
---|---|
FileNotFound
|
the folder path did not exist. |
Source code in webcode_tk/html_tools.py
128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 |
|
get_number_of_elements_per_file(project_dir, element_data)
returns a list of number of elements per file in a project folder.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
project_dir
|
str
|
the folder we want to check |
required |
element_data
|
list
|
a list of tuples of element (str) and required number |
required |
Returns:
Name | Type | Description |
---|---|---|
elements_per_file |
list
|
a list of tuples that includes the file, the element, and the number of occurrences. |
Source code in webcode_tk/html_tools.py
413 414 415 416 417 418 419 420 421 422 423 424 425 426 427 428 429 430 431 432 433 434 |
|
get_possible_selectors_by_tag(file_path, tag)
Returns all possible selectors for a particular tag.
Gets all tag ids and selectors for a given tag in an html document, and returns a list of all CSS selector permutations.
Possible future version may include descendant selectors that include all possible permutations as well (by looking at a tag's ancestors)
Parameters:
Name | Type | Description | Default |
---|---|---|---|
file_path
|
str
|
path to html document. |
required |
tag
|
str
|
string version of the tag. |
required |
Returns: all_selectors: a list of strings for selectors that could target the tag (just potential-not actual)
Source code in webcode_tk/html_tools.py
336 337 338 339 340 341 342 343 344 345 346 347 348 349 350 351 352 353 354 355 356 357 358 359 360 361 362 363 364 365 366 367 368 369 370 371 372 373 374 375 376 377 378 379 380 381 382 383 384 385 386 387 388 389 390 391 392 393 394 395 396 397 398 399 400 |
|
get_style_attribute_data(file)
returns a list of all tags that contain a style attribute
Checks to make sure file path goes to HTML doc and raises an exception if not. It will return a list of tuples. The tuples each contain filename, tag, and style attribute value (for reference).
Parameters:
Name | Type | Description | Default |
---|---|---|---|
file
|
str
|
path to an html document. |
required |
Returns:
Name | Type | Description |
---|---|---|
data |
list
|
a list of style attribute data (filename, tag, and attribute value) or st |
Raises:
Type | Description |
---|---|
ValueError
|
to get a list of attribute values, the file must be an HTML document |
Source code in webcode_tk/html_tools.py
262 263 264 265 266 267 268 269 270 271 272 273 274 275 276 277 278 279 280 281 282 283 284 285 286 287 288 289 290 291 292 293 294 295 296 297 298 299 300 301 302 303 304 305 306 307 308 |
|
has_style_attribute_data(file)
returns whether a file has style attributes or not
Makes use of get_style_attribute_data(), and if there is any data, returns True; otherwise, False
Parameters:
Name | Type | Description | Default |
---|---|---|---|
file
|
str
|
path to an html document. |
required |
Returns:
Name | Type | Description |
---|---|---|
has_style_attribute |
bool
|
a boolean (has or has not) a style attribute. |
Raises:
Type | Description |
---|---|
ValueError
|
to get a list of attribute values, the file must be an HTML document |
Source code in webcode_tk/html_tools.py
311 312 313 314 315 316 317 318 319 320 321 322 323 324 325 326 327 328 329 330 331 332 333 |
|
has_text_content(el)
returns whether an element contains text content
The test is to see if there is raw text inside the element as its content. It must be text that is not inside of a nested tag but is the direct text outside of any other HTML elements
Source code in webcode_tk/html_tools.py
437 438 439 440 441 442 443 444 |
|
string_to_tag(el)
Takes html markup as a string and returns a bs4 Tag object
Parameters:
Name | Type | Description | Default |
---|---|---|---|
el
|
str
|
HTML code in the form of a string. (example: ' My Header') |
required |
Returns:
Name | Type | Description |
---|---|---|
tag |
Tag
|
A BeautifulSoup 4 Tag object |
Raises:
Type | Description |
---|---|
ValueError
|
to get a tag object, the BeautifulSoup object must start with an opening tag. Without an opening tag, soup.find() will return a None type object. |
.. BeautifulSoup Tag Object: https://www.crummy.com/software/BeautifulSoup/bs4/doc/#tag
Source code in webcode_tk/html_tools.py
206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 225 226 227 228 229 230 231 232 233 234 235 236 237 238 239 240 241 242 |
|
uses_inline_styles(markup)
determines whether the markup uses inline styles or not as a boolean.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
markup
|
Union[Tag, str]
|
the code in string or Tag form. |
required |
Returns:
Name | Type | Description |
---|---|---|
has_inline_styles |
bool
|
boolean True if contains style attribute False if it does not contain style attribute. |
Source code in webcode_tk/html_tools.py
245 246 247 248 249 250 251 252 253 254 255 256 257 258 259 |
|