git.net

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

NotADirectoryError: [Errno 20] Not a directory


Youssef Abdelmohsen wrote:

> Note: Beginner
> 
> I'm trying to create an html parser that will go through a folder and all
> its subfolders and export all html files without any html tags, in file
> formats CSV and TXT with each html labeled with the title of the web page
> in a new CSV and TXT.
> 
> However I keep getting an error saying:
> 
> 
> 
> 
> *"Traceback (most recent call last):  File
> "/Users/username/Documents/htmlparser/parser10.py", line 59, in <module>
> for subentry in os.scandir(entry.path):NotADirectoryError: [Errno 20] Not
> a directory: '/Users/username/site/.DS_Store'"*
> 
> Here's what I've done so far (I have bolded line 59):

The error message says it: in the outer loop you encounter a *file* called 
".DS_Store" that doesn't match your regex. You then pass it to the inner 
loop, i. e. entry.path below is a file

> for subentry in os.scandir(entry.path):

However os.scandir expects a *directory* rather than a file.

To fix the immediate problem you can ensure that entry is a directory

if entry.is_dir():
    for subentry in os.scandir(entry.path):
        ...

but wait a moment. I note that you copied the code to process the html file 
twice. This is bad practice as it's hard to keep the code in sync when you 
apply changes (you already have a bug because you refer to the `entry` 
variable of the outer loop in the inner loop, too).

Instead use a helper functions like

def is_html_file(filename):
    return bool(re.match(...))

def process_html_file(filename):
        ...

The loops then become

for entry in os.scandir(site_directory):
    if entry.is_dir():
        for subentry in os.scandir(entry.path)
            if subentry.is_dir():
                for file in os.scandir(subentry.path)
                    if is_html_file(file.path):
                        process_html_file(file)
    elif is_html_file(entry.path):
        process_html_file(entry.path)

Hm, that still looks messy; there may be bugs. 

Do you really want to exclude the html files from the intermediate level? 

I'd suggest that instead you scan the whole tree. Enter os.walk():

for path, folders, files in os.walk(site_directory):
    for name in files:
        filename = os.path.join(path, name):
        if is_html_file(filename):
            process_html_file(filename)

While this doesn't do exactly the same thing it should be much clearer what 
it does ;)