4

I've got a bunch of XML files under a directory tree which I would like to move to corresponding folders with the same name within that same directory tree.

Here is sample structure (in shell):

touch foo.xml bar.xml "[ foo ].xml" "( bar ).xml"
mkdir -p foo bar "foo/[ foo ]" "bar/( bar )"

So my approach here is:

find . -name "*.xml" -exec sh -c '
  DST=$(
    find . -type d -name "$(basename "{}" .xml)" -print -quit
  )
  [ -d "$DST" ] && mv -v "{}" "$DST/"' ';'

which gives the following output:

‘./( bar ).xml’ -> ‘./bar/( bar )/( bar ).xml’
mv: ‘./bar/( bar )/( bar ).xml’ and ‘./bar/( bar )/( bar ).xml’ are the same file
‘./bar.xml’ -> ‘./bar/bar.xml’
‘./foo.xml’ -> ‘./foo/foo.xml’

But the file with square brackets ([ foo ].xml) hasn't been moved as if it had been ignored.

I've checked and basename (e.g. basename "[ foo ].xml" ".xml") converts the file correctly, however find has problems with brackets. For example:

find . -name '[ foo ].xml'

won't find the file correctly. However, when escaping the brackets ('\[ foo \].xml'), it works fine, but it doesn't solve the problem, because it's part of the script and I don't know which files having those special (shell?) characters. Tested with both BSD and GNU find.

Is there any universal way of escaping the filenames when using with find's -name parameter, so I can correct my command to support files with the metacharacters?

1
  • 1
    For this situation, I would change direction, using find only to locate the files to handle, making it call a script (e.g., in Perl) that could handle the filenames without contaminating them. Commented Feb 13, 2016 at 16:18

4 Answers 4

7

It's so much easier with zsh globs here:

for f (**/*.xml(.)) (mv -v -- $f **/$f:r:t(/[1]))

Or if you want to include hidden xml files and look inside hidden directories like find would:

for f (**/*.xml(.D)) (mv -v -- $f **/$f:r:t(D/[1]))

But beware that files called .xml, ..xml or ...xml would become a problem, so you may want to exclude them:

setopt extendedglob
for f (**/(^(|.|..)).xml(.D)) (mv -v -- $f **/$f:r:t(D/[1]))

With GNU tools, another approach to avoid having to scan the whole directory tree for each file would be to scan it once and look for all directories and xml files, record where they are and do the moving in the end:

(export LC_ALL=C
find . -mindepth 1 -name '*.xml' ! -name .xml ! \
  -name ..xml ! -name ...xml -type f -printf 'F/%P\0' -o \
  -type d -printf 'D/%P\0' | awk -v RS='\0' -F / '
  {
    if ($1 == "F") {
      root = $NF
      sub(/\.xml$/, "", root)
      F[root] = substr($0, 3)
    } else D[$NF] = substr($0, 3)
  }
  END {
    for (f in F)
      if (f in D) 
        printf "%s\0%s\0", F[f], D[f]
  }' | xargs -r0n2 mv -v --
)

Your approach has a number of problems if you want to allow any arbitrary file name:

  • embedding {} in the shell code is always wrong. What if there's a file called $(rm -rf "$HOME").xml for instance? The correct way is to pass those {} as argument to the in-line shell script (-exec sh -c 'use as "$1"...' sh {} \;).
  • With GNU find (implied here as you're using -quit), *.xml would only match files consisting of a sequence of valid characters followed by .xml, so that excludes file names that contain invalid characters in the current locale (for instance file names in the wrong charset). The fix for that is to fix the locale to C where every byte is a valid character (that means error messages will be displayed in English though).
  • If any of those xml files are of type directory or symlink, that would cause problems (affect the scanning of directories, or break symlinks when moved). You may want to add a -type f to only move regular files.
  • Command substitution ($(...)) strips all trailing newline characters. That would cause problems with a file called foo␤.xml for instance. Working around that is possible but a pain: base=$(basename "$1" .xml; echo .); base=${base%??}. You can at least replace basename with the ${var#pattern} operators. And avoid command substitution if possible.
  • your problem with file names containing wildcard characters (?, [, * and backslash; they are not special to the shell, but to the pattern matching (fnmatch()) done by find which happens to be very similar to shell pattern matching). You'd need to escape them with a backslash.
  • the problem with .xml, ..xml, ...xml mentioned above.

So, if we address all of the above, we end up with something like:

LC_ALL=C find . -type f -name '*.xml' ! -name .xml ! -name ..xml \
  ! -name ...xml -exec sh -c '
  for file do
    base=${file##*/}
    base=${base%.xml}
    escaped_base=$(printf "%s\n" "$base" |
      sed "s/[[*?\\\\]/\\\\&/g"; echo .)
    escaped_base=${escaped_base%??}
    find . -name "$escaped_base" -type d -exec mv -v "$file" {\} \; -quit
  done' sh {} +

Phew...

Now, it's not all. With -exec ... {} +, we run as few sh as possible. If we're lucky, we'll run only one, but if not, after the first sh invocation, we'll have moved a number of xml files around, and then find will continue looking for more, and may very well find the files we have moved in the first round again (and most probably try to move them where they are).

Other than that, it's basically the same approach as the zsh ones. A few other notable differences:

  • with the zsh one, the file list is sorted (by directory name and file name), so the destination directory is more or less consistent and predictable. With find, it's based on the raw order of files in directories.
  • with zsh, you'll get an error message if no matching directory to move the file to is found, not with the find approach above.
  • With find, you'll get error messages if some directories cannot be traversed, not with the zsh one.

A last note of warning. If the reason you get some files with dodgy file names is because the directory tree is writable by an adversary, then beware than none of the solutions above are safe if the adversary may rename files under the feet of that command.

For instance, if you're using LXDE, the attacker could make a malicious foo/lxde-rc.xml, create a lxde-rc folder, detect when you're running your command and replace that lxde-rc with a symlink to your ~/.config/openbox/ during the race window (which can be made as large as necessary in many ways) between find finding that lxde-rc and mv doing the rename("foo/lxde-rc.xml", "lxde-rc/lxde-rc.xml") (foo could also be changed to that symlink making you move your lxde-rc.xml elsewhere).

Working around that is probably impossible using standard or even GNU utilities, you'd need to write it in a proper programming language, doing some safe directory traversal and using renameat() system calls.

All the solutions above will also fail if the directory tree is deep enough that the limit on the length of the paths given to the rename() system call done by mv is reached (causing rename() to fail with ENAMETOOLONG). A solution using renameat() would also work around the problem.

9
  • A highly complex solution that could be avoided if my proposal based on switching off globbing was used.
    – schily
    Commented Feb 14, 2016 at 12:12
  • 1
    @schily, set -f is irrelevant here. There's no shell globbing in the OP's question. The problem here is the fnmatching performed by find -name when we would like it to do a byte-to-byte comparison, and set -f won't affect find's behaviour. Commented Feb 14, 2016 at 12:17
  • Did you read the text from the OP? He mentions that there is no problem with escaping the -name argument but that there are problems with processing the find results using the shell.
    – schily
    Commented Feb 14, 2016 at 12:38
  • 1
    @schily: You're confused. What OP ask is escaping the filenames when using with find's -name parameter. The problem is the using of -name "$(basename "{}" .xml)" inside inline-script. There, basename return something like [ foo ], which contains pattern matching special characters, which the OP want to escape. There's nothing to do with shell globbing.
    – cuonglm
    Commented Feb 14, 2016 at 16:32
  • So you confirm that I am not confused and the OP has problems with the shell part that could be fixed via set -f.
    – schily
    Commented Feb 14, 2016 at 21:56
3

When you use inline script with find ... -exec sh -c ..., you should pass find result to the shell through positional parameter, then you don't have to use {} everywhere in your inline script.

If you have bash or zsh, you can pass basename output through printf '%q':

find . -name "*.xml" -exec bash -c '
  for f do
    BASENAME="$(printf "%q" "$(basename -- "$f" .xml)")"
    DST=$(find . -type d -name "$BASENAME" -print -quit)
    [ -d "$DST" ] && mv -v -- "$f" "$DST/"
  done
' bash {} +

With bash, you can use printf -v BASENAME, and this approach won't work properly if file name contains control characters or non-ascii characters.

If you want it to work properly, you need to write a shell function to escape only [, *, ? and backslash.

9
  • 1
    printf '%q' will not escape all characters in the proper way for find's -name. Like for control characters or some non-ascii characters. Commented Feb 13, 2016 at 17:00
  • @StéphaneChazelas: Why do we need to escape other characters? I think [, * and ? are enough.
    – cuonglm
    Commented Feb 13, 2016 at 17:08
  • printf '%q' is a ksh93 enhancement that is not part of POSIX. It will not work in portable scripts. set -f (see my answer) is portable.
    – schily
    Commented Feb 13, 2016 at 17:27
  • @schily: Yes, I mention only bash and zsh there, printf '%q' in ksh93 won't work in this case. How can you make find aware about set -f?
    – cuonglm
    Commented Feb 13, 2016 at 17:31
  • 1
    Yes, that's the problem, you only want to escape [, * and ? while printf %q will escape others, like it will change a newline character to $'\n' which for find -name is $'n'. Commented Feb 13, 2016 at 17:36
2

The good news:

find . -name '[ foo ].xml'

is not interpreted by the shell, it is passed this way to the find program. Find however interprets the argument to -name as a glob pattern and this needs to be taken into account.

If you like to call find -exec \; or better find -exec +, there is no shell involved.

If you like to process the find output by the shell, I recommend just to disable file name globbing in the shell by calling set -f before the code in question and switch it on again by calling set +f later.

6
  • 1
    set -f doesn't make any difference as find still can't find the file (set -f; find . -name '[ foo ].xml') if I understood the answer correctly.
    – kenorb
    Commented Feb 13, 2016 at 18:01
  • Well I mentioned that you should use set -f for shell commands that process the filenames. This is not the find command as find just produces the filenames.
    – schily
    Commented Feb 13, 2016 at 18:59
  • I think you're missing the point of the question here. The OP is calling find within find with the argument to the -name of the second find being provided by the first find and transformed by a shell. Commented Feb 13, 2016 at 22:19
  • The OP is calling the shell with the names found by find and if the OP did use set -f as the first instruction in these shell commands, there would be no globbing and thus no resulting problems.
    – schily
    Commented Feb 14, 2016 at 12:06
  • @schily, you're being confused, there's no shell globbing here. Commented Feb 14, 2016 at 12:18
2

The following is a relatively straightforward, POSIX-compliant pipeline. It scans the hierarchy twice, first for directories and then for *.xml regular files. A blank line between scans signals AWK of the transition.

The AWK component maps basenames to destination directories (if there are multiple directories with the same basename, only the first traversal is remembered). For each *.xml file, it prints a tab-delimited line with two fields: 1) the file's path and 2) its corresponding destination directory.

{
    find . -type d
    echo
    find . -type f -name \*.xml
} |
awk -F/ '
    !NF { ++i; next }
    !i && !($NF".xml" in d) { d[$NF".xml"] = $0 }
    i { print $0 "\t" d[$NF] }
' |
while IFS='     ' read -r f d; do
    mv -- "$f" "$d"
done

The value assigned to IFS just before the read is a literal tab character, not a space.

Here's a transcript using the original question's touch/mkdir skeleton:

$ touch foo.xml bar.xml "[ foo ].xml" "( bar ).xml"
$ mkdir -p foo bar "foo/[ foo ]" "bar/( bar )"
$ find .
.
./foo
./foo/[ foo ]
./bar.xml
./foo.xml
./bar
./bar/( bar )
./[ foo ].xml
./( bar ).xml
$ ../mv-xml.sh
$ find .
.
./foo
./foo/[ foo ]
./foo/[ foo ]/[ foo ].xml
./foo/foo.xml
./bar
./bar/( bar )
./bar/( bar )/( bar ).xml
./bar/bar.xml
1
  • Nice, but it still doesn't allow arbitrary file names as it won't work if file paths contain TAB or newline characters (and potentially some other pathological cases as mentioned in my answer). Commented Feb 14, 2016 at 22:48

You must log in to answer this question.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.