csplit multiple files into multiple files

Question

folks-

I'm a bit stumped, on this one. I'm trying to write a bash script that will use csplit to take multiple input files and split them according to the same pattern. (For context: I have multiple TeX files with questions in them, separated by the \question command. I want to extract each question into their own file.)

The code I have so far:

#!/bin/bash
# This script uses csplit to run through an input TeX file (or list of TeX files) to separate out all the questions into their own files.
# This line is for the user to input the name of the file they need questions split from.

read -ep "Type the directory and/or name of the file needed to split. If there is more than one file, enter the files separated by a space. " files

read -ep "Type the directory where you would like to save the split files: " save

read -ep "What unit do these questions belong to?" unit

# This is a check for the user to confirm the file list, and proceed if true:

echo "The file(s) being split is/are $files. Please confirm that you wish to split this file, or cancel."
select ynf in "Yes" "No"; do
    case $ynf in 
        No ) exit;;
        Yes ) echo "The split files will be saved to $save. Please confirm that you wish to save the files here."
            select ynd in "Yes" "No"; do
            case $ynd in
                Yes )
#                   This line will create a loop to conduct the script over all the files in the list.
                    for i in ${files[@]}
                    do
#                   Mass re-naming is formatted to give "guestion###.tex' to enable processing a large number of questions quickly.
#                   csplit is the utility used here; run "man csplit" to learn more of its functionality.
#                   the structure is "csplit [name of file] [output options] [search filter] [separator(s)].
#                   this script calls csplit, will accept the name of the file in the argument, searches the files for calls of "question", splits the file everywhere it finds a line with "question", and renames it according to the scheme [prefix]#[suffix] (the %03d in the suffix-format is what increments the numbering automatically).
#                   the '\\question' allows searching for \question, which eliminates the split for \end{questions}; eliminating the \begin{questions} split has not yet been understood.
                        csplit $i --prefix=$save'/'$unit'q' --suffix-format='%03d.tex' /'\\question'/ '{*}'
                    done; exit;;
                No ) exit;;
            esac
        done
    esac
done

return

I can confirm it does do the loop as I intended for the input files I have. However, the behavior I'm noticing is that it'll split the first file into "q1.tex q2.tex q3.tex" as expected, and when it moves on to the next file in the list, it'll split the questions and overwrite the old files, and the third file it will overwrite the second file's splits, etc. What I would like to happen is that, say, if File1 has 3 questions, it will output:

q1.tex
q2.tex
q3.tex

And then if File2 has 4 questions, it will then continue incrementing to:

q4.tex
q5.tex
q6.tex
q7.tex

Is there a way for csplit to detect the numbering that has already been done in this loop, and increment appropriately?

Thanks for any help you folks can offer!

Chris Davies · Accepted Answer · 2020-01-03 13:57:31Z

The csplit command has no saved context (and nor should it), so it always starts its counting from 1. There's no way to fix this, but you could maintain your own counted value that you interpolate into the prefix string.

Alternatively, try replacing

read -ep "Type the directory and/or name of the file needed to split. If there is more than one file, enter the files separated by a space. " files

...

for i in ${files[@]}
do
    csplit $i --prefix=$save'/'$unit'q' --suffix-format='%03d.tex' /'\\question'/ '{*}'
done

with

read -a files -ep 'Type the directory and/or name of the file needed to split. If there is more than one file, enter the files separated by a space. '

...

cat "${files[@]}" | csplit - --prefix="$save/${unit}q" --suffix-format='%03d.tex' '/\\question/' '{*}'

This is one of the relatively rare instances where one really does need to use cat {file} | ... as csplit takes only a single file argument (or - for stdin).

I've changed your read action to use an array variable since that's what you are (correctly) trying to use in your for ... do csplit ... loop.

Regardless of what you finally decide to do, I'd strongly recommend you double-quote all your variables where you use them, particularly any further use of an array list such as "${files[@]}".

Absolutely brilliant---I should really learn all the tools like grep, cat, etc much better than I know them, now. I ran across csplit, first, because I was searching for "how to split file," and was extending from there, but should have gone in first with the main question of "what am I designing this to do, first?" and understood the tools out there, to begin with. Many thanks! — Wayne, Commented Jan 3, 2020 at 14:55

JJoao · Accepted Answer · 2020-01-05 19:06:12Z

1

With Awk you could run something along the lines of:

awk '/\\question/ {i++} ; {print > "q" i ".tex"}'  exam*.tex

If you want to define out-dir(d) and topic(t), and control the number length:

awk '/\\question/ {f=sprintf("%s/%s-q%03d.tex", d, t, i++)} {print>f}' d=d1 t=t1 ex*

In order to skip TeX preambulo, we can "print" just when "f" is defined:

awk '/\\question/ {f=sprintf("%s/%s-q%03d.tex", d, t, ++i)} 
     f            {print>f}' d=d1 t=t1 ex*

edited Jan 5, 2020 at 19:06

answered Jan 3, 2020 at 16:41

JJoao

12.6k1 gold badge25 silver badges44 bronze badges

I think this should work, but may run afoul of the problem I had with drmus's response, which is that there is more information I need from the files than just the lines where the questions are stored. Each question also has lines that may involve diagrams, solutions, etc, that are not denoted only by the \question command. So while this would work for pulling out the question lines only, I think roaima's response with the cat command combined with csplit gives the output I need. That said, many thanks for the working solution!
– Wayne
Commented Jan 3, 2020 at 18:40
@Wayne, I think my solution should produce a result similar to Csplit. The only difference should be text before the first question. Could you please present any unexpected output?
– JJoao
Commented Jan 3, 2020 at 18:58
1

JJoao, I should've just tried it---this does work, as intended! There is one wrinkle from this, which is that the awk command (as written) outputs one file with the preamble for my text file. The cat/csplit operation that @roaima described also did this, but I figured out how to use delimeters with csplit that skipped outputting the preamble, so that way I only get the question files. Not sure if there's a way to do this with awk, but I have upvoted this as another acceptable solution, because this does actually work as intended, and not as I had initally suspected. Thanks very much!
– Wayne
Commented Jan 4, 2020 at 19:24
@wayne, please see my variant.
– JJoao
Commented Jan 5, 2020 at 19:08

Add a comment |

Chris Davies · Accepted Answer · 2020-01-03 13:53:42Z

0

You can use this script

grep -o -P '(parameter).*(parameter)' your_teX_file.teX > questions.txt

you get questions.txt file for all questions then you can split it.

split -l 1 questions.txt

edited Jan 3, 2020 at 13:53

Chris Davies

124k16 gold badges172 silver badges311 bronze badges

answered Jan 3, 2020 at 13:47

durmus yılmaz

111 bronze badge

Thanks for the response, though unfortuantely, the context of what I'm trying to do isn't fully solved, with this solution. The reason is that in the file, say, "test.tex," there will be questions formatted like: "\question some text (then on a number of new lines, some solutions, multiple choice answers, etc)". Using this will only give me the questions, whereas for the project I'm working on, I'm trying to collect the questions and their solutions, as well. Thanks for the help, though!
– Wayne
Commented Jan 3, 2020 at 14:49

Add a comment |

Stack Exchange Network

csplit multiple files into multiple files

3 Answers 3

You must log in to answer this question.

Hot Network Questions

csplit multiple files into multiple files

3 Answers 3

You must log in to answer this question.

Related

Hot Network Questions