C++: Data sorting taking a long time

Question

I wrote a simple program that takes the enable1.txt word list of scrabble/words with friends and searches it for an input. In order to search the massive list of 172,820 words I figured sorting it then using a binary search would be a good idea.

The sorting algorithm I used was std::stable_sort as I wanted similar words stay in their locations. After profiling, the std::stable_sort is taking 37% of the run time.

Should I have used std::stable_sort? Would std::sort have been better? Would a different sort entirely had been a better idea? Is there a faster sort other than rolling my own?

Code:

#include <iostream>
#include <string>
#include <fstream>
#include <vector>
#include <algorithm>
#include <ctime>

double Search(const std::vector<std::string>& obj);
double BSearch(const std::vector<std::string>& obj);

int main() {

    std::cout << "Loading database...";
    std::ifstream db;
    db.open("enable1.txt", std::ios::in);
    unsigned long number_of_words = 0;
    const unsigned short max_length = 50;
    while(db.fail() == false) {
        char dump[max_length];
        db.getline(&dump[0], max_length);
        ++number_of_words;
    }
    db.close();
    db.clear();
    std::cout << "done" << std::endl;

    std::vector<std::string> words(number_of_words);
    words.reserve(number_of_words);

    db.open("enable1.txt", std::ios::in);

    unsigned long i = 0;
    while(std::getline(db, words[i], '\n')) {
        ++i;
    }
    db.close();
    db.clear();

    std::cout << "Sorting database...";
    std::stable_sort(words.begin(), words.end());
    std::cout << "done" << std::endl << std::endl;
    double seconds = Search(words);
    std::cout << "Time for Search: " << seconds << std::endl;
    double b_seconds = BSearch(words);
    std::cout << "Time for BSearch: " << b_seconds << std::endl;
    std::cout << "Cleaning up resources, please wait..." << std::endl;
    return 0;
}

double Search(const std::vector<std::string>& obj) {

    std::cout << "Enter word: ";
    std::string word;
    std::getline(std::cin, word);

    std::cout << "Searching database..." << std::endl;
    std::vector<std::string, std::allocator<std::string> >::size_type s = obj.size();
    unsigned long mid = s / 2;
    unsigned long first = 0;
    unsigned long last = s - 1;

    std::clock_t end_time = 0;
    std::clock_t start_time = clock();
    while(first <= last) {
        std::cout << "Checking: " << word << " with " << obj[mid] << std::endl;
        int result = word.compare(obj[mid]);
        if(result == 0) {
            end_time = clock();
            std::cout << "Valid word." << std::endl;
            return std::difftime(end_time, start_time);
        } else if(result < 0) {
            last = mid - 1;
            mid = ((last - first) / 2 + first);
        } else {
            first = mid + 1;
            mid = ((last - first) / 2) + first;
        }
    }
    end_time = clock();
    std::cout << word << " is not a valid word." << std::endl;
    return std::difftime(end_time, start_time);
}

double BSearch(const std::vector<std::string>& obj) {

    std::cout << "Enter word: ";
    std::string word;
    std::getline(std::cin, word);

    std::cout << "Searching database..." << std::endl;
    std::clock_t end_time = 0;
    std::clock_t start_time = clock();
    if(std::binary_search(obj.begin(), obj.end(), word)) {
        end_time = clock();
        std::cout << "Valid word." << std::endl;
        return std::difftime(end_time, start_time);
    }
    end_time = clock();
    std::cout << word << " is not a valid word." << std::endl;
    return std::difftime(end_time, start_time);
}

The Search and BSearch methods are the same, one uses a hand-coded binary search and the other uses the STL version. (I was verifying their speed differences...unsurprisingly, there isn't one.)

As an after thought: Is there a better way to count the number of lines in a file? Maybe without having to open and close the file twice?

P.S. If you're wondering about the message at the end of the program, that's due to the vector of 170,000+ strings cleaning up after going out of scope. It takes a while.

Community · Accepted Answer · 2020-06-10 13:24:26Z

Answer to generic questions

In order to search the massive list of 172,820 words

That's relatively small (OK small->medium).

I figured sorting it then using a binary search would be a good idea.

Yes that's a good idea.

The sorting algorithm I used was std::stable_sort as I wanted similar words stay in their locations.

Why. Stable sort means that if two words have the same value (they are equal) they maintain their relative order. Since you are searching for a single value (not a group of values) should you not be dedupping your input anyway. Even if you want to maintain multiple entries of the same word is their position in the input file significant in any way?

Should I have used std::stable_sort?

No.

Would std::sort have been better?

Maybe.
I would consider using a sorted container that does the work of dedupping for you.

Would a different sort entirely had been a better idea?

The only way to know is to actually do it and test the difference. But std::sort provides a complexity of O(n.log(n)) on average which is hard to beat unless you know something about your input set.

As an after thought: Is there a better way to count the number of lines in a file? Maybe without having to open and close the file twice?

You can re-wind the file to the beginning by using seek() (seekg() of file streams).

The Search and BSearch methods are the same, one uses a hand-coded binary search and the other uses the STL version. (I was verifying their speed differences...unsurprisingly, there isn't one.)

Your timing is invalid. You are printing to a stream in the middle of the timed section. This will be the most significant cost in your search and will outweigh the cost of the search by an order of magnitude. Remove the prints std::cout << message; and re-time.

P.S. If you're wondering about the message at the end of the program, that's due to the vector of 170,000+ strings cleaning up after going out of scope. It takes a while.

Which message are you referring too. And define a while. I would not expect the cleanup of strings to be significantly slow (there there is a cost).

Comments on Code

Your code seems very dense. White space is your friend when writing readable code.

There is no need to open/close and clear a file. Calling clear on a file after it has closed has no affect and the subsequent open() would reset the internal state of the stream anyway. When reading a file I see little point in explicitly opining and closing a file (let the constructor/destructor do that). See https://codereview.stackexchange.com/a/544/507

std::ifstream db("enable1.txt");

// Do Stuff

db.clear();                 // Clear the EOF flag
db.seekg(ios_base::beg, 0); // rewind to beginning

// Do more stuff

There is an easier way to count the number of words. Note there is also a safer version of getline() that uses strings and thus can't overflow.

std::string line;
std::getline(db, line);  // Reads one line.

Given that you are actually reading the number of lines but counting it as words means that the file is one word per line. Also testing the state of the stream pre-using it is an anti-pattern and nearly always wrong.

while(db.fail() == false) {

This will result in an over-count of 1. This is because the last word read will read up-to but not past the EOF. Thus the EOF flag is not set and you re-enter the loop. You then try and read the next word (which is not there resulting in the stream setting the EOF flag but you increment the word count anyway. If you do it this way then you need to check the state of the stream after the read.

while( db.SomeActionToReadFromIt()) {

Thus in all common languages you do a read as part of the while condition. The result of the read indicates if the loop should be entered (if the read worked then do the loop and processes the value you just read).

The operator >> when used on a string will read one white-space separated words. So to count the number of words in a file a trivial implementation would be:

std::string line;
while(std::getline(db, line))
{    ++number_of_words;
}

// Or alternatively

std::string word;
while(db >> word)
{    ++number_of_words;
}

Note it is important to note the second version here. This is because you can use stream iterators and some standard functions to achieve the same results. Note: stream iterators use the operator >> to read their target.

std::size_t size = std::distance(std::istream_iterator<std::string>(db),
                                 std::istream_iterator<std::string>());

If you want absolute speed then the C interface could be used (though I would not recommend it).

There is no need to set the size of the vector and then reserve the same size.

std::vector<std::string> words(number_of_words);
words.reserve(number_of_words);

Personally I would just use reserve(). That way you do not need to prematurely construct 170,000 empty strings. But then you would need to use push_back rather than explicit read into an element (so swings and roundabouts). An alternative to your loop is to use stream iterator again to copy the file into the vector:

unsigned long i = 0;
while(std::getline(db, words[i], '\n')) { // no need for the '\n' here!
    ++i;
}

// Alternatively you can do this:
std::copy(std::istream_iterator<std::string>(db), std::istream_iterator<std::string>(),
          std::back_inserter(words)
         );

Now you sort the container:
Alternatively you can use a sorted container. I would consider using std::set. Then you can just insert all the words. std::set has a neat find() methods that searches the now sorted container:

std::set<std::string>   words;
std::copy(std::istream_iterator<std::string>(db), std::istream_iterator<std::string>(),
          std::inserter(words, words.end())
         );

The container is automatically sorted and de-dupped. And you can just use find on it:

if (words.find("Loki") != words.end())
{
    // We have found it.
}

Unless you are doing something really clever then let the compiler default the template arguments you are not specifying:

std::vector<std::string, std::allocator<std::string> >::size_type s = obj.size();

// Rather:

std::vector<std::string>::size_type s = obj.size();

You already know the type. Why are you changing type in mid function?

unsigned long mid = s / 2;
unsigned long first = 0;
unsigned long last = s - 1;

Use the same type you use for s. If that is too much to type then typedef it to something easier. But C++ is all about type and safety. Keep your types consistent.

I believe there is a bug in your code. If you fail to find a value then it will lock up in an infinite loop.

+1 although I'd note that set will probably less efficient then a sorted vector for searching. — Winston Ewert
– Winston Ewert, Commented Feb 9, 2012 at 15:31
Thanks for the due diligence. One thing to note, the reason I used a vector instead of a non-duplicate container is that there ARE no duplicates in the word list and I needed random-access-speed for the binary search to work. — Casey
– Casey, Commented Feb 10, 2012 at 0:50
@Casey: Yes you used std::vector so you can use binary search. If you use std::set the binary search becomes redundant as the set is implemented to have the same characteristics as a binary tree. Thus the find on the tree is already O(log(n)). Alternatively if you had used an unordered set the complexity becomes O(1). — Loki Astari
– Loki Astari, Commented Feb 10, 2012 at 5:33

Stack Exchange Network

C++: Data sorting taking a long time

1 Answer 1

Answer to generic questions

Comments on Code

You must log in to answer this question.

Linked

Hot Network Questions

C++: Data sorting taking a long time

1 Answer 1

Answer to generic questions

Comments on Code

You must log in to answer this question.

Linked

Related

Hot Network Questions