I wrote a simple program that takes the enable1.txt word list of scrabble/words with friends and searches it for an input. In order to search the massive list of 172,820 words I figured sorting it then using a binary search would be a good idea.
The sorting algorithm I used was std::stable_sort as I wanted similar words stay in their locations. After profiling, the std::stable_sort is taking 37% of the run time.
Should I have used std::stable_sort? Would std::sort have been better? Would a different sort entirely had been a better idea? Is there a faster sort other than rolling my own?
Code:
#include <iostream>
#include <string>
#include <fstream>
#include <vector>
#include <algorithm>
#include <ctime>
double Search(const std::vector<std::string>& obj);
double BSearch(const std::vector<std::string>& obj);
int main() {
std::cout << "Loading database...";
std::ifstream db;
db.open("enable1.txt", std::ios::in);
unsigned long number_of_words = 0;
const unsigned short max_length = 50;
while(db.fail() == false) {
char dump[max_length];
db.getline(&dump[0], max_length);
++number_of_words;
}
db.close();
db.clear();
std::cout << "done" << std::endl;
std::vector<std::string> words(number_of_words);
words.reserve(number_of_words);
db.open("enable1.txt", std::ios::in);
unsigned long i = 0;
while(std::getline(db, words[i], '\n')) {
++i;
}
db.close();
db.clear();
std::cout << "Sorting database...";
std::stable_sort(words.begin(), words.end());
std::cout << "done" << std::endl << std::endl;
double seconds = Search(words);
std::cout << "Time for Search: " << seconds << std::endl;
double b_seconds = BSearch(words);
std::cout << "Time for BSearch: " << b_seconds << std::endl;
std::cout << "Cleaning up resources, please wait..." << std::endl;
return 0;
}
double Search(const std::vector<std::string>& obj) {
std::cout << "Enter word: ";
std::string word;
std::getline(std::cin, word);
std::cout << "Searching database..." << std::endl;
std::vector<std::string, std::allocator<std::string> >::size_type s = obj.size();
unsigned long mid = s / 2;
unsigned long first = 0;
unsigned long last = s - 1;
std::clock_t end_time = 0;
std::clock_t start_time = clock();
while(first <= last) {
std::cout << "Checking: " << word << " with " << obj[mid] << std::endl;
int result = word.compare(obj[mid]);
if(result == 0) {
end_time = clock();
std::cout << "Valid word." << std::endl;
return std::difftime(end_time, start_time);
} else if(result < 0) {
last = mid - 1;
mid = ((last - first) / 2 + first);
} else {
first = mid + 1;
mid = ((last - first) / 2) + first;
}
}
end_time = clock();
std::cout << word << " is not a valid word." << std::endl;
return std::difftime(end_time, start_time);
}
double BSearch(const std::vector<std::string>& obj) {
std::cout << "Enter word: ";
std::string word;
std::getline(std::cin, word);
std::cout << "Searching database..." << std::endl;
std::clock_t end_time = 0;
std::clock_t start_time = clock();
if(std::binary_search(obj.begin(), obj.end(), word)) {
end_time = clock();
std::cout << "Valid word." << std::endl;
return std::difftime(end_time, start_time);
}
end_time = clock();
std::cout << word << " is not a valid word." << std::endl;
return std::difftime(end_time, start_time);
}
The Search and BSearch methods are the same, one uses a hand-coded binary search and the other uses the STL version. (I was verifying their speed differences...unsurprisingly, there isn't one.)
As an after thought: Is there a better way to count the number of lines in a file? Maybe without having to open and close the file twice?
P.S. If you're wondering about the message at the end of the program, that's due to the vector of 170,000+ strings cleaning up after going out of scope. It takes a while.