Intro
I am trying to populate a Multi-column Combo-box with a large amount of records.Depending on the selection the user has taken there can be 1 - 50000 items in the Combo-box.
I am using entity framework to get a list of objects and i am then creating an anonymous type with a few columns from the database but i am also adding a calculated column which decreases performance. This calculated column is an implementation of the Levenshtein distance for comparing some user input to a column in the entity object.
Doing this using a single thread can take around 120 seconds, so i decided to use the Task parallel library to split up the data and perform the calculation in chunks and then join the results together.
The Code
using (Models.CmdpEntities entity = new Models.CmdpEntities())
{
IOrderedQueryable<Models.Establishment> establishments = entity.Establishments
.Where(establishment =>
establishment.country == "IT" &&
establishment.source == "ITRACK").OrderBy(e => e.Estab_key);
var finalList = establishments.Take(1).ToList().Select(
establishment =>
new
{
establishment.Estab_key,
establishment.iTrackId,
establishment.Institution,
establishment.address_1,
establishment.city,
establishment.postcode,
establishment.country,
establishment.province,
Levenshtein = DMS.Matching.Levenshtein.Compare(establishment.Institution, "Test institution").ToString()
}
).ToList();
finalList.Clear();
int count = establishments.Count();
int divided = count / 10;
int remainder = count % 10;
Task[] tasks = new Task[10];
for (int i = 0; i < 10; i++)
{
int numberToSkip = i * divided;
int numberToTake = divided;
if (i == 9)
numberToTake += remainder;
tasks[i] = Task.Factory.StartNew(() =>
{
finalList.AddRange(
establishments.Skip(numberToSkip).Take(numberToTake).ToList().Select(establishment =>
new
{
establishment.Estab_key,
establishment.iTrackId,
establishment.Institution,
establishment.address_1,
establishment.city,
establishment.postcode,
establishment.country,
establishment.province,
Levenshtein = DMS.Matching.Levenshtein.Compare(establishment.Institution, "Test institution").ToString()
}).ToList());
});
}
Task.WaitAll(tasks);
}
Code walkthrough
I wrap everything in a using statement for my entity context. I then create the "establishments" object with returns an ordered result of my entity based on the Where condition. This will have all the records i need but they wont have the correct selection as i am going to split the results into batches for different Tasks to use when calling "Select".
I next create the "finalList" variable and i set it to my anonymous type. I only take 1 as this is just a declaration of my anonymous type so that i can use it to add the results from each of my tasks in. I am not sure if this is a good way to declare a list for anonymous types but it was the easiest i could think of. Could there be better ways out there?
I then call "Clear" on the "finalList" as I only inserted data in there to begin with so it would recognize the variable as a List of my anonymous type.
I then get the number of items i will process in each task ("divided" variable) and the remainder so i know how many extra items the last task should process.
Next i have my for loop that creates a Task object for each batch of the data i need to process and then i make the selection including a call to my "Compare" method that using Levenshteins distance to return a value. I am splitting it into 10 tasks however i am unsure if this is a efficient amount.
I have to call "ToList" before making the selection as else my Compare method will fail as its not recognized in linq-to-entities.
Finally i wait for all tasks to finish. A few tests should some good improvement (down around 30 seconds to process 45 thousand records) however im not sure if there is anything i can do to improve it further.
Conclusion
I have a few concerns with if i am handling anonymous types and using the TPL (task parallel library) efficiency and with best practices in mind. Such as how am i using the right amount of Tasks?
I am not sure if i am going about this the right way any feedback would be much appreciated.