Stack overflow
[info]dezorithms
What naive quicksort implementation would be complete without a stack overflow in the initial tests? I'm guessing the partition algorithm is somehow a bit busted so I'm going to examine my partition testcases more closely.

More progress on partition
[info]dezorithms

I've been stuck on partition for a bit, so I made my test for the ordinary STL style partition successfully. There was a problem with binding the predicate to the strict weak ordering, but I managed to get around that with a custom binder. Unfortunately I think I may have done it a bit inelegantly, but I'll revisit that later; Probably after I start doing actual STL work.

In the progress of writing the binder I realized that my custom less functor was necissary for all the flavors of const:

template

<class _Tp>
struct less : public binary_function<_Tp, _Tp, bool>
{
  bool operator()(const _Tp& _Left, const _Tp& _Right) const { return (_Left < _Right);}
 
bool operator()(_Tp& _Left, const _Tp& _Right) const { return (_Left < _Right);}
 
bool operator()(const _Tp& _Left, _Tp& _Right) const { return (_Left < _Right);}
 
bool operator()(_Tp& _Left, _Tp& _Right) const{return (_Left < _Right);}
};

Where the std::less functor only has the first one. I dont think I need to return a reference, but my knowledge of the intricasies of C++ is limited. Where I'm writing a sort, theres the one that uses the < operation and one that uses the comp functor passed in. Now you can use std::less some of the time but I couldn't do that when my less functor mutates itself, as the test measurement function for instance that keeps track of how often you do a compare. So if I want a comprehensive less functor that models the builtin operator the same I think the example above is appropriate, but I do need to consult someone during an outside review to make sure. I sure hope so because I dont want to have to have two copies of the exact same algorithm for every piece of code that requires a strict weak ordering. So far my testing has me using 3 of the 4 flavors of the less functor and I think thats sufficient.

Going forward, I'm going to think about how to specialize for custom swap functions so I avoid assignment based rotates when swapping is much cheaper. I'm realizing that my full sort library is going to have a lot of partial specialization.

I'm also thinking about containers, both for the STL and the generic boost library extention. For instance I'm going to want to make a std::set and std::map based on red black trees, but in the boost extention I'll want to be able to use the same structure to plug in splay trees and other types of structures that have different cost/benifits.

Making a test suite for partitioning algorithms
[info]dezorithms
I'm going to have to make a test suite for my partitioning algorithms to fully test these implementations because the algorithms seem somewhat nontrivial after an initial inspection. So as part of that I'm going to make inspection algorithms similar to is_sorted, like is_partitioned and is_fatpartitioned, which I'm going to include in the main library I think. Another issue I'm going to have to investigate is a more compartmentalized random data generator which currently I'm copying around from testsuite to test suite. For now I'm going to put that off for another project.

So hopefully I'll have the test suite done tomorrow with some representation of several partitioning algorithms. I'm going to mess around with the subversion repository this weekend so I actually start making regular checkins. I'll possibly make a public repository on sourceforge or google code next week.

Quicksort partitioning algorithms
[info]dezorithms

Currently I'm trying to figure out a good quicksort partitioning algorithm, and its a bit tricky considering how my code works. All of them assume that you do sampling to pick a good median of n elements. But if I put the pivot on the edge and swap it out afterwords I destroy some of the ordering that might be present without doing a costly rotate. If I leave it in the middle and take the value, then I end up doing a wasteful self-compare... which might not be a big deal but I know that that sometimes causes big performace costs, like when you're sorting cyclic shifts of a string as in BWT. I want to avoid self comparison if possible. I can use a fat (fit) pivot as described here:

http://www.ddj.com/cpp/184403848

But the problem with that is its a relatively complicated algorithm from what I can tell that has some extra overhead doing checks on subarray boundaries. In addition, I feel its likely unnecessary given the adaptive check after the partition, but I may investigate it. There is no merging required on the pivot after all.

Given STLPort and the STL doesn't seem to have any prohibitions on self-comparison, maybe I'll just deal with that because in most cases you can shove the pivot into a register and it runs much faster. I'm guessing I'll use the fit pivot for the general case for avoiding self-comparisons and specialize for builtin types.

Right now I'll just have to implement several types of partitioners and play with them. I'm leaning towards a fit pivot currently.

Short term goals
[info]dezorithms
I now know what I need to do with this project before moving onto anything else for the largest impact. I'm going to have to write my own STL. I'm not going to go the way of STLPort. I'm ignoring old compilers. I'm going to include all the syntactic sugar I can use to give good error information and readability of the source, along with powerful updated algorithms. I'm realizing that most of my algorithms that I'm playing with now belong in an STL, and thats where they can have the biggest effect on improving speed. I'll do full partial specialization wherever I can. Its a large project, but for me its entirely tractable since I know how to do everything in it.

Then perhaps I'm going to make a C runtime library written in C++ using the STL that I'll have finished to leverage the work I'll have done for the rest of the world that still has libraries in C. qsort should get a nice speedup if nothing else. I'd likely use boost for quite a bit of it. The downside to this is the CRT is on many systems real OS code, and so theres a lot of platform specific stuff wedged in there. But I should be able to make one for windows at least since its not an OS file and just do a lot of forwarding to kernel32.

After that, many of the algorithms that wont fit into the STL will go into boost, such as parallel sorting, sorting on forward or bidirectional iterators, direct implementations of shellsort, exposure of swapping functions and so on. The one problem I'm trying to figure out is how to maintain the code that will go into boost with the STL code that will largely be identical but in different namespaces with different exposure levels.

Probably the hardest part for me to implement will be all the data structures. I'll be thinking on this over the next several weeks, but I still have to finish my work on my basic sort.

Long term dream goals.
[info]dezorithms
I've been thinking about where I want to go with this project. In the short term:

Finish up the generic sort library. I'm in the middle of the first part of the first stage. This is making the big generic sort algorithm I'm working on now that is faster and more adaptive than the ordinary standard library sort, along with making versions that work for forward and bidirectional iterators and making stable sorts. After that there are parallel sorts I want to finish, probably with the boost::threads infrastructure. Then add functionality of sorting networks for exploiting hardware and hardware compilers. In the process of this probably add sorting like algorithms that will inevitably be subfunctions of some of these algorithms like merge, inplace merge, partition, selection, etcetera.

After that some string sorting algorithms that for instance to suffix sorting for BWT.

Then I'll probably do some data structure work with containers or something that do string structures like suffix trees. I might use the boost graph library for this but I'm not entirely sure yet. The last time I messed with that it was somewhat of an abstraction tarpit. Very elegant but I didn't really know where to start or how to get where I wanted to go.

Then I want to do some work on lossless data compression using the infrastructure I've built, mostly focusing on BWT.

But some things I'd like to do if I have several years or decades would be:

A C++ compiler built using boost spirit like framework and llvm type design. I probably wont get around to this because its such a huge project and there isn't much that it serves. clang and llvm should get there, but there is a different design philosophy going into that project that is somewhat in opposition of boost style coding. Another reason to do this project is to add language constructs for nice tools like automatic formal verification and model checking.

A generalized computer algebra system for C++. Theres Ginac, but first its GPL and second its not abstract enough. The problem I have with writing computer algebra systems is I sink into abstraction tarpits and end up trying to write a theorem prover that trys to do everything.

On to the next part: quicksort
[info]dezorithms
I think I'm satisfied with shellsort and insertion sort for now. I'll be moving on to the quicksort algorithm and the partitioning algorithms now. I've got some data on gap sequences that is interesting enough to put together a gap generator when I get to that part, but I think I need to actually play with it inside a quicksort for the data to be meaningful.

From the looks of it, shellsort should provide a significant boost to quicksort performance compared to insertion sort for small subfiles, so I'm fairly happy with that, but I cant tell until I have some more data to play with.

Gapfind program complete.
[info]dezorithms
I finished my program to search for optimal shellsort gap sequences. Now the problem is I have a whol lot of data that I have to take apart that I'm not sure I understand fully yet. I'm happy I got this done, so tomarrow I have some work to do trying to figure out the meaning behind all the data and weather I'll implement my gap sequence as a giant switch statement or something else.

Shellsort more fleshed out
[info]dezorithms
Shellsort has been mostly fleshed out. Now I just need to write my search program to find optimal gap sequences for set length sets. I made a gap generator class for handling the gaps, and will use that both in my experiments and in the final code, likely with a switch statement on the subfile size.

Got shellsort working.
[info]dezorithms
I got a simple version of shellsort working after having some problems with the less functor. Currently it just does simple shell increments and isn't completely iterator safe. I'm trying to figure out how to avoid iterator overrun problems that are present in the naive version of shellsort implementation and remaking it so I can pass in specified gap sequences so I can do my specialized search for gap sequences.

Found out one of my optimizations is likely pointless
[info]dezorithms
One of the optimizations I thought I was doing is likely pointless, and negates the purpose of doing a reverse insertion sort. Its still useful for partially ordered subfiles from a quicksort partition so its not an entire waste, but I found out that the extra increment in the forward insertion sort is done in a rotate along with an extra compare. Well, live and learn.

So my gapped insertion sorter will likely be forward rather than backward.

Finished reverse insertion sort
[info]dezorithms
I fixed my bug and made the reverse insertion sort with the fixed decision trees as well as adding a swapper to avoid the rotate loop. Now I'm trying to figure out how to do the binary insertion sort. It seems to only make sense when the sorted subfile is bigger than 5 or 6, which is just barely too big to use a strait decision tree for sorting (120 leaves is just too much code bloat.) I'm also trying to make sure I can integrate this into a shell sort.

I keep looking at the code and replaying all the different trade offs. I think I'm going to use my regular reverse insertion sort for up to 7 elements, then use a binary insertion sort. How this works with shell is unclear now, because I'm fairly sure it will change how the gap sequence works and the total number of assignments. I think it will reduce the number of comparisons and increase the number of assignments. And it adds another variable besides the gap sequence: the insertion search offset sequence. I expect there are different offsets for each gap, and it makes for an interesting problem to find an optimal one. I'm not entirely sure what I'm going to do for it yet, so I may just have to guess, and measure the problem with some rule like 1 compare = 5 assignments or something.

Not wasting comparisons: Another optimization for adaptive quicksort
[info]dezorithms
In my adaptive quicksort implementation I originally was going to test for sorted subpartitions and repartition them if the number of sorted elements was smaller than whats efficient for an inplace merge. I realized today that I want to save the number of sorted elements sorted even if they are too small for an inplace merge. When you go to partition the elements, you have a list of elements that are allready ordered from the is sorted test at the end of the last partition which is useful for finding the median, and then after that for doing a binary search for finding which elements need to be swapped because they're on the wrong side of the pivot.

Unfortunately I wont get to this for at least a month.

Working on reverse insertion sort
[info]dezorithms
Currently I'm trying to reverse my insertion sort so when it sorts [first, last) it starts with last-1 to first, instead of the other way around. When I wrote my insertion sort originally I did it from first to last, but it turns out there is an extra operation that can be removed if I proceed from last to first in finding the insertion point.

I can get rid of that operation if I proceed from first to last also, but only if I assume first can be decremented, not something that I can assume with a truly generic algorithm. Now I'll need the forward version as well for my final sort algorithm, but that's already written. I'm thinking my shell sort will be based on my reverse insertion sort. Currently I'm running into a bug that I'm trying to take apart, but I should have that done shortly.

Pure decision trees a waste of time for anything more than 4 elements
[info]dezorithms
The number of permutations seems to make combinatorial explosion just too large. I'm estimating the code for ford-johnson style median selection is 1/5th the size of my pure decision tree. which is maybe 10% faster absent caching concerns. When caching is taken into account it seems that it is likely slower. There may be some places where my pure decision tree is faster, but I'm not messing around with anything larger than 4 elements anymore, and likely only 3.

 But it was an interesting exercise. Maybe I'll find a use for it now that its done at least. Now I just need to finish the ford-johnson median selector for 5 elements and move on to the insertion/shell problem.

Small security concern regarding find insertion point optimization
[info]dezorithms
I realize that there is a very small security concern regarding the find insertion sort optimization that relys on the array order to find the beginning of the array. If someone has access to an operator overload of less that mutates the elements, it can scan back before the start of the array. Now, if they have access to the code to overload an operator, they can trash all over the memory with something a little more straitforward, but I should probably make note of that in the code and provide a safe version of find insertion point for the truely paranoid.

Discovered something while doing the select 3 of 4 decision tree
[info]dezorithms
My naive method of selecting 2 or 3 from 4 required 104 comparisons out of 24 permutations. While doing the 3 of 4 algorithm I realized I could do it in 100 comparisons. One problem I'm running into however is that the decision trees aren't obvious for balancing in an adaptive fasion. It looks like I'll have to play with balancing decision trees sooner than I expected.

Its tedious work. Now I remember why once I wrote a code generator for decision trees. The problem with that is thats a lot of work in itself.

Open problem: balancing decision trees
[info]dezorithms


One of the problems I'm going to have to work on is how to balance the decision trees of sorting and selection algorithms. Obviously they need to be adaptive for sorted arrays as the most likely, but beyond that I need to rely on quicksort partitioning affecting the probabilities of the subarrays. I've done the work for the threesort already, but foursort hasn't been balanced to the most probable permutations having the least number of comparisons yet. While this isn't an immediate concern, its something to keep in mind for later if I keep optimizing my sort.

It also is important to consider when designing my selection algorithms. My median of five selection algorithms can do anywhere from 4 to 8 comparisons, and my 2nd and 3rd of 4 can do anywhere from 3 to 5. I dont have the data yet for optimizing the decision trees to make the least comparisons on the most probable permutations yet (from random data after quicksort partitioning) so I'm going to have to consider this later. This is further complicated by trying to predict the most probable permutations that shellsort is going to see after a quicksort partition, so I'll have a number of interesting avenues to investigate if I really want to optimize the sort. Most likely I wont get around to it but I think its good to make a note of it now if I decide to really optimize the decision trees.

Another problem I'm encountering now that I'm not sure I'll have to deal with later is ensuring stability in selection algorithms. When I attempt to do my inplace stable sort I'm going to want to have my median finders be stable, and I'm not sure if thats going to introduce extra comparisons yet. Again, thats further down the road.


Working on median of five algorithm
[info]dezorithms
I'm taking a detour from insertion sort now to work on my median of five algorithm. I've seen the ford-johnson style median of five selection algorithm that does it in exactly 6 comparisons and I think I can do it in slightly less average comparisons. Even if I can only get it to an average of 6, it should be more adaptive to data thats likely to be present in real world data, and thus more useful for quicksort and quickselect. I also think I might be able to generalize it and make quickselect faster.

The way I'm doing it now I need to make custom selects for least of three and second of four. If this works I'll see if I can do median of seven based on this and generalize it. If not, I'll try to at least go to median of nine for quicksort.

Making progress on insertion sort
[info]dezorithms
I added the optmization that removes the extra check in insertion sort successfully after running into some problems with it and am moving on to more loop unrolling. Removing the extra check the way its coded right now makes it very slightly more expensive in terms of comparisons.

I did find out how to do a pseudo-binary insertion sort without any divide or bitshift by unrolling the loop. I added foursort for smart insertion sort but haven't decided weather it's worth the effort. I'll probably find out when I actually finish the timing test class, but thats a project for a couple weeks down the road.

The binary insertion sort should make it significantly faster, even for shell sort, and I'm fairly proud of that.

In several days I'll try to get the basics of this done, make a check in and then I'm going to figure out how to set up other build environments. I want to use a gcc environment either on my linux VM, cygwin, or minwin... probably try to set up all three, so I can use gcov for algorithm analysis going forward.

Home