The merge operations of section S work within memory, which may not be large enough to hold a file of interest. A common way to handle large files is to write them in segments of sorted values, called runs. Pairs of runs are merged to form larger runs--the essential idea of MergeSort. The access to external storage is expensive, however, and so larger runs and fewer phases increase efficiency.
One way to form the initial runs for a file is to simply make one pass through the file, writing values found to be in order onto a run file and the others onto a reject file for a second pass. Consider the integer file:
2 1 6 3 8 9 4 4 10 5 2 8
The first pass would produce the following run and reject:
run 2 6 8 9 10
reject 1 3 4 4 5 2 8
The straightforward version of this scheme, acting on a source file, whose termination can be recognized by the flag EOF(Source), is:
Read(Source, Cutoff)
while NOT EOF(Source) do
Read(Source,x)
if x Cutoff
then Write(RunFile,x)
Cutoff x
else Write(Reject,x)
endif
endwhile
Unfortunately, if the largest item is first, the run will be only one item long. A second approach is to input M items, sort them internally, and write them out as a run. It is this value M that can be stretched with the help of a priority queue, specifically a heap, although other forms of priority queues would serve.
Suppose the items from Source are tagged with a run number R, and placed in a heap, PQ, with room for M elements. PQ is taken to be an array of records of the form:
RECORD
Run : INTEGER
Key : {comparable type
END
The heap is maintained in lexicographic order:
PQ[i] < PQ[j] iff:
PQ[i].Run < PQ[j].Run, or
PQ[i].Run = PQ[j].Run and PQ[i].Key < PQ[j].Key
For consistency with section S, the highest priority in the heap is the smallest value. The heap is formed with:
procedure BuildPQ
for i = 1 to M do
Read(Source,x)
PQ[i].Run 1
PQ[i].Key x
UpHeap(i)
next i
HN m
end {BuildPQ
The stretched run is then created with:
procedure PQruns
BuildPQ
R 1
while NOT Empty(PQ) do
Top PQ[1]
if Top.Run = R
then Write(RunFile(R),Top.Key)
Swap(PQ[1],PQ[HN])
HN HN - 1
DownHeap(1)
else R R + 1
endif
if NOT EOF(Source)
then Read(Source,x)
if x Top.Key
then xRun R
else xRun R + 1
endif
HN HN + 1
PQ[HN].Run xRun
PQ[HN].Key x
UpHeap(HN)
endif
endwhile
end {PQruns
It is possible to derive additional enhancement by replacing R R + 1 by a procedure DumpPQ that stores the queue on external storage so that the pass through the source file can continue. The dumps are retrieved for the next run.
Program for trying it yourself
PGT.1 Implement PQruns.