× The internal search function is temporarily non-functional. The current search engine is no longer viable and we are researching alternatives.
As a stop gap measure, we are using Google's custom search engine service.
If you know of an easy to use, open source, search engine ... please contact support@midrange.com.


  • Subject: Re: Sorting large file (Why?)
  • From: "Leif Svalgaard" <leif@xxxxxxxx>
  • Date: Wed, 4 Oct 2000 20:06:38 -0500

Thanks everybody for the variety of suggestions.
Just in case you wonder where the file came from,
the real problem was this:

Imagine you have a large text (say all the programs
of a large application) of about 1,000,000 lines.

You want to make a KWIC index. In case you have never
heard of a KWIC index, it is constructed in this way:

For each line in the text:
    shift the line left or right until the first word start in (say)
    position 40, output the line, shift again until the second word
    is in position 40, output, etc until all words on the line
    have been treated.

The output file will contain about 10 times as many records
as the input, assuming about 10 words per line.

Sort the output on position 40. Now you have a KWIC
index.

There are other ways of doing this. The above is very
simple and easy to get to work if you have a fast sort.

An improvement is to look each word up in a little dictionary
of common words like "the", "and", etc and mot output them,
but that is not the issue.

I had some discussion with somebody on the running time
of this on various platforms. It is clear that the sort dominates.




+---
| This is the Midrange System Mailing List!
| To submit a new message, send your mail to MIDRANGE-L@midrange.com.
| To subscribe to this list send email to MIDRANGE-L-SUB@midrange.com.
| To unsubscribe from this list send email to MIDRANGE-L-UNSUB@midrange.com.
| Questions should be directed to the list owner/operator: david@midrange.com
+---

As an Amazon Associate we earn from qualifying purchases.

This thread ...

Follow-Ups:
Replies:

Follow On AppleNews
Return to Archive home page | Return to MIDRANGE.COM home page

This mailing list archive is Copyright 1997-2024 by midrange.com and David Gibbs as a compilation work. Use of the archive is restricted to research of a business or technical nature. Any other uses are prohibited. Full details are available on our policy page. If you have questions about this, please contact [javascript protected email address].

Operating expenses for this site are earned using the Amazon Associate program and Google Adsense.