× The internal search function is temporarily non-functional. The current search engine is no longer viable and we are researching alternatives.
As a stop gap measure, we are using Google's custom search engine service.
If you know of an easy to use, open source, search engine ... please contact support@midrange.com.


  • Subject: Re: Sorting large file (Why?)
  • From: "James W. Kilgore" <eMail@xxxxxxxxxxxxxxxxxxx>
  • Date: Wed, 04 Oct 2000 20:31:39 -0700
  • Organization: Progressive Data Systems, Inc.

Leif,

The S/34 had a inquiry utility that did something on the same principle,
and I use it for multi-million record files for keyword in content
searching.

You construct an indexed file with the word and record number and
process the index to point you to likely suspects in the text file.  You
only have to build the index once and add a trigger to the file if
someone should edit or add a line.  We build the file with a limit of 8
characters per word and just truncate the index.  This does mean that a
index build for words abcdefgh and abcdefghi create the same index, but
you limit the size of the index file and must add additional logic in
interrogating the suspects.

Leif Svalgaard wrote:
> 
> Thanks everybody for the variety of suggestions.
> Just in case you wonder where the file came from,
> the real problem was this:
> 
> Imagine you have a large text (say all the programs
> of a large application) of about 1,000,000 lines.
> 
> You want to make a KWIC index. In case you have never
> heard of a KWIC index, it is constructed in this way:
> 
> For each line in the text:
>     shift the line left or right until the first word start in (say)
>     position 40, output the line, shift again until the second word
>     is in position 40, output, etc until all words on the line
>     have been treated.
> 
> The output file will contain about 10 times as many records
> as the input, assuming about 10 words per line.
> 
> Sort the output on position 40. Now you have a KWIC
> index.
> 
> There are other ways of doing this. The above is very
> simple and easy to get to work if you have a fast sort.
> 
> An improvement is to look each word up in a little dictionary
> of common words like "the", "and", etc and mot output them,
> but that is not the issue.
> 
> I had some discussion with somebody on the running time
> of this on various platforms. It is clear that the sort dominates.
> 
> +---
> | This is the Midrange System Mailing List!
> | To submit a new message, send your mail to MIDRANGE-L@midrange.com.
> | To subscribe to this list send email to MIDRANGE-L-SUB@midrange.com.
> | To unsubscribe from this list send email to MIDRANGE-L-UNSUB@midrange.com.
> | Questions should be directed to the list owner/operator: david@midrange.com
> +---
+---
| This is the Midrange System Mailing List!
| To submit a new message, send your mail to MIDRANGE-L@midrange.com.
| To subscribe to this list send email to MIDRANGE-L-SUB@midrange.com.
| To unsubscribe from this list send email to MIDRANGE-L-UNSUB@midrange.com.
| Questions should be directed to the list owner/operator: david@midrange.com
+---

As an Amazon Associate we earn from qualifying purchases.

This thread ...

Replies:

Follow On AppleNews
Return to Archive home page | Return to MIDRANGE.COM home page

This mailing list archive is Copyright 1997-2024 by midrange.com and David Gibbs as a compilation work. Use of the archive is restricted to research of a business or technical nature. Any other uses are prohibited. Full details are available on our policy page. If you have questions about this, please contact [javascript protected email address].

Operating expenses for this site are earned using the Amazon Associate program and Google Adsense.