一个强大的排序工具

有人want to sort a 20GB binary file that contains 30-byte key and 20 byte value placed in a contiguous manner,
stackexchange上的最高赞回答是hexdump, sort再xxd转回来〔链接〕。

有人指出这样需要越70G的临时文件。Techincally, “越”不一定错,“约”更严谨。

  • Unix下可以cat file | sort -u
  • uniq - report or omit repeated lines

在下要对字的bigram排序,比如 三亚<浮点数>,key固定为6字节(UTF-8),value固定为4或8字节,记录长度为10或12.

An efficient C++ multi-threaded sorting function〕这不还得自己写程序么,而且去测他的程序有无bug?

有个命令行工具bmsort就好了啊。binary文件由定长的记录构成。-rl指定 record len,

  • -kb和-ke指定key的位置 [begin, end),-j指定线程数,mmap进内存,多线程分路排序后归并。
  • bmsort is not for sorting female pubic hair.

However,

  • 分路归并需要等大缓冲区。NVME大时代,mmap和就地排序都可以不要
  • 我手潮;我也不想测试我的程序

GNU sort - sort lines of text files

  • --parallel=N change the number of sorts run concurrently to N
    • 调了pthread_create
  • -t, --field-separator=SEP use SEP instead of non-blank to blank transition
  • -k, --key=KEYDEF sort via a key; KEYDEF gives location and type

它是以line为记录的。

apt source coreutils; configure; make 刷刷地出来了。

在src/sort.c里:

/* Fill BUF reading from FP, moving buf->left bytes from the end
   of buf->buf to the beginning first.  If EOF is reached and the
   file wasn't terminated by a newline, supply one.  Set up BUF's line                                                  
   table too.  FILE is the name of the file corresponding to FP.    
   Return true if some input was read.  */
                                                                                                                        
static bool                                                                                                             
fillbuf (struct buffer *buf, FILE *fp, char const *file)

可惜不能fgets换fread. 如果能把这个函数改对…后面的代码是久经考验的。

$ lc 2
2874758 2

$ time sort -k1,1 --parallel=4 2 >2.sorted
real	0m3.040s
user	0m8.825s
sys	0m0.084s

$ $ head -n 3 2*
==> 2 <==
8d858d85 -2.6364934
9ad88d85 -2.9231086
51b28d85 -3.881363

==> 2.sorted <==
4e004e00 -2.2267218
4e004e01 -3.4738135
4e004e03 -2.4385815

WHAT?!!

$ time sort -k1,1  2 >2.sorted.2
real	0m2.982s
user	0m8.878s
sys	0m0.093s

负优化后,我明白了(sort自动多核啊):

$ time sort -k1,1 --parallel=1 2 >2.sorted.3
real	0m7.186s
user	0m7.097s
sys	0m0.088s

 

posted @ 2025-11-11 19:32  华容道专家  阅读(9)  评论(0)    收藏  举报