一个强大的排序工具
有人want to sort a 20GB binary file that contains 30-byte key and 20 byte value placed in a contiguous manner,
stackexchange上的最高赞回答是hexdump, sort再xxd转回来〔链接〕。
有人指出这样需要越70G的临时文件。Techincally, “越”不一定错,“约”更严谨。
- Unix下可以cat file | sort -u
- uniq - report or omit repeated lines
在下要对字的bigram排序,比如 三亚<浮点数>,key固定为6字节(UTF-8),value固定为4或8字节,记录长度为10或12.
〔An efficient C++ multi-threaded sorting function〕这不还得自己写程序么,而且去测他的程序有无bug?
有个命令行工具bmsort就好了啊。binary文件由定长的记录构成。-rl指定 record len,
- -kb和-ke指定key的位置 [begin, end),-j指定线程数,mmap进内存,多线程分路排序后归并。
- bmsort is not for sorting female pubic hair.
However,
- 分路归并需要等大缓冲区。NVME大时代,mmap和就地排序都可以不要
- 我手潮;我也不想测试我的程序
GNU sort - sort lines of text files
- --parallel=N change the number of sorts run concurrently to N
- 调了pthread_create
- -t, --field-separator=SEP use SEP instead of non-blank to blank transition
- -k, --key=KEYDEF sort via a key; KEYDEF gives location and type
它是以line为记录的。
apt source coreutils; configure; make 刷刷地出来了。
在src/sort.c里:
/* Fill BUF reading from FP, moving buf->left bytes from the end of buf->buf to the beginning first. If EOF is reached and the file wasn't terminated by a newline, supply one. Set up BUF's line table too. FILE is the name of the file corresponding to FP. Return true if some input was read. */ static bool fillbuf (struct buffer *buf, FILE *fp, char const *file)
可惜不能fgets换fread. 如果能把这个函数改对…后面的代码是久经考验的。
$ lc 2 2874758 2 $ time sort -k1,1 --parallel=4 2 >2.sorted real 0m3.040s user 0m8.825s sys 0m0.084s $ $ head -n 3 2* ==> 2 <== 8d858d85 -2.6364934 9ad88d85 -2.9231086 51b28d85 -3.881363 ==> 2.sorted <== 4e004e00 -2.2267218 4e004e01 -3.4738135 4e004e03 -2.4385815
WHAT?!!
$ time sort -k1,1 2 >2.sorted.2 real 0m2.982s user 0m8.878s sys 0m0.093s
负优化后,我明白了(sort自动多核啊):
$ time sort -k1,1 --parallel=1 2 >2.sorted.3 real 0m7.186s user 0m7.097s sys 0m0.088s

浙公网安备 33010602011771号