file - Splitting a 7 million row CSV by a specific column

07
2014-07
  • BrandonMXB

    How would I go about splitting a very large (7 million rows) CSV file into several different sheets/files by a specific numerical column. It should split into about 10 different files.

  • Answers
  • Cristian Ciupitu

    Use this Python 3 program:

    #!/usr/bin/env python3
    import csv
    import os.path
    import sys
    from tkinter.filedialog import askopenfilename, askdirectory
    from tkinter.simpledialog import askinteger
    
    def split_csv_file(f, dst_dir, keyfunc):
        csv_reader = csv.reader(f)
        csv_writers = {}
        for row in csv_reader:
            k = keyfunc(row)
            if k not in csv_writers:
                csv_writers[k] = csv.writer(open(os.path.join(dst_dir, k),
                                                 mode='w', newline=''))
            csv_writers[k].writerow(row)
    
    def get_args_from_cli():
        input_filename = sys.argv[1]
        column = int(sys.argv[2])
        dst_dir = sys.argv[3]
        return (input_filename, column, dst_dir)
    
    def get_args_from_gui():
        input_filename = askopenfilename(
            filetypes=(('CSV', '.csv'),),
            title='Select CSV Input File')
        column = askinteger('Choose Table Column', 'Table column')
        dst_dir = askdirectory(title='Select Destination Directory')
        return (input_filename, column, dst_dir)
    
    if __name__ == '__main__':
        if len(sys.argv) == 1:
            input_filename, column, dst_dir = get_args_from_gui()
        elif len(sys.argv) == 4:
            input_filename, column, dst_dir = get_args_from_cli()
        else:
            raise Exception("Invalid number of arguments")
        with open(input_filename, mode='r', newline='') as f:
            split_csv_file(f, dst_dir, lambda r: r[column-1]+'.csv')
            # if the column has funky values resulting in invalid filenames
            # replace the line from above with:
            # split_csv_file(f, dst_dir, lambda r: binascii.b2a_hex(r[column-1].encode('utf-8')).decode('utf-8')+'.csv')
    

    Save it as split-csv.py and run it from Explorer or from the command line.

    For example to split su.csv based off column 1 and write the output files under dstdir use:

    python split-csv.py su.csv 1 dstdir
    

    If you run it without arguments, a Tkinter based GUI will prompt you to choose the input file, the column (1 based index) and the destination directory.

  • Cristian Ciupitu

    This could be as simple as this one-liner with awk:

    awk -F ',' '{ print > ("split-" $1 ".csv") }' 7mil.csv
    
    • The input file here is 7mil.csv
    • The deciding column number is indicated with the dollar sign. If it was the third column, it would be $3 instead of $1
    • The column value is used to generate the resulting file name. So for example, every line with the value 42 will be in a file named split-42.csv
    • The field separator is the comma
      • which works because/if the value is numeric, and doesn't have any quotes that need to be stripped
      • but also requires that there are no commas in any strings in the file (at least not before the numeric column)

    So this just reads every line and prints it to the file that corresponds to the value. Note that it adds to the file, so if you run it twice, all the data will be duplicated; so make sure there are no files with that naming pattern to start: del split-*.cvs

    The hard part for trying this is installing awk on Windows. There's gawk for Windows and a few tips for running it here.


  • Related Question

    text editing - Application to open / edit a very large CSV file (500 MB, 4 million records)?
  • Giorgi

    Possible Duplicates:
    Text Editor for very big file - Windows
    What editor/viewer to use to inspect large text based files?

    I have a CSV file which has about 4 million rows and is about 500 MB in size. Can you recommend any editor that can open the file without making the system crawl? I tried EmEditor but it is complaining that there are too many characters in a single line.


  • Related Answers
  • Herbert Sitz

    Vim should work pretty well if you change a couple settings to turn off swap space and undo functionality: http://vim.wikia.com/wiki/VimTip611

  • Andy Lester

    Define "edit". How do you want to edit the file? Are you just viewing it, or do you want to make changes? Are they changes that you'd be making programmatically? If so, you may want to look at using awk, sed or Perl to do it.

    As to "it is complaining that there are too many characters in a single line," it's probably that it's got line endings it's not able to handle. It would help if you'd tell us what platform you're on.

  • Chris_K

    I've used JujuEdit (free) to work with 2GB CSVs in the past. Handles 'em pretty well. What are you doing with the contents? Another option might be to make an MS Access linked table pointed to the file if you need to do some querying/sorting/filtering.

  • afrazier

    You could give CSVed a try. It claims to load parts of the file on demand, so it might work for you.

  • Blackbeagle

    I believe that UltraEdit probably could. I've used it on superhuge Gig+ files and it didn't crash out, but really, locating whatever data you are looking for is likely to be a pain. Can you do whatever you need to do programmetrically? Since you said CSV, assume it is text data, so maybe Perl might be a good choice due to its string handling and decent RegEx.

  • Mike Fitzpatrick

    Have you tried Excel? It handled CSV files.

    I'm not sure of Excel 2007's file limitation but Excel 2010 64 bit now supports spreadsheets up to 4GB in size.

  • Giorgi

    I tested all the editors suggested there and the only one that managed to open it (in a reasonable time) was UltraEdit.


    Update

    It turned out that the file did not have any line break in it (it used some other character for it) so that's why it was difficult to open it. I wrote a program which replaced all these chars with line break and EmEditor opened it in seconds.