Important: submitting your corpus


I would like your corpus, or some subset thereof, in order to test your homework #2 and future homeworks this semester. This is potentially problematic because of the sizes involved. So please follow these instructions:

  1. Find out how big your corpus is. (On Linux, you can type “ls -lh nameOfYourCorpusFile” and look near the middle of the line for the size of the file. It should end in a “K” (for kilobytes), an “M” (for megabytes), or a “G” (for gigabytes). For example, the Bob Dylan lyrics corpus is 930 KB:
    $ ls -lh dylan.txt
    -rw-r--r-- 1 stephen stephen 930K Sep 26 10:38 dylan.txt
    

    On Mac or Windows there’s probably some way to right-click/properties and see the file’s size. Google if you need to.)

  2. If your corpus is less than 10 MB, please email it to me as an attachment with subject line “DATA 470 corpus turnin“.
  3. If your corpus is between 10 MB and 100 MB, please send me an email with subject line “DATA 470 github repo request“. In the body of the email, include your github username. (If you don’t have a github account, create one. It is free.) I will then add you to a special repo and give you further instructions for uploading your corpus to it.
  4. If your corpus is over 100 MB, please create a copy of it with just the first 100 MB. You can do this in Linux with the command:
    $ head -c 100M nameOfYourCorpusFile > stephensShorterCorpus
    

    This will create a new file called stephensShorterCorpus with only the first 100 MB of your corpus. (If you’re on Windows or Mac, google how to do a similar operation.) Then, follow the instructions in step 3.

You’ll get +5XP if you do this before Sunday, Sept 28 at midnight, or 0XP if you do it after that.


Leave a Reply

Your email address will not be published. Required fields are marked *

DATA 470D3 – Natural Language Processing

stephendavies.org