{"id":380,"date":"2025-09-26T10:52:05","date_gmt":"2025-09-26T14:52:05","guid":{"rendered":"http:\/\/stephendavies.org\/nlp\/?p=380"},"modified":"2025-09-26T10:52:29","modified_gmt":"2025-09-26T14:52:29","slug":"important-submitting-your-corpus","status":"publish","type":"post","link":"http:\/\/stephendavies.org\/nlp\/index.php\/2025\/09\/26\/important-submitting-your-corpus\/","title":{"rendered":"Important: submitting your corpus"},"content":{"rendered":"<style type=\"text\/css\">\n.smallpts { font-weight:bold; color:darkred; }\nli { margin-bottom:30px; }\npre, tt { font-size:medium; }\n<\/style>\n<p>I would like your corpus, or some subset thereof, in order to test your homework #2 and future homeworks this semester. This is potentially problematic because of the sizes involved. So please follow these instructions:<\/p>\n<ol>\n<li>Find out how big your corpus is. (On Linux, you can type &#8220;<tt>ls -lh nameOfYourCorpusFile<\/tt>&#8221; and look near the middle of the line for the size of the file. It should end in a &#8220;<tt>K<\/tt>&#8221; (for kilobytes), an &#8220;<tt>M<\/tt>&#8221; (for megabytes), or a &#8220;<tt>G<\/tt>&#8221; (for gigabytes). For example, the Bob Dylan lyrics corpus is 930 KB:\n<pre>\r\n$ ls -lh dylan.txt\r\n-rw-r--r-- 1 stephen stephen 930K Sep 26 10:38 dylan.txt\r\n<\/pre>\n<p>On Mac or Windows there&#8217;s probably some way to right-click\/properties and see the file&#8217;s size. Google if you need to.)<\/li>\n<li>If your corpus is less than 10 MB, please email it to me as an attachment with subject line &#8220;<tt>DATA 470 corpus turnin<\/tt>&#8220;.<\/li>\n<li>If your corpus is between 10 MB and 100 MB, please send me an email with subject line &#8220;<tt>DATA 470 github repo request<\/tt>&#8220;. In the body of the email, include your github username. (If you don&#8217;t have a github account, create one. It is free.) I will then add you to a special repo and give you further instructions for uploading your corpus to it.<\/li>\n<li>If your corpus is over 100 MB, please create a copy of it with just the first 100 MB. You can do this in Linux with the command:\n<pre>\r\n$ head -c 100M nameOfYourCorpusFile > stephensShorterCorpus\r\n<\/pre>\n<p>This will create a new file called <tt>stephensShorterCorpus<\/tt> with only the first 100 MB of your corpus. (If you&#8217;re on Windows or Mac, google how to do a similar operation.) Then, follow the instructions in step 3.<\/li>\n<\/ol>\n<p>You&#8217;ll get <span class=\"smallpts\">+5XP<\/span> if you do this before Sunday, Sept 28 at midnight, or <span class=\"smallpts\">0XP<\/span> if you do it after that.<\/p>\n","protected":false},"excerpt":{"rendered":"<p>I would like your corpus, or some subset thereof, in order to test your homework #2 and future homeworks this semester. This is potentially problematic because of the sizes involved. So please follow these instructions: Find out how big your corpus is. (On Linux, you can type &#8220;ls -lh nameOfYourCorpusFile&#8221; and look near the middle [&hellip;]<\/p>\n","protected":false},"author":1,"featured_media":0,"comment_status":"open","ping_status":"open","sticky":false,"template":"","format":"standard","meta":{"footnotes":"","_links_to":"","_links_to_target":""},"categories":[1],"tags":[],"class_list":["post-380","post","type-post","status-publish","format-standard","hentry","category-uncategorized"],"_links":{"self":[{"href":"http:\/\/stephendavies.org\/nlp\/index.php\/wp-json\/wp\/v2\/posts\/380","targetHints":{"allow":["GET"]}}],"collection":[{"href":"http:\/\/stephendavies.org\/nlp\/index.php\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"http:\/\/stephendavies.org\/nlp\/index.php\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"http:\/\/stephendavies.org\/nlp\/index.php\/wp-json\/wp\/v2\/users\/1"}],"replies":[{"embeddable":true,"href":"http:\/\/stephendavies.org\/nlp\/index.php\/wp-json\/wp\/v2\/comments?post=380"}],"version-history":[{"count":7,"href":"http:\/\/stephendavies.org\/nlp\/index.php\/wp-json\/wp\/v2\/posts\/380\/revisions"}],"predecessor-version":[{"id":387,"href":"http:\/\/stephendavies.org\/nlp\/index.php\/wp-json\/wp\/v2\/posts\/380\/revisions\/387"}],"wp:attachment":[{"href":"http:\/\/stephendavies.org\/nlp\/index.php\/wp-json\/wp\/v2\/media?parent=380"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"http:\/\/stephendavies.org\/nlp\/index.php\/wp-json\/wp\/v2\/categories?post=380"},{"taxonomy":"post_tag","embeddable":true,"href":"http:\/\/stephendavies.org\/nlp\/index.php\/wp-json\/wp\/v2\/tags?post=380"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}