<?xml version="1.0" encoding="UTF-8"?><!-- generator="wordpress/2.3.3" -->
<rss version="2.0"
	xmlns:content="http://purl.org/rss/1.0/modules/content/"
	xmlns:dc="http://purl.org/dc/elements/1.1/"
	>
<channel>
	<title>Comments on: Using Python to detect the most frequent words in a file</title>
	<link>http://antoniocangiano.com/2008/03/18/use-python-to-detect-the-most-frequent-words-in-a-file/</link>
	<description>By Antonio Cangiano, Software Engineer &#38; Technical Evangelist at IBM</description>
	<pubDate>Sat, 17 May 2008 10:43:12 +0000</pubDate>
	<generator>http://wordpress.org/?v=2.3.3</generator>
		<item>
		<title>By: david</title>
		<link>http://antoniocangiano.com/2008/03/18/use-python-to-detect-the-most-frequent-words-in-a-file/#comment-2874</link>
		<dc:creator>david</dc:creator>
		<pubDate>Fri, 28 Mar 2008 18:14:30 +0000</pubDate>
		<guid>http://antoniocangiano.com/2008/03/18/use-python-to-detect-the-most-frequent-words-in-a-file/#comment-2874</guid>
		<description>count = {}; open("somefile").each_line { &#124;line&#124; line.split(/\b/).each { &#124;word&#124; count[word] &#124;&#124;= 0; count[word] += 1 } }</description>
		<content:encoded><![CDATA[<p>count = {}; open(&#8221;somefile&#8221;).each_line { |line| line.split(/\b/).each { |word| count[word] ||= 0; count[word] += 1 } }</p>
]]></content:encoded>
	</item>
	<item>
		<title>By: Mark</title>
		<link>http://antoniocangiano.com/2008/03/18/use-python-to-detect-the-most-frequent-words-in-a-file/#comment-2838</link>
		<dc:creator>Mark</dc:creator>
		<pubDate>Wed, 26 Mar 2008 03:00:19 +0000</pubDate>
		<guid>http://antoniocangiano.com/2008/03/18/use-python-to-detect-the-most-frequent-words-in-a-file/#comment-2838</guid>
		<description>Somehow, my sort line above got mangled. This one's an improvement:

top_words = count.sort_by { &#124;w&#124; w[1] }</description>
		<content:encoded><![CDATA[<p>Somehow, my sort line above got mangled. This one&#8217;s an improvement:</p>
<p>top_words = count.sort_by { |w| w[1] }</p>
]]></content:encoded>
	</item>
	<item>
		<title>By: William Chang</title>
		<link>http://antoniocangiano.com/2008/03/18/use-python-to-detect-the-most-frequent-words-in-a-file/#comment-2812</link>
		<dc:creator>William Chang</dc:creator>
		<pubDate>Mon, 24 Mar 2008 07:03:28 +0000</pubDate>
		<guid>http://antoniocangiano.com/2008/03/18/use-python-to-detect-the-most-frequent-words-in-a-file/#comment-2812</guid>
		<description>This of course depends on what you are counting words for, but I would recommend translate all non-letters to space and then splitting on space. For the natural language tasks that I do, this is pretty appropriate. It really depends on what you want to happen when you hit stuff like "Bob's", "hyper-active", "http://www.google.com", "bob@gmail.com", "2342sdf", etc... I also like putting the rule that I use to split into words into it's own function which I call here tokenize().

from string import punctuation
from collections import defaultdict

N = 10
words = {}

def tokenize(line):
    line = re.sub(r"[^a-z]", " ", line.lower())
    return line.split()

words = defaultdict(int)
for line in open("test.txt"):
   for token in tokenize(line):
       words[token] +=1

top_words = sorted(words.iteritems(),
                   key=lambda(word, count): (-count, word))[:N] 

for word, frequency in top_words:
    print "%s: %d" % (word, frequency)</description>
		<content:encoded><![CDATA[<p>This of course depends on what you are counting words for, but I would recommend translate all non-letters to space and then splitting on space. For the natural language tasks that I do, this is pretty appropriate. It really depends on what you want to happen when you hit stuff like &#8220;Bob&#8217;s&#8221;, &#8220;hyper-active&#8221;, &#8220;http://www.google.com&#8221;, &#8220;bob@gmail.com&#8221;, &#8220;2342sdf&#8221;, etc&#8230; I also like putting the rule that I use to split into words into it&#8217;s own function which I call here tokenize().</p>
<p>from string import punctuation<br />
from collections import defaultdict</p>
<p>N = 10<br />
words = {}</p>
<p>def tokenize(line):<br />
    line = re.sub(r&#8221;[^a-z]&#8221;, &#8221; &#8220;, line.lower())<br />
    return line.split()</p>
<p>words = defaultdict(int)<br />
for line in open(&#8221;test.txt&#8221;):<br />
   for token in tokenize(line):<br />
       words[token] +=1</p>
<p>top_words = sorted(words.iteritems(),<br />
                   key=lambda(word, count): (-count, word))[:N] </p>
<p>for word, frequency in top_words:<br />
    print &#8220;%s: %d&#8221; % (word, frequency)</p>
]]></content:encoded>
	</item>
	<item>
		<title>By: Mark</title>
		<link>http://antoniocangiano.com/2008/03/18/use-python-to-detect-the-most-frequent-words-in-a-file/#comment-2795</link>
		<dc:creator>Mark</dc:creator>
		<pubDate>Sun, 23 Mar 2008 01:41:48 +0000</pubDate>
		<guid>http://antoniocangiano.com/2008/03/18/use-python-to-detect-the-most-frequent-words-in-a-file/#comment-2795</guid>
		<description>I’m learning ruby, so I thought I’d put together a Ruby version:

&lt;div class="highlight"&gt;&lt;pre&gt;&lt;span class="n"&gt;N&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="mi"&gt;10&lt;/span&gt;
&lt;span class="n"&gt;count&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="no"&gt;Hash&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;new&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="no"&gt;File&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;open&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;the_file&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;each_line&lt;/span&gt; &lt;span class="k"&gt;do&lt;/span&gt; &lt;span class="o"&gt;&#124;&lt;/span&gt;&lt;span class="n"&gt;line&lt;/span&gt;&lt;span class="o"&gt;&#124;&lt;/span&gt;
  &lt;span class="n"&gt;line&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;downcase&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;scan&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sr"&gt;/\w+/&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;each&lt;/span&gt; &lt;span class="k"&gt;do&lt;/span&gt; &lt;span class="o"&gt;&#124;&lt;/span&gt;&lt;span class="n"&gt;word&lt;/span&gt;&lt;span class="o"&gt;&#124;&lt;/span&gt;
    &lt;span class="n"&gt;count&lt;/span&gt;&lt;span class="o"&gt;[&lt;/span&gt;&lt;span class="n"&gt;word&lt;/span&gt;&lt;span class="o"&gt;]&lt;/span&gt; &lt;span class="o"&gt;+=&lt;/span&gt; &lt;span class="mi"&gt;1&lt;/span&gt;
  &lt;span class="k"&gt;end&lt;/span&gt;
&lt;span class="k"&gt;end&lt;/span&gt;

&lt;span class="n"&gt;top_words&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;count&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;sort&lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="o"&gt;&#124;&lt;/span&gt;&lt;span class="n"&gt;a&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="n"&gt;b&lt;/span&gt;&lt;span class="o"&gt;&#124;&lt;/span&gt; &lt;span class="n"&gt;a&lt;/span&gt;&lt;span class="o"&gt;[&lt;/span&gt;&lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="o"&gt;]&lt;/span&gt;&lt;span class="n"&gt;b&lt;/span&gt;&lt;span class="o"&gt;[&lt;/span&gt;&lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="o"&gt;]&lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;

&lt;span class="n"&gt;top_words&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;each&lt;/span&gt; &lt;span class="k"&gt;do&lt;/span&gt; &lt;span class="o"&gt;&#124;&lt;/span&gt;&lt;span class="n"&gt;top&lt;/span&gt;&lt;span class="o"&gt;&#124;&lt;/span&gt;
  &lt;span class="nb"&gt;puts&lt;/span&gt; &lt;span class="s2"&gt;&#34;%s: %d&#34;&lt;/span&gt; &lt;span class="o"&gt;%&lt;/span&gt; &lt;span class="n"&gt;top&lt;/span&gt;
&lt;span class="k"&gt;end&lt;/span&gt;
&lt;/pre&gt;&lt;/div&gt;</description>
		<content:encoded><![CDATA[<p>I’m learning ruby, so I thought I’d put together a Ruby version:</p>
<div class="highlight">
<pre><span class="n">N</span> <span class="o">=</span> <span class="mi">10</span>
<span class="n">count</span> <span class="o">=</span> <span class="no">Hash</span><span class="o">.</span><span class="n">new</span><span class="p">(</span><span class="mi">0</span><span class="p">)</span>

<span class="no">File</span><span class="o">.</span><span class="n">open</span><span class="p">(</span><span class="n">the_file</span><span class="p">)</span><span class="o">.</span><span class="n">each_line</span> <span class="k">do</span> <span class="o">|</span><span class="n">line</span><span class="o">|</span>
  <span class="n">line</span><span class="o">.</span><span class="n">downcase</span><span class="o">.</span><span class="n">scan</span><span class="p">(</span><span class="sr">/\w+/</span><span class="p">)</span><span class="o">.</span><span class="n">each</span> <span class="k">do</span> <span class="o">|</span><span class="n">word</span><span class="o">|</span>
    <span class="n">count</span><span class="o">[</span><span class="n">word</span><span class="o">]</span> <span class="o">+=</span> <span class="mi">1</span>
  <span class="k">end</span>
<span class="k">end</span>

<span class="n">top_words</span> <span class="o">=</span> <span class="n">count</span><span class="o">.</span><span class="n">sort</span><span class="p">{</span><span class="o">|</span><span class="n">a</span><span class="p">,</span><span class="n">b</span><span class="o">|</span> <span class="n">a</span><span class="o">[</span><span class="mi">1</span><span class="o">]</span><span class="n">b</span><span class="o">[</span><span class="mi">1</span><span class="o">]</span><span class="p">}</span>

<span class="n">top_words</span><span class="o">.</span><span class="n">each</span> <span class="k">do</span> <span class="o">|</span><span class="n">top</span><span class="o">|</span>
  <span class="nb">puts</span> <span class="s2">&quot;%s: %d&quot;</span> <span class="o">%</span> <span class="n">top</span>
<span class="k">end</span>
</pre>
</div>
]]></content:encoded>
	</item>
	<item>
		<title>By: Antonio Cangiano</title>
		<link>http://antoniocangiano.com/2008/03/18/use-python-to-detect-the-most-frequent-words-in-a-file/#comment-2745</link>
		<dc:creator>Antonio Cangiano</dc:creator>
		<pubDate>Wed, 19 Mar 2008 12:34:56 +0000</pubDate>
		<guid>http://antoniocangiano.com/2008/03/18/use-python-to-detect-the-most-frequent-words-in-a-file/#comment-2745</guid>
		<description>Hi Paddy,

thanks for your comment. I've slightly changed the wording to point out that in the "good" solution at the end, we are using key rather than cmp. Using defaultdict would work (2.5 only) and also be more efficient. Here is the solution that incorporates your suggestion:

&lt;div class="highlight"&gt;&lt;pre&gt;&lt;span class="k"&gt;from&lt;/span&gt; &lt;span class="nn"&gt;string&lt;/span&gt; &lt;span class="k"&gt;import&lt;/span&gt; &lt;span class="n"&gt;punctuation&lt;/span&gt;
&lt;span class="k"&gt;from&lt;/span&gt; &lt;span class="nn"&gt;collections&lt;/span&gt; &lt;span class="k"&gt;import&lt;/span&gt; &lt;span class="n"&gt;defaultdict&lt;/span&gt;

&lt;span class="n"&gt;N&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="mi"&gt;10&lt;/span&gt;
&lt;span class="n"&gt;words&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;{}&lt;/span&gt;

&lt;span class="n"&gt;words_gen&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;word&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;strip&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;punctuation&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;lower&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt; &lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;line&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="nb"&gt;open&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s"&gt;&#34;test.txt&#34;&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
                                             &lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;word&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="n"&gt;line&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;split&lt;/span&gt;&lt;span class="p"&gt;())&lt;/span&gt;

&lt;span class="n"&gt;words&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;defaultdict&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nb"&gt;int&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;word&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="n"&gt;words_gen&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
    &lt;span class="n"&gt;words&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="n"&gt;word&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt; &lt;span class="o"&gt;+=&lt;/span&gt;&lt;span class="mi"&gt;1&lt;/span&gt;

&lt;span class="n"&gt;top_words&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;sorted&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;words&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;iteritems&lt;/span&gt;&lt;span class="p"&gt;(),&lt;/span&gt;
                   &lt;span class="n"&gt;key&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="k"&gt;lambda&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;word&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;count&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="o"&gt;-&lt;/span&gt;&lt;span class="n"&gt;count&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;word&lt;/span&gt;&lt;span class="p"&gt;))[:&lt;/span&gt;&lt;span class="n"&gt;N&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt; 

&lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;word&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;frequency&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="n"&gt;top_words&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
    &lt;span class="k"&gt;print&lt;/span&gt; &lt;span class="s"&gt;&#34;&lt;/span&gt;&lt;span class="si"&gt;%s&lt;/span&gt;&lt;span class="s"&gt;: &lt;/span&gt;&lt;span class="si"&gt;%d&lt;/span&gt;&lt;span class="s"&gt;&#34;&lt;/span&gt; &lt;span class="o"&gt;%&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;word&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;frequency&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/pre&gt;&lt;/div&gt;</description>
		<content:encoded><![CDATA[<p>Hi Paddy,</p>
<p>thanks for your comment. I&#8217;ve slightly changed the wording to point out that in the &#8220;good&#8221; solution at the end, we are using key rather than cmp. Using defaultdict would work (2.5 only) and also be more efficient. Here is the solution that incorporates your suggestion:</p>
<div class="highlight">
<pre><span class="k">from</span> <span class="nn">string</span> <span class="k">import</span> <span class="n">punctuation</span>
<span class="k">from</span> <span class="nn">collections</span> <span class="k">import</span> <span class="n">defaultdict</span>

<span class="n">N</span> <span class="o">=</span> <span class="mi">10</span>
<span class="n">words</span> <span class="o">=</span> <span class="p">{}</span>

<span class="n">words_gen</span> <span class="o">=</span> <span class="p">(</span><span class="n">word</span><span class="o">.</span><span class="n">strip</span><span class="p">(</span><span class="n">punctuation</span><span class="p">)</span><span class="o">.</span><span class="n">lower</span><span class="p">()</span> <span class="k">for</span> <span class="n">line</span> <span class="ow">in</span> <span class="nb">open</span><span class="p">(</span><span class="s">&quot;test.txt&quot;</span><span class="p">)</span>
                                             <span class="k">for</span> <span class="n">word</span> <span class="ow">in</span> <span class="n">line</span><span class="o">.</span><span class="n">split</span><span class="p">())</span>

<span class="n">words</span> <span class="o">=</span> <span class="n">defaultdict</span><span class="p">(</span><span class="nb">int</span><span class="p">)</span>
<span class="k">for</span> <span class="n">word</span> <span class="ow">in</span> <span class="n">words_gen</span><span class="p">:</span>
    <span class="n">words</span><span class="p">[</span><span class="n">word</span><span class="p">]</span> <span class="o">+=</span><span class="mi">1</span>

<span class="n">top_words</span> <span class="o">=</span> <span class="n">sorted</span><span class="p">(</span><span class="n">words</span><span class="o">.</span><span class="n">iteritems</span><span class="p">(),</span>
                   <span class="n">key</span><span class="o">=</span><span class="k">lambda</span><span class="p">(</span><span class="n">word</span><span class="p">,</span> <span class="n">count</span><span class="p">):</span> <span class="p">(</span><span class="o">-</span><span class="n">count</span><span class="p">,</span> <span class="n">word</span><span class="p">))[:</span><span class="n">N</span><span class="p">]</span> 

<span class="k">for</span> <span class="n">word</span><span class="p">,</span> <span class="n">frequency</span> <span class="ow">in</span> <span class="n">top_words</span><span class="p">:</span>
    <span class="k">print</span> <span class="s">&quot;</span><span class="si">%s</span><span class="s">: </span><span class="si">%d</span><span class="s">&quot;</span> <span class="o">%</span> <span class="p">(</span><span class="n">word</span><span class="p">,</span> <span class="n">frequency</span><span class="p">)</span>
</pre>
</div>
]]></content:encoded>
	</item>
	<item>
		<title>By: Paddy3118</title>
		<link>http://antoniocangiano.com/2008/03/18/use-python-to-detect-the-most-frequent-words-in-a-file/#comment-2744</link>
		<dc:creator>Paddy3118</dc:creator>
		<pubDate>Wed, 19 Mar 2008 05:54:47 +0000</pubDate>
		<guid>http://antoniocangiano.com/2008/03/18/use-python-to-detect-the-most-frequent-words-in-a-file/#comment-2744</guid>
		<description>Hi Antonio,
Just before the last code section, you introduce it as "getting rid of reverse=True", but you fail to mention that you also change from using cmp to use key. cmp is called for every comparison wheras key is called once for each item in the list which is usually faster.
&lt;br/&gt;
I also wonder if this code:
&lt;br/&gt;
&lt;pre&gt;
for word in words_gen:
    words[word] = words.get(word, 0) + 1
&lt;/pre&gt;
&lt;br/&gt;
Might be replaced by (untested):
&lt;br/&gt;
&lt;pre&gt;
words = defaultdict(int)
for word in words_gen:
    words[word] +=1
&lt;/pre&gt;
&lt;br/&gt;
Which would look up word in words only once?
&lt;br/&gt;
I enjoyed your post. 
&lt;br/&gt;
Thanks,  Paddy.</description>
		<content:encoded><![CDATA[<p>Hi Antonio,<br />
Just before the last code section, you introduce it as &#8220;getting rid of reverse=True&#8221;, but you fail to mention that you also change from using cmp to use key. cmp is called for every comparison wheras key is called once for each item in the list which is usually faster.<br />
<br />
I also wonder if this code:<br />
</p>
<pre>
for word in words_gen:
    words[word] = words.get(word, 0) + 1
</pre>
<p>
Might be replaced by (untested):<br />
</p>
<pre>
words = defaultdict(int)
for word in words_gen:
    words[word] +=1
</pre>
<p>
Which would look up word in words only once?<br />
<br />
I enjoyed your post.<br />
<br />
Thanks,  Paddy.</p>
]]></content:encoded>
	</item>
	<item>
		<title>By: Antonio Cangiano</title>
		<link>http://antoniocangiano.com/2008/03/18/use-python-to-detect-the-most-frequent-words-in-a-file/#comment-2739</link>
		<dc:creator>Antonio Cangiano</dc:creator>
		<pubDate>Tue, 18 Mar 2008 13:13:46 +0000</pubDate>
		<guid>http://antoniocangiano.com/2008/03/18/use-python-to-detect-the-most-frequent-words-in-a-file/#comment-2739</guid>
		<description>Nice one Tom. :)</description>
		<content:encoded><![CDATA[<p>Nice one Tom. <img src='http://antoniocangiano.com/wp-includes/images/smilies/icon_smile.gif' alt=':)' class='wp-smiley' /></p>
]]></content:encoded>
	</item>
	<item>
		<title>By: tlrobinson.net / blog &#187; Blog Archive &#187; Using command line tools to detect the most frequent words in a file</title>
		<link>http://antoniocangiano.com/2008/03/18/use-python-to-detect-the-most-frequent-words-in-a-file/#comment-2738</link>
		<dc:creator>tlrobinson.net / blog &#187; Blog Archive &#187; Using command line tools to detect the most frequent words in a file</dc:creator>
		<pubDate>Tue, 18 Mar 2008 08:30:44 +0000</pubDate>
		<guid>http://antoniocangiano.com/2008/03/18/use-python-to-detect-the-most-frequent-words-in-a-file/#comment-2738</guid>
		<description>[...] Cangiano wrote a post about &#8220;Using Python to detect the most frequent words in a file&#8220;. It&#8217;s a nice summary of how to do it in Python, but (nearly) the same thing can be [...]</description>
		<content:encoded><![CDATA[<p>[&#8230;] Cangiano wrote a post about &#8220;Using Python to detect the most frequent words in a file&#8220;. It&#8217;s a nice summary of how to do it in Python, but (nearly) the same thing can be [&#8230;]</p>
]]></content:encoded>
	</item>
	<item>
		<title>By: Tom Robinson</title>
		<link>http://antoniocangiano.com/2008/03/18/use-python-to-detect-the-most-frequent-words-in-a-file/#comment-2737</link>
		<dc:creator>Tom Robinson</dc:creator>
		<pubDate>Tue, 18 Mar 2008 07:37:58 +0000</pubDate>
		<guid>http://antoniocangiano.com/2008/03/18/use-python-to-detect-the-most-frequent-words-in-a-file/#comment-2737</guid>
		<description>Of course, this is exactly what command line filter programs are good at...

cat test.txt &#124; tr -s '[:space:]' '\n' &#124; tr '[:upper:]' '[:lower:]' &#124; sort &#124; uniq -c &#124; sort -n &#124; tail -10</description>
		<content:encoded><![CDATA[<p>Of course, this is exactly what command line filter programs are good at&#8230;</p>
<p>cat test.txt | tr -s &#8216;[:space:]&#8217; &#8216;\n&#8217; | tr &#8216;[:upper:]&#8217; &#8216;[:lower:]&#8217; | sort | uniq -c | sort -n | tail -10</p>
]]></content:encoded>
	</item>
</channel>
</rss>
