[Python]用Histogram呈現文章單詞出現數量

這次的練習是統計一篇英文文章出現的單字數量，並將頻率最大的前十個單詞透過直方圖(histogram)表示出來。

一些比較有問題的點在於：

拿到的文章中有著一些標點符號，ex:, . ‘ “ \ - …這種，在統計單字數量的時候如果沒有處理掉這些符號就會有不同的結果，比方說：”woman”和”woman.”會是不同的單詞。
第二個問題是在處理文本的時候發現裡面夾雜著utf編碼的內容，所以在開檔的時候要指定encoding = “utf8”

繪圖的部份使用matplotlib這個lib，如果沒有安裝過的話要先行安裝，在cmd下輸入pip install matplotlib

為了統計每個單詞出現的次數，我的作法是先一行一行讀取文章，針對每行先把不必要的符號切掉(replace())，切完後會只剩下英文單字和空格夾雜。

之後再透過split()用空格把一行文字切割成一個個單詞組成的list，並透過dict()來統計每個單詞出現的次數：

1	count[i] = count.get(i,0) + 1 #如果i這個index出現過就取得他的value+1，不存在則用0代替

到這邊就取得了所有單詞的出現次數，接下來要進行排序，透過lambda去設定key是dict的value，並且要由大到小排序，所以reverse = True:

1	key = lambda d:d[1] # d[0]是dict()的key d[1]是dict的value

把前十個印出來，sort完會是一個tuple的list，所以用個loop把前十個的key和value取出即可。

提一下一些畫圖的funcion：

plt.title() #設置圖表的標題
plt.xlabel() / plt.ylabel() #設置圖表的x / y軸文字
plt.bar() #設置圖表的值
plt.show()顯示圖表

然後matplotlib目前好像有個bug，印出來的順序會按照字母排序，而不是給定的list的順序，相關資訊可以看這篇：

https://stackoverflow.com/questions/47373762/pyplot-sorting-y-values-automatically

import matplotlib.pyplot as plt
#make a string contain marks we can ingore
mark_set = "\"\',.?!*[]#:-" #without space
 
if __name__ == '__main__':
    count = dict()
    document_space = []
    filename = input("please input the filename(must in the same workspace):")
    try:
        with open(filename+".txt","r",encoding="utf8") as file:
            for each_row in file:
                for iterator in mark_set:
                    each_row = each_row.replace(iterator,'')
                each_row = each_row.replace('\n','') #replace space
                each_row = each_row.lower() # tolower
                #print(i)
 
                document_space = each_row.split() #split word
                #print(document_space)
                for i in document_space:
                    count[i] = count.get(i,0) + 1 #cal frequency of every word in each row
            #print(count)
 
            #sort dict by value
            sorted_count = sorted(count.items(),key = lambda d:d[1],reverse = True)
            for i in range(10):
                print(sorted_count[i][0],sorted_count[i][1])
            #plot
            plt.title("Word Count",color="r")
            plt.xlabel("Word",color="r")
            plt.ylabel("Count",color="r")
            #in matplotlib,the output will be sorted by category, hence alphabetically...this is a bug
            plt.bar([sorted_count[i][0] for i in range(10)], [sorted_count[i][1] for i in range(10)],color = 'r')
 
            plt.show()
    except:
        print("Oops!some error happen...")
        print("Closing..")
        exit()

更新：儘管目前繪圖存在bug，但仍可以透過先指定x軸為數字，再透過xticks()轉為string，達到排序的功能，程式碼如下：

...
 
plt.bar(range(len(count)), count.values(),color = 'r')
plt.xticks(range(len(count)),list(count.keys()))
 
...

結果如下：

github: github