Julia 入門 / 處理文字檔案

字串和字元	Julia 入門	處理日期和時間
	處理文字檔案

從檔案讀取

從文字檔案獲取資訊的標準方法是使用 open()、read() 和 close() 函式。

開啟

要從檔案讀取文字，首先獲取檔案控制代碼

f = open("sherlock-holmes.txt")

f 現在是 Julia 與磁碟上檔案的連線。當你完成對檔案的操作後，你應該使用以下方法關閉連線：

close(f)

一般來說，在 Julia 中處理檔案的推薦方法是將任何檔案處理函式包裝在 do 塊中

open("sherlock-holmes.txt") do file
    # do stuff with the open file
end

當該塊完成時，開啟的檔案會自動關閉。有關 do 塊的更多資訊，請參見控制流程。

由於塊中區域性變數的範圍，你可能希望保留一些已處理的資訊

totaltime, totallines = open("sherlock-holmes.txt") do f
    linecounter = 0
    timetaken = @elapsed for l in eachline(f)
        linecounter += 1
    end
    (timetaken, linecounter)
end

julia> totaltime, totallines
(0.004484679, 76803)

吸取 - 一次性讀取整個檔案

你可以使用 read() 一次性讀取開啟檔案的全部內容

julia> s = read(f, String)

這將檔案的全部內容儲存在 s 中

s = open("sherlock-holmes.txt") do file
    read(file, String)
end

你可以使用 readlines() 將整個檔案讀入陣列，每行是一個元素

julia> f = open("sherlock-holmes.txt");

julia> lines = readlines(f)
76803-element Array{String,1}:
"THE ADVENTURES OF SHERLOCK HOLMES by SIR ARTHUR CONAN DOYLE\r\n"
"\r\n"
"   I. A Scandal in Bohemia\r\n"
"  II. The Red-headed League\r\n"
...
"Holmes, rather to my disappointment, manifested no further\r\n"
"interest in her when once she had ceased to be the centre of one\r\n"
"of his problems, and she is now the head of a private school at\r\n"
"Walsall, where I believe that she has met with considerable success.\r\n"
julia> close(f)

現在你可以遍歷這些行

counter = 1
for l in lines
   println("$counter $l")
   counter += 1
end

1 THE ADVENTURES OF SHERLOCK HOLMES by SIR ARTHUR CONAN DOYLE
2
3    I. A Scandal in Bohemia
4   II. The Red-headed League
5  III. A Case of Identity
6   IV. The Boscombe Valley Mystery
...
12638 interest in her when once she had ceased to be the centre of one
12639 of his problems, and she is now the head of a private school at
12640 Walsall, where I believe that she has met with considerable success.

有一個更好的方法 - 參見下面的 enumerate()。

你可能會發現 chomp() 函式很有用 - 它會從字串中刪除尾部的換行符。

逐行

eachline() 函式將原始碼轉換為迭代器。這允許你逐行處理檔案

open("sherlock-holmes.txt") do file
    for ln in eachline(file)
        println("$(length(ln)), $(ln)")
    end
end

1, THE ADVENTURES OF SHERLOCK HOLMES by SIR ARTHUR CONAN DOYLE
2,
28,    I. A Scandal in Bohemia
29,   II. The Red-headed League
26,  III. A Case of Identity
35,   IV. The Boscombe Valley Mystery
…
62, the island of Mauritius. As to Miss Violet Hunter, my friend
60, Holmes, rather to my disappointment, manifested no further
66, interest in her when once she had ceased to be the centre of one
65, of his problems, and she is now the head of a private school at
70, Walsall, where I believe that she has met with considerable success.

另一種方法是讀取直到到達檔案末尾。你可能想要跟蹤你正在處理的哪一行

 open("sherlock-holmes.txt") do f
   line = 1
   while !eof(f)
     x = readline(f)
     println("$line $x")
     line += 1
   end
 end

更好的方法是對可迭代物件使用 enumerate() - 你將“免費”獲得行號

open("sherlock-holmes.txt") do f
    for i in enumerate(eachline(f))
      println(i[1], ": ", i[2])
    end
end

如果你想對檔案呼叫特定函式，可以使用這種替代語法

function shout(f::IOStream)
    return uppercase(read(f, String))
end

julia> shoutversion = open(shout, "sherlock-holmes.txt");

julia> shoutversion[30237:30400]
"ELEMENTARY PROBLEMS. LET HIM, ON MEETING A\nFELLOW-MORTAL, LEARN AT A GLANCE TO DISTINGUISH THE HISTORY OF THE\nMAN, AND THE TRADE OR  PROFESSION TO WHICH HE BELONGS. "

這將開啟檔案，對它執行 shout() 函式，然後再次關閉它，並將處理後的內容分配給變數。

你可以使用 CSV.jl 來讀取和寫入逗號分隔值 (.csv) 檔案，並且推薦使用它，而不是使用 DelimitedFiles.readdlm() 函式（它處理更多邊界情況並且速度更快，尤其是在處理大型檔案時）來讀取以特定字元分隔的行，例如資料檔案、儲存為文字檔案的陣列和表格。如果你使用 DataFrames 包，還有一個專門用於將資料讀入表格的 readtable() 函式。

處理路徑和檔名

以下函式將有助於處理檔名

cd(path) 更改當前目錄。
pwd() 獲取當前工作目錄。
readdir(path) 返回命名目錄或當前目錄的內容列表。
abspath(path) 將當前目錄的路徑新增到檔名，以建立絕對路徑名。
joinpath(str, str, ...) 從多個部分組裝路徑名。
isdir(path) 告訴你路徑是否是一個目錄。
splitdir(path) - 將路徑拆分為目錄名和檔名的元組。
splitdrive(path) - 在 Windows 上，將路徑拆分為驅動器號部分和路徑部分。在 Unix 系統上，第一個元件始終為空字串。
splitext(path) - 如果路徑的最後一個元件包含點號，則將路徑拆分為點號之前的部分和包括點號及之後的全部部分。否則，返回一個元組，其中包含未修改的引數和空字串。
expanduser(path) - 將路徑開頭的波浪號替換為當前使用者的家目錄。
normpath(path) - 規範化路徑，刪除 "." 和 ".." 條目。
realpath(path) - 透過擴充套件符號連結並刪除 "." 和 ".." 條目來規範化路徑。
homedir() - 獲取當前使用者的家目錄。
dirname(path) - 獲取路徑的目錄部分。
basename(path) - 獲取路徑的檔名部分。

要處理目錄中的一組受限的檔案，請使用 filter() 和匿名函式來篩選檔名，只保留你想要的檔案。（filter() 更像是一個捕魚網或篩子，而不是咖啡過濾器，因為它捕獲了你想保留的東西。）

for f in filter(x -> endswith(x, "jl"), readdir())
    println(f)
end

Astro.jl
calendar.jl
constants.jl
coordinates.jl
...
pseudoscience.jl
riseset.jl
sidereal.jl
sun.jl
utils.jl
vsop87d.jl

如果你想使用正則表示式匹配一組檔案，那麼請使用 occursin()。讓我們查詢字尾為 ".jpg" 或 ".png" 的檔案（記住要轉義 "."）。

for f in filter(x -> occursin(r"(?i)\.jpg|\.png", x), readdir())
    println(f)
end

034571172750.jpg
034571172750.png
51ZN2sCNfVL._SS400_.jpg
51bU7lucOJL._SL500_AA300_.jpg
Voronoy.jpg
kblue.png
korange.png
penrose.jpg
r-home-id-r4.png
wave.jpg

要檢查檔案層次結構，請使用 walkdir()，它允許你遍歷目錄，並依次檢查每個目錄中的檔案。

檔案資訊

如果你想要有關特定檔案的資訊，請使用 stat("pathname")，然後使用其中一個欄位來查詢資訊。以下是如何獲取有關檔案“i”的所有資訊以及列出的欄位名稱：

 for n in fieldnames(typeof(stat("i")))
    println(n, ": ", getfield(stat("i"),n))
end

device: 16777219
inode: 2955324
mode: 16877
nlink: 943
uid: 502
gid: 20
rdev: 0
size: 32062
blksize: 4096
blocks: 0
mtime:1.409769933e9
ctime:1.409769933e9

你可以透過“stat”結構訪問這些欄位

julia> s = stat("Untitled1.ipynb")
StatStruct(mode=100644, size=64424)

julia> s.ctime
1.446649269e9

你也可以直接使用其中一些欄位

julia> ctime("Untitled2.ipynb")
1.446649269e9

儘管不是 size

julia> s.size
64424

要處理滿足條件的特定檔案 - 例如，所有 Jupyter 檔案（即副檔名為“ipynb”的檔案）在特定日期後修改的檔案 - 你可以使用類似以下的內容

using Dates
function output_file(path)
    println(stat(path).size, ": ", path)
end 

for afile in filter!(f -> endswith(f, "ipynb") && (mtime(f) > Dates.datetime2unix(DateTime("2015-11-03T09:00"))),
    readdir())
    output_file(realpath(afile))
end

與檔案系統互動

cp()、mv()、rm() 和 touch() 函式與其 Unix shell 對應項具有相同的名稱和功能。

要將檔名轉換為路徑名，請使用 abspath()。你可以將它對映到目錄中的檔案列表上

julia> map(abspath, readdir())
67-element Array{String,1}:
"/Users/me/.CFUserTextEncoding"
"/Users/me/.DS_Store"
"/Users/me/.Trash"
"/Users/me/.Xauthority"
"/Users/me/.ahbbighrc"
"/Users/me/.apdisk"
"/Users/me/.atom"
...

要將列表限制為包含特定子字串的檔名，請在 filter() 中使用匿名函式 - 類似於以下內容

julia> filter(x -> occursin("re", x), map(abspath, readdir()))
4-element Array{String,1}:
"/Users/me/.DS_Store"
"/Users/me/.gitignore"
"/Users/me/.hgignore_global"
"/Users/me/Pictures"
...

要將列表限制為正則表示式匹配項，請嘗試以下操作

julia> filter(x -> occursin(r"recur.*\.jl", x), map(abspath, readdir()))
2-element Array{String,1}:
 "/Users/me/julia/recursive-directory-scan.jl"
 "/Users/me/julia/recursive-text.jl"

寫入檔案

要寫入文字檔案，請使用“w”標誌開啟它，並確保你具有在指定目錄中建立檔案的許可權

open("/tmp/t.txt", "w") do f
    write(f, "A, B, C, D\n")
end

以下是如何寫入 20 行，每行包含 4 個介於 1 到 10 之間的隨機數，並以逗號分隔

function fourrandom()
    return rand(1:10,4)
end

open("/tmp/t.txt", "w") do f
           for i in 1:20
              n1, n2, n3, n4 = fourrandom()
              write(f, "$n1, $n2, $n3, $n4 \n")
           end
       end

比這更快的選擇是使用 DelimitedFiles.writedlm() 函式，接下來將介紹它。

using DelimitedFiles
writedlm("/tmp/test.txt", rand(1:10, 20, 4), ", ")

將陣列寫入檔案並從檔案讀取陣列

在 DelimitedFiles 包中，有兩個方便的函式，writedlm() 和 readdlm()。它們允許你將陣列或集合寫入檔案或從檔案讀取陣列或集合。

writedlm() 將物件的全部內容寫入文字檔案，而 readdlm() 將資料從檔案讀入陣列

julia> numbers = rand(5,5)
5x5 Array{Float64,2}:
0.913583  0.312291  0.0855798  0.0592331  0.371789
0.13747   0.422435  0.295057   0.736044   0.763928
0.360894  0.434373  0.870768   0.469624   0.268495
0.620462  0.456771  0.258094   0.646355   0.275826
0.497492  0.854383  0.171938   0.870345   0.783558

julia> writedlm("/tmp/test.txt", numbers)

你可以使用 shell 檢視該檔案（鍵入分號“;”切換）

<shell> cat "/tmp/test.txt"
.9135833328830523	.3122905420350348	.08557977218948465	.0592330821115965	.3717889559226475
.13747015238054083	.42243494637594203	.29505701073304524	.7360443978397753	.7639280496847236
.36089432672073607	.43437288984307787	.870767989032692	.4696243851552686	.26849468736154325
.6204624598015906	.4567706404666232	.25809436255988105	.6463554854347682	.27582613759302377
.4974916625466639	.8543829989347014	.17193814498701587	.8703447748713236	.783557793485824

除非你指定其他分隔符，否則元素將以製表符分隔。這裡使用冒號作為數字分隔符

julia> writedlm("/tmp/test.txt", rand(1:6, 10, 10), ":")

shell> cat "/tmp/test.txt"
3:3:3:2:3:2:6:2:3:5
3:1:2:1:5:6:6:1:3:6
5:2:3:1:4:4:4:3:4:1
3:2:1:3:3:1:1:1:5:6
4:2:4:4:4:2:3:5:1:6
6:6:4:1:6:6:3:4:5:4
2:1:3:1:4:1:5:4:6:6
4:4:6:4:6:6:1:4:2:3
1:4:4:1:1:1:5:6:5:6
2:4:4:3:6:6:1:1:5:5

要從文字檔案讀取資料，可以使用 readdlm()。

julia> numbers = rand(5,5)
5x5 Array{Float64,2}:
0.862955  0.00827944  0.811526  0.854526  0.747977
0.661742  0.535057    0.186404  0.592903  0.758013
0.800939  0.949748    0.86552   0.113001  0.0849006
0.691113  0.0184901   0.170052  0.421047  0.374274
0.536154  0.48647     0.926233  0.683502  0.116988

julia> writedlm("/tmp/test.txt", numbers)

julia> numbers = readdlm("/tmp/test.txt")
5x5 Array{Float64,2}:
0.862955  0.00827944  0.811526  0.854526  0.747977
0.661742  0.535057    0.186404  0.592903  0.758013
0.800939  0.949748    0.86552   0.113001  0.0849006
0.691113  0.0184901   0.170052  0.421047  0.374274
0.536154  0.48647     0.926233  0.683502  0.116988

還有一些專門用於讀取和寫入檔案資料的 Julia 包，包括 DataFrames.jl 和 CSV.jl。您可以在 JuliaHub 或 JuliaPackages 中搜索這些包和其他包。許多這些包都位於 JuliaData 組織的網站上。