從惰性 IO 說起_Haskell 筆記 6

一。惰性 I/O 與 buffer

Haskell 中，I/O 也是惰性的，例如：

readThisFile = withFile "./data/lines.txt" ReadMode (\handle -> do
    contents <- hGetContents handle
    putStr contents
  )

從硬碟讀文件時並不會一次性全讀入內存，而是一點一點的流式讀取。文本文件的話，默認 buffer 是 line-buffering，即一次讀一行，二進制文件的話，默認 buffer 是 block-buffering，一次讀一個 chunk，其具體大小取決於操作系統

line-buffering 和 block-buffering 用 BufferMode 值來表示：

data BufferMode
  = NoBuffering | LineBuffering | BlockBuffering (Maybe Int)
    -- Defined in 'GHC.IO.Handle.Types'

BufferMode 類型下有三個值，NoBuffering，LineBuffering，BlockBuffering (Maybe Int) 分別表示不用 buffer，用 line-buffering，以及用 block-buffering 模式。其中 Maybe Int 表示每個 chunk 有幾個字節（byte），給 Nothing 的話用系統默認的 chunk 大小，NoBuffering 表示一次讀一個字符（character），會瘋狂（高頻）訪問硬碟，一般不用

可以手動設置 BufferMode，例如：

readThisFileInBlockMode = withFile "./data/lines.txt" ReadMode (\handle -> do
    hSetBuffering handle $ BlockBuffering (Just 1024)
    contents <- hGetContents handle
    putStr contents
  )

每次讀 1024B（即 1KB），其中 hSetBuffering 的類型為：

hSetBuffering :: Handle -> BufferMode -> IO ()

接受一個文件指針和 BufferMode 值，返回個空的 I/O Action

既然有 buffer，就需要 flush buffer，所以還有個 hFlush：

hFlush :: Handle -> IO ()

用來清理 buffer，不用等 buffer 塞滿或者其它自動 flush 機制（如 line-buffering 遇到換行符就 flush）

P.S.有個很形象但不太優雅的比喻：

你的馬桶會在水箱有一加侖的水的時候自動沖水。所以你不斷灌水進去直到一加侖，馬桶就會自動沖水，在水裡面的數據也會被看到。但你也可以手動地按下沖水鈕來沖水。他會讓現有的水被沖走。沖水這個動作就是 hFlush 這個名字的含意。

二.Data.ByteString

既然從系統讀取文件需要考慮性能採用 Buffer，那讀入內存之後呢？又該如何存儲，如何操作？

ByteString 看著像個新的數據類型，但我們不是已經有 String 了嗎？

惰性的 List

String 是 Char List 的別名，而 List 是惰性的，所以：

str = "abc"
charList = ['a', 'b', 'c']
charList' = 'a' : 'b' : 'c' : []

> str == charList && charList == charList'
True

聲明字符串 "abc" 只是承諾，我們將會擁有一個 Char List，那麼什麼時候才真正擁有（或者創造）這個 List 呢？

在不得不計算（求值）的時候，比如上例中 == 判斷的時候：

instance (Eq a) => Eq [a] where
  {-# SPECIALISE instance Eq [Char] #-}
  []     == []     = True
  (x:xs) == (y:ys) = x == y && xs == ys
  _xs    == _ys    = False

（摘自 GHC.Classes）

通過模式匹配從左向右遍歷對比元素是否相等，每次取 List 首元，此時才真正需要List，才被"創造"出來

用非惰性的JS 來描述就像這樣：

function unshift(x, xs) {
  return [x].concat(xs);
}
const str = 'abc';
charList = unshift('a', unshift('b', unshift('c', [])));
function eq(s, a) {
  if (!s.length && !a.length) return true;
  return s[0] == a[0] && eq(s.slice(1), a.slice(1));
}

// test
eq(str, charList);

但與立即求值的 JS 不同，Haskell 是惰性的，所以，實際情況類似於：

const EMPTY_LIST = {
  value: Symbol.for('_EMPTY_LIST_'),
  tail: () => EMPTY_LIST
};
function unshift(x, xs) {
  return { value: x, tail: () => xs };
}
function sugar(str) {
  return str.split('')
    .reduceRight((a, v) => a.concat([v]), [])
    .reduce((a, v) => unshift(v, a), EMPTY_LIST);
}
const str = sugar('abc');
const charList = unshift('a', unshift('b', unshift('c', EMPTY_LIST)));
function eq(s, a) {
  if (s === EMPTY_LIST && a === EMPTY_LIST) return true;
  return s.value == a.value && eq(s.tail(), a.tail());
}

// test
eq(str, charList);

用"懶"鏈表來模擬只在真正需要的時候才去創造的 List，就像 'a' : 'b' : 'c' : []"承諾" 會有一個 'a' 開頭的 List，這個 List 有多長，佔多少空��，在真正需要求值之前都是未知的（也沒必要知道，所以允許存在無限長的 List，而不用擔心如何存儲的問題）

但這種惰性並非十全十美，帶來的直接問題就是效率不高，尤其是在巨長 List 的場景（比如讀文件），處理一個"承諾"（模擬場景裡的 tail()）的成本可能不高，但如果積攢了一大堆的"承諾"，處理這些"承諾" 的成本就會凸顯出來，實際效率自然會下降。所以，為了解決這個問題，就像引入 foldl 的嚴格版本（非惰性版本）foldl' 一樣，我們引入了 ByteString

P.S.上面提到的"承諾"，其實在 Haskell 有個對應的術語叫 thunk

ByteString

Bytestring 的每個元素都是一個字節（8 個 bit），分惰性與嚴格（非惰性）兩種：

惰性：Data.ByteString.Lazy，同樣具有惰性，但比 List 稍微勤快一些，不是逐元素的 thunk，而是逐 chunk 的（64K 一個 chunk），一定程度上減少了所產生 thunk 的數量
嚴格：位於 Data.ByteString 模塊，不會產生任何 thunk，表示一連串的字節，所以不存在無限長的 strict bytestring，也沒有惰性 List 的內存優勢

lazy bytestring 就像 chunk List（List 中每個元素都是 64K 大小的 strict bytestring），既減少了惰性帶來的效率影響，又具有惰性的內存優勢，所以大多數時候用 lazy 版本

P.S.64K 這個大小是有講究的：

64K 有很高的可能性能夠裝進你 CPU 的 L2 Cache

常用函數

ByteString 相當於另一種 List，所以 List 的大多數方法在 ByteString 都有同名的對應實現，例如：

head, tail, init, null, length, map, reverse, foldl, foldr, concat, takeWhile, filter

所以先要避免命名衝突：

-- 惰性 ByteString
import Data.ByteString.Lazy as B
-- 嚴格 ByteString
import Data.ByteString as S

創建一個 ByteString：

-- Word8 List 轉 ByteString
B.pack :: [GHC.Word.Word8] -> ByteString
-- 嚴格 ByteString 轉惰性 ByteString
B.fromChunks :: [Data.ByteString.Internal.ByteString] -> ByteString

其中 Word8 相當於範圍更小的 Int（0 ~ 255 之間，和 Int 一樣都屬於 Num 類），例如：

> B.pack [65, 66, 67]
"ABC"
> B.fromChunks [S.pack [65, 66, 67], S.pack [97, 98, 99]]
"ABCabc"

注意，fromChunks 會把給定的一組 strict bytestring 串起來變成 chunk List，而不是先拼接起來再塞進一個個 64K 空間，如果有一堆碎的 strict bytestring 而又不像拼接起來佔著內存，可以用這種方式把它們串起來

插入元素：

B.cons :: GHC.Word.Word8 -> B.ByteString -> B.ByteString
B.cons' :: GHC.Word.Word8 -> B.ByteString -> B.ByteString

cons 就是 List 的 :，用於在左側插入元素，同樣是惰性的（即便第一個 chunk 足夠容納新元素，也插入一個 chunk），而 cons' 是其嚴格版本，會優先填充第一個 chunk 的剩餘空間，區別類似於：

> Prelude.foldr B.cons B.empty [50..60]
Chunk "2" (Chunk "3" (Chunk "4" (Chunk "5" (Chunk "6" (Chunk "7" (Chunk "8" (Chunk "9" (Chunk ":" (Chunk ";" (Chunk "<"
Empty))))))))))
> Prelude.foldr B.cons' B.empty [50..60]
Chunk "23456789:;<" Empty

P.S.舊版本 GHC 會 show 出類似於上面的差異，0.10.0.1 之後的 Show 實現改成了類似於字符串字面量的形式，看不出來差異了，具體見 Haskell: Does ghci show "Chunk .. Empty"?

文件讀寫：

-- 按 chunk 讀
S.readFile :: FilePath -> IO S.ByteString
-- 全讀進來
B.readFile :: FilePath -> IO B.ByteString
-- 逐 chunk 寫
S.writeFile :: FilePath -> S.ByteString -> IO ()
-- 一次寫完
B.writeFile :: FilePath -> B.ByteString -> IO ()

實際上，ByteString 與 String 類型在大多數場景可以很容易地互相轉換，所以可以先用 String 實現，在性能不好的場景再改成 ByteString

P.S.更多 ByteString 相關函數，見 Data.ByteString

三。命令行參數

除交互輸入和讀文件外，命令行參數是另一種獲取用戶輸入的重要方式：

-- readWhat.hs
import System.Environment
import System.IO

main = do
  args <- getArgs
  contents <- readFile (args !! 0)
  putStr contents

試玩一下：

$ ghc --make ./readWhat.hs
[1 of 1] Compiling Main             ( readWhat.hs, readWhat.o )
Linking readWhat ...
$  ./readWhat ./data/lines.txt
hoho, this is xx.
who's that ?
$ ./readWhat ./data/that.txt
contents in that file
another line
last line

這就有了 cat 的基本功能。其中 getArgs 的類型是：

getArgs :: IO [String]

位於 System.Environment 模塊，以為 I/O Action 形式返回命令行參數組成的 String 數組，類似的還有：

-- 獲取程序名（可執行文件的名字）
getProgName :: IO String
-- 獲取當前絕對路徑
getExecutablePath :: IO FilePath
-- 設置環境變量
setEnv :: String -> String -> IO ()
-- 獲取環境變量
getEnv :: String -> IO String

P.S.更多環境相關函數，見 System.Environment

例如：

import System.IO
import System.Environment

main = do
  progName <- getProgName
  args <- getArgs
  pwd <- getExecutablePath
  setEnv "NODE_ENV" "production"
  nodeEnv <- getEnv "NODE_ENV"
  putStrLn pwd
  putStrLn ("NODE_ENV " ++ nodeEnv)
  putStrLn (progName ++ (foldl (++) "" $ map (" " ++) args))

試玩：

$ ghc --make ./testArgs
[1 of 1] Compiling Main             ( testArgs.hs, testArgs.o )
Linking testArgs ...
$ ./testArgs -a --p path
/absolute/path/to/testArgs
NODE_ENV production
testArgs -a --p path

P.S.除 ghc --make sourceFile 編譯執行外，還有一種直接 run 源碼的方式：

$ runhaskell testArgs.hs -b -c
/absolute/path/to/ghc-8.0.1/bin/ghc
NODE_ENV production
testArgs.hs -b -c

此時 getExecutablePath 返回的是 ghc（可執行文件）的絕對路徑

四。隨機數

除了 I/O，另一個鐵定不純的場景就是隨機數了。那麼，純函數能造出來隨機數嗎？

造偽隨機數還是有點可能的。做法類似於 C 語言，要給個"種子"：

random :: (Random a, RandomGen g) => g -> (a, g)

其中 Random 和 RandomGen 種子的類型分別為：

instance Random Word -- Defined in 'System.Random'
instance Random Integer -- Defined in 'System.Random'
instance Random Int -- Defined in 'System.Random'
instance Random Float -- Defined in 'System.Random'
instance Random Double -- Defined in 'System.Random'
instance Random Char -- Defined in 'System.Random'
instance Random Bool -- Defined in 'System.Random'

instance RandomGen StdGen -- Defined in 'System.Random'
data StdGen
  = System.Random.StdGen {-# UNPACK #-}GHC.Int.Int32
                        {-# UNPACK #-}GHC.Int.Int32
    -- Defined in 'System.Random'

P.S.其中 Word 指的是可以指定寬度的無符號整型，具體見 Int vs Word in common use?

數值、字符、布爾類型等都可以有隨機值，種子則需要通過特殊的 mkStdGen :: Int -> StdGen 函數生成，例如：

> random (mkStdGen 7) :: (Int, StdGen)
(5401197224043011423,33684305 2103410263)
> random (mkStdGen 7) :: (Int, StdGen)
(5401197224043011423,33684305 2103410263)

果然是純函數，所以兩次調用結果完全一樣（並不是因為連續調用，過十天半個月調用還是這個結果）。通過類型聲明來告知 random 函數期望返回的隨機值類型，不妨換個別的：

> random (mkStdGen 7) :: (Bool, StdGen)
(True,320112 40692)
> random (mkStdGen 7) :: (Float, StdGen)
(0.34564054,2071543753 1655838864)
> random (mkStdGen 7) :: (Char, StdGen)
('\279419',320112 40692)

random 函數每次都會生成下一個種子，所以可以這樣做：

import System.Random

random3 i = collectNext $ collectNext $ [random $ mkStdGen i]
  where collectNext xs @((i, g):_) = xs ++ [random g]

試玩一下：

> random3 100
[(-3633736515773289454,693699796 2103410263),(-1610541887407225575,136012003 1780294415),(-1610541887407225575,136012003 1780294415)]
> (random3 100) :: [(Bool, StdGen)]
[(True,4041414 40692),(False,651872571 1655838864),(False,651872571 1655838864)]
> [b | (b, g) <- (random3 100) :: [(Bool, StdGen)]]
[True,False,False]

P.S.注意 (random3 100) :: [(Bool, StdGen)] 只限定了 random3 的返回類型，編譯器能夠推斷出 random $ mkStdGen i 所需類型是 (Bool, StdGen)

這下有點（偽）隨机的意思了，因為 random 是個純函數，所以只能通過換種子參數來得到不同的返回值

實際上有很簡單的方式：

random3' i = take 3 $ randoms $ mkStdGen i
> random3' 100 :: [Bool]
[True,False,False]

其中 randoms :: (Random a, RandomGen g) => g -> [a] 函數接受一個 RandomGen 參數，返回 Random 無窮序列

此外，常用的還有：

-- 返回 [min, max] 範圍的隨機數
randomR :: (Random a, RandomGen g) => (a, a) -> g -> (a, g)
-- 類似於 randomR，返回無限序列
randomRs :: (Random a, RandomGen g) => (a, a) -> g -> [a]

例如：

> randomR ('a', 'z') (mkStdGen 1)
('x',80028 40692)
> take 24 $ randomRs (1, 6) (mkStdGen 1)
[6,5,2,6,5,2,3,2,5,5,4,2,1,2,5,6,3,3,5,5,1,4,3,3]

P.S.更多隨機數相關函數，見 System.Random

動態種子

寫死的種子每次都返回同一串隨機數，沒什麼意義，所以需要一個動態的種子（如系統時間等）：

getStdGen :: IO StdGen

getStdGen 在程序運行時會向系統要一個隨機數生成器（random generator），並存成全局生成器（global generator）

例如：

main = do
  g <- getStdGen
  print $ take 10 (randoms g :: [Bool])

試玩一下：

$ ghc --make rand.hs
[1 of 1] Compiling Main             ( rand.hs, rand.o )
Linking rand ...
$ ./rand
[False,False,True,False,False,True,False,True,False,False]
$ ./rand
[True,False,False,False,True,False,False,False,True,True]
$ ./rand
[True,True,True,False,False,True,True,False,False,True]

注意，在 GHCIi 環境調用 getStdGen 得到的總是同一個種子，類似於程序連續調用 getStdGen 的效果，所以總是返回同一串隨機值序列：

> getStdGen
1661435168 1
> getStdGen
1661435168 1
> main
[False,False,False,False,True,False,False,False,True,True]
> main
[False,False,False,False,True,False,False,False,True,True]

可以手動控制取無限序列後面的部分，或者使用 newStdGen :: IO StdGen 函數：

> newStdGen
1018152561 2147483398
> newStdGen
1018192575 40691

newStdGen 能夠把現有的 global generator 分成兩個 random generator，把其中一個設置成 global generator，返回另一個。所以：

> getStdGen
1661435170 1655838864
> getStdGen
1661435170 1655838864
> newStdGen
1018232589 1655838863
> getStdGen
1661435171 2103410263

如上面示例，newStdGen 不僅返回新的 random generator，還會重置 global generator

五。異常處理

直到此刻，我們見過許多異常了（模式匹配遺漏、缺少類型聲明、空數組取首元、除零異常等），知道一旦發生異常，程序就會立刻報錯退出，但一直沒有嘗試過捕獲異常

實際上，與其它主流語言一樣，Haskell 也有完整的異常處理機制

I/O 異常

I/O 相關的場景需要更嚴謹的異常處理，因為與內部邏輯相比，外部環境顯得更加不可控，不可信賴：

像是打開文件，文件有可能被 lock 起來，也有可能文件被移除了，或是整個硬碟都被拔掉

此時需要拋出異常，告知程序某些事情發生了錯誤，沒有按照預期正常運行

I/O 異常可以通過 catchIOError 來捕獲，例如：

import System.IO.Error
catchIOError :: IO a -> (IOError -> IO a) -> IO a

傳入 I/O Action 和對應的異常處理函數，返回同類型的 I/O Action。機制類似於 try-catch，I/O Action 拋出異常才執行異常處理函數，並返回其返回值，例如：

import System.IO
import System.IO.Error
import Control.Monad
import System.Environment

main = do
  args <- getArgs
  when (not . null $ args) (do
    contents <- catchIOError (readFile (head args)) (\err -> do return "Failed to read this file!")
    putStr contents
    )

在找不到文件，或者其他原因導致 readFile 異常時，會輸出提示信息：

$ runhaskell ioException.hs ./xx
Failed to read this file!

這裡只是簡單粗暴的吃掉了所有異常，最好區別對待：

main = do
  args <- getArgs
  when (not . null $ args) (do
    contents <- catchIOError (readFile (head args)) errorHandler
    putStr contents
    )
  where errorHandler err
          | isDoesNotExistError err = do return "File not found!"
          | otherwise = ioError err

其中 isDoesNotExistError 和 ioError 如下：

isDoesNotExistError :: IOError -> Bool
ioError :: IOError -> IO a

前者是個 predicate，用來判定傳入的 IOError 是不是目標（文件）不存在引起的，後者相當於 JS 的 throw，把這個異常再度丟出去

IOError 的其它 predicate 還有：

isAlreadyExistsError
isAlreadyInUseError
isFullError
isEOFError
isIllegalOperation
isPermissionError
isUserError

其中 isUserError 用來判定通過 userError :: String -> IOError 函數手動製造的異常

獲取錯誤信息

想要輸出引發異常的用戶輸入的話，可能會這麼做：

exists = do
  file <- getLine
  when (not . null $ file) (do
    contents <- catchIOError (readFile file) (\err -> do
      return ("File " ++ file ++ " not found!\n")
      )
    putStr contents
    )

試玩一下：

> exists
./xx
File ./xx not found!
> exists
./io.hs
main = print "hoho"

符合預期，這裡用了 lambda 函數，能夠訪問外部的 file 變量，如果異常處理函數相當龐大，就不太容易了，例如：

exists' = do
  file <- getLine
  when (not . null $ file) (do
    contents <- catchIOError (readFile file) (errorHandler file)
    putStr contents
    )
  where errorHandler file = \err -> do (return ("File " ++ file ++ " not found!\n"))

為了把 file 變量傳入 errorHandler，我們多包了一層，看起來蠢蠢的，而且能保留的現場信息很有限

所以，像其他語言一樣，我們能夠從異常對象身上取出一些錯誤信息，例如：

exists'' = do
  file <- getLine
  when (not . null $ file) (do
    contents <- catchIOError (readFile file) (\err ->
      case ioeGetFileName err of Just path -> return ("File at " ++ path ++ " not found!\n")
                                 Nothing -> return ("File at somewhere not found!\n")
      )
    putStr contents
    )

其中 ioeGetFileName 用來從 IOError 中取出文件路徑（這些工具函數都以 ioe 開頭）：

ioeGetFileName :: IOError -> Maybe FilePath

P.S.更多類似函數，見 Attributes of I/O errors

純函數異常

異常並不是 I/O 場景特有的，例如：

> 1 `div` 0
*** Exception: divide by zero
> head []
*** Exception: Prelude.head: empty list

純函數也會引發異常，比如上面的除零異常和空數組取首元異常，有兩種處理方式：

使用 Maybe 或 Either
使用 try :: Exception e => IO a -> IO (Either e a)（位於 Control.Exception 模塊）

例如：

import Data.Maybe
> case listToMaybe [] of Nothing -> ""; Just first -> first
""
> case listToMaybe ["a", "b"] of Nothing -> ""; Just first -> first
"a"

其中 listToMaybe :: [a] -> Maybe a 用於取 List 首元，並包裝成 Maybe 類型（空 List 就是 Nothing）

除零異常要么手動檢查除數不為 0，要么用 evaluate 塞進 I/O 場景，通過 try 來捕獲：

> import Control.Exception
> first <- try $ evaluate $ 1 `div` 0 :: IO (Either ArithException Integer)
> first
Left divide by zero

實際上，除零異常的具體類型是 DivideByZero，位於 Control.Exception 模塊：

data ArithException
  = Overflow
  | Underflow
  | LossOfPrecision
  | DivideByZero
  | Denormal
  | RatioZeroDenominator
    -- Defined in 'GHC.Exception'

如果不清楚具體異常類別（這個是確實不清楚異常類型，查源碼都猜不出來），或者希望捕獲所有類型的異常，可以用 SomeException：

> first <- try $ evaluate $ head [] :: IO (Either SomeException ())
> first
Left Prelude.head: empty list

P.S.關於 4 種異常處理方案的更多信息，見 Handling errors in Haskell

一。惰性 I/O 與 buffer

二.Data.ByteString

惰性的 List

ByteString

常用函數

三。命令行參數

四。隨機數

動態種子

五。異常處理

I/O 異常

獲取錯誤信息

純函數異常

參考資料

評論

提交評論