Skip to main content

Starting from Lazy I/O_Haskell Notes 6

Free2018-05-26#Functional_Programming#Haskell随机数#Haskell异常处理#Haskell获取命令行参数#Haskell ByteString#Haskell Buffer

The story of buffer, chunk and thunk

I. Lazy I/O and buffer

In Haskell, I/O is also lazy, for example:

readThisFile = withFile "./data/lines.txt" ReadMode (\handle -> do
    contents <- hGetContents handle
    putStr contents
  )

When reading a file from disk, it doesn't read everything into memory at once, but reads in a stream bit by bit. For text files, the default buffer is line-buffering, meaning reading one line at a time. For binary files, the default buffer is block-buffering, reading one chunk at a time, with the specific size depending on the operating system

line-buffering and block-buffering are represented using BufferMode values:

data BufferMode
  = NoBuffering | LineBuffering | BlockBuffering (Maybe Int)
    -- Defined in 'GHC.IO.Handle.Types'

The BufferMode type has three values: NoBuffering, LineBuffering, BlockBuffering (Maybe Int) representing no buffer, using line-buffering, and using block-buffering mode respectively.其中 Maybe Int indicates how many bytes each chunk has (byte), giving Nothing uses the system's default chunk size, NoBuffering means reading one character at a time (character), which will access the disk crazily (high frequency), generally not used

You can manually set BufferMode, for example:

readThisFileInBlockMode = withFile "./data/lines.txt" ReadMode (\handle -> do
    hSetBuffering handle $ BlockBuffering (Just 1024)
    contents <- hGetContents handle
    putStr contents
  )

Reading 1024B (i.e. 1KB) each time, where hSetBuffering has the type:

hSetBuffering :: Handle -> BufferMode -> IO ()

Accepting a file handle and BufferMode value, returning an empty I/O Action

Since there's a buffer, you need to flush the buffer, so there's also hFlush:

hFlush :: Handle -> IO ()

Used to clear the buffer, no need to wait for the buffer to fill up or other automatic flush mechanisms (like line-buffering flushing when encountering a newline character)

P.S. There's a very vivid but not elegant metaphor:

Your toilet will automatically flush when there's a gallon of water in the tank. So you keep pouring water in until a gallon, the toilet will automatically flush, and the data in the water will be seen. But you can also manually press the flush button to flush. It will flush away the existing water. The flush action is the meaning of the name hFlush.

II. Data.ByteString

Since reading files from the system requires considering performance and using Buffer, what about after reading into memory? How to store and operate?

ByteString looks like a new data type, but don't we already have String?

Lazy List

String is an alias for Char List, and List is lazy, so:

str = "abc"
charList = ['a', 'b', 'c']
charList' = 'a' : 'b' : 'c' : []

> str == charList && charList == charList'
True

Declaring the string "abc" is just a promise that we will have a Char List, so when do we truly own (or create) this List?

When we have to compute (evaluate), such as in the == judgment in the example above:

instance (Eq a) => Eq [a] where
  {-# SPECIALISE instance Eq [Char] #-}
  []     == []     = True
  (x:xs) == (y:ys) = x == y && xs == ys
  _xs    == _ys    = False

(Excerpted from GHC.Classes)

Traversing and comparing whether elements are equal from left to right through pattern matching, taking the List head each time, only at this point is the List truly needed, and thus "created"

Using non-lazy JS to describe it would be like this:

function unshift(x, xs) {
  return [x].concat(xs);
}
const str = 'abc';
charList = unshift('a', unshift('b', unshift('c', [])));
function eq(s, a) {
  if (!s.length && !a.length) return true;
  return s[0] == a[0] && eq(s.slice(1), a.slice(1));
}

// test
eq(str, charList);

But unlike immediately evaluated JS, Haskell is lazy, so the actual situation is similar to:

const EMPTY_LIST = {
  value: Symbol.for('_EMPTY_LIST_'),
  tail: () => EMPTY_LIST
};
function unshift(x, xs) {
  return { value: x, tail: () => xs };
}
function sugar(str) {
  return str.split('')
    .reduceRight((a, v) => a.concat([v]), [])
    .reduce((a, v) => unshift(v, a), EMPTY_LIST);
}
const str = sugar('abc');
const charList = unshift('a', unshift('b', unshift('c', EMPTY_LIST)));
function eq(s, a) {
  if (s === EMPTY_LIST && a === EMPTY_LIST) return true;
  return s.value == a.value && eq(s.tail(), a.tail());
}

// test
eq(str, charList);

Using a "lazy" linked list to simulate a List that is only created when truly needed, like 'a' : 'b' : 'c' : [] "promises" there will be a List starting with 'a', how long this List is, how much space it occupies, is unknown before truly needing evaluation (and there's no need to know, so infinite Lists are allowed, without worrying about how to store them)

But this laziness isn't perfect, the direct problem it brings is low efficiency, especially in scenarios with huge Lists (like reading files), the cost of processing a "promise" (tail() in the simulation scenario) may not be high, but if a bunch of "promises" accumulate, the cost of processing these "promises" will become apparent, and actual efficiency will naturally decrease. So, to solve this problem, just like introducing the strict version (non-lazy version) foldl' of foldl, we introduced ByteString

P.S. The "promise" mentioned above actually has a corresponding term in Haskell called thunk

ByteString

Each element of ByteString is a byte (8 bits), divided into lazy and strict (non-lazy) versions:

  • Lazy: Data.ByteString.Lazy, also lazy, but slightly more diligent than List, not element-by-element thunk, but chunk-by-chunk (64K per chunk), reducing the number of thunks generated to some extent

  • Strict: Located in Data.ByteString module, doesn't generate any thunks, representing a continuous string of bytes, so there's no infinite length strict bytestring, nor the memory advantage of lazy List

lazy bytestring is like a chunk List (each element in the List is a 64K size strict bytestring), reducing the efficiency impact brought by laziness, while having the memory advantage of laziness, so most of the time use the lazy version

P.S. The size 64K is deliberate:

64K has a high probability of fitting into your CPU's L2 Cache

Common Functions

ByteString is equivalent to another List, so most List methods have corresponding implementations with the same name in ByteString, for example:

head, tail, init, null, length, map, reverse, foldl, foldr, concat, takeWhile, filter

So avoid naming conflicts first:

-- Lazy ByteString
import Data.ByteString.Lazy as B
-- Strict ByteString
import Data.ByteString as S

Creating a ByteString:

-- Word8 List to ByteString
B.pack :: [GHC.Word.Word8] -> ByteString
-- Strict ByteString to Lazy ByteString
B.fromChunks :: [Data.ByteString.Internal.ByteString] -> ByteString

Where Word8 is equivalent to a smaller range Int (between 0 ~ 255, and like Int belongs to the Num class), for example:

> B.pack [65, 66, 67]
"ABC"
> B.fromChunks [S.pack [65, 66, 67], S.pack [97, 98, 99]]
"ABCabc"

Note, fromChunks will string together a given set of strict bytestrings into a chunk List, not concatenate first then stuff into 64K spaces one by one, if you have a bunch of fragmented strict bytestring and don't want to concatenate them occupying memory, you can use this way to string them together

Inserting elements:

B.cons :: GHC.Word.Word8 -> B.ByteString -> B.ByteString
B.cons' :: GHC.Word.Word8 -> B.ByteString -> B.ByteString

cons is like List's :, used to insert elements on the left, also lazy (even if the first chunk is large enough to accommodate the new element, it still inserts a chunk), while cons' is its strict version, will preferentially fill the remaining space of the first chunk, the difference is similar to:

> Prelude.foldr B.cons B.empty [50..60]
Chunk "2" (Chunk "3" (Chunk "4" (Chunk "5" (Chunk "6" (Chunk "7" (Chunk "8" (Chunk "9" (Chunk ":" (Chunk ";" (Chunk "<"
Empty))))))))))
> Prelude.foldr B.cons' B.empty [50..60]
Chunk "23456789:;<" Empty

P.S. Old version GHC will show differences similar to above, Show implementation after 0.10.0.1 changed to a form similar to string literals, can't see the difference anymore, specifics see Haskell: Does ghci show "Chunk .. Empty"?

File read/write:

-- Read by chunk
S.readFile :: FilePath -> IO S.ByteString
-- Read all in
B.readFile :: FilePath -> IO B.ByteString
-- Write by chunk
S.writeFile :: FilePath -> S.ByteString -> IO ()
-- Write all at once
B.writeFile :: FilePath -> B.ByteString -> IO ()

Actually, ByteString and String types can easily convert to each other in most scenarios, so you can implement using String first, then change to ByteString in scenarios with poor performance

P.S. More ByteString related functions, see Data.ByteString

III. Command Line Arguments

Besides interactive input and reading files, command line arguments are another important way to get user input:

-- readWhat.hs
import System.Environment
import System.IO

main = do
  args <- getArgs
  contents <- readFile (args !! 0)
  putStr contents

Try it out:

$ ghc --make ./readWhat.hs
[1 of 1] Compiling Main             ( readWhat.hs, readWhat.o )
Linking readWhat ...
$  ./readWhat ./data/lines.txt
hoho, this is xx.
who's that ?
$ ./readWhat ./data/that.txt
contents in that file
another line
last line

This has the basic functionality of cat. Where getArgs has the type:

getArgs :: IO [String]

Located in System.Environment module, returns a String array composed of command line arguments in I/O Action form, similar ones include:

-- Get program name (executable file name)
getProgName :: IO String
-- Get current absolute path
getExecutablePath :: IO FilePath
-- Set environment variable
setEnv :: String -> String -> IO ()
-- Get environment variable
getEnv :: String -> IO String

P.S. More environment related functions, see System.Environment

For example:

import System.IO
import System.Environment

main = do
  progName <- getProgName
  args <- getArgs
  pwd <- getExecutablePath
  setEnv "NODE_ENV" "production"
  nodeEnv <- getEnv "NODE_ENV"
  putStrLn pwd
  putStrLn ("NODE_ENV " ++ nodeEnv)
  putStrLn (progName ++ (foldl (++) "" $ map (" " ++) args))

Try it out:

$ ghc --make ./testArgs
[1 of 1] Compiling Main             ( testArgs.hs, testArgs.o )
Linking testArgs ...
$ ./testArgs -a --p path
/absolute/path/to/testArgs
NODE_ENV production
testArgs -a --p path

P.S. Besides ghc --make sourceFile compilation execution, there's also a way to run source code directly:

$ runhaskell testArgs.hs -b -c
/absolute/path/to/ghc-8.0.1/bin/ghc
NODE_ENV production
testArgs.hs -b -c

At this time getExecutablePath returns the absolute path of ghc (executable file)

IV. Random Numbers

Besides I/O, another definitely impure scenario is random numbers. So, can pure functions create random numbers?

Creating pseudo-random numbers is somewhat possible. The approach is similar to C language, need to give a "seed":

random :: (Random a, RandomGen g) => g -> (a, g)

Where Random and RandomGen seed types are respectively:

instance Random Word -- Defined in 'System.Random'
instance Random Integer -- Defined in 'System.Random'
instance Random Int -- Defined in 'System.Random'
instance Random Float -- Defined in 'System.Random'
instance Random Double -- Defined in 'System.Random'
instance Random Char -- Defined in 'System.Random'
instance Random Bool -- Defined in 'System.Random'

instance RandomGen StdGen -- Defined in 'System.Random'
data StdGen
  = System.Random.StdGen {-# UNPACK #-}GHC.Int.Int32
                        {-# UNPACK #-}GHC.Int.Int32
    -- Defined in 'System.Random'

P.S. Where Word refers to unsigned integers with specifiable width, specifics see Int vs Word in common use?

Numbers, characters, boolean types etc. can all have random values, seeds need to be generated through the special mkStdGen :: Int -> StdGen function, for example:

> random (mkStdGen 7) :: (Int, StdGen)
(5401197224043011423,33684305 2103410263)
> random (mkStdGen 7) :: (Int, StdGen)
(5401197224043011423,33684305 2103410263)

Sure enough it's a pure function, so two calls return exactly the same result (not because of consecutive calls, calling again in ten days or half a month will still be this result). Let's try changing to something else by declaring the type to inform the random function of the expected random value type:

> random (mkStdGen 7) :: (Bool, StdGen)
(True,320112 40692)
> random (mkStdGen 7) :: (Float, StdGen)
(0.34564054,2071543753 1655838864)
> random (mkStdGen 7) :: (Char, StdGen)
('\279419',320112 40692)

The random function generates the next seed each time, so you can do this:

import System.Random

random3 i = collectNext $ collectNext $ [random $ mkStdGen i]
  where collectNext xs @((i, g):_) = xs ++ [random g]

Try it out:

> random3 100
[(-3633736515773289454,693699796 2103410263),(-1610541887407225575,136012003 1780294415),(-1610541887407225575,136012003 1780294415)]
> (random3 100) :: [(Bool, StdGen)]
[(True,4041414 40692),(False,651872571 1655838864),(False,651872571 1655838864)]
> [b | (b, g) <- (random3 100) :: [(Bool, StdGen)]]
[True,False,False]

P.S. Note (random3 100) :: [(Bool, StdGen)] only limits the return type of random3, the compiler can infer that the type needed for random $ mkStdGen i is (Bool, StdGen)

Now it has a bit of (pseudo) random meaning, because random is a pure function, so can only get different return values by changing the seed parameter

Actually there's a simpler way:

random3' i = take 3 $ randoms $ mkStdGen i
> random3' 100 :: [Bool]
[True,False,False]

Where randoms :: (Random a, RandomGen g) => g -> [a] function accepts a RandomGen parameter, returns an infinite Random sequence

Additionally, commonly used ones include:

-- Return random number in [min, max] range
randomR :: (Random a, RandomGen g) => (a, a) -> g -> (a, g)
-- Similar to randomR, returns infinite sequence
randomRs :: (Random a, RandomGen g) => (a, a) -> g -> [a]

For example:

> randomR ('a', 'z') (mkStdGen 1)
('x',80028 40692)
> take 24 $ randomRs (1, 6) (mkStdGen 1)
[6,5,2,6,5,2,3,2,5,5,4,2,1,2,5,6,3,3,5,5,1,4,3,3]

P.S. More random number related functions, see System.Random

Dynamic Seeds

Hard-coded seeds always return the same string of random numbers, meaningless, so need a dynamic seed (like system time etc.):

getStdGen :: IO StdGen

getStdGen requests a random generator from the system when the program runs, and stores it as a global generator

For example:

main = do
  g <- getStdGen
  print $ take 10 (randoms g :: [Bool])

Try it out:

$ ghc --make rand.hs
[1 of 1] Compiling Main             ( rand.hs, rand.o )
Linking rand ...
$ ./rand
[False,False,True,False,False,True,False,True,False,False]
$ ./rand
[True,False,False,False,True,False,False,False,True,True]
$ ./rand
[True,True,True,False,False,True,True,False,False,True]

Note, calling getStdGen in GHCI environment always gets the same seed, similar to the effect of a program continuously calling getStdGen, so always returns the same string of random value sequences:

> getStdGen
1661435168 1
> getStdGen
1661435168 1
> main
[False,False,False,False,True,False,False,False,True,True]
> main
[False,False,False,False,True,False,False,False,True,True]

You can manually control taking the latter part of the infinite sequence, or use the newStdGen :: IO StdGen function:

> newStdGen
1018152561 2147483398
> newStdGen
1018192575 40691

newStdGen can split the existing global generator into two random generators, set one as the global generator, and return the other. So:

> getStdGen
1661435170 1655838864
> getStdGen
1661435170 1655838864
> newStdGen
1018232589 1655838863
> getStdGen
1661435171 2103410263

As in the example above, newStdGen not only returns a new random generator, it also resets the global generator

V. Exception Handling

Until this moment, we've seen many exceptions (pattern matching omissions, missing type declarations, empty array head extraction, division by zero exceptions etc.), know that once an exception occurs, the program will immediately error and exit, but haven't tried catching exceptions yet

Actually, like other mainstream languages, Haskell also has a complete exception handling mechanism

I/O Exceptions

I/O related scenarios need more rigorous exception handling, because compared to internal logic, the external environment appears more uncontrollable, untrustworthy:

Like opening a file, the file might be locked, or the file might be removed, or the entire hard drive might be pulled out

At this time need to throw an exception, informing the program that something went wrong, didn't run normally as expected

I/O exceptions can be caught through catchIOError, for example:

import System.IO.Error
catchIOError :: IO a -> (IOError -> IO a) -> IO a

Passing in I/O Action and corresponding exception handling function, returning an I/O Action of the same type. The mechanism is similar to try-catch, the exception handling function is executed only when the I/O Action throws an exception, and returns its return value, for example:

import System.IO
import System.IO.Error
import Control.Monad
import System.Environment

main = do
  args <- getArgs
  when (not . null $ args) (do
    contents <- catchIOError (readFile (head args)) (\err -> do return "Failed to read this file!")
    putStr contents
    )

When the file cannot be found, or other reasons cause readFile exception, it will output prompt information:

$ runhaskell ioException.hs ./xx
Failed to read this file!

Here it simply and crudely swallows all exceptions, better to treat them differently:

main = do
  args <- getArgs
  when (not . null $ args) (do
    contents <- catchIOError (readFile (head args)) errorHandler
    putStr contents
    )
  where errorHandler err
          | isDoesNotExistError err = do return "File not found!"
          | otherwise = ioError err

Where isDoesNotExistError and ioError are as follows:

isDoesNotExistError :: IOError -> Bool
ioError :: IOError -> IO a

The former is a predicate, used to determine whether the passed in IOError is caused by the target (file) not existing, the latter is like JS's throw, throwing this exception out again

IOError's other predicates include:

isAlreadyExistsError
isAlreadyInUseError
isFullError
isEOFError
isIllegalOperation
isPermissionError
isUserError

Where isUserError is used to determine exceptions manually created through the userError :: String -> IOError function

Getting Error Information

If you want to output the user input that caused the exception, you might do this:

exists = do
  file <- getLine
  when (not . null $ file) (do
    contents <- catchIOError (readFile file) (\err -> do
      return ("File " ++ file ++ " not found!\n")
      )
    putStr contents
    )

Try it out:

> exists
./xx
File ./xx not found!
> exists
./io.hs
main = print "hoho"

Meets expectations, here used a lambda function, can access the external file variable, if the exception handling function is quite large, it's not easy, for example:

exists' = do
  file <- getLine
  when (not . null $ file) (do
    contents <- catchIOError (readFile file) (errorHandler file)
    putStr contents
    )
  where errorHandler file = \err -> do (return ("File " ++ file ++ " not found!\n"))

To pass the file variable into errorHandler, we wrapped an extra layer, looks stupid, and the scene information that can be retained is very limited

So, like other languages, we can extract some error information from the exception object, for example:

exists'' = do
  file <- getLine
  when (not . null $ file) (do
    contents <- catchIOError (readFile file) (\err ->
      case ioeGetFileName err of Just path -> return ("File at " ++ path ++ " not found!\n")
                                 Nothing -> return ("File at somewhere not found!\n")
      )
    putStr contents
    )

Where ioeGetFileName is used to extract the file path from IOError (these utility functions all start with ioe):

ioeGetFileName :: IOError -> Maybe FilePath

P.S. More similar functions, see Attributes of I/O errors

Pure Function Exceptions

Exceptions are not unique to I/O scenarios, for example:

> 1 `div` 0
*** Exception: divide by zero
> head []
*** Exception: Prelude.head: empty list

Pure functions can also raise exceptions, like the division by zero exception and empty array head extraction exception above, there are two handling methods:

  • Use Maybe or Either

  • Use try :: Exception e => IO a -> IO (Either e a) (located in Control.Exception module)

For example:

import Data.Maybe
> case listToMaybe [] of Nothing -> ""; Just first -> first
""
> case listToMaybe ["a", "b"] of Nothing -> ""; Just first -> first
"a"

Where listToMaybe :: [a] -> Maybe a is used to take the List head, and wrap it into Maybe type (empty List is Nothing)

Division by zero exception either manually check divisor is not 0, or use evaluate to stuff into I/O scenario, catch through try:

> import Control.Exception
> first <- try $ evaluate $ 1 `div` 0 :: IO (Either ArithException Integer)
> first
Left divide by zero

Actually, the specific type of division by zero exception is DivideByZero, located in Control.Exception module:

data ArithException
  = Overflow
  | Underflow
  | LossOfPrecision
  | DivideByZero
  | Denormal
  | RatioZeroDenominator
    -- Defined in 'GHC.Exception'

If unclear about specific exception category (this is truly unclear about exception type, even checking source code can't guess), or hope to catch all types of exceptions, can use SomeException:

> first <- try $ evaluate $ head [] :: IO (Either SomeException ())
> first
Left Prelude.head: empty list

P.S. For more information about 4 exception handling schemes, see Handling errors in Haskell

References

Comments

No comments yet. Be the first to share your thoughts.

Leave a comment