module PlainText

Utility methods for mainly line-based processing of String

This module contains methods useful in processing a String object of a text file, that is, a String that contains an entire or a multiple-line part of a text file. The methods include normalizing the line-break codes, removing extra spaces from each line, etc. Many of the methods work on tha basis of a line. For example, {#head} and {#tail} methods work like the respective UNIX-shell commands, returning a specified line at the head/tail parts of self.

Most methods in this module are meant to be included in String, except for a few module functions. It is however debatable whether it is a good practice to include a third-party module in the core class. This module contains a helper module function {PlainText.extend_this}, with which an object extends this module easily as Singleton if this module is not already included.

A few methods in this module assume that {PlainText::Split} is included in String, which in default is the case, as soon as this file is read (by Ruby's require).

@author Masa Sakano (Wise Babel Ltd)

Constants

DEF_HEADTAIL_N_LINES

Default number of lines to extract for {#head} and {#tail}

DEF_METHOD_OPTS

Default options for class/instance methods

DefLineBreaks

List of the default line breaks.

Public Class Methods

__call_inst_method__(method, instr, *rest, **k) click to toggle source

Call instance method as a Module function

The return String includes {PlainText} as Singleton.

@param method [Symbol] module method name @param instr [String] String that is examined. @return [#instr]

# File lib/plain_text.rb, line 64
def self.__call_inst_method__(method, instr, *rest, **k)
  newself = instr.clone
  PlainText.extend_this(newself)
  newself.public_send(method, *rest, **k)
end
clean_text( prt, preserve_paragraph: DEF_METHOD_OPTS[:clean_text][:preserve_paragraph], boundary_style: DEF_METHOD_OPTS[:clean_text][:boundary_style], lbs_style: DEF_METHOD_OPTS[:clean_text][:lbs_style], lb_is_space: DEF_METHOD_OPTS[:clean_text][:lb_is_space], sps_style: DEF_METHOD_OPTS[:clean_text][:sps_style], delete_asian_space: DEF_METHOD_OPTS[:clean_text][:delete_asian_space], linehead_style: DEF_METHOD_OPTS[:clean_text][:linehead_style], linetail_style: DEF_METHOD_OPTS[:clean_text][:linetail_style], firstlbs_style: DEF_METHOD_OPTS[:clean_text][:firstlbs_style], lastsps_style: DEF_METHOD_OPTS[:clean_text][:lastsps_style], lb: DEF_METHOD_OPTS[:clean_text][:lb], lb_out: DEF_METHOD_OPTS[:clean_text][:lb_out], is_debug: false ) click to toggle source

Cleans the text

Such as, removing extra spaces, normalising the linebreaks, etc.

In default,

  • Paragraphs (more than 2 \n) are taken into account (one \n between two): +preserve_paragraph=true+

  • Blank lines are truncated into one line with no white spaces: +boundary_style=lb_out*2(=$/*2)+

  • Consecutive white spaces are truncated into a single space: +sps_style=:truncate+

  • White spaces before or after a CJK character is deleted: +delete_asian_space=true+

  • Preceding white spaces in each line are preserved: +linehead_style=:none+

  • Trailing white spaces in each line are deleted: +linetail_style=:delete+

  • Line-breaks at the beginning of the entire input string are deleted: +firstlbs_style=:delete+

  • Trailing white spaces and line-breaks at the end of the entire input string are truncated into a single linebreak: +lastsps_style=:truncate+

For a String with predominantly CJK characters, the following setting is recommended:

  • +lbs_style: :delete+

  • +delete_asian_space: true+ (Default)

Note for the Symbols in optional arguments, the Symbol with the first character only is accepted, e.g., :d instead of :delete (nb., :t2 for :truncate2).

For more detail, see the description of each command-line options.

Note that for the case of traditional genko-yoshi-style Japanese texts with “jisage” for each new paragraph marking a new paragraph, probably the best way is to make your own Part instance to give to this method, where the rule for the Part should be something like:

/(\A[[:blank:]]+|\n[[:space:]]+)/

@param prt [PlainText:Part, String] {Part} or String to examine. @param preserve_paragraph: [Boolean] Paragraphs are taken into account if true (Def: False). In the input, paragraphs are defined to be separated with more than one lb with potentially some space characters in between. Their output style is specified with boundary_style. @param boundary_style: [String, Symbol] One of +(:truncate|:truncate2|:delete|:none)+ or String. If String, the boundaries between paragraphs are replaced with this String (Def: +lb_out*2+). If :truncate, consecutive linebreaks and spaces are truncated into 2 linebreaks. :truncate2 are similar, but they are not truncated beyond 3 linebreaks (ie., up to 2 blank lines between Paragraphs). If :none, nothing is done about them. Unless :none, all the white spaces between linebreaks are deleted. @param lbs_style: [Symbol] One of +(:truncate|:delete|:none)+ (Def: :truncate). If :delete, all the linebreaks within paragraphs are deleted. :truncate is meaningful only when +preserve_paragraph=false+ and consecutive linebreaks are truncated into 1 linebreak. @param sps_style: [Symbol] One of +(:truncate|:delete|:none)+ (Def: :truncate). If :truncate, the consecutive white spaces within paragraphs, except for those at the line-head or line-tail (which are controlled by linehead_style and linehead_style, respectively), are truncated into a single white space. If :delete, they are deleted. @param lb_is_space: [Boolean] If true, a line-break, except those for the boundaries (unless preserve_paragraph is false), is equivalent to a space (Def: False). @param delete_asian_space: [Boolean] Any spaces between, before, after Asian characters (but punctuation) are deleted, if true (Default). @param linehead_style: [Symbol] One of +(:truncate|:delete|:none)+ (Def: :none). Determine how to handle consecutive white spaces at the beggining of each line. @param linetail_style: [Symbol] One of +(:truncate|:delete|:markdown|:none)+ (Def: :delete). Determine how to handle consecutive white spaces at the end of each line. If +:markdown, 1 space is always deleted, and two or more spaces are truncated into two ASCII whitespaces if the last two spaces are ASCII whitespaces, or else untouched. @param firstlbs_style: [Symbol, String] One of +(:truncate|:delete|:none)+ or String (Def: :default). If :truncate, any linebreaks at the very beginning of self (and whitespaces in between), if exist, are truncated to a single linebreak. If String, they are, even if not exists, replaced with the specified String (such as a linebreak). If :delete, they are deleted. Note This option has nothing to do with the whitespaces at the beginning of the first significant line (hence the name of the option). Note if a (random) Part is given, this option only considers the first significant element of it. @param lastsps_style: [Symbol, String] One of +(:truncate|:delete|:none|:linebreak)+ or String (Def: :truncate). If :truncate, any of linebreaks AND white spaces at the tail of self, if exist, are truncated to a single linebreak. If :delete, they are deleted. If String, they are, even if not exists, replaced with the specified String (such as a linebreak, in which case lb_out is used as String, i.e., it guarantees only 1 linebreak to exist at the end of the String). Note if a (random) Part is given, this option only considers the last significant element of it. @param lb: [String] Linebreak character like \n etc (Default: $/). If this is one of the standard line-breaks, irregular line-breaks (for example, existence of CR when only LF should be there) are corrected. @param lb_out: [String] Linebreak used for output (Default: lb) @return same as prt

# File lib/plain_text.rb, line 147
  def self.clean_text(
        prt,
        preserve_paragraph: DEF_METHOD_OPTS[:clean_text][:preserve_paragraph],
        boundary_style:     DEF_METHOD_OPTS[:clean_text][:boundary_style], # If unspecified, will be replaced with lb_out * 2
        lbs_style:      DEF_METHOD_OPTS[:clean_text][:lbs_style],
        lb_is_space:    DEF_METHOD_OPTS[:clean_text][:lb_is_space],
        sps_style:      DEF_METHOD_OPTS[:clean_text][:sps_style],
        delete_asian_space: DEF_METHOD_OPTS[:clean_text][:delete_asian_space],
        linehead_style: DEF_METHOD_OPTS[:clean_text][:linehead_style],
        linetail_style: DEF_METHOD_OPTS[:clean_text][:linetail_style],
        firstlbs_style: DEF_METHOD_OPTS[:clean_text][:firstlbs_style],
        lastsps_style:  DEF_METHOD_OPTS[:clean_text][:lastsps_style],
        lb:     DEF_METHOD_OPTS[:clean_text][:lb],
        lb_out: DEF_METHOD_OPTS[:clean_text][:lb_out], # If unspecified, will be replaced with lb
        is_debug: false
      )

#isdebug = true if prt == "foo\n\n\nbar\n"
    lb_out ||= lb  # Output linebreak
    boundary_style = lb_out*2 if true       == boundary_style
    boundary_style = ""       if [:delete, :d].include? boundary_style
    lastsps_style  = lb_out   if :linebreak == lastsps_style

    if !prt.class.method_defined? :last_significant_element
      # Construct a Part instance from the given String.
      ret = ''
      begin
        prt = prt.unicode_normalize
      rescue ArgumentError  # (invalid byte sequence in UTF-8)
        warn "The given String in (#{self.name}\##{__method__}) seems wrong."
        raise
      end
      prt = normalize_lb(prt, "\n", lb_from: (DefLineBreaks.include?(lb) ? nil : lb)).dup
      kwd = (["\r\n", "\r", "\n"].include?(lb) ? {} : { rules: /#{Regexp.quote lb}{2,}/})
      prt = (preserve_paragraph ? Part.parse(prt, **kwd) : Part.new([prt]))
    else
      # If not preserve_paragraph, reconstructs it as a Part with a single Paragraph.
      # Also, deepcopy is needed, as this method is destructive.
      prt = (preserve_paragraph ? prt : Part.new([prt.join])).deepcopy
    end
    prt.squash_boundaryies!  # Boundaries are squashed.

    # Handles Boundary
    clean_text_boundary!(prt, boundary_style: boundary_style)

    # Handles linebreaks and spaces (within Paragraphs)
    clean_text_lbs_sps!( prt,
      lbs_style: lbs_style,
      lb_is_space: lb_is_space,
      sps_style: sps_style,
      delete_asian_space: delete_asian_space,
      is_debug: is_debug
    )
    # Handles the line head/tails.
    clean_text_line_head_tail!( prt,
      linehead_style: linehead_style,
      linetail_style: linetail_style
    )

    # Handles the file head/tail.
    clean_text_file_head_tail!( prt,
      firstlbs_style: firstlbs_style,
      lastsps_style:  lastsps_style,
      is_debug: is_debug
    )

    # Replaces the linebreaks to the specified one
    prt.map{ |i| i.gsub!(/\n/m, lb_out) }

    (ret ? prt.join : prt)  # prt.to_s may be different from prt.join
  end
count_char(instr, *rest, lbs_style: DEF_METHOD_OPTS[:count_char][:lbs_style], linehead_style: DEF_METHOD_OPTS[:count_char][:linehead_style], lastsps_style: DEF_METHOD_OPTS[:count_char][:lastsps_style], lb_out: DEF_METHOD_OPTS[:count_char][:lb_out], **k ) click to toggle source

Count the number of characters

See {PlainText#clean_text!} for the optional parameters. The defaults of a few of the optional parameters are different from it, such as the default for lb_out is +“n”+ (newline, so that a line-break is 1 byte in size). It is so that this method is more optimized for East-Asian (CJK) characters, given this method is most useful for CJK Strings, whereas, for European alphabets, counting the number of words, rather than characters as in this method, would be more standard.

@param instr [String] String for which the number of chars is counted @param (see count_char) @return [Integer]

# File lib/plain_text.rb, line 90
def self.count_char(instr, *rest,
      lbs_style:      DEF_METHOD_OPTS[:count_char][:lbs_style],
      linehead_style: DEF_METHOD_OPTS[:count_char][:linehead_style],
      lastsps_style:  DEF_METHOD_OPTS[:count_char][:lastsps_style],
      lb_out:         DEF_METHOD_OPTS[:count_char][:lb_out],
      **k
    )
  clean_text(instr, *rest, lbs_style: lbs_style, linehead_style: linehead_style, lastsps_style: lastsps_style, lb_out: lb_out, **k).size
end
delete_spaces_bw_cjk_european(instr, *rest) click to toggle source

Module function of {#delete_spaces_bw_cjk_european}

@param (see delete_spaces_bw_cjk_european) @return as instr

# File lib/plain_text.rb, line 223
def self.delete_spaces_bw_cjk_european(instr, *rest)
  __call_inst_method__(:delete_spaces_bw_cjk_european, instr, *rest)
end
extend_this(obj) click to toggle source

If the class of the obj does not “include” this module, do so in the singular class.

@param obj [Object] Maybe String. For which a singular class def is run, if the condition is met. @return [TrueClass, NilClass] true if the singular class def is run. Else nil.

# File lib/plain_text.rb, line 74
def self.extend_this(obj)
  return nil if defined? obj.delete_spaces_bw_cjk_european!
  obj.extend(PlainText)
  true
end
head(instr, *rest, **k) click to toggle source

Module function of {#head}

The return String includes {PlainText} as Singleton.

@param instr [String] String that is examined. @param (see head) @return as instr

# File lib/plain_text.rb, line 235
def self.head(instr, *rest, **k)
  return PlainText.__call_inst_method__(:head, instr, *rest, **k)
end
head_inverse(instr, *rest, **k) click to toggle source

Module function of {#head_inverse}

The return String includes {PlainText} as Singleton.

@param instr [String] String that is examined. @param (see head_inverse) @return as instr

# File lib/plain_text.rb, line 247
def self.head_inverse(instr, *rest, **k)
  return PlainText.__call_inst_method__(:head_inverse, instr, *rest, **k)
end
normalize_lb(instr, *rest, **k) click to toggle source

Module function of {#normalize_lb}

The return String includes {PlainText} as Singleton.

@param instr [String] String that is examined. @param (see normalize_lb) @return as instr

# File lib/plain_text.rb, line 259
def self.normalize_lb(instr, *rest, **k)
  return PlainText.__call_inst_method__(:normalize_lb, instr, *rest, **k)
end
tail(instr, *rest, **k) click to toggle source

Module function of {#tail}

The return String includes {PlainText} as Singleton.

@param instr [String] String that is examined. @param (see tail) @return as instr

# File lib/plain_text.rb, line 271
def self.tail(instr, *rest, **k)
  return PlainText.__call_inst_method__(:tail, instr, *rest, **k)
end
tail_inverse(instr, *rest, **k) click to toggle source

Module function of {#tail_inverse}

The return String includes {PlainText} as Singleton.

@param instr [String] String that is examined. @param (see tail_inverse) @return as instr

# File lib/plain_text.rb, line 283
def self.tail_inverse(instr, *rest, **k)
  return PlainText.__call_inst_method__(:tail_inverse, instr, *rest, **k)
end

Private Class Methods

clean_text_boundary!( prt, boundary_style: , is_debug: false ) click to toggle source

@param prt [PlainText:Part] (see PlainText.clean_text) @param boundary_style (see PlainText.clean_text) @return [void]

@see PlainText.clean_text

# File lib/plain_text.rb, line 297
def self.clean_text_boundary!( prt,
      boundary_style: ,
      is_debug: false
    )

  # Boundary
  case boundary_style
  when String
    prt.each_boundary_with_index{|ec, i| ((i == prt.size - 1) && ec.empty?) ? ec : ec.replace(boundary_style)}
  when :truncate,  :t
    prt.boundaries.each{|ec| ec.gsub!(/[[:blank:]]+/m, ""); ec.gsub!(/\n+{3,}/m, "\n\n")}
  when :truncate2, :t2
    prt.boundaries.each{|ec| ec.gsub!(/[[:blank:]]+/m, ""); ec.gsub!(/\n+{4,}/m, "\n\n\n")}
  when :none, :n
    # Do nothing
  else
    raise ArgumentError
  end
end
clean_text_file_head_tail!( prt, firstlbs_style: , lastsps_style: , is_debug: false ) click to toggle source

@param prt [PlainText:Part] (see PlainText.clean_text#prt) @param firstlbs_style [Symbol, String] (see PlainText.clean_text#firstlbs_style) @param lastsps_style [Symbol, String] (see PlainText.clean_text#lastsps_style) @return [void]

@see PlainText.clean_text

# File lib/plain_text.rb, line 415
def self.clean_text_file_head_tail!(
      prt,
      firstlbs_style: ,
      lastsps_style:  ,
      is_debug: false
    )

  # Handles the beginning of the given Part.
  obj = prt.first_significant_element || return
  # The first significant element is either Paragraph or Background.
  # obj may be nil.

  case firstlbs_style
  when String
    # This assumes the first Background is not
    #   (1) containing any non-space characters,
    #   (2) white-spaces only AND the first Paragraph starts from a linebreak.
    # You can assume it as long as String is the original input.
    # However, if the input is Part, anything can be possible, like
    # first multiple Backgrounds contain a linebreak for each, each of which
    # follows an empty Paragraph...
    #  The thing is, if String is always returned, it is much easier
    # to process after Part#join.  However, the method may return Part.
    # Therefore, you cannot do it!
    # I explain it in the document in {self.clean_text}.
    obj.sub!(/\A([[:space:]]*\n)?/m, firstlbs_style)
  when :truncate, :t
    # The initial blank lines, if exist, are truncated to a single "\n"
    obj.sub!(/\A[[:space:]]*\n/m, "\n")
  when :delete, :d
    # The initial blank lines are deleted.
    obj.sub!(/\A[[:space:]]*\n/m, "")
  when :none, :n
    # Do nothing
  else
    raise ArgumentError, "Invalid firstlbs_style (#{firstlbs_style.inspect}) is specified."
  end

  # Handles the end of the given Part.
  ind = prt.last_significant_index
  ind_para = (prt.index_para?(ind) ? ind : ind-1) # ind_para guaranteed to be for Paragraph
  obj = Part.new(prt[ind_para, 2]).join  # Handles as a String
  case lastsps_style
  when String
    # The trailing spaces and line-breaks, even if onot exist, are replaced with a specified String.
    changed = obj.sub!(/[[:space:]]*\z/m, lastsps_style)
  when :truncate, :t
    # The trailing spaces and line-breaks, if exist, are replaced with a single `linebreak_out`.
    changed = obj.sub!(/[[:space:]]+\z/m, "\n")
  when :delete, :d
    # The trailing spaces and line-breaks are deleted.
    changed = obj.sub!(/[[:space:]]+\z/m, "")
  when :none, :n
    # Do nothing
  else
    raise ArgumentError, "Invalid lastsps_style (#{lastsps_style.inspect}) is specified."
  end

  return nil if !changed
  ma = /^#{Regexp.quote prt[ind_para]}/.match obj
  if ma
  prt[ind_para].replace ma[0]
  prt[ind_para+1].replace ma.post_match
  else
  prt[ind_para].replace obj
  prt[ind_para+1].replace ""
  end
end
clean_text_lbs_sps!( prt, lbs_style: , lb_is_space: , sps_style: , delete_asian_space: , is_debug: false ) click to toggle source

@param prt [PlainText:Part] (see PlainText.clean_text) @param lbs_style (see PlainText.clean_text) @param sps_style (see PlainText.clean_text) @param lb_is_space (see PlainText.clean_text) @param delete_asian_space (see PlainText.clean_text) @return [void]

@see PlainText.clean_text

# File lib/plain_text.rb, line 326
def self.clean_text_lbs_sps!(
      prt,
      lbs_style:          ,
      lb_is_space:        ,
      sps_style:          ,
      delete_asian_space: ,
      is_debug: false
    )

  # Linebreaks and spaces
  case lbs_style
  when :truncate,   :t
    prt.paras.each{|ec| ec.gsub!(/\n{2,}/m, "\n")}
  when :delete,   :d
    prt.paras.each{|ec| ec.gsub!(/\n/m, "")}
  when :none, :n
    # Does nothing
  else
    raise ArgumentError
  end

  # Handles spaces in each line
  clean_text_sps!(prt, sps_style: sps_style, is_debug: is_debug)

  # Linebreaks become spaces
  if lb_is_space
    prt.paras.each{|ec| ec.gsub!(/\n/m, " ")}
    clean_text_sps!(prt, sps_style: sps_style, is_debug: is_debug) if sps_style == :truncate
  end

  # Ignore spaces between, before, and after Asian characters.
  if delete_asian_space
    prt.paras.each do |ea_p|
      PlainText.extend_this(ea_p)
      ea_p.delete_spaces_bw_cjk_european!  # Destructive change in prt.
    end
  end
end
clean_text_line_head_tail!( prt, linehead_style: , linetail_style: , is_debug: false ) click to toggle source

@param prt [PlainText:Part] (see PlainText.clean_text) @param linehead_style [Symbol, String] (see PlainText.clean_text) @param linetail_style [Symbol, String] (see PlainText.clean_text) @return [void]

@see PlainText.clean_text

# File lib/plain_text.rb, line 372
def self.clean_text_line_head_tail!(
      prt,
      linehead_style: ,
      linetail_style: ,
      is_debug: false
    )

  # Head of each line
  case linehead_style
  when :truncate, :t
    prt.paras.each{|ec| ec.gsub!(/^[[:blank:]]+/, " ")}
  when :delete, :d
    prt.paras.each{|ec| ec.gsub!(/^[[:blank:]]+/, "")}
  when :none, :n
    # Do nothing
  else
    raise ArgumentError, "Invalid linehead_style (#{linehead_style.inspect}) is specified."
  end

  # Tail of each line
  case linetail_style
  when :truncate, :t
    prt.paras.each{|ec| ec.gsub!(/[[:blank:]]+$/, " ")}
  when :delete, :d
    prt.paras.each{|ec| ec.gsub!(/[[:blank:]]+$/, "")}
  when :markdown, :m
    # Two spaces are preserved
    prt.paras.each{|ec| ec.gsub!(/(?:^|(?<![[:blank:]]))[[:blank:]]$/, "")}  # A single space is deleted.
    prt.paras.each{|ec| ec.gsub!(/[[:blank:]]+  $/, "  ")}  # 3 or more spaces are truncated into 2 spaces, only IF the last two spaces are the ASCII spaces.
  when :none, :n
    # Do nothing
  else
    raise ArgumentError, "Invalid linetail_style (#{linetail_style.inspect}) is specified."
  end
end
clean_text_sps!( prt, sps_style: , is_debug: false ) click to toggle source

Handles spaces within Paragraphs

uses Part to transform a Paragraph into a Part

@param prt [PlainText:Part] (see PlainText.clean_text) @param sps_style (see PlainText.clean_text) @return [void]

@see PlainText.clean_text

# File lib/plain_text.rb, line 495
def self.clean_text_sps!(
      prt,
      sps_style: ,
      is_debug: false
    )

  prt.paras.each do |e_pa|
    # Each line treated as a Paragraph, and [[:space:]]+ between them as a Boundary.
    # Then, to work on anything within a line except for line-head/tail is easy.
    prt_para = Part.parse(e_pa, rule: ParseRule::RuleEachLineStrip).map_para { |e_li|
      case sps_style
      when :truncate, :t
        e_li.gsub(/[[:blank:]]{2,}/m, " ")
      when :delete, :d
        e_li.gsub(/[[:blank:]]+/m, "")
      when :none, :n
        e_li
      else
        raise ArgumentError
      end
    } # map_para
    e_pa.replace prt_para.join
  end
end

Public Instance Methods

count_char(*rest, **k) click to toggle source

Count the number of characters

See {PlainText.count_char} and further {PlainText.clean_text!} for the optional parameters. The defaults of a few of the optional parameters are different from the latter, such as the default for lb_out is +“n”+ (newline, so that a line-break is 1 byte in size). It is so that this method is more optimized for East-Asian (CJK) characters, given this method is most useful for CJK Strings, whereas, for European alphabets, counting the number of words, rather than characters as in this method, would be more standard.

@param (see {PlainText.count_char}) @return [Integer]

# File lib/plain_text.rb, line 535
def count_char(*rest, **k)
  PlainText.public_send(__method__, self, *rest, **k)
end
delete_spaces_bw_cjk_european(*rest) click to toggle source

Non-destructive version of {#delete_spaces_bw_cjk_european!}

@param (see delete_spaces_bw_cjk_european!) @return same class as self

# File lib/plain_text.rb, line 558
def delete_spaces_bw_cjk_european(*rest)
  newself = clone
  newself.delete_spaces_bw_cjk_european!(*rest)
  newself
end
delete_spaces_bw_cjk_european!(repl="") click to toggle source

Delete all the spaces between CJK and European characters or numbers.

All the spaces between CJK and European characters, numbers or punctuations are deleted or converted into a specified replacement character. Or, in short, any spaces between, before, and after a CJK characters are deleted. If the return is non-nil, there is at least one match.

@param repl [String] Replacement character (Default: “”). @return [MatchData, NilClass] MatchData of (one of) the last match if there is a positive match, else nil.

# File lib/plain_text.rb, line 548
def delete_spaces_bw_cjk_european!(repl="")
  ret = gsub!(/(\p{Hiragana}|\p{Katakana}|[ー-]|[一-龠々]|\p{Han}|\p{Hangul})([[:blank:]]+)([[:upper:][:lower:][:digit:][:punct:]])/, '\1\3')
  ret ||= gsub!(/([[:upper:][:lower:][:digit:][:punct:]])([[:blank:]]+)(\p{Hiragana}|\p{Katakana}|[ー-]|[一-龠々]|\p{Han}|\p{Hangul})/, '\1\3')
end
head(num_in=DEF_HEADTAIL_N_LINES, unit: :line, inclusive: true, padding: 0, linebreak: $/) click to toggle source

Returns the first num lines (or characters, bytes) or before the last n-th line.

If “byte” is specified as the return unit, the encoding is the same as self, though the encoding for the returned String may not be valid anymore. Note that it is probably the better practice to use +string[ 0..5 ]+ and +string#byteslice(0,5)+ instead of this method for the units of “char” and “byte”, respectively.

For num, a negative number means counting from the last (e.g., -1 (lines, if unit is :line) means everything but the last 1 line, and -5 means everything but the last 5 lines), whereas 0 is forbidden. If a too big negative number is given, such as -9 for String of 2 lines, a null string is returned.

If unit is :line, num can be Regexp, in which case the string of the lines up to the first line that matches the given Regexp is returned, where the process is based on the lines. For example, if num is /ABC/ (Regexp), String of the lines from the beginning up to the line that contains the character +“ABC”+ is returned.

@param num_in [Integer, Regexp] Number (positive or negative, but not 0) of :unit to extract (Def: 10), or Regexp, which is valid only if unit is :line. @param unit: [Symbol, String] One of :line (or +“-n”+), :char, :byte (or +“-c”+) @param inclusive: [Boolean] read only when unit is :line. If inclusive (Default), the (entire) line that matches is included in the result. @param linebreak: [String] \n etc (Default: +$/+), used when +unit==:line+ (Default) @return [String] as self

# File lib/plain_text.rb, line 593
def head(num_in=DEF_HEADTAIL_N_LINES, unit: :line, inclusive: true, padding: 0, linebreak: $/)
  if num_in.class.method_defined? :to_int
    num = num_in.to_int
    raise ArgumentError, "Non-positive num (#{num_in}) is given in #{__method__}" if num.to_int < 1
  elsif num_in.class.method_defined? :named_captures
    re_in = num_in
  else
    raise raise_typeerror(num_in, 'Integer or Range')
  end

  case unit
  when :line, "-n"
    # Regexp (for boundary)
    return head_regexp(re_in, inclusive: inclusive, padding: padding, linebreak: linebreak) if re_in

    # Integer (a number of lines)
    ret = split(linebreak, -1)[0..(num-1)].join(linebreak)  # -1 is specified to preserve the last linebreak(s).
    return ret if size <= ret.size  # Specified line is larger than the original or the last NL is missing.
    return(ret << linebreak)  # NL is added to the tail as in the original.
  when :char
    return self[0..(num-1)]
  when :byte, "-c"
    return self.byteslice(0..(num-1))
  else
    raise ArgumentError, "Specified unit (#{unit}.inspect) is invalid in #{__method__}"
  end
end
head!(*rest, **key) click to toggle source

Destructive version of {#head}

@param (see head) @return [self]

# File lib/plain_text.rb, line 569
def head!(*rest, **key)
  replace(head(*rest, **key))
end
head_inverse(*rest, **key) click to toggle source

Inverse of head - returns the content except for the first num lines (or characters, bytes)

@param (see head) @return same as self

# File lib/plain_text.rb, line 634
def head_inverse(*rest, **key)
  s2 = head(*rest, **key)
  (s2.size >= size) ? self[0,0] : self[s2.size..-1]
end
head_inverse!(*rest, **key) click to toggle source

Destructive version of {#head_inverse}

@param (see head_inverse) @return [self]

# File lib/plain_text.rb, line 626
def head_inverse!(*rest, **key)
  replace(head_inverse(*rest, **key))
end
normalize_lb(*rest, **k) click to toggle source

Non-destructive version of {#normalize_lb!}

@param (see normalize_lb!) @return same class as self

# File lib/plain_text.rb, line 663
def normalize_lb(*rest, **k)
  newself = clone  # must be clone (not dup) so Singlton methods, which may include this method, must be included.
  newself.normalize_lb!(*rest, **k)
  newself
end
normalize_lb!(repl=$/, lb_from: nil) click to toggle source

Normalizes line-breaks

All the line-breaks of self are converted into a new character or n If the return is non-nil, self contains unexpected line-break characters for the OS.

@param repl [String] Replacement character (Default: +$/+ which is \n in UNIX). @param lb_from [String, Array, NilClass] Candidate line-break(s) (Defaut: +[CR+LF, CR, LF]+) @return [MatchData, NilClass] MatchData of the last match if there is non-$/ match, else nil.

# File lib/plain_text.rb, line 648
def normalize_lb!(repl=$/, lb_from: nil)
  ret = nil
  lb_from ||= DefLineBreaks
  lb_from = [lb_from].flatten
  lb_from.each do |ea_lb|
    gsub!(/#{ea_lb}/, repl) if ($/ != ea_lb) || ($/ == ea_lb && repl != ea_lb)
    ret = $~ if ($/ != ea_lb) && !ret
  end
  ret
end
strip_at_lines(*rest, **k) click to toggle source

Non-destructive version of {#strip_at_lines!}

@param (see strip_at_lines!) @return same class as self

# File lib/plain_text.rb, line 688
def strip_at_lines(*rest, **k)
  newself = clone  # must be clone (not dup) so Singlton methods, which may include this method, must be included.
  newself.strip_at_lines!(*rest, **k)
  newself
end
strip_at_lines!(strip_head: true, strip_tail: true, markdown: false, linebreak: $/) click to toggle source

String#strip! for each line

@param strip_head: [Boolean] if true (Default), spaces at each line head are removed. @param strip_tail: [Boolean] if true (Default), spaces at each line tail are removed (see markdown option). @param markdown: [Boolean] if true (Def: false), a double space at each tail remains and strip_head is forcibly false. @param linebreak: [String] \n etc (Default: +$/+) @return [self, NilClass] nil if gsub! does not match at all, i.e., there are no spaces to remove.

# File lib/plain_text.rb, line 677
def strip_at_lines!(strip_head: true, strip_tail: true, markdown: false, linebreak: $/)
  strip_head = false if markdown
  r1 = strip_at_lines_head!(                    linebreak: linebreak) if strip_head
  r2 = strip_at_lines_tail!(markdown: markdown, linebreak: linebreak) if strip_tail
  (r1 || r2) ? self : nil
end
strip_at_lines_head(*rest, **k) click to toggle source

Non-destructive version of {#strip_at_lines_head!}

@param (see strip_at_lines_head!) @return same class as self

# File lib/plain_text.rb, line 708
def strip_at_lines_head(*rest, **k)
  newself = clone  # must be clone (not dup) so Singlton methods, which may include this method, must be included.
  newself.strip_at_lines_head!(*rest, **k)
  newself
end
strip_at_lines_head!(linebreak: $/) click to toggle source

String#strip! for each line but only for the head part (NOT tail part)

@param linebreak: [String] “n” etc (Default: $/) @return [self, NilClass] nil if gsub! does not match at all, i.e., there are no spaces to remove.

# File lib/plain_text.rb, line 699
def strip_at_lines_head!(linebreak: $/)
  lb_quo = Regexp.quote linebreak
  gsub!(/(\A|#{lb_quo})[[:blank:]]+/m, '\1')
end
strip_at_lines_tail(*rest, **k) click to toggle source

Non-destructive version of {#strip_at_lines_tail!}

@param (see strip_at_lines_tail!) @return same class as self

# File lib/plain_text.rb, line 732
def strip_at_lines_tail(*rest, **k)
  newself = clone  # must be clone (not dup) so Singlton methods, which may include this method, must be included.
  newself.strip_at_lines_tail!(*rest, **k)
  newself
end
strip_at_lines_tail!(markdown: false, linebreak: $/) click to toggle source

String#strip! for each line but only for the tail part (NOT head part)

@param markdown: [Boolean] if true (Def: false), a double space at each tail remains. @param linebreak: [String] “n” etc (Default: $/) @return [self, NilClass] nil if gsub! does not match at all, i.e., there are no spaces to remove.

# File lib/plain_text.rb, line 719
def strip_at_lines_tail!(markdown: false, linebreak: $/)
  lb_quo = Regexp.quote linebreak
  return gsub!(/(?<=^|[^[:blank:]])[[:blank:]]+(#{lb_quo}|\z)/m, '\1') if ! markdown

  r1 = gsub!(/(?<=^|[^[:blank:]])[[:blank:]]{3,}(#{lb_quo}|\z)/m, '\1')
  r2 = gsub!(/(?<=^|[^[:blank:]])[[:blank:]](#{lb_quo}|\z)/m, '\1')
  (r1 || r2) ? self : nil
end
tail(num_in=DEF_HEADTAIL_N_LINES, unit: :line, inclusive: true, padding: 0, linebreak: $/) click to toggle source

Returns the last num lines (or characters, bytes) or of and after the first n-th line.

If “byte” is specified as the return unit, the encoding is the same as self, though the encoding for the returned String may not be valid anymore. Note that it is probably the better practice to use +string[ -5..-1 ]+ and +string#byteslice(-5,5)+ instead of this method for the units of “char” and “byte”, respectively.

For num, a negative number means counting from the first (e.g., -1 [lines, if unit is :line] means everything but the first 1 line, and -5 means everything but the first 5 lines), whereas 0 is forbidden. If a too big negative number is given, such as -9 for String of 2 lines, a null string is returned.

If unit is :line, num can be Regexp, in which case the string of the lines after the first line that matches the given Regexp is returned (not inclusive), where the process is based on the lines. For example, if num is /ABC/, String of the lines from the next line of the first line that contains the character “ABC” till the last one is returned. “The next line” means (1) the line immediately after the match if the matched string has the linebreak at the end, or (2) the line after the first linebreak after the matched string, where the trailing characters after the matched string to the linebreak (inclusive) is ignored.

Tips =

To specify the last line that matches the Regexp, consider prefixing +(?:.*)+ with the option m, e.g., +/(?:.*)ABC/m+

Note for developers =

The line that matches with Regexp has to be exclusive. Because otherwise to specify the last line that matches would be impossible in principle. For example, to specify the last line that matches ABC, the given regexp should be +/(?:.*)ABC/m+ (see the above Tips); in this case, if this matched line was inclusive, *all the lines from Line 1* would be included, which is most likely not what the caller wants.

@param num_in [Integer, Regexp] Number (positive or negative, but not 0) of :unit to extract (Def: 10), or Regexp, which is valid only if unit is :line. If positive, the last num_in lines are returned. If negative, the lines from the num-in-th line from the head are returned. In short, calling this method as +tail(3)+ and +tail(-3)+ is similar to the UNIX commands “tail -n 3” and “tail -n +3”, respectively. @param unit: [Symbol] One of :line (as in -n option), :char, :byte (-c option) @param inclusive: [Boolean] read only when unit is :line. If inclusive (Default), the (entire) line that matches is included in the result. @param linebreak: [String] \n etc (Default: +$/+), used when unit==:line (Default) @return [String] as self

# File lib/plain_text.rb, line 781
def tail(num_in=DEF_HEADTAIL_N_LINES, unit: :line, inclusive: true, padding: 0, linebreak: $/)

  if num_in.class.method_defined? :to_int
    num = num_in.to_int
    raise ArgumentError, "num of zero is given in #{__method__}" if num == 0
    num += 1 if num < 0
  elsif num_in.class.method_defined? :named_captures
    re_in = num_in
  else
    raise raise_typeerror(num_in, 'Integer or Range')
  end

  case unit
  when :line, '-n'
    # Regexp (for boundary)
    return tail_regexp(re_in, inclusive: inclusive, padding: padding, linebreak: linebreak) if re_in

    # Integer (a number of lines)
    return tail_linenum(num_in, num, linebreak: linebreak)
  when :char
    num = 0 if num >= size && num_in > 0
    return self[(-num)..-1]
  when :byte, '-c'
    num = 0 if num >= bytesize && num_in > 0
    return self.byteslice((-num)..-1)
  else
    raise ArgumentError, "Specified unit (#{unit}.inspect) is invalid in #{__method__}"
  end
end
tail!(*rest, **key) click to toggle source

Destructive version of {#tail}

@param (see tail) @return [self]

# File lib/plain_text.rb, line 743
def tail!(*rest, **key)
  replace(tail(*rest, **key))
end
tail_inverse(*rest, **key) click to toggle source

Inverse of tail - returns the content except for the first num lines (or characters, bytes)

@param (see tail) @return same as self

# File lib/plain_text.rb, line 823
def tail_inverse(*rest, **key)
  s2 = tail(*rest, **key)
  (s2.size >= size) ? self[0,0] : self[0..(size-s2.size-1)]
end
tail_inverse!(*rest, **key) click to toggle source

Destructive version of {#tail_inverse}

@param (see tail_inverse) @return [self]

# File lib/plain_text.rb, line 815
def tail_inverse!(*rest, **key)
  replace(tail_inverse(*rest, **key))
end

Private Instance Methods

_ilines_consecutive(str, linebreak: $/) click to toggle source

Returns line number of the three lines.

In the following order:

  1. Index of the last line of the argument String (number of lines - 1) (or nil if it becomes negative).

  2. the 1st one plus 1 IF the last line ends with a linebreak. Or, 0 if the first one is empty. Otherwise the same as the 1st.

@return [Array<Integer, nil>] nil if the value is invalid.

# File lib/plain_text.rb, line 952
def _ilines_consecutive(str, linebreak: $/)
  return [nil, 0] if str.empty?

  # Only if first ends with a linebreak, increment by 1 for the second.
  nmatched, with_linened = PlainText::Split.count_regexp(str, linebreak, with_if_end: true)
  first  = nmatched + (with_linened ? 0 : 1) - 1
  second = first    + (with_linened ? 1 : 0)
  first  = nil if first < 0
  [first, second]
end
_matched_line_indices(mat, strpre=nil, linebreak: $/) click to toggle source

Get the line index numbers of the first and last lines of the mathced string

It is the Ruby index number as used in the each_line method.

Matched String can span for multi-lines.

Note if matched string is empty, it still is treated as significant.

@param mat [MatchData, String] If String, it is User's (last) matched String. @param strpre [String, nil] Pre-match from the beginning of self to the mathced string, if mat is String. @param linebreak: [String] \n etc (Default: $/) @return [Hash<Integer, nil>] 4 keys: :last_prematch, :first_matched, :last_matched, :first_post_match

# File lib/plain_text.rb, line 922
def _matched_line_indices(mat, strpre=nil, linebreak: $/)
  if mat.class.method_defined? :post_match
    # mat is MatchData
    strmatched, strpre = mat[0], mat.pre_match
  else
    strmatched = mat.to_str  rescue raise_typeerror(mat, 'String')
  end

  hsret = {
    # last_prematch: nil,
    first_matched: nil,
    last_matched: nil,
    # first_post_match: nil
  }

  _, hsret[:first_matched] = _ilines_consecutive(strpre, linebreak: linebreak)
  hsret[:last_matched], _  = _ilines_consecutive(strpre+strmatched, linebreak: linebreak)

  hsret
end
head_regexp(re_in, inclusive: true, padding: 0, linebreak: $/) click to toggle source

head command with Regexp

@todo Improve the algorithm like {#tail_regexp}

@param re_in [Regexp] Regexp to determine the boundary. @param inclusive: [Boolean] If true (Default), the (entire) line that matches re_in is included in the result. Else the entire line is excluded. @param padding: [Integer] Add (postive/negative) the number of lines returned. @param linebreak: [String] \n etc (Default: $/). @return [String] as self @see head

# File lib/plain_text.rb, line 843
def head_regexp(re_in, inclusive: true, padding: 0, linebreak: $/)
  return self if empty?
  mat = re_in.match self
  return self if !mat
  if inclusive
    postmat = post_match_in_line(mat, linebreak: linebreak)
    strmain = mat.pre_match+mat[0]+(postmat ? postmat[0] : "")
  else
    strmain = pre_match_in_line(mat.pre_match, linebreak: linebreak).pre_match
  end
  return strmain if padding == 0

  ## Adds paddig

  lines_main_deli, nlines_main = split_lines_with_nlines(strmain, linebreak: linebreak)
  n_lines2ret = nlines_main + padding

  return self[0,0] if n_lines2ret <= 0  # padding is too negatively large and hence no lines left.
    # [0,0] instead of "" is used to preserve its class (in case it is a child class of String).
  return lines_main_deli[0..(n_lines2ret*2-1)].join if padding < 0

  # positive padding
  lines_self_deli, nlines_self = split_lines_with_nlines(linebreak: linebreak)
  return self if n_lines2ret > nlines_self
  return lines_self_deli[0..(n_lines2ret*2-1)].join
end
post_match_in_line(mat, strpost=nil, linebreak: $/) click to toggle source

Returns MatchData of the String after the MatchData to the linebreak (inclusive)

@param mat [MatchData, String] If String, it is User's (last) matched String. @param strpost [String, nil] Post-match, if mat is String. After User's last match. @param linebreak: [String] \n etc (Default: $/) @return [MatchData] m is the string after matched data and up to the next first linebreak (inclusive) (or empty string if the last character(s) of matched data is the linebreak) and m.post_match is all the lines after that. (maybe nil?? not sure…)

# File lib/plain_text.rb, line 970
def post_match_in_line(mat, strpost=nil, linebreak: $/)
  lb_quo = Regexp.quote linebreak
  if mat.class.method_defined? :post_match
    # mat is MatchData
    strmatched, strpost = mat[0], mat.post_match
  else
    strmatched = mat.to_str rescue raise_typeerror(mat, 'String')
  end
  return(/\A/.match strpost) if /#{lb_quo}\z/ =~ strmatched
  return(/\A/.match strpost) if strpost.empty?
  /.*?#{lb_quo}/m.match strpost  # non-greedy match and m option are required, as linebreak can be any characters.
end
pre_match_in_line(strpre, linebreak: $/) click to toggle source

Returns MatchData of the String at and before the first linebreak before the MatchData (inclusive)

Basically this returns String after the last linebreak of the input

@example

pre_match_in_line("1\n2\n__abc")  # => #<MatchData "__abc"> pre_match=="1\n2\n"
pre_match_in_line("1\n2\n")       # => #<MatchData ""     > pre_match=="1\n2\n"
pre_match_in_line(      "__abc")  # => #<MatchData "__abc"> pre_match=="     "

@param strpre [String] String of prematch of the last MatchData @param linebreak: [String] \n etc (Default: $/) @return [MatchData] m is the string after the last linebreak before the matched data (exclusive) and m.pre_match is all the lines before that.

# File lib/plain_text.rb, line 903
def pre_match_in_line(strpre, linebreak: $/)
  lb_quo = Regexp.quote linebreak
  return /\z/.match(strpre) if /#{lb_quo}\z/ =~ strpre
  /(?:^|(?<=#{lb_quo}))[^#{lb_quo}]*?\z/m.match strpre  # non-greedy match and m option are required, as linebreak can be any characters.
end
split_lines_with_nlines(str=self, linebreak: $/) click to toggle source

Returns Array of [Line,NL,Line,NL, …] and number of lines

Wrapper of {PlainText::Split::split_with_delimiter}(str, linebreak)

See {PlainText::Split::count_lines} (and {PlainText::Split#count_lines})

@param str [String] @return [Array<Array, Integer>]

# File lib/plain_text.rb, line 879
def split_lines_with_nlines(str=self, linebreak: $/)
  arline = PlainText::Split::split_with_delimiter(str, linebreak)
  nlines = arline.size
  nlines += 1 if nlines.odd?  # There is no newline at the tail.

  # This is the NUMBER of the lines (NOT index)
  nlines = nlines.quo 2  # "/" MUST be fine...
  [arline, nlines]
end
tail_linenum(num_in, num, linebreak: $/) click to toggle source

tail command based on the number of lines

@param num_in [Integer] Original argument of the specified number of lines @param num [Integer] Converted integer for num_in @param linebreak: [String] \n etc (Default: $/). @return [String] as self @see tail

# File lib/plain_text.rb, line 1035
def tail_linenum(num_in, num, linebreak: $/)
  arret = split(linebreak, -1)  # -1 is specified to preserve the last linebreak(s).
  return self[0,0] if arret.empty?

  lb_quo = Regexp.quote linebreak
  if num_in > 0
    num += 1 if /#{lb_quo}\z/ =~ self
    num = 0  if num >= arret.size
  end
  ar = arret[(-num)..-1]
  (ar.nil? || ar.empty?) ? self[0,0] : ar.join(linebreak)
end
tail_regexp(re_in, inclusive: true, padding: 0, linebreak: $/) click to toggle source

tail command with Regexp

Algorithm

  1. Split the String with Regexp with {PlainText::Split#split_with_delimiter}

  2. Find the last matched String.

  3. Find the “line”-index-number of the matched String.

  4. Adjust the line index number depending inclusive/exclusive

  5. Add positive/negative padding number

  6. pass it to {#head_inverse} (after Line-1).

@param re_in [Regexp] Regexp to determine the boundary. @param inclusive: [Boolean] If true (Default), the (entire) line that matches re_in is included in the result. Else the entire line is excluded. @param linebreak: [String] \n etc (Default: $/). @return [String] as self @see tail

# File lib/plain_text.rb, line 1000
def tail_regexp(re_in, inclusive: true, padding: 0, linebreak: $/)
  return self if empty?

  # "split" with the given Regexp pattern (NOT with linebreak!)
  #   arst == (Array<String>) [PreMatch, UsersMatch1, Str, UsersMatch2,..., UsersMatchN [, PostMatch]]
  #   If the user's match comes at the very end, the last element does not exist.
  arst = split_with_delimiter re_in  # PlainText::Split#split_with_delimiter (included in this module (hence maybe String))

  # UsersMatch basically failed - no match.
  return self[0,0] if arst.size <= 1 

  arst.push "" if arst.size.even?
  # Now the last component is guarantee to be not the delimiter (= String of User's match)
  #   arst == [PreMatch, UsersMatch1, ..., UsersMatchN, PostMatch(maybe-empty)]
  # Minimum:
  #   arst == [PreMatch, UsersMatch1, PostMatch]  (<= maybe much more PreMatch-es)
  #   (Either/both PreMatch and PostMatch can be empty).

  hslinenum = _matched_line_indices(arst[-2], arst[0..-3].join, linebreak: $/)

  # Note: hslinenum[] is for indices, whereas the number of lines is
  # required here to pass to head_inverse().
  nlines_remove_to = (inclusive ? hslinenum[:first_matched] : hslinenum[:last_matched]+1) - padding
  return self if nlines_remove_to <= 0  # everything
  return head_inverse(nlines_remove_to)
end