git.net

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

[GitHub] mbeckerle commented on a change in pull request #61: Base64, gzip, and line-folding layering


mbeckerle commented on a change in pull request #61: Base64, gzip, and line-folding layering
URL: https://github.com/apache/incubator-daffodil/pull/61#discussion_r185883290
 
 

 ##########
 File path: daffodil-runtime1/src/main/scala/org/apache/daffodil/layers/LineFoldedTransformer.scala
 ##########
 @@ -0,0 +1,474 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *     http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.daffodil.layers
+
+import org.apache.daffodil.schema.annotation.props.gen.LayerLengthKind
+import org.apache.daffodil.schema.annotation.props.gen.LayerLengthUnits
+import org.apache.daffodil.util.Maybe
+import org.apache.daffodil.processors.TermRuntimeData
+import org.apache.daffodil.processors.LayerLengthInBytesEv
+import org.apache.daffodil.processors.LayerBoundaryMarkEv
+import org.apache.daffodil.processors.LayerCharsetEv
+import org.apache.daffodil.processors.parsers.PState
+import java.nio.charset.StandardCharsets
+import org.apache.daffodil.exceptions.Assert
+import org.apache.daffodil.processors.unparsers.UState
+import org.apache.daffodil.io.LayerBoundaryMarkInsertingJavaOutputStream
+import java.io.OutputStream
+import java.io.InputStream
+import org.apache.daffodil.exceptions.ThrowsSDE
+import org.apache.daffodil.schema.annotation.props.Enum
+import org.apache.daffodil.io.RegexLimitingStream
+
+/*
+ * This and related classes implement so called "line folding" from
+ * IETF RFC 2822 Internet Message Format (IMF), and IETF RFC 5545 iCalendar.
+ *
+ * There are multiple varieties of line folding, and it is important to
+ * be specific about which algorithm.
+ *
+ * For IMF, unfolding simply removes CRLFs if they are followed by a space or tab.
+ * The Folding is more complex however, as CRLFs can only be inserted before
+ * a space/tab that appears in the data. If the data has no spaces, then no
+ * folding is possible.
+ * If there are spaces/tabs, the one closest to position 78 is used unless it is
+ * followed by punctuation, in which case a prior space/tab (if it exists) is used.
+ * (This preference for spaces not followed by punctuation is optional, it is
+ * not required, but is preferred in the IMF RFC.)
+ *
+ * Note: folding is done by some systems in a manner that does not respect
+ * character boundaries - i.e., in utf-8, a multi-byte character sequence may be
+ * broken in the middle by insertion of a CRLF. Hence, unfolding initially treats
+ * the text as iso-8859-1, i.e., just bytes, and removes CRLFs, then subsequently
+ * re-interprets the bytes as the expected charset such as utf-8.
+ *
+ * IMF is supposed to be US-ASCII, but implementations have gone to 8-bit characters
+ * being preserved, so the above problem can occur.
+ *
+ * IMF has a maximum line length of 998 characters per line excluding the CRLF.
+ * The layer will fail (cause a parse error) if a line longer than this is encountered
+ * or constructed after unfolding. When unparsing, if a line longer than 998 cannot be
+ * folded due to no spaces/tabs being present in it, then it is an unparse error.
+ *
+ * Note that i/vCalendar, vCard, and MIME payloads held by IMF do not run into
+ * the IMF line length issues, in that they have their own line length limits that
+ * are smaller than those of IMF, and which do not require accomodation by having
+ * pre-existing spaces/tabs in the data. So such data *always* will be short
+ * enough lines.
+ *
+ * For vCard, iCalendar, and vCalendar, the maximum is 75 bytes plus the CRLF, for
+ * a total of 77. Folding is inserted by inserting CRLF + a space or tab. The
+ * CRLF and the following space or tab are removed to unfold. If data happened to
+ * contain a CRLF followed by a space or tab initially, then that will be lost when
+ * the data is parsed.
+ *
+ * For MIME, the maximum line length is 76.
+ */
+sealed trait LineFoldMode extends LineFoldMode.Value
+object LineFoldMode extends Enum[LineFoldMode] {
+
+  case object IMF extends LineFoldMode; forceConstruction(Left)
+  case object iCalendar extends LineFoldMode; forceConstruction(Right)
+
+  override def apply(name: String, context: ThrowsSDE): LineFoldMode = stringToEnum("lineFoldMode", name, context)
+}
+
+/**
+ * For line folded, the notion of "delimited" means that the element is a "line"
+ * that ends with CRLF, except that if it is long, it will be folded, which involves
+ * inserting/removing CRLF+Space (or CRLF+TAB). A CRLF not followed by space or tab
+ * is ALWAYS the actual "delimiter". There's no means of supplying a specific delimiter.
+ */
+class LineFoldedTransformerDelimited(mode: LineFoldMode)
+  extends LayerTransformer {
+
+  override protected def wrapLimitingStream(jis: java.io.InputStream, state: PState) = {
+    // regex means CRLF not followed by space or tab.
+    // NOTE: this regex cannot contain ANY capturing groups (per scaladoc on RegexLimitingStream)
+    val s = new RegexLimitingStream(jis, "\\r\\n(?!(?:\\t|\\ ))", "\r\n", StandardCharsets.ISO_8859_1)
+    s
+  }
+
+  override protected def wrapLimitingStream(jos: java.io.OutputStream, state: UState): java.io.OutputStream = {
+    //
+    // Q: How do we insert a CRLF "not followed by tab or space" when we don't
+    // control what follows?
+    // A: We don't. This is nature of the format. If what follows could begin
+    // with a space or tab, then the format can't use a line-folded layer.
+    //
+    val newJOS = new LayerBoundaryMarkInsertingJavaOutputStream(jos, "\r\n", StandardCharsets.ISO_8859_1)
+    newJOS
+  }
+
+  override protected def wrapLayerDecoder(jis: java.io.InputStream): java.io.InputStream = {
+    val s = new LineFoldedInputStream(mode, jis)
+    s
+  }
+  override protected def wrapLayerEncoder(jos: java.io.OutputStream): java.io.OutputStream = {
+    val s = new LineFoldedOutputStream(mode, jos)
+    s
+  }
+}
+
+/**
+ * For line folded, the 'implicit' length kind means that the region continues
+ * to end of data. At top level this would be the "whole file/stream" but this can
+ * also be used with a specified length enclosing element. This code cannot tell
+ * the difference.
+ */
+class LineFoldedTransformerImplicit(mode: LineFoldMode)
+  extends LayerTransformer {
+
+  override protected def wrapLimitingStream(jis: java.io.InputStream, state: PState) = {
+    jis // no limiting - just pull input until EOF.
+  }
+
+  override protected def wrapLimitingStream(jos: java.io.OutputStream, state: UState): java.io.OutputStream = {
+    jos // no limiting - just write output until EOF.
+  }
+
+  override protected def wrapLayerDecoder(jis: java.io.InputStream): java.io.InputStream = {
+    val s = new LineFoldedInputStream(mode, jis)
+    s
+  }
+  override protected def wrapLayerEncoder(jos: java.io.OutputStream): java.io.OutputStream = {
+    val s = new LineFoldedOutputStream(mode, jos)
+    s
+  }
+}
+
+sealed abstract class LineFoldedTransformerFactory(mode: LineFoldMode, name: String)
+  extends LayerTransformerFactory(name) {
+
+  override def newInstance(maybeLayerCharsetEv: Maybe[LayerCharsetEv],
+    maybeLayerLengthKind: Maybe[LayerLengthKind],
+    maybeLayerLengthInBytesEv: Maybe[LayerLengthInBytesEv],
+    maybeLayerLengthUnits: Maybe[LayerLengthUnits],
+    maybeLayerBoundaryMarkEv: Maybe[LayerBoundaryMarkEv],
+    trd: TermRuntimeData): LayerTransformer = {
+
+    trd.schemaDefinitionUnless(maybeLayerLengthKind.isDefined,
+      "The propert dfdl:layerLengthKind must be defined.")
+
+    val xformer =
+      maybeLayerLengthKind.get match {
+        case LayerLengthKind.BoundaryMark => {
+          new LineFoldedTransformerDelimited(mode)
+        }
+        case LayerLengthKind.Implicit => {
+          new LineFoldedTransformerImplicit(mode)
+        }
+        case x =>
+          trd.SDE("Property dfdl:layerLengthKind can only be 'implicit' or 'boundaryMark', but was '%s'",
+            x.toString)
+      }
+    xformer
+  }
+}
+
+object IMFLineFoldedTransformerFactory
+  extends LineFoldedTransformerFactory(LineFoldMode.IMF, "lineFolded_IMF")
+
+object ICalendarLineFoldedTransformerFactory
+  extends LineFoldedTransformerFactory(LineFoldMode.iCalendar, "lineFolded_iCalendar")
+
+/**
+ * Doesn't enforce 998 max line length limit.
+ *
+ * This is a state machine, so of course must be used only on a single thread.
+ */
+class LineFoldedInputStream(mode: LineFoldMode, jis: InputStream)
+  extends InputStream {
+
+  object State extends org.apache.daffodil.util.Enum {
+    abstract sealed trait Type extends EnumValueType
+
+    /**
+     * No state. Read a character, and if CR, go to state GotCR.
+     */
+    case object Start extends Type
+
+    /**
+     * Read another character and if LF go to state GotCRLF.
+     */
+    case object GotCR extends Type
+
+    /**
+     * Read another character and if SP/TAB then what we do depends on
+     * IMF or iCalendar mode.
+     *
+     * In iCalendar mode we just goto Start, and iterate
+     * again. effectively absorbing all the CR, LF, and the sp/tab.
+     *
+     * In IMF mode we change state to Start, but we return the sp/tab so that
+     * we've effectively absorbed the CRLF, but not the space/tab character.
+     */
+    case object GotCRLF extends Type
+
+    /**
+     * We have a single saved character. Return it, go to Start state
+     */
+    case object Buf1 extends Type
+
+    /**
+     * We have 2 saved characters. They must be a LF, then the next character.
+     * Return the LF and go to state Buf1.
+     */
+    case object Buf2 extends Type
+
+    /**
+     * Done. Always return -1, stay in state Done
+     */
+    case object Done extends Type
+  }
+
+  private var c: Int = -2
+  private var state: State.Type = State.Start
+
+  /**
+   * Assumes an ascii-family encoding, but reads it byte at a time regardless
+   * of the encoding. This enables it to handle data where a CRLF was inserted
+   * to limit line length, and that insertion broke up a multi-byte character.
+   *
+   * Does not detect errors such as isolated \r or isolated \n. Leaves those
+   * alone. Does not care if lines are in fact less than any limit in length.
+   *
+   * All this does is remove \r\n[\ \t], replacing with just the space or tab.(IMF)
+   * or replace with nothing (iCalendar).
+   *
+   */
+  override def read(): Int = {
+    import State._
+    while (state != Done) {
+      state match {
+        case Start => {
+          c = jis.read()
+          c match {
+            case -1 => {
+              state = Done
+              return -1
+            }
+            case '\r' => {
+              state = GotCR
+            }
+            case _ => {
+              // state stays Start
+              return c
+            }
+          }
+        }
+        case GotCR => {
+          c = jis.read()
+          c match {
+            case -1 => {
+              state = Done
+              return '\r'
+            }
+            case '\n' => {
+              state = GotCRLF
+            }
+            case _ => {
+              state = Buf1
+              return c
 
 Review comment:
   Good catch. 

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@xxxxxxxxxxxxxxxx


With regards,
Apache Git Services